# optimal_estimation_of_sparse_topic_models__9fe5482f.pdf Journal of Machine Learning Research 21 (2020) 1-45 Submitted 1/20; Published 07/20 Optimal Estimation of Sparse Topic Models Xin Bing xb43@cornell.edu Florentina Bunea fb238@cornell.edu Department of Statistics and Data Science Cornell University Ithaca, NY 14850, USA Marten Wegkamp mhw73@cornell.edu Department of Mathematics and Department of Statistics and Data Science Cornell University Ithaca, NY 14850, USA Editor: Arnak Dalalyan Topic models have become popular tools for dimension reduction and exploratory analysis of text data which consists in observed frequencies of a vocabulary of p words in n documents, stored in a p n matrix. The main premise is that the mean of this data matrix can be factorized into a product of two non-negative matrices: a p K word-topic matrix A and a K n topic-document matrix W. This paper studies the estimation of A that is possibly element-wise sparse, and the number of topics K is unknown. In this under-explored context, we derive a new minimax lower bound for the estimation of such A and propose a new computationally efficient algorithm for its recovery. We derive a finite sample upper bound for our estimator, and show that it matches the minimax lower bound in many scenarios. Our estimate adapts to the unknown sparsity of A and our analysis is valid for any finite n, p, K and document lengths. Empirical results on both synthetic data and semi-synthetic data show that our proposed estimator is a strong competitor of the existing state-of-the-art algorithms for both non-sparse A and sparse A, and has superior performance is many scenarios of interest. Keywords: topic models, minimax estimation, sparse estimation, adaptive estimation, high dimensional estimation, non-negative matrix factorization, separability, anchor words 1. Introduction Topic modeling has been a popular and powerful statistical model during the last two decades in machine learning and natural language processing for discovering thematic structures from a corpus of documents. Topic models have wide applications beyond the context in which was originally introduced, to genetics, neuroscience and social science (Blei, 2012), to name just a few areas in which they have been successfully employed. In the computer science and machine learning literature, topic models were first introduced as latent semantic indexing models by Deerwester et al. (1990); Papadimitriou et al. (1998); Hofmann (1999); Papadimitriou et al. (2000). For uniformity and clarity, we explain our methodology in the language typically associated with this set-up. A corpus of n c 2020 Xin Bing, Florentina Bunea and Marten Wegkamp. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/20-079.html. Bing, Bunea and Wegkamp documents is assumed to follow generative models based on the bag-of-word representation. Specifically, each document Xi Rp is a vector containing empirical (observed) frequencies of p words from a pre-specified dictionary, generated as Ni Multinomialp (Ni, Πi) , for each i [n] := {1, 2, . . . , n}. (1) Here Ni denotes the length (or the number of sampled words) in the ith document. The expected frequency vector Πi Rp is called the word-document vector, and is a convex combination of K word-topic vectors with weights corresponding to the allocation of K topics. Mathematically, one postulates that k=1 A k Wki (2) where A k = (A1k, . . . , Apk) is the word-topic vector for the kth topic and W i = (W1i, . . . , WKi) is the allocation of K topics in this ith document. From a probabilistic point of view, equation (2) has the conditional probability interpretation P(word j | document i) | {z } Πji k=1 P(word j | topic k) | {z } Ajk P(topic k | document i) | {z } Wki for each j [p], justified by Bayes theorem. As a result, the (expected) word-document frequency matrix Π = (Π1, . . . , Πn) Rp n has the following decomposition Π = AW = A(W1, . . . , Wn). (4) The entries of the columns of Π, A and W are probabilities, so they are non-negative and sum to one: j=1 Πji = 1, j=1 Ajk = 1, k=1 Wki = 1, for any k [K] and i [n]. (5) Since the number of topics, K, is typically much smaller than p and n, the matrix Π exhibits a low-rank structure. In the topic modeling literature, the main interest is to recover the matrix A when only the p n frequency matrix X = (X1, . . . , Xn) and the document lengths N1, . . . , Nn are observed. One direction of a large body of work is of Bayesian nature, and the most commonly used prior distribution on W is the Dirichlet distribution (Blei et al., 2003). Posterior inference on A is then typically conducted via variational inference (Blei et al., 2003), or sampling techniques involving MCMC-type solvers (Griffiths and Steyvers, 2004). We refer to Blei (2012) for a in-depth review. The computational intensive nature of Bayesian approaches, in high dimensions, motivated a separate line of recent work that develops efficient algorithms, with theoretical guarantees, from a frequentist perspective. Anandkumar et al. (2012) proposes an estimation method, with provable guarantees, that employs the third moments of Π via a Optimal Estimation of Sparse Topic Models tensor-decomposition. However, the success of this approach requires the topics not be correlated, and in many situations there is strong evidence suggesting the contrary (Blei and Lafferty, 2007; Li and Mc Callum, 2006). This motivated another line of work, similar in spirit with the work presented in this paper, which relies on the following separability condition on A, and allows for correlated topics. Assumption 1 (separability) For each topic k [K], there exists at least one word j such that Ajk > 0 and Ajℓ= 0 for any ℓ = k. The separability condition was first introduced by Donoho and Stodden (2004) to ensure uniqueness in the Non-negative Matrix Factorization (NMF) framework. Arora et al. (2012) introduce the separability condition to the topic model literature with the interpretation that, for each topic, there exist some words which only occur in this topic. These special words are called anchor words (Arora et al., 2012) and guarantee recovery of A, coupled with the following condition on W (Arora et al., 2012). Assumption 2 Assume the matrix n 1WW is strictly positive definite. Finding anchor words is the first step towards the recovery of the desired target A. Many algorithms are developed for this purpose, see, for instance, Arora et al. (2012); Recht et al. (2012); Arora et al. (2013); Ding et al. (2013); Ke and Wang (2017). All these works require the number of topics K be known, yet in practice K is rarely known. This motivated us Bing et al. (2020) to develop a method that estimates K consistently from the data under the incoherence Condition 3 on the topic-document matrix W given in Section 5. We defer to this for further discussion of other existing methods for finding anchor words. Despite the wide-spread interest and usage of topic models, most of the existing works are mainly devoted to the computational aspects of estimation, and relatively few works provide statistical guarantees for estimators of A. An exception is Arora et al. (2012, 2013) that provide upper bounds for the ℓ1-loss b A A 1 = Pp j=1 PK k=1 | b Ajk Ajk| of their estimator. Their analysis allows K, p and Ni to grow with n. Unfortunately, the convergence rate of their estimator is not optimal (Ke and Wang, 2017; Bing et al., 2020). The recent work of Ke and Wang (2017) is the first to establish the minimax lower bound for the estimator of A in topic models for known, fixed K. Their estimator provably achieves the minimax optimal rate under appropriate conditions. When K is allowed to grow with n, the minimax optimal rate of b A A 1 is established in Bing et al. (2020) and an optimal estimation procedure is proposed. Despite these recent advances, all the aforementioned results are established for a fully dense matrix A. In the modern big data era, the dictionary size p, the number of documents n and the number of topics K are large, as evidenced by real data in Section 6. Sparsity is likely to happen for large dictionaries (p) and when the number of topics K is large, one should expect that there are many words not occurring in all topics, that is, Ajk = P(word j | topic k) = 0 for some k. To the best of our knowledge, the minimax lower bound of b A A 1 in the topic model is unknown when the word-topic matrix A is element-wise sparse and no estimation procedure exists tailored to this scenario of sparse A and unknown K. Bing, Bunea and Wegkamp 1.1. Our Contributions We summarize our contributions in this paper. New minimax lower bound for b A A 1, when A is sparse. To understand the difference of estimating a dense A and a entry-wise sparse A in topic models, we first establish the minimax lower bound of estimators of A in Theorem 1 of Section 2. It shows that inf b A sup A PA b A A 1 c0 A 1 for some constants c0 > 0 and c1 (0, 1], by assuming N = N1 = N2 = = Nn for ease of presentation. The infimum is taken over all estimators b A while the supremum is over a prescribed parameter space A defined in (7) below. We have A 1 = K by (5) for all A. The term A 0 characterizes the overall sparsity of A, and the minimax rate of A becomes faster as A gets more sparse. When the rows Aj of non-anchor words j are dense in the sense Aj 0 = K, our result reduces to that in Bing et al. (2020). Our minimax lower bound is valid for all p, K, N and n and, to the best of our knowledge, the lower bound with dependency on the sparsity of A is new in the topic model literature. A new estimation procedure for sparse A. To the best of our knowledge, the only minimaxoptimal estimation procedure, for dense A and K large and unknown, is offered in Bing et al. (2020). While the procedure is computationally very fast, it is impractical to adjust it in simple ways in order to obtain a sparse estimator of A, that would hopefully be minimax-optimal. For instance, simply thresholding an estimator b A to encourage sparsity will require threshold levels that vary from row to row, resulting in too many tuning parameters. We propose a new estimation procedure in Section 3 that adapts to this unknown sparsity. To motivate our procedure, we start with the recovery of A in the noise-free case in Section 3.1, under Assumptions 1 and 2. Since several existing algorithms, including Bing et al. (2020), provably select the anchor words, we mainly focus on the estimation of the portion of A corresponding to non-anchor words. In the presence of noise, we propose our estimator in Section 3.2 and summarize the procedure in Algorithm 1. The new algorithm requires the solution of a quadratic program for each non-anchor row. Except for a ridge-type tuning parameter (which can often be set to zero), the procedure is devoid of any (further) tuning parameters. We give detailed comparisons with other methods in the topic model literature in Section 3.3. Adaptation to sparsity. We provide finite sample upper bounds on the ℓ1 loss of our new estimator in Section 4, valid for all p, K, n and N. As shown in Theorem 2, our estimator adapts to the unknown sparsity of A. To the best of our knowledge, our estimator is the first computationally fast estimator shown to adapt to the unknown sparsity of A. We further show in Corollary 3 that it is minimax optimal under reasonable scenarios. Simulation study. In Section 6, we provide experimental results based on both synthetic data and semi-synthetic data. We compare our new estimator with existing state-of-theart algorithms. The effect of sparsity on the estimation of A is verified in Section 6.1 for Optimal Estimation of Sparse Topic Models synthetic data, while we analyze two semi-synthetic data sets based on a corpus of NIPs articles and a corpus of New York Times (NYT) articles in Section 6.2. 1.2. Notation We introduce notation that we use throughout the paper. The integer set {1, . . . , n} is denoted by [n]. We use 1d to denote the d-dimensional vector with entries equal to 1 and use {e1, . . . , e K} to denote the canonical basis vectors in RK. For a generic set S, we denote |S| as its cardinality. For a generic vector v Rd, we let v q denote the vector ℓq norm, for q = 0, 1, 2, . . . , , and let supp(v) denote its support. We write v 2 = v for brevity. We denote by diag(v) a d d diagonal matrix with diagonal elements equal to v. For a generic matrix Q Rd m, we write Q 1 = P 1 i d,1 j m |Qij| and Q ,1 = max1 i d P 1 j m |Qij|. For the submatrix of Q, we let Qi and Q j be the ith row and jth column of Q. For a set S, we let QS and Q S denote its |S| m and d |S| submatrices. For a symmetric matrix Q, we denote its smallest eigenvalue by λmin(Q). We use an bn to denote there exists an absolute constant c > 0 such that an cbn, and write an bn if there exists two absolute constants c, c > 0 such that cbn an c bn. In the probabilities of our results, we might write c an as O(an) for some absolute constant c > 0. Finally, we write an = op(bn) if an/bn 0 with probability tending to 1. For a given word-topic matrix A, we let I := I(A) be the set of anchor words, and I be its partition relative to the K topics. That is, Ik := {j [p] : Ajk > 0, Ajℓ= 0 for all ℓ = k}, I := k=1 Ik, I := {I1, . . . , IK} . (6) We further write J := [p] \ I to denote the set of non-anchor words. For the convenience of our analysis, we assume all documents have the same number of sampled words, that is, N := N1 = = Nn, while our results can be extended to the general case. 2. Minimax lower bounds of b A A 1 In this short section, we establish the minimax lower bound of b A A 1 based on model (4) for any estimator b A of A over the parameter space A := n A Rp K + : A 1p = 1K, A satisfies Assumption 1 with A 0 n N o . (7) To prove the lower bound, it suffices to choose one particular W. We let W 0 = {e1, . . . , e1 | {z } n1 , e2, . . . , e2 | {z } n2 , . . . , e K, . . . , e K | {z } n K with PK k=1 nk = n and |nk nk | 1 for k, k [K]. Note that W 0 satisfies Assumption 2. Denote by PA the joint distribution of (X1, . . . , Xn) under model (4), for the chosen W 0. Theorem 1 Under topic model (4), assume (1). Then, there exist constants c0 > 0 and c1 (0, 1] such that inf b A sup A A PA b A A 1 c0 A 1 Bing, Bunea and Wegkamp The infimum is taken over all estimators b A of A. Remark 1 The estimate constructed in the next section achieves this lower bound in many scenarios. The lower bound rate of b A A 1 in (9) becomes faster as A 0 decreases, that is, if A becomes more sparse. Since each of the K columns of A sum to one, we always have A 1 = K. If the submatrix AJ, corresponding to the non-anchor words, is dense in the sense that AJ 0 = K|J|, Theorem 1 reduces to the result in (Bing et al., 2020, Theorem 6) for K = K(n), and the result in (Ke and Wang, 2017, Theorem 2.2) for fixed K. 3. Estimation of A In this section, we present our procedure for estimating A when a subset of anchor words L = SK k=1 Lk and its partition L = {L1, . . . , LK} are given. Moreover, we assume that, for each k [K], Lk Iπ(k) for some group permutation π : [K] [K]. For simplicity of presentation, we assume π is identity such that Lk Ik, for each k [K]. (10) We discuss methods for selecting L and L in Section 5. We start with the noise-free case, that is, we observe the expected word-document frequency matrix Π, in Section 3.1. Our strategy is as follows. Instead of inferring A from Π directly, we consider the scaled version B that has all its rows sum to 1, but critically retains the same sparsity pattern of A. The submatrix BL of B with rows corresponding to the representative set L is then immediate to find, as the i-th row of B equals the unit vector ek, for each i Lk. Next, we show that the remaining submatrix BLc of the sparse matrix B can be found, row-by-row, using a quadratic optimization over the probability simplex, making essential use of the fact that each row of B sums to one. Finally, we recover the original matrix A by columnwise renormalization of the obtained matrix B. Motivated by the developed algorithm in the noise-free case that recovers A, we propose the estimation procedure of A in Section 3.2 when we have access to X only, requiring slight modifications from the procedure for the noiseless case, including a hard-thresholding step to handle words with extremely low frequencies. The optimization problem is an efficient K K quadratic program, which gives it an edge over previous works such as Arora et al. (2013). 3.1. Recovery of A in the Noise-free Case Suppose that Π is given and write DΠ := n 1diag(Π1n) and DW := n 1diag(W1n). We recover A via its row-wisely normalized version B = D 1 Π ADW (11) as B enjoys the following three properties: supp(B) = supp(A), Bjk [0, 1], Bj 1 = 1, for all j [p], k [K]. (12) The row-wise sum-to-one property is critical in the later estimation step to adapt to the unknown sparsity of B (or equivalently, the sparsity of A). From L = {L1, . . . , LK} and (12), we can directly recover BL by setting Bi = ek, for any i Lk, k [K]. Optimal Estimation of Sparse Topic Models To recover BLc with Lc := [p] \ L, let R := D 1 Π ΘD 1 Π = B D 1 W 1 n WW D 1 W be a normalized version of Θ := n 1ΠΠ . Since R has the decomposition RLL = BLMB L , RLc L = BLc MB L . and Assumption 2 implies M is invertible, we arrive at the expressions M = (B L BL) 1B L RLLBL(B L BL) 1, (13) BLc = RLc LBL(B L BL) 1M 1. (14) Display (14) implies that MB Lc = (B L BL) 1B L RLLc := H, whence Mβ = h for each column β of B Lc (which is a row of BLc) and corresponding column h of H. Given M and H, the solution β of the equation Mβ = h is the minimizer of β Mβ 2β h over β 0 and β 1 = 1. This formulation will be used in the next subsection. After recovering B = (B L , B Lc), display (11) implies that A can be recovered by normalizing columns of DΠB to unit sums. 3.2. Estimation of A in the Noisy Case The estimation procedure of A follows the same idea of the noise-free case. We first estimate B defined in (11) by using the estimate b R = D 1 X bΘD 1 X (15) of R, based on DX = n 1diag(X1n) and the unbiased estimator Ni Ni 1Xi X i 1 Ni 1diag(Xi) (16) of the matrix Θ. We estimate BL by b Bi = ek, for any i Lk, k [K]. (17) c M = ( b B L b BL) 1 b B L b RLL b BL( b B L b BL) 1, b H = ( b B L b BL) 1 b B L b RLLc, (18) Bing, Bunea and Wegkamp we estimate row-by-row the remainder of the matrix B. We compute, for each j Lc, b Bj = 0, if (DX)jj 7 log(n p)/(n N), (19) b Bj = arg min β 0, β 1=1 β (c M + λIK)β 2β bh(j), otherwise, (20) where bh(j) is the corresponding column of b H. We set λ = 0 whenever c M is invertible and otherwise choose λ large enough such that c M + λIK is invertible. We detail the exact rate of λ when c M is not invertible in Section 4. Finally, we estimate A via normalizing DX b B to unit column sums. Remark 2 In our procedure, the hard-thresholding step in (19) is critical to obtain the optimal rate of the final estimator that does not rely on a lower bound condition on the word-frequencies. In contrast, the analysis of Arora et al. (2013) requires a lower bound for all word-frequencies. The thresholding level in (19) is carefully chosen from the element-wise control of the difference DX DΠ. For the reader s convenience, the estimation procedure is summarized in Algorithm 1. Algorithm 1 Sparse Topic Model solver (STM) Require: frequency data matrix X Rp n with document lengths N1, . . . , Nn; the partition of anchor words L 1: procedure 2: compute DX = n 1diag(X1n), bΘ from (16) and b R from (15) 3: compute b BL from (17) 4: compute c M and b H from (18) 5: solve b BLc from (19) (20) by using λ in (29) 6: compute b A by normalizing DX b B to unit column sums 7: return b A 3.3. Comparison with Existing Methods In this section, we provide comparisons between our estimation procedure and two existing methods, which are seemingly close to our procedure. 3.3.1. Comparison with Arora et al. (2013) This algorithm also estimates the same target B defined in (11) first. For a given set L of anchor words, there are two main differences for estimating B. 1. The algorithm in Arora et al. (2013) uses only one anchor word per topic to estimate B whereas our estimation procedure utilizes all anchor words. The benefit of using multiple anchor words per topic is substantial and verified in our simulation in Section 6. 2. The algorithm in Arora et al. (2013) is based on different quadratic programs with more parameters (p K versus K2). This makes it more computationally intensive Optimal Estimation of Sparse Topic Models and less accurate than the algorithm proposed here. This is verified in our simulations in Section 6.2. Specifically, write eΘ := D 1 Θ Θ = D 1 Θ A(n 1WW )A , Q := (n 1WW )A and e Q := D 1 Q Q with DΘ = diag(Θ1p) and DQ = diag(Q1p). Arora et al. (2013) utilizes the following observation eΘ = D 1 Θ AQ = D 1 Θ ADQ e Q = B e Q by noting that DΘ = DΠ and DQ = DW from (5). Based on the observation that eΘj Rp is a convex combination of eΘe L = e Q RK p for any j [p] \ e L, Arora et al. (2013) proposes to estimate Bj by solving b Bj = arg min β 0, β 1=1 beΘj β be Q where beΘj and be Q are the corresponding estimates of eΘj and e Q. The matrix e Q contains p K entries, while our estimation procedure in (20) only requires to estimate M RK K which has fewer parameters. The analysis of Arora et al. (2013) only holds for invertible estimates e Q e Q and the rate of the estimator from (21) depends on λmin( e Q e Q ). Our result holds as long as λmin(M) > 0 due to the ridge-type estimator in (20) and the rate of our estimator in (20) depends on λmin(M). Lemma 20 in the Appendix shows that λmin(M)λmin(n 1WW ) min k [K],i Ik A2 ik λmin( e Q e Q ) λmin(M). Since 0 < λmin(n 1WW ) 1/K as shown in Lemma 21 and 0 < mini Ik,k [K] A2 ik < 1, it is easy to see that λmin( e Q e Q ) could be much smaller comparing to λmin(M). This suggests that our procedure in (20) should be more accurate than (21), which is confirmed in our simulations in Section 6.2. 3.3.2. Comparison with Bing et al. (2020) Although both methods are based on the normalized second moment R, they differ significantly in estimating A. 1. The algorithm in Bing et al. (2020) uses R only to estimate the anchor words and relies on Θ for the estimation of B. Specifically, by observing Θ e L := ACAe L = AA 1 e L Ae LCAe L = AA 1 e L Θe Le L := e AΘe Le L with e L being a set that contains one anchor word per topic and Ae L RK K being a diagonal matrix, Bing et al. (2020) proposes to first estimate e A by bΘ e LbΩ. Here bΩis an estimator of Θ 1 e Le L obtained via solving a linear program. Instead of e A, we propose here to first estimate B defined in (11). This is a different scaled version of A with more desirable structures (12). Bing, Bunea and Wegkamp 2. Furthermore, our estimation of B is done row-by-row via quadratic programming instead of simple matrix multiplication. While this is more computationally expensive than estimating Θ 1 e Le L, it gives more accurate row-wise control of b B B. This control is the key to obtain a faster rate of b A A 1 that adapts to the unknown sparsity. 3. Finally, we emphasize that it is impractical to modify the estimator of Bing et al. (2020) to adapt to the sparsity of A. For instance, further thresholding the estimator of e A to encourage sparsity, will require the thresholding levels to vary row-by-row. This would involve too many tuning parameters. 4. Upper Bounds of b A A 1 To simplify notation and properly adjust the scales, for each j [p] and k [K], we define i=1 Πji, γk := K i=1 Wki, αj := p max 1 k K Ajk, (22) such that Pp j=1 µj = p, PK k=1 γk = K and p Pp j=1 αj p K from (5). For given set L satisfying (10), we further set µL = min i L µi, γ = max 1 k K γk, γ = min 1 k K γk, αL = min i L αi, ρj = αj/αL. (23) For future reference, we note that As our procedure depends whether the inverse of c M defined in (18) exists, we first give a critical bound on the control for the operator norm of c M M and provide insight on the choice of λ in (20). Lemma 3 Consider the topic model (4) under assumption 1 and min i L 1 n i=1 Πji c0 log(n p) N , min i L max 1 i n Πji c1 log2(n p) for some sufficiently large constants c0, c1 > 0. Then, with probability 1 O((n p) 1), we have p K log(n p) µLn N . (25) Remark 4 Arora et al. (2013) observe that the smallest frequency of anchor words plays an important role in the estimation of A. Condition (24) prevents the frequency of anchor words from being too small and also appeared in Bing et al. (2020). In case the matrix c M cannot be inverted, we select λ c M M op in (20). Lemma 3 thus suggests to choose λ as p K log(n p) µLn N , (26) Optimal Estimation of Sparse Topic Models for some absolute constant c > 0. Let b A be obtained via choosing λ as (26). The following theorem states the upper bound of b A A 1. Our procedure, its theoretical performance and its proof differ from those in Bing et al. (2020). While the proof borrows some preliminary lemmas from Bing et al. (2020), it requires a more refined analysis (see Lemmas 15 19 in the Appendix). We define sj = Aj 0 for j [p], s J = P j J sj and es J := P j Lc(αj/αL)sj = P j Lc ρjsj. Theorem 2 Under model (4), assume Assumptions 1, 2 with λmin := λK(n 1WW ) > 0 and (24). Then, with probability 1 O((n p) 1), we have b A A 1 I + II + III, n N + p K log(n p) max {s J + |I| |L|, es J} p log4(n p) max {s J + |I| |L|, es J} log(n p) Furthermore, if µLKn N (27) for some sufficiently large constant c2 > 0, then, with probability 1 O((n p) 1), b A obtained via λ = 0 enjoys the same rate with III = 0. Remark 5 The estimation error of A consists of three parts: I, II and III. Each part reflects errors made at different stages of our estimation procedure. Recall that b A first uses a hard-thresholding step in (19) and then relies on the estimates b B and DX of B and DΠ, respectively. The first term in I quantifies the error of DX DΠ, while the second term is due to the hard-thresholding step. The second term II is due to the error of b Bj Bj for those j [p] \ L that pass the test (19). Finally, III stems from the error incurred by the regularization choice of λ. Remark 6 Condition (27) is a lower bound for the smallest eigenvalue of the matrix n 1WW . If it holds, we can set λ = 0 with high probability and the rate of b A A 1 is improved by (at most) a factor of q K(γ/γ). Under (24), inequality (27) follows from Bing, Bunea and Wegkamp The following corollary provides sufficient conditions that guarantee that our estimator b A constructed in Section 3.2 achieves the optimal minimax rate. Corollary 3 (Attaining the minimax rate) Consider the topic model (4) with Assumptions 1 and 2. Suppose further that (i) A 0 log(n p) n N; (ii) γ γ, λmin 1/K; (iii) P j Lc ρjsj s J + |I| hold. Further, assume (24) holds with the condition on mini L n 1 Pn i=1 Πji replaced by min i L 1 n i=1 Πji c0 max 1, (s J + |I| |L|) log2(n p) Then, with probability 1 O((n p) 1), we have b A A 1 A 1 A 0 log(n p) Remark 7 (Conditions in Corollary 3) 1. Condition (i) is natural (up to the multiplicative logarithmic factor) as A 0 is the effective number of parameters to estimate while n N is the total sample size. 2. The first part of condition (ii), γ γ, requires that all topics have similar frequency. The ratio γ/γ is called the topic imbalance (Arora et al., 2012) and is expected to affect the estimation rate of A. 3. The second part of condition (ii), λmin 1/K, requires that topics are not too correlated. This is expected even for known W, playing the same role of the design matrix in the classical regression setting. 4. Condition (iii) puts a mild constraint on the word-topic matrix A between the selected anchor words and the other words (anchor and non-anchor). It is implied by sj P j Lc sj Aj min i L Ai 1, which in turn is implied by max 1 k K P {word j | topic k} k=1 P {word i | topic k} for any i L and j / L. The latter condition prevents the selected anchor words from being much less frequent than the other words. Optimal Estimation of Sparse Topic Models 5. Finally, condition (28) strengthens (24) by requiring a slightly larger lower bound for the frequency of selected anchor words. It is implied by N A 0 log2(n p) K2 (s J + |I| |L|) log2(n p) under (24). As discussed in Arora et al. (2012, 2013); Bing et al. (2020), usage of infrequent anchor words often leads to inaccurate estimation of A. 5. Practical Aspects of the Algorithm We discuss two practical concerns of our proposed algorithm in Section 3.2: 1. Selection of the number of topics K and subset of anchor words L 2. Data-driven choice of the tuning parameter λ in (26). 5.1. Selection of K and L Several existing algorithms with theoretical guarantees for finding anchor words in the topic model exist. Most methods rely on finding the vertices of a simplex structure, provided that the number of topics K is known beforehand. For known K, Recht et al. (2012) make clever use of the appropriately defined simplex structure on Θ = n 1ΠΠ implied by Assumption 1. However, their method needs to solve a linear program in dimension p p, which becomes rapidly computationally intractable. Arora et al. (2013) proposes a faster combinatorial algorithm which returns one anchor word per topic. The returned anchor words are shown to be close to anchor words within a specified tolerance level. Recently, Ke and Wang (2017) proposes another algorithm for finding anchor words by utilizing the simplex structure of the singular vectors of the word-document frequency matrix. However, their algorithm runs much slower than that of Arora et al. (2013). In practice, K is rarely known in advance. This situation is addressed in Bing et al. (2020). This work proposes a method that provably estimates K from the data, provided that the topic-document matrix W satisfies the following incoherence condition. Assumption 3 The inequality cos ( (Wi , Wj )) < ζi ζi for all 1 i = j K, holds, with ζi := Wi 2/ Wi 1. This additional assumption is not needed in the aforementioned work when K is known. When columns of W are i.i.d. samples of Dirichlet distribution, Assumption 3 holds with high probability under mild conditions on the hyper-parameter of Dirichlet distribution (Bing et al., 2020, Lemma 25 in the Supplement). In addition to the estimation of K, the algorithm in Bing et al. (2020) estimates both the set and the partition of all anchor words for each topic. This sets it further apart from Arora et al. (2013), as the latter only recovers one approximate anchor word for each topic. The algorithm of finding anchor words in Bing et al. (2020) is optimization-free and runs as fast as that in Arora et al. (2013). Hence, for selecting L, we can use Algorithm 4 in Arora et al. (2013) when K is known and Algorithm 2 in Bing et al. (2020) if K is known or needs to be estimated. Bing, Bunea and Wegkamp 5.2. Data-driven Choice of λ The precise rate for λ in (26) contains unknown quantities γ and µL. We proceed via crossvalidation over a specified grid. We prove in Lemma 22 in the Appendix that | mini L(DX)ii µL/p| = op( p log(n p)/(n N)) with DX = n 1diag(X1n). We recommend the following procedure for selecting λ. For some constant c0 (our empirical study suggests the choice c0 = 0.01), we take t = arg min n t {0, 1, 2, . . .} : c M + λ(t)IK is invertible o , λ(t) = t c0 K K log(n p) [mini L(DX)ii]n 1 !1/2 . (29) 6. Experimental Results In this section, we report on the empirical performance of the new algorithm proposed and compare it with existing competitors on both synthetic and semi-synthetic data. Notation. Recall that n denotes the number of documents, N denotes the number of words drawn from each document, p denotes the dictionary size, K denotes the number of topics, and |Ik| denotes the cardinality of anchor words for topic k. We write ξ := mink [K],i Ik Aik for the minimal frequency of anchor words. Larger values of ξ are more favorable for estimation. Methodology. For competing algorithms, we consider Latent Dirichlet Allocation (LDA) (Blei et al., 2003)1, the algorithm (AWR) proposed in Arora et al. (2013) and the TOP algorithm proposed in Bing et al. (2020). We use the default values of hyper-parameters for all algorithms. Both LDA and AWR need to specify the number of topics K. In our proposed Algorithm 1 (STM), we choose λ according to (29) and we select the anchor words either via AWR with specified K or via TOP (Bing et al., 2020), and proceed with the estimation of A as described in Section 3.2. We name the resulting estimates STM-AWR and STM-TOP, respectively. 6.1. Synthetic Data In this section, we use synthetic data to demonstrate the effect of the sparsity of A on the estimation error b A A 1/K for AWR, TOP, STM-AWR and STM-TOP. Both AWR and STM-AWR are given the correct K, while TOP and STM-TOP estimate K. To simulate synthetic data, we generate A satisfying Assumption 1 by the following strategy. Generate anchor words by Aik := ξ for any i Ik and k [K]. Each entry of non-anchor words is sampled from Uniform(0, 1). 1. We use the code of LDA from Riddell et al. (2016) implemented via the fast collapsed Gibbs sampling with the default of 1,000 iterations Optimal Estimation of Sparse Topic Models Normalize each sub-column AJk A k to have sum 1 P i I Aik. Draw columns of W from the symmetric Dirichlet distribution with parameter 0.3. Simulate N words from Multinomialp(N; AW). To change the sparsity of A, we randomly set s = ηK entries of each row in AJ to zero, for a given sparsity proportion η (0, 1). Normalizing the thresholded matrix gives A(η) and the sparsity of A(η) is calculated as s(η) = A(η) 0/(p K). We set N = 1500, p = n = 1000, K = 20, |Ik| = p/200 and ξ = K/p. For each η {0, 0.1, 0.2, . . . , 0.9}, we generate 50 data sets based on A(η) and report in Figure 1 the average estimation errors b A A(η) 1/K of the four different algorithms. The x-axis represents the corresponding sparsity level s(η). Since the selected anchor words are up to a group permutation, we align the columns of b A before calculating the estimation error. Conclusion. STM-TOP has the best performance overall. Both STM-AWR and STM-TOP perform increasingly better as A becomes sparser. The performance of AWR improves only if the sparsity level is sufficiently large, say s(η) < 0.5. As expected, TOP does not adapt to the sparsity. 0.9 0.82 0.72 0.64 0.55 0.46 0.36 0.28 0.18 0.1 TOP STM-TOP AWR STM-AWR Figure 1: Plots of the estimation error b A A(η) 1/K for η {0, 0.1, 0.2, . . . , 0.9}. 6.2. Semi-synthetic Data We evaluate two real-world data sets, a corpus of NIPs articles and a corpus of New York Times (NYT) articles (Dheeru and Karra Taniskidou, 2017). Following (Arora et al., 2013), 1. We removed common stopping words and rare words occurring in less than 150 documents. 2. For each preprocessed data set, we apply LDA with K = 100 and obtain an estimated word-topic matrix A(0). 3. For each document i [n], we generate the topics Wi from a specified distribution. 4. We sample N words from Multinomialp(N; A(0)W). Bing, Bunea and Wegkamp 6.2.1. NIPs corpus After this preprocessing stop, the NIPs data set consists of n = 1, 500 documents with dictionary size p = 1, 253 and mean document length 847. 1. We set N = 850 and vary n {2000, 4000, 6000, 8000, 10000} for generating semisynthetic data. 2. While the estimated A(0) from LDA does not have exact zero entries, we calculate the approximate sparsity level of A by sparsity = 1 p K k=1 1 Ajk 10 3p 1 0.696. (30) The calculated sparsity indicates that the posterior A(0) from LDA has many entries close to 0. 3. As in Arora et al. (2013), we manually add |Ik| = m anchor words for each topic with m {1, 5}. After adding m anchor words, we re-normalize the columns to obtain A(m). 4. The columns of W are generated from the symmetric Dirichlet distribution with parameter 0.03. We sample N words from Multinomialp(N; A(m)W). For each combination of n and m, we generate 20 data sets and the average estimation errors b A A 1/K of different algorithms are shown in Figure 2. The bars represent the standard deviations across 20 repetitions. Again, LDA, AWR and STM-AWR are given the correct K, while TOP and STM-TOP estimate K. Conclusion. STM-TOP has best overall performance and STM-AWR has the second best result. LDA is dominated by all other algorithms, although increasing the number of iterations might boost the performance of LDA. Both STM-TOP and TOP have better performance when one has more anchor words. Dirichlet 0.03 m = 1 Dirichlet 0.03 m = 5 2000 4000 6000 8000 10000 2000 4000 6000 8000 10000 Figure 2: Plots of the estimation errors b A A 1/K We also investigate the effect of the correlation among topics on the estimation of A. Following Arora et al. (2013), we simulate W from a log-normal distribution with block diagonal covariance matrix and different within-block correlation. To construct the block Optimal Estimation of Sparse Topic Models diagonal covariance structure, we divide 100 topics into 10 groups. For each group, the offdiagonal elements of the covariance matrix of topics is set to ρ, while the diagonal entries are set to 1. The parameter ρ {0.03, 0.3} reflects the magnitude of correlation among topics. We take the case m = 1 and the estimation errors of the algorithms are shown in Figure 3. Conclusion. STM-TOP has the best performance in all settings. As long as the number of documents n is large, STM-AWR is more robust to the correlation among topics than AWR. LDA and AWR are comparable. Logistic Normal rho = 0.03 Logistic Normal rho = 0.3 2000 4000 6000 8000 10000 2000 4000 6000 8000 10000 Figure 3: Plots of the estimation errors b A A 1/K for ρ = 0.03 and ρ = 0.3. Finally, we report the running times of the various algorithms in Table 1. As one can see, LDA is the slowest and does not scale well with n. On the other hand, TOP is the fastest and the other three algorithms (AWR, STM-AWR and STM-TOP) have comparable running times. Table 1: Running time (seconds) of different algorithms. TOP STM-TOP AWR STM-AWR LDA n = 2000 35.2 614.3 393.8 500.7 1918.7 n = 4000 32.8 611.2 447.0 466.2 3724.5 n = 6000 41.8 610.9 455.0 416.7 5616.6 n = 8000 44.7 605.1 458.4 463.5 7358.8 n = 10000 52.0 609.0 482.8 517.9 9130.6 6.2.2. New York Times (NYT) data set After the same preprocessing step, the NYT data set contains n = 299, 419 documents with dictionary size p = 3, 079 and mean document length 210. We choose N = 300 and vary n {30000, 40000, . . . , 70000}. The estimated A(0) from LDA has sparsity 0.679 calculated from (30). As in the NIPs corpus earlier, we manually add |Ik| = m {1, 5} anchor words per topic. For each m and n, we generate 20 data sets where columns of W are generated from the symmetric Dirichlet distribution with parameter 0.03. The average estimation errors b A A 1/K are shown in Figure 4. We also study the effect of correlation among topics on the estimation errors for the case m = 1 and with the columns of W generated from the log-normal distribution with block diagonal correlation and ρ = {0.1, 0.3}. Bing, Bunea and Wegkamp The result is shown in Figure 5. Conclusion. From Figure 4, in the presence of anchor words, we see that STM-TOP has the best overall performance and STM-AWR outperforms AWR. The errors of STM-TOP and TOP decrease if more anchor words are introduced. In Figure 5, STM-TOP outperforms the other algorithms in all cases. TOP has the second best performance while the other three algorithms are comparable. Dirichlet 0.03 m = 1 Dirichlet 0.03 m = 5 30000 40000 50000 60000 70000 30000 40000 50000 60000 70000 Figure 4: Plots of the estimation errors b A A 1/K Logistic Normal rho = 0.1 Logistic Normal rho = 0.3 30000 40000 50000 60000 70000 30000 40000 50000 60000 70000 Figure 5: Plots of the estimation errors b A A 1/K for ρ = 0.1 and ρ = 0.3. 7. Conclusion We have studied estimation of the word-topic matrix A when it is possibly entry-wise sparse and the number of topics K is unknown, under the separability condition. A new minimax lower bound of b A A 1 is derived and a computationally efficient procedure (STM) for estimating A is proposed. The estimator provably achieves the minimax lower bound (modulo a logarithmic factor) and adapts to the unknown sparsity. Extensive simulations corroborate the superior performance of our new estimation procedure in tandem with the existing algorithm in Bing et al. (2020) for selecting anchor words. Acknowledgments Bunea and Wegkamp are supported in part by NSF grants DMS-1712709 and DMS-2015195. Optimal Estimation of Sparse Topic Models Appendix A. Proofs The proofs rely on some lemmas in Bing et al. (2020). For the reader s convenience, we restate them in Section A.2 and use similar notations for simplicity. A.1. Notations and two useful expressions From the topic model specifications, the matrices Π, A and W are all scaled as j=1 Πji = 1, j=1 Ajk = 1, k=1 Wki = 1 (31) for any 1 j p, 1 i n and 1 k K. In order to adjust their scales properly, we denote mj = p max 1 i n Πji, µj = p i=1 Πji, αj = p max 1 k K Ajk, γk = K i=1 Wki, (32) so that K X k=1 γk = K, j=1 µj = p. (33) Recall that ρj = αj/αL and es J := P j Lc ρjsj. We define t=1 Xjt, for all 1 j p. (34) We write d := n p throughout the proof. Finally, note that Assumption 3 implies K < n. From model specifications (31) and (32), we derive three useful facts that are later repeatedly invoked. (a) For any j [p], by using (32), i=1 Πji = p k=1 Ajk Wki = p k=1 Ajkγk p K k=1 Ajk γ µj αj. (35) In particular, for any j Ik with any k [K], k=1 Ajk Wki = p K Ajkγk (32) = αjγk (b) For any j [p], mj (32) = p max 1 i n Πji = p max 1 i n k=1 Ajk Wki p max 1 k K Ajk (32) = αj µj mj αj, (37) by using 0 Wki 1 and P k Wki = 1 for any k [K] and i [n]. Bing, Bunea and Wegkamp (c) For any j [p] and k [K], define a=1 Aja Cak with C = n 1WW . (38) a=1 Aja Cak = a=1 Cak = 1 a=1 Wkt Wat (31) = 1 t=1 Wkt (32) = γk A.2. Useful results from Bing et al. (2020) Let εji := Xji Πji, for 1 i n and 1 j p and assume N1 = . . . = Nn = N for ease of presentation since similar results for different N can be derived by using the same arguments. Lemma 8 With probability 1 2d 1, we have np N + 4 log(d) n N , uniformly in 1 j p. (40) If min1 j p µj/p log(d)/(n N) holds, with probability 1 2d 1, np N , uniformly in 1 j p. Lemma 9 Recall Θ = n 1ΠΠ . With probability 1 2d 1, 6mℓΘjℓlog(d) np N + 2mℓlog(d) np N , uniformly in 1 j, ℓ p. Lemma 10 If min1 j p µj/p 2 log(d)/(3N), then with probability 1 4d 1, i=1 (εjiεℓi E[εjiεℓi]) Θjℓ+ (µj + µℓ) log(d) n N2 + 4d 3, holds, uniformly in 1 j, ℓ p. Lemma 11 Assume model (4) and min 1 j p 1 n i=1 Πji c log(d) for some sufficiently large constant c > 0. With probability greater than 1 O(d 1), |bΘjℓ Θjℓ| c0ηjℓ, | b Rjℓ Rjℓ| c1δjℓ, for all 1 j, ℓ p Optimal Estimation of Sparse Topic Models for some constant c0, c1 > 0, where N + (mj + mℓ) δjℓ:= p2ηjℓ µjµℓ + p2Θjℓ A.3. Proof of Theorem 1 in Section 2 We first choose {I1, . . . , IK} such that ||Ik| |Ik || 1 for any k, k [K]. This also implies |Ik| 2|I|/K. Further choose the integer set {g1, . . . , g K} such that PK k=1 gk = s J and |gk gk | 1 for any k, k [K], further implying gk 2s J/K. We first choose A(0). Let 1|I1| 1|I2| ... 1|IK| e1g1 e1g2 e1g K where, for any k [K], e1gk = 1gk if gk = |J| and e1gk = (1 gk, 0 ) otherwise. We then set A(0) = e A(0) 1 |I1|+g1 1 |I2|+g2 ... 1 |IK|+g K We start by constructing a set of hypotheses of A. Assume |Ik|+gk is even for 1 k K. Let M := {0, 1}(|I|+s J)/2. Following the Varshamov-Gilbert bound in Lemma 2.9 in Tsybakov (2009), there exists w(j) M for j = 0, 1, . . . , T, such that w(i) w(j) 1 |I| + s J 16 , for any 0 i = j T, (45) with w(0) = 0 and log(T) log(2) 16 (|I| + s J). (46) For each w(j) M, we divide it into K chunks as w(j) = w(j) 1 , w(j) 2 , . . . , w(j) K with w(j) k R(|Ik|+gk)/2. For each w(j) k , we write ew(j) k Rp as its augumented counterpart such that Bing, Bunea and Wegkamp [ ew(j) k ]Sk = [w(j) k , w(j) k ] and [ ew(j) k ]ℓ= 0 for any ℓ/ Sk, where Sk := supp(A(0) k ). For 1 j T, we choose A(j) as A(j) = A(0) + γ h ew(j) 1 ew(j) K log(2) 45(1 + c0) n N(|I| + s J) (48) for some constant c0 > 0. Under |I| + s J n N, it is easy to verify that A(j) A(|I|, s J) for all 0 j T. In order to apply Theorem 2.5 in Tsybakov (2009), we need to check the following conditions: (a) KL(PA(j), PA(0)) log(T)/16, for each i = 1, . . . , T. (b) A(i) A(j) 1 c1K p (|I| + s J)/(n N), for 0 i < j T and some constant c1 > 0. We first show part (a). Fix 1 j T and choose D(j) = A(j)W 0 where W 0 is defined in (8). Let mk be the set such that |mk| = nk and W 0 i = ek, for all i mk and k [K]. Since |Ik| + gk 2(|I| + s J)/K, it follows that k=1 A(0) ℓk W 0 ki = 1/(|Ik| + gk) 2 1K/(|I| + s J), if ℓ Sk, i mk, k [K] 0, otherwise . (49) for any i [n] and ℓ [p]. Similarly, we have D(j) ℓi D(0) ℓi = γ k=1 [ ew(j) k ]ℓW 0 ki γ, if ℓ Sk, i mk, k [K] 0, otherwise . (50) Thus, by |I| + s J n N, we have max (ℓ,i) T c |D(j) ℓi D(0) ℓi | D(0) ℓi 2γ |I| + s J K < 1, for any 1 j T Optimal Estimation of Sparse Topic Models where T := {(ℓ, i) [p] [n] : D(0) ℓi = 0} and T c := [p] [n] \ T . Observe that D(j) ℓi = 0 for any (ℓ, i) T and 1 j T, and invoke Lemma 12 to get KL(PA(j), PA(0)) (1 + c0) N X |D(j) ℓi D(0) ℓi |2 ℓ Sk γ2(|Ik| + gk) = (1 + c0) N i mk γ2(|Ik| + gk)2 (by |Sk| = |Ik| + gk) 4 (1 + c0) Nnγ2 (|I| + s J)2 (46) 1 16 log T. The second inequality uses (49) and (50) and the fourth line uses |Ik| + gk 2(|I| + s J)/K. This verifies (a). To show (b), (47) yields A(j) A(ℓ) 1 = A(j) k A(ℓ) k 1 w(j) k w(ℓ) k 1 = 2γ w(j) w(ℓ) 1 (45) γ 8(|I| + s J). After we plug this into the expression of γ, we obtain (b). Invoking (Tsybakov, 2009, Theorem 2.5) concludes the proof when |Ik|+gk is even for all k [K]. The complementary case is easy to derive with slight modifications. Specifically, denote by Sodd := {1 k K : |Ik| + gk is odd}. Then we change M := {0, 1}Card with |Ik| + gk 1 For each w(j), we write it as w(j) = (w(j) 1 , . . . , w(j) K ) and each w(j) k has length (|Ik|+gk 1)/2 if k Sodd and (|Ik|+gk)/2 otherwise. We then construct A(j) k = A(0) k +γ ew(j) k where ew(j) k Rp is the same augumented ccounterpart of w(j) k . The result follows from the same arguments and the proof is complete. Bing, Bunea and Wegkamp The upper bound of Kullback-Leibler divergence between two multinomial distributions is studied in (Ke and Wang, 2017, Lemma 6.7). We use the following modification of their bound. Lemma 12 Let D and D be two p n matrices such that each column of them is a weight vector. Under model (4), let P and P be the probability measures associated with D and D , respectively. Let T be the set such that T := (j, i) [p] [n] : Dji = D ji = 0 Let T c := ([p] [n]) \ T and η = max (j,i) T c |D ji Dji| Dji and assume η < 1. There exists a universal constant c0 > 0 such that KL(P , P) (1 + c0η)N X |D ji Dji|2 Proof With the convention that 0/0 = 1, we have KL(P , P) = N j=1 D ji log D ji Dji (j,i) T c D ji log (1 + ηji) . Then the proof follows by the same arugments in Ke and Wang (2017). A.4. Proofs of Section 4 We first give the proof of Lemma 3 and then prove our main Theorem 2. A.4.1. Proof of Lemma 3 From (18), we have c Mab = 1 |La||Lb| Further notice that 1 |La||Lb| i La,j Lb Rij = Mab. Using the fact that Q op Q ,1 for any symmetric matrix Q, yields c M M op c M M ,1 = max 1 k K i La,j Lb ( b Rij Rij) max 1 k K max i Lk a=1 max j La | b Rij Rij|. Optimal Estimation of Sparse Topic Models Invoking Lemma 11 for all i, j L under condition (24), with probability 1 O(d 1), we have c M M op max 1 k K max i Lk a=1 max j La δij. The result follows by invoking Lemma 14. A.4.2. Proof of Theorem 2 As our estimation procedure uses a thresholding step in (19), we first define i=1 Πji < log(d) i=1 Xji < 7 log(d) and write T c := [p] \ T and b T c := [p] \ b T. Recall that our final estimator b A is obtained by normalizing b B = DX b B to unit column sums with DX = diag (bu1/p, . . . , bup/p) where buj/p is defined in (34) for 1 j p. For any j [p] and k [K], we have b Ajk Ajk = b Bjk b Bk 1 Bjk Bk 1 where B = DΠB. Summing over 1 j p yields b Ak Ak 1 = b Bk 1 b Bjk Bk 1 + b Bjk Bjk | Bk 1 b Bk 1| Bk 1 + b Bk Bk 1 2 b Bk Bk 1 γk b Bk Bk 1. In the last equality, we use j=1 Ajk 1 n t=1 Wkt = γk Bing, Bunea and Wegkamp by observing that B = DΠB = ADW . Further recall that b Bjk = bµj b Bjk/p for j [p] and b Bj = 0 for any j b T. We have b Ak Ak 1 = 2K b Bjk |bµj µj| p | b Bjk Bjk| b Bjk |bµj µj| p | b Bjk Bjk| + X b Bjk |bµj µj| p | b Bjk Bjk| + 2 X We use Ajk = (µj/p)Bjk(K/γk) in the last line. Summing over 1 k K gives p b Bj Bj 1 p b Bj Bj 1 j b T Aj 1. We use b Bjk 1 = 1 in the first line and the fact b Bj = Bj for all j L in the second line. Next, we study the three terms on the right hand side. To bound the first term, we observe that P{T b T} = P{ b T c T c} = 1 2d 1, by Lemma 13. This fact, the second part of Lemma 8 and the inequality min j T c µj Further, the Cauchy-Schwarz inequality and P j T c µj/p Pp j=1 µj/p = 1 yield |T c| log(d) 1 4d 1. (53) Optimal Estimation of Sparse Topic Models To bound the third term, Lemma 13 yields j b T Aj 1 20K| b T| log(d) γn N 20Kp log(d) 1 2d 1. (54) The proof of the upper bound for the second term is more involved. We work on the intersection of the event { b T c T c} with EM := n λmin(c M + λIK) λmin(M) + λ c M M op λmin(M) o to establish an upper bound for X p b Bj Bj 1. Lemma 3 and the choice of λ guarantee P(EM) = 1 O(d 1). Pick any j T c \L and recall that b Bj is estimated via (20). Starting with b B j (c M + λIK) b Bj 2 b B j bh(j) B j (c M + λIK) 1Bj 2B j bh(j), standard arguments yield ( (j)) (c M + λIK) (j) 2 ( (j)) (bh(j) c MBj λBj ) 2 n |( (j)) (bh(j) h(j))| + |( (j)) (h(j) c MBj )| + λ (j) Bj o by writing (j) := b Bj Bj . Hence, on the event EM, we have (j) 2 λmin(M) ( |( (j)) (bh(j) h(j))| (j) + |( (j)) (h(j) c MBj )| (j) + λ Bj Let sj = Bj 0 and Sj = supp(Bj ). Since 0 = Bj 1 b Bj 1 = Bj Sj 1 b Bj Sj 1 b Bj Sc j 1 (j) Sj 1 (j) Sc j 1, we have (j) 1 2 (j) Sj 1 2 sj (j) Sj 2 sj (j) . (56) Combination of (56) with (55) gives X p b Bj Bj 1 ( |( (j)) (bh(j) h(j))| (j) + |( (j)) (h(j) c MBj )| (j) + λ Bj Bing, Bunea and Wegkamp The results of Lemmas 15 and 16 and the inequality λmin(M) λmin K2/γ2 give X p b Bj Bj 1 max {s J + |I| |L|, es J} max {s J + |I| |L|, es J} log(d) Finally, (53), (54) and (58) together imply that n N + p K log(d) max {s J + |I| |L|, es J} max {s J + |I| |L|, es J} log(d) holds with probability 1 O(d 1). After we invoke the result of Lemma 17, the proof of the first result follows. The second result follows by setting λ = 0 in (59) as P n λmin(c M) λmin(M) c M M op cλmin(M) o 1 O(d 1). A.5. Lemmas used in the proof of Theorem 2 Lemma 13 Let T and b T be defined in (51). With probability 1 2d 1, we have T b T and, for any 1 j p, if 1 n i=1 Xji < 7 log(d) we further have Aj 1 19K log(d) Proof Recall that Xji = Πji + εji such that bµj/p = µj/p + n 1 Pn i=1 εji. We work on the event np N + 4 log(d) which holds with probability 1 2d 1 from Lemma 8. Since, for any j T, p + |bµj µj| np N + 4 log(d) n N < 7 log(d) Optimal Estimation of Sparse Topic Models we have j b T, hence T b T. To prove the second statement, for any j such that buj/p 7 log(d)/(n N), we have For this j, since np N + 4 log(d) P(E1) = 1 2d 1, we have, with probability 1 2d 1, np N + 11 log(d) which implies µj/p 19 log(d)/(n N). The result then follows by using (35). Lemma 14 Let δij be defined in (43) for any i, j [p]. Let ψjk be defined in (38) for any j [p], k [K]. Under condition (24), we have max 1 k K max i Lk a=1 max j La δij K and, for any k [K] and j [p], µLn N max i Lk K max ℓ La δiℓ K ρjψjk log(d) For any j [p], if 1 n t=1 Πjt c log(d) for some constant c > 0, we further have p max i Lk δij (1 + ρj) K log(d) (1 + ρj) ψjk log(d) Proof For any i Lk and j La with a, k [K], we start with the expressions in (42) and (43). Note that (35), (37) and (64) imply p 2c1 log2(d) p , µi + µj t=1 (Πit + Πjt) 2c0 log(d) Bing, Bunea and Wegkamp Also, using mi αi from (37) and µi = αiγk/K from (36), together with t=1 Aik Wkt Wat Aja = Aik Aja Cka (22) = αiαj Cka αiγk + αjγa Kn N3 + Cka Using the Cauchy-Schwarz inequality and the fact that a=1 Cka = 1 a=1 Wkt Wat = 1 t=1 Wkt (22) = γk we further have, after a bit of algebra, a=1 max j La δij K2 p log(d) αLγn N + p K log(d) p log(d) αLγKn N + p3K log4(d) where we also use αi αL, γa γ. Note that the first term on the right-hand side dominates the other three as p K2 log(d) c0 , p2K log3(d) α2 LγN2 p log2(d) c0αLN 1 c0c1 by using K < n and the following observation from (24), (37) min i L mi p c1 log2(d) N , αL p K αLγ p c0 log(d) The first result then follows by using µL = αLγ/K from (36). To prove the second result, we argue K max i Lk,ℓ La δiℓ Ckap log(d) p3γ log4(d) Kα3 Ln N3 + Cka αLγn N + p log(d) Ckap log(d) αLn N + Cka αLγn N + p log(d) ρjψjk K log(d) ρjψjk K log(d) γγkn N + Aj 1 p K log(d) ρjψjk K log(d) γγkn N + Aj 1 p K log(d) Optimal Estimation of Sparse Topic Models The second line follows from (62), the third line uses p3γ log4(d) Kα3 Ln N3 = αLKN γp log2(d) and the fourth line uses the Cauchy-Schwarz inequality together with ρj = αj/αL, Cka γk/K and (38). Since s ρjψjk K log(d) ρjψjk log(d) The result now follows after observing that Aj 1 p K log(d) µLn N Aj 1 p K log(d) γkn N (by K < n). We proceed to prove the third result. Fix any j [p] and i Lk with k [K] and note that (60) still holds by replacing the constants 2 by 1. Since Θij = Aik 1 n a=1 Aja Wat (38) = Aikψjk, (65) and mj αj (37), µi = αiγk/K from (36) and ρj = αj/αL, the expressions of (42) and (43) yield (1 + ρj) ψjk log(d) n N + (1 + ρj) K log(d) p p2 log4(d) α2 Ln N3 + Kψjk αLγkn N + Kψjk We now simplify the three terms in the second line. Since a=1 Aja Wat Wkt 1 a=1 Aja Wat = 1 t=1 Πjt = µj Bing, Bunea and Wegkamp we have Kψjk Also note that (72) yields αjψjk log(d) ρjψjk log(d) Finally, by using p2 log4(d) α2 Ln N3 p log2(d) c1αLn N2 γk log(d) from (64) and γk γ, we can upper bound maxi Lk(µj/p)δij by (1 + ρj) K log(d) (1 + ρj) ψjk log(d) γkn N , (66) which completes the proof. The following three lemmas provide upper bounds for the three terms on the right-handside of (57). Recall that ρj = αj/αL, es J = P j Lc ρjsj and ψjk = PK a=1 Aja Cak for any j [p] and k [K]. Lemma 15 Under conditions of Theorem 2, with probability 1 O(d 1), p |( (j)) (bh(j) h(j))| max {s J + |I| |L|, es J} max {s J + |I| |L|, es J} log(d) Proof Pick any j T c \ L. From the definition of bh(j) in (18), we have p |( (j)) (bh(j) h(j))| k=1 | (j) k | µj bh(j) k h(j) k k=1 | (j) k | µj k=1 | (j) k | µj p max i Lk δjℓ, Optimal Estimation of Sparse Topic Models with probability 1 O(d 1), invoking Lemma 11 and inequality (52). Application of the third part of Lemma 14 further gives p |( (j)) (bh(j) h(j))| c1 (j) T (jk) 2 2 #1/2 + c1 (j) 1 max 1 k K T (jk) 1 (56) c1 (j) T (jk) 2 2 #1/2 + 2c1 sj (j) max 1 k K T (jk) 1 , T (jk) 1 = (1 + ρj) K log(d) µLn N3 (67) T (jk) 2 = K (1 + ρj) ψjk log(d) γkn N . (68) Hence, by the Cauchy-Schwarz inequality, p |( (j)) (bh(j) h(j))| j T c\L (1 + ρj)sj K2ψjk log(d) γ2 kn N + µj K log(d) j T c\L sj max k [K] T (jk) 1 . We conclude our proof by observing that X j T c\L sj s J + |I| |L| j=1 ψjk (39) K2 γkn N K2 log(d) by Pp j=1 µj = p. Lemma 16 Under conditions of Theorem 2, with probability 1 O(d 1), p |( (j)) (h(j) c MBj )| (j) µLn N3 + Kes J log(d) es J log(d) Bing, Bunea and Wegkamp Proof We work on the event n | b Riℓ Riℓ| c1δiℓ o Lemmas 8, 11 and (24) guarantee that P(E) 1 O(d 1). The event E and (24) further imply p , for all i L, (70) for some constants c, c > 0 and (64). Pick any j T c \ L and k [K]. Observe that h(j) = MBj and µj Aja γa K . (71) From (15) and (18), we have (c Mk Mk ) Bj = 1 Ajaγa |Lk||La| Ajaγa |Lk||La| buibuℓ p2Θiℓ Ajaγa |Lk||La| p2(bΘiℓ Θiℓ) Ajaγa |Lk||La| (µiµℓ bµibµℓ) := Rem1(jk) + Rem2(jk). For Rem2(jk), we find K 1 |Lk||La| [µi(µℓ bµℓ) + (µi bµi)bµℓ] K 1 |Lk||La| b Riℓ max i Lk,ℓ La µℓ + |bµi µi| ! 1 |Lk||La| i Lk,ℓ La (Riℓ+ c1δiℓ) K max i Lk,ℓ La δiℓ. Optimal Estimation of Sparse Topic Models We use (70) in the second line, the definition of the event E together with (36) in the third line and µiµℓ = K2Cka (follows from (36) and (61)) in the fourth line. We bound the first term on the right as αLγn N = ψjk K ρjψjk log(d) a=1 Aja Wat Wkt Aj 1 n a=1 Wat Wkt = αjγk Invoking the second result of Lemma 14 gives ρjψjk log(d) ρj Aj 1 log(d) γkn N . (73) We proceed to bound Rem1(jk). Recalling (71) and µℓ/p = Aℓaγa/K from (36), we find Aja |Lk||La| p(bΘiℓ Θiℓ) max i Lk p µi Aja Aℓa (bΘiℓ Θiℓ) Since, for any i Lk, j La, bΘiℓ Θiℓ= N N 1 n Aik W k εℓ+ 1 n Aℓa W a εi n E h ε i εℓ i cf. (Bing et al., 2020, page 11 in the Supplement), we obtain Rem1(jk) max i Lk p µi a=1 Aja 1 n n E h ε i εℓ i + Ajk := Rem11(jk) + Rem12(jk) + Rem13(jk) + Rem14(jk). (74) Bing, Bunea and Wegkamp In the sequel, we provides separate bounds the each of the four terms. We start with the last term and obtain on the event E Rem14(jk) max i Lk p µi µLn N3 . (75) by recalling that ρj = αj/αL. Observing that P a Aja Wat = Πjt, with probability 1 O(d 1), the second term can be upper bounded by using Lemma 9 as Rem12(jk) = max i Lk p µi max i Lk p µi 6mjΘji log(d) np N + 2mj log(d) 6αjψjk log(d) αin N + 2αj log(d) 6ρjψjk log(d) n N + 2ρj K log(d) where we also use(37) and (65) to derive the third line and use (36) to arrive at the last line. The upper bounds of Rem11(jk) and Rem13(jk) are proved in Lemmas 18 and 19. Combination of (73), (75), (76), (77) and (83) yields (c Mk Mk ) Bj ρj µLn N3 + 2ρj K log(d) ρjψjk log(d) ρj Aj 1 log(d) p ρj K log(d) with probability 1 O(d 1). Next we use similar arguments as in the proof of Lemma 15. Analogous to (67) (68), we define ρj R(k) 1 = µLn N3 + 2K log(d) ρj R(jk) 2 = K Aj 1 log(d) We can obtain ( (j)) (c M M)Bj (j) ρj R(jk) 2 2 #1/2 + ρj sj (j) max 1 k K R(k) 1 which, by the Cauchy-Schwarz inequality, further gives p |( (j)) (h(j) c MBj | (j) p 1 2 + es J max k [K] R(k) 1 . Optimal Estimation of Sparse Topic Models Finally, we calculate P j T c\L PK k=1(R(jk) 2 )2 as k=1 (R(jk) 2 )2 K2ψjk log(d) γγkn N + Aj 1 log(d) p ρj K log(d) We use (33), (39) and Pp j=1 Aj 1 = K to arrive at the last line. Lemma 17 Let λ be chosen as in (26). With probability 1 O(d 1), p sj Bj c K γ γ Kes J log(d) Proof Recall that Bjkµj/p = Ajkγk/K. We have p sj Bj = 1 k=1 A2 jkγ2 k From µL = αLγ/K and the choice of λ, it follows that, with probability 1 O(d 1), p K2 log(d) γes JK log(d) γes JK log(d) Bing, Bunea and Wegkamp Here we use the Cauchy-Schwarz inequality in the third line and the identity Pp j=1 µj = p in the last line. This completes the proof. A.6. Lemmas used in the proof of Lemma 16 Let Rem11(jk) and Rem13(jk), j [p], k [K], be defined as (74). Lemma 18 Under conditions of Theorem 2, with probability 1 2d 1, Rem11(jk) K 6ρjψjk log(d) n N + 4ρj K log(d) uniformly for any j [p] and k [K]. Proof We upper bound Rem11(jk) by studying max i Lk p µi Aik 1 |La| Aja Aℓa εℓt Recall that r=1 Z(ℓ) rt (78) where Z(ℓ) rt denotes the ℓth element of Zrt and Zrt has a centered Multinomialp(1; Πt) (subtracted its mean Mt). Next we will use Bernstein s inequality to bound 1 |La| Aja Aℓa Z(ℓ) rt from above. Note that E[Wktζrt] = 0 and |Wktζrt| ρj a=1 max ℓ La Z(ℓ) rt 2ρj. (80) To calculate the variance of Pn t=1 PN r=1 Wktζrt, observe that ζrt = η ZL rt with ZL rt denoting the sub-vector of Zrt corresponding to L and 1|L1| ... 1|LK| Aj1/|L1| ... Aj K/|LK| Optimal Estimation of Sparse Topic Models where (DL)ℓℓ= 1/Aℓa for any ℓ La and a [K]. We thus have t=1 W 2 ktη diag(ΠLt)η ℓ La Wat A2 ja Aℓa ρjn Nψjk, (82) using Wkt 1 and |La| 1 in the third line and (38) in the last line. Invoke Lemma 23 with B = 2ρj and v = ρjn Nψjk to obtain, for any t > 0, 2 exp n2N2t2/2 ρjn Nψjk + 2n Nρjt/3 which further implies 2e t/2, for any t > 0. Choosing t = 6 log(d) yields Rem11(jk) K 6ρjψjk log(d) n N + 4ρj K log(d) with probability 1 2d 3. Taking the union bound for probabilities completes the proof. Lemma 19 Under conditions of Theorem 2, with probability 1 6d 1, we have Rem13(jk) K ρjψjk log(d) p ρj K log(d) µLn N3 (83) uniformly for j [p] and k [K]. Proof Recall that Rem13(jk) = max i Lk p µi t=1 (εitξt E [εitξt]) Using (78) and (79), we have Aja Aℓa|La|εℓt = 1 Bing, Bunea and Wegkamp We will use similar truncation arguments in tandem with Hoeffding s inequality as in (Bing et al., 2020, proof of Lemma 15). This implies that, for any i L, 6Πit log(d) N + 2 log(d) To truncate ζrt, recall that E[ζrt] = 0, |ζrt| 2ρj from (80) and = Nη diag(ΠLt)η ℓ La Wat A2 ja Aℓa where η is defined in (81). Invoking Lemma 23 with B = 2ρj and v = NρjΠjt yields 6ρjΠjt log(d) N + 4ρj log(d) We define Yit = εit1St with St := {|εit| Tt} and Y t = ξt1S t with S t := {|ξt| T t}, for each i [p] and t [n], and set S := p i=1 n t=1 St S t. It follows that P(S) 1 4d 1. On the event S, we have t=1 (εitξt E[εitξt]) Yit Y t E[Yit Y t ] | {z } R1 E[εitξt] E[Yit Y t ] | {z } R2 E[εitξt] = E[Yit Y t ] + E h Yitξt1(S t)c i + E εit1Sc t ξt , E[εitξt] E[Yit Y t ] 1 E h Yitξt1(S t)c i + E εit1Sc t ξt t=1 2ρj P(Sc t ) + P((S t)c) (84) by using |Yit| |εit| 1 and |ξt| |ζrt| 2ρj in the second inequality. Optimal Estimation of Sparse Topic Models It remains to bound R1. Since |Yit| Tt, we know 2Tt T t Yit Y t E[Yit Y t ] 2Tt T t for all 1 t n. Applying Hoeffding s inequality (Lemma 24) with at = 2Tt T t and bt = 2Tt T t gives Yit Y t E[Yit Y t ] t 8 Pn t=1 T 2 t (T t)2 Taking t = q 24 Pn t=1 T 2 t (T t)2 log(d) yields Yit Y t E[Yit Y t ] 2 t=1 T 2 t (T t)2 log(d) with probability greater than 1 2d 3. Finally, note that i=1 T 2 t (T t)2 1 ( ΠitρjΠjt log2(d) N2 + ρ2 j log4(d) N4 + ρjΠjt log3(d) N3 + Πitρ2 j log3(d) = ρjΘij log2(d) N2 + ρ2 j log4(d) N4 + ρjµj log3(d) p N3 + ρ2 jµi log3(d) ρjαiψjk log2(d) p N2 + ρjµj log3(d) p N3 + ρ2 jµi log3(d) by using (32) in the second equality and (64) and (65) to obtain the last line. Finally, combining (84) - (86) gives Rem13(jk) K ρjψjkp log3(d) p p2 log4(d) α2 Ln N3 + ρj µLn N3 + pρj ρjψjk log(d) for all j, k, with probability 1 6d 1. We also use (64) and αLγk αLγ c0p K log(d)/N. This completes the proof. A.7. Auxilliary lemmas In this section, we state three lemmas which are used in the main paper. The following lemma gives the range of λmin( e Q e Q ) where e Q = D 1 Q Q and Q = CA . Lemma 20 Let C = n 1WW and M = D 1 W CD 1 W . We have min k [K],i Ik A2 ik λmin(C)λmin(M) λmin( e Q e Q ) λmin(M). Bing, Bunea and Wegkamp Proof Observe that DQ = Q1p = CA 1p = C1K (63) = DW , whence λmin( e Q e Q ) = inf v SK 1 v D 1 W CA ACD 1 W v. (87) On the one hand, inf v SK 1 v D 1 W CA ACD 1 W v C1/2A AC1/2 op inf v SK 1 v D 1 W CD 1 W v = ACA op λmin(M). The upper bound now follows using (5) and ACA op 1. The latter follows from the string of inequalities ACA op ACA ,1 = ACA 1p = AC1K = 1 The lower bound follows immediately from λmin( e Q e Q ) λmin(A A)λmin(C)λmin(M) λmin(A I AI)λmin(C)λmin(M) min k [K],i Ik A2 ik λmin(C)λmin(M). Lemma 21 Let C = n 1WW . We have Proof From the definition of the smallest eigenvalue, λmin(C) = inf v SK 1 v Cv min 1 k K Ckk = min 1 k K 1 n Wk 2. The result follows from 1 n Wk 2 1 n Wk 1 (22) = γk k=1 γk/K = 1/K. Optimal Estimation of Sparse Topic Models Lemma 22 Under condition (24), ( min i L (DX)ii µL Proof For any i L, note that (DX)ii µi/p = n 1 Pn t=1 εit. The result follows from Lemma 8 and the inequalities µj/p 1 for all j [p] by (22). For completeness, we state the well-known Bernstein and Hoeffding inequalities for bounded random variables. Lemma 23 (Bernstein s inequality for bounded random variables) For independent random variables Y1, . . . , Yn with bounded ranges [ B, B] and zero means, 2 exp n2x2/2 v + n Bx/3 , for any x 0, where v var(Y1 + . . . + Yn). Lemma 24 (Hoeffding s inequality) Let Y1, . . . , Yn be independent random variables with E[Yi] = 0 and P{ai Yi bi} = 1. For any t 0, we have 2 exp 2t2 Pn i=1(bi ai)2 Anima Anandkumar, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Yi-kai Liu. A spectral algorithm for latent dirichlet allocation. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 917 925. Curran Associates, Inc., 2012. Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models going beyond svd. In Foundations of Computer Science (FOCS), 2012, IEEE 53rd Annual Symposium, pages 1 10. IEEE, 2012. Sanjeev Arora, Rong Ge, Yonatan Halpern, David M Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. A practical algorithm for topic modeling with provable guarantees. In ICML (2), pages 280 288, 2013. Xin Bing, Florentina Bunea, and Marten Wegkamp. A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics. Bernoulli, 26(3):1765 1796, 08 2020. doi: 10.3150/19-BEJ1166. URL https://doi.org/10.3150/19-BEJ1166. David M. Blei. Introduction to probabilistic topic models. Communications of the ACM, 55:77 84, 2012. Bing, Bunea and Wegkamp David M. Blei and John D. Lafferty. A correlated topic model of science. Ann. Appl. Stat., 1(1):17 35, 06 2007. doi: 10.1214/07-AOAS114. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, pages 993 1022, 2003. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391 407, 1990. Dua Dheeru and EfiKarra Taniskidou. UCI machine learning repository, 2017. Weicong Ding, Mohammad Hossein Rohban, Prakash Ishwar, and Venkatesh Saligrama. Topic discovery through data dependent and random projections. In Sanjoy Dasgupta and David Mc Allester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1202 1210, Atlanta, Georgia, USA, 17 19 Jun 2013. PMLR. URL http://proceedings.mlr.press/ v28/ding13.html. David Donoho and Victoria Stodden. When does non-negative matrix factorization give a correct decomposition into parts? In S. Thrun, L. K. Saul, and P. B. Sch olkopf, editors, Advances in Neural Information Processing Systems 16, pages 1141 1148. MIT Press, 2004. Thomas L. Griffiths and Mark Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228 5235, 2004. ISSN 0027-8424. doi: 10.1073/pnas. 0307752101. Thomas Hofmann. Probabilistic latent semantic indexing. Proceedings of the Twenty-Second Annual International SIGIR Conference, 1999. Tracy Zheng Ke and Minzhe Wang. A new svd approach to optimal topic estimation. ar Xiv:1704.07016, 2017. Wei Li and Andrew Mc Callum. Pachinko allocation: Dag-structured mixture models of topic correlations. In Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pages 577 584, New York, NY, USA, 2006. ACM. ISBN 1-59593-383-2. doi: 10.1145/1143844.1143917. Christos H. Papadimitriou, Hisao Tamaki, Prabhakar Raghavan, and Santosh Vempala. Latent semantic indexing: A probabilistic analysis. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS 98, pages 159 168, New York, NY, USA, 1998. ACM. ISBN 0-89791-996-3. doi: 10.1145/ 275487.275505. Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. Latent semantic indexing: A probabilistic analysis. Journal of Computer and System Sciences, 61(2):217 235, 2000. ISSN 0022-0000. doi: https://doi.org/10.1006/jcss.2000. 1711. Optimal Estimation of Sparse Topic Models Ben Recht, Christopher Re, Joel Tropp, and Victor Bittorf. Factoring nonnegative matrices with linear programs. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1214 1222. Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/ 4518-factoring-nonnegative-matrices-with-linear-programs.pdf. Allen Riddell, Timothy Hopper, and Andreas Grivas. lda: 1.0.4, July 2016. URL https: //doi.org/10.5281/zenodo.57927. Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, New York, 2009. doi: 10.1007/b13794.