# algorithms_for_generalized_topic_modeling__d0c65933.pdf

Algorithms for Generalized Topic Modeling

Avrim Blum Toyota Technological Institute at Chicago avrim@ttic.edu

Nika Haghtalab Computer Science Department Carnegie Mellon University nhaghtal@cs.cmu.edu

Recently there has been signiﬁcant activity in developing algorithms with provable guarantees for topic modeling. In this work we consider a broad generalization of the traditional topic modeling framework, where we no longer assume that words are drawn i.i.d. and instead view a topic as a complex distribution over sequences of paragraphs. Since one could not hope to even represent such a distribution in general (even if paragraphs are given using some natural feature representation), we aim instead to directly learn a predictor that given a new document, accurately predicts its topic mixture, without learning the distributions explicitly. We present several natural conditions under which one can do this from unlabeled data only, and give efﬁcient algorithms to do so, also discussing issues such as noise tolerance and sample complexity. More generally, our model can be viewed as a generalization of the multi-view or co-training setting in machine learning.

1 Introduction

Topic modeling is an area with signiﬁcant recent work in the intersection of algorithms and machine learning (Arora et al. 2012; Arora, Ge, and Moitra 2012; Arora et al. 2013; Anandkumar et al. 2012; 2014; Bansal, Bhattacharyya, and Kannan 2014). In topic modeling, a topic (such as sports, business, or politics) is modeled as a probability distribution over words, expressed as a vector ai. A document is generated by ﬁrst selecting a mixture w over topics, such as 80% sports and 20% business, and then choosing words i.i.d. from the associated mixture distribution, which in this case would be 0.8asports+0.2abusiness. Given a large collection of such documents (and some assumptions about the distributions ai as well as the distribution over mixture vectors w) the goal is to recover the topic vectors ai and then to use the ai to correctly classify new documents according to their topic mixtures. Algorithms for this problem have been developed with strong provable guarantees even when documents consist of only two or three words each (Arora, Ge, and Moitra 2012; Anandkumar et al. 2012; Papadimitriou et al. 1998). In addition, algorithms based on this problem formulation perform well empirically on standard datasets (Blei, Ng, and Jordan 2003; Hofmann 1999).

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

As a theoretical model for document generation, however, an obvious problem with the standard topic modeling framework is that documents are not actually created by independently drawing words from some distribution. Moreover, important words within a topic often have meaningful correlations, like shooting a free throw or kicking a ﬁeld goal. Better would be a model in which sentences are drawn i.i.d. from a distribution over sentences. Even better would be paragraphs drawn i.i.d. from a distribution over paragraphs (this would account for the word correlations that exist within a coherent paragraph). Or, even better, how about a model in which paragraphs are drawn non-independently, so that the second paragraph in a document can depend on what the ﬁrst paragraph was saying, though presumably with some amount of additional entropy as well? This is the type of model we study here. Note that an immediate problem with considering such a model is that now the task of learning an explicit distribution (over sentences or paragraphs) is hopeless. While a distribution over words can be reasonably viewed as a probability vector, one could not hope to learn or even represent an explicit distribution over sentences or paragraphs. Indeed, except in cases of plagiarism, one would not expect to see the same paragraph twice in the entire corpus. Moreover, this is likely to be true even if we assume paragraphs have some natural feature-vector representation. Instead, we bypass this issue by aiming to directly learn a predictor for documents that is, a function that given a document, predicts its mixture over topics without explicitly learning topic distributions. Another way to think of this is that our goal is not to learn a model that could be used to write a new document, but instead just a model that could be used to classify a document written by others. This is much as in standard supervised learning where algorithms such as SVMs learn a decision boundary (such as a linear separator) for making predictions on the labels of examples without explicitly learning the distributions D+ and D over positive and negative examples respectively. However, our setting is unsupervised (we are not given labeled data containing the correct classiﬁcations of the documents in the training set) and furthermore, rather than each data item belonging to one of the k classes (topics), each data item belongs to a mixture of the k topics. Our goal is given a new data item to output what that mixture is. We begin by describing our high level theoretical formu-

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

lation. This formulation can be viewed as a generalization both of standard topic modeling and of multi-view learning or co-training (Blum and Mitchell 1998; Dasgupta, Littman, and Mc Allester 2002; Chapelle, Schlkopf, and Zien 2010; Balcan, Blum, and Yang 2004; Sun 2013). We then describe several natural assumptions under which we can indeed efﬁciently solve the problem, learning accurate topic mixture predictors.

2 Preliminaries

We assume that paragraphs are described by n real-valued features and so can be viewed as points x in an instance space X Rn. We assume that each document consists of at least two paragraphs and denote it by (x1, x2). Furthermore, we consider k topics and partial membership functions f1, . . . , fk : X [0, 1], such that fi(x) determines the degree to which paragraph x belongs to topic i, and, k i=1 fi(x) = 1. For any vector of probabilities w Rk which we sometimes refer to as mixture weights we deﬁne X w = {x Rn | i, fi(x) = wi} to be the set of all paragraphs with partial membership values w. We assume that both paragraphs of a document have the same partial membership values, that is (x1, x2)

w X w X w, although we also allow some noise later on. To better relate to the literature on multi-view learning, we also refer to topics as classes and refer to paragraphs as views of the document. Much like the standard topic models, we consider an unlabeled sample set that is generated by a two-step process. First, we consider a distribution P over vectors of mixture weights and draw w according to P. Then we consider distribution Dw over the set X w X w and draw a document (x1, x2) according to Dw. We consider two settings. In the ﬁrst setting, which is addressed in Section 3, the learner receives the instance (x1, x2). In the second setting, discussed in Section 4, the learner receives samples (ˆx1, ˆx2) that have been perturbed by some noise. In both cases, the goal of the learner is to recover the partial membership functions fi. More speciﬁcally, in this work we consider partial membership functions of the form fi(x) = f(vi x), where v1, . . . , vk Rn are linearly independent and f : R [0, 1] is a monotonic function.1 For the majority of this work, we consider f to be the identity function, so that fi(x) = vi x. Deﬁne ai span{v1, . . . , vk} such that vi ai = 1 and vj ai = 0 for all j = i. In other words, the matrix containing ais as columns is the pseudoinverse of the matrix containing vis as columns, and ai can be viewed as the projection of a paragraph that is purely of topic i onto span{v1, . . . , vk}. Deﬁne Δ = CH({a1, . . . , ak}) to be the convex hull of a1, . . . , ak. Throughout this work, we use 2 to denote the spectral norm of a matrix or the L2 norm of a vector. When it is clear from the context, we simply use to denote these quantities. We denote by Br(x) the ball of radius r around x. For a M, we use M + to denote the pseudoinverse of M.

1We emphasize that linear independence is a much milder assumption than the assumption that topic vectors are orthogonal.

Generalization of Standard Topic Modeling

Let us brieﬂy discuss how the above model is a generalization of the standard topic modeling framework. In the standard framework, a topic is modeled as a probability distribution over n words, expressed as a vector ai [0, 1]n, where aij is the probability of word j in topic i. A document is generated by ﬁrst selecting a mixture w [0, 1]k over k topics, and then choosing words i.i.d. from the associated mixture distribution k i=1 wiai. The document vector ˆx is then the vector of word counts, normalized by dividing by the number of words in the document so that ˆx 1 = 1. As a thought experiment, consider inﬁnitely long documents. In the standard framework, all inﬁnitely long documents of a mixture weight w have the same representation x = k i=1 wiai. This representation implies x vi = wi for all i [k], where V = (v1, . . . , vk) is the pseudo-inverse of A = (a1, . . . , ak). Thus, by partitioning the document into two halves (views) x1 and x2, our noise-free model with fi(x) = vi x generalizes the standard topic model for long documents. However, our model is substantially more general: features within a view can be arbitrarily correlated, the views themselves can also be correlated, and even in the zeronoise case, documents of the same mixture can look very different so long as they have the same projection to the span of the a1, . . . , ak. For a shorter document ˆx, each feature ˆxi is drawn according to a distribution with mean xi, where x = k i=1 wiai. Therefore, ˆx can be thought of as a noisy measurement of x. The fewer the words in a document, the larger is the noise in ˆx. Existing work in topic modeling, such as (Arora, Ge, and Moitra 2012; Anandkumar et al. 2014), provide elegant procedures for handling large noise that is caused by drawing only 2 or 3 words according to the distribution induced by x. As we show in Section 4, our method can also tolerate large amounts of noise under some conditions. While our method cannot deal with documents that are only 2or 3-words long, the beneﬁt is a model that is much more general in many other respects.

Generalization of Co-training Framework

Here, we brieﬂy discuss how our model is a generalization of the co-training framework. The standard co-training framework of Blum and Mitchell considers learning a binary classiﬁer from primarily unlabaled instances, where each instance (x1, x2) is a pair of views that have the same classiﬁcation. For example, Blum and Mitchell and Balcan and Blum show that if views are independent of each other given the classiﬁcation, then one can efﬁciently learn a halfspace from primarily unlabeled data. In the language of our model, this corresponds to a setting with k = 2 classes, unknown class vectors v1 = v2, where each view of an instance belongs to one class fully using membership function fi(x) = sign(vi x). Our work generalizes co-training by extending it to multiclass settings where each instance belongs to one or more classes partially, using a partial membership function fi( ).

3 An Easier Case with Simplifying Assumptions

We make two main simplifying assumptions in this section, both of which will be relaxed in Section 4: 1) The documents are not noisy, i.e., x1 vi = x2 vi; 2) There is non-negligible probability density on instances that belong purely to one class. In this section we demonstrate ideas and techniques.

The Setting: We make the following assumptions. The documents are not noisy, that is for any document (x1, x2) and for all i [k], x1 vi = x2 vi. Regarding distribution P, we assume that a non-negligible probability density is assigned to pure documents for each class. More formally, for some ξ > 0, for all i [k], Prw P[w = ei] ξ. Regarding distribution Dw, we allow the two paragraphs in a document, i.e., the two views (x1, x2) drawn from Dw, to be correlated as long as for any subspace Z null{v1 . . . , vk} of dimension strictly less than n k, Pr(x1,x2) Dw[(x1 x2) Z] ζ for some non-negligible ζ. One way to view this in the context of topic modeling is that if, say, sports is a topic, then it should not be the case that the second paragraph always talks about the exact same sport as the ﬁrst paragraph; else sports would really be a union of several separate but closely-related topics. Thus, while we do not require independence we do require some non-correlation between the paragraphs.

Algorithm and Analysis: The main idea behind our approach is to use the consistency of the two views of the samples to ﬁrst recover the subspace spanned by v1, . . . , vk (Phase 1). Once this subspace is recovered, we show that a projection of a sample on this space corresponds to the convex combination of class vectors using the appropriate mixture weight that was used for that sample. Therefore, we ﬁnd vectors a1, . . . , ak that purely belong to each class by taking the extreme points of the projected samples (Phase 2). The class vectors v1, . . . , vk are the unique vectors (up to permutations) that classify a1, . . . , ak as pure samples. Phase 2 is similar to that of (Arora, Ge, and Moitra 2012). Algorithm 1 formalizes the details of this approach.

Algorithm 1 ALGORITHM FOR GENERALIZED TOPIC MODELS NO NOISE Input: A sample set S = {(x1 i , x2 i ) | i [m]} such that for each i, ﬁrst a vector w is drawn from P and then (x1 i , x2 i ) is drawn from Dw.

Phase 1: Let X1 and X2 be matrices where the ith column is x1 i and x2 i , respectively. Let P be the projection matrix on the last k left singular vectors of (X1 X2).

Phase 2: Let S = {Pxj i | i [m], j {1, 2}}. Let A be a matrix whose columns are the extreme points of the convex hull of S . (This can be found using farthest traversal or linear programming.)2 Output: Return columns of A+ as v1, . . . , vk.

Figure 1: v1, v2 correspond to class 1 and 2, and a1 and a2 correspond to canonical vectors purely of class 1 and 2, respectively.

In Phase 1 for recovering span{v1, . . . , vk}, note that for any sample (x1, x2) drawn from Dw, we have that vi x1 = vi x2 = wi. Therefore, regardless of what w was used to produce the sample, we have that vi (x1 x2) = 0 for all i [k]. That is, v1, . . . , vk are in the null-space of all such (x1 x2). The assumptions on Dw show that after seeing sufﬁciently many samples, (x1 i x2 i ) span a n k dimensional subspace. So, span{v1, . . . , vk} can be recovered by taking null{(x1 x2) | (x1, x2) X w X w, w Rk}. This null space is spanned by the last k singular vectors of X1 X2, where X1 and X2 are matrices with columns x1 i and x2 i , respectively. The next lemma, whose proof appears in the full version of this paper (Blum and Haghtalab 2016), formalizes this discussion. Lemma 3.1. Let Z = span{(x1 i x2 i ) | i [m]}. Then, m = O( n k

δ )) is sufﬁcient such that with probability 1 δ, rank(Z) = n k. Using Lemma 3.1, Phase 1 of Algorithm 1 recovers span{v1, . . . , vk}. Next, we show that pure samples are the extreme points of the convex hull of all samples when projected on the subspace span{v1, . . . , vk}. Figure 1 demonstrates the relation between the class vectors, vi, projection of samples, and the projection of pure samples ai. The next lemma, whose proof appears in the full version of this paper (Blum and Haghtalab 2016), formalizes this claim. Lemma 3.2. For any x, let x represent the projection of x on span{v1, . . . , vk}. Then, x =

i [k](vi x)ai.

With i [k](vi x)ai representing the projection of x on span{v1, . . . , vk}, it is clear that the extreme points of the set of all projected instances that belong to X w for all w are a1, . . . , ak. Since in a large enough sample set, with high probability for all i [k], there is a pure sample of type i, taking the extreme points of the set of projected samples is also a1, . . . , ak. The following lemma, whose proof appears in the full version of this paper (Blum and Haghtalab 2016), formalizes this discussion. Lemma 3.3. Let m = c( 1

δ )) for a large enough constant c > 0. Let P be the projection matrix for

span{v1, . . . , vk} and S = {Pxj i | i [m], j {1, 2}} be the set of projected samples. With probability 1 δ, {a1, . . . , ak} is the set of extreme points of CH(S ).

Therefore, a1, . . . , ak can be learned by taking the extreme points of the convex hull of all samples projected on span({v1, . . . , vk}). Furthermore, V = A+ is unique, therefore v1, . . . , vk can be easily found by taking the pseudoinverse of matrix A. Together with Lemma 3.1 and 3.3 this proves the next theorem regarding learning class vectors in the absence of noise.

Theorem 3.4 (No Noise). There is a polynomial time algorithm for which m = O n k

δ ) is sufﬁcient to recover vi exactly for all i [k], with probability 1 δ.

4 Relaxing the Assumptions In this section, we relax the two main simplifying assumptions from Section 3. We relax the assumption on non-noisy documents and allow a large fraction of the documents to not satisfy vi x1 = vi x2. In the standard topic model, this corresponds to having a large fraction of short documents. Furthermore, we relax the assumption on the existence of pure documents to an assumption on the existence of almostpure documents.

The Setting: We assume that any sampled document has a non-negligible probability of being non-noisy and with the remaining probability, the two views of the document are perturbed by additive Gaussian noise, independently. More formally, for a given sample (x1, x2), with probability p0 > 0 the algorithm receives (x1, x2) and with the remaining probability 1 p0, the algorithm receives (ˆx1, ˆx2), such that ˆxj = xj + ej, where ej N(0, σ2In). We assume that for each topic, the probability that a document is mostly about that topic is non-negligible. More formally, for any topic i [k], Prw P[ ei w 1 ϵ ] g(ϵ), where g is a polynomial function of its input. A stronger form of this assumption, better known as the dominant admixture assumption, assumes that every document is mostly about one topic and has been empirically shown to hold on several real world data sets (Bansal, Bhattacharyya, and Kannan 2014). Furthermore, in the Latent Dirichlet Allocation model, Prw P[maxi [k] wi 1 ϵ] O(ϵ2) for typical values of the concentration parameter. We also make assumptions on the distribution over instances. We assume that the covariance of the distribution over (x1 i x2 i )(x1 i x2 i ) is larger than the noise covariance σ2.3 That is, for some δ0 > 0, the least signiﬁcant non-zero eigen value of E(x1 i ,x2 i )[(x1 i x2 i )(x1 i x2 i ) ], equivalently its (n k)th eigen value, is greater than 6σ2 + δ0. Moreover, we assume that the L2 norm of each view of a sample

3This assumption is only used in Phase 1. One can assure that this assumption holds by taking the average of several documents in phase 1, where the average of documents (ˆx1 1, ˆx2 1), . . . , (ˆx1 m, ˆx2 m) is m i=1 ˆx1 i /m, m i=1 ˆx2 i /m . Since the noise shrinks in the averaged documents, the noise level falls under the required level. This would mildly increase the sample complexity.

is bounded by some M > 0. We also assume that for all i [k], ai α for some α > 0. At a high level, ai s are inversely proportional to the non-zero singular values of V = (v1, . . . , vk). Therefore, ai α implies that the k topic vectors are sufﬁciently different.

Algorithm and Results: Our approach follows the general theme of the previous section: First, recover span{v1, . . . , vk} and then recover a1, . . . , ak by taking the extreme points of the projected samples. In this case, in the ﬁrst phase we recover span{v1, . . . , vk} approximately, by ﬁnding a projection matrix ˆP such that P ˆP ϵ for an arbitrarily small ϵ, where P is the projection matrix on span{v1, . . . , vk}. At this point in the algorithm, the projection of samples on ˆP can include points that are arbitrarily far from Δ. This is due to the fact that the noisy samples are perturbed by N(0, σ2In), so, for large values of σ some noisy samples map to points that are quite far from Δ. Therefore, we have to detect and remove these samples before continuing to the second phase. For this purpose, we show that the low density regions of the projected samples can safely be removed such that the convex hull of the remaining points is close to Δ. In the second phase, we consider projections of each sample using ˆP. To approximately recover a1, . . . , ak, we recover samples, x, that are far from the convex hull of the remaining points, when x and a ball of points close to it are removed. We then show that such points are close to one of the pure class vectors, ai. Algorithm 2 and the details of the above approach and its performance are as follows.

Theorem 4.1. Consider any small enough ϵ > 0 and any δ > 0, there is an efﬁcient algorithm for which an unlabeled sample set of size

δ ) + nσ2M 4r2

δ2 0ϵ2 polylog(nr M

+ k ln(1/δ) p0 g(ϵ/(krα))

is sufﬁcient to recover ˆai such that ˆai ai 2 ϵ for all i [k], with probability 1 δ. Where, r is a parameter that depends on the geometry of the simplex Δ and will be deﬁned in section 4.3.

The proof of Theorem 4.1 relies on the next lemmas regarding the performance of each phase of the algorithm. We formally state them here, but defer their proofs to Sections 4.1, 4.2 and 4.3.

Lemma 4.2 (Phase 1). For any σ, ϵ > 0, it is sufﬁcient to have an unlabeled sample set of size

δ ) + nσ2M 2

δ2 0ϵ2 polylog( n

so with probability 1 δ, Phase 1 of Algorithm 2 returns a matrix ˆP, such that P ˆP 2 ϵ.

Lemma 4.3 (Denoising). Let ϵ 1 3σ

ϵ /8M, and γ = g ϵ 8kα . An unlabeled sample size of m =

Algorithm 2 ALGORITHM FOR GENERALIZED TOPIC MODELS WITH NOISE Input: A sample set {(ˆx1 i , ˆx2 i ) | i [m]} such that for each i, ﬁrst a vector w is drawn from P, then (x1 i , x2 i ) is drawn from Dw, then with probability p0, ˆxj i = xj i, else with probability 1 p0, ˆxj i = xj i + N(0, σ2In) for i [m] and j {1, 2}. Phase 1:

1. Take m1 = Ω n k

δ ) + nσ2M 4r2

δ2 0ϵ2 polylog( nr M

2. Let ˆX1 and ˆX2 be matrices where the ith column is ˆx1 i and ˆx2 i , respectively.

3. Let ˆP be the projection matrix on the last k left singular vectors of ˆX1 ˆX2.

Denoising Phase: 4. Let ϵ = ϵ/(8r) and γ = g (ϵ /(8kα)).

5. Take m2 = Ω k p0γ ln 1

δ fresh samples and let ˆS = ˆP ˆx1 i | i [m2] .

6. Remove ˆx from ˆS , for which there are less than p0γm2

2 points within distance of ϵ

2 in ˆS . Phase 2: 7. For all ˆx in ˆS , if dist(x , CH( ˆS \ B6rϵ (ˆx))) 2ϵ add ˆx to C. 8. Cluster C using single linkage with threshold 16rϵ . Assign any point from cluster i as ˆai. Output: Return ˆa1, . . . , ˆak.

O k p0γ ln( 1

δ ) is sufﬁcient such that for ˆS deﬁned in Step 6 of Algorithm 2 the following holds with probability 1 δ: For any x ˆS , dist(x, Δ) ϵ , and, for all i [k], there exists ˆai ˆS such that ˆai ai ϵ .

Lemma 4.4 (Phase 2). Let ˆS be a set for which the conclusion of Lemma 4.3 holds with the value of ϵ = ϵ/8r. Then, Phase 2 of Algorithm 2 returns ˆa1, . . . , ˆak such that for all i [k], ai ˆai ϵ.

We now prove our main Theorem 4.1 by directly leveraging the three lemmas we just stated.

Proof of Theorem 4.1. By Lemma 4.2, sample set of size m1 is sufﬁcient such that Phase 1 of Algorithm 2 leads to P ˆP ϵ 32Mr, with probability 1 δ/2. Let ϵ = ϵ 8r and take a fresh sample of size m2. By Lemma 4.3, with probability 1 δ/2, for any x ˆS , dist(x, Δ) ϵ , and, for all i [k], there exists ˆai ˆS such that ˆai ai ϵ . Finally, by Lemma 4.4 we have that Phase 2 of Algorithm 2 returns ˆai, such that for all i [k], ai ˆai ϵ.

Theorem 4.1 discusses the approximation of ai for all i [k]. It is not hard to see that such an approximation also translates to the approximation of class vectors, vi for

all i [k]. That is, using the properties of perturbation of pseudoinverse matrices (see the full version of the paper for details) one can show that ˆA+ V O( ˆA A ). Therefore, ˆV = ˆA+ is a good approximation for V .

4.1 Proof of Lemma 4.2 Phase 1

For j {1, 2}, let Xj and ˆXj be n m matrices with the ith

column being xj i and ˆxj i, respectively. As we demonstrated in Lemma 3.1, with high probability rank(X1 X2) = n k. Note that the nullspace of columns of X1 X2 is spanned by the left singular vectors of X1 X2 that correspond to its k zero singular values. We show that the nullspace of columns of X1 X2 can be approximated within any desirable accuracy by the space spanned by the k least signiﬁcant left singular vectors of ˆX1 ˆX2, given a sufﬁciently large number of samples. Let D = X1 X2 and ˆD = ˆX1 ˆX2. For ease of exposition, assume that all samples are perturbed by Gaussian noise N(0, σ2In).4 Since each view of a sample is perturbed by an independent draw from a Gaussian noise distribution, we can view ˆD = D + E, where each column of E is drawn i.i.d from distribution N(0, 2σ2In). Then, 1 m ˆD ˆD = 1 m DD + 1

m EE . As a thought experiment, consider this equation in expectation. Since E[ 1

m EE ] = 2σ2In is the covariance matrix of the noise and E[DE + ED ] = 0, we have

1 m E ˆD ˆD 2σ2In = 1

m E DD . (1)

Moreover, the eigen vectors and their order are the same in 1

m E[ ˆD ˆD ] and 1

m E[ ˆD ˆD ] 2σ2In. Therefore, one can recover the nullspace of 1

m E[DD ] by taking the space of the smallest k eigen vectors of 1

m E[ ˆD ˆD ]. Next, we show how to recover the nullspace using ˆD ˆD , rather than E[ ˆD ˆD ]. Assume that the following properties hold:

1. Equation 1 holds not only in expectation, but also with high probability. That is, with high probability, 1

m ˆD ˆD 2σ2In 1

2. With high probability λn k( 1

m ˆD ˆD ) > 4σ2 + δ0/2, where λi( ) denotes the ith most signiﬁcant eigen value.

Let D = UΣV and ˆD = ˆU ˆΣ ˆV be SVD representations. We have that 1

m ˆD ˆD 2σ2In = ˆU( 1

m ˆΣ2 2σ2In) ˆU . By property 2, λn k( 1

m ˆΣ2) > 4σ2 + δ0/2. That is, the eigen vectors and their order are the same in 1

m ˆD ˆD 2σ2In and 1 m ˆD ˆD . As a result the projection matrix, ˆP, on the least signiﬁcant k eigen vectors of 1 m ˆD ˆD , is the same as the projection matrix, Q, on the least signiﬁcant k eigen vectors of 1

m ˆD ˆD 2σ2In.

4The assumption that with a non-negligible probability a sample is non-noisy is not needed for the analysis and correctness of Phase 1 of Algorithm 2. This assumption only comes into play in the denoising phase.

Recall that ˆP and P and Q are the projection matrices on the least signiﬁcant k eigen vectors of 1

m ˆD ˆD , 1

m DD , and 1

m ˆD ˆD 2σ2I, respectively. As we discussed, ˆP = Q. Now, using the Wedin sin θ theorem (Davis and Kahan 1970; Wedin 1972) (see the full version of the paper for details) from matrix perturbation theory, we have,

P ˆP 2 = P Q

m ˆD ˆD 2σ2In 1

m DD 2 λn k( 1

m ˆD ˆD ) 2σ2 λn k+1( 1

where we use Properties 1 and 2 and the fact that λn k+1( 1

m DD ) = 0, in the last transition.

Concentration It remains to prove Properties 1 and 2. We brieﬂy describe our proof that when m is large, with high probability 1

m ˆD ˆD 2σ2In 1 m DD 2 ϵ and λn k( 1

m ˆD ˆD ) > 4σ2 + δ0/2. Let us ﬁrst describe 1 m ˆD ˆD 2σ2In 1

m DD in terms of the error matrices. We have

1 m ˆD ˆD 2σ2In 1

It sufﬁces to show that for large enough m > mϵ,δ, Pr[ 1

m EE 2σ2In 2 ϵ] δ and Pr[ 1

m DE + 1 m ED 2 ϵ] δ. In the former, note that 1

m EE is the sample covariance of the Gaussian noise matrix and 2σ2In is the true covariance matrix of the noise distribution. The next two claims follow by the convergence properties of sample covariance of the Gaussians and the use of Matrix Bernstein inequality (Tropp 2015). See the full version of this paper (Blum and Haghtalab 2016) for more details.

Claim 1. m = O( nσ4

δ )) is sufﬁcient to get 1

m EE 2σ2In 2 ϵ, with probability 1 δ.

Claim 2. m=O( nσ2M 2

ϵ2 polylog n

ϵδ) is sufﬁcient to get 1

m ED 2 ϵ, with probability 1 δ.

We prove that λn k( 1

m ˆD ˆD ) > 4σ2 + δ0/2. Since for any two matrices, the difference in λn k can be bounded by the spectral norm of their difference (see the full version of the paper for details), by Equation 2, we have λn k

m ˆD ˆD λn k

where in the last transition we use Claims 1 and 2 with the value of δ0/8 to bound the last two terms by a total of δ0/4. Since λn k(E[ 1

m DD ]) 6σ2 + δ0, it is sufﬁcient to show that |λn k(E[ 1

m DD ]) λn k([ 1

m DD ])| δ0/4. Similarly as before, this is bounded by 1

We use the Matrix Bernstein inequality to prove this concentration result; see the full version of this paper (Blum and Haghtalab 2016) for a proof.

Claim 3. m=O M 4

δ is sufﬁcient to get 1

4 , with probability 1 δ.

This completes the analysis of Phase 1 of our algorithm and the proof of Lemma 4.2 follows directly from the above analysis and the application of Claims 1 and 2 with the error of ϵδ0, and Claim 3.

4.2 Proof of Lemma 4.3 Denoising Step

We use projection matrix ˆP to partially denoise the samples while approximately preserving Δ = CH({a1, . . . , ak}). At a high level we show that, in the projection of samples on ˆP, 1) the regions around ai have sufﬁciently high density, and, 2) the regions that are far from Δ have low density. We claim that if ˆx ˆS is non-noisy and corresponds almost purely to one class then ˆS also includes a nonnegligible number of points within O(ϵ ) distance of ˆx . This is due to the fact that a non-negligible number of points (about p0γm points) correspond to non-noisy and almost-pure samples that using P would get projected to points within a distance of O(ϵ ) of each other. Furthermore, the inaccuracy in ˆP can only perturb the projections up to O(ϵ ) distance. So, the projections of all non-noisy samples that are almost purely of class i fall within O(ϵ ) of ai. The following claim, whose proof appears in the full version of this paper (Blum and Haghtalab 2016), formalizes this discussion. In the following lemmas, let D denote the ﬂattened distribution of the ﬁrst paragraphs. That is, the distribution over ˆx1 where we ﬁrst take w P, then take (x1, x2) Dw, and ﬁnally take ˆx1.

Claim 4. For all i [k], Prx D ˆPx Bϵ /4(ai) p0γ.

On the other hand, any projected point that is far from the convex hull of a1, . . . , ak has to be noisy, and as a result, has been generated by a Gaussian distribution with variance σ2. For a choice of ϵ that is small with respect to σ, such points do not concentrate well within any ball of radius ϵ . In the next claim, we show that the regions that are far from the convex hull have low density.

Claim 5. For any z such that dist(z, Δ) ϵ , we have Prx D ˆPx Bϵ /2(z) p0γ

The next claim shows that in a large sample set, the fraction of samples that fall within any of the described regions in Claims 4 and 5 is close to the density of that region. The proof of this claim follows from VC dimension of the set of balls.

Claim 6. Let D be any distribution over Rk and x1, . . . , xm be m points drawn i.i.d from D. Then m = O( k

δ ) is sufﬁcient so that with probability 1 δ, for any ball B Rk such that Prx D[x B] 2γ, |{xi | xi B}| > γm and for any ball B Rk such that Prx D[x B] γ/2, |{xi | xi B}| < γm.

Figure 2: Demonstrating the distinction between points close to ai and far from ai. The convex hull of CH( ˆS|| \ Br2(ˆx)), which is a subset of the blue and gray region, intersects Br1(ˆx) only for ˆx that is sufﬁciently far from ai s.

Therefore, upon seeing Ω( k

δ ) samples, with probability 1 δ, for all i [k] there are more than p0γm/2 projected points within distance ϵ /4 of ai (by Claims 4 and 6), and, no point that is ϵ far from Δ has more than p0γm/2 points in its ϵ /2-neighborhood (by Claims 5 and 6). Phase 2 of Algorithm 2 leverages these properties of the set of projected points for denoising the samples while preserving Δ: Remove any point from ˆS with fewer than p0γm/2 neighbors within distance ϵ /2. We conclude the proof of Lemma 4.3 by noting that the remaining points in ˆS are all within distance ϵ of Δ. Furthermore, any point in Bϵ /4(ai) has more than p0γm/2 points within distance of ϵ /2. Therefore, such points remain in ˆS and any one of them can serve as ˆai for which ai ˆai ϵ /4.

4.3 Proof of Lemma 4.4 Phase 2

At a high level, we consider two balls around each projected sample point ˆx ˆS with appropriate choice of radii r1 < r2 (see Figure 2). Consider the set of projections ˆS when points in Br2(x) are removed from it. For points that are far from all ai, this set still includes points that are close to ai for all topics i [k]. So, the convex hull of ˆS \ Br2(x) is close to Δ, and in particular, intersects Br1(x). On the other hand, for x that is close to ai, ˆS \ Br2(x) does not include an extreme point of Δ or points close to it. So, the convex hull of ˆS \ Br2(x) is considerably smaller than Δ, and in particular, does not intersect Br1(x). The geometry of the simplex and the angles between a1, . . . , ak play an important role in choosing the appropriate r1 and r2. Note that when the samples are perturbed by noise, a1, . . . , ak can only be approximately recovered if they are sufﬁciently far apart and the angles of the simplex at each ai is far from being ﬂat. That is, we assume that for all i = j, ai aj 3ϵ. Furthermore, deﬁne r 1 to be the smallest value such that the distance between ai and CH(Δ\Brϵ(ai)) is at least ϵ. Note that such a value of r always exists and depends entirely on the angles of the simplex deﬁned by the class vectors. Therefore, the number of samples needed for our method depends on the value of r. The smaller the value of r, the larger is the separation between the topic vectors and

Figure 3: Parameter r is determined by the geometry of Δ.

the easier it is to identify them (See Figure 3). The next claim, whose proof appears in the full version of this paper (Blum and Haghtalab 2016), demonstrates this concept.

Claim 7. Let ϵ = ϵ/8r. Let ˆS be the set of denoised projections, as in step 6 of Algorithm 2. For any ˆx ˆS such that for all i, ˆx ai > 8rϵ , dist(ˆx, CH( ˆS \B6rϵ (ˆx))) 2ϵ . Furthermore, for all i [k] there exists ˆai ˆS such that ˆai ai < ϵ and dist(ˆai, CH( ˆS \ B6rϵ (ˆai))) > 2ϵ .

Given the above structure, it is clear that set of points in C are all within ϵ of one of the ai s. So, we can cluster C using single linkage with threshold ϵ to recover ai upto accuracy ϵ.

5 Additional Results and Extensions

In this section, we brieﬂy mention some additional results and extensions. We explain these and discuss other extensions ( such as alternative noise models) in more detail in the full version of this paper (Blum and Haghtalab 2016).

Sample Complexity Lower bound As we observed the number of samples required by our method is poly(n). However, as the number of classes can be much smaller than the number of features, one might hope to recover v1, . . . , vk, with a number of samples that is polynomial in k rather than n. Here, we show that in the general case Ω(n) samples are needed to learn v1, . . . , vk regardless of the value of k. See, the full version of this paper (Blum and Haghtalab 2016) for more information.

General function f( ) We also consider the general model described in Section 2, where fi(x) = f(vi x) for an unknown strictly increasing function f : R+ [0, 1] such that f(0) = 0. We describe how variations of the techniques discussed up to now can extend to this more general setting. See, the full version of this paper (Blum and Haghtalab 2016) for more information.

Alternative Noise Models We also discuss two additional noise models and interesting open problems that arise in these settings. In the ﬁrst model, we consider the problem of recovering v1, . . . , vk in the presence of agnostic noise, where for an ϵ fraction of the samples (x1, x2), x1 and x2 correspond to different mixture weights. In the second model,

we consider the case of p0 = 0. That is, when every document is affected by Gaussian noise N(0, σ2In), for σ ϵ.

Acknowledgments This work was partially supported by NSF grants CCF1525971, CCF-1535967, CCF-1800317; a Microsoft Research Ph.D. Fellowship and an IBM Ph.D. Fellowship.

References Anandkumar, A.; Liu, Y.-k.; Hsu, D. J.; Foster, D. P.; and Kakade, S. M. 2012. A spectral algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems (NIPS). 917 925. Anandkumar, A.; Ge, R.; Hsu, D.; Kakade, S. M.; and Telgarsky, M. 2014. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research 15(1):2773 2832. Arora, S.; Ge, R.; Kannan, R.; and Moitra, A. 2012. Computing a nonnegative matrix factorization provably. In Proceedings of the 44th Annual ACM Symposium on Theory of Computing (STOC), 145 162. Arora, S.; Ge, R.; Halpern, Y.; Mimno, D. M.; Moitra, A.; Sontag, D.; Wu, Y.; and Zhu, M. 2013. A practical algorithm for topic modeling with provable guarantees. In Proceedings of the 29th International Conference on Machine Learning (ICML), 280 288. Arora, S.; Ge, R.; and Moitra, A. 2012. Learning topic models going beyond svd. In Proceedings of the 53rd Symposium on Foundations of Computer Science (FOCS), 1 10. Balcan, M.-F., and Blum, A. 2010. A discriminative model for semi-supervised learning. Journal of the ACM (JACM) 57(3):19. Balcan, M.-F.; Blum, A.; and Yang, K. 2004. Co-training and expansion: Towards bridging theory and practice. In Advances in Neural Information Processing Systems (NIPS), 89 96. Bansal, T.; Bhattacharyya, C.; and Kannan, R. 2014. A provable svd-based algorithm for learning topics in dominant admixture corpus. In Advances in Neural Information Processing Systems (NIPS), 1997 2005. Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent dirichlet allocation. Journal of machine Learning research 3(Jan):993 1022. Blum, A., and Haghtalab, N. 2016. Generalized topic modeling. ar Xiv preprint ar Xiv:1611.01259. Blum, A., and Mitchell, T. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Conference on Computational Learning Theory (COLT), 92 100. Chapelle, O.; Schlkopf, B.; and Zien, A. 2010. Semi Supervised Learning. The MIT Press, 1st edition. Dasgupta, S.; Littman, M. L.; and Mc Allester, D. 2002. Pac generalization bounds for co-training. Advances in Neural Information Processing Systems (NIPS) 1:375 382.

Davis, C., and Kahan, W. M. 1970. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Computing 7(1):1 46. Hofmann, T. 1999. Probabilistic latent semantic analysis. In Proceedings of the 15th conference on Uncertainty in artiﬁcial intelligence, 289 296. Papadimitriou, C. H.; Tamaki, H.; Raghavan, P.; and Vempala, S. 1998. Latent semantic indexing: A probabilistic analysis. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, 159 168. Sun, S. 2013. A survey of multi-view machine learning. Neural computing and applications 23:2031 2038. Tropp, J. A. 2015. An introduction to matrix concentration inequalities. ar Xiv preprint ar Xiv:1501.01571. Wedin, P.-Å. 1972. Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics 12(1):99 111.