# the_search_problem_in_mixture_models__63328dd2.pdf

Journal of Machine Learning Research 18 (2018) 1-61 Submitted 9/16; Revised 10/17; Published 4/18

The Search Problem in Mixture Models

Avik Ray avik@utexas.edu Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78701, USA

Joe Neeman neeman@iam.uni-bonn.de Department of Mathematics Rheinische Friedrich-Wilhelms-Universit at Bonn D-53115 Bonn, Germany

Sujay Sanghavi sanghavi@mail.utexas.edu Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78701, USA

Sanjay Shakkottai shakkott@austin.utexas.edu Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78701, USA

Editor: Animashree Anandkumar

We consider the task of learning the parameters of a single component of a mixture model, for the case when we are given side information about that component; we call this the search problem in mixture models. We would like to solve this with computational and sample complexity lower than solving the overall original problem, where one learns parameters of all components. Our main contributions are the development of a simple but general model for the notion of side information, and a corresponding simple matrix-based algorithm for solving the search problem in this general setting. We then specialize this model and algorithm to four common scenarios: Gaussian mixture models, LDA topic models, subspace clustering, and mixed linear regression. For each one of these we show that if (and only if) the side information is informative, we obtain parameter estimates with greater accuracy, and also improved computation complexity than existing moment based mixture model algorithms (e.g. tensor methods). We also illustrate several natural ways one can obtain such side information, for speciﬁc problem instances. Our experiments on real data sets (NY Times, Yelp, BSDS500) further demonstrate the practicality of our algorithms showing signiﬁcant improvement in runtime and accuracy. Keywords: mixture models, search, side information, semi-supervised, method of moments

1. Introduction

Mixture models denote the statistical setting where observed samples can come from one of several distinct underlying populations each typically with its own probability distribution

c 2018 Avik Ray, Joe Neeman, Sujay Sanghavi, and Sanjay Shakkottai.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/16-483.html.

Ray, Neeman, Sanghavi, and Shakkottai

but are not labeled as separate in the data presented. They have been used to model a wide variety of phenomena, and have seen great success in practice, going back as far as Pearson (1894). In this paper we consider (what we call) the search problem in the mixture model setting: given some special side information about one of the mixture components, is it possible to eﬃciently learn the parameters of that component only? Given that there are known methods for learning the entire set of parameters of various mixture models, eﬃcient here means more eﬃcient (statistically and/or computationally) than existing methods for learning all the parameters.

As an example, we consider the latent Dirichlet allocation model for document generation. In this model, underlying population means the set of topics in a document, which determines the frequencies of diﬀerent words in the document. Side information could be a word that is more common in the topic of interest than it is in any other topic: for example, the word semi-supervised might work if the topic of interest is machine learning.

Side information could also consist of a small number of labelled examples. We might have a small collection of documents about machine learning and also a much larger corpus that includes documents from many topics. Our methods will allow us to leverage the large, unlabelled corpus to obtain good estimates for word frequencies in machine learning articles and these estimates will be much better than anything that could be learned from the small labelled sample.

Main contributions: We propose a general setting for side information in mixture models, and show how to solve the search problem by estimating certain matrices of moments. We prove error bounds on the resulting estimates; our rates have a sharp dependence on the sample size (although they are possibly not sharp in the other parameters).

We then specialize our approach to four popular families of mixture models: Gaussian mixture models with spherical covariances, latent Dirichlet allocation for topic models, mixed linear regression, and subspace clustering. We give concrete algorithms for these four families. Our results also include new moment derivations for mixed linear regression and subspace clustering models.

Finally, we simulate our algorithm on both real and synthetic data sets for the Gaussian mixture model, topic model, and subspace clustering applications. For synthetic data set we compare its performance to the tensor decomposition methods discussed by Anandkumar et al. (2014) in both GMM and LDA models, and k-means for subspace clustering. We show that our methods outperform the baseline when the side information is informative. We also demonstrate the practical applicability of our algorithms on three real data sets the NY Times data set of news articles, Yelp data set of business reviews, and BSDS500 data set of images. In the ﬁrst two text corpus, we show our algorithm recovers more coherent topics than topic modeling algorithm by Arora et al. (2013). In the BSDS500 data set, we demonstrate how our algorithm can be used for parallel image segmentation. In all three cases, our algorithm also exhibits signiﬁcant computational gains over competing unsupervised and semi-supervised algorithms.

The Search Problem in Mixture Models

1.1 Related Work

There is a vast literature on mixture models; too much to even summarize here. We will therefore focus this section on two more closely related areas: method of moments estimators for mixture models, and learning with side information. Mixture models and method of moments: A common method for learning mixture models is the EM algorithm of Dempster et al. (1977), which outputs a complete set of model parameters. However, EM may converge slowly (or not at all) [Redner and Walker 1984]; this weakness of EM has spurred a resurgence in method-of-moments estimators for mixture models. Although these methods go back to the pioneering work of Pearson (1894) on Gaussian mixture models, the last several years have seen important advances. Moitra and Valiant (2010), and Hardt and Price (2015) showed that Gaussian mixture models with two components can be learned in polynomial time. Hsu and Kakade (2013) considered mixtures of more Gaussians, but constrained to have spherical covariances. They gave a method based on third-order tensor decompositions, which was later generalized to other models in Anandkumar et al. (2014). Learning with side information: As has been observed many times, often in practice one has access to a set of data that is somewhat richer than standard models of data in learning theory. The term side information is used as a catch-all for extra data that doesn t ﬁt into pre-existing models; as such, the literature contains many incomparable models of side information. Xing et al. (2002) and Yang et al. (2010) took unsupervised clustering as their starting point. For them, side information arrived as pairs of points that were known to belong to the same cluster; they showed how this extra information could substantially improve the performance of the k-means algorithm. Kuusela and Ocone (2004) developed a framework for side information in the PAC learning model, in which extra samples with a particular dependence on the original samples could sometimes give a substantial beneﬁt. Many diﬀerent types of metadata have been proposed for the latent Dirichlet allocation (LDA) model of document generation. Mcauliﬀe and Blei (2008) introduced the supervised LDA model, in which each document comes with an additional response variable from a generalized linear model. On the other hand Rosen-Zvi et al. (2004) proposed the authortopic model, in which the metadata (author names) aﬀects the distribution of the documents themselves. From a more experimental point of view, Lu and Zhai (2008) used long, detailed product reviews as side information for categorizing short snippets and blog entries. The notion of semi-supervised learning (see the book by Chapelle et al. (2006)) is also related to our framework of side information. In semi-supervised learning, the learner has access to a small number of labelled examples and a large number of unlabelled examples. This setting is useful for us too, although our general method does not strictly require data of this form.

2. Basic Idea and Algorithm

We now ﬁrst brieﬂy describe the basic mixture model setting, and then describe our method. These descriptions cover several popular speciﬁc examples for mixture models, and we detail the application to each of them in Section 3.

Ray, Neeman, Sanghavi, and Shakkottai

Setting: We are interested in the standard statistical setting of (parametric) mixture models: that is, samples are drawn i.i.d. from a distribution f given by

i=1 αi g(x; µi).

Here g corresponds to a known parametric class of distributions, and k is the number of mixture components. The corresponding parameter vectors are µ1, . . . , µk, and their mixture weights / probabilities are α1, . . . , αk. So, for example, in the case of the standard (spherical) Gaussian mixture model, g(x; µi) is the Gaussian pdf N(µi, I). Thus each sample can be considered to be drawn by ﬁrst selecting a mixture component µi with probability αi, and then drawing the sample x according to g(x; µi). We assume all the µi s are linearly independent. This is a common assumption for learning mixture models using spectral methods. Search problem: The standard parameter estimation problem is to ﬁnd all the µi vectors given samples. In this paper we are interested in the search problem: we are given side information about one of the vectors say µ1, without loss of generality and we would like to recover only µ1. Of course, we would like to do this with sample and computational complexity lower than what would be required to estimate all parameter vectors (i.e., lower complexity than the standard case). Side information: Our general procedure requires the following model for side information: we assume that we have access to a vector v such that the inner product with the parameter vector µ1 the special one we are searching for is higher than the inner product with any of the other µi; i.e. there exists δ > 0 such that;

µ1, v (1 + δ) µi, v for all i = 1

Section 3 shows how to obtain such side information in some speciﬁc models of interest: spherical Gaussian mixture models, mixed linear regression, subspace clustering and the LDA topic model. We remark that it s also possible (and perhaps more intuitive in some situations) to ask for side information satisfying | µ1, v | (1 + δ)| µi, v |. However, our assumption above is slightly weaker, since for any v satisfying the latter assumption, either v or v satisﬁes the former assumption. Later, we show the above condition is suﬃcient for uniquely identifying the required parameter µ1 (but it may not be necessary). We refer side information vector v as informative about µ1 if it satisﬁes the above condition.

2.1 General Procedure

The main idea behind method of moments is to use samples to estimate certain moments of the distribution f(x), using which we can recover the parameters of interest. For many mixture models (including the four common examples we detail), it is possible to easily and directly estimate using ﬁrst and second order moments, given suﬃcient samples, the vector

i=1 αiµi. (1)

The Search Problem in Mixture Models

and the matrix

i=1 αiµiµT i . (2)

For example, in many models the estimate of vector m is simply the sample mean, and matrix A can be derived from the sample covariance matrix. The exact procedure for estimating m and A varies according to the particular parametric model g. The fact that m and A (and also higher-order tensors) can be estimated from samples is well known for many models, see Anandkumar et al. (2014) for a treatment of several diﬀerent models, and for other pointers to the literature. Typically, all mixture model components cannot be identiﬁed from just the ﬁrst and second order moments (or m and A). It is often necessary to compute even higher order moment terms. In our search problem, given the side information, we develop procedures to estimate an alternative matrix B, using higher order moments, given by

i=1 αi µi, v µiµT i (3)

Again, the exact procedure for estimating B from samples depends on the particular parametric model g. For this section, we assume we are able to estimate A, B, m to within some accuracy. We will use the notation ˆA, ˆB, ˆm to denote these ﬁnite sample estimates of A, B, m respectively, and n denotes the number of samples used to compute these estimates. With this in hand, we outline two general procedures for estimating µ1 (i.e. the component that we are interested in). The ﬁrst procedure is based on a whitening step, much like the one that is used in the spectral algorithms in Hsu and Kakade (2013); Anandkumar et al. (2012), and tensor decomposition methods of Anandkumar et al. (2014) (please see remarks in Section 3 for the diﬀerences for speciﬁc models). The second procedure uses a line search instead, and may be computationally favorable when k is large, because it avoids the need to invert a k k matrix. Both Algorithms 1 and 2 take as input the estimates ˆA, ˆB, ˆm (where ˆB is constructed using side information vector v) and they output estimates of the ﬁrst mixture component ˆµ1, and also the proportion of the ﬁrst component ˆα1.

2.1.1 The Whitening Method

Our main result about Algorithm 1 is that if ˆA and ˆB are good estimates of A and B then Algorithm 1 outputs good estimates for µ1 and α1. In order to interpret Theorem 1 as an error rate, note that if all parameters but ϵ are ﬁxed then the error is O(ϵ). Since standard concentration results yield ϵ = O(n 1/2), where n is the number of samples; our error rate in terms of n is also O(n 1/2). This rate is sharp, since it is also the rate for estimating the mean of a single Gaussian vector (i.e. a GMM with only one component).

Theorem 1 Suppose that µ1, . . . , µk are linearly independent, and that ˆA is positive semideﬁnite. Also suppose that µ1, v (1 + δ) µi, v for all i = 1. Assume that

max{ A ˆA , B ˆB , m ˆm } ϵ < σk(A)/4,

Ray, Neeman, Sanghavi, and Shakkottai

Algorithm 1 Extracting a mixture component from side information: the whitening method. Input: ˆA, ˆB, ˆm Output: ˆµ1, ˆα1

1: let {σj, vj} be the singular values and singular vectors of ˆA, in non-increasing order

2: let V be the d k matrix whose jth column is vj 3: let D be the k k diagonal matrix with Djj = σj 4: let u be the largest eigenvector of D 1/2V T ˆBV D 1/2

5: let w = V D1/2u

6: let E be the span of {V D1/2v : v u}

7: write V V T ˆm (uniquely) as aw + y, where y E

8: return w/a and a2

and that the right hand side of (4) is at most α1. Then

µ1 ˆµ1 CR|α 1/2 1 ˆα 1/2 1 | + C

σ1(A) α1 η , and

|α1 ˆα1| C α1(α1R + η)

η + R ϵ σk(A) + ϵ (4)

where η = ϵσ1 δσ5/2 k , R = maxi µi , σ1(A) σk(A) > 0 are the non-zero singular values

of A = P i αiµiµT i , and C is a universal constant.

Our error bounds are somewhat complicated, and depend on many diﬀerent parameters, so let us elaborate on them slightly. First of all, the dependence on σ1(A) and σk(A) is of the order µ1 ˆµ1 σ1(A)3/2/σk(A)5/2, which is probably an artifact of the analysis, and not the true behavior of the algorithm. On the other hand, our dependence on ϵ is optimal: we have |α1 ˆα1| ϵ and µ1 ˆµ1 ϵ. Note also that our bound has no explicit dependence on k; this feature comes from the fact that our method is targeted at a single mixture component. By comparison, other methods typically give bounds in which the averaged per-mixture-component error does not depend on k. In terms of dependence on k, therefore, our bounds are better than previous bounds if there is only one component of interest. Finally, let us remark on the assumption that the right hand side of (4) is at most α1. This amounts to an assumption that ϵ is suﬃciently small compared to all the other parameters. Without this assumption, the bound in (4) would not be very interesting, since |α1 ˆα1| α1 is too weak to give useful information about ˆα1 (it could even be zero). We defer the actual analysis of Algorithm 1 to the appendix, but we will motivate the algorithm and give the basic idea of the proof by showing that if ˆA, ˆB, and ˆm are equal to A, B and m respectively then Algorithm 1 outputs µ1 and α1 exactly.

Lemma 2 Let m, A, and B be deﬁned by in (1), (2), and (3), where µ1, . . . , µk are linearly independent. If µ1, v > µi, v for all i = 1 and we apply Algorithm 1 to A, B, and m, then it returns µ1 and α1.

The Search Problem in Mixture Models

Proof Let V and D be as deﬁned in Algorithm 1. Since A has rank k,

i=1 αi D 1/2V T µiµT i V D 1/2 = D 1/2V T AV D 1/2 = Ik.

Deﬁning ui := αi D 1/2V T µi, we have P i uiu T i = Ik, which implies that the ui are orthonormal in Rk. Now,

D 1/2V T BV D 1/2 =

i=1 αi µi, v D 1/2V T µiµT i V D 1/2 =

i=1 µi, v uiu T i .

Since µ1, v was assumed to be larger than all other µi, v , it follows that u1 is the largest eigenvector of D 1/2V T BV D 1/2. Now, if w = V D1/2u1 then w = α1µ1. Now, note that since the µi are linearly independent, there is a unique way to write m = V V T m = P i αiµ1 as aw +y, where y belongs to the span of {µ2, . . . , µk} (which is the same as the span of {V D1/2ui : i 2}. Moreover, the unique choice of a that allows this representation must satisfy aw = α1µ1, which implies that a = α1. Therefore, w/a = µ1 and a2 = α1.

The proof of Lemma 2 is crucial to understanding the algorithm, and also the broader message of this article: if we can get hold of two diﬀerent normalizations of something, then we can learn something about it. In the proof of Lemma 2, this happens twice: ﬁrst, we use the fact that A and B contain the same components (but with diﬀering normalizations) to extract the span of a single component of interest. The diﬀering normalization is crucial, because A by itself does not uniquely determine the set {µ1, . . . , µk}, much less single out a speciﬁc component of interest. In the second step of Lemma 2, we know α1µ1, which is not enough to determine either α1 or µ1. However, we also have access to m, which involves a contribution of α1µ1. Exploiting the diﬀerence between these two normalizations, we recover both α1 and µ1.

2.1.2 The Cancellation Method

Our second method avoids the matrix inversion in Algorithm 1, preferring a line search instead. In the above Algorithm 2, we assume µ1, v > 0. When this is not the case and B is a negative semi-deﬁnite matrix, we simply have to change the line search step to search for the smallest λ < 0 such that b V b V T ( b A λ b B)b V b V T is PSD. Theorem 3 shows that with m, A, B estimated up to O(ϵ) error, the parameter estimation error in Algorithm 2 is also bounded as O(ϵ).

Theorem 3 Suppose {µ1, . . . , µk} are linearly independent and v satisﬁes µ1, v (1 + δ) µi, v for all i = 1. Suppose that max{ b A A , b B B , ˆm m } < ϵ, and λ1 := 1/ µ1, v . Then Algorithm 2 returns ˆµ1, ˆα1 with

ˆµ1 µ1 < Cϵ α2 1a2 1

σ1(A) 1 + α1a1 σk 1(Zλ1)

|ˆα1 α1| < Cσ1(A)ϵ

η1 + η2Rη3 σk 1(Zλ1)

Ray, Neeman, Sanghavi, and Shakkottai

Algorithm 2 Extracting a mixture component from side information: the cancellation method. Input: ˆA, ˆB, ˆm Output: ˆµ1, ˆα1

1: let b V be the d k matrix of k largest eigenvectors of ˆA;

2: search over λ to ﬁnd the largest λ = λ such that b V b V T ( b A λ b B)b V b V T is PSD;

3: let b Zλ = b A λ b B, and let {v2, . . . , vk} be the top k 1 singular vectors of b Zλ

4: let V1:(k 1) be the d (k 1) matrix with columns {v2, . . . , vk}

5: let x1 = ˆm V1:(k 1)V T 1:(k 1) ˆm

6: let v1 = x1/ x1

7: compute ci = v T 1 b Avi for i = 1 to k

8: let ai = ci/ x1 for i = 1 to k

9: return ˆµ1 = Pk i=1 aivi and ˆα1 = c1/a2 1

where η1 := max{α1a1(2a1 + 1), 20}, η2 := max{α1a2 1, 10}, η3 = max {1, λ1, σ1(B)} , R = max µi , a1 = µ1 Q V µ1 , where V = span{µ2, . . . , µk}, and C is an universal constant.

Again, we will defer the actual analysis to the appendix, and instead show that Algorithm 2 returns the exact answer when fed exact initial data. We will do this in two lemmas: Lemmas 4 and 5.

Lemma 4 Let Z = Pk i=1 γiµiµT i where {µ1, . . . , µk} are linearly independent, µi Rd, γi R and d > k. If γ1 < 0 and γi > 0 for all i = 1 then Z is not positive semi-deﬁnite.

Proof Let Π denote the projection onto the orthogonal complement of span{µ2, . . . , µk}. Let x = Πµ1, and note that x, µ1 > 0 but x, µi = 0 for all i = 1. Hence, x T Zx = γ1 x, µ1 2 < 0 and so Z is not positive semi-deﬁnite.

Lemma 5 Let m, A, and B be deﬁned by in (1), (2), and (3), where µ1, . . . , µk are linearly independent. If µ1, v > µi, v for all i = 1 and we apply Algorithm 2 to A, B, and m, then it returns µ1 and α1.

Proof Deﬁne wi = µi, v and let γi = αi(1 λwi), so that

Zλ = A λB =

i=1 γiµiµT i .

Note that, in our case where b A = A, and b B = B, columns of b V simply form a common orthonormal bases of the row/column space of both matrices A, B. Therefore the matrix b V b V T (A λB)b V b V T = A λB = Zλ. Now for λ > 1 w1 , γ1 < 0 and for all λ 1 w1 , γi 0 for all i since w1 > wi, for every i = 1. By Lemma 4, λ = 1 w1 is the largest λ such that Zλ is PSD; hence,

i=2 αi(1 λ wi)µiµT i .

The Search Problem in Mixture Models

From Lemma 26 in Appendix E.2 it follows that k 1 singular vectors {v2, . . . , vk} of Zλ form a basis of the subspace V = span{µ2, . . . , µk}. Let V be the perpendicular space of V, and write Π = I V1:(k 1)V T 1:(k 1) for the orthogonal projection onto V . Since Πµi = 0 for i = 1, we have x1 = Πm = αΠµ1. Now deﬁne b1, . . . , bk by µ1 = Pk i=1 bivi. In order to prove that the algorithm returns µ1 correctly, we need to show that bi = ai := ci/ x1 . Indeed,

ci := v T 1 Avi =

j=1 αjv T 1 µjµT j vi = α1b1bi,

since v T 1 µj = 0 for j = 1. On the other hand, x1 = α Πµ1 = αb1, and so bi = ai, as claimed. Moreover, ˆα1 = c1

a2 1 = α1, as claimed.

Optimization for λ : The ﬁrst step of Algorithm 2 involves ﬁnding a smallest λ such that b Z λ = b V b V T ( b A λ b B)b V b V T is PSD using line search. Although b Z λ is a d d matrix, this step can be performed eﬃciently as follows. Instead of searching for λ directly for b Z λ, we do this for a smaller k k matrix b V T b Z λ b V = b V T ( b A λ b B)b V . This optimization step using line search can be performed in just O(k3 log |λ |) time.

3. Speciﬁc Models

In this section we discuss how the search algorithms can be applied in four speciﬁc mixture models.

3.1 Gaussian Mixture Model with Spherical Covariance

The model: Besides the mixture parameters α1, . . . , αk, the Gaussian mixture model (GMM) has mean parameters µ1, . . . , µk Rd and variance parameters σ1, . . . , σk R. The conditional densities g( ; µi, σi) are Gaussian, with mean µi and covariance σ2 i Id. Explicitly,

g(x; µi, σi) = 1 (2πσ2 i )d/2 e x µi 2

Matrices A and B: We ﬁx a vector v Rd, with the assumption that v, µ1 > v, µi for i = 1. Recall (from Section 2.1) that m = E[x] = P i αiµi, A = Pk i=1 αiµiµT i , and B = Pk i=1 αi µi, v µiµT i . To compute these quantities, we ﬁrst deﬁne σ2 to be the (k+1)thlargest eigenvalue of the mixture covariance matrix E[(x m)(x m)T ], and let u be a corresponding eigenvector. Then let em = E[x(u T (x m))2]. Then it follows from moment computations (see Hsu and Kakade (2013)) that:

A = E[xx T ] σ2Id B = E[ x, v xx T ] emv T v em T em, v Id,

Given the samples {ˆxi}, we can now empirically evaluate these quantities (denoted by ˆm, ˆA, ˆB respectively) by replacing expectations above by the corresponding sample averages; for instance we replace E[xx T ] by b E[xx T ] .= (1/n) Pn j=1 ˆxj ˆx T j .

Ray, Neeman, Sanghavi, and Shakkottai

Examples of v: Assuming that µ1 2 > µ1, µi for all i = 1 this will be true, for example, if µi are all the same one can ﬁnd a suitable vector v given a relatively small number of samples from the ﬁrst mixture component. Speciﬁcally, if µ1 2 µ1, µi + δ and µi R for all i = 1 then standard Gaussian tail bounds imply the following: if v := ℓ 1 Pℓ j=1 xj where ℓ= Ω(R2δ 2 log k) and x1, . . . , xm are drawn independently from the distribution g( ; µ1, σ1) then with high probability v satisﬁes v, µ1 > v, µi for all i = 1. Here, high probability means probability converging to 1 as the hidden constant in ℓ= Ω( ) grows. Note here that the number of tagged samples is nowhere near suﬃcient to estimate µ1 by direct averaging; indeed to do so would require the number of samples to grow with the size of the underlying dimension. Remarks: We note that spectral algorithms which uses the whitening procedure has been proposed before in the context of GMM e.g. Hsu and Kakade (2013). The primary diﬀerence between the algorithm in Hsu and Kakade (2013) and Algorithm 1 is that the former, in absence of side information, takes a projection of the third order moment tensor M3 on a random unit vector to obtain the second matrix, where as our matrix B can be viewed as a projection of M3 on the side information vector v. The main advantage of projecting onto v is that, when we have reliable side information, this will give a good singular value separation resulting in better empirical performance. The Cancellation algorithm however is distinctly diﬀerent from both and has not been studied before.

3.2 Latent Dirichlet Allocation

The model: In the LDA model with k topics and a dictionary of size d, the parameters µ1, . . . , µk d 1 are the probability distributions corresponding to each topic ( d 1 denotes the probability simplex {y Rd : P i yi = 1, mini yi 0}). The LDA model introduced in Blei et al. (2003) diﬀers slightly from the other models as the mixture distribution cannot be expressed exactly in the parametric form in Section 2. Instead we have a two level hierarchy as follows. Given α = (α1, . . . , αk), we ﬁrst draw a topic distribution θ from the Dirichlet( α) distribution. Given this θ = (θ1, . . . , θk) each word in the document is drawn i.i.d. from the distribution Pk i=1 θiµi. However still we can compute the vector m and the matrices A, B as shown below. Then with an appropriate v our algorithms can recover the topic distribution µ1. Matrices A and B: Let x1 denote the random vector with x1(w) = 1 if the ﬁrst word is w, and 0 otherwise. Similarly deﬁne vectors x2, x3 corresponding to the second and third word respectively, and let α0 = Pk i=1 αi. Then, moment computations under the LDA distribution yields the following expressions for (m, A, B), deﬁned in (1), (2), (3):

m = α0E[x1], A = α0(α0 + 1)E[x1x T 2 ] mm T

B = α0(α0 + 1)(α0 + 2)

2 E[ x3, v x1x T 2 ] α0(α0 + 1)

2 m, v E[x1x T 2 ] + E[ x3, v x1m T ]

+ E[ x3, v mx T 2 ] + m, v mm T .

With the given document samples, let ˆxi denote the normalized empirical word frequencies in the document i. Then, ˆm = α0

n Pn i=1 ˆxi, and b A, b B can be immediately estimated using the above expressions by replacing expectations with sample averages.

The Search Problem in Mixture Models

Using labeled words to ﬁnd v: In order to recover the topic distribution µ1 we now require a vector v which satisﬁes µ1, v > µi, v for i = 1. Now suppose we are given a labeled word ℓsuch that its occurrence probability in topic 1 is the highest, i.e., µ1(ℓ) > µi(ℓ) for i = 1 (note that this does not mean ℓis the most frequent word in topic 1, there may be words with higher occurrence probability in this topic). Then we can simply choose v = eℓ(the standard basis element with 1 in the ℓ-th coordinate). For most topics of practical interest it is possible to ﬁnd such labeled words. For example the word ball can be a labeled word for topic sport, party is a labeled word for topic politics and so on. However, a labeled word is merely indicative of a topic and is not exclusive to a topic (e.g. the word ball can occur in other contexts as well). In this sense, the labelled word is quite diﬀerent from the anchor word described in Arora et al. (2013). Note however that anchor words are also labeled words (but not vice-versa) since for an anchor word ℓ, µ1(ℓ) > 0 and µi(ℓ) = 0 for i = 1.

Using labeled documents to ﬁnd v: If the diﬀerent topics are not too similar, then we can estimate a suitable vector v from a small collection of documents that are mostly about the topic of interest. For example, if µi, µj η µi µj for all i = j, and if we observe a total of m words from some collection of documents with θ1 (1 + δ)(1/2 + η) then about m = Ω(δ 2 log k) words will suﬃce to ﬁnd a suitable vector v.

Remarks: Similar to the case of GMM, a spectral algorithm using whitening procedure to estimate LDA components have been presented before in Anandkumar et al. (2012). Again the main diﬀerence with our Whitening algorithm being the fact that in Anandkumar et al. (2012) the second matrix is constructed by taking a random projection of the third order moment tensor Triples, and in Algorithm 1 this is constructed as a projection onto v. Empirically this results is a more stable algorithm due to guaranteed singular value separation. The Cancellation algorithm has not been previously studied in LDA model.

3.3 Mixed Regression

The model: In mixed linear regression the mixture samples generated are of the form y = x, µi + ξ, where x N(0, I) and noise ξ N(0, σ2). As before, a sample is generated using the i-th linear component µi, with probability αi. We have access to the observations (y, x) but the particular µi and ξ are unknown. Hence the conditional density g(x, y; µi, σ) is a multivariate Gaussian where x N(0, I), y N(0, µi 2 + σ2), and Cov(x, y) = µi.

Matrices A and B: To compute A and B, we consider the following moments (for more detailed derivations, see Appendix C):

Ray, Neeman, Sanghavi, and Shakkottai

M1,1 = E[yx] =

M2,2 = E[y2xx T ] = 2

i=1 αiµiµT i +

i=1 αi(σ2 + µi 2)I

M3,1 = E[y3x] = 3

i=1 αi(σ2 + µi 2)µi

M3,3 = E[y3 x, v xx T ] = 6

i=1 αi µi, v µiµT i + M3,1v T + v MT 3,1 + M3,1, v I

Let τ 2 be the smallest singular value of the matrix M2,2. Then we can compute m, A, B as follows.

m = M1,1, A = 1

2(M2,2 τ 2I)

B = 1 6(M3,3 (M3,1v T + v MT 3,1 + M3,1, v I))

As in the previous cases with ﬁnite samples the estimates ˆm, b A, b B can be computed by taking their empirical expectations e.g., c M1,1 = b E[yx] = 1

n Pn i=1 ˆyiˆxi and so on, where (ˆyi, ˆxi) denote the i-th sample. Examples of v: Suppose we are given a few random labeled examples from the ﬁrst component. Then assuming µ1 2 > µ1, µi + δ, µi 2 R, similar to the GMM case we can estimate a v := 1

ℓ Pℓ j=1 ˆyj ˆxj using only ℓ= Ω R4δ 2 log k labeled samples so that µ1, v > µi, v holds with high probability. Remarks: Our construction of the second matrix B is a consequence of some new moment results for the mixed linear regression model. We present these detailed moment derivations in Appendix C.4. This also results in improved sample complexity bounds over previous moment based algorithms (discussed in Section 3.5).

3.4 Subspace Clustering

The model: Besides the mixture parameters α1, . . . , αk, the subspace clustering model has parameters U1, . . . , Uk Rd m and σ R, where the matrices U1, . . . , Uk have orthonormal columns. The conditional distribution g( ; Ui) is a standard Gaussian variable supported on the column space of Ui, plus independent Gaussian noise. More precisely, we sample y N(0, Id) and set x = Ui UT i y + ξ, where ξ N(0, σ2Id) is independent of y. Matrices A and B: The subspace clustering model does not quite ﬁt into the basic method of Section 2; one motivation for presenting it is to show that the basic ideas in Section 2 are more ﬂexible than they ﬁrst appear. Suppose v Rd satisﬁes UT 1 v > UT i v

The Search Problem in Mixture Models

for all i = 1. We consider

A := E[xx T ] σ2Id =

i=1 αi Ui UT i

B := E[ x, v 2xx T ] σ2v T Av Id σ2 v 2A σ4( v 2Id + vv T ) 2σ2(Avv T + vv T A)

i=1 αi UT i v 2Ui UT i + 2

i=1 αi Ui UT i vv T Ui UT i

and their empirical versions ˆA and ˆB (the computation giving the claimed formula for B is carried out in Appendix C). Now with these ˆA and ˆB, we can recover the subspace U1 using Algorithm 3. This algorithm uses the same principle behind the whitening method in Section 2.1.1, the key diﬀerence is that here we pick the top m eigenvectors of the whitened B matrix.

Algorithm 3 Subspace clustering algorithm

Input: ˆA, ˆB Output: ˆU

1: let {σj, vj} be the singular values and singular vectors of ˆA, in non-increasing order

2: let V be the d mk matrix whose jth column is vj 3: let D be the mk mk diagonal matrix with Djj = σj 4: let Y = [u1, . . . , um] be the matrix of m largest eigenvectors of D 1/2V T ˆBV D 1/2

5: let Z = V D1/2Y

6: let the columns of ˆU be the m eigenvectors of the matrix ZZT

The following perturbation theorem guarantees that if the side information vector v is substantially more aligned with the subspace spanned by U1 than it is with any other subspace, and the matrices A, B are estimated within ϵ accuracy, then Algorithm 3 can recover the required subspace with a small error.

Theorem 6 Suppose that ˆA A ϵ and ˆB B ϵ. Suppose that the side information vector v satisﬁes Uiv 2 (1/3 δ) U1v 2. Then output ˆU of Algorithm 3 satisﬁes

ˆU ˆUT U1UT 1 Cϵα 1 1 σ1(A)2σmk(A) 2δ 1.

We prove Theorem 6 in Appendix F. Note that the conditions on v can be satisﬁed if the spaces Ui satisfy a certain aﬃnity condition and we have a few labelled samples from U1. Speciﬁcally, suppose that u, w < ( 1

3 η) u w for every u U1 and w Ui, i = 1. Then any v U1 will satisfy the assumption of Theorem 6. Hence, a single labelled sample from U1 (or several depending on η noisy samples) is enough to ﬁnd a suitable v. Remarks: To the best of our knowledge Algorithm 3 is the ﬁrst moment based algorithm for the subspace clustering model. The detailed moment derivations are presented in Appendix C.5. Also our generative model allows samples to be noisy, hence they do not lie exactly on the subspace but close to it. Such a setting has not been considered in most subspace clustering literature.

Ray, Neeman, Sanghavi, and Shakkottai

3.5 Comparison

In this section we compare the theoretical performance of the Whitening and Cancellation algorithms with other algorithms. Both Whitening and Cancellation algorithms require estimating the quantities m, A, B by computing moments from the samples. Therefore the sample complexity primarily depends on how well these quantities concentrate. We compute the speciﬁc sample complexities for each model in Appendix G. For Gaussian mixture model the sample complexity of our algorithm scales as Ω(dϵ 2 log d) similar to moment based algorithm by Hsu and Kakade (2013) and tensor decomposition based algorithm by Anandkumar et al. (2014). In terms of runtime the Whitening algorithm is faster than the tensor decomposition based algorithm by Anandkumar et al. (2014). This can be viewed as follows. The ﬁrst step in both the algorithms take O(d2k) time to compute the whitening matrix and in subsequent whitening steps. However, computing the largest eigenvector in Algorithm 1 takes only O(k2) time, faster than O(k5 log k) time required for rank-k tensor power iteration (we also verify this in our experiments in Section 4). In LDA topic model our algorithms have a sample complexity of Ω(ϵ 2 log d), again similar to tensor decomposition based algorithm by Anandkumar et al. (2014), and nonnegative matrix factorization (NMF) based algorithm by Arora et al. (2013). The Whitening algorithm again is faster than tensor decomposition as argued for GMM case. The NMF based algorithm using optimization based Recover KL/Recover L2 procedures also has a runtime of O(d2k) similar to our algorithms (in Section 4 again we observe our algorithm to be faster in practice). The spectral topic modeling algorithm in Anandkumar et al. (2012) also has a computation complexity O(d2k) similar to our algorithms. However, its sample complexity has a high Ω(k5) dependence on the number of components. This spectral algorithm also suﬀer from instability in practice due to the random projection step (as noted in Anandkumar et al. 2014). In the case of mixed linear regression again our method has a sample complexity of Ω(dϵ 2 log d) similar (upto log factors) to the convex optimization based approach by Chen et al. (2014), alternating minimization based approach by Yi et al. (2014), but better than tensor decomposition based method of Sedghi et al. (2016) which has a sample complexity of Ω(d3ϵ 2). However unlike the convex optimization and alternating minimization based techniques our method is also applicable when the number of components k > 2. As argued in GMM case the Whitening algorithm is again faster than the tensor algorithm by Sedghi et al. (2016). Subspace clustering algorithms like greedy subspace clustering by Park et al. (2014), optimization based algorithms by Elhamifar and Vidal (2009), Soltanolkotabi and Candes (2012), requires the samples to exactly lie on a subspace. In contrast our moment based algorithm works even when the samples are noisy and perturbed from the actual subspace. Our subspace clustering algorithm also has a sample complexity of Ω(mϵ 2 log d) which is similar (up to log factors) to greedy subspace clustering algorithm by Park et al. (2014). We note that it is possible to use approximation methods like randomized svd to further speed up the Whitening, Cancellation and tensor decomposition based algorithms by Anandkumar et al. (2014), however this will result in decreased accuracy in both algorithms. We refer to Huang et al. (2015) for such stochastic optimization, and parallelization techniques used to speed up the tensor algorithms.

The Search Problem in Mixture Models

In a setting where side information is provided on each of the k components, observe that we can run the Whitening algorithm independently for each of the k components, possibly in parallel. Hence we can recover all k components, without loosing the runtime advantage of the Whitening algorithm. We demonstrate this application on real data set in Section 4.2. In terms of the overall computation time, it can be shown that running the Whitening algorithm for all k components is still faster than the tensor decomposition based algorithm by Anandkumar et al. (2014), when k = Ω(n 1 3 d 1 3 ).

4. Experiments

In this section we present the empirical performance of our Whitening, Cancellation, and Subspace clustering algorithms. We consider three of the settings: the Gaussian Mixture Model (GMM), and Latent Dirichlet Allocation (LDA), and Subspace clustering, and validate our algorithms on both real and synthetic data sets.

4.1 Synthetic Data Set

First we compare the sample complexity and runtime of our algorithms with the robust tensor decomposition algorithm by Anandkumar et al. (2014), which is based on tensor power iteration, for learning mixture models (we refer to this as the TPM algorithm). Our second baseline algorithm is a faster heuristic of TPM where we start the tensor power iterations initialized with side information vector v, and recover just the ﬁrst component. We refer this as the Fast-TPM algorithm. For the Cancellation algorithm we compute the optimum λ for cancellation using two diﬀerent techniques as follows. First, let b Z λ = V T b ZλV, where V is the matrix of top k singular vectors of b A. In the ﬁrst method, we perform a line search over positive λ to ﬁnd the minimum λ such that σk( b Z λ) falls below certain threshold. This method works well in GMM case. In a second method we minimize the convex function b Z λ + λ, subject to λ 0. This method performs better in the case of LDA. Note that for the Cancellation algorithm after estimating λ, instead of using m and A to ﬁnd µ1 we can follow the same steps using m = Av and B to recover µ1. Theoretically it has the same performance, however empirically we observe this to work slightly better and we use this version for our experiments. We implement all algorithms for our synthetic data experiments using MATLAB. Performance metric: We compute the estimation error of parameter µ1 as E = ˆµ1 µ1 . In our ﬁgures we plot the quantity percentage relative error gain which is deﬁned as G = 100(ET EA)/ET , where ET is the TPM error and EA is the error for Whitening / Cancellation / Fast-TPM algorithm. Note that a positive error gain implies that the TPM error is greater than that of the competing algorithm. In the subspace clustering model we plot similar percentage relative error gain over the baseline k-means algorithm. Gaussian mixture model: We generate synthetic data sets for GMM with diﬀerent k, d, αi, σ, and v. Figure 1 shows the percentage relative error gains of the Whitening, Cancellation, and Fast-TPM algorithms over the TPM algorithm in a GMM with various values of k, d, αi, σ, and n. The µi were generated randomly over the sphere of norm r = 10. We deﬁne αmin := mini αi. The side information vector v was chosen as follows. Let {v1, . . . , vk} be a orthonormal basis of span{µ1, . . . , µk}, such that {v2, . . . , vk} span{µ2, . . . , µk}. Then we

Ray, Neeman, Sanghavi, and Shakkottai

GMM K=10, d=500, min =.0253, n=6000

1 2 3 4 5 Components

Percentage error gain over TPM

Whitening, = .4 Whitening, = .5 Cancellation, = .4 Cancellation, = .5 Fast-TPM, = .4 Fast-TPM, = .5

GMM K=10, d=500, min =.0253, n=8000

1 2 3 4 5 Components

Percentage error gain over TPM

Whitening, = .4 Whitening, = .5 Cancellation, = .4 Cancellation, = .5 Fast-TPM, = .4 Fast-TPM, = .5

GMM K=10, d=500, min =.0253, n=10000

1 2 3 4 5 Components

Percentage error gain over TPM

Whitening, = .4 Whitening, = .5 Cancellation, = .4 Cancellation, = .5 Fast-TPM, = .4 Fast-TPM, = .5

Figure 1: Figure showing the percentage relative error gain by the Whitening, Cancellation, and Fast-TPM algorithm over the TPM algorithm for 5 components of increasing size, in a GMM with k = 10, d = 500, σ {.4, .5}, and three diﬀerent sample complexities (a) n = 6000 (b) n = 8000 (c) n = 10000. Our algorithms shows increasingly better gain over TPM and Fast-TPM as αi, σ and n increase.

choose v = γv1 + p

(1 γ)/(k 1) Pk i=2 vi for some γ (0, 1) such that the condition µ1, v > µi, v is satisﬁed. We observe that in all the cases, our algorithms have lower error (positive error gain) than both the tensor algorithms. Moreover, our methods advantage increases with increasing proportion αi, increasing sample size n, and increasing variance σ. We also observe that the Fast-TPM algorithm has the same error performance as TPM (error gain close to zero).

Figure 2 gives an example where the Whitening algorithm can successfully recover even rare components. Here we consider a GMM with k = 10, d = 500 with the rarest component having probability αmin = .0037. Again we observe positive relative error gains over TPM algorithm for increasing number of samples n.

In Figure 3 we plot the speedup of the algorithms over TPM, and observe that the Whitening and Cancellation algorithms are much faster (high speedup) than the TPM algorithm. We also observe that the Fast-TPM algorithm is faster than TPM and Cancellation algorithms, but slower than Whitening algorithm. Note that, while it is also possible to speed up the basic TPM algorithm compared here using techniques such as randomized svd and stochastic tensor gradient descent [Huang et al. 2015], such approximate methods will reduce the overall accuracy. Moreover the randomized svd techniques can also be applied to the search algorithms presented in this paper, to obtain further speedups.

Topic Modeling: We generate a synthetic LDA document corpus according to the model in Blei et al. (2003). The lengths of the documents are generated using a Poission(L) distribution where L is the mean document length. In Figure 4 we plot the percentage relative error gain of the Whitening, Cancellation, and Fast-TPM algorithms over the TPM algorithm. Our side information was a labeled word w satisfying µ1(w) > µi(w) for i = 1. Again we observe positive error gains over the TPM algorithm. Although the Fast-TPM algorithm sometimes perform better than TPM for more frequent topics, the Whitening

The Search Problem in Mixture Models

GMM K=10, d=500, min = .0037, 5000 samples

1 2 3 4 5 6 7 8 9 10 Components

Percentage error gain over TPM

= .3 = .4 = .5 = .6

GMM K=10, d=500, min = .0037, 6000 samples

1 2 3 4 5 6 7 8 9 10 Components

Percentage error gain over TPM

= .3 = .4 = .5 = .6

GMM K=10, d=500, min = .0037, 8000 samples

1 2 3 4 5 6 7 8 9 10 Components

Percentage error gain over TPM

= .3 = .4 = .5 = .6

Figure 2: Figure showing the percentage relative error gain of the Whitening algorithm over the TPM algorithm in presence of rare components (αmin = .0037), for a GMM with k = 10, d = 500, σ {.3, .4, .5, .6}, and number of samples (a) n = 5000 (b) n = 6000 (c) n = 8000. The Whitening algorithm recovers even the rarest component with increasing error gain over TPM as the number of samples increase.

GMM K=10, d=500, min =.0253, n=6000

1 2 3 4 5 Components

Average speedup over TPM

Whitening, = .4 Whitening, = .5 Cancellation, = .4 Cancellation, = .5 Fast-TPM, = .4 Fast-TPM, = .5

GMM K=10, d=500, min =.0253, n=8000

1 2 3 4 5 Components

Average speedup over TPM

Whitening, = .4 Whitening, = .5 Cancellation, = .4 Cancellation, = .5 Fast-TPM, = .4 Fast-TPM, = .5

GMM K=10, d=500, min =.0253, n=10000

1 2 3 4 5 Components

Average speedup over TPM

Whitening, = .4 Whitening, = .5 Cancellation, = .4 Cancellation, = .5 Fast-TPM, = .4 Fast-TPM, = .5

Figure 3: Figure showing the average speedup of Whitening, Cancellation, and Fast-TPM algorithms over TPM, for 5 components of increasing size, in a GMM with k = 10, d = 500, σ {.4, .5}, and three diﬀerent sample complexities (a) n = 6000 (b) n = 8000 (c) n = 10000. The Whitening algorithm is the fastest.

algorithm still outperforms it. Note that the performance varies across topics since the probability of the labeled word is diﬀerent for each topic. Subspace Clustering: We generate synthetic data for the subspace clustering model described in section 3.4 using parameters d = 500, k = 5, m = 10, and αi [.1, .3]. First we generate k = 5 random subspaces with orthonormal basis {Ui}k i=1, each of dimension m = 10. Then we generate random points on these subspaces, and add white Gaussian perturbations with σ {.1, .2}. We choose the side information vector v similar to the sensitivity experiment in GMM, and ensuring UT 1 v > UT i v , for i = 1. Note that due to the added Gaussian noise, our samples do not lie exactly on the subspaces {Ui}k i=1, but close

Ray, Neeman, Sanghavi, and Shakkottai

LDA K=5 Topics, d=500 Words, 4000 samples

1 2 3 4 5 Components

Percentage error gain over TPM

Whitening, L=2000 Whitening, L=3000 Cancellation, L=2000 Cancellation, L=3000 Fast-TPM, L=2000 Fast-TPM, L=3000

LDA K=5 Topics, d=500 Words, 6000 samples

1 2 3 4 5 Components

Percentage error gain over TPM

Whitening, L=2000 Whitening, L=3000 Cancellation, L=2000 Cancellation, L=3000 Fast-TPM, L=2000 Fast-TPM, L=3000

LDA K=5 Topics, d=500 Words, 8000 samples

1 2 3 4 5 Components

Percentage error gain over TPM

Whitening, L=2000 Whitening, L=3000 Cancellation, L=2000 Cancellation, L=3000 Fast-TPM, L=2000 Fast-TPM, L=3000

Figure 4: Figure showing the percentage relative error gain in each component of the Whitening, Cancellation, and Fast-TPM algorithms over the TPM algorithm in an LDA model with k = 5, d = 500, mean document length L {2000, 3000}, and number of documents (a) n = 4000 (b) n = 6000 (c) n = 8000. The Whitening algorithm show an improvement over TPM and Fast-TPM with increasing samples.

to it. Traditional subspace clustering algorithms, which assume points to lie exactly on the subspace, may not perform well. The TPM algorithm is also not well suited for this model since (a) the required moment tensor will be of 4th order resulting in high computation cost (b) even if mk basis of the tensor are recovered, ﬁnding the target subspace will involve a further combinatorial search of mk m subspaces and ﬁnding the one having the strongest projection of v. Therefore we choose the k-means algorithm as our baseline for this model and compare with Algorithm 3. First we compute k clusters using k-means, then we ﬁnd an m dimensional basis for each cluster using svd, ﬁnally we choose the target subspace as the one having the largest projection of v. If b U1 is the estimated orthonormal basis for the target subspace U1, we compute the error as E = b U1 b UT 1 U1UT 1 / U1UT 1 . Figure 5 shows that Algorithm 3 has a much better error performance over k-means. In the speedup plots in Figure 6 we also observe that our subspace search algorithm is over 4X times faster than k-means.

4.2 Real Data Sets

Topic Modeling: In this section we compare the performance of Whitening algorithm with a recent non-negative matrix factorization based topic modeling algorithm by Arora et al. (2013) (we refer this as NMF algorithm), and also the semi-supervised version of this NMF algorithm (we refer to this as SS-NMF). We test on two real large data sets; (a) New York Times news article data set [UCI 2008] (300, 000 articles) (b) Yelp data set of business reviews [Yelp 2014] (335, 022 reviews). We run both algorithms for k = 100 topics. For this experiment we do not consider the TPM algorithm by Anandkumar et al. (2014) since its runtime with k = 100 topics becomes extremely large on these data sets.1

1. To be more precise, with just k = 10 topics, the tensor algorithm takes 908 seconds in NY Times data set, compared to just 188 seconds for the Whitening algorithm (using MATLAB).

The Search Problem in Mixture Models

Subspace Clustering K=5, m=10, d=500, n=6000

1 2 3 4 5 Components

Percentage error gain over k-means

Subspace Clustering K=5, m=10, d=500, n=8000

1 2 3 4 5 Components

Percentage error gain over k-means

Subspace Clustering K=5, m=10, d=500, n=10000

1 2 3 4 5 Components

Percentage error gain over k-means

Figure 5: Figure showing the percentage relative error gain by our subspace search algorithm (Algorithm 3) over k-means for 5 components of increasing size, in a subspace clustering model with k = 5, m = 10, d = 500, σ {.1, .2}, and three diﬀerent sample complexities (a) n = 6000 (b) n = 8000 (c) n = 10000. Our algorithm shows much better error performance than k-means.

1 2 3 4 5 Components

Average speedup over k-means

Subspace Clustering K=5, m=10, d=500, n=6000

σ = .1 σ = .2

1 2 3 4 5 Components

Average speedup over k-means

Subspace Clustering K=5, m=10, d=500, n=8000

σ = .1 σ = .2

1 2 3 4 5 Components

Average speedup over k-means

Subspace Clustering K=5, m=10, d=500, n=10000

σ = .1 σ = .2

Figure 6: Figure showing the average speedup of our subspace search algorithm (Algorithm 3) over k-means, for 5 components of increasing size, in a subspace clustering model with k = 5, m = 10, d = 500, σ {.1, .2}, and three diﬀerent sample complexities (a) n = 6000 (b) n = 8000 (c) n = 10000. Our subspace clustering algorithm shows high speedup over k-means.

In contrast, the NMF algorithm is known to be faster, and produce topics of comparable quality to more popular variational inference based algorithms [Blei et al. 2003]. The side information for this experiment are chosen as follows. First from the set of topics produced by NMF algorithm we choose a subset of interpretable topics, then we choose labeled words representative of these topics. We test with a set of 62 labeled words for NY Times data set and 54 labeled words for Yelp data set. Note that given labeled word wl the whitening algorithm produces one topic distribution µ1, but the NMF algorithm ﬁnds k topics. Therefore for NMF algorithm the target topic i is the one which has the highest probability of the labeled word i.e., µi(wl). For the semi-supervised NMF we ﬁrst compute

Ray, Neeman, Sanghavi, and Shakkottai

the weighted word-word co-occurrence matrix Qw where we re-weigh each document by the normalized frequency of the labeled word wl. Then we apply the NMF algorithm [Arora et al. 2013] on this weighted matrix Qw. All three algorithms were implemented in Python. Performance metric: We compare the quality of the topics returned by Whitening, NMF, and SS-NMF algorithms using the pointwise mutual information (PMI) score, known to be a good metric for topic coherence [Newman et al. 2010; R oder et al. 2015]. However in order to also capture the relevance of the estimated topic to the labeled word we compute PMI score for topic i as,

PMI(topic i) = 1

log p(wl, w)

where wl is the labeled word, T i 20 is the set of top 20 words in the i-th topic. The probabilities p(wl, w), p(w), p(wl) are computed over a larger data set of English Wikipedia articles to reduce noise [Newman et al. 2011]. For whitening algorithm we choose α0 = .01. Note that other supervised topic modeling algorithms e.g. supervised LDA by Mcauliﬀe and Blei (2008), labeled LDA by Ramage et al. (2009) require a much stronger notion of side-information than just labeled words, hence we could not compare with them.

Figure 7: Figure comparing the performance of Whitening, NMF [Arora et al. 2013], and semi-supervised NMF (SS-NMF) algorithms on NY Times and Yelp data sets. (a) Topics estimated by Whitening algorithm have the best PMI score in 40 out of 62 labeled words for NY Times data set, and 35 out of 54 labeled words in Yelp data set. (b) Whitening shows more than 2X speedup over competing algorithm in both data sets.

In Figure 7 (a) we plot the percentage of labeled words for which each algorithm has the best PMI score. Observe that for most labeled words (40 out of 62 labeled words for NY Times data set, and 35 out of 54 labeled words in Yelp data set) the Whitening algorithm estimates topic with better PMI score over NMF and SS-NMF algorithms. The Whitening algorithm is also more than twice as fast as NMF and SS-NMF2 as shown in Figure 7 (b).

2. For large corpus the NMF algorithm runs much faster than Gibbs sampling and variational inference based algorithms [Arora et al. 2013].

The Search Problem in Mixture Models

A complete list of topics and PMI scores returned by the algorithms for every labeled word is presented in Tables 2, 3 of Appendix B. Notice that the Whitening algorithm often estimates more coherent topics which are more relevant to the given labeled word than topics produced by the NMF/SS-NMF algorithm. For example in NY Times data set with the labeled word student the Whitening algorithm returns top ﬁve words in the topic as student, school, teacher, percent, program; however those returned by NMF algorithm are test, school, student, ignore, export; and those by SS-NMF algorithm are student, university, shooting, shot, rampage.

Parallel image segmentation: One method to perform image segmentation is to use GMM clustering. In this experiment we demonstrate how GMM search algorithm can be used to parallelize image segmentation in vision applications. For this we consider the BSDS500 data set introduced in Arbelaez et al. (2011) and choose a subset of 70 images having less than 4 segments in the ground truth. Note that this data set has up to six ground truth segmentation by human users for each image. We randomly choose one pixel from each segment in ground truth as side-information v. We compare our Whitening algorithm with the seeded k-means clustering [Basu et al. 2002] where the centers are initialized by these side-information pixels (we refer to this as s-Kmeans). The Whitening algorithm uses one pixel from the i-th cluster to compute µi, in parallel for every i, and then it assigns each pixel to its closest µi. The segmentation quality is compared using normalized mutual information (NMI) metric [Manning et al. 2008]. To avoid local minimum in s-Kmeans we consider the maximum NMI over 5 initializations of side-information for each ground truth, and then we compute average NMI over all ground truths for an image.

Figure 8: Figure comparing the performance of image segmentation by Whitening (row 3) and s-Kmeans (row 2) algorithms, with images selected from the BSDS500 data set. The side-information pixels are shown in red plus in the original image (row 1). In the segmented images (rows 2, 3) the segments are shown in diﬀerent shades. Observe that the Whitening algorithm often isolates the foreground segment better than s-Kmeans.

We summarize our result in Table 1. Observe that the Whitening algorithm has a slightly better NMI performance over s-Kmeans in the BSDS test data set and similar performance

Ray, Neeman, Sanghavi, and Shakkottai

Data set N NW NK TW (s) TK (s) NMIW NMIK BSDS test 30 17 13 6.7 81.5 0.17 0.13 BSDS train 25 12 13 8.2 89.8 0.15 0.15 BSDS val 15 8 7 10.6 117.2 0.11 0.09

Table 1: Table comparing the performance of Whitening and s-Kmeans algorithm on BSDS data set. N is the total number of images, NW is the number of images where segmentation produced by Whitening has a better NMI than s-Kmeans, and NK is the number of images where segmentation of s-Kmeans has a better NMI. TW is the median runtime of Whitening algorithm and TK is the median runtime of s-Kmeans. NMIW and NMIK are the median NMI scores for the Whitening and s-Kmeans algorithms respectively. Whitening runs much faster than s-Kmeans.

in BSDS train and BSDS val data sets. However the Whitening algorithm runs an order of magnitude faster than s-Kmeans.

5. Conclusion and Discussion

In this paper we developed a new, simple and ﬂexible framework for incorporating side information into mixture model learning. The underlying motivation was to provide a principled way to take into account extra input (e.g. generated by human data analysts etc.). Even for cases where this input is very limited compared to the size/dimensionality of the data, we show meaningful statistical and computational performance improvement over baseline unsupervised and semi-supervised methods. More generally, developing methods which work with very limited human input is a promising research endeavor, in our opinion.

Acknowledgments

We would like to acknowledge support from NSF grants CNS-1320175, 0954059, ARO grants W911NF-15-1-0227, W911NF-14-1-0387, W911NF-16-1-0377, and the US Do T supported DSTOP Tier 1 University Transportation Center. The authors also acknowledge the Texas Advanced Computing Center [TACC 2018] at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper.

The Search Problem in Mixture Models

Appendix A. More Experiments for Gaussian Mixture Models

In Figure 9 we show the sensitivity of the Whitening and Cancellation algorithms in GMM with k = 20, d = 500, all equal probability components, and two diﬀerent values of σ and n. Observe that the percentage error gain of the algorithms decreases with decreasing values of δ = mini =1 µ1,v µi,v , as we would expect, and it eventually becomes negative when the performance become worse than TPM algorithm. Also here the Cancellation algorithm shows lesser sensitivity, hence better performance compared to the Whitening algorithm.

GMM K=20, d=500, =.05 (all equal), n=6000

Inf 5.62 3.97 3.23 2.77 2.44 2.20 -40

Percentage error gain over TPM

Whitening, = .5 Whitening, = .6 Cancellation, = .5 Cancellation, = .6

GMM K=20, d=500, =.05 (all equal), n=8000

Inf 5.62 3.97 3.23 2.77 2.44 2.20 -50

Percentage error gain over TPM

Whitening, = .5 Whitening, = .6 Cancellation, = .5 Cancellation, = .6

Figure 9: Sensitivity plots showing how the percentage relative error gain of the Whitening and Cancellation algorithms over the TPM algorithm decrease with decreasing values of the parameter δ = mini =1 µ1,v µi,v , in GMM with k = 20, d = 500, all equal probability components, for diﬀerent values of variance σ {.5, .6}, and two diﬀerent sample complexities (a) n = 6000 (b) n = 8000.

Appendix B. Complete Results on New York Times and Yelp Data Set

In this section we provide more detailed result of our experiments on NY Times and Yelp data sets. In Tables 2, 3 we show for every labeled word, the top ﬁve words in the topics computed by Whtening, NMF, and SS-NMF algorithms along with their corresponding PMI scores.

Table 2: Results of topic search by Whitening and NMF algorithms on NYtimes data set of 300, 000 news articles using K = 100 topics and 62 labeled words.

NY Times data set Label word

Algo topword-1 topword-2 topword-3 topword-4 topword-5 PMI

Whitening ﬂight security passenger airport hour 0.1424 NMF security government oﬃcial percent bill 0.0499 SSNMF passenger plane ﬂight ﬁre crash 0.1711

Whitening coach season job team head 0.2637

Ray, Neeman, Sanghavi, and Shakkottai

Algo topword-1 topword-2 topword-3 topword-4 topword-5 PMI

NMF team coach season player jet 0.1740 SSNMF coach arrived assistant defenseman ended 0.1756

Whitening information question today eastern daily 0.0255 NMF art show dessert book home 0.0769 SSNMF art artist show painting museum 0.1250

Whitening campaign al gore money political republican 0.1530 NMF al gore campaign george bush president bush 0.1608 SSNMF nra ﬂorida article senator presidential 0.0926

Whitening corp meeting list dividend partial 0.0815 NMF corp meeting list group dividend 0.0570 SSNMF partial energy dividend meeting corp 0.0254

Whitening tax cut taxes percent income 0.2126 NMF graf president bush mail information 0.0722 SSNMF tax income cut taxes site 0.2279

Whitening cup minutes food article add 0.0227 NMF buy panelist ﬂavor thought product 0.0130 SSNMF tobacco chef restaurant pastry article 0.1495

Whitening oil cup minutes prices companies 0.1460 NMF oil million prices percent market 0.0928 SSNMF oil company listing largest brazil 0.0902

Whitening court case law decision lawyer 0.2288 NMF oﬃcial court case attack government 0.1285 SSNMF chicago court decision ruling justices 0.1834

Whitening election ballot vote voter ﬂorida 0.2132 NMF election ballot al gore bush vote 0.2155 SSNMF gained election article presidential independence 0.1702

Whitening case court lawyer death trial 0.1830 NMF oﬃcial court case attack government 0.1017 SSNMF lawyer rat legal client jokes 0.1314

Whitening mail oﬃcial anthrax attack worker 0.0600 NMF anthrax oﬃcial mail worker letter 0.0156 SSNMF anthrax poverty cb show return -0.0776

Whitening tiger wood shot round player tour 0.1288 NMF tiger wood shot round player play 0.1356 SSNMF misstated master tee hit golf 0.1356

Whitening mail anthrax oﬃcial test found -0.0763 NMF anthrax oﬃcial mail worker letter -0.1097 SSNMF mas bacteria con una anos -0.2420

Whitening ﬁlm movie director character actor 0.1906 NMF article misstated new york company million 0.0288 SSNMF kiss ﬁlm actress article role 0.1295

Whitening million www percent building night 0.0481 NMF team tour lance armstrong

won race -0.0405

SSNMF tourist million visitor oﬃcial campaign 0.0995

Whitening race won win run track 0.1129 NMF race won horse win kentucky derby

SSNMF horse truck road oﬃcial killed 0.0433

Whitening campaign george bush bush election republican 0.2449 NMF al gore campaign george bush president bush 0.1868 SSNMF republican democrat democratic house parties 0.1053

Whitening computer system microsoft program software 0.1904

The Search Problem in Mixture Models

Algo topword-1 topword-2 topword-3 topword-4 topword-5 PMI

NMF company computer microsoft system companies 0.1533 SSNMF computer chip mail program buy 0.1903

palestinian

Whitening palestinian israel israeli yasser arafat

peace 0.2189

NMF palestinian israel oﬃcial israeli yasser arafat 0.1950 SSNMF palestinian reformer reform authority arab 0.1519

Whitening ﬁlm movie director character actor 0.1492 NMF ﬁlm show actor movie thought 0.0901 SSNMF red sox movie interview seattle host 0.0388

Whitening player play won game women 0.1054 NMF game play player point andre agassi 0.1187 SSNMF motif tennis season pros image 0.1480

Whitening won night ﬁght win sport 0.0566 NMF ﬁght mike tyson lennox lewis million round 0.1181 SSNMF ﬁght pound ﬁghter beat boxing 0.1254

Whitening music song record album band 0.2298 NMF music company million companies napster 0.0812 SSNMF music mp3 customer digital online 0.0150

Whitening cup minutes add oil tablespoon 0.0608 NMF cup minutes add tablespoon water 0.0431 SSNMF coﬀee bean tablespoon cup ground -0.0765

Whitening bush US oﬃcial system administration 0.1223 NMF oﬃcial bush government US nuclear 0.1356 SSNMF ibm nuclear computer research fastest -0.0253

Whitening race car driver team season 0.1443 NMF car race driver team season 0.1319 SSNMF sport ﬁle los angeles racing notebook -0.0640

Whitening military taliban war afghanistan us 0.0916 NMF taliban oﬃcial afghanistan government us 0.0796 SSNMF russian war chechnya army veteran 0.1296

quarterback

Whitening yard season game play team 0.2389 NMF game team play yard season 0.1773 SSNMF eﬀort quarterback ucla heroic alabama 0.1472

Whitening stock market percent company fund 0.1585 NMF percent stock market company companies 0.1338 SSNMF stock market price shares investment 0.0507

Whitening game run yard play hit 0.1782 NMF run game inning hit season 0.1361 SSNMF ball hit run inning home 0.1708

Whitening patient doctor care health drug 0.2532 NMF oﬃcial virus percent new york found 0.1003 SSNMF patient study doctor article brain 0.1334

Whitening won win round shot tiger wood 0.1029 NMF ﬁght mike tyson lennox lewis million round 0.0955 SSNMF olympic champion ﬁnal meet medalist 0.1177

Whitening business company question information companies 0.0887 NMF information eastern commentary daily business 0.0311 SSNMF publication business send released businesses 0.0996

Whitening government oﬃcial country federal political 0.1524 NMF graf president bush mail information 0.0767 SSNMF program government computer local newspaper 0.0784

Whitening season team game games play 0.1799 NMF team game season play games 0.1406

Ray, Neeman, Sanghavi, and Shakkottai

Algo topword-1 topword-2 topword-3 topword-4 topword-5 PMI

SSNMF season cotton fact simple variety 0.0626

Whitening death case lawyer court trial 0.1333 NMF advise spot earlier held today -0.0340 SSNMF prison inmates security population bed 0.1472

Whitening ﬁle spot internet read output 0.0359 NMF ﬁle spot new york sport los angeles 0.0228 SSNMF wonderful mail al gore george bush message 0.0766

Whitening air part high wind rain 0.1963 NMF air wind shower rain storm 0.1939 SSNMF chicago sun times

nominated rain east thought 0.0179

Whitening game team play games season 0.2000 NMF team game season play games 0.1722 SSNMF covering game tonight coverage celebration 0.0531

Whitening election ballot vote percent voter 0.2068 NMF election ballot al gore bush vote 0.1870 SSNMF voter poll percent primary election 0.2067

Whitening player team season game sport 0.1691 NMF team chicago white sox

mariner season player 0.1803

SSNMF velocity baseball air shot test 0.0629

Whitening student school teacher percent program 0.2077 NMF test school student ignore export 0.0729 SSNMF student university shooting shot rampage 0.1396

Whitening president vice white house george bush executive 0.2116 NMF graf president bush mail information 0.0758 SSNMF hedge president television broadway produced 0.0226

Whitening taliban afghanistan military us war 0.1684 NMF taliban oﬃcial afghanistan government us 0.1413 SSNMF afghan afghanistan blanket friend country 0.0577

Whitening team games won women american 0.1822 NMF team tour lance armstrong

won race 0.0348

SSNMF endit medal honor winner newspaper 0.0786

Whitening school student teacher high program 0.1566 NMF test school student ignore export 0.0388 SSNMF teacher program pay school teaching 0.1499

Whitening show home network television night 0.1721 NMF los angeles daily new

spot newspaper new york show 0.1456

SSNMF clinton home television survived tonight -0.0090

Whitening al gore campaign election political republican 0.1837 NMF al gore campaign george bush president bush 0.1677 SSNMF environmental democratic national committee

nominee fund 0.0813

Whitening cup minutes add oil tablespoon 0.1039 NMF cup minutes add tablespoon water 0.1072 SSNMF ﬂavor panelist ounces buy onion 0.1188

Whitening student school college teacher program 0.1314 NMF game season team play coach -0.0595 SSNMF campus operation aol building center 0.0645

Whitening car driver race racing seat 0.2047 NMF car race driver team season 0.1222

The Search Problem in Mixture Models

Algo topword-1 topword-2 topword-3 topword-4 topword-5 PMI

SSNMF car team race driver winston cup 0.1516

Whitening companies percent company business industry 0.1430 NMF music company million companies napster 0.0821 SSNMF xxx show trade software entertainment 0.1161

Whitening ﬁlm today system movie team -0.0054 NMF wire inadvertently kill mandatory today -0.0750 SSNMF captor planet ﬁlm kill astronomer 0.0949

Whitening bill money member system number 0.1257 NMF bill tax bush member percent 0.0287 SSNMF donation card credit account voted 0.1382

Whitening race car driver won win 0.1917 NMF car race driver team season 0.1814 SSNMF amazing race show tonight sit 0.0502

Whitening cup minutes food add oil 0.0499 NMF wine wines percent company million 0.0748 SSNMF wine wines bottle bottles age 0.1082

Whitening case death lawyer court trial 0.1952 NMF oﬃcial court case attack government 0.1363 SSNMF prosecutor lawyer attorney incorrectly general 0.1406

Whitening team season game player play 0.1654 NMF team game season play games 0.1558 SSNMF team qualify olympic article member 0.1530

Whitening percent market economy stock cut 0.1528 NMF percent stock market company companies 0.1048 SSNMF percent economy quarter rate recession 0.1452

Whitening air high part wind rain 0.1909 NMF air wind shower rain storm 0.1895 SSNMF wash wind school winter white 0.1902

Whitening microsoft computer system company software 0.1981 NMF company computer microsoft system companies 0.1911 SSNMF xxx software industry show trade 0.1222

Table 3: Results of topic search by Whitening and NMF algorithms on Yelp data set of 335, 022 reviews of businesses using K = 100 topics and 54 labeled words.

Yelp data set Label word

Algo topword-1 topword-2 topword-3 topword-4 topword-5 PMI

Whitening cheese pizza time sandwich back 0.1842 NMF bagel coﬀee bagels cheese sandwich 0.1666 SSNMF bartender cheese tasty made server 0.0555

Whitening hair salon nails nail back 0.0678 NMF hair absolute cut beautiful salon -0.0192 SSNMF salon manicure back nail clean 0.0375

Whitening mexican burrito tacos salsa cheese 0.0506 NMF mexican fresh burrito tacos time 0.0389 SSNMF exit mexican bland restaurants world -0.0720

Whitening chicken chinese rice hot fast 0.0978 NMF chicken chinese fast rice time 0.0717 SSNMF chinese area type lot east 0.0455

Ray, Neeman, Sanghavi, and Shakkottai

Algo topword-1 topword-2 topword-3 topword-4 topword-5 PMI

Whitening coﬀee ﬁnd things tea starbucks 0.1079 NMF ﬁnd store things tea oil 0.0470 SSNMF tea coﬀee starbucks safeway ice 0.1787

Whitening sushi roll happy rolls ﬁsh 0.0330 NMF cooks fun hash browns reasonable -0.0441 SSNMF 2nd sushi time location amazing -0.1112

Whitening nails nail pedicure salon time 0.1385 NMF nails nail pedicure time salon 0.1316 SSNMF nail nails grandma cut make 0.0658

Whitening car wash clean time job 0.0617 NMF car wash back time job 0.0583 SSNMF car wash feels clean time 0.0290

Whitening years business oﬃce recommend family 0.0856 NMF oﬃce work walk time insurance 0.0189 SSNMF insurance years business steve saved 0.0459

Whitening ice cream chocolate cold wait 0.1739 NMF ice cream school cone kids 0.1111 SSNMF cream ice wait stone cold 0.1494

Whitening hair beautiful absolute years salon 0.0749 NMF hair absolute cut beautiful salon 0.0507 SSNMF beautiful hair years cut time 0.0532

Whitening classes class yoga studio gym 0.0928 NMF yoga classes class studio time 0.0816 SSNMF yoga practice dave feel amazing 0.0391

Whitening tire tires oil car discount 0.0739 NMF tire car tires back time 0.0634 SSNMF tire tires car discount time 0.0274

Whitening time chicken thai rice chinese -0.0442 NMF pho chicken rice sauce back 0.0825 SSNMF vietnamese cake chinese back fresh -0.0105

Whitening donuts fresh coﬀee donut chocolate -0.0349 NMF donuts coﬀee donut store location -0.0040 SSNMF donuts donut chocolate time selection -0.1298

Whitening pizza crust wings sauce cheese 0.0068 NMF pizza crust wings time cheese -0.0503 SSNMF min pizza crust hut pretty -0.1131

Whitening ice cream cold chocolate ﬂavors 0.1234 NMF ice cream school cone kids 0.0718 SSNMF ice cream wait stone cold 0.1312

Whitening store location big feel kids 0.0075 NMF store time location pharmacy helpful 0.0049 SSNMF pharmacy customer clean safeway rude -0.0127

Whitening bar time beer wings drinks 0.0900 NMF pizza brick pretty bar box -0.0190 SSNMF beers beer operated hand locally 0.0817

Whitening bike shop guys tires back 0.0053 NMF bike shop back bikes time 0.0525 SSNMF bike time gun pretty store -0.0293

Whitening yogurt ﬂavors toppings frozen chocolate 0.0659 NMF yogurt ﬂavors toppings frozen chocolate 0.0420 SSNMF yogurt ﬂavors back ice shop -0.1370

Whitening sushi chinese time fresh rice -0.0311 NMF magazine market farmer farmers boston -0.0702

The Search Problem in Mixture Models

Algo topword-1 topword-2 topword-3 topword-4 topword-5 PMI

SSNMF korean chicken pretty fried spicy 0.0376

Whitening pizza crust wings time cheese 0.1491 NMF pizza brick pretty bar box 0.0582 SSNMF pizza ride brick long red 0.0518

Whitening coﬀee starbucks donuts tea time 0.2728 NMF coﬀee busy starbucks ice cream 0.2613 SSNMF coﬀee starbucks drinks latte work 0.0974

Whitening sandwich subway sandwiches bread time 0.1714 NMF sandwich subway fresh bread location 0.1311 SSNMF sandwich sandwiches ham chips limited 0.0083

Whitening time thai rice sauce back -0.2046 NMF pho chicken rice sauce back -0.1096 SSNMF pho rice beef vietnamese sauce -0.0911

Whitening classes class work gym yoga 0.1518 NMF link open isn working fast -0.0304 SSNMF gym ﬁtness work open time 0.1117

Whitening dog park dogs area kids 0.1099 NMF park dog time area trail 0.1023 SSNMF park dog dogs lake area 0.1303

Whitening coﬀee starbucks drink time make -0.1617 NMF coﬀee busy starbucks ice cream 0.0802 SSNMF latte location work drink drinks -0.0539

Whitening park area phoenix time lot 0.1356 NMF park dog time area trail 0.1049 SSNMF trail parking street major easy 0.0267

Whitening oﬃce years dentist experience work 0.0734 NMF oﬃce dentist time work years 0.1169 SSNMF dentist oﬃce insurance made teeth 0.0766

Whitening starbucks drink coﬀee drinks times -0.0972 NMF coﬀee busy starbucks ice cream -0.0477 SSNMF starbucks drink argue smile times -0.1099

Whitening taco bell tacos fast sauce 0.0994 NMF mexican fresh burrito tacos time 0.1875 SSNMF taco bell ghetto pizza location -0.0042

Whitening mexican burrito tacos salsa fresh 0.0887 NMF mexican fresh burrito tacos time 0.0267 SSNMF salsa fresh tacos baja ﬁsh -0.0697

Whitening thai rice chinese hot chicken 0.0691 NMF thai chicken rice back sauce 0.1164 SSNMF thai pad tea dish green 0.0275

Whitening yogurt ﬂavors chocolate cream ice 0.1923 NMF gelato ﬂavors chocolate ice cream 0.1641 SSNMF chocolate caramel factory dark covered 0.1943

Whitening bar drinks night time beer 0.0142 NMF pizza brick pretty bar box -0.0143 SSNMF bar bit big seating beer -0.0086

Whitening chicken chinese rice thai sauce 0.2423 NMF pho chicken rice sauce back 0.2630 SSNMF chicken noodle rice back sauces 0.0910

Whitening burrito mexican stars tacos salsa 0.1320 NMF mexican fresh burrito tacos time 0.0638 SSNMF stars burrito green sauce mexican 0.0467

Whitening salad chicken fresh sandwich bar 0.1780

Ray, Neeman, Sanghavi, and Shakkottai

Algo topword-1 topword-2 topword-3 topword-4 topword-5 PMI

NMF pizza brick pretty bar box -0.0220 SSNMF salad bar salads soup competitors -0.0123

Whitening burger fries burgers fast time 0.1489 NMF link open isn working fast 0.0159 SSNMF stale burger meat bite king 0.0322

Whitening park area time lot back 0.0572 NMF park dog time area trail 0.0747 SSNMF hike park rock mountain water 0.1255

Whitening nails nail pedicure job salon 0.0189 NMF nails nail pedicure time salon 0.0158 SSNMF pedicure job nail close home -0.0931

Whitening burger fries burgers fast cheese -0.0413 NMF cut wait time hair manager -0.2616 SSNMF fries grease dirty dark slow -0.1629

Whitening dog dogs park pet hot 0.1501 NMF dog tony cut dogs style 0.0751 SSNMF dog door tie made serve 0.0080

Whitening chicken fast chinese rice time -0.1488 NMF chicken chinese fast rice time -0.1291 SSNMF panda orange rice fried bad -0.1327

Whitening mexican burrito chicken tacos salsa -0.0550 NMF mexican fresh burrito tacos time -0.1419 SSNMF trouble beans rice chicken marinated -0.1233

Whitening subway sandwich clean fresh location -0.0074 NMF sandwich subway fresh bread location -0.0445 SSNMF subway location clean super sandwich -0.0524

Whitening car wash back time work 0.1064 NMF car wash back time job 0.0874 SSNMF visited car back job weeks 0.0353

Whitening found cake chocolate shop yogurt 0.0754 NMF back time shop cake found 0.0099 SSNMF cake wanted wedding ﬂavor perfect 0.0416

Whitening location fast makes feel quality -0.0672 NMF prices selection quality family helpful -0.1569 SSNMF diﬀerence fast steak sandwiches subs -0.1672

Whitening thai chicken rice chinese hot 0.1482 NMF thai chicken rice back sauce 0.1903 SSNMF chicken stew brown curry rice 0.0047

Whitening massage back amazing years spa 0.1359 NMF massage time back amazing hour -0.0035 SSNMF massage arts experience amazing hour -0.0168

Whitening sandwich pizza time back bread -0.0254 NMF gelato ﬂavors chocolate ice cream 0.0241 SSNMF ice italian ﬂavors cream chocolate -0.0231

Appendix C. Computation of A, B for Diﬀerent Models

This section outlines the construction of matrices A, B in various models via diﬀerent moment computations. First we introduce some notations which we use in Appendices C, D, E, F, and G.

The Search Problem in Mixture Models

C.1 Notations

For a vector x, x denotes its ℓ2 norm. For a matrix X, X represents the spectral norm of the matrix. We use the notation b X or b E[X] to represent the sample estimate of a quantity X, unless mentioned otherwise. For a matrix M let σk(M) denote the k th largest singular value of M, and σk(M) denote the k th largest eigenvalue. n represents the number of samples used to obtain the sample estimates. Next, we introduce some basic tensor notations. Let x, y, z Rd be three d dimensional vectors. Then the order-3 tensor T3 = x y z is deﬁned as T3(i, j, k) = x(i)y(j)z(k), for i, j, k [d]. Similarly the order-2 tensor T2 = x y is equivalent to the matrix outer product T2 = xy T . Finally let v Rd be another d dimensional vector, I be the d dimensional identity matrix. The tensor contraction T3(I, I, v) is equal to the order-2 tensor T3(I, I, v) = z, v x y, which is again equivalent to the matrix T3(I, I, v) = z, v xy T . For order-2 tensors we will use the tensor and matrix notations interchangeably.

C.2 GMM Moments

In this section we prove how the required matrices A, B can be computed in the GMM model. We restate the following useful theorem from Hsu and Kakade (2013) which computes three tensor moments for the GMM model.

Theorem 7 (Hsu and Kakade (2013)) Consider the GMM model with means {µ1, . . . , µk} and corresponding variances {σ2 1, . . . , σ2 k}, and αi denote the proportion of the i-th component in the mixture. Let σ2 = Pk i=1 αiσ2 i be the smallest eigenvalue of the covariance matrix E[(x E[x])(x E[x])T ] ( note that since P αiµiµT i has rank k, this is the same as the k + 1th-largest eigenvalue), and u be a unit norm eigenvector corresponding to the eigenvalue σ2. Deﬁne

em = E[x(u T (x E[x]))2], M2 = E[x x] σ2I

M3 = E[x x x]

i=1 ( em ei ei + ei em ei + ei ei em)

where {e1, . . . , ed} form standard basis of Rd. Then,

i=1 αiσ2 i µi, M2 =

i=1 αiµi µi, M3 =

i=1 αiµi µi µi.

Theorem 8 In the GMM model deﬁne

m = E[x], A = E[xx T ] σ2Id B = E[ x, v xx T ] emv T v em T em, v Id

Then, m = P i αiµi, A = Pk i=1 αiµiµT i and B = Pk i=1 αi µi, v µiµT i

Proof The expression for m, A follows directly from Theorem 7 by noting that A = M2 and µi µi = µiµT i . To compute B consider the tensor contraction M3(I, I, v), M3 as in

Ray, Neeman, Sanghavi, and Shakkottai

Theorem 7. Then,

M3(I, I, v) = E[ x, v x x]

i=1 (v(i) em ei + v(i)ei em + em, v ei ei)

= E[ x, v xx T ]

i=1 (v(i) eme T i + v(i)ei em T + em, v eie T i )

= E[ x, v xx T ] emv T v em T em, v Id = B

Also from Theorem 7, M3(I, I, v) = Pk i=1 αi µi, v µi µi = Pk i=1 αi µi, v µiµT i . Therefore B = Pk i=1 αi µi, v µiµT i .

C.3 LDA Moments

In this section we show the m, A, B computation corresponding to the LDA model. Again we restate the following theorem from Anandkumar et al. (2014) which computes the ﬁrst three tensor moments for LDA distribution.

Theorem 9 (Anandkumar et al. (2014)) In an LDA model with parameters α = (α1, . . . , αk) , topic distributions µ1, . . . , µk. Let α0 = Pk i=1 αi. Deﬁne

M1 = E[x1], M2 = E[x1 x2] α0 1 + α0 M1 M1

M3 = E[x1 x2 x3] α0 α0 + 2 (E[x1 x2 M1] + E[x1 M1 x3] + E[M1 x2 x3])

+ 2α2 0 (α0 + 1)(α0 + 2)M1 M1 M1

αi α0 µi, M2 =

αi α0(α0 + 1)µi µi

2αi α0(α0 + 1)(α0 + 2)µi µi µi

Theorem 10 For an LDA model for any v Rd suppose m, A, B be deﬁned as

m = α0E[x1]

A = α0(α0 + 1)E[x1x T 2 ] mm T

B = α0(α0 + 1)(α0 + 2)

2 E[ x3, v x1x T 2 ] α0(α0 + 1)

2 m, v E[x1x T 2 ] + E[ x3, v x1m T ]

+E[ x3, v mx T 2 ] + m, v mm T .

The Search Problem in Mixture Models

Then we can express m, A, B as follows.

i=1 αiµi, A =

i=1 αiµiµT i , B =

i=1 αi µi, v µiµT i

Proof The expressions for m and A follows easily from Theorem 9 since m = α0M1 and A = α0(α0+1)M2. To show the expression for B consider the tensor contraction M3(I, I, v), M3 deﬁned as in Theorem 9. Then we have

M3(I, I, v) = E[ x3, v x1 x2] α0 α0 + 2 (E[ M1, v x1 x2] + E[ x3, v x1 M1]

+E[ x3, v M1 x2 x3]) + 2α2 0 (α0 + 1)(α0 + 2) M1, v M1 M1

= 2 α0(α0 + 1)(α0 + 2)B

where we used x1 x2 is same as x1x T 2 and so on. We also get from Theorem 9 M3(I, I, v) = Pk i=1 2αi α0(α0+1)(α0+2) µi, v µi µi. Therefore we have

B = α0(α0 + 1)(α0 + 2)

2 M3(I, I, v) =

i=1 αi µi, v µiµT i .

C.4 Mixed Regression Moments

Recall in mixed regression we have y = x, µi + ξ where x N(0, I) and ξ N(0, σ2). In the following Lemmas we compute the various moments M1,1, M2,2, M3,1, M3,3 and show how they are used to compute m, A, B.

Lemma 11 In mixed linear regression deﬁne M1,1 = E[yx], M2,2 = E[y2xx T ], M3,1 = E[y3x] and M3,3 = E[y3 x, v xx T ]. Then,

i=1 αiµiµT i + (σ2 +

i=1 αi µi 2)I

i=1 αi(σ2 + µi 2)µi

i=1 αi µi, v µiµT i + M3,1v T + v MT 3,1 + M3,1, v I

Ray, Neeman, Sanghavi, and Shakkottai

Proof We compute the moments as shown below.

M1,1 = E[yx] =

i=1 αi E[x T µix + ξx] =

M2,2 = E[y2xx T ] =

i=1 αi E[ µi, x 2xx T ] + E[ξ2]E[xx T ]

i=1 αi E[ µi, x 2xx T ] + σ2I

i=1 αi(2µiµT i + µi 2I) + σ2I

i=1 αiµiµT i +

i=1 αi(σ2 + µi 2)I

Using the fact that all odd moments of normal random variable are zero.

M3,1 = E[y3x] =

i=1 αi E[( x, µi + ξ)3x]

i=1 αi E[ x, µi 3x] + 3

i=1 αi E[ξ2]E[ x, µi x]

i=1 αi µi 2µi + 3

i=1 αiσ2µi = 3

i=1 αi(σ2 + µi 2)µi

We use the fact that for even p the moment E[zp] = (p 1)!! for a standard normal random variable z and !! denote the double factorial. Next we compute M3,3.

M3,3 = E[y3 x, v xx T ] =

i=1 αi E[( x, µi + ξ)3 x, v xx T ]

i=1 αi E[ x, µi 3 x, v xx T ] + 3

i=1 αi E[ξ2]E[ x, v x, µi xx T ]

i=1 αi E[ x, µi 3 x, v xx T ] + 3σ2 k X

i=1 αi E[ x, v x, µi xx T ] (5)

Now we compute these individual moments.

E[ x, v x, µi xx T ] = µT i v + vµT i + µi, v I

The Search Problem in Mixture Models

Using the fact that any odd combination of the variables in x will be zero in expectation. Also,

E[ x, µi 3 x, v xx T ] = 6 v, µi µiµT i + 3 µi 2[µT i v + vµT i + µi, v I]

Again by using the moments of standard normal variable. This can be veriﬁed by considering the (a, b)-th entry of the matrix on the right as a polynomial in µi(l), the lth component of µi, and matching the corresponding coeﬃcients from both sides of the equation. Combining with equation (5) we get,

i=1 αi 6 v, µi µiµT i + 3 µi 2(µT i v + vµT i + µi, v I)

i=1 αi[µT i v + vµT i + µi, v I]

i=1 αi v, µi µiµT i + 3

i=1 αi(σ2 + µi 2)[µT i v + vµT i + µi, v I]

i=1 αi v, µi µiµT i + M3,1v T + v MT 3,1 + M3,1, v I

Theorem 12 Let m, A, B be deﬁned as

m = M1,1, A = 1

2(M2,2 τ 2I),

B = 1 6(M3,3 (M3,1v T + v MT 3,1 + M3,1, v I))

where τ 2 is the smallest singular value of M2,2. Then,

i=1 αiµi, A =

i=1 αiµiµT i , B =

i=1 αi µi, v µiµT i

Proof The proof follows directly from Lemma 11. Note that since µi-s are linearly independent the smallest singular vector τ 2 of M2,2 is equal to Pk i=1 αi(σ2 + µi 2). Then A = 1

2 M2,2 τ 2I = Pk i=1 αiµiµT i . Similarly the expression for B holds.

Ray, Neeman, Sanghavi, and Shakkottai

C.5 Subspace Clustering Moments

In this section we derive the necessary moments required for subspace clustering. Recall that in the subspace clustering model we have k dimension m subspaces U1, . . . , Uk Rd m

(matrices U1, . . . , Uk have orthonormal columns). The data is generated as follows. We sample y N(0, Id) and set x = Ui UT i y + ξ, where ξ N(0, σ2Id) is additive noise.

Theorem 13 Consider the subspace clustering model. Let M2, A, B be deﬁned as,

M2 := E[xx T ], A := M2 σ2Id B := E[ x, v 2xx T ] σ2(v T Av)Id σ2 v 2A σ4( v 2Id + vv T ) 2σ2(Avv T + vv T A)

where σ2 = σmk+1(M2). Then,

i=1 αi Ui UT i

i=1 αi UT i v 2Ui UT i + 2

i=1 αi Ui UT i vv T Ui UT i

Proof First we compute M2.

M2 = E(xx T ) =

i=1 αi E Ui UT i yy T Ui UT i + E[ξξT ] =

i=1 αi Ui UT i + σ2Id

Using E[yy T ] = I as y N(0, I) and UT i Ui = I since the columns are orthogonal. Since αi > 0, the mk + 1-th singular value of M2, σmk+1(M2) = σ2. Therefore it follows that,

A = M2 σ2Id =

i=1 αi Ui UT i

Now we compute the moment E[ x, v 2xx T ]. Given a sample x = Ui UT i y + ξ from the i-th subspace we have,

x, v 2 = v T Ui UT i yy T Ui UT i v + v T ξξT v + 2v T ξv T Ui UT i y

xx T = Ui UT i yy T Ui UT i + Ui UT i yξT + ξy T Ui UT i + ξξT

Then we can write,

E[ x, v 2xx T ]

i=1 αi E[v T Ui UT i yy T Ui UT i v Ui UT i yy T Ui UT i ] + E[v T Ui UT i yy T Ui UT i v]E[ξξT ]

+E[v T ξξT v]E[Ui UT i yy T Ui UT i ] + E[v T ξξT vξξT ] + 2E[(v T ξv T Ui UT i y)Ui UT i yξT ]

+2E[(v T ξv T Ui UT i y)ξy T Ui UT i ]

= T1 + T2 + T3 + T4 + T5 + T6 (6)

The Search Problem in Mixture Models

where T1, . . . , T6 are as follows. We deﬁne vi := Ui UT i v, we use the Gaussian moment results E[ v, z z] = σ2v, and E[ v, z 2zz T ] = σ4( v 2Id + vv T ) whenever z N(0, σ2Id).

i=1 αi E v T Ui UT i yy T Ui UT i v Ui UT i yy T Ui UT i

i=1 αi E[ y, vi 2Ui UT i yy T Ui UT i ] =

i=1 αi Ui UT i E[ y, vi 2yy T ]Ui UT i

i=1 αi Ui UT i ( vi 2Id + 2viv T i )Ui UT i

i=1 αi vi 2Ui UT i + 2

i=1 αi Ui UT i vv T Ui UT i

i=1 αi UT i v 2Ui UT i + 2

i=1 αi Ui UT i vv T Ui UT i

since vi = Ui UT i v = UT i v .

i=1 αi E[v T Ui UT i yy T Ui UT i v]E[ξξT ] =

i=1 αiv T Ui UT i v σ2Id = σ2(v T Av)Id

i=1 αi E[v T ξξT v]E[Ui UT i yy T Ui UT i ] = σ2 v 2 k X

i=1 αi Ui UT i = σ2 v 2A

i=1 αi E[v T ξξT vξξT ] = E[ v, ξ 2ξξT ] = σ4( v 2Id + 2vv T )

i=1 αi2E[(v T ξv T Ui UT i y)Ui UT i yξT ] = 2

i=1 αi E[(v T Ui UT i y)Ui UT i y]E[ v, ξ ξT ]

i=1 αi E[(v T Ui UT i y)Ui UT i y] σ2v T = 2σ2 k X

i=1 αi E[(v T Ui UT i y)Ui UT i yv T ]

i=1 αi E[Ui UT i v, y yv T ] = 2σ2 k X

i=1 αi Ui UT i vv T = 2σ2Avv T

i=1 αi E[(v T ξv T Ui UT i y)ξy T Ui UT i ] = 2

i=1 αi E[ v, ξ ξ]E[ vi, y y T Ui UT i ]

i=1 αivv T i Ui UT i = σ2 k X

i=1 αivv T Ui UT i = σ2vv T k X

i=1 αi Ui UT i = 2σ2vv T A

Ray, Neeman, Sanghavi, and Shakkottai

B = E[ x, v 2xx T ] σ2(v T Av)Id σ2 v 2A σ4( v 2Id + vv T ) 2σ2(Avv T + vv T A)

= E[ x, v 2xx T ] T2 T3 T4 T5 T6 = T1

i=1 αi UT i v 2Ui UT i + 2

i=1 αi Ui UT i vv T Ui UT i

Appendix D. Finite-sample Analysis of the Whitening Method

Suppose that

where σk is the kth singular value of A. Let V be the n k matrix whose columns are the ﬁrst k singular vectors of A, and let ˆV be the same for ˆA. Let D be the diagonal matrix of singular values of A, and let ˆD be the diagonal matrix of the ﬁrst k singular values of ˆA. Then A = V DV T and V T V = ˆV T ˆV = Ik. This entire section is under the assumptions of Theorem 1; in particular, recall that ϵ σk(A)/4. It will be technically convenient for us to assume that B A = σ1(A). This assumption holds without loss of generality: if not, simply rescale the side information, setting vnew = v A

B . This has the eﬀect of rescaling B, so that Bnew = A ; deﬁne also ˆBnew = ˆB A

B . Note that

Bnew ˆBnew = B ˆB A

under the assumption B ˆB ϵ. Now, the algorithm is homogeneous in ˆB: it will produce the same output given either ˆB or ˆBnew; hence, it suﬃces to prove Theorem 1 with v, B, and ˆB replaced by their new versions. Since the new versions satisfy Bnew A , we may assume this without loss of generality. From now on, we will drop the notation Bnew, and we will simply prove Theorem 1 under the assumption B A . Our basic tool is Wedin s theorem:

Theorem 14 For a matrix A, let P A s be the orthogonal projection onto the subspace spanned by singular vectors of A with singular value at least s. Let P A s be the orthogonal projection onto the subspace spanned by singular vectors with singular value at most s. Then for any matrices A and B, and for any s < t,

P A s P B t 2 A B

The Search Problem in Mixture Models

In applying Wedin s theorem, the following geometric lemma will be useful. In what follows, PE denotes the orthogonal projection onto E.

Lemma 15 Let E and F be subspaces of Rn with PE PF δ. Then PF v 2 PEv 2+ 3δ v 2 for every v Rn.

Lemma 16 If ϵ < σk/4 then for any u Rk, s

σ2 k u ˆV T V u u .

By a simple change of variables, if we deﬁne

O = D 1/2 ˆV T V D1/2

then O is also an almost-isometry: for every u Rk, s

σ2 k u Ou u . (7)

Proof First, note that σk( ˆA) σk(A) A ˆA σk ϵ. If ϵ < σk/4, we also have σk+1( ˆA) σk+1(A) + ϵ σk/4 < σk ϵ, which implies that ˆV ˆV T = P ˆ A σk ϵ. Let ˆW be a d (d k) matrix whose columns form an orthonormal basis for the orthogonal complement of the column span of ˆV . Note that if ϵ < σk/2 then the kth singular value of ˆA is strictly larger than σk/2 and the (k + 1)th singular value is at most ϵ. Then P ˆ A ϵ = ˆW ˆW T . By Wedin s theorem,

ˆW ˆW T V V T = P ˆ A ϵP A σk 2ϵ σk ϵ 4ϵ

Now, ˆW T and V have norm 1, and so it follows that

ˆW T V = ˆW T ( ˆW ˆW T V V T )V 4ϵ

For any u Rk with u = 1, we have

ˆV T V u 2 = 1 ˆW T V u 2 1 16ϵ2/σ2 k,

from which the claimed lower bound follows. On the other hand, ˆV T V u u because both ˆV T and V have norm 1.

Let M = D 1/2V T BV D 1/2 and ˆ M = ˆD 1/2 ˆV T ˆB ˆV ˆD 1/2. Then M is the inﬁnitesample version of A s whitening matrix applied to B, and ˆ M is the ﬁnite-sample analogue. Recall from (7) that O = D 1/2 ˆV T V D1/2 is an almost-isometry of Rk.

OMOT ˆ M C ϵσ1

Ray, Neeman, Sanghavi, and Shakkottai

Proof The ﬁrst step is to approximate OMOT by D 1/2 ˆV T B ˆV D 1/2. To this end, note that OMOT = D 1/2 ˆV T V V T BV V T ˆV D 1/2.

Now, ˆV is an isometry of Rk into Rn; hence,

ˆV T V V T ˆV T = ˆV ˆV T V V T ˆV ˆV T = P ˆ A σk ϵP A σk P ˆ A σk ϵ = P ˆ A σk ϵP A 0 ,

where the last equality used the fact that A has rank exactly k, and hence I P A σk = P A 0. Now, Wedin s theorem applied to the computation above implies that

ˆV T V V T ˆV T 2ϵ σk ϵ 4ϵ

(recalling that ϵ σk/4). Now, for general matrices X, Y, Y , Z we have

XT Y T ZY X XT Y T Z Y X XT (Y Y )T ZY X + XT Y T Z(Y Y )X

Y Y X 2 Z ( Y + Y ).

We apply this with X = D 1/2, Y = ˆV , Y = ˆV V V T , and Z = B; since D 1/2 = σ 1/2 k , B σ1, and ˆV , V , V T = 1,

OMOT D 1/2 ˆV T B ˆV D 1/2 8ϵσ1

Next, we will replace B by ˆB in the above inequality. Since ˆV = ˆV T = 1 and D 1/2 = σ 1/2 k ,

D 1/2 ˆV T B ˆV D 1/2 D 1/2 ˆV T ˆB ˆV D 1/2 = D 1/2 ˆV T (B ˆB) ˆV D 1/2

σ 1 k B ˆB ϵ

Putting this together with the previous bound yields

OMOT D 1/2 ˆV T ˆB ˆV D 1/2 ϵ

It remains to relate D 1/2 ˆV T ˆB ˆV D 1/2 to ˆ M (which is the same, but with ˆD instead of D). Now, Weyl s inequality implies that

D 1/2 ˆD 1/2 σ 1/2 k (σk ϵ) 1/2 ϵσ 3/2 k ,

where the second inequality follows from a ﬁrst-order Taylor expansion and the fact that ϵ σk/2. Hence,

D 1/2 ˆV T ˆB ˆV D 1/2 ˆ M D 1/2 ˆD 1/2 ˆV T ˆB ˆV D 1/2

+ ˆD 1/2 ˆV T ˆB ˆV D 1/2 ˆD 1/2

4ϵσ1σ 2 k .

The Search Problem in Mixture Models

Combining this with (8) and the triangle inequality, we have

OMOT ˆ M = ϵ

Since O is almost an isometry, it follows that there is an orthogonal matrix O that is close to O (for example, if UDV T = O is an SVD, let O = UV T ). In this way, we may ﬁnd an orthogonal O such that

Now let u be the top eigenvector of M and let u O be the top eigenvector of OMOT . Then Ou is the top eigenvector of OM OT . The triangle inequality implies that

OMOT OM OT 2 M O O 32ϵ2

On the other hand, M was assumed to have a spectral gap of δ M . By Wedin s theorem, it follows that

u OT u O = Ou u O 64ϵ2

Finally, let ˆu be the top eigenvector of ˆ M. By Lemma 17 and Wedin s theorem,

ˆu u O Cϵσ1

Ou ˆu O O + Ou ˆh C max ϵσ1

δσ2 k , (9)

where the last inequality follows because ϵ σk/2 σ1/2. Next, we unpack O. Weyl s inequality implies that

D 1/2 ˆD 1/2 σ 1/2 k (σk ϵ) 1/2 ϵσ 3/2 k ,

where the second inequality follows from a ﬁrst-order Taylor expansion and the fact that ϵ σk/4. Hence,

O ˆD 1/2 ˆV T V D1/2 D1/2 D 1/2 ˆD 1/2 ϵ σ1

The right hand side is smaller than ϵσ1

σ2 k , and so we may plug it into (9) to obtain

ˆD 1/2 ˆV T V D1/2u ˆu Cϵσ1

Ray, Neeman, Sanghavi, and Shakkottai

Finally, (again because ϵ σk/2), ˆD 1/2 (σk/2) 1/2, and so

V D1/2u ˆV ˆD1/2ˆu Cϵσ1

δσ5/2 k . (10)

Setting w = V D1/2u and ˆw = ˆV ˆD1/2ˆu and comparing this to the setting of Algorithm 1, (10) shows that the ﬁnite-sample algorithm gets almost the same w as the inﬁnite-sample version. It remains to check the last few lines of Algorithm 1; i.e., to see that we recover the right scaling of w.

Lemma 18 Let M be a symmetric matrix of rank k 1 and let E be the span of its columns. Then w dist(w, E) σk(M + ww T ).

Proof It suﬃces to consider the case w = 1 (for a general w, apply the special case of the lemma to w/ w and M/ w 2). Let PE denote the orthogonal projection onto E, and note that w PEw = dist(w, E) Let F = span{E, w}. Since F has dimension k and y F implies (M + ww T )y = 0, it suﬃces to ﬁnd some y F such that (M + ww T )y dist(w, E) y . Choose y = w PEw. Then My = 0 and so

(M + ww T )y = |w T y| = w PEw 2 = dist(w, E) y .

Lemma 19 Let E be a subspace and take w E. For x span{E, w}, let a(x) R be the unique solution to x = aw + e, e E. Then |a(x) a(y)| x y / dist(w, E).

Proof Given x, y span{E, w}, we can write x y = (a(x) a(y))w + e, where e E. It follows that

x y = (a(x) a(y))w + e inf e E (a(x) a(y))w + e

= |a(x) a(y)| dist(w, E).

Finally, we apply the preceding two lemmas to show that ˆα1 is accurate in Algorithm 1. Together with (10) (whose right hand side provides the value of η that we will use), this completes the proof of Theorem 1.

Lemma 20 Let m = P i αiµi. If ˆA A ϵ, ˆm m ϵ and ˆw α1µ1 η then

|ˆα1 α1| C α1|α1R + η|

where R = maxi µi , provided that the right hand side above is at most α1.

The Search Problem in Mixture Models

Proof By Wedin s theorem,

V V T ˆV ˆV T 2 ˆA A

σk ˆA A 4 ϵ

if ϵ σk/2. Hence,

m ˆV ˆV T ˆm = V V T m ˆV ˆV T ˆm

(V V T ˆV ˆV T )m + ˆV ˆV T (m ˆm)

Now, let y = α1 ˆw + ˆV ˆV T Pk i=2 αiµi. Then

m y α1 ˆw α1µ1 +

i=2 αi(µi ˆV ˆV T µi)

η + max i µi V V T ˆV ˆV T

η + 4 max i µi ϵ

Deﬁning R = maxi µi , we have

y ˆV ˆV T ˆm η + 8R ϵ

Now, let ˆE be the span of { ˆV ˆD1/2v : v Rk, v ˆu}, and note that ˆE may also be written as the column space of ˆV ˆD1/2(Ik ˆuˆu T ) ˆD1/2 ˆV T = ˆV ˆD ˆV T ˆw ˆw T . Since ˆV ˆD1/2 is injective, ˆE has dimension k 1 and does not contain ˆw = ˆV ˆD1/2ˆu. Hence, y = α1 ˆw +e is the unique way to decompose y in span{ ˆw} ˆE. If we deﬁne a by the decomposition ˆm = a ˆw + e then Lemma 19 implies

|a α1| y ˆm / dist( ˆw, ˆE)

dist( ˆw, ˆE)

On the other hand, Lemma 18 applied to ˆV ˆD ˆV T ˆw ˆw T and ˆw implies (because the kth singular value of ˆV ˆD ˆV T σk ϵ σk/2) that ˆw dist( ˆw, ˆE) σk/2. Therefore,

|a α1| 2 ˆw

σk + ϵ 2(α1 µ1 + η)

Finally, note that |ˆα1 α1| = |a2 α1| = |a α1|(a + α1). We consider two cases: if a C α1 then |ˆα1 α1| (1 + C) α1|a α1|, which completes the proof. In the other case, we have |ˆα1 α1| ˆα1 C p

which implies that |ˆα1 α1| C|a α1|2

Ray, Neeman, Sanghavi, and Shakkottai

for some other constant C. This implies

|ˆα1 α1| C (α1R + η)

σk + ϵ 2 C α1

where the second inequality comes from the assumption that the right hand side in the lemma is bounded by α1.

As we pointed out in Section 2, spectral algorithms similar to Algorithm 1 has been proposed before for GMM [Hsu and Kakade 2013] and LDA [Anandkumar et al. 2012] models, the main diﬀerence being how the second matrix (equivalent to B) is constructed. Since the underlying whitening procedure is the same in all these algorithms, the proof approach presented above is similar to those in Hsu and Kakade (2013); Anandkumar et al. (2012). The proofs diverge when computing the perturbation of the second matrix, matrix B in our algorithm, which introduces diﬀerent dependence on various parameter models in the overall error bound. For example the error bound in Theorem 4.1 of Anandkumar et al. (2012) has a slightly worse dependence on k and σk than Theorem 1.

Appendix E. Finite-sample Analysis of the Cancellation Method

In this section we analyze the performance of Algorithm 2 when we have ﬁnite sample estimates of the matrices A, B and vector m. For ease of exposition we replaced the quantities V1:(k 1), vi, ai, ci in Algorithm 2 with the notation representing estimate b V1:(k 1), ˆvi, ˆai, ˆci respectively, since these are computed from sample estimates b A, b B. First, we show in Lemma 21 that we can have a good estimate for b Zλ using good estimates for A, B and λ1.

Lemma 21 Let b Zλ = b A λ b B, Zλ = A λB. Suppose max{ b A A , b B B } < ϵ and λ1 = 1/w1. Then,

b Zλ Zλ1 < ϵ 2 + 1

when |λ1 λ| < ϵ1 < 1.

Proof We have,

b Zλ Zλ1 b A A + λ b B λ1B

< b A A + λ1 b B B + |λ1 λ| b B

ϵ + λ1ϵ + ϵ1(σ1(B) + ϵ)

< ϵ(1 + 1/w1 + ϵ1) + ϵ1σ1(B) < ϵ 2 + 1

since ϵ1 < 1.

The following lemma will show that even with noisy estimates of A, B, the estimated λ is close to λ1.

The Search Problem in Mixture Models

Lemma 22 Let max{ b A A , b B B } < ϵ < σk(A)/2, and λ1 = 1/w1 > 0. Then,

|λ λ1| = O(ϵ)

Proof Deﬁne Z λ = V V T AV V T λV V T BV V T , V being the d k matrix of top k eigenvectors of A. The corresponding empirical estimate b Z λ = b V b V T b Ab V b V T λb V b V T b B b V b V T . The main proof idea is the following. We try to ﬁnd λ2, λ3 > 0 such that:

1. λ > λ2, b Z λ is not PSD.

2. λ < λ3, b Z λ is PSD.

The above two conditions imply that the optimum λ is bounded as λ3 λ λ2. We then simply bound λ λ1 as λ3 λ1 λ λ1 λ2 λ1. We now elaborate the above two steps. First, we bound the perturbation of empirical matrix b Z λ as follows. Using Wedin s theorem we have b V b V T V V T 4ϵ σk(A). Using this and the theorem assumptions we can compute the following bounds.

b V b V T b Ab V b V T V V T AV V T 13ϵ

b V b V T b B b V b V T V V T BV V T 1 + 12σk(B)

Combining, we have

b Z λ Z λ b V b V T b Ab V b V T V V T AV V T +λ b V b V T b B b V b V T V V T BV V T c1(1+λ)ϵ (11)

where c1 = max{13, 1 + 12σk(B)

σk(A) }. Step 1: Since matrices A and B share the same column and row space, V V T AV V T = A, V V T BV V T = B, and Z λ = Zλ = Pk i=1(1 λwi)αiµiµT i , wi = µi, v . Recall, V = span{µ2, . . . , µk} and Π denote the projection onto V , its perpendicular space. Let x1 = Πµ1/ Πµ1 , and x1 = V x1, x1 = x1 = 1. Consider the eigenvalues of the k k Hermitian matrix V T ZλV. Using variational theorem we can write:

σk(V T ZλV ) = min x =0, x =1 x T V T ZλV x x T 1 V T ZλV x1 = x T 1 Zλx1 = (1 λw1)α1a 1 (12)

where a 1 = | x1, µ1 |2 > 0. Now note that the matrices Z λ = V V T ZλV V T and V T ZλV have the same set of non-zero eigenvalues since V forms an orthonormal basis of the row/column space of Zλ. Therefore we can write from above,

σk(Z λ) = σk(V T ZλV ) (1 λw1)α1a 1 (13)

For λ = λ1 = 1/w1, Z λ1 is a rank k 1 matrix, and for any λ > λ1, Z λ has at least one negative eigenvalue. Consider λ2 > λ1 such that Z λ2 has one negative eigenvalue and k 1 positive eigenvalues. Since b Z λ2, Z λ2 are symmetric matrices, using Weyl s inequality we get,

σk( b Z λ2) σk(Z λ2) + b Z λ2 Z λ2 σk(Z λ2) + c1(1 + λ2)ϵ

(1 λ2w1)α1a 1 + c1(1 + λ2)ϵ

a 1[(α1 + ϵ) λ2(w1α1 ϵ)] (14)

Ray, Neeman, Sanghavi, and Shakkottai

using equations (11), (13), and assuming a 1 > c1 (else we can simply rescale ϵ). Now for any λ > λ2 = α1+ϵ α1w1 ϵ we get

σk( b Z λ) a 1[(α1 + ϵ) λ(w1α1 ϵ)] a 1[(α1 + ϵ) λ2(w1α1 ϵ)] = 0

Therefore, when λ > λ2 = α1+ϵ α1w1 ϵ, b Z λ is not PSD. This implies that λ2 λ . Then,

λ λ1 λ2 λ1 = α1 + ϵ α1w1 ϵ 1

w1 = ϵ(w1 + 1) (α1w1 ϵ)w1 (15)

Step 2: Consider λ3 < λ1 such that Z λ3 is PSD. Then we lower bound σk(Z λ3) as follows. Let vk,λ3 be the k th eigenvector of Z λ3 having eigenvalue σk(Z λ3). Then,

σk(Z λ3) = v T k,λ3Z λ3 vk,λ3 =

i=1 αi(1 λ3wi) v T k,λ3µiµT i vk,λ3

i=1 αi| vk,λ3, µi |2 (1 λ3w1)a 2 (16)

since w1 > wi, i = 1, and where a 2 = infλ 0 Pk i=1 αi| vk,λ, µi |2 > 0. Now using the lower bound of Weyl s inequality,

σk( b Z λ3) σk(Z λ3) b Z λ3 Z λ3

σk(Z λ3) c1(1 + λ3)ϵ

(1 λ3w1)a 2 c1(1 + λ3)ϵ

c1[(1 ϵ) λ3(w1 + ϵ)]

using equation (16), and assuming c1 < a 2 (else we can simply rescale ϵ). Then, for any λ < λ3 = (1 ϵ) (w1+ϵ) we have σk( b Z λ) > 0, or b Z λ is PSD. This implies λ > λ3. Therefore,

λ λ1 λ3 λ1 = (1 ϵ)

w1 = (w1 + 1)ϵ

(w1 + ϵ)w1 (17)

Combining equations (15), (17) we get,

|λ λ1| c3ϵ = O(ϵ)

where c3 = max (w1+1) (w1+ϵ)w1 , (w1+1) (α1w1 ϵ)w1

In Lemma 22 we assume w1 = µ1, v is positive. When w1 < 0, we have to modify the line search and ﬁnd the smallest λ < 0 such that b Z λ is PSD. However we can still apply similar arguments and prove that as long as the estimates of A, B, are within ϵ in spectral norm, Algorithm 2 can estimate λ within an O(ϵ) accuracy of λ1. Lemma 21 and 22

The Search Problem in Mixture Models

together implies that b Zλ Zλ1 = O(ϵ) as follows, which will be used to prove Theorem 3. We have,

b Zλ Zλ1 < ϵ 2 + 1

where in the last inequality we assume ϵ < α1w1/2, and η3 = max n 2, 1

w1 , c3σ1(B) o .

Lemma 23 Let ˆm m < ϵ, b Zλ Zλ1 < ϵ2 < σk 1(Zλ1)/2 for λ1 = α1/β1. V1:(k 1) denote the d (k 1) matrix of k 1 largest singular vectors of Zλ1 and b V1:(k 1) be the d (k 1) matrix of k 1 largest singular vectors of b Zλ . Then,

ˆx1 x1 < 2ϵ + 4ϵ2R σk 1(Zλ1) = ϵ3

ˆv1 v1 < 2ϵ3 α1a1 = ϵ4

where R = maxi [k] µi .

Proof Since, b Zλ Zλ1 < ϵ2 < σk 1(Zλ1)/2, applying Wedin s theorem we get,

b V1:(k 1) b V T 1:(k 1) V1:(k 1)V T 1:(k 1) 2 b Zλ Zλ1

σk 1(Zλ1) b Zλ Zλ1 4ϵ2 σk 1(Zλ1) (19)

since ϵ2 < σk 1(Zλ1)/2. Now,

ˆx1 x1 = ˆm b V1:(k 1) b V T 1:(k 1) ˆm m + V1:(k 1)V T 1:(k 1)m

ˆm m + (b V1:(k 1) b V1:(k 1) V1:(k 1)V T 1:(k 1))m + b V1:(k 1) b V T 1:(k 1)(m ˆm)

< 2 m ˆm + 4ϵ2 m σk 1(Zλ1) < 2ϵ + 4ϵ2R σk 1(Zλ1) := ϵ3

where we used equation 19 and m R. Recall that x1 = α1 Q V µ1 = α1a1v1, where V = span{µ2, . . . , µk} and a1 = µ1, v1 . To show the second bound,

ˆv1 v1 = ˆx1 ˆx1 x1 x1

x1 + ˆx1 1 x1 1 ˆx1

x1 + | ˆx1 x1 |

x1 2 ˆx1 x1

< 2ϵ3 α1a1 := ϵ4

Ray, Neeman, Sanghavi, and Shakkottai

Lemma 24 Let b A A < ϵ, ˆv1 v1 < ϵ4. Deﬁne d k matrices V = [v1V1:(k 1)] and b V = [ˆv1 b V1:(k 1)]. Then,

b V b V T b Aˆv1 V V T Av1 < σ1(A) 3ϵ4 + 4ϵ σk 1(Zλ1)

+ ϵ(1 + ϵ4)

Proof Similar to Lemma 23 we have from Wedin s theorem b V1:(k 1) b V T 1:(k 1) V1:(k 1)V T 1:(k 1) < 4ϵ σk 1(Zλ1). Then we can bound,

b V b V T V V T ˆv1ˆv T 1 v1v T 1 + b V1:(k 1) b V1:(k 1) V1:(k 1)V T 1:(k 1)

< 2 ˆv1 v1 + 4ϵ σk 1(Zλ1)

< 2ϵ4 + 4ϵ σk 1(Zλ1) (20)

b V b V T b Aˆv1 V V T Av1 (b V b V T V V T )Av1 + b V b V T (A b A)v1

+ b V b V T b A(v1 ˆv1)

b V b V T V V T A + A b A + b A v1 ˆv1

< σ1(A) 2ϵ4 + 4ϵ σk 1(Zλ1)

+ ϵ + (σ1(A) + ϵ)ϵ4

where we use inequality (20), Av1 σ1(A) as v1 is unit norm, b V b V T < 1 since b V is orthonormal, and b A < A + ϵ. Combining,

b V b V T b Aˆv1 V V T Av1 < σ1(A) 3ϵ4 + 4ϵ σk 1(Zλ1)

+ ϵ(1 + ϵ4)

Lemma 25 Let b A A < ϵ, ˆx1 x1 < ϵ3 < α1a1

2 , and ˆv1 v1 < ϵ4. Then,

|ˆa1 a1| < α1a1 (2σ1(A)ϵ4 + ϵ(1 + ϵ4)) + 2(σ1(A) + ϵ)ϵ3

Proof We ﬁrst compute,

|ˆv T 1 b Aˆv1 v T 1 Av1| |(v T 1 ˆv T 1 )Av1| + |ˆv T 1 (A b A)v1| + |ˆv T 1 b A(v1 ˆv1)|

v T 1 ˆv T 1 σ1(A) + A b A + σ1( b A) v1 ˆv1

< σ1(A)ϵ4 + ϵ + (σ1(A) + ϵ)ϵ4 = 2σ1(A)ϵ4 + ϵ(1 + ϵ4) (21)

The Search Problem in Mixture Models

using the fact that v1, ˆv1 have unit norms. Now we can bound the error |ˆa1 a1| as follows.

ˆv T 1 b Aˆv1 ˆx1 v T 1 Av1 x1

1 x1 |ˆv T 1 b Aˆv1 v T 1 Av1| + |ˆv T 1 b Aˆv1|| x1 ˆx1 |

From equation (21) and using | x1 ˆx1 | < ˆx1 x1 < ϵ3, x1 = α1a1 we get,

|ˆa1 a1| < 2σ1(A)ϵ4 + ϵ(1 + ϵ4)

α1a1 + (σ1(A) + ϵ)ϵ3 α1a1(α1a1 ϵ3)

< α1a1 (2σ1(A)ϵ4 + ϵ(1 + ϵ4)) + 2(σ1(A) + ϵ)ϵ3

α2 1a2 1 since ϵ3 < α1a1

Note that from Lemma 23 taking 2ϵ3 α1a1 = ϵ4 the above bound becomes |ˆa1 a1| < 6σ1(A)ϵ3+ϵα1a1+4ϵϵ3

E.1 Proof of Theorem 3

We now proof Theorem 3. Assume b Zλ Zλ1 ϵ2. Under the assumptions we have using Lemma 23 ˆx1 x1 < ϵ3 = 2ϵ + 4ϵ2R σk 1(Zλ1), ˆv1 v1 < ϵ4 = 2ϵ3 α1a1 . Also from Lemma 24 we

have b V b V T b Aˆv1 V V T Av1 < σ1(A) 3ϵ4 + 4ϵ σk 1(Zλ1) + ϵ(1 + ϵ4). Using these we compute the ﬁrst bound as follows.

b V b V T b Aˆv1

ˆx1 V V T Av1

b V b V T b Aˆv1 1 ˆx1 1 x1

+ 1 x1 b V b V T b Aˆv1 V V T Av1

ˆx1 x1 + 1 x1 b V b V T b Aˆv1 V V T Av1

Now using bounds from Lemma 23, 24 we get,

ˆµ1 µ1 < (σ1(A) + ϵ)ϵ3 α1a1(α1a1 ϵ3) + σ1(A) 3ϵ4 + 4ϵ σk 1(Zλ1) + ϵ(1 + ϵ4)

< 2 α2 1a2 1 [(σ1(A) + ϵ) ϵ3 + α1a1 ((3σ1(A) + ϵ)ϵ4

+ϵ (1 + 4σ1(A)/σk 1(Zλ1)))]

< 2 α2 1a2 1 [(σ1(A) + ϵ) ϵ3 + 2 (3σ1(A) + ϵ) ϵ3

+α1a1ϵ (1 + 4σ1(A)/σk 1(Zλ1))]

2 10σ1(A)ϵ3 + 5α1a1ϵ σ1(A) σk 1(Zλ1) α2 1a2 1

Ray, Neeman, Sanghavi, and Shakkottai

assuming ϵ3 α1a1

2 , σ1(A) ϵ, and σ1(A) > σk 1(Zλ1). Now expanding ϵ3 and rearranging terms we have,

ˆµ1 µ1 < 1 α2 1a2 1

40 + 10 α1a1 σk 1(Zλ1)

σ1(A)ϵ + 80 σ1(A)Rϵ2

< 80 α2 1a2 1

σ1(A)ϵ 1 + α1a1 σk 1(Zλ1)

To prove the second bound from Lemma 25 and assuming ϵ < σ1(A) we have |ˆa1 a1| 10σ1(A)ϵ3+α1a1ϵ

α2 1a2 1 . Then,

ˆa1(α1 ˆα1) = ˆa1α1 ˆa1ˆα1 = a1α1 ˆa1ˆα1 + ˆa1α1 a1α1 ˆa1|α1 ˆα1| |a1α1 ˆa1ˆα1| + α1|ˆa1 a1|

|α1 ˆα1| 1 ˆa1 ( x1 ˆx1 + α1|ˆa1 a1|)

< ϵ3 + α1|ˆa1 a1|

a1 |ˆa1 a1|

2 ϵ3 + (10σ1(A)ϵ3+α1a1ϵ)

using |ˆa1 a1| < a1

2 . We have,

|α1 ˆα1| 2α1a2 1ϵ3 + 10σ1(A)ϵ3 + α1a1ϵ

α1a2 1 + 10σ1(A) (2ϵ + 4Rϵ2/σk 1(Zλ1)) + α1a1ϵ

η1ϵ + η2Rϵ2 σk 1(Zλ1)

where η1 := max{α1a1(2a1 + 1), 20}, and η2 := max{α1a2 1, 10}. Finally using equation (18) we can bound b Zλ Zλ1 ϵ2 3η3ϵ, where η3 =

w1 , c3σ1(B) o . Using this in equations (22) and (23) proves the theorem.

E.2 Related Lemmas

In this section we prove a supporting lemma for Lemma 5.

Lemma 26 Let {µ2, . . . , µk} be linearly independent. Suppose matrix Zλ be expressed as,

i=2 αi(1 λ wi)µiµT i = V1:(k 1)Σ1:(k 1)V T 1:(k 1) =

i=2 σi 1(Zλ )viv T i , (24)

where wi = µi, v , V1:(k 1) = [v2, . . . , vk] the matrix of k 1 singular vectors, and Σ1:(k 1) is a diagonal matrix of singular values of Zλ . Then {v2, . . . , vk} forms a basis of span{µ2, . . . , µk}.

The Search Problem in Mixture Models

Proof Deﬁne VZλ as the column space of matrix Zλ . First observe that from equation (24) each column of Zλ can be written as a linear combination of {µ2, . . . , µk}. Therefore any vector in the column space VZλ can be written as a linear combination of {µ2, . . . , µk}. this implies, VZλ span{µ2, . . . , µk} (25)

Now any vector y VZλ can be written as y = Zλ x = Pk i=2 σi 1(Zλ ) vi, x vi using equation (24). This implies,

VZλ span{v2, . . . , vk} (26)

Conversely any vector s span{v2, . . . , vk} can be written as s = V1:(k 1)r = Zλ V1:(k 1)Σ 1 1:(k 1)r =

Zλ r , using equation (24), where r = V1:(k 1)Σ 1 1:(k 1)r. This implies,

span{v2, . . . , vk} VZλ (27)

Therefore combining equations (25),(26),(27) we get,

span{v2, . . . , vk} = VZλ span{µ2, . . . , µk} (28)

Note that both the vector spaces span{v2, . . . , vk} and span{µ2, . . . , µk} have rank k 1 since {v2, . . . , vk} are orthonormal, and {µ2, . . . , µk} are linearly independent. Then from this rank constraint and equation (28) we must have:

span{v2, . . . , vk} = span{µ2, . . . , µk}

This implies {v2, . . . , vk} forms a basis of span{µ2, . . . , µk}.

Appendix F. Subspace Clustering Proofs

In this section we prove Theorem 6 and the necessary lemmas. The main point is the following inﬁnite-sample analysis, which shows that the top m eigenvectors of the whitened matrix B can be used to recover the subspace U1.

Theorem 27 Suppose that there is some δ > 0 such that Uiv 2 (1/3 δ) U1v 2 for all i = 1. Let Y = [u1, ..., um] be the matrix of top m eigenvectors of R = D 1/2V T BV D 1/2

and Z = V D1/2Y. Let Z be the subspace spanned by columns of Z. Then,

2. σm(R) σm+1(R) 3δ U1v 2

Proof Deﬁne wi = Ui UT i v = UT i v , and Ui := αi D 1/2V T Ui; note that Pk i=1 Ui UT i is the (km) (km) identity matrix, which implies that each Ui has orthonormal columns. Consider the whitened B matrix. Using Theorem 13,

D 1/2V T BV D 1/2 =

i=1 w2 i Ui UT i + 2

i=1 Ui UT i vv T Ui UT i

i=1 w2 i Ui UT i + 2

i=1 vi v T i =

i=1 (w2 i Ui UT i + 2 vi v T i )

Ray, Neeman, Sanghavi, and Shakkottai

where vi = Ui UT i v. Note that vi are orthogonal to each other and each vi is in the space Ui, the span of corresponding Ui. Moreover, vi = wi. Now for each i consider a diﬀerent orthonormal basis Vi of Ui such that in this basis the ﬁrst unit vector is aligned along vi. Deﬁne a rotation Ri such that Vi = Ui Ri. Then Vi V T i = Ui UT i . Therefore we can write the above equation as

R = D 1/2V T BV D 1/2 =

i=1 Vi Di V T i (29)

where each Di is a diagonal matrix with one maximum value of 3w2 i and all other values w2 i , and also the matrices Vi are orthogonal. Under the assumption that w2 i (1/3 δ)w2 1, it follows that the top m eigenvectors of R are the columns of Vi, and that the corresponding eigenvalues are 3w2 1 and then w2 1 repeated m 1 times. Therefore we can write Y = Ui O, where O is an m m orthogonal matrix. Then,

Z = V D1/2Y = V D1/2 Ui O = α1U1O

This proves the ﬁrst statement that Z, the span of the columns of Z, is the subspace U1, the span of columns of U1. The second statement follows from equation (29) since the maximum value of the m + 1-th eigenvalue is 3w2 i for some i = 1. Hence,

σm(R) σm+1(R) w2 1 3 max i =1 w2 i 3δw2 1 = 3δ U1v 2.

Lemma 28 Let ˆA A < ϵ < σmk(A)/4. A = V DV T and ˆA = ˆV ˆD ˆV T be the eigen decompositions of A, ˆA. Let ˆW = ˆV ˆD 1/2 be the whitening matrix. Then,

Ik ( ˆW T A ˆW) 1/2 4ϵ σmk(A)

Proof We prove this along the lines in Hsu and Kakade (2013). The matrix ˆW whitens ˆA since,

ˆW T ˆA ˆW = ˆD 1/2 ˆV T ˆA ˆV ˆD 1/2 = Ik

Also ϵ < σmk(A)/2, hence using Weyl s inequality σmk( ˆA) σmk(A)/2. This implies

Ik ˆW T A ˆW = ˆW T ( ˆA A) ˆW ˆW 2 ˆA A

< 2ϵ σmk(A)

Therefore all eigenvalues of the matrix ˆW T A ˆW lie in the interval (1 2ϵ/σmk(A), 1 + 2ϵ/σmk(A)) . This implies the eigenvalues of ( ˆW T A ˆW) 1 lie in the interval (1/(1 + 2ϵ/σmk(A)), 1/(1 2ϵ/σmk(A))). Then,

The Search Problem in Mixture Models

(Ik ( ˆW T A ˆW) 1/2)(Ik + ( ˆW T A ˆW) 1/2) = Ik ( ˆW T A ˆW) 1

Ik ( ˆW T A ˆW) 1/2 = Ik ( ˆW T A ˆW) 1 (Ik + ( ˆW T A ˆW) 1/2) 1

Ik ( ˆW T A ˆW) 1/2 Ik ( ˆW T A ˆW) 1

1 1 2ϵ/σmk(A) 1 4ϵ σmk(A)

Lemma 29 (Whitening matrix perturbation) Assume ˆA A < ϵ < σmk(A)/4. Let ˆW = ˆV ˆD 1/2 be the whitening matrix. Deﬁne W := ˆW( ˆW T A ˆW) 1/2 . Then,

ˆW W 8ϵ σmk(A)3/2

Proof We note that the matrix W whitens the matrix A, since

W T AW = ( ˆW T A ˆW) 1/2 ˆW T A ˆW( ˆW T A ˆW) 1/2 = Ik

We can bound the perturbation as follows.

ˆW W = ˆW(Ik ( ˆW T A ˆW) 1/2)

ˆW Ik ( ˆW T A ˆW) 1/2

4ϵ σmk(A) = 8ϵ σmk(A)3/2

where the last inequality follows from Lemma 28.

Lemma 30 Let max{ ˆA A , ˆB B } < ϵ, and also let ϵ < min{σ1(B)/2, σmk(A)

16 }. W = ˆW( ˆW T A ˆW) 1/2 be the whitening matrix. Deﬁne R = W T BW as the whitened B matrix, and ˆR = ˆW T ˆB ˆW is its estimate. Then,

ˆR R < 51σ1(B)ϵ

σmk(A)2 := ϵ1

Proof From Lemma 29 we have ˆW W 8ϵ σmk(A)3/2 < ˆW /2. Also we know ˆW p

2/σmk(A). We obtain the required bound as follows.

Ray, Neeman, Sanghavi, and Shakkottai

ˆR R = ˆW T ˆB ˆW W T BW

( ˆW W)T ˆB ˆW + W T ( ˆB B) ˆW + W T B( ˆW W)

3 2 ˆW W B ˆW + 3

2 ˆW 2 ˆB B + 3

2 ˆW T B ˆW W

= 3 ˆW W B ˆW + 3

2 ˆW 2 ˆB B

< 48 σ1(B)ϵ

σmk(A)2 + 3ϵ σmk(A) < 51σ1(B)ϵ

Lemma 31 Suppose Y = [u1, . . . , um] be the matrix of m largest eigenvectors of R = W T BW, and ˆY be that of ˆR = ˆW T ˆB ˆW. Let ˆZ = ˆV ˆD1/2 ˆY . Then,

ˆZ ˆZT ZZT C1 σ1(A)σ1(B)ϵ (σm(R) σm+1(R))σmk(A)2

where Z satisﬁes Y = W T Z, and C1 is a constant.

Proof First using Wedin s theorem for the matrix A and ˆA we get

ˆV ˆV T V V T < 4ϵ σmk(A). (30)

From Lemma 30 we have ˆR R < 51σ1(B)ϵ

σmk(A)2 = ϵ1. Therefore we can again use Wedin s

theorem on the matrices R, ˆR to bound the perturbation of the subspace spanned by Y.

ˆY ˆY T Y Y T 4 ˆR R σm(R) σm+1(R)

= 4ϵ1 σm(R) σm+1(R). (31)

We now bound the following term.

ˆV ˆD1/2W T ˆV ˆV T = ˆV ˆD1/2( ˆW T A ˆW) 1/2 ˆW T ˆV ˆV T

= ˆV ˆD1/2( ˆW T A ˆW) 1/2 ˆD 1/2 ˆV T ˆV ˆV T

ˆD1/2( ˆW T A ˆW) 1/2 ˆD 1/2 Ik

ˆD1/2 ( ˆW T A ˆW) 1/2 Ik ˆD 1/2

4ϵ σmk(A) 8σ1(A)1/2ϵ

σmk(A)3/2 (32)

where the second to last inequality follows from Lemma 28. Next we show that ˆZ ˆZT is close to the projection of ZZT onto the subspace ˆV ˆV T .

The Search Problem in Mixture Models

ˆZ ˆZT ˆV ˆV T ZZT ˆV ˆV T

= ˆV ˆD1/2 ˆY ˆY T ˆD1/2 ˆV T ˆV ˆV T ZZT ˆV ˆV T

ˆV ˆD1/2( ˆY ˆY T Y Y T ) ˆD1/2 ˆV T + ˆV ˆD1/2Y Y T ˆD1/2 ˆV T ˆV ˆV T ZZT ˆV ˆV T

σ1( ˆA) ˆY ˆY T Y Y T + ˆV ˆD1/2W T ZZT W ˆD1/2 ˆV T ˆV ˆV T ZZT ˆV ˆV T (33)

We bound the second term as follows. Observe that the matrix D 1/2V T also whitens the matrix A. Therefore Z can be expressed as Z = V D1/2U where U is a matrix with orthonormal columns. This implies ZZT = V D1/2U U T D1/2V T σ1(A).

ˆV ˆD1/2W T ZZT W ˆD1/2 ˆV T ˆV ˆV T ZZT ˆV ˆV T

( ˆV ˆD1/2W T ˆV ˆV T )ZZT W ˆD1/2 ˆV T + ˆV ˆV T ZZT (W ˆD1/2 ˆV T ˆV ˆV T )

( ˆV ˆD1/2W T ˆV ˆV T )ZY T ˆD1/2 ˆV T + ZZT W ˆD1/2 ˆV T ˆV ˆV T

ˆV ˆD1/2W T ˆV ˆV T Z ˆD1/2 + ZZT W ˆD1/2 ˆV T ˆV ˆV T

σmk(A)3/2 2σ1(A) + σ1(A) 8σ1(A)1/2ϵ

= 24 σ1(A)3/2ϵ

The second to last step follows from equation 32. Now using the above bound in equation 33 we get,

ˆZ ˆZT ˆV ˆV T ZZT ˆV ˆV T σ1( ˆA) ˆY ˆY T Y Y T + 24 σ1(A)3/2ϵ

8σ1(A)ϵ1 σm(R) σm+1(R) + 24 σ1(A)3/2ϵ

σmk(A)3/2 (34)

where the last step follows from inequalities (31). We compute the required bound by combining equations (30) and (34) as follows.

ˆZ ˆZT ZZT = ˆZ ˆZT V V T ZZT V V T

ˆZ ˆZT ˆV ˆV T ZZT ˆV ˆV T + 3 V V T ˆV ˆV T ZZT

8σ1(A)ϵ1 σm(R) σm+1(R) + 24 σ1(A)3/2ϵ

σmk(A)3/2 + 12σ1(A)ϵ

C1 σ1(A)σ1(B)ϵ (σm(R) σm+1(R))σmk(A)2

where C1 is a constant.

Ray, Neeman, Sanghavi, and Shakkottai

F.1 Proof of Theorem 6

The proof follows from Theorem 27 and Lemma 31. Note that the matrix Z has all singular values equal to α1, therefore ZZT has singular values α1. Under the aﬃnity condition from Theorem 27, we have

σm(R) σm+1(R) 3δ U1v 2

Combining with Lemma 31 we get

ˆZ ˆZT ZZT C2σ1(A)σ1(B)ϵ

δ U1v 2σmk(A)2

where C2 is a constant. Finally applying Wedin s theorem for the matrices ˆZ ˆZT and ZZT , we have

ˆU ˆUT U1UT 1 C3σ1(A)σ1(B)ϵ α1δ U1v 2σmk(A)2 Cσ1(A)2ϵ α1δσmk(A)2

where C3 = 4C2.

Appendix G. Sample Complexity Analysis

Since the basic application of our method requires the estimation of certain covariance matrices, we need to show that one can estimate these matrices. There is a large literature on estimating covariance matrices, but for simplicity we will only focus on the simplest estimator: the sample covariance matrix. By well-known matrix concentration inequalities, one can show that the sample covariance matrix will be close to the covariance matrix with high probability if the sample size is large enough:

Theorem 32 Tropp (2015) Let A1, . . . , An be i.i.d. symmetric random d d matrices. If A1 L a.s. then

G.1 Truncation

Unfortunately, the matrices we will be dealing with do not usually have almost sure bounds on their norm. Here, we develop some straightforward truncation arguments in order to adapt Theorem 32.

Theorem 33 Suppose that A1, . . . , An are i.i.d. symmetric random d d matrices satisfying the tail bound Pr( A1 t) Ce ctα

for some α > 0. Then for any ϵ, δ > 0, if n Ωα(ϵ 2 log(d/δ)) then

Pr( ˆEA EA ϵ) δ,

where Ωα(k) means C(α)Ω(k log C(α) k).

The Search Problem in Mixture Models

Proof Fix L > 0 (to be determined later) and deﬁne the random matrix Bi by Bi = Ai1{ Ai L}. Then Theorem 32 applies to Bi: if n Ω(L2ϵ 2 log(d/δ)) then

Pr( ˆEB EB ϵ) δ.

To compare this with the similar quantity involving A, we will consider ˆE(A B) and E(A B) separately. First, note that Pr(Ai = Bi) = Pr( A L) C exp( c Lα). If L = Ω(log1/α(n/δ)) then Pr(Ai = Bi) δ/n. By a union bound,

Pr(ˆEA = ˆEB) δ. (35)

Now we ﬁx L = C log1/α(n/(δ ϵ)) and we consider E(A B) . By the triangle inequality, E(A B) = EA1{ A L} E A 1{ A L}.

On the other hand, we can bound

E A 1{ A L} = Z

L Pr( A t) dt C Z

L e ctα dt.

With the change of variables t = u1/α, we have

E A 1{ A L} 1

Lα u1/αe cu du.

Now, if u C 1

α for large enough C then u1/αe cu e cu/2. Hence, if Lα C 1

E A 1{ A L} 1

Lα e cu/2 du C(α)e c Lα/2 C(α)ϵ

where the last inequality holds if the constant C in the deﬁnition of L is large enough compared to c. On the other hand, if Lα < C 1

α then we must have ϵ > c(α) for some c(α) > 0. In this case, E A 1{ A L} C C(α)ϵ trivially. To summarize, in every case we have E(A B) C(α)ϵ.

Putting this together with (35), we have that if n Ω(L2ϵ 2 log(d/δ)) then with probability at least 1 2δ,

ˆEA EA ˆEB EB + ˆEA ˆEB + EA EB

(1 + C(α))ϵ.

Finally, recalling that L = polylog(n, 1/ϵ, 1/δ) (with the polynomial depending on α), we see that n = Ωα(ϵ 2 log(d/δ)) suﬃces. Finally, we can absorb the constant C(α) into ϵ.

We will now show how Theorem 33 bounds the error in estimating the various matrices that we had to estimate for the various diﬀerent models we considered. Essentially, we will repeatedly use the observation that if z is a standard Gaussian variable then z2/α has a tail that decays like e ctα. In other words, moments of Gaussians will naturally lead to a condition that the one assumed in Theorem 33.

Ray, Neeman, Sanghavi, and Shakkottai

G.2 Gaussian Mixture Model

For the following theorem, we revert to the notation of the Gaussian mixture model.

Theorem 34 Fix ϵ, δ > 0. Let ˆA = ˆE[xx T ] and ˆB = ˆE[ x, v xx T ], where ˆE is taken with n i.i.d. samples. If n Ω(dϵ 2 log(d/δ)) then with probability at least 1 δ, ˆEA EA ϵ and ˆEB EB ϵ.

Proof To estimate A, ﬁrst note that xx T = x 2. Now, E x 2 R2 + dσ2, where R = maxi µi , and also Pr( x 2 E x 2 + t

d) Ce ct. Hence, we may apply Theorem 33 with Ai = xix T i /

d and α = 1; this yields the claimed bound on ˆEA EA . To estimate B, note that x, v 2xx T = x, v 2 x 2. Now, the triangle inequailty implies that x, v 2 x 2 is stochastically dominated by

4R4 + 4E[ z, v 2 z 2] = 4R4 + 4E[z2 1 z 2],

where z is a standard (i.e., centered) Gaussian vector. Then E[z2 1 z 2] = 2 + d, and z2 1 z 2 has tails of order e ct1/2; that is it satisﬁes the assumptions of Theorem 33 with α = 1/2. Applying Theorem 33 with Ai = xi, v 2xix T i /

d then yields the claimed bound on ˆEB EB .

G.3 LDA Topic Model

For the following theorem, we revert to the notation of the LDA topic model, where d is the size of the dictionary.

Theorem 35 Fix ϵ, δ > 0. Let ˆA = ˆE[x1x T 2 ] and ˆB = ˆE[ x3, v x1x T 2 ], where ˆE is taken with n i.i.d. samples. If n Ω(ϵ 2 log(d/δ)) then with probability at least 1 δ, ˆA EA ϵ and ˆB EB ϵ.

Proof We can apply Theorem 32 directly, since x1x T 2 1 and x3, v x1x T 2 1.

G.4 Mixed Regression

For the following theorem, we revert to the notation of the mixed regression model.

Theorem 36 Fix ϵ, δ > 0. Let ˆA = ˆE[y2xx T ] and ˆB = ˆE[y3 x, v xx T ], where ˆE is taken with n i.i.d. samples. Let R = maxi µi . If n Ω((R2 + σ2)ϵ 2d log(d/δ)) then with probability at least 1 δ, ˆA EA ϵ and ˆB EB ϵ.

Proof Recalling that in cluster i we have y = x, µi + ξ, we have

y2xx T 2 x, µi 2 x 2 + 2ξ2 x 2.

Hence, E y2xx T 2R2(2 + d) + σ2d, with tails that decay at the rate e ct1/2. Applying Theorem 33 implies the claimed bounds for A. The case of B is analogous, except that since it involves sixth moments the tails will decay at the rate e ct1/3; this only eﬀects the poly-logarithmic terms hidden in the Ωnotation.

The Search Problem in Mixture Models

G.5 Subspace Clustering

For the following theorem, we revert to the notation of the subspace clustering model. We assume for simplicity that σ is known, since if it isn t then it can be easily and accurately learnt.

Theorem 37 Fix ϵ, δ > 0. Let ˆA = ˆE[xx T ] σ2Id and

ˆB = ˆE[ x, v 2xx T ] σ2(v T ˆAv)Id σ2 v 2 ˆA σ4( v 2Id + vv T ) 2σ2( ˆAvv T + vv T ˆA)

where ˆE is taken with respect to n i.i.d. samples. If n Ω(ϵ 2(1+σ2) v 2m log(d/δ)) then with probability at least 1 δ, ˆA A ϵ and ˆB B ϵ.

Proof Since x/σ is an m-dimensional Gaussian vector, x 2/(σ2m) is concentrated around its mean (1) with tails of order e ct. In other words, Theorem 33 (with α = 1) implies our claim for A. The claim for B is analogous, except that since it involves fourth moments, the tails will decay at the rate e ct1/2.

Anima Anandkumar, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Yi-Kai Liu. A spectral algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems, pages 917 925, 2012.

Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research, 15(1):2773 2832, 2014.

Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 33(5): 898 916, May 2011.

Sanjeev Arora, Rong Ge, Yonatan Halpern, David M Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. A practical algorithm for topic modeling with provable guarantees. In Proceedings of ICML-2013, pages 280 288, 2013.

Sugato Basu, Arindam Banerjee, and Raymond Mooney. Semi-supervised clustering by seeding. In Proceedings of 19th International Conference on Machine Learning (ICML2002, 2002.

David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3:993 1022, 2003.

Olivier Chapelle, Bernhard Sch olkopf, and Alexander Zien, editors. Semi-supervised learning. MIT press Cambridge, 2006.

Yudong Chen, Xinyang Yi, and Constantine Caramanis. A convex formulation for mixed regression with two components: Minimax optimal rates. In COLT, pages 560 604, 2014.

Ray, Neeman, Sanghavi, and Shakkottai

Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1 38, 1977.

Ehsan Elhamifar and Ren e Vidal. Sparse subspace clustering. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 2790 2797. IEEE, 2009.

Moritz Hardt and Eric Price. Tight bounds for learning a mixture of two gaussians. In Proceedings of ACM on Symposium on Theory of Computing, STOC, pages 753 760, 2015.

Daniel Hsu and Sham M Kakade. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pages 11 20. ACM, 2013.

Furong Huang, UN Niranjan, Mohammad Umar Hakeem, and Animashree Anandkumar. Online tensor methods for learning latent variable models. Journal of Machine Learning Research, 16:2797 2835, 2015.

Pirkko Kuusela and Daniel Ocone. Learning with side information: Pac learning bounds. Journal of Computer and System Sciences, 68(3):521 545, 2004.

Yue Lu and Chengxiang Zhai. Opinion integration through semi-supervised topic modeling. In Proceedings of the 17th International Conference on World Wide Web, pages 121 130. ACM, 2008.

Christopher D Manning, Prabhakar Raghavan, Hinrich Sch utze, et al. Introduction to information retrieval, volume 1. Cambridge university press Cambridge, 2008.

Jon D Mcauliﬀe and David M Blei. Supervised topic models. In Advances in neural information processing systems, pages 121 128, 2008.

Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of gaussians. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 93 102. IEEE, 2010.

David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conf. of the North American Chapter of the Association for Computational Linguistics, pages 100 108. Association for Computational Linguistics, 2010.

David Newman, Edwin V Bonilla, and Wray Buntine. Improving topic coherence with regularized topic models. In Advances in neural information processing systems, pages 496 504, 2011.

Dohyung Park, Constantine Caramanis, and Sujay Sanghavi. Greedy subspace clustering. In Advances in Neural Information Processing Systems, pages 2753 2761, 2014.

The Search Problem in Mixture Models

Karl Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, pages 71 110, 1894.

Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D Manning. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In Proc. of the 2009 Conf. on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pages 248 256. Association for Computational Linguistics, 2009.

Richard A Redner and Homer F Walker. Mixture densities, maximum likelihood and the em algorithm. SIAM review, 26(2):195 239, 1984.

M. R oder, A. Both, and A. Hinneburg. Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining, pages 399 408. ACM, 2015.

Michal Rosen-Zvi, Thomas Griﬃths, Mark Steyvers, and Padhraic Smyth. In Proceedings of the 20th conference on Uncertainty in Artiﬁcial Intelligence, pages 487 494, 2004.

Hanie Sedghi, Majid Janzamin, and Anima Anandkumar. Provable tensor methods for learning mixtures of generalized linear models. In Proceedings of International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2016, pages 1223 1231, 2016.

Mahdi Soltanolkotabi and Emmanuel J Candes. A geometric analysis of subspace clustering with outliers. The Annals of Statistics, pages 2195 2238, 2012.

TACC. Texas advanced computing center, 2018. http://www.tacc.utexas.edu.

Joel Tropp. An introduction to matrix concentration inequalities. ar Xiv preprint ar Xiv:1501.01571, 2015.

UCI. NY Times dataset, 2008. http://mlr.cs.umass.edu/ml/ machine-learning-databases/.

Eric P Xing, Michael I Jordan, Stuart Russell, and Andrew Y Ng. Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems, pages 505 512, 2002.

Tianbao Yang, Rong Jin, and Anil K Jain. Learning from noisy side information by generalized maximum entropy model. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1199 1206, 2010.

Yelp. Yelp dataset, 2014. http://www.yelp.com/dataset_challenge/.

Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating minimization for mixed linear regression. In Proceedings of International Conference on Machine Learning, ICML 2014, pages 613 621, 2014.