# learning_mixtures_of_linear_classifiers__5120beaa.pdf Learning Mixtures of Linear Classifiers Yuekai Sun YUEKAI@STANFORD.EDU Institute for Computational and Mathematical Engineering, Stanford University, 475 Via Ortega, Stanford, CA 94305 Stratis Ioannidis STRATIS.IOANNIDIS@TECHNICOLOR.COM Technicolor, 175 S San Antonio Rd, Los Altos, CA 94022 Andrea Montanari MONTANARI@STANFORD.EDU Dept. of Electrical Engineering & Dept. of Statistics, Stanford University, 350 Serra Mall, Stanford, CA 94305 We consider a discriminative learning (regression) problem, whereby the regression function is a convex combination of k linear classifiers. Existing approaches are based on the EM algorithm, or similar techniques, without provable guarantees. We develop a simple method based on spectral techniques and a mirroring trick, that discovers the subspace spanned by the classifiers parameter vectors. Under a probabilistic assumption on the feature vector distribution, we prove that this approach has nearly optimal statistical efficiency. 1. Introduction Since Pearson s seminal contribution (Pearson, 1894), and most notably after the introduction of the EM algorithm (Dempster et al., 1977), mixture models and latent variable models have played a central role in statistics and machine learning, with numerous applications see, e.g., Mc Lachlan & Peel (2004), Bishop (1998), and Bartholomew et al. (2011). Despite their ubiquity, fitting the parameters of a mixture model remains a challenging task. The most popular methods (e.g., the EM algorithm or likelihood maximization by gradient ascent) are plagued by local optima and come with little or no guarantees. Computationally efficient algorithms with provable guarantees are an exception in this area. Even the idealized problem of learning mixtures of Gaussians has motivated a copious theoretical literature (Arora & Kannan, 2001; Moitra & Valiant, 2010). In this paper we consider the problem of modeling a regression function as a mixture of k components. Namely, Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). we are given labels Yi R and feature vectors Xi Rd, i [n] {1, 2, . . . , n}, and we seek estimates of the parameters of a mixture model Yi Xi=xi Pk ℓ=1 pℓf(yi|xi, uℓ) . (1) Here k is the number of components, (pℓ)ℓ [k] are weights of the components, and uℓis a vector of parameters for the ℓ-th component. Models of this type have been intensely studied in the neural network literature since the early nineties (Jordan & Jacobs, 1994; Bishop, 1998). They have also found numerous applications ranging from object recognition (Quattoni et al., 2004) to machine translation (Liang et al., 2006). These studies are largely based on learning algorithms without consistency guarantees. Recently, Chaganty & Liang (2013) considered mixtures of linear regressions, whereby the relation between labels and feature vectors is linear within each component; i.e., Yi = uℓ, Xi + noise (here and below a, b denotes the standard inner product in Rm). Equivalently, f(yi|xi, uℓ) = f0(yi xi, uℓ ) with f0( ) a density of mean zero. Building on a new approach developed by Hsu et al. (2012) and Anandkumar et al. (2012), these authors propose an algorithm for fitting mixtures of linear regressions with provable guarantees. The main idea is to regress Y q i , for q {1, 2, 3} against the tensors Xi, Xi Xi, Xi Xi Xi. The coefficients of these regressions are tensors whose decomposition yields the parameters uℓ, pℓ. While the work of Chaganty & Liang (2013) is a significant step forward, it leaves several open problems: Statistical efficiency. Consider a standard scaling of the feature vectors, whereby the components (Xi,j)j [p] are of order one. Then, the mathematical guarantees of Chaganty & Liang (2013) require a sample size n d6. This is substantially larger than the information-theoretic optimal scaling, and is an unrealistic requirement in highdimension (large d). As noted in (Chaganty & Liang, Learning Mixtures of Linear Classifiers 2013), this scaling is an intrinsic drawback of the tensor approach as it operates in a higher-dimensional space (tensor space) than the space in which data naturally live. Linear regression versus classification. In virtually all applications of the mixture model (1), labels Yi are categorical see, e.g., Jordan & Jacobs (1994), Bishop (1998), Quattoni et al. (2004), Liang et al. (2006). In this case, the very first step of Chaganty & Liang, namely, regressing Y 2 i on X 2 i and Y 3 i on X 3 i , breaks down. Consider to be definite the important case of binary labels (e.g., Yi {0, 1} or Yi {+1, 1}). Then powers of the labels do not provide additional information (e.g., if Yi {0, 1}, then Yi = Y 2 i ). Also, since Yi is non-linearly related to uℓ, Y 2 i does not depend only on u 2 ℓ. Computational complexity. The method of Chaganty & Liang (2013) solves a regularized linear regression in d3 dimensions and factorizes a third order tensor in d dimensions. Even under optimistic assumptions (finite convergence of iterative schemes), this requires O(d3n + d4) operations. In this paper, we develop a spectral approach to learning mixtures of linear classifiers in high dimension. For the sake of simplicity, we shall focus on the case of binary labels Yi {+1, 1}, but we expect our ideas to be more broadly applicable. We consider regression functions of the form f(yi|xi, uℓ) = f(yi| xi, uℓ ), i.e., each component corresponds to a generalized linear model with parameter vector uℓ Rd. In a nutshell, our method constructs a symmetric matrix ˆQ Rd d by taking a suitable empirical average of the data. The matrix ˆQ has the following property: (d k) of its eigenvalues are roughly degenerate. The remaining k eigenvalues correspond to eigenvectors that approximately span the same subspace as u1, . . . , uk. Once this space is accurately estimated, the problem dimensionality is reduced to k; as such, it is easy to come up with effective prediction methods (as a matter of fact, simple K-nearest neighbors works very well). The resulting algorithm is computationally efficient, as its most expensive step is computing the eigenvector decomposition of a d d matrix (which takes O(d3) operations). Assuming Gaussian feature vectors Xi Rd, we prove that our method is also statistically efficient, i.e., it only requires n d samples to accurately reconstruct the subspace spanned by u1, . . . , uk. This is the same amount of data needed to estimate the covariance of the feature vectors Xi or a parameter vector u1 Rd in the trivial case of a mixture with a single component, k = 1. It is unlikely that a significantly better efficiency can be achieved without additional structure. The assumption of Gaussian feature vectors Xi s is admit- tedly restrictive. On one hand, as for the problem of learning mixtures of Gaussians (Arora & Kannan, 2001; Moitra & Valiant, 2010), we believe that useful insights can be gained by studying this simple setting. On the other, and as discussed below, our proof does not really require the distribution of the Xi s to be Gaussian, and a strictly weaker assumption is sufficient. We expect that future work will succeed in further relaxing this assumption. 1.1. Technical contribution and related work Our approach is related to the principal Hessian directions (p Hd) method proposed by Li (1992) and further developed by Cook (1998) and co-workers. PHd is an approach to dimensionality reduction and data visualization. It generalizes principal component analysis to the regression (discriminative) setting, whereby each data point consists of a feature vector Xi Rd and a label Yi R. Summarizing, the idea is to form the Hessian matrix ˆH = n 1 Pn i=1 Yi Xi XT i Rd d. (We assume here, for ease of exposition, that the Xi s have zero mean and unit covariance.) The eigenvectors associated to eigenvalues with largest magnitude are used to identify a subspace in Rd onto which to project the feature vectors Xi s. Unfortunately, the p Hd approach fails in general for the mixture models of interest here, namely, mixtures of linear classifiers. For instance, it fails when each component of (1) is described by a logistic model f(yi = +1|z) = (1 + e z) 1, when features are centered at E(Xi) = 0; a proof can be found in the extended version of this paper (Sun et al., 2013). Our approach overcomes this problem by constructing ˆQ = n 1 Pn i=1 Zi Xi XT i Rd d. The Zi s are pseudo-labels obtained by applying a mirroring transformation to the Yi s. Unlike with ˆH, the eigenvector structure of ˆQ enables us to estimate the span of u1, . . . , uk. As an additional technical contribution, we establish nonasymptotic bounds on the estimation error that allow to characterize the trade-off between the data dimension d and the sample size n. In contrast, rigorous analysis on p Hd is limited to the low-dimensional regime of d fixed as n . It would be interesting to generalize the analysis developed here to characterize the high-dimensional properties of p Hd as well. 2. Problem Formulation Consider a dataset comprising n i.i.d. pairs (Xi, Yi) Rd { 1, +1}, i [n]. We refer to the vectors Xi Rd as features and to the binary variables as labels. We assume that the features Xi Rd are sampled from a Gaussian dis- Learning Mixtures of Linear Classifiers 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 (a) (b) (c) Figure 1. The mirroring process applied to a mixture of two 3-dimensional classifiers. Figure (a) shows labels generated by two classifiers in R3; the figure includes the parameter profiles as well as the corresponding classification surfaces. Figure (b) shows the mirroring direction ˆr as a dashed vector, computed by (5), as well as the plane it defines; note that ˆr lies within the positive cone spanned by the two classifier profiles, approximately. Finally, Figure (c) shows the result of the mirroring process: the region of points that was predominantly positive has remained unaltered, while the region of points that was predominantly negative has been flipped. tribution with mean µ Rd and a positive definite covariance Σ Rd d. The labels Yi { 1, +1} are generated by a mixture of linear classifiers, i.e., Pr(Yi = +1 | Xi) = Pk ℓ=1 pℓf( uℓ, Xi ) . (2) Here, k 2 is the number of components in the mixture; (pℓ)ℓ [k] are the weights, satisfying of course pℓ> 0, Pk ℓ=1 pℓ= 1; and (uℓ)ℓ [k], uℓ Rd are the normals to the planes defining the k linear classifiers. We refer to each normal uℓas the parameter profile of the ℓ-th classifier; we assume that the profiles uℓ, ℓ [k], are linearly independent, and that k < n/2. We assume that the function f : R [0, 1], characterizing the classifier response, is analytic, non-decreasing, strictly concave in [0, + ), and satisfies: lim t f(t)=1, lim t f(t)=0, 1 f(t)=f( t). (3) As an example, it is useful to keep in mind the logistic function f(t) = (1 + e t) 1. Fig. 1(a) illustrates a mixture of k = 2 classifiers over d = 3 dimensions. 2.2. Subspace Estimation, Prediction and Clustering Our main focus is the following task: Subspace Estimation: After observing (Xi, Yi), i [n], estimate the subspace spanned by the profiles of the k classifiers, i.e., U span(u1, . . . , uk). For b U an estimate of U, we characterize performance via the principal angle between the two spaces, namely d P (U, b U) = max x U,y b U arccos x,y x y . Notice that projecting the features Xi on U entails no loss of information w.r.t. (2). This can be exploited to improve the performance of several learning tasks through dimensionality reduction, by projecting the features to the estimate of the subspace U. Two such tasks are: Prediction: Given a new feature vector Xn+1, predict the corresponding label Yn+1. Clustering: Given a new feature vector and label pair (Xn+1, Yn+1), identify the classifier that generated the label. As we will see in Section 5, our subspace estimate can be used to significantly improve the performance of both prediction and clustering. 2.3. Technical Preliminary We review here a few definitions used in our exposition. The sub-gaussian norm of a random variable X is: X ψ2 = sup p 1 1 p(E[|X|p])1/p. We say X is sub-gaussian if X ψ2 < . We say that a random vector X Rd is sub-gaussian if y, X is subgaussian for any y on the unit sphere Sd 1. We use the following variant of Stein s identity (Stein, 1973; Liu, 1994). Let X Rd, X Rd be jointly Gaussian random vectors, and consider a function h : Rd R that is almost everywhere (a.e.) differentiable and satisfies E[| h(X )/ xi|] < , i [d ]. Then, the following identity holds: Cov(X, h(X )) = Cov(X, X )E[ h(X )]. (4) Learning Mixtures of Linear Classifiers Algorithm 1 SPECTRALMIRROR Input: Pairs (Xi, Yi), i [n] Output: Subspace estimate b U 1: ˆµ 1 n/2 P n/2 i=1 Xi 2: ˆΣ 1 n/2 P n/2 i=1 (Xi ˆµ)(Xi ˆµ)T 3: ˆr 1 n/2 P n/2 i=1 Yi ˆΣ 1(Xi ˆµ) 4: for each i { n/2 + 1, . . . , n}: Zi Yi sgn ˆr, Xi 5: ˆQ 1 n/2 i= n/2 +1 Zi ˆΣ 1/2(Xi ˆµ)(Xi ˆµ)T ˆΣ 1/2 6: Find eigendecomposition Pd ℓ=1 λℓwℓw T ℓof ˆQ 7: Let λ(1), . . . , λ(k) be the k eigenvalues furthest from the median. 8: ˆU span ˆΣ 1/2w(1), . . . , ˆΣ 1/2w(k) 3. Subspace Estimation In this section, we present our algorithm for subspace estimation, which we refer to as SPECTRALMIRROR. Our main technical contribution, stated formally below, is that the output ˆU of SPECTRALMIRROR is a consistent estimator of the subspace U as soon as n C d, for a sufficiently large constant C. 3.1. Spectral Mirror Algorithm We begin by presenting our algorithm for estimating the subspace span U. Our algorithm consists of three main steps. First, as pre-processing, we estimate the mean and covariance of the underlying features Xi. Second, using these estimates, we identify a vector ˆr that concentrates near the convex cone spanned by the profiles (uℓ)ℓ [k]. We use this vector to perform an operation we call mirroring: we flip all labels lying in the negative halfspace determined by ˆr. Finally, we compute a weighted covariance matrix ˆQ over all Xi, where each point s contribution is weighed by the mirrored labels: the eigenvectors of this matrix, appropriately transformed, yield the span U. These operations are summarized in Algorithm 1. We discuss each of the main steps in more detail below: Pre-processing. (Lines 1 2) We split the dataset into two halves. Using the first half (i.e., all Xi with 1 i n 2 ), we construct estimates ˆµ Rd and ˆΣ Rd d of the feature mean and covariance, respectively. Standard Gaussian (i.e., whitened ) versions of features Xi can be constructed as ˆΣ 1/2(Xi ˆµ). Mirroring. (Lines 3 4) We compute the vector: ˆr = 1 n/2 P n/2 i=1 Yi ˆΣ 1(Xi ˆµ) Rd. (5) We refer to ˆr as the mirroring direction. In Section 4, we show that ˆr concentrates around its population (n = ) version r E[Y Σ 1(X µ)]. Crucially, r lies in the interior of the convex cone spanned by the parameter profiles, i.e., r = Pk ℓ=1 αℓuℓ, for some positive αℓ> 0, ℓ [k] (see Lemma 2 and Fig. 1(b)). Using this ˆr, we mirror the labels in the second part of the dataset: Zi = Yi sgn ˆr, Xi , for n/2 < i n. In words, Zi equals Yi for all i in the positive half-space defined by the mirroring direction; instead, all labels for points i in the negative half-space are flipped (i.e., Zi = Yi). This is illustrated in Figure 1(c). Spectral Decomposition. (Lines 5 8) The mirrored labels are used to compute a weighted covariance matrix over whitened features as follows: i= n/2 +1 Zi ˆΣ 1/2(Xi ˆµ)(Xi ˆµ)T ˆΣ 1/2 The spectrum of ˆQ has a specific structure, that reveals the span U. In particular, as we will see in Section 4, ˆQ converges to a matrix Q that contains an eigenvalue with multiplicity n k; crucially, the eigenvectors corresponding to the remaining k eigenvalues, subject to the linear transform ˆΣ 1/2, span the subspace U. As such, the final steps of the algorithm amount to discovering the eigenvalues that stand out (i.e., are different from the eigenvalue with multiplicity n k), and rotating the corresponding eigenvectors to obtain ˆU. More specifically, let (λℓ, wℓ)ℓ [d] be the eigenvalues and eigenvectors of ˆQ. Recall that k < n/2. The algorithm computes the median of all eigenvalues, and identifies the k eigenvalues furthest from this median; these are the outliers . The corresponding k eigenvectors, multiplied by ˆΣ 1/2, yield the subspace estimate ˆU. The algorithm does not require knowledge of the classifier response function f. Also, while we assume knowledge of k, an eigenvalue/eigenvectors statistic (see, e.g., Zelnik Manor & Perona (2004)) can be used to estimate k, as the number of outlier eigenvalues. 3.2. Main Result Our main result states that SPECTRALMIRROR is a consistent estimator of the subspace spanned by (uℓ)ℓ [k]. This is true for most µ Rd. Formally, we say that an event occurs for generic µ if adding an arbitrarily small random perturbation to µ, the event occurs with probability 1 w.r.t. this perturbation. Learning Mixtures of Linear Classifiers Theorem 1. Denote by ˆU the output of SPECTRALMIR- ROR, and let P r I rr T / r 2 be the projector orthogonal to r, given by (6). Then, for generic µ, as well as for µ = 0, there exists ǫ0 > 0 such that, for all ǫ [0, ǫ0), Pr(d P (P r U, ˆU) > ǫ) C1 exp( C2 nǫ2 Here C1 is an absolute constant, and C2 > 0 depends on µ, Σ, f and (uℓ)ℓ [k]. In other words, ˆU provides an accurate estimate of P r U as soon as n is significantly larger than d. This holds for generic µ, but we also prove that it holds for the specific and important case where µ = 0; in fact, it also holds for all small-enough µ. Note that this does not guarantee that ˆU spans the direction r U; nevertheless, as shown below, the latter is accurately estimated by ˆr (see Lemma 1) and can be added to the span, if necessary. Moreover, our experiments suggest this is rarely the case in practice, as ˆU indeed includes the direction r (see Section 5). 4. Proof of Theorem 1 Recall that we denote by r the population (n = ) version of ˆr. Let g(s) 2f(s) 1, for s R, and observe that E[Y | X = x] = Pk ℓ=1 pℓg( uℓ, x ). Hence, r = E h Σ 1(X µ) Pk ℓ=1 pℓg( uℓ, X ) i . (6) Then, the following concentration result holds: Lemma 1. There exist an absolute constant C > 0 and c1, c 1, c 2 that depend on X ψ2 such that: Pr( ˆr r 2 ǫ) Ce min c2nǫ2 d , c 1 nǫ c 2 The proof of Lemma 1 relies on a large deviation inequality for sub-Gaussian vectors, and is provided in (Sun et al., 2013). Crucially, r lies in the interior of the convex cone spanned by the parameter profiles: Lemma 2. r = Pk ℓ=1 αℓuℓfor some αℓ> 0, ℓ [k]. Proof. From (6), r = Pk ℓ=1 pℓΣ 1E[(X µ)g( uℓ, X )]. It thus suffices to show that Σ 1E[(X µ)g( u, X )] = αu, for some α > 0. Note that X = u, X is normal with mean µ0 = u T µ and variance σ2 0 = u T Σu > 0. Since f is analytic and non-decreasing, so is g; moreover, g 0. This, and the fact that g is non-constant, implies E[g (X )] > 0. On the other hand, from Stein s identity (4), E[g (X )] = 1 σ2 0 E[X g(X )] < , as g is bounded. Hence: Σ 1E[(X µ)g( u, X )] (4)= Σ 1Cov(X, u, X )E[g (X )], where X N(µ0, σ2 0) = Σ 1 E[(X µ)XT u] E[g (X )] = Σ 1 Σu E[g (X )] = E[g (X )] u and the lemma follows. For r and (αℓ)ℓ [k] as in Lemma 2, define z(x) = E[Y sgn( r, X ) | X = x] = Pk ℓ=1 pℓg( x, uℓ ) sgn Pk ℓ=1 αℓ x, uℓ . Observe that z(x) is the expectation of the mirrored label at a point x presuming that the mirroring direction is exactly r. Let Q Rd d be the matrix: Q = E[z(X)Σ 1/2(X µ)(X µ)T Σ 1/2]. Then ˆQ concentrates around Q, as stated below. Lemma 3. Let ǫ0 min{α1, . . . , αk}σmin(U), where the αℓ> 0 are defined as per Lemma 2 and σmin(U) is the smallest non-zero singular value of U. Then for ǫ < min(ǫ0, r /2): Pr( ˆQ Q 2 > ǫ) C exp{ F(ǫ2)}, where F(ǫ) min n c1nǫ2 d , c 1 nǫ c 2 d 2o , C an ab- solute constant, and c1, c 1, c 2 depend on µ, Σ, and r . The proof of Lemma 3 is also provided in (Sun et al., 2013). We again rely on large deviation bounds for sub-gaussian random variables; nevertheless, our proof diverges from standard arguments because ˆr, rather than r, is used as a mirroring direction. Additional care is needed to ensure that (a) when ˆr is close enough to r, its projection to U lies in the interior of the convex cone spanned by the profiles, and (b) although ˆr may have a (vanishing) component outside the convex cone, the effect this has on ˆQ is negligible, for n large enough. An immediate consequence of Lemma 2 is that r reveals a direction in the span U. The following lemma states that the eigenvectors of Q, subject to a rotation, yield the remaining k 1 directions: Lemma 4. Matrix Q has at most k + 1 distinct eigenvalues. One eigenvalue, termed λ0, has multiplicity d k. For generic µ, as well as for µ = 0, the eigenvectors w1, . . . , wk corresponding to the remaining eigenvalues λ1, . . . , λk are such that P r U = span(P r Σ 1/2w1, . . . , P r Σ 1/2wk), where P r is the projection orthogonal to r. Learning Mixtures of Linear Classifiers Proof. Note that Q = E[z(X)Σ 1 2 (X µ)(X µ)T Σ 1 = E[z(Σ1/2W + µ)WW T ], where W N(0, I) ℓ=1 pℓg( Σ 1 2 W +µ,uℓ ) sgn( Σ 1 2 W +µ,r )WW T ℓ=1 pℓg( W + µ, uℓ ) sgn( W + µ, r )WW T for uℓ Σ 1 2 uℓ, r Σ 1 2 r, and µ Σ 1 2 µ. Hence Q = Pk ℓ=1 pℓQℓwhere Qℓ= E[g( uℓ, W + µ ) sgn( r, W + µ )WW T ]. By a rotation invariance argument, Qℓcan be written as Qℓ= aℓI + bℓ( uℓ r T + r u T ℓ) + cℓ uℓ u T ℓ+ dℓ r r T (7) for some aℓ, bℓ, cℓ, dℓ R. To see this, let Qℓ = [ qij]i,j [d], and suppose first that r = [ r1, r2, 0, . . . , 0] and uℓ= [ uℓ1, uℓ2, 0, . . . , 0]. (8) Since W is whitened, its coordinates are independent. Thus, under (8), qij = 0 for all i = j s.t. i, j > 2, and qii = aℓfor i > 2, for some aℓ. Thus Qℓ= aℓI + B, where B is symmetric and 0 everywhere except perhaps on B11, B12, B21, B22 (the top left block). Since the profiles uℓare linearly independent, so are uℓand r, by Lemma 2. Hence, matrices uℓ r T + r u T ℓ, uℓ u T ℓ, r r T span all such B, so (7) follows. Moreover, since W is whitened, Qℓis rotation invariant and thus (7) extends beyond (8); indeed, if r = R r, u ℓ= R uℓ, µ = R µ where R a rotation matrix (i.e. RRT = I), then Q = RQRT . Hence, as (8) holds for some orthonormal basis, (7) holds for all bases. Let a = Pk ℓ=1 pℓaℓ. Then ℓ=1 pℓdℓ r r T + r( ℓ=1 pℓbℓ uℓ)T + ℓ=1 pℓbℓ uℓ) r T + ℓ=1 pℓcℓ uℓ u T ℓ. Let P r be the projector orthogonal to r, i.e., P r = I r r T r 2 2 . Let vℓ P r uℓ. Lemma 2 and the linear independence of uℓimply that vℓ = 0, for all ℓ [k]. Define R P r (Q a I)P r = Pk ℓ=1 γℓvℓv T ℓ, where γℓ= pℓcℓ, ℓ [k]. We will show below that for generic µ, as well as for µ = 0, γℓ = 0 for all ℓ [k]. This implies that rank(R) = k 1. Indeed, R = P r P γℓ uℓ u T ℓP r = P r RP r , where R has rank k by the linear independence of profiles. As P is a projector orthogonal to a 1dimensional space, R has rank at least k 1. On the other hand, range(R) U, for U = span( u1, . . . , uℓ), and r T R r = 0 where r U \ {0}), so rank(R) = k 1. The latter also implies that range(R) = P r U, as range(R) r, range(R) U, and dim(range(R)) = k 1. The above imply that Q has one eigenvalue of multiplicity n k, namely a. Moreover, the eigenvectors w1, . . . , wk corresponding to the remaining eigenvalues (or, the nonzero eigenvalues of Q a I) are such that P r Σ1/2U = P r span(w1, . . . , wk). The lemma thus follows by multiplying both sides of the above equality with P r Σ 1/2, and using the fact that P r Σ 1/2P r = P r Σ 1/2. It remains to show that γℓ = 0, for all ℓ [k], when µ is generic or 0. Note that cℓ uℓ, vℓ 2 (7)= vℓ, (Qℓ aℓI)vℓ = (9) Cov(g( uℓ, W + µ ) sgn( r, W + µ ); W, vℓ 2) cℓ It thus suffices to show that cℓ = 0. Lemma 2 implies that uℓ= vℓ+ c r for some c > 0, hence cℓ= Cov[g(X + c Y + zℓ(µ) ) sgn(Y + z0(µ)); X2], where X vℓ, W and Y r, W are independent Gaussians with mean 0, and zℓ(µ) uℓ, µ , z0(µ) r, µ . Hence, cℓ= Cov[F(X); X2] where F(x) = EY [g(x + c Y + zℓ(µ)) sgn(Y + z0(µ))] z0(µ) g(x+cy +zℓ(µ))φ(y)dy Z z0(µ) g(x + cy + zℓ(µ)φ(y)dy where φ the normal p.d.f. Assume first that µ = 0. By (3), g is anti-symmetric, i.e., g( x) = g(x). Thus, F( x) = EY [g( x + c Y ) sgn(Y )] Y Y = EY [g( x c Y ) sgn( Y )] = F(x), i.e., F is symmetric. Further, F (x) = Ey[g (x + c Y ) sgn(Y )] = R 0 (g (x + cy) g (x cy))φ(y)dy. The strict concavity of g in [0, ) implies that g is decreasing in [0, + ), and the antisymmetry of g implies that g is symmetric. Take x > 0: if x > cy 0, g (x+cy) > g (x cy), while if x cy, then g (x cy) = g (cy x) > g (cy + x), so F (x) is negative for x > 0. By the symmetry of F, F (x) is positive for x < 0. As such, F(x) = G(x2) for some strictly decreasing G, and cℓ= Cov(G(Z); Z) for Z = X2; hence, cℓ< 0, for all ℓ [k]. To see that cℓ = 0 for generic µ, recall that f is analytic and hence so is g. Hence, cℓis an analytic function of µ, for every ℓ [k]; also, as cℓ(µ) < 0 for µ = 0, it is not identically 0. Hence, the sets {µ Rd : cℓ(µ) = 0}, ℓ [k], have Lebesgue measure 0 (see, e.g., pg. 83 in (Krantz & Parks, 2002)), and so does their union Z. As such, cℓ = 0 for generic µ; if not, there exists a ball B Rd such that B Z has positive Lebesgue measure, a contradiction. Learning Mixtures of Linear Classifiers 1000 2000 3000 4000 5000 Largest principal angle d = 10 d = 20 d = 30 0 100 200 300 400 500 Largest principal angle d = 10 d = 20 d = 30 (a) sin(d P ) vs. n (b) sin(d P ) vs. n/d Figure 2. Convergence of ˆU to U. 1000 2000 3000 4000 5000 0.2 d = 10 d = 20 d = 30 1000 2000 3000 4000 5000 d = 10 d = 20 d = 30 (a) K = n (b) K = log(n) Figure 3. Predicting the expected label given features using KNN (RMSE). Dotted lines are for K-NN after projecting the features Xi onto ˆU. Denote by λ0 the eigenvalue of multiplicity d k in Lemma 4. Let = minℓ [k] |λ0 λℓ| be the gap between λ0 and the remaining eigenvalues. Then, the following lemma holds; this, along with Lemma 4, yields Theorem 1. Lemma 5. Let ˆU be our estimate for U. If λ1, . . . , λk are separated from λ0 by at least , then for ǫ min(ǫ0/ , 1 4), we have Pr(d P (U, ˆU) > ǫ) C exp F( ǫ) , where ǫ0, F are defined as per Lemma 3. Proof. If we ensure ˆQ Q /4, then, by Weyl s theorem (Horn & Johnson, 2012), d k eigenvalues of ˆQ are contained in [λk+1 /4, λk+1 + /4], and the remaining eigenvalues are outside this set, and will be detected by SPECTRALMIRROR. Moreover, by the Davis-Kahan sin(θ) theorem, dp(range(Q), range( ˆQ)) ˆQ Q 2 ˆQ Q 2 = 1 ˆ Q Q 2 1. Thus the event dp(U, ˆU) ǫ is implied by ˆQ Q 2 ǫ 1+ǫ ǫ. Moreover, this implies that sufficient condition for ˆQ Q 2 /4 (which is required for SPEC- TRALMIRROR to detect the correct eigenvalues) is that ǫ 1 4. The lemma thus follows from Lemma 3. Note that the Gaussianity of X is crucially used in the fact that the whitened features W are uncorrelated, which in turn yields Eq. (7). We believe that the theorem can be extended to more general distributions, provided that the transform Σ 1 2 de-correlates the coordinates of X. 5. Experiments We conduct computational experiments to validate the performance of SPECRALMIRROR on subspace estimation, prediction, and clustering. We generate synthetic data using k = 2, with profiles uℓ N(0, I), ℓ= 1, 2 and mixture weights pℓsampled uniformly at random from the k-dimensional simplex. Features are also Gaussian: Xi N(0, I), i = 1, . . . , n; labels generated by the ℓ-th classifier are given by yi = sgn(u T ℓXi), i = 1, . . . , n. Convergence. We study first how well SPECTRALMIRROR estimates the span U. Figure 2(a) shows the convergence of ˆU to U in terms of (the sin of) the largest principal angle between the subspaces versus the sample size n. We also plot the convergence versus the effective sample size n/d (Figure 2(a)). The curves for different values of d align in Figure 2, indicating that the upper bound in Thm. 1 correctly predicts the sample complexity as n Θ(d). Though not guaranteed by Theorem 1, in all experiments r was indeed spanned by ˆU, so the addition of ˆr to ˆU was not necessary. Prediction through K-NN. Next, we use the estimated subspace to aid in the prediction of expected labels. Given a new feature vector X, we use the average label of its K nearest neighbors (K-NN) in the training set to predict its expected label. We do this for two settings: once over the raw data (the ambient space), and once over data for which the features X are first projected to ˆU, the estimated span (of dimension 2). For each n, we repeat this procedure 25 times with K = n and K = log n. We record the average root mean squared error between predicted and expected labels over the 25 runs. Figures 3(a) and 3(b) show that, despite the error in ˆU, using K-NN on this subspace outperforms K-NN on the ambient space. Prediction and Clustering through EM. We next study the performance of prediction and clustering using the Expectation-Maximization (EM) algorithm. We use EM to fit the individual profiles both over the training set, as well as on the dataset projected to the estimated subspace ˆU. We conducted two experiments in this setting: (a) initialize EM close to the true profiles uℓ, ℓ [k], and (b) randomly initialize EM and choose the best set of profiles from 30 runs. For each n we run EM 10 times. The first set of prediction experiments, we again compare expected labels to the predicted labels, using for the latter profiles uℓand mixture probabilities pℓas estimated by Learning Mixtures of Linear Classifiers 2000 4000 6000 8000 d = 10 d = 20 d = 30 2000 4000 6000 8000 0 d = 10 d = 20 d = 30 2000 4000 6000 8000 0.2 d = 10 d = 20 d = 30 (a) EM Prediction (close to gr. truth) (b) EM Prediction (random) (c) Clustering (random) Figure 4. (a) Predicting the label given features and the classifier using using EM (normalized 0-1 loss) from a starting point close to ground truth. Dotted lines are for k NN after projecting the features onto the estimated subspace. (b) Predicting the label given features and the classifier using using EM (normalized 0-1 loss) from a random starting point. (c) Predicting the classifier given features and the label. EM. Figure 4(a) measures the statistical efficiency of EM over the estimated subspace versus EM over the ambient space, when EM is initialized close to the true profiles. The second set of experiments, illustrated in Figure 4(b), aims to capture the additional improvement due to the reduction in the number of local minima in the reduced space. In both cases we see that constraining the estimated profiles to lie in the estimated subspace improves the statistical efficiency of EM; in the more realistic random start experiments, enforcing the subspace constraint also improves the performance of EM by reducing the number of local minima. We also observe an overall improvement compared to prediction through K-NN. Finally, we use the fitted profiles uℓto identify the classifier generating a label given the features and the label. To do this, once the profiles uℓhave been detected by EM, we use a logistic model margin condition to identify the classifier who generated a label, given the label and its features. Figure 4(c) shows the result for EM initialized at a random point, after choosing the best set of profiles from out of 30 runs. We evaluate the performance of this clustering procedure using the normalized 0-1 loss. Again, constraining the estimated profiles to the estimated subspace significantly improves the performance of this clustering task. 6. Conclusions We have proposed SPECTRALMIRROR, a method for discovering the span of a mixture of linear classifiers. Our method relies on a non-linear transform of the labels, which we refer to as mirroring . Moreover, we have provided consistency guarantees and non-asymptotic bounds, that also imply the near optimal statistical efficiency of the method. Finally, we have shown that, despite the fact that SPECTRALMIRROR discovers the span only approximately, this is sufficient to allow for a significant improvement in both prediction and clustering, when the features are projected to the estimated span. We have already discussed several technical issues that remain open, and that we believe are amenable to further analysis. These include amending the Gaussianity assumption, and applying our bounds to other p Hd-inspired methods. An additional research topic is to further improve the computational complexity of the estimation of the eigenvectors of the mirrored matrix ˆQ. This is of greatest interest in cases where the covariance Σ and mean µ are a priori known. This would be the case when, e.g., the method is applied repeatedly and, although the features X are sampled from the same distribution each time, labels Y are generated from a different mixture of classifiers. In this case, SPECTRALMIRROR lacks the pre-processing step, that requires estimating Σ and is thus computationally intensive; the remaining operations amount to discovering the spectrum of ˆQ, an operation that can be performed more efficiently. For example, we can use a regularized M-estimator to exploit the fact that Σ 1/2 ˆQΣ 1/2 should be the sum of a multiple of the identity and a low rank matrix see, e.g., Negahban et al. (2012). Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M. Tensor decompositions for learning latent variable models, 2012. ar Xiv preprint, ar Xiv: 1210.7559. Arora, S. and Kannan, R. Learning mixtures of arbitrary Gaussians. In Proceedings of the 33rd annual ACM Symposium on Theory of Computing, pp. 247 257. ACM, 2001. Bartholomew, D. J., Knott, M., and Moustaki, I. Latent Variable Models and Factor Analysis: a Unified Approach, volume 899. Wiley & Sons, 2011. Bishop, C. M. Latent variable models. In Learning in Graphical Models, pp. 371 403. Springer, 1998. Learning Mixtures of Linear Classifiers Chaganty, A. T. and Liang, P. Spectral experts for estimating mixtures of linear regressions. In ICML, 2013. Cook, R. D. Principal Hessian directions revisited. Journal of the American Statistical Association, 93(441):84 94, 1998. Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pp. 1 38, 1977. Horn, R. A. and Johnson, C. R. Matrix Analysis. Cambridge University Press, 2012. Hsu, D., Kakade, S. M., and Zhang, T. A spectral algorithm for learning hidden Markov models. Journal of Computer and System Sciences, 78(5):1460 1480, 2012. Jordan, M. I. and Jacobs, R. A. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6 (2):181 214, 1994. Krantz, S. G. and Parks, H. R. A primer of real analysis and functions. Springer, 2002. Li, K.-C. On principal Hessian directions for data visualization and dimension reduction: another application of Stein s Lemma. Journal of the American Statistical Association, 87(420):1025 1039, 1992. Liang, P., Bouchard-Cˆot e, A., Klein, D., and Taskar, B. An end-to-end discriminative approach to machine translation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pp. 761 768. Association for Computational Linguistics, 2006. Liu, J. S. Siegel s formula via Stein s identities. Statistics & Probability Letters, 21(3):247 251, 1994. Mc Lachlan, G. and Peel, D. Finite Mixture Models. Wiley & Sons, 2004. Moitra, A. and Valiant, G. Settling the polynomial learnability of mixtures of Gaussians. In 51st Annual IEEE Symposium on Foundations of Computer Science, pp. 93 102. IEEE, 2010. Negahban, S. N., Ravikumar, P., Wainwright, M. J., and Yu, B. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science, 27(4):538 557, 2012. Pearson, Karl. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, 185:71 110, 1894. Quattoni, A., Collins, M., and Darrell, T. Conditional random fields for object recognition. In Advances in Neural Information Processing Systems, pp. 1097 1104, 2004. Stein, C. M. Estimation of the mean of a multivariate normal distribution. In Prague Symposium on Asymptotic Statistics, 1973. Sun, Y., Ioannidis, S., and Montanari, A. Learning mixtures of linear classifiers, 2013. ar Xiv:1311.2547. Zelnik-Manor, L. and Perona, P. Self-tuning spectral clustering. In Advances in Neural Information Processing Systems, 2004.