# alternating_minimization_for_mixed_linear_regression__fb9a2807.pdf

Alternating Minimization for Mixed Linear Regression

Xinyang Yi YIXY@UTEXAS.EDU Constantine Caramanis CONSTANTINE@MAIL.UTEXAS.EDU Sujay Sanghavi SANGHAVI@MAIL.UTEXAS.EDU Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, 78712

Mixed linear regression involves the recovery of two (or more) unknown vectors from unlabeled linear measurements; that is, where each sample comes from exactly one of the vectors, but we do not know which one. It is a classic problem, and the natural and empirically most popular approach to its solution has been the EM algorithm. As in other settings, this is prone to bad local minima; however, each iteration is very fast (alternating between guessing labels, and solving with those labels).

In this paper we provide a new initialization procedure for EM, based on ﬁnding the leading two eigenvectors of an appropriate matrix. We then show that with this, a re-sampled version of the EM algorithm provably converges to the correct vectors, under natural assumptions on the sampling distribution, and with nearly optimal (unimprovable) sample complexity. This provides not only the ﬁrst characterization of EM s performance, but also much lower sample complexity as compared to both standard (randomly initialized) EM, and other methods for this problem.

1. Introduction

In this paper we consider the mixed linear regression problem: we would like to recover vectors from linear observations of each, except that these are unlabeled. In particular, consider for i = 1, . . . , N

yi = xi, β 1 zi + xi, β 2 (1 zi) + wi,

Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

where each zi is either 1 or 0, and wi is noise independent of everything else. A value zi = 1 means the ith

measurement comes from β 1, and zi = 0 means it comes from β 2. Our objective is to infer β 1, β 2 Rk given (yi, xi), i = 1, . . . , N; in particular, we do not have access to the labels zi. For now1, we do not make a priori assumptions on the β s; thus we are necessarily in the regime where the number of samples, N, exceeds the dimensionality, k (N > k).

We show in Section 4 that this problem is NP-hard in the absence of any further assumptions. We therefore focus on the case where the measurement vectors xi are independent, uniform Gaussian vectors in Rp. While our algorithm works in the noisy case, our performance guarantees currently apply only to the setting of no noise, i.e., wi = 0.

Mixed linear regression naturally arises in any application where measurements are from multiple latent classes and we are interested in parameter estimation. See (Deb & Holmes, 2000) for application of mixed linear regression in health care and work in (Gr un et al., 2007) for some related dataset.

The natural, and empirically most popular, approach to solving this problem (as with other problems with missing information) is the Expectation-Maximization, or EM, algorithm; see e.g.Viele & Tong 2002. In our context, EM involves iteratively alternating between updating estimates for β1, β2, and estimates for the labels; typically, unless there is speciﬁc side-information, the initialization is random. Each step can be solved in closed form, and hence is very computationally efﬁcient. However, as widely acknowledged, there has been to date no way to analytically pre-determine the performance of EM; as in other contexts, it is prone to getting trapped in local minima (Wu, 1983).

1As we discuss in more detail below, some work has been done in the sparse version of the problem, though the work we are aware of does not give an efﬁcient algorithm with performance guarantees on ˆβi β , i = 1, 2.

Alternating Minimization for Mixed Linear Regression

Contribution of our paper: We provide the ﬁrst analytical guarantees on the performance of the EM algorithm for mixed linear regression. A key contribution of our work, both algorithmically and for analysis, is the initialization step. In particular, we develop an initialization scheme, and show that with this EM will converge at least exponentially fast to the correct β s and ﬁnally recover ground truth exactly, with O(k log2 k) samples for a problem of dimension k. This sample complexity is optimal, up to logarithmic factors, in the dimension and in the error parameter. We are investigating the proposed algorithm in the noisy case, while in this paper we only present noiseless result.

1.1. Related Work

There is of course a huge amount of work in both latent variable modeling, and ﬁnite mixture models; here we do not attempt to cover this broad spectrum, but instead focus on the most relevant work, pertaining directly to mixed linear regression.

The work in (Viele & Tong, 2002) describes the application of the EM algorithm to the mixed linear regression problem, both with bayesian priors on the frequencies for each mixture, and in the non-parametric setting (i.e. where one does not a priori know the relative fractions from each β). More recently, in the high dimension case when N < k but the βs to be recovered are sparse, the work in (Stadler et al., 2010) proposes changing the vanilla EM for this problem, by adding a Lasso penalty to the β update step. For this method, and sufﬁcient samples, they show that there exists a local minimizer which selects the correct support. This can be viewed as an interesting extension of the known fact about EM, that it has efﬁcient local minima, to the sparse case; however there are no guarantees that any (or even several) runs of this modiﬁed EM will actually ﬁnd this good local minimum.

In recent years, an interesting line of work (e.g., (Hsu & Kakade, 2012), (Anandkumar et al., 2012)) has shown the possibility of resolving latent variable models via considering spectral properties of appropriate third-order tensors. Very recent work (Chaganty & Liang, 2013) applies this approach to mixed linear regression. Their method suffers from high sample complexity; in the setting of our problem, their theoretical analysis indicates N > O(k6). Additionally, this method has much higher computational complexity than the methods in our paper (both EM, and the initialization), due to the fact that they need to work with third-order tensors.

A quite similar problem that attracts extensive attention is subspace clustering, where the goal is to learn an unknown number of linear subspaces of varying dimensions from sample points. Putting our problem in this setting, each sample (y, x) is a vector in Rk+1; the points from β1 cor-

respond to one k-dimensional subspace, and from β2 to another k-dimensional subspace. Note that this makes for a very hard instance of subspace clustering, as not only are the dimensions of each subspace very high (only one less than ambient), but the projections of the points in the ﬁrst k coordinates are exactly the same. Even without the latter restriction, one typical method (Vidal et al., 2003), (Elhamifar & Vidal, 2012) as an example requires N O k2

to have unique solution.

1.2. Notation

For matrix X, we use σi(X) to denote the ith singular value of X. We denote the spectral, or operator, norm by X := maxi σi(X). For any vector x and scalar p, x p is deﬁned as the usual ℓp norm. For two vectors x, y we use x, y to denote their inner product and x y to denote their outer product. x T is transpose of x. We deﬁne T(x, y) to be the subspace spanned by x and y. The operator PT (x,y) is the orthogonal projection on T(x, y). We use N denote number of sample. k is dimension of unknown parameters.

2. Algorithm

In this section we describe the classical EM algorithm as is applied to our problem of mixed linear regression, and our new initialization procedure. Since our analytical results are currently only for the noiseless case, we focus here on EM for this setting, even though EM and also our initialization procedure easily apply to the general setting. The iterations of EM involve alternating between (a) given current β1, β2, partitioning the samples into J1 (which are more likely to have come from β1) and J2 (respectively, from β2), and then (b) updating each of β1, β2 given the new sample sets J1, J2 corresponding to each, respectively. Both parts of the iteration are extremely efﬁcient, and can be scaled easily to large problem sizes. In the typical application, in the absence of any extraneous side information, the initial β(0) s are chosen at random.

It is not hard to see that each iteration of the above procedure results in a decrease in the loss function

i min zi {0,1} (yi xi, ziβ1 + (1 zi)β2 )2 . (1)

Note that L, being the minimum of several convex functions, is neither convex nor concave; hence, while EM is guaranteed to converge, all that can be said a priori is that it will reach a local minimum. Indeed, our hardness result in Section 4 conﬁrms that for general xi, this must be the case. Yet even for the Gaussian case we consider, this has essentially been the state of analytical understanding of EM for this problem to date; in particular there are no global

Alternating Minimization for Mixed Linear Regression

Algorithm 1 EM (noiseless case)

input Initial β(0) 1 , β(0) 2 , number of iterations t0, samples {(yi, xi), i = 1, 2, ..., N} 1: for t = 0, , t0 1 do 2: {EM Part I: Guess the labels} 3: J1, J2 4: for i = 1, 2, , N do

5: if yi xi, β(t) 1 < yi xi, β(t) 2 then

6: J1 J1 {i} 7: else 8: J2 J2 {i} 9: end if 10: end for 11: {EM Part II: Solve least squares}

12: β(t+1) 1 argminβ Rk y J1 XJ1β 2 13: β(t+1) 2 argminβ Rk y J2 XJ2β 2 14: end for output β(t0) 1 , β(t0) 2

guarantees on convergence to the true solutions, under any assumptions, as far as we are aware.

The main algorithmic innovation of our paper is to develop a more principled initialization procedure. In practice, this allows for faster convergence, and with fewer samples, to the true β 1, β 2. Additionally, it allows us to establish global guarantees for EM, when EM is started from here. We now describe this initialization.

2.1. Initialization

Our initialization procedure is based on estimating the top rank 2 subspace of the following the response weighted covariance matrix:

i=1 y2 i xi xi,

where represents the outer product of two vectors. As we will show later, the second moment construction M is an unbiased estimator of a matrix whose top two eigenvectors span the same space spanned by the true β 1, β 2. We now present the idea, and then formally describe the procedure.

Idea: The expected value of M is given by

E[M] = p1A1 + p2A2,

where p1, p2 are the fractions of observations of β 1, β 2 respectively, and the matrices Ai, i = 1, 2, are given by

Ai := E x, β i 2 x x ,

where the expectation is over the random vector x, which in our setting is uniform normal. It is not hard to see that

this matrix evaluates to

Ai = I + 2(β i β i )

where I is the identity matrix. One can expect that with sufﬁcient samples N, the top-2 eigenspace of M will be a decent approximation of the space spanned by β 1, β 2. However, generally β 1, β 2 are not identiﬁable from top-2 eigenvectors of M or even E[M]. Note that even for the expected matrix E(M), when p1 = p2 and β 1 = β 2 , the top two eigenvectors will not be β 1,β 2. We thus need to run a simple 1-dimensional grid search on the unit circle in this space to ﬁnd good approximations to the individual vectors β 1, β 2, as opposed to just the space spanned by them. Our algorithm uses the empirical loss of every candidate pair, produced by the grid search, in order to select a good initial starting point.

The details of the above idea are given below, along with the formal description of our procedure, in Algorithm 2.

Algorithm 2 Initialization input Grid resolution δ, samples {(yi, xi), i = 1, 2, ..., N} 1: M 1

N PN i=1 y2 i xi xi 2: Compute top 2 eigenvectors v1, v2 of M 3: {Make the grid points} G {u : u = v1 cos(δt) + v2 sin(δt), t = 0, 1, ..., 2π

δ } 4: {Pick the pair that has the lowest loss}

β(0) 1 , β(0) 2 arg min u1,u2 G L(u1, u2)

output β(0) 1 , β(0) 2

Choice of grid resolution δ. In section 4, we show that it s sufﬁcient to choose δ < c β 1 β 2 2 p

min{p1, p2} 3 for some universal constant c. Even we have no knowledge of gound truth, successful choice of δ relies on a conservative estimation of β 1 β 2 2 and min{p1, p2}. Note that this upper bound does not scale with problem size. The number of candidate pairs is actually independent of (k, N).

Search avoidance method using prior knowledge of proportions. When p1, p2 are known, approximation of β 1, β 2 can be computed from the top two eigenvectors of M in closed form. Suppose (v b, λ b), b = 1, 2 are eigenvectors and eigenvalues of E[(M I)/2]. We deﬁne

( 1, b = 1 1, b = 2

It is easy to check that when λ 1 = λ 2 (we use b to denote

Alternating Minimization for Mixed Linear Regression

{1, 2} \ b),

1 b 2 v b + sign(b)

1 + b 2 v b, b = 1, 2,

b = (λ b λ b)2 + p2 b p2 b 2(λ b λ b)pb , b = 1, 2.

Duplicate eigenvalues. λ 1 = λ 2 if and only if p1 = p2 and β 1, β 2 = 0. In this case {β 1, β 2} are not identiﬁable from spectral structure of E(M) because any linear combination of {β 1, β 2} is an eigenvector of E(M). We go back to Algorithm 2 in this case.

Based on the above analysis, we propose an alternative initialization method using proportion information when eigenvalues are nonidentical, in Algorithm 3.

Algorithm 3 Init with proportion information input p1, p2, samples {(yi, xi), i = 1, 2, ..., N}

N PN i=1 y2 i xi xi 2: Compute top 2 eigenvectors and eigenvalues (vb, λb), b = 1, 2 of (M I)/2

3: Compute β(0) 1 , β(0) 2 via equation (2) ( use empirical version, i.e., remove superscript ) output β(0) 1 , β(0) 2

In Section 3, we demonstrate empirically the importance of this initialization technique; we show that EM initialized randomly has remarkably slower performance compared to EM initialized by Algorithm 2. Our theoretical results presented in Section 4, conﬁrm this observation analytically.

3. Numerical Results

In this section, we present the empirical performance of our algorithm on synthetic data set. The results highlight in particular two important features of our results. First, the simulations corroborate our theoretical results given in Section 4, which show that our algorithm is nearly optimal (unimprovable) in terms of sample complexity. Indeed, we show here that EM+SVD succeeds when given about as many samples as dimensions (in the absence of additional structure, e.g., sparsity, it is not possible to do better). Second, our results show that the SVD initialization seems to be critical: without it, EM s performance is signiﬁcantly degraded.

Experiment Settings. Each input vector xi are generated independently from standard Guassian distribution with mean 0 and covariance matrix I). We then choose the mixture labels for each sample with equal probability, i.e., we set p1 = p2 = 0.5. Also, in each trial, we generate β 1

and β 2 randomly but keep β 1, β 2 = 1.73. This constant 1.73 is arbitrarily chosen here. In this case, β 1 and β 2 are non-orthogonal and it s impossible to recover them from the SVD step due to ambiguity. We run algorithm 2 with a fairly coarse grid: δ = 0.3. We also test algorithm 3 using p1 = p2. The following metric which stands for global optimality is used

err(t) := max{ β(t) 1 β 1 2, β(t) 2 β 2 2}. (3)

Here t is the sequence of number of iterations.

Sample Complexity. In ﬁgure 1 we empirically investigate how the number of samples N needed for exact recovery scales with the dimension k. Each point in Figure 1 represents 1000 trials, and the corresponding value of N is the number of samples at which the success rate was greater than 0.99. We use algorithm 2 for initialization. In ﬁgure 2, we show the phase transition curves with a few (N, k) pairs.

10 20 30 40 50 60 70 80 90 100 100

Figure 1. Number of samples needed for success rate greater than 0.99 using SVD+EM. The dotted line is the least square ﬁt of the experimental data.

Effect of Initialization. We compare our eigenvectorbased Initialization + EM with the usual randomly initialized EM. For N = 300 samples and k = 10 dimensions, ﬁgure 3 shows how the error err converges as a function of the iterations. Each curve is averaged over 200 trials. We observe that the ﬁnal error of SVD+EM is about 10 35. The level of noise results from ﬂoat computation. For each trial, the blue and green curves show that exact recovery occurred after 7 iterations. This is possible since we are in the noiseless case.

As can be clearly seen, initialization has a profound effect on the performance of EM in our setting; it allows for exact recovery with high probability in a small number of iterations, while random initialization does not.

Alternating Minimization for Mixed Linear Regression

2 4 6 8 10 12 14 16 18 0

Exact Recovery Probability

k = 10 k = 50 k = 100

Figure 2. Success probability vs. normalized number of samples, i.e., N/k.

4. Main Results

In this section, we present the main results of our paper: provable statistical guarantees for EM, initialized with our Algorithm 2, in solving the mixed linear regression problem. We ﬁrst show that for general {xi}, the problem is NP-hard, even without noise. Then, we focus on the setting where each measurement vector xi is iid and sampled from the uniform normal distribution N(0, I). We also assume that the true vectors β 1, β 2 are equal in magnitude, which without loss of generality, we assume is 1. Intuitively, equal magnitudes represents a hard case, as in this setting the yi s from the two β s are statistically identical2.

Our proof can be broken into two key results. We ﬁrst show that using O(k log2 k) samples, with high probability our initialization procedure returns β(0) 1 , β(0) 2 which are within a constant distance of the true β 1, β 2. We note that for our scaling guarantees to hold, this constant need only be independent of the dimension, and in particular, it need not depend on the ﬁnal desired precision. Results with a 1/error or even 1/error2 dependence as would be required in order for the SVD step alone to obtain an approximation of β i , i = 1, 2, to within some error tolerance, are exponentially worse than what our two-step algorithm guarantees.

We then show that, given this good initialization, at any subsequent step t with current estimate (β(t) 1 , β(t) 2 ), doing one step of the EM iteration with samples that are independent of these β(t) i results in the error decreasing by a factor

2In particular, each yi has mean 0, and variance β 1 2 if it comes from the ﬁrst vector, and β 2 2 if it comes from the second. Having them be equal, i.e. β 1 2 = β 2 2, makes the yis statistically identical.

1 2 3 4 5 6 7 8 9 40

Number of iterations

SVD(grid search) SVD(with proportion) Random Initialization

Figure 3. This ﬁgure compares the decay in error, as a function of iteration count of EM, with and without our initialization. As can be seen, initialization allows for exact recovery (the 10 35 error is precision of Matlab) in a small number of iterations, while the standard random initialization is still not close.

of half, hence implying geometric convergence. As we explain below, our analysis providing this guarantee depends on using a new set of samples, i.e., the analysis does not allow re-use samples across iterations, as typically done in EM. We believe this is an artifact of the analysis; and of course, in practice, reusing the samples in each iteration seems to be advantageous.

Thus, our analytical results are for resampled versions of EM and the initialization scheme, which we state as Algorithms 4 and 5 below. Essentially, resampling involves splitting the set of samples into disjoint sets, and using one set for each iteration of EM; otherwise the algorithm is identical to before. Since we have geometric decrease in the error, achieving an ϵ accuracy comes at an additional cost of a factor log(1/ϵ) in the sample complexity, as compared to what may have been possible with the nonresampled case. We then show that when ϵ O 1/k2 , the error decays to be zero with high probability. In other words, we need in total O (log k) iterations in order to do exact recovery. Additionally, and the main contribution of this paper, the resampled version given here, represents the only known algorithm, EM or otherwise, with provable global statistical guarantees for the mixed linear regression problem, with sample complexity close to O(k).

Similarly, in the initialization procedure, for analytical guarantees we require two separate sets of samples: one set S for ﬁnding the top-2 eignespace, and another set S+ for evaluating the loss function for grid points.

First, we provide the hardness result for the case of general

Alternating Minimization for Mixed Linear Regression

Algorithm 4 EM with resampling

input Initial β(0) 1 , β(0) 2 , number of iterations t0, samples {(yi, xi), i = 1, 2, ..., N} 1: Partition the samples {(yi, xi)} into t0 disjoint sets: S1, ..., St0. 2: for t = 1, , t0 do 3: Use St to run lines 2 to 13 in algorithm 1. 4: end for output β(t0) 1 , β(t0) 2

Algorithm 5 Initialization with resampling input Grid resolution δ, samples {(yi, xi), i = 1, 2, ..., N} 1: Partition the samples {(yi, xi)} into two disjoint sets: S , S+ 2: M 1 |S | P

i S y2 i xi xi 3: Compute top 2 eigenvectors v1, v2 of M 4: {Make the grid points} G {u : u = v1 cos(δt) + v2 sin(δt), t = 0, 1, ..., 2π

δ } 5: {Pick the pair that has the lowest loss}

β(0) 1 , β(0) 2 arg min u1,u2 G L+(u1, u2)

where this loss L+ is evaluated as in (1) using samples in S+ output β(0) 1 , β(0) 2

Proposition 4.1. Deciding if a general instance of the mixed linear equations problem speciﬁed by (y, X) has a solution, β1, β2, is NP-hard.

The proof follows via a reduction from the so-called SUBSETSUM problem, which is known to be NP-hard(Garey & Johnson, 1979). We postpone the details to the supplemental material.

We now state two theoretical guarantees of the initialization algorithms. Recall that the error err(t) is as given in (3), and p1, p2 are the fractions of observations that come from β 1, β 2 respectively.

The following result guarantees a good initialization (algorithm 5) without requiring sample complexity that depends on the ﬁnal target error of the ultimate solution. Essentially, it says that we obtain an initialization that is good enough using O(k log2 k) samples.

Proposition 4.2. Given any constant bc < 1/2, with probability at least 1 c3k 2 Algorithm 5 produces an initial-

ization (β(0) 1 , β(0) 2 ), satisfying

err(0) bc min{p1, p2} β 1 β 2 2,

as long as we choose grid resolution δ 2 11bc β 1

min{p1, p2} 3, and the number of samples |S | and |S+| satisfy:

|S+| c2 min{p1, p2}

where c1, c2 and c3 depend on ˆc and min{p1, p2} but not on the dimension, k, and where

1 4(1 β 1, β 2 2)p1p2).

Algorithm 3 can be analyzed without resampling argument. The input sample set is S , we have the following conclusion.

Proposition 4.3. Consider initialization method in algorithm 3. Given any constant bc < 1/2, with probablity at least 1 1 k2 , the approach produces an initialization (β(0) 1 , β(0) 2 ) satisfying

err(0) bc min{p1, p2} β 1 β 2 2,

2 k log2 k.

Here c1 is a constant that depends on bc. And p

min{p1, p2} 3 β 1 β 2 2(

where κ = p

1 4(1 β 1, β 2 2)p1p2.

Comparing the obtained upper bound of eδ with that in proposition 4.2, we note there is an additional κ factor. Actually, κ represents the gap between top two eigenvectors of E(M). This factor characterizes the hardness of identifying two vectors from search avoiding method.

The proofs of proposition 4.2 and 4.3 relies on standard concentration results and eigenspace perturbation analysis. We postpone the details to supplemental materials.

The main theorem of the paper guarantees geometric decay of error, assuming a good initialization. Essentially, this says that to achieve error less than ϵ, we need log(1/ϵ) iterations, each using O(k) samples. Again, we note the absence of higher order dependence on the dimension, k, or anything other than the mild dependence on the ﬁnal error tolerance, ϵ.

Alternating Minimization for Mixed Linear Regression

Theorem 4.4. Consider one iteration in algorithm 4. For ﬁxed (β(t 1) 1 , β(t 1) 2 ), there exist absolute constants ec, c1, c2 such that if

err(t 1) ec min{p1, p2} β 1 β 2 2,

and if the number of samples in that iteration satisﬁes

|St| c1 min{p1, p2}

then with probability greater than 1 exp( c2k) we have a geometric decrease in the error at the next stage, i.e.

Note that the decrease factor 1/2 is arbitrarily chosen here. To put the above results together, we choose the constant bc in proposition 4.2 and 4.3 to be less than the constant ec in Theorem 4.4. Then, in each iteration of alternating minimization, with O (k) fresh samples, the error decays geometrically by a constant factor with probability greater than 1 exp ck. Suppose we are satisﬁed with error level ϵ, resampling regime requires O k log2 k + k log(1/ϵ) number of samples.

Let J b denote the set of samples generated from β b , b = 1, 2. It s not hard to observe that in noiseless case, exact recovery occurs when Jb = J b . The next result shows that when ϵ < c k2 β 1 β 2 2, fresh Θ(k) samples will be clustered correctly which results in exact recovery. Proposition 4.5. (Exact Recovery) There exist absolute constants c1, c2 such that if

err(t 1) c1

k2 β 1 β 2 2

and 1 min{p1, p2}k < |St| < c2k,

then with probability greater than 1 1

err(t) = 0.

By setting ϵ = O 1/k2 , it turns out that exact recovery needs totally O k log2 k samples. On using alternating minimization, approximation error will decay geometrically in the ﬁrst place. Then when error hits some level, exact recovery occurs and the ground truth is found. Simulation results in ﬁgure 3 supports our conclusion.

5. Proof of Theorem 4.4

In this section, we provide the proofs of our main theorem: we show that with a good starting point, EM exhibits geometric convergence, reducing the error by a factor of 2 at each iteration. The following lemma is crucial.

Lemma 5.1. Assume x Rk is a standard normal random vector. Let u, v be two ﬁxed vectors in Rk. Deﬁne α(u,v) := cos 1 (v u) (v+u) u+v 2 u v 2 , α(u,v) [0, π]. Let Σ = E(xx |(x u)2 > (x v)2). Then,

σmax(Σ) = 1 + sin α(u,v)

α(u,v) , (4)

σmin(Σ) = 1 sin α(u,v)

α(u,v) , (5)

P (x u)2 > (x v)2

2 u 2 > v 2

v 2 u 2 < v 2 (6)

To simplify notation, we drop the iteration index t, and let (β1, β2) denote the input to the EM algorithm, and (β+ 1 , β+ 2 ) denote its output. Similarly, we write err := maxi βi β i and err+ := maxi β+ i β i . We denote by J 1 and J 2 the sets of samples that come from β 1 and β 2 respectively, and similarly we denote the sets produced by the E step using the current iteration (β1, β2) by J1 and J2. Thus we have:

J 1 := {i St : yi = x i β 1},

J1 := {i St : (yi x i β1)2 < (yi x i β2)2},

and similarly for J 2 and J2.

We deﬁne a diagonal matrix W RSt St to pick out the rows in J1 when used for left multiplication: to this end, let Wii = 1 if i J1, and zero otherwise. Let W be deﬁned similarly, using J 1 . Thus, β+ 1 is the least squares solution to Wy = WXβ, and β+ 2 is the least squares solution to (I W)y = (I W)Xβ, and

y = W Xβ 1 + (I W )Xβ 2.

Observing that W 2 = W, we have that β+ 1 has closed form

β+ 1 = (X WX) 1X Wy.

By simple algebraic calculation, we ﬁnd

β+ 1 β 1 = (X WX) 1X (WW W)X(β 1 β 2).

In order to bound the magnitude of the error and hence of the right hand side, we write

β+ 1 β 1 2 AB, (7)

Alternating Minimization for Mixed Linear Regression

A = (X WX) 1

B = X (W WW )X(β 1 β 2) 2.

Bounding A. Observe that X WX = P

i J1 xix i . Decomposing J1 = (J1 J 1 ) (J1 J 2 ), we have

σmin(X WX) σmin( X

i J1 J 1 xix i ).

We need to control this quantity. We do so by lower bounding the number of terms in J1 J 1 , and also the smallest singular value of the matrix Σ = E {xix i |i J1 J 1 } .

If the current error satisﬁes

err β 1 β 2 2 2 , (8)

we have β 1 β2 2 > β 1 β1 2. Now, from Lemma 5.1, we have

P (x i (β 1 β1))2 < (x i (β 1 β2))2 > 1

and σmin(Σ) (1 2

Using Hoeffding s inequality, with probability greater than 1 e 1

8 p1|St|, we have the bound |J1 J 1 | 1

4p1|St|. By a standard concentration argument (see, e.g., (Vershynin, 2010) Corollary 50), we conclude that for any η (0, 1 2 π), there exists a constant c3, such that if

|St| c3 k ηp1 , (9)

then A 4 (1 2

π η)p1|St|, (10)

with probability at least 1 e k.

Bounding B. Let Q := X (W WW )X. We have

B2 σmax(Q)(β 1 β 2) Q(β 1 β 2).

(β 1 β 2) Q(β 1 β 2)

i J1 J 2 (x i (β 1 β 2))2

i J1 T J 2 2(x i (β 1 β1))2 + 2(x i (β 2 β1))2

i J1 T J 2 2(x i (β 1 β1))2 + 2(x i (β 2 β2))2.

The last inequality results from the decision rule labeling β1 and β2. This immediately implies that

B 2σmax(Q)err. (11)

Using Lemma 5.1, σmax(E xix i |i J1 J 2 ) 2. Following Theorem 39 in (Vershynin, 2010), we claim that there exist constants c4, c5 such that with probability greater than 1 2e c4k,

σmax(Q) |J1 J 2 |(2 + max(ˆη, ˆη2))

where ˆη = c5 q

k |J1 J 2 |. Letting c6 = 2 + c2 5, we have

σmax(Q) c6 max(k, |J1 J 2 |).

Now using again Lemma 5.1, we ﬁnd

E [|J1 J 2 |] 2err(t 1)

β 1 β 2 2 p2|St|.

By Hoeffding s inequality, with high probability

|J1 J 2 | 2E [|J1 J 2 |] .

Now we can combine the bounds on A (10) and on B (11). Setting η = (1 2

64c6 p1 β 1 β 2 2, (12)

and |St| 16c6

0.18 k p1 , (13)

we conclude that

β+ 1 β 1 2 1

Repeating the steps for β+ 2 , we obtain a similar result, and hence we conclude: err+ 1

2err, as claimed.

6. Conclusion and Future Work

In this paper, we provide a sufﬁcient condition when alternating minimization, as a popular method for solving non-convex optimization problem, achieves global optimum when it is used for solving mixed linear regression. Based on that, under mild conditions, we show a novel spectral initialization algorithm is guaranteed to satisfy the sufﬁcient condition of alternating minimization with near optimal sample complexity. As a future direction, it would be interesting and challenging to analyze alternating minimization in the face of noise. Acknowledgement The authors would like to acknowledge NSF grants 1302435, 0954059, 1017525 and DTRA grant HDTRA113-1-0024 for supporting this research.

Alternating Minimization for Mixed Linear Regression

Anandkumar, Anima, Ge, Rong, Hsu, Daniel, Kakade, Sham M., and Telgarsky, Matus. Tensor decompositions for learning latent variable models. Co RR, abs/1210.7559, 2012.

Chaganty, A. and Liang, P. Spectral experts for estimating mixtures of linear regressions. In International Conference on Machine Learning (ICML), 2013.

Deb, Partha and Holmes, Ann M. Estimates of use and costs of behavioural health care: a comparison of standard and ﬁnite mixture models. Health Economics, 9(6): 475 489, 2000.

Elhamifar, Ehsan and Vidal, Ren e. Sparse subspace clustering: Algorithm, theory, and applications. Co RR, abs/1203.1005, 2012.

Garey, M.R. and Johnson, D.S. Computers and Intractability: A Guide to the Theory of NP-Completeness. Series of Books in the Mathematical Sciences. W. H. Freeman, 1979.

Gr un, Bettina, Leisch, Friedrich, et al. Applications of ﬁnite mixtures of regression models. URL: http://cran. rproject. org/web/packages/ﬂexmix/vignettes/regressionexamples. pdf, 2007.

Hsu, Daniel and Kakade, Sham M. Learning gaussian mixture models: Moment methods and spectral decompositions. Co RR, abs/1206.5766, 2012.

Stadler, Nicolas, Buhlmann, Peter, and Geer, Sara. 1penalization for mixture regression models. TEST, 19 (2):209 256, 2010. ISSN 1133-0686.

Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. Ar Xiv e-prints, November 2010.

Vidal, Ren e, Ma, Yi, and Sastry, Shankar. Generalized principal component analysis (gpca). In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 1, pp. I 621. IEEE, 2003.

Viele, Kert and Tong, Barbara. Modeling with mixtures of linear regressions. Statistics and Computing, 12(4), 2002. ISSN 0960-3174. URL http://dx.doi. org/10.1023/A%3A1020779827503.

Wu, CF. On the convergence properties of the EM algorithm. The Annals of Statistics, 11(1):95 103, 1983.