# high_dimensional_structured_superposition_models__efe33ba7.pdf

High Dimensional Structured Superposition Models

Qilong Gu Dept of Computer Science & Engineering University of Minnesota, Twin Cities guxxx396@cs.umn.edu

Arindam Banerjee Dept of Computer Science & Engineering University of Minnesota, Twin Cities banerjee@cs.umn.edu

High dimensional superposition models characterize observations using parameters which can be written as a sum of multiple component parameters, each with its own structure, e.g., sum of low rank and sparse matrices, sum of sparse and rotated sparse vectors, etc. In this paper, we consider general superposition models which allow sum of any number of component parameters, and each component structure can be characterized by any norm. We present a simple estimator for such models, give a geometric condition under which the components can be accurately estimated, characterize sample complexity of the estimator, and give high probability nonasymptotic bounds on the componentwise estimation error. We use tools from empirical processes and generic chaining for the statistical analysis, and our results, which substantially generalize prior work on superposition models, are in terms of Gaussian widths of suitable sets.

1 Introduction

For high-dimensional structured estimation problems [3, 15], considerable advances have been made in accurately estimating a sparse or structured parameter θ Rp even when the sample size n is far smaller than the ambient dimensionality of θ, i.e., n p. Instead of a single structure, such as sparsity or low rank, recent years have seen interest in parameter estimation when the parameter θ is a superposition or sum of multiple different structures, i.e., θ = Pk i=1 θi, where θ1 may be sparse, θ2 may be low rank, and so on [1, 6, 7, 9, 11, 12, 13, 23, 24]. In this paper, we substantially generalize the non-asymptotic estimation error analysis for such superposition models such that (i) the parameter θ can be the superposition of any number of component parameters θi, and (ii) the structure in each θi can be captured by any suitable norm Ri(θi). We will analyze the following linear measurement based superposition model

i=1 θi + ω , (1)

where X Rn p is a random sub-Gaussian design or compressive matrix, k is the number of components, θi is one component of the unknown parameters, y Rn is the response vector, and ω Rn is random noise independent of X. The structure in each component θi is captured by any suitable norm Ri( ), such that Ri(θi) has a small value, e.g., sparsity captured by θi 1, low-rank (for matrix θi) captured by the nuclear norm θi , etc. Popular models such as Morphological Component Analysis (MCA) [10] and Robust PCA [6, 9] can be viewed as a special cases of this framework (see Section D). The superposition estimation problem can be posed as follows: Given (y, X) generated following (1), estimate component parameters {ˆθi} such that all the component-wise estimation errors i = ˆθi θ i , where θ i is the population mean, are small. Ideally, we want to obtain high-probability non-asymptotic bounds on the total componentwise error measured as Pk i=1 ˆθi θ i 2, with the bound improving (getting smaller) with increase in the number n of samples.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

We propose the following estimator for the superposition model in (1):

min {θ1,...,θk}

2 s.t. Ri(θi) αi , i = 1, . . . , k , (2)

where αi are suitable constants. In this paper, we focus on the case where αi = Ri(θ i ), e.g., if θ i is s-sparse with θ i 2 = 1 and Ri( ) = 1, then αi = s so that Ri(θ i ) s, noting that recent advances [16] can be used to extend our results to more general settings. The superposition estimator in (2) succeeds if a certain geometric condition, which we call structural coherence (SC), is satisﬁed by certain sets (cones) associated with the component norms Ri( ). Since the estimate ˆθi = θ i + i is in the feasible set of the optimization problem (2), the error vector i satisﬁes the constraint Ri(θ i + i) αi where αi = Ri(θ i ). The SC condition is a geometric relationship between the corresponding error cones Ci = cone{ i|Ri(θ i + i) Ri(θ i )}. If SC is satisﬁed, then we can show that the sum of componentwise estimation error can be bounded with high probability, and the bound takes the form:

i=1 ˆθi θ i 2 cmaxi w(Ci Bp) + log k n , (3)

where n is the sample size, k is the number of components, and w(Ci Bp) is the Gaussian width [3, 8, 22] of the intersection of the error cone Ci with the unit Euclidean ball Bp Rp. Interestingly, the estimation error decreases at the rate of 1/ n, similar to the case of single parameter estimators [15, 3], and depends only logarithmically on the number of components k. Further, while dependency of the error on Gaussian width of the error cone has been established in recent results involving a single parameter [3, 22], the bound in (3) depends on the maximum of the Gaussian width of individual error cones, not their sum. The analysis thus gives a general way to construct estimators for superposition problems along with high-probability non-asymptotic upper bounds on the sum of componentwise errors. To show the generality of our work, we review and compare related work in Appendix B. Notation: In this paper, we use . to denote vector norm, and |||.||| to denote operator norm. For example, . 2 is the Euclidean norm for a vector or matrix, and |||.||| is the nuclear norm of a matrix. We denote cone{E} as the smallest closed cone that contains a given set E. We denote ., . as the inner product. The rest of this paper is organized as follows: We start with a deterministic estimation error bound in Section 2, while laying down the key geometric and statistical quantities involved in the analysis. In Section 3, we discuss the geometry of the structural coherence (SC) condition, and in Section 4 show that the geometric SC condition implies statistical restricted eigenvalue (RE) condition. In Section 5, we develop the main error bound on the sum of componentwise errors which hold with high probability for sub-Gaussian designs and noise. We apply our error bound to practical problems in Section 6, and present experimental results in Section 7. We conclude in Section 8. In the Appendix, we compare an estimator using inﬁmal convolution [18] of norms with our estimator (2) for the noiseless case, and provide some addition examples and experiments. The proofs of all technical results are also in the Appendix.

2 Error Structure and Recovery Guarantees

In this section, we start with some basic results and, under suitable assumptions, provide a deterministic bound for the componentwise estimation error in superposition models. Subsequently, we will show that the assumptions made here hold with high probability as long as a purely geometric non-probabilistic condition characterized by structural coherence (SC) is satisﬁed.

Let {ˆθi} be a solution to the superposition estimation problem in (2), {θ i } be the optimal (population) parameters involved in the true data generation process. Let i = ˆθi θ i be the error vector for component i of the superposition. Our goal is to provide a preliminary understanding of the structure of error sets where i live, identify conditions under which a bound on the total componentwise error Pk i=1 ˆθi θ i 2 will hold, and provide a preliminary version of such a bound, which will be subsequently reﬁned to the form in (3) in Section 5. Since ˆθi = θ i + i lies in the feasible set of (2),

as discussed in Section 1, the error vectors i will lie in the error sets Ei = { i Rp|Ri(θ i + i) Ri(θ i )} respectively. For the analysis, we will be focusing on the cone of such error sets, given by

Ci = cone{ i Rp|Ri(θ i + i) Ri(θ i )} . (4)

Let θ = Pk i=1 θ i , ˆθ = Pk i=1 ˆθi, and = Pk i=1 i, so that = ˆθ θ . From the optimality of ˆθ as a solution to (2), we have

y X ˆθ 2 y Xθ 2 X 2 2ωT X , (5)

using ˆθ = θ + and y = Xθ + ω. In order to establish recovery guarantees, under suitable assumptions we construct a lower bound to X 2, the left hand side of (5). The lower bound is a generalized form of the restricted eigenvalue (RE) condition studied in the literature [4, 5, 17]. We also construct an upper bound to ωT X , the right hand side of (5), which needs to carefully analyze the noise-design (ND) interaction, i.e., between the noise ω and the design X. We start by assuming that a generalized form of RE condition is satisﬁed by the superposition of errors: there exists a constant κ > 0 such that for all i Ci, i = 1, 2, . . . , k:

i=1 i 2 . (6)

The above RE condition considers the following set:

H = n Pk i=1 i : i Ci, Pk i=1 i 2 = 1 o . (7)

which involves all the k error cones, and the lower bound is over the sum of norms of the component wise errors. If k = 1, the RE condition in (6) above simpliﬁes to the widely studied RE condition in the current literature on Lasso-type and Dantzig-type estimators [4, 17, 3] where only one error cone is involved. If we set all components but i to zero, then (6) becomes the RE condition only for component i. We also note that the general RE condition as explicitly stated in (6) has been implicitly used in [1] and [24]. For subsequent analysis, we introduce the set H deﬁned as

H = n Pk i=1 i : i Ci, Pk i=1 i 2 1 o . (8)

noting that H H. The general RE condition in (6) depends on the random design matrix X, and is hence an inequality which will hold with certain probability depending on X and the set H. For superposition problems, the probabilistic RE condition as in (6) is intimately related to the following deterministic structural coherence (SC) condition on the interaction of the different component cones Ci, without any explicit reference to the random design matrix X: there is a constant ρ > 0 such that for all i Ci, i = 1, . . . , k,

i=1 i 2 . (9)

If k = 1, the SC condition is trivially satisﬁed with ρ = 1. Since most existing literature on highdimensional structured models focus on the k = 1 setting [4, 17, 3], there was no reason to study the SC condition carefully. For k > 1, the SC condition (9) implies a non-trivial relationship among the component cones. In particular, if the SC condition is true, then the sum Pk i=1 i being zero implies that each component i must also be zero. As presented in (9), the SC condition comes across as an algebraic condition. In Section 3, we present a geometric characterization of the SC condition [13], and illustrate that the condition is both necessary and sufﬁcient for accurate recovery of each component. In Section 4, we show that for sub-Gaussian design matrices X, the SC condition in (9) in fact implies that the RE condition in (6) will hold with high probability, after the number of samples crosses a certain sample complexity, which depends on the Gaussian width of the component cones. For now, we assume the RE condition in (6) to hold, and proceed with the error bound analysis. To establish recovery guarantee, following (5), we need an upper bound on the interaction between noise ω and design X [3, 14]. In particular, we consider the noise-design (ND) interaction

(ND) sn(γ) = inf s>0

s : sup u s H

1 nωT Xu γs2 n , (10)

Figure 1: Geometry of SC condition when k = 2. The error sets E1 and E2 are respectively shown as blue an green squares, and the corresponding error cones are C1 and C2 respectively. C1 is the reﬂection of error cone C1. If C1 and C2 do not share a ray, i.e., the angle α between the cones is larger than 0, then δ0 < 1, and the SC condition will hold.

where γ > 0 is a constant, and s H is the scaled version of H where the scaling factor is s > 0. Here, sn(γ) denotes the minimal scaling needed on H such that one obtains a uniform bound over s H of the form: 1

nωT X γs2 n(γ). Then, from the basic inequality in (5), with the bounds implied by the RE condition and the ND interaction, we have

1 n X 2 1 n

i=1 i 2 γsn(γ) , (11)

which implies a bound on the component-wise error. The main deterministic bound below states the result formally:

Theorem 1 (Deterministic bound) Assume that the RE condition in (6) is satisﬁed in H with parameter κ. Then, if κ2 > γ, we have Pk i=1 i 2 2sn(γ).

The above bound is deterministic and holds only when the RE condition in (6) is satisﬁed with constant κ such that κ2 > γ. In the sequel, we ﬁrst give a geometric characterization of the SC condition in Section 3, and show that the SC condition implies the RE condition with high probability in Section 4. Further, we give a high probability characterization of sn(γ) based on the noise ω and design X in terms of the Gaussian widths of the component cones, and also illustrate how one can choose γ in Section 5. With these characterizations, we will obtain the desired component-wise error bound of the form (3).

3 Geometry of Structural Coherence

In this section, we give a geometric characterize the structural coherence (SC) condition in (9). We start with the simplest case of two vectors x, y. If they are not reﬂections of each other, i.e., x = y, then the following relationship holds:

Proposition 2 If there exists a δ < 1 such that x, y δ x 2 y 2, then

2 ( x 2 + y 2) . (12)

Next, we generalize the condition of Proposition 2 to vectors in two different cones C1 and C2. Given the cones, deﬁne δ0 = sup x C1 Sp 1,y C2 Sp 1 x, y . (13)

By construction, x, y δ0 x 2 y 2 for all x C1 and y C2. If δ0 < 1, then (12) continues to hold for all x C1 and y C2 with constant p

(1 δ0)/2 > 0. Note that this corresponds to the SC condition with k = 2 and ρ = p

(1 δ0)/2. We can interpret this geometrically as follows: ﬁrst reﬂect cone C1 to get C1, then δ is the cosine of the minimum angle between C1 and C2. If δ0 = 1, then C1 and C2 share a ray, and structural coherence does not hold. Otherwise, δ0 < 1, implying C1 C2 = {0}, i.e., the two cones intersect only at the origin, and structural coherence holds. For the general case involving k cones, denote δi = sup u Ci Sp 1,v P

j =i Cj Sp 1 u, v . (14)

In recent work, [13] concluded that if δi < 1 for each i = 1, . . . , k then Ci and P

j =i Cj does not share a ray, and the original signal can be recovered in noiseless case. We show that the condition above in fact implies ρ > 0 for the SC condition in (9), which is sufﬁcient for accurate recovery even in the noisy case. In particular, with δ := maxi δi, we have the following result:

Theorem 3 (Structural Coherence (SC) Condition) Let δ := maxi δi with δi as deﬁned in (14). If δ < 1, then there exists a ρ > 0 such that for any i Ci, i = 1, . . . , k, the SC condition in (9) holds, i.e., Pk i=1 i 2 ρ Pk i=1 i 2 . (15)

Thus, the SC condition is satisﬁed in the general case as long as the reﬂection Ci of any cone Ci does not intersect, i.e., share a ray, with the Minkowski sum P

j =i Cj of the other cones.

4 Restricted Eigenvalue Condition for Superposition Models

Assuming that the SC condition is satisﬁed by the error cones {Ci}, i = 1, . . . , k, in this section we show that the general RE condition in (6) will be satisﬁed with high probability when the number of samples n in the sub-Gaussian design matrix X Rn p crosses the sample complexity n0. We give a precise characterization of the sample complexity n0 in terms of the Gaussian width of the set H. Our analysis is based on the results and techniques in [20, 14], and we note that [3] has related results using mildly different techniques. We start with a restricted eigenvalue condition on C. For a random vector Z Rp, we deﬁne marginal tail function for an arbitrary set E as

Qξ(E; Z) = infu E P(| Z, u | ξ) , (16)

noting that it is deterministic given the set E Rp. Let ϵi, i = 1, . . . , n, be independent Rademacher random variables, i.e., random variable with probability 1

2 of being either +1 or 1, and let Xi, i = 1, . . . , n, be independent copies of Z. We deﬁne empirical width of E as

Wn(E; Z) = supu E h, u , where h = 1 n Pn i=1 ϵi Xi . (17)

With this notation, we recall the following result from [20]:

Lemma 1 Let X Rn p be a random design matrix with each row the independent copy of sub-Gaussian random vector Z. Then for any ξ, ρ, t > 0, we have

inf u H Xu 2 ρξ n Q2ρξ(H; Z) 2Wn(H; Z) ρξt (18)

with probability at least 1 e t2

In order to obtain lower bound of κ in RE condition (6), we need to lower bound Q2ρξ(H; Z) and upper bound Wn(H; Z). To lower bound Q2ρξ(H; Z), we consider the spherical cap

A = (Pk i=1 Ci) Sp 1 . (19)

From [20, 14], one can obtain a lower bound to Qξ(A; Z) based on the Paley-Zygmund inequality. The Paley-Zygmund inequality lower bound the tail distribution of a random variable by its second momentum. Let u be an arbitrary vector, we use the following version of the inequality.

P(| Z, u | 2ξ) [E| Z,u | 2ξ]2 + E| Z,u |2 (20)

In the current context, the following result is a direct consequence of SC condition, which shows that Q2ρξ(H; Z) is lower bounded by Qξ(A; Z), which in turn is strictly bounded away from 0 . The proof of Lemma 2 is given in Appendix H.1.

Lemma 2 Let sets H and A be as deﬁned in (7) and (19) respectively. If the SC condition in (9) holds, then the marginal tail functions of the two sets have the following relationship:

Qρξ(H; Z) Qξ(A; Z). (21)

Next we discuss how to upper bound the empirical width Wn(H; Z). Let set E be arbitrary, and random vector g N(0, Ip) be a standard Gaussian random vector in Rp. The Gaussian width [3] of E is deﬁned as w(E) = E sup u E g, u . (22)

Empirical width Wn(H; Z) can be seen as the supremum of a stochastic process. One way to upper bound the supremum of a stochastic process is by generic chaining [19, 3, 20], and by using generic

chaining we can upper bound the stochastic process by a Gaussian process, which is the Gaussian width. As we can bound Q2ρξ(H; Z) and Wn(H; Z), we come to the conclusion on RE condition. Let X Rn p be a random matrix where each row is an independent copy of the sub-Gaussian random vector Z Rp, and where Z has sub-Gaussian norm |||Z|||ψ2 σx [21]. Let α = infu Sp 1 E[| Z, u |] so that α > 0 [14, 20]. We have the following lower bound of the RE condition. The proof of Theorem 4 is based on the proof of [20, Theorem 6.3], and we give it in appendix H.2.

Theorem 4 (Restricted Eigenvalue Condition) Let X be the sub-Gaussian design matrix that satisﬁes the assumptions above. If the SC condition (9) holds with a ρ > 0, then with probability at least 1 exp( t2/2), we have

inf u H Xu 2 c1ρ n c2w(H) c3ρt (23)

where c1, c2 and c3 are positive constants determined by σx, σω and α.

To get a κ > 0 in (6), one can simply choose t = (c1ρ n c2w(H))/2c3ρ. Then as long as n > c4w2(H)/ρ2 for c4 = c2 2/c2 1, we have

κ = infu H 1 n Xu 2 1

2 c1ρ c2 w(H) n > 0,

with high probability. From the discussion above, if SC condition holds and the sample size n is large enough, then we can ﬁnd a matrix X such that RE condition holds. On the other hand, once there is a matrix X such that RE condition holds, then we can show that SC must also be true. Its proof is give in Appendix H.3.

Proposition 5 If X is a matrix such that the RE condition (6) holds for i Ci, then the SC condition (9) holds.

Proposition 5 demonstrates that SC condition is a necessary condition for the possibility of RE. If SC condition does not hold, then there is { i} such that i = 0 for some i = 1, . . . , k, but Pk i=1 i 2 = 0 which implies Pk i=1 i = 0. Then for every matrix X, we have X Pk i=1 i = 0, and RE condition is not possible.

5 General Error Bound

Recall that the error bound in Theorem 1 is given in terms of the noise-design (ND) interaction

sn(γ) = infs>0 n s : supu s C 1 nωT Xu γs2 n o . (24)

In this section, we give a characterization of the ND interaction, which yields the ﬁnal bound on the componentwise error as long as n n0, i.e., the sample complexity is satisﬁed. Let ω be a centered sub-Gaussian random vector, and its sub-Gaussian norm |||ω|||ψ2 σω. Let X be a row-wise i.i.d. sub-Gaussian random matrix, for each row Z, its sub-Gaussian norm |||Z|||ψ2 σx. The ND interaction can be bounded by the following conclusion, and the proof of lemma 3 is given in appendix I.1.

Lemma 3 Let design X Rn p be a row-wise i.i.d. sub-Gaussian random matrix, and noise ω Rn be a centered sub-Gaussian random vector. Then sn(γ) c w( H) γ n . for some constant c > 0 with probability at least 1 c1 exp( c2w2( H)) c3 exp( c4n). Constant c depends on σx and σω.

In lemma 3 and theorem 6, we need the Gaussian width of H and H respectively. From deﬁnition, both H and H is related to the union of different cones; therefore bounding the width of H and H may be difﬁcult. We have the following bound of w(H) and w( H) in terms of the width of the component spherical caps. The proof of Lemma 4 is given in Appendix I.2.

Lemma 4 (Gaussian width bound) Let H and H be as deﬁned in (7) and (8) respectively. Then, we have w(H) = O maxi w(Ci Sp 1) + log k and w( H) = O maxi w(Ci Bp) + log k .

By applying lemma 4, we can derive the error bound using the Gaussian width of individual error cone. From our conclusion on deterministic bound in theorem 1, we can choose an appropriate γ such that κ2 > γ. Then, by combining the result of theorem 1, theorem 4, lemma 3 and lemma 4, we have the ﬁnal form of the bound, as originally discussed in (3):

Theorem 6 For estimator (2), let Ci = cone{ : Ri(θ i + ) Ri(θ i )}, design X be a random matrix with each row an independent copy of sub-Gaussian random vector Z, noise ω be a centered sub-Gaussian random vector, and Bp Rp be the centered unit euclidean ball. If sample size n > c(maxi w2(Ci Sp 1) + log k)/ρ2, then we have with probability at least 1 η1

k exp( η2 maxi w2(Ci Sp 1)) η3 exp( η4n), Pk i=1 ˆθi θ i 2 C maxi w(Ci Bp)+ log k

ρ2 n , (25)

for constants c, C > 0 that depend on sub-Gaussian norms |||Z|||φ2 and |||ω|||φ2.

Thus, assuming the SC condition in (9) is satisﬁed, the sample complexity and error bound of the estimator depends on the largest Gaussian width, rather than the sum of Gaussian widths. The result can be viewed as a direct generalization of existing results for k = 1, when the SC condition is always satisﬁed, and the sample complexity and error is given by w2(C1 Sp 1) and w(C1 Bp) [3, 8].

6 Application of General Bound

In this section, we instantiate the general error bounds on Morphological Component Analysis (MCA), and low-rank and sparse matrix decomposition. The comprehensive results are provided in Appendix D.

6.1 Morphological Component Analysis In Morphological Component Analysis [10], we consider the following linear model

y = X(θ 1 + θ 2) + ω (26)

where vector θ 1 is sparse and θ 2 is sparse under a rotation Q. Consider the following estimator

min θ1,θ2 y X(θ1 + θ2) 2 2 s.t. θ1 1 θ 1 1, Qθ2 1 Qθ 2 1, (27)

where vector y Rn is the observation, vectors θ1, θ2 Rp are the parameters we want to estimate, matrix X Rn p is a sub-Gaussian random design, matrix Q Rp p is orthogonal. We assume θ 1 and Qθ 2 are s1-sparse and s2-sparse vectors respectively. Function Q. 1 is still a norm. In general, we can derive the following error bound from Theorem 6:

θ1 θ 1 2 + θ2 θ 2 2 = O max q

6.2 Low-rank and Sparse Matrix Decomposition To recover a sparse matrix and low-rank matrix from their sum [6, 9], one can use L1 norm to induce sparsity and nuclear norm to induce low-rank. These two kinds of norm ensure that the sparsity and the rank of the estimated matrices are small. Suppose we have a rank-r matrix L and a sparse matrix S with s nonzero entries, S , L Rd1 d2. Our observation Y comes from the following problem

Yi = Xi, L + S + Ei, i = 1, . . . , n,

where each Xi Rd1 d2 is a sub-Gaussian random design matrix. Ei is the noise matrix. The estimator takes the form:

i=1 (Yi Xi, L + S )2 s.t. |||L||| |||L ||| , S 1 S 1. (28)

By using Theorem 6, and existing results on Gaussian widths, the error bound is given by

L L 2 + S S 2 = O max q

s log(d1d2)

7 Experimental Results

In this section, we conﬁrm the theoretical results in this paper with some simple experiments. We show our experimental results under different settings. In our experiments we focus on MCA when k = 2. The design matrix X are generated from Gaussian distribution such that every entry of X

0 100 200 300 400 500 600 700 800 900 1000

θ1 θ 1 2 + θ2 θ 2 2

ρ 0.026 ρ = 1/

Samples 0 5 10 15 20 25 30 35 40

Fraction of Successful Recovery

p=20 p=40 p=80 p=160

Figure 2: (a) Effect of parameter ρ on estimation error when noise ω = 0. We choose the parameter ρ to be 0, 1/

2, and a random sample. (b) Effect of dimension p on fraction of successful recovery in noiseless case. Dimension p varies in {20, 40, 50, 150}

subjects to N(0, 1). The noise ω is generated from Gaussian distribution such that every entry of ω subjects to N(0, 1). We implement our algorithm 1 in MATLAB. We use synthetic data in all our experiments, and let the true signal θ1 = (1, . . . , 1 | {z } s1

, 0 . . . , 0), Qθ2 = (1, . . . , 1 | {z } s2

, 0 . . . , 0)

We generate our data in different ways for our three experiments.

7.1 Recovery From Noisy Observation In our ﬁrst experiment, we test the impact of ρ on the estimation error. We choose three different matrices Q, and ρ is determined the choice of Q. The ﬁrst Q is given by random sampling: we sample a random orthogonal matrix Q such that Qij > 0, and ρ is lower bounded by (42). The second and third Q is given by identity matrix I and its negative I; therefore ρ = 1/

2 and ρ = 0 respectively. We choose dimension p = 1000, and let s1 = s2 = 1. The number of samples n varied between 1 and 1000. Observation y is given by y = X(θ 1 + θ 2) + ω. In this experiment, given Q, for each n, we generate 100 pairs of X and w. For each (X, w) pair, we get a solution ˆθ1 and ˆθ2. We take the average over all ˆθ1 θ 1 2 + ˆθ2 θ 2 2. Figure 2(a) shows the plot of number of samples vs the average error. From ﬁgure 2(a), we can see that the error curve given by random Q lies between curves given by two extreme cases, and larger ρ gives lower curve. In Appendix E, we provide an additional experiment using k-support norm [2].

7.2 Recovery From Noiseless Observation In our second experiment, we test how the dimension p affects the successful recovery of true value. In this experiment, we choose different dimension p with p = 20, p = 40, p = 80, and p = 160. We let s1 = s2 = 1. To avoid the impact of ρ, for each sample size n, we sample 100 random orthogonal matrices Q. Observation y is given by y = X(θ 1 + θ 2). For each solution ˆθ1 and ˆθ2 of (41), we calculate the proportion of Q such that ˆθ1 θ 1 2 + ˆθ2 θ 2 2 10 4. We increase n from 1 to 40, and the plot we get is ﬁgure 2(b). From ﬁgure 2(b) we can ﬁnd that the sample complexity required to recover θ 1 and θ 2 increases with dimension p.

8 Conclusions

We present a simple estimator for general superposition models and give a purely geometric characterization, based on structural coherence, of when accurate estimation of each component is possible. Further, we establish sample complexity of the estimator and upper bounds on componentwise estimation error and show that both, interestingly, depend on the largest Gaussian width among the spherical caps induced by the error cones corresponding to the component norms. Going forward, it will be interesting to investigate speciﬁc component structures which satisfy structural coherence, and also extend our results to allow more general measurement models. Acknowledgements: The research was also supported by NSF grants IIS-1563950, IIS-1447566, IIS-1447574, IIS-1422557, CCF-1451986, CNS1314560, IIS-0953274, IIS-1029711, NASA grant NNX12AQ39A, and gifts from Adobe, IBM, and Yahoo.

[1] A. Agarwal, S. Negahban, and M. J. Wainwright. Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics, 40(2):1171 1197, 2012. [2] A. Argyriou, R. Foygel, and N. Srebro. Sparse Prediction with the k-Support Norm. In Advances in Neural Information Processing Systems, Apr. 2012. [3] A. Banerjee, S. Chen, F. Fazayeli, and V. Sivakumar. Estimation with Norm Regularization. In Advances in Neural Information Processing Systems, 2014. [4] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705 1732, 2009. [5] P. Buhlmann and S. van de Geer. Statistics for High Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Springer, 2011. [6] E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. ACM, 58(3):1 37, 2011. [7] V. Chandrasekaran, P. A. Parrilo, and A. S. Willsky. Latent variable graphical model selection via convex optimization. The Annals of Statistics, 40(4):1935 1967, 2012. [8] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The Convex Geometry of Linear Inverse Problems. Foundations of Computational Mathematics, 12:805 849, 2012. [9] V. Chandrasekaran, S. Sanghavi, P. a. Parrilo, and A. S. Willsky. Rank-Sparsity Incoherence for Matrix Decomposition. SIAM Journal on Optimization, 21(2):572 596, 2011. [10] D. L. Donoho and X. Huo. Uncertainty principles and ideal atomic decomposition. IEEE Transactions on Information Theory, 47(7):2845 2862, 2001. [11] R. Foygel and L. Mackey. Corrupted Sensing: Novel Guarantees for Separating Structured Signals. IEEE Transactions on Information Theory, 60(2):1223 1247, Feb. 2014. [12] D. Hsu, S. M. Kakade, and T. Zhang. Robust matrix decomposition with sparse corruptions. IEEE Transactions on Information Theory, 57(11):7221 7234, 2011. [13] M. B. Mc Coy and J. A. Tropp. The achievable performance of convex demixing. ar Xiv, 2013. [14] S. Mendelson. Learning without concentration. J. ACM, 62(3):21:1 21:25, June 2015. [15] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A Uniﬁed Framework for High Dimensional Analysis of M-Estimators with Decomposable Regularizers. Statistical Science, 27(4):538 557, Nov. 2012. [16] S. Oymak, B. Recht, and M. Soltanolkotabi. Sharp Time Data Tradeoffs for Linear Inverse Problems. Ar Xiv e-prints, July 2015. [17] G. Raskutti, M. J. Wainwright, and B. Yu. Restricted Eigenvalue Properties for Correlated Gaussian Designs. Journal of Machine Learning Research, 11:2241 2259, 2010. [18] R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970. [19] M. Talagrand. Upper and Lower Bounds for Stochastic Processes. A Series of Modern Surveys in Mathematics. Springer-Verlag Berlin Heidelberg, 2014. [20] J. A. Tropp. Convex recovery of a structured signal from independent random linear measurements. ar Xiv, May 2014. [21] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar and G. Kutyniok, editors, Compressed Sensing, pages 210 268. Cambridge University Press, Cambridge, Nov. 2012. [22] R. Vershynin. Estimation in high dimensions: a geometric perspective. Sampling Theory, a Renaissance, pages 3 66, 2015. [23] J. Wright, A. Ganesh, K. Min, and Y. Ma. Compressive principal component pursuit. IEEE International Symposium on Information Theory, pages 1276 1280, 2012. [24] E. Yang and P. Ravikumar. Dirty statistical models. Advances in Neural Information Processing Systems, pages 1 9, 2012.