# structured_sparse_regression_via_greedy_hard_thresholding__67e185bf.pdf Structured Sparse Regression via Greedy Hard-thresholding Prateek Jain Microsoft Research India Nikhil Rao Technicolor Inderjit Dhillon Several learning applications require solving high-dimensional regression problems where the relevant features belong to a small number of (overlapping) groups. For very large datasets and under standard sparsity constraints, hard thresholding methods have proven to be extremely efficient, but such methods require NP hard projections when dealing with overlapping groups. In this paper, we show that such NP-hard projections can not only be avoided by appealing to submodular optimization, but such methods come with strong theoretical guarantees even in the presence of poorly conditioned data (i.e. say when two features have correlation 0.99), which existing analyses cannot handle. These methods exhibit an interesting computation-accuracy trade-off and can be extended to significantly harder problems such as sparse overlapping groups. Experiments on both real and synthetic data validate our claims and demonstrate that the proposed methods are orders of magnitude faster than other greedy and convex relaxation techniques for learning with group-structured sparsity. 1 Introduction High dimensional problems where the regressor belongs to a small number of groups play a critical role in many machine learning and signal processing applications, such as computational biology and multitask learning. In most of these cases, the groups overlap, i.e., the same feature can belong to multiple groups. For example, gene pathways overlap in computational biology applications, and parent-child pairs of wavelet transform coefficients overlap in signal processing applications. The existing state-of-the-art methods for solving such group sparsity structured regression problems can be categorized into two broad classes: a) convex relaxation based methods , b) iterative hard thresholding (IHT) or greedy methods. In practice, IHT methods tend to be significantly more scalable than the (group-)lasso style methods that solve a convex program. But, these methods require a certain projection operator which in general is NP-hard to compute and often certain simple heuristics are used with relatively weak theoretical guarantees. Moreover, existing guarantees for both classes of methods require relatively restrictive assumptions on the data, like Restricted Isometry Property or variants thereof [2, 7, 16], that are unlikely to hold in most common applications. In fact, even under such settings, the group sparsity based convex programs offer at most polylogarithmic gains over standard sparsity based methods [16]. Concretely, let us consider the following linear model: y = Xw + β, (1) where β N(0, λ2I), X 2 Rn p, each row of X is sampled i.i.d. s.t. xi N(0, ), 1 i n, and w is a k -group sparse vector i.e. w can be expressed in terms of only k groups, Gj [p]. The existing analyses for both convex as well as hard thresholding based methods require = σ1/σp c, where c is an absolute constant (like say 3) and σi is the i-th largest eigenvalue of . 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. This is a significantly restrictive assumption as it requires all the features to be nearly independent of each other. For example, if features 1 and 2 have correlation more than say .99 then the restriction on required by the existing results do not hold. Moreover, in this setting (i.e., when = O(1)), the number of samples required to exactly recover w (with λ = 0) is given by: n = (s + k log M) [16], where s is the maximum support size of a union of k groups and M is the number of groups. In contrast, if one were to directly use sparse regression techniques (by ignoring group sparsity altogether) then the number of samples is given by n = (s log p). Hence, even in the restricted setting of = O(1), group-sparse regression improves upon the standard sparse regression only by logarithmic factors. Greedy, Iterative Hard Thresholding (IHT) methods have been considered for group sparse regression problems, but they involve NP-hard projections onto the constraint set [3]. While this can be circumvented using approximate operations, the guarantees they provide are along the same lines as the ones that exist for convex methods. In this paper, we show that IHT schemes with approximate projections for the group sparsity problem yield much stronger guarantees. Specifically, our result holds for arbitrarily large , and arbitrary group structures. In particular, using IHT with greedy projections, we show that n = + 2k log M) log 1 samples suffice to recover -approximatation to w when λ = 0. On the other hand, IHT for standard sparse regression [10] requires n = ( 2s log p). Moreover, for general noise variance λ2, our method recovers ˆw s.t. k ˆw w k 2 + λ s+ 2k log M n . On the other hand, the existing state-of-the-art results for IHT for group sparsity [4] guarantees k ˆw w k λ ps + k log M for 3, i.e., ˆw is not a consistent estimator of w even for small condition number . Our analysis is based on an extension of the sparse regression result by [10] that requires exact projections. However, a critical challenge in the case of overlapping groups is the projection onto the set of group-sparse vectors is NP-hard in general. To alleviate this issue, we use the connection between submodularity and overlapping group projections and a greedy selection based projection is at least good enough. The main contribution of this work is to carefully use the greedy projection based procedure along with hard thresholding iterates to guarantee the convergence to the global optima as long as enough i.i.d. data points are generated from model (1). Moreover, the simplicity of our hard thresholding operator allows us to easily extend it to more complicated sparsity structures. In particular, we show that the methods we propose can be generalized to the sparse overlapping group setting, and to hierarchies of (overlapping) groups. We also provide extensive experiments on both real and synthetic datasets that show that our methods are not only faster than several other approaches, but are also accurate despite performing approximate projections. Indeed, even for poorly-conditioned data, IHT methods are an order of magnitude faster than other greedy and convex methods. We also observe a similar phenomenon when dealing with sparse overlapping groups. 1.1 Related Work Several papers, notably [5] and references therein, have studied convergence properties of IHT methods for sparse signal recovery under standard RIP conditions. [10] generalized the method to settings where RIP does not hold, and also to the low rank matrix recovery setting. [21] used a similar analysis to obtain results for nonlinear models. However, these techniques apply only to cases where exact projections can be performed onto the constraint set. Forward greedy selection schemes for sparse [9] and group sparse [18] constrained programs have been considered previously, where a single group is added at each iteration. The authors in [2] propose a variant of Co Sa MP to solve problems that are of interest to us, and again, these methods require exact projections. Several works have studied approximate projections in the context of IHT [17, 6, 12]. However, these results require that the data satisfies RIP-style conditions which typically do not hold in real-world regression problems. Moreover, these analyses do not guarantee a consistent estimate of the optimal regressor when the measurements have zero-mean random noise. In contrast, we provide results under a more general RSC/RSS condition, which is weaker [20], and provide crisp rates for the error bounds when the noise in measurements is random. 2 Group Iterative Hard Thresholding for Overlapping Groups In this section, we formally set up the group sparsity constrained optimization problem, and then briefly present the IHT algorithm for the same. Suppose we are given a set of M groups that can arbitrarily overlap G = {G1, . . . , GM}, where Gi [p]. Also, let [M i=1Gi = {1, 2, . . . , p}. We let kwk denote the Euclidean norm of w, and supp(w) denotes the support of w. For any vector w 2 Rp, [8] defined the overlapping group norm as kwk G := inf ka Gik s.t. a Gi = w, supp(a Gi) Gi (2) We also introduce the notion of group-support of a vector and its group0 pseudo-norm: G-supp(w) := {i s.t. ka Gik > 0}, kwk G 1{ka Gik > 0}, (3) where a Gi satisfies the constraints of (2). 1{ } is the indicator function, taking the value 1 if the condition is satisfied, and 0 otherwise. For a set of groups G, supp(G) = {Gi, i 2 G}. Similarly, G-supp(S) = G-supp(w S). Suppose we are given a function f : Rp ! R and M groups G = {G1, . . . , GM}. The goal is to solve the following group sparsity structured problem (GS-Opt): GS-Opt: min w f(w) s.t. kwk G f can be thought of as a loss function over the training data, for instance, logistic or least squares loss. In the high dimensional setting, problems of the form (4) are somewhat ill posed and are NP-hard in general. Hence, additional assumptions on the loss function (f) are warranted to guarantee a reasonable solution. Here, we focus on problems where f satisfies the restricted strong convexity and smoothness conditions: Definition 2.1 (RSC/RSS). The function f : Rp ! R satisfies the restricted strong convexity (RSC) and restricted strong smoothness (RSS) of order k, if the following holds: k I H(w) Lk I, where H(w) is the Hessian of f at any w 2 Rp s.t. kwk G Note that the goal of our algorithms/analysis would be to solve the problem for arbitrary k > 0 and Lk < 1. In contrast, adapting existing IHT results to this setting lead to results that allow Lk/ kless than a constant (like say 3). We are especially interested in the linear model described in (1), and in recovering w? consistently (i.e. recover w? exactly as n ! 1). To this end, we look to solve the following (non convex) constrained least squares problem GS-LS: ˆw = arg min w f(w) := 1 2nky Xwk2 s.t. kwk G with k k being a positive, user defined integer 1. In this paper, we propose to solve (5) using an Iterative Hard Thresholding (IHT) scheme. IHT methods iteratively take a gradient descent step, and then project the resulting vector (g) on to the (non-convex) constraint set of group sparse vectos, i.e., k (g) = arg min w kw gk2 s.t kwk G Computing the gradient is easy and hence the complexity of the overall algorithm heavily depends on the complexity of performing the aforementioned projection. Algorithm 1 details the IHT procedure for the group sparsity problem (4). Throughout the paper we consider the same high-level procedure, but consider different projection operators b P G k (g) for different settings of the problem. 1typically chosen via cross-validation Algorithm 1 IHT for Group-sparsity 1: Input : data y, X, parameter k, iterations T, step size 2: Initialize: t = 0, w0 2 Rp a k-group sparse vector 3: for t = 1, 2, ..., T do 4: gt = wt rf(wt) 5: wt = b P G k (gt) where b P G k (gt) performs (approximate) projections 6: end for 7: Output : w T Algorithm 2 Greedy Projection Require: g 2 Rp, parameter k, groups G 1: ˆu = 0 , v = g, b G = {0} 2: for t = 1, 2, . . . k do 3: Find G? = arg max G2G\ b G kv Gk 4: b G = b G S G? 5: v = v v G? 6: u = u + v G? 7: end for 8: Output ˆu := b P G k (g), b G = supp(u) 2.1 Submodular Optimization for General G Suppose we are given a vector g 2 Rp, which needs to be projected onto the constraint set kuk G 0 k (see (6)). Solving (6) is NP-hard when G contains arbitrary overlapping groups. To overcome this, P G k ( ) can be replaced by an approximate operator b P G k ( ) (step 5 of Algorithm 1). Indeed, the procedure for performing projections reduces to a submodular optimization problem [3], for which the standard greedy procedure can be used (Algorithm 2). For completeness, we detail this in Appendix A, where we also prove the following: Lemma 2.2. Given an arbitrary vector g 2 Rp, suppose we obtain ˆu, b G as the output of Algorithm 2 with input g and target group sparsity k. Let u = P G k (g) be as defined in (6). Then k k k(g)supp(u )k2 + ku gk2 where e is the base of the natural logarithm. Note that the term with the exponent in Lemma 2.2 approaches 0 as k increases. Increasing k should imply more samples for recovery of w . Hence, this lemma hints at the possibility of trading off sample complexity for better accuracy, despite the projections being approximate. See Section 3 for more details. Algorithm 2 can be applied to any G, and is extremely efficient. 2.2 Incorporating Full Corrections IHT methods can be improved by the incorporation of corrections after each projection step. This merely entails adding the following step in Algorithm 1 after step 5: wt = arg min w f( w) s.t. supp( w) = supp( b P G When f( ) is the least squares loss as we consider, this step can be solved efficiently using Cholesky decompositions via the backslash operator in MATLAB. We will refer to this procedure as IHTFC. Fully corrective methods in greedy algorithms typically yield significant improvements, both theoretically and in practice [10]. 3 Theoretical Performance Bounds We now provide theoretical guarantees for Algorithm 1 when applied to the overlapping group sparsity problem (4). We then specialize the results for the linear regression model (5). Theorem 3.1. Let w = arg minw,kw Gk0 k f(w) and let f satisfy RSC/RSS with constants k0, Lk0, respectively (see Definition 2.1). Set k = 32 Lk0 k0 kw k2 and let k0 2k+k . Suppose we run Algorithm 1, with = 1/Lk0 and projections computed according to Algorithm 2. Then, the following holds after t + 1 iterations: 1 k0 10 Lk0 kwt w k2 + γ + k0 where γ = 2 Lk0 max S, s.t., | G-supp(S)| k k(rf(w ))Sk2. Specifically, the output of the T = Lk0 k0 kw k2 -th iteration of Algorithm 1 satisfies: kw T w k2 2 + 10 Lk0 The proof uses the fact that Algorithm 2 performs approximately good projections. The result follows from combining this with results from convex analysis (RSC/RSS) and a careful setting of parameters. We prove this result in Appendix B. Theorem 3.1 shows that Algorithm 1 recovers w up to O error. If k arg minw f(w)k G 0 k, then, γ = 0. In general our result obtains an additive error which is weaker than what one can obtain for a convex optimization problem. However, for typical statistical problems, we show that γ is small and gives us nearly optimal statistical generalization error (for example, see Theorem 3.2). Theorem 3.1 displays an interesting interplay between the desired accuracy , and the penalty we thus pay as a result of performing approximate projections γ. Specifically, as is made small, k becomes large, and thus so does γ. Conversely, we can let be large so that the projections are coarse, but incur a smaller penalty via the γ term. Also, since the projections are not too accurate in this case, we can get away with fewer iterations. Thus, there is a tradeoff between estimation error and model selection error γ. Also, note that the inverse dependence of k on is only logarithmic in nature. We stress that our results do not hold for arbitrary approximate projection operators. Our proof critically uses the greedy scheme (Algorithm 2), via Lemma 2.2. Also, as discussed in Section 4, the proof easily extends to other structured sparsity sets that allow such greedy selection steps. We obtain similar result as [10] for the standard sparsity case, i.e., when the groups are singletons. However, our proof is significantly simpler and allows for a significantly easier setting of . 3.1 Linear Regression Guarantees We next proceed to the standard linear regression model considered in (5). To the best of our knowledge, this is the first consistency result for overlapping group sparsity problems, especially when the data can be arbitrarily conditioned. Recall that σmax (σmin) are the maximum (minimum) singular value of , and := σmax/σmin is the condition number of . Theorem 3.2. Let the observations y follow the model in (1). Suppose w is k -group sparse and let f(w) := 1 2nk Xw yk2 2. Let the number of samples satisfy: (s + 2 k log M) log where s = maxw,kwk G 0 k | supp(w)|. Then, applying Algorithm 1 with k = 8 2k log = 1/(4σmax), guarantees the following after T = iterations (w.p. 1 1/n8): s + 2k log M Note that one can ignore the group sparsity constraint, and instead look to recover the (at most) ssparse vector w using IHT methods for 0 optimization [10]. However, the corresponding sample complexity is n 2s log(p). Hence, for an ill conditioned , using group sparsity based methods provide a significantly stronger result, especially when the groups overlap significantly. Note that the number of samples required increases logarithmically with the accuracy . Theorem 3.2 thus displays an interesting phenomenon: by obtaining more samples, one can provide a smaller recovery error while incurring a larger approximation error (since we choose more groups). Our proof critically requires that when restricted to group sparse vectors, the least squares objective function f(w) = 1 2nky Xwk2 2 is strongly convex as well as strongly smooth: Lemma 3.3. Let X 2 Rn p be such that each xi N(0, ). Let w 2 Rp be k-group sparse over groups G = {G1, . . . GM}, i.e., kwk G 0 k and s = maxw,kwk G 0 k | supp(w)|. Let the number of samples n (C (k log M + s)). Then, the following holds with probability 1 1/n10: We prove Lemma 3.3 in Appendix C. Theorem 3.2 then follows by combining Lemma 3.3 with Theorem 3.1. Note that in the least squares case, these are the Restricted Eigenvalue conditions on the matrix X, which as explained in [20] are much weaker than traditional RIP assumptions on the data. In particular, RIP requires almost 0 correlation between any two features, while our assumption allows for arbitrary high correlations albeit at the cost of a larger number of samples. 3.2 IHT with Exact Projections P G We now consider the setting where P G k ( ) can be computed exactly and efficiently for any k. Examples include the dynamic programming based method by [3] for certain group structures, or Algorithm 2 when the groups do not overlap. Since the exact projection operator can be arbitrary, our proof of Theorem 3.1 does not apply directly in this case. However, we show that by exploiting the structure of hard thresholding, we can still obtain a similar result: Theorem 3.4. Let w = arg minw,kw Gk0 k f(w). Let f satisfy RSC/RSS with constants 2k+k , L2k+k , respectively (see Definition 2.1). Then, the following holds for the T = O Lk0 k0 kw k2 iterate of Algorithm 1 (with = 1/L2k+k ) with b P G k ( ) = P G k ( ) being the exact projection: kw T w k2 + 10 Lk0 where k0 = 2k + k , k = O(( Lk0 k0 )2 k ), γ = 2 Lk0 max S, s.t., | G-supp(S)| k k(rf(w ))Sk2. See Appendix D for a detailed proof. Note that unlike greedy projection method (see Theorem 3.1), k is independent of . Also, in the linear model, the above result also leads to consistent estimate of w . 4 Extension to Sparse Overlapping Groups (So G) The So G model generalizes the overlapping group sparse model, allowing the selected groups themselves to be sparse. Given positive integers k1, k2 and a set of groups G, IHT for So G would perform projections onto the following set: a Gi : kwk G 0 k1, ka G1k0 k2 As in the case of overlapping group lasso, projection onto (7) is NP-hard in general. Motivated by our greedy approach in Section 2, we propose a similar method for So G (see Algorithm 3). The algorithm essentially greedily selects the groups that have large top-k2 elements by magnitude. Below, we show that the IHT (Algorithm 1) combined with the greedy projection (Algorithm 3) indeed converges to the optimal solution. Moreover, our experiments (Section 5) reveal that this method, when combined with full corrections, yields highly accurate results significantly faster than the state-of-the-art. We suppose that there exists a set of supports Sk such that supp(w ) 2 Sk . Then, we obtain the following result, proved in Appendix E: Theorem 4.1. Let w = arg minw,supp(w)2Sk f(w), where Sk Sk {0, 1}p is a fixed set parameterized by k . Let f satisfy RSC/RSS with constants k, Lk, respectively. Furthermore, assume that there exists an approximately good projection operator for the set defined in (7) (for example, Algorithm 3). Then, the following holds for the T = O Lk0 k0 kw k2 -th iterate of Algorithm 1 : kw T w k2 2 + 10 L2k+k where k = O(( L2k+k 2k+k )2 k β 2k+k L2k+k ), γ = 2 L2k+k max S, S2Sk k(rf(w ))Sk2. Algorithm 3 Greedy Projections for So G Require: g 2 Rp, parameters k1, k2, groups G 1: ˆu = 0 , v = g, b G = {0}, ˆS = {0} 2: for t=1,2,... k1 do 3: Find G? = arg max G2G\ b G kv Gk 4: b G = b G S G? 5: Let S correspond to the indices of the top k2 entries of v G? by magnitude 6: Define v 2 Rp, v S = (v G?)S vi = 0 i /2 S 7: ˆS = ˆS S S 8: v = v v 9: u = u + v 10: end for 11: Output ˆu, b G, ˆS Similar to Theorem 3.1, we see that there is a tradeoff between obtaining accurate projections and model mismatch γ. Specifically in this case, one can obtain small by increasing k1, k2 in Algorithm 3. However, this will mean we select large number of groups, and subsequently γ increases. A result similar to Theorem 3.2 can be obtained for the case when f is least squares loss function. Specifically, the sample complexity evaluates to n 2 ! 1 log(M) + 2k 2 log(maxi |Gi|) . We obtain results for least squares in Appendix F. An interesting extension to the So G case is that of a hierarchy of overlapping, sparsely activated groups. When the groups at each level do not overlap, this reduces to the case considered in [11]. However, our theory shows that when a corresponding approximate projection operator is defined for the hierarchical overlapping case (extending Algorithm 3), IHT methods can be used to obtain the solution in an efficient manner. 5 Experiments and Results Time (seconds) 0 50 100 150 200 250 300 log(objective) IHT IHT+FC Co GEn T FW GOMP Time (seconds) 0 50 100 150 200 250 300 log(objective) IHT IHT+FC Co GEn T FW GOMP condition number measurements 50 100 150 200 250 300 condition number 50 100 150 200 250 300 measurements Figure 1: (From left to right) Objective value as a function of time for various methods, when data is well conditioned and poorly conditioned. The latter two figures show the phase transition plots for poorly conditioned data, for IHT and GOMP respectively. In this section, we empirically compare and contrast our proposed group IHT methods against the existing approaches to solve the overlapping group sparsity problem. At a high level, we observe that our proposed variants of IHT indeed outperforms the existing state-of-the-art methods for groupsparse regression in terms of time complexity. Encouragingly, IHT also performs competitively with the existing methods in terms of accuracy. In fact, our results on the breast cancer dataset shows a 10% relative improvement in accuracy over existing methods. Greedy methods for group sparsity have been shown to outperform proximal point schemes, and hence we restrict our comparison to greedy procedures. We compared four methods: our algorithm with (IHT-FC) and without (IHT) the fully corrective step, the Frank Wolfe (FW) method [19] , Co GEn T, [15] and the Group OMP (GOMP) [18]. All relevant hyper-parameters were chosen via a grid search, and experiments were run on a macbook laptop with a 2.5 GHz processor and 16gb memory. Additional experimental results are presented in Appendix G 0 5 10 15 20 8 time (seconds) IHT IHT FC COGEn T FW 500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500 Method Error % time (sec) FW 29.41 6.4538 IHT 27.94 0.0400 GOMP 25.01 0.2891 Co GEn T 23.53 0.1414 IHT-FC 21.65 0.1601 Figure 2: (Left) So G: error vs time comparison for various methods, (Center) So G: reconstruction of the true signal (top) from IHT-FC (middle) and Co GEn T (bottom). (Right:) Tumor Classification: misclassification rate of various methods. Synthetic Data, well conditioned: We first compared various greedy schemes for solving the overlapping group sparsity problem on synthetic data. We generated M = 1000 groups of contiguous indices of size 25; the last 5 entries of one group overlap with the first 5 of the next. We randomly set 50 of these to be active, populated by uniform [ 1, 1] entries. This yields w? 2 Rp, p 22000. X 2 Rn p where n = 5000 and Xij i.i.d N(0, 1). Each measurement is corrupted with Additive White Gaussian Noise (AWGN) with standard deviation λ = 0.1. IHT mehods achieve orders of magnitude speedup compared to the competing schemes, and achieve almost the same (final) objective function value despite approximate projections (Figure 1 (Left)). Synthetic Data, poorly conditioned: Next, we consider the exact same setup, but with each row of X given by: xi N(0, ) where = σmax( )/σmin( ) = 10. Figure 1 (Center-left) shows again the advantages of using IHT methods; IHT-FC is about 10 times faster than the next best Co GEn T. We next generate phase transition plots for recovery by our method (IHT) as well as the stateof-the-art GOMP method. We generate vectors in the same vein as the above experiment, with M = 500, B = 15, k = 25, p 5000. We vary the the condition number of the data covariance ( ) as well as the number of measurements (n). Figure 1 (Center-right and Right) shows the phase transition plot as the measurements and the condition number are varied for IHT, and GOMP respectively. The results are averaged over 10 independent runs. It can be seen that even for condition numbers as high as 200, n 1500 measurements suffices for IHT to exactly recovery w , whereas GOMP with the same setting is not able to recover w even once. Tumor Classification, Breast Cancer Dataset We next compare the aforementioned methods on a gene selection problem for breast cancer tumor classification. We use the data used in [8] 2. We ran a 5-fold cross validation scheme to choose parameters, where we varied 2 {2 5, 2 4, . . . , 23} k 2 {2, 5, 10, 15, 20, 50, 100} 2 {23, 24, . . . , 213}. Figure 2 (Right) shows that the vanilla hard thresholding method is competitive despite performing approximate projections, and the method with full corrections obtains the best performance among the methods considered. We randomly chose 15% of the data to test on. Sparse Overlapping Group Lasso: Finally, we study the sparse overlapping group (So G) problem that was introduced and analyzed in [14] (Figure 2). We perform projections as detailed in Algorithm 3. We generated synthetic vectors with 100 groups of size 50 and randomly selected 5 groups to be active, and among the active group only set 30 coefficients to be non zero. The groups themselves were overlapping, with the last 10 entries of one group shared with the first 10 of the next, yielding p 4000. We chose the best parameters from a grid, and we set k = 2k for the IHT methods. 6 Conclusions and Discussion We proposed a greedy-IHT method that can applied to regression problems over set of group sparse vectors. Our proposed solution is efficient, scalable, and provide fast convergence guarantees under general RSC/RSS style conditions, unlike existing methods. We extended our analysis to handle even more challenging structures like sparse overlapping groups. Our experiments show that IHT methods achieve fast, accurate results even with greedy and approximate projections. 2download at http : //cbio.ensmp.fr/ ljacob/ [1] Francis Bach. Convex analysis and optimization with submodular functions: A tutorial. ar Xiv preprint ar Xiv:1010.4207, 2010. [2] Richard G Baraniuk, Volkan Cevher, Marco F Duarte, and Chinmay Hegde. Model-based compressive sensing. Information Theory, IEEE Transactions on, 56(4):1982 2001, 2010. [3] Nirav Bhan, Luca Baldassarre, and Volkan Cevher. Tractability of interpretability via selection of group- sparse models. In Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on, pages 1037 1041. IEEE, 2013. [4] Thomas Blumensath and Mike E Davies. Sampling theorems for signals from the union of finitedimensional linear subspaces. Information Theory, IEEE Transactions on, 55(4):1872 1882, 2009. [5] Thomas Blumensath and Mike E Davies. Normalized iterative hard thresholding: Guaranteed stability and performance. Selected Topics in Signal Processing, IEEE Journal of, 4(2):298 309, 2010. [6] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. Approximation algorithms for model-based compres- sive sensing. Information Theory, IEEE Transactions on, 61(9):5129 5147, 2015. [7] Junzhou Huang, Tong Zhang, and Dimitris Metaxas. Learning with structured sparsity. The Journal of Machine Learning Research, 12:3371 3412, 2011. [8] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graph lasso. In Proceedings of the 26th annual International Conference on Machine Learning, pages 433 440. ACM, 2009. [9] Prateek Jain, Ambuj Tewari, and Inderjit S Dhillon. Orthogonal matching pursuit with replacement. In Advances in Neural Information Processing Systems, pages 1215 1223, 2011. [10] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for high- dimensional m-estimation. In Advances in Neural Information Processing Systems, pages 685 693, 2014. [11] Rodolphe Jenatton, Julien Mairal, Francis R Bach, and Guillaume R Obozinski. Proximal methods for sparse hierarchical dictionary learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 487 494, 2010. [12] Anastasios Kyrillidis and Volkan Cevher. Combinatorial selection and least absolute shrinkage via the clash algorithm. In Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on, pages 2216 2220. IEEE, 2012. [13] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions. Mathematical Programming, 14(1):265 294, 1978. [14] Nikhil Rao, Christopher Cox, Rob Nowak, and Timothy T Rogers. Sparse overlapping sets lasso for multitask learning and its application to fmri analysis. In Advances in neural information processing systems, pages 2202 2210, 2013. [15] Nikhil Rao, Parikshit Shah, and Stephen Wright. Forward backward greedy algorithms for atomic norm regularization. Signal Processing, IEEE Transactions on, 63(21):5798 5811, 2015. [16] Nikhil S Rao, Ben Recht, and Robert D Nowak. Universal measurement bounds for structured sparse signal recovery. In International Conference on Artificial Intelligence and Statistics, pages 942 950, 2012. [17] Parikshit Shah and Venkat Chandrasekaran. Iterative projections for signal identification on manifolds: Global recovery guarantees. In Communication, Control, and Computing (Allerton), 2011 49th Annual Allerton Conference on, pages 760 767. IEEE, 2011. [18] Grzegorz Swirszcz, Naoki Abe, and Aurelie C Lozano. Grouped orthogonal matching pursuit for variable selection and prediction. In Advances in Neural Information Processing Systems, pages 1150 1158, 2009. [19] Ambuj Tewari, Pradeep K Ravikumar, and Inderjit S Dhillon. Greedy algorithms for structurally constrained high dimensional problems. In Advances in Neural Information Processing Systems, pages 882 890, 2011. [20] Sara A Van De Geer, Peter B uhlmann, et al. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360 1392, 2009. [21] Xiaotong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit for sparsity-constrained optimization. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 127 135, 2014.