# learning_sparse_group_models_through_boolean_relaxation__9cdcb9e4.pdf Published as a conference paper at ICLR 2023 LEARNING SPARSE GROUP MODELS THROUGH BOOLEAN RELAXATION Yijie Wang1 Yuan Zhou2 Xiaoqing Huang3 Kun Huang3 Jie Zhang4 Jianzhu Ma5 1Computer Science Department, Indiana University Bloomington 2Yau Mathematical Sciences Center and Department of Mathematical Sciences, Tsinghua University 3Department of Biostatistics & Health Data Science, Indiana University 4Department of Medical and Molecular Genetics, Indiana University 5Institute for AI Industry Research, Tsinghua University yijwang@iu.edu {yuan-zhou, majianzhu}@tsinghua.edu.cn;( Equal contribution) ABSTRACT We introduce an efficient algorithmic framework for learning sparse group models formulated as the natural convex relaxation of a cardinality-constrained program with Boolean variables. We provide theoretical techniques to characterize the equivalent condition when the relaxation achieves the exact integral optimal solution, as well as a rounding algorithm to produce a feasible integral solution once the optimal relaxation solution is fractional. We demonstrate the power of our equivalent condition by applying it to two ensembles of random problem instances that are challenging and popularly used in literature and prove that our method achieves the exactness with overwhelming probability and the nearly optimal sample complexity. Empirically, we use synthetic datasets to demonstrate that our proposed method significantly outperforms the state-of-the-art group sparse learning models in terms of individual and group support recovery when the number of samples is small. Furthermore, we show the out-performance of our method in cancer drug response prediction. 1 INTRODUCTION Sparsity is one of the most important concepts in statistical machine learning, which strongly connects to the data & computational efficiency, generalizability, and interpretability of the model. Traditional sparse estimation tasks aim at selecting sparse features at the individual level Tibshirani (1996); Negahban et al. (2012). However, in many real-world scenarios, structural properties among the individual features are assumed thanks to prior knowledge, and leveraging these structures may improve both model accuracy and learning efficiency Gramfort & Kowalski (2009); Kim & Xing (2012). In this paper, we focus on learning the sparse group models for intersection-closed group sparsity, where groups of variables are either selected or discarded together. The general task of learning the sparse group models has been investigated quite a lot in literature, where most of the prior studies are based on the structured sparsity-inducing norm regularization Friedman et al. (2010); Huang et al. (2011); Zhao et al. (2009); Simon et al. (2013), which stems from Lasso Tibshirani (1996), the traditional and popular technique for a sparse estimate at the individual feature level. As reviewed in Bach et al. (2012); Jenatton et al. (2011), the structured sparsity-inducing norm is quite general and can encode structural assumptions such as trees Kim & Xing (2012); Liu & Ye (2010), contiguous groups Rapaport et al. (2008), directed-acyclic-graphs Zheng et al. (2018), and general overlapping groups Yuan et al. (2011). Another type of approach for learning the sparse group models is to view the task as a cardinalityconstrained program, where the constraint set encodes the group structures as well as restricts the number of groups of variables being selected. Baldassarre et al. Baldassarre et al. (2013) investigate the projection onto such cardinality-constrained set. However, due to the combinatorial nature of the projection, directly applying the projected gradient descent with the projection Baldassarre et al. (2013) to solve general learning problems with typical loss functions might not have good results Kyrillidis et al. (2015). Recent work Pilanci et al. (2015) studies the Boolean relaxation of the learning problem with cardinality constraints on the individual variables. This work Pilanci et al. (2015) can be viewed as a special case of sparse group models, where each group contains only one variable. Both the original work of Pilanci et al. (2015) and several follow-up papers Bertsimas & Parys (2020); Bertsimas et al. (2020) show that the Boolean relaxation empirically outperforms the sparse estimation methods using sparse-inducing norms (Lasso Tibshirani (1996) and elastic net Zou & Hastie (2005)), especially when the sample size is small and the feature dimension is Published as a conference paper at ICLR 2023 large. However, the results in Pilanci et al. (2015) cannot be applied to the sparse group models with arbitrary group structures. To fill the gap, in this paper, we study the sparse group models through a cardinality-constrained program. We first propose the Boolean relaxation for sparse group models. We further establish an analytical and algorithmic framework for our Boolean relaxation which includes a theorem stating the equivalent condition for the relaxation to achieve the exactness (i.e., the optimal integral solution) and a rounding scheme that produces an integral solution when the optimal relaxation solution is fractional. We demonstrate the power of our equivalent condition theorem by applying it to two ensembles of random problem instances that are challenging and popularly used in literature and proving that our Boolean relaxation achieves the exactness with high probability and the nearly optimal sample complexity. Our contributions are threefold: 1) We propose a novel framework that uses constraints to induce intersection-closed group sparsity. Baldassarre et al.Baldassarre et al. (2013) investigate the projection on the group sparsity constraints. But our framework extends to any convex loss function with the group sparsity constraints. 2) We prove our framework is tight and can achieve the exactness with high probability and the nearly optimal sample complexity for two ensembles of random problem instances. This result is inspired by Pilanci et al. (2015) but our derivations and proofs are not straightforward extensions (e.g., due to the group structure, we need to analyze more complex feature-group matrices, prove new matrix concentration properties, and carefully choose different regularization parameters). 3) Empirically, we perform extensive experiments to demonstrate that our framework significantly outperforms the state-of-the-art methods when the sample size is small on simulated datasets. Furthermore, we show the out-performance of our framework in cancer drug response prediction. 1.1 RELATED WORKS Convex programming relaxations and their rounding techniques have been widely used for approximating many combinatorial optimization problems that are computationally intractable (see, e.g., Williamson & Shmoys (2011)). The specific algorithmic technique in this work is inspired by the Boolean relaxation method introduced in Pilanci et al. (2015) for learning sparsity at the individual feature level. However, the additional group structure in our problem raises new algorithmic challenges, and both our Boolean relaxation formulation and its theoretical analysis (e.g., the equivalent condition for the exactness) are different from their counterparts Pilanci et al. (2015). As mentioned before, sparse estimation using structured sparsity-inducing norms were thoroughly studied for learning structured sparsity under different structure assumptions motivated by various practical scenarios Friedman et al. (2010); Huang et al. (2011); Zhao et al. (2009); Simon et al. (2013); Tibshirani (1996); Bach et al. (2012); Kim & Xing (2012); Liu & Ye (2010); Rapaport et al. (2008); Zheng et al. (2018); Yuan et al. (2011); Jenatton et al. (2011). However, none of these algorithms provides the rigorous theoretical techniques as in this work to verify whether the algorithm has produced the exact optimal solution. Also, as we will show in the experiments section, our proposed method outperforms these algorithms on both synthetic and real-world datasets. In our experiments, we also compare with the elastic net method Zou & Hastie (2005), which can only control the sparsity at the individual feature level. There exist another family of structured sparsity-inducing norms Jacob et al. (2009) that aim to model the union-closed families of supports, where the support of the solution is a union of groups. Different from our proposed models, in which the support of the solution is the intersection of the complements of some of groups considered (intersection-closed group sparsity) Jenatton et al. (2009). Another approach is to learn sparse group models by introducing the penalty functions for the constraints and applying the convex relaxation to them. Bach Bach (2010) investigate to design norms from submodular set-functions. Halabi et al. El Halabi & Cevher (2015); Halabi et al. (2018) study to induce group sparsity using tight convex relaxation of linear matrix inequalities and combinatorial penalties. Note that these works use convex regularizers to induce group sparsity while we use constraints. Halabi et al. El Halabi & Cevher (2015); Halabi et al. (2018) study general equivalent conditions to characterize the tightness of their relaxations while our theoretical results works for specific distributions where their general conditions cannot be easily verified. We use different analytical frameworks and thus the theoretical results cannot be directly compared. Published as a conference paper at ICLR 2023 2 BOOLEAN RELAXATION FOR SPARSE GROUP MODEL The organization of this section is as follows. In section 2.1, we introduce the original problem and its exact boolean representation. In section 2.2, we propose the boolean relaxed program and provide the condition under which the relaxed program is guaranteed to have an integer solution and hence be tight. In section 2.3, we propose the rounding strategy if the relaxed program does not generate integral solutions. 2.1 SPARSE GROUP MODEL AND ITS FORMULATION VIA BOOLEAN CONSTRAINTS We consider a learning problem for a collection of n samples {(xi, yi) Rd Y}n i=1 and define the design matrix as X Rn d, where x i Rd is the i-th row of X. This setup is flexible to model various problems including binary classification (where the label space Y = { 1, +1}) and regression problems (where the label space Y = R). For a linear model x 7 w x, our goal is to learn a sparse weight vector w Rd whose support encodes certain structures reflecting the relationships among the features which are usually defined by the prior knowledge. More formally, we need to solve the following mathematical program. P = min w Θ i=1 f(w xi; yi) + 1 Here, the loss function f( ; ) measures the prediction error by our linear model, where the common choices include the squared loss for least-squares regression, the log loss for the logistic regression, and the hinge loss for the support vector machine. The regularization term 1 2ρ w 2 2 in (1) makes sure that the objective function is strongly convex and therefore has a unique optimal solution w Rd. Finally, the constraint set Θ encodes the sparsity requirements for both individual features and groups of features. We use gi to denote the set of the indices of features in the i-th group and for any vector w Rd, we use wgi to denote the vector containing all entries of w corresponding to the indices in gi. We further assume that we have b predefined groups and then the cardinality constraint set Θ can be written as w Rd w 0 k, j=1 1 wgj 0 > 0 h where 0 denotes the ℓ0 norm and 1[ ] denotes the indicator variable that takes the value 1 when the corresponding condition holds and 0 otherwise. The first constraint enforces the number of contributing features to be less than k, and the second constraint makes sure the number of groups that contain those selected features is less than h. We also remark that the structured sparsity constraints defined by Θ in equation 2 is very flexible. First, the w 0 k constraint imposes the feature-level sparsity requirement and encompasses the unstructured sparsity model (as investigated in Pilanci et al. (2015)) as a special case. Second, the group-level sparsity constraint is introduced by 1 wgj 0 > 0 h covers the needs for structured sparsity arising from many practical scenarios such as neuroimaging Gramfort & Kowalski (2009); Xi et al. (2009), genomic analysis Rapaport et al. (2008); Kim & Xing (2012), and wavelet-based denoising Zhao et al. (2009); Huang et al. (2011). The groups {gi}i {1,2,...,b} can be arbitrary sets of the features and may model not only non-overlapping structured sparsity when gi gj = , i, j but also various overlapping patterns including the contiguous pattern, the block pattern, and the hierarchical pattern as reviewed in Bach et al. (2012); Jenatton et al. (2011). Note that the first term in equation 2 can be absorbed into the second term, which however will not have the sparsity control at the individual level. Note that the structured sparse group learning problem P defined in equation 1 involves only real variables. In the following theorem, we show that the problem can be reformulated as a convex program with additional Boolean variables and constraints, which will naturally lead to the Boolean relaxation algorithm in the later sections. Theorem 2.1 (Exact representation with Boolean constraints). Suppose that for each y Y, the function t 7 f(t; y) is closed and convex. The Legendre-Fenchel conjugate of f is f (s; y) := supt R{st f(t : y)}. Then for any ρ > 0, the structured sparse learning problem P in equation 1 can be represented by the following Boolean convex program P = min (u,z) Γ max v Rn 2ρv XD(u)X v i=1 f (vi; yi) | {z } H(u) Published as a conference paper at ICLR 2023 where D(u) := diag(u) is a diagonal matrix with u Rd on its diagonal, and Γ is the constraint set for u and v, defined as the follows, j=1 zj h, ui zj, i gj, u {0, 1}d, z {0, 1}b The proof of Theorem 2.1 can be found in the supplementary materials Section A. In the theorem statement, u is a vector of the Boolean indicators for the supports of the individual features and z is also a vector of the Boolean indicators for the supports of the group features. H(u) in equation 3 is convex in u because it is the maximum of a family of functions that are linear with u. However, the whole program is still computationally difficult due to the Boolean constraints u {0, 1}d and z {0, 1}b. In the next subsection, we will relax these Boolean constraints which leads to a convex program that can be efficiently solved for many popular loss functions f. 2.2 CONVEX PROGRAM THROUGH BOOLEAN RELAXATION AND THEORETICAL CONDITIONS FOR EXACTNESS We apply interval relaxation to both Boolean vector variables u and z, and obtain the Boolean relaxation for P as follows. PBR = min (u,z) Ωmax v Rn 2ρv XD(u)X v i=1 f (vi; yi) j=1 zj h, ui zj, i gj, u [0, 1]d, z [0, 1]b PBR is a convex program and can be solved by the sub-gradient-based optimization algorithm Nesterov (2009) if the inner maximization problem can be solved efficiently. In general, PBR can also be converted into a minimax optimization problem and solved by methods in Lin et al. (2020). We now investigate when PBR achieves the exact solution of P , under the assumption that the groups are non-overlapping. Note that PBR is a relaxation of P as defined in equation 3. The relaxation is exact if and only if the optimal solution in PBR also happens to be integral and therefore feasible in P . The following theorem (proved in the supplementary materials Section B.1) characterizes the equivalent condition for the exactness. Theorem 2.2. Suppose that each feature belongs to only one group and the optimal integral solution (ˆu, ˆz) for P as defined in equation 3 selects k features and h groups, then the optimal solution of PBR also recovers (ˆu, ˆz) if and only if there exists non-negative λk and λh, such that ˆv arg max v Rn 2ρv XD(ˆu)X v i=1 f (vi; yi) 1. For each group i such that ˆzi = 1, it holds that p gi, ˆup = 1 (X p ˆv)2 > λk and (6) p gi, ˆup = 0 (X p ˆv)2 λk. (7) 2. For each group i such that ˆzi = 1, it holds that X p gi,ˆup=1 ((X p ˆv)2 λk) > λh. (8) 3. For each group i such that ˆzi = 0, X p gi max{(X p ˆv)2 λk, 0} λh. (9) Here, Xp denotes the p-th column of the design matrix X. The special case of least-squares regression. Among all candidate choices of the loss function f, the squared loss f(t; y) = 1 2(t y)2 for least-squares regression is the most important and popular one with many real-world applications Kim et al. (2007); Nguyen & Rocke (2002); Boulesteix & Strimmer (2007); Fort & Lambert-Lacroix (2004), and the corresponding Legendre-Fenchel conjugate becomes f (s; y) = s2 2 + sy. In this special case of the structured sparse learning for least-squares Published as a conference paper at ICLR 2023 regression, the relaxed convex program PBR becomes the following form. LBR = min (u,z) Ω G(u) := y 1 ρXD(u)X + I 1 y The detailed derivation of equation 10 can be found in the supplementary materials Section B.2. We let S denote the support of the unique optimal solution u to the original Boolean program L := min (u,z) Γ{G(u)} (11) and define the n n matrix M by M := In + ρ 1XSX S . (12) Now we are ready to apply Theorem 2.2 to the squared loss function and derive the sufficient and necessary condition for the exactness of LBR (assuming non-overlapping groups), as follows. Corollary 2.3. Suppose that each feature belongs to only one group and the optimal integral solution (ˆu, ˆz) selects k features and h groups, then LBR = L if and only if there exist non-negative λk and λh, such that 1. For each group i such that ˆzi = 1, it holds that p gi, ˆup = 1 (X p My)2 > λk and (13) p gi, ˆup = 0 (X p My)2 λk. (14) 2. For each group i such that ˆzi = 1, it holds that X p gi,ˆup=1 ((X p My)2 λk) > λh. (15) 3. For each group i such that ˆzi = 0, X p gi max{(X p My)2 λk, 0} λh. (16) The proof of Corollary 2.3 can be found in the supplementary materials Section B.3. Corollary 2.3 creates an analysis framework for the exactness of the Boolean relaxation LBR where one only has to construct two scalars λk and λh and verify the conditions in equation 14, equation 15, and equation 16. In Section 3, we will follow this framework to theoretically prove the exactness of LBR for several classes of problem instances that are popularly studied in the literature, demonstrating the power of Corollary 2.3. 2.3 RANDOMIZED ROUNDING When the Boolean relaxation is not exact (i.e., the optimal solution of the Boolean relaxation turns out to be fractional), we describe in this section a rounding method to recover an integral solution. We will apply randomized rounding, a state-of-the-art technique for converting fractional solutions to integer solutions with provable approximation guarantees Pilanci et al. (2015), to the solution of the relaxed problem PBR. Given the fractional solution u [0, 1]d and z [0, 1]b, our goal is to recover a feasible Boolean solution u {0, 1}d and z {0, 1}b. For simplicity of exposition, we show the rounding scheme for the case when each feature belongs to exactly one group. However, the algorithm could be easily generalized to the cases of overlapping groups. In our rounding algorithm, we first generate a feasible Boolean solution at the group level z {0, 1}b. For each group j, we independently set zj according to the following probability distribution: Pr[zj = 1] = zj and Pr[zj = 0] = 1 zj. (17) Once the groups are decided, the ui is set to 0 for each feature i that belongs to a non-selected group. For each selected group gj and for each feature i that belongs to the group, we independently set ui according to the following probability distribution: Pr[ui = 1] = ui zj and Pr[ui = 0] = 1 ui It is easy to verify that the Boolean solution generated by the method above matches the fractional solution in the sense of expectation E[z] = z and E[u] = u, and furthermore their expected ℓ0 norms are given by E [ z 0] = Pb j=1 Pr [zj = 1] = Pb j=1 zj h, and E [ u 0] = Pd i=1 (Pr[ ui = 1, zj = 1] + Pr[ ui = 1, zj = 0]) = Pd i=1 ui k. With these expectation bounds in hand, applying standard concentration inequalities, we can show that if we let G = maxj |gj| be the size of the largest group, for any δ (0, 1/3), with probability at least (1 exp(Ω( hδ2)) exp( Ω(k2δ2/(b G2)))), we have that z 0 (1 + δ)h and u 0 (1 + δ)k. This means that when the group sizes are relatively small, our randomized rounding scheme produces a nearly optimal solution with high probability. Finally, once we have obtained Published as a conference paper at ICLR 2023 the integral solution u, the weight vector w for the original problem equation 1 can be computed by w := arg minw Rd F(D( u)w). Value guarantees for least-squares regression. For least-squares loss, we are also able to establish theoretical guarantees for the value (i.e., H(u) as defined in equation 3) of our rounding scheme. Without loss of generality, let us assume the columns of the design matrix X are normalized, i.e., Xp 2 1 for p {1, 2, . . . , d} and y 2 = 1. We have the following theorem and its proof is in the supplementary materials Section C. Theorem 2.4. Let ( u, z) be the optimal solution to the relaxed program. Let rz be the number of fractional entries in z and let ru be the number of fractional entries in u. Let (u, z) be the integral solution returned by our rounding scheme. For any δ > 0, with probability (1 δ), it holds that H(u) P O 1 ρ p rz log(rz/δ)G + p ru log(ru/δ) . 3 THEORETICAL GUARANTEES OF LBR ON ENSEMBLES OF RANDOM INSTANCES In this section, we apply Corollary 2.3 to prove our relaxed program is tight and can achieves the exactness with high probability and the nearly optimal sample complexity for two ensembles of random problem instances. We focus on the case of least-squares regression and its corresponding relaxation LBR. We will introduce two ensembles of random problem instances and theoretically analyze the performance of our LBR on them. The first random ensemble has been popularly used in literature to evaluate the ℓ1 Sparse Group Lasso algorithms Simon et al. (2013); Friedman et al. (2010). The second random ensemble is designed by us. It is more challenging compared to the first ensemble as there is more than one optimal weight vector w at the feature level. However, the algorithm has to figure out the one with the most group sparsity. For both ensembles, we will show that our LBR will successfully recover both the group and feature sparsity with overwhelming probability and almost optimal sample complexity (i.e., n the number of observations). 3.1 RANDOM ENSEMBLE I The first class of random instances can be generated as follows (illustrated in Fig. 1(a)). We first generate the random design matrix X Rn d with i.i.d. N(0, 1) entries. The d features are divided into b groups where the size of each group is d/b. We will construct the regression weight vector w Rd such that the first h groups will have non-zero coefficients and the coefficients in the rest of the groups are 0. The number of non-zero coefficients in the j-th group (for j {1, 2, . . . , h}) is kj and we have that Ph j=1 kj = k, where k is the total number of non-zero coefficients in w. For each group j {1, 2, . . . , h}, we arbitrarily choose kj features in the group and randomly set the corresponding coefficient in w to be 1 k. Finally, y = Xw + ϵ, where the noise vector ϵ Rn has i.i.d. N(0, γ2) entries. The goal is to identify the support of w and the corresponding coefficients. The following theorem provides the theoretical guarantee that given a sufficient amount of observations, LBR recovers the individual and group level sparsity for this random ensemble. Theorem 3.1. Consider the random instance described above with parameters (n, d, k, γ, b, h) and let y = Xw + ϵ be the observed response vector. Suppose that γ 1. Let ρ = n1/2+δ (δ (0, 1/2)). With probability at least (1 d exp( Ω(n2δ/(γ2k))) d exp( Ω(n1 2δ)), the relaxed program LBR admits the optimal solution u and z where u i = 1[wi = 0] and z j = 1[j {1, 2, . . . , h}]. We remark that the regularization parameter ρ is set to n1/2+δ in our theorem, while in contrast, ρ was set to n in Pilanci et al. (2015) for the sparse learning problem without group constraints. In fact, if we are only looking for (1 o(1)) success probability, we can set δ to be as large (close to 1/2) as possible. For example, if we set δ = 1/2 log log n log n . The success probability is (1 o(1)) as long as n/(γ2k log n) = ω(1), which means that we only need n = ω((k/γ2) log(k/γ2)) to achieve the (1 o(1)) success probability, which almost matches the information theoretic lower bound Wainwright (2007) up to the logarithmic factor. 3.2 RANDOM ENSEMBLE II We now describe the another class of random instances which is more challenging due to the multiple candidate regression vectors (illustrated in Fig. 2(a)). We first generate a random design matrix X Rn d with two candidate k-sparse cross-fit regression vectors w(1), w(2) Rd, such that both Published as a conference paper at ICLR 2023 vectors lead to the same expected response (i.e., Xw(1) = Xw(2)). The response vector is generated by y = Xw(1) + ϵ where ϵ Rn has i.i.d. N(0, γ2) entries. The d feature dimensions are divided into b groups of the same size d/b (for simplicity we assume that d is a multiple of b). The groups are arranged in a way so that the non-zero coefficients of w(1) span h groups (where h d/b k), and the non-zero coefficients of w(2) span ζh groups (ζ > 1). Given (X, y) (after randomly permuting the indices of the coordinates and the groups), the goal is to identify the support of w(1) since it is sparser at the group level. We next specifically describe how we generate w(1), w(2) and design the groups. For any vector w Rk with no non-zero entries, we let the first k entries of w(1) filled by w and the rest filled by 0; we also let the (k + 1)-th to the 2k-th entries of w(2) filled by w and the rest filled 0. We let each of the first h groups contain k/h non-zero identical coordinates of w(1), and let each of the next ζh groups contain k/(ζh) non-zero identical coordinates of w(2). We finally fill the b groups with the coordinates numbered from (2k + 1) to d so that each group has the same size d/b. We next describe how we generate the design matrix X. We first generate a random orthogonal matrix Q such that Qw = w. This can be done by first fixing an arbitrary orthogonal matrix with w w 2 as its first column (i.e., letting P = [ w w 2 , β1, β2, . . . , βk 1]), generating a random (k 1) (k 1) orthogonal matrix Q , and finally letting Q = P diag(1, Q ) P . We then generate the matrices X1 Rn k and X3 Rn (d 2k) with i.i.d. N(0, 1) entries. Let X2 = X1Q. Since Q is orthogonal, it is easy to see that in the the marginal distribution of X2, each entry is also i.i.d. N(0, 1). We finally let X = [X1, X2, X3]. One can verify that Xw(1) = X1w as well as Xw(2) = X1Qw = X1w. For the theoretical guarantee of LBR on our second random ensemble, we have the following theorem. Theorem 3.2. Let X = [X1, X2, X3] and y = Xw(1) + ϵ be a random instance described above with parameters (n, d, k, γ, b, h, ζ, w). Suppose there exists ξ > 0 such that ξ |wi| ζ1/4ξ for all i {1, 2, . . . , k}. Also suppose that γ 1. Let ρ = n1/2+δ (δ (0, 1/2)). For large enough constant ζ, with probability at least (1 d exp( Ω(n2δξ2/γ2)) d exp( Ω(n1 2δ))), the relaxed program LBR admits the optimal solution u and z where u i = 1[w(1) i = 0] and z g = 1[ i g : w(1) i = 0]. Here, we use g to denote both the index of a group and the set of the features included in the group. We first remark that the smallest possible value of ζ for the theorem to hold can be made arbitrarily close to 1 (but greater than 1). This relaxation would only affect the constant coefficients in the Ω( ) notations in the success probability bound. Also, similarly to the remark in Section 3.1, we may set δ = 1/2 log log n log n . The success probability is (1 o(1)) as long as nξ2/(γ2 log n) = ω(1). Since ξ2 is usually Θ(1/k), again, we only need n = ω((k/γ2) log(k/γ2)) to achieve the (1 o(1)) success probability, almost matching the information theoretic lower bound Wainwright (2007) up to the logarithmic factor. 4 EXPERIMENTS In this section, we perform extensive experiments to investigate the effectiveness of the proposed sparse group models under the setting of ℓ2 2-regularized least-squares regression specified in equation 10. We use both simulated datasets (non-overlapping groups) introduced in Sections 3.1 and 3.2 and a real-world application (overlapping groups) in cancer to evaluate the performance. To access the performance on simulated data, we compute the recovery accuracy of both individual supports and group supports, which are defined as AI(w) and AG(w), respectively: AI(w) := |{i : wi = 0, wtrue i = 0}| |{i : wtrue i = 0}| , AG(w) := |{j : wgj = 0, wtrue gj = 0}| |{j : wtrue gj = 0}| . (19) Here wtrue i is the ith element of the ground-truth vector wtrue and wtrue gj is the weight vector for group gj. We apply the projected Quasi-Newton method Schmidt et al. (2009) to efficiently solve equation 10. The details related to the optimization could be found in supplementary materials Section E. If the solution of equation 10 is not integral, we use the rounding scheme proposed in section 2.3. For the non-overlapping setting, we compare the performance of our method with Sparse Group Lasso (SGL) Simon et al. (2013), Sparse G-group cover (SGCover) El Halabi & Cevher (2015), and SGL (group level sparsity using ℓ norm and individual level sparsity using ℓ1 norm); for the overlapping group setting, we compare with (SGL-Overlap) Yuan et al. (2011) and SGCover El Halabi & Cevher (2015). We also compare our method with Elastic Net (ENet) Zou & Published as a conference paper at ICLR 2023 Figure 1: (a) Illustration of the data generation process for Random Ensemble I. (b) AI as n increases. (c) AG as n increases. We average results over 100 datasets and the error bar means 95% confidence interval. (d) (g) Estimate k and h by out-of-sample of MSE by different methods. The black vertical line indicates the true k. The results are averaged over 10 datasets and the error bar means standard deviation. Hastie (2005) which is the state-of-the-art method for detecting sparse features at the individual level. All experiments run on a computer with 8 cores 3.7GHz Intel CPU and 32 GB RAM. 4.1 SIMULATION EVALUATION OF RANDOM ENSEMBLE I We first consider the simulation setting described in Section 3.1 in which we use d = 1000, b = 10, k = 50 (ki = 10, i {1, 2, . . . , 5}), h = 5, and γ = 2.5 to generate the simulation data whose signal-noise-ratio (SNR) is around 4. We evaluate the performance of all the methods on recovering the ground truth weight vector wtrue with k = 50 contributing features distributed in h = 5 groups. Feature selection with given support sizes k and h: We first consider the case when k and h are given and equal to the ground-truth for all methods, while all other hyper-parameters are selected by cross-validation. It is hard to let SGL, SGCover, and SGL to select exact k individual features and h groups of features so we indirectly control k and h by sweeping the regularization parameters. For cases where these methods do not yield exact k individual features and h groups of features, we rank their results and get the top k individual features and h groups of features instead. The same procedure was adopted to ENet to select k features. In this setting, we do not need to worry about false discovery rate (FDR), because it is complementary to the accuracy defined in equation 19. As shown in Fig. 1(b) and (c), AI and AG of all competing methods converge to 1 with the increasing sample size n. However, our proposed method converges the fastest among all the methods, indicating its effectiveness on recovering the structured sparsity when the sample size is small. We further conduct similar experiments with larger γ and show the results in the supplementary materials Section G. Based on these results, we could confirm that our proposed method outperforms conventional methods in selecting contributing features at both the individual level and group level. Estimation of support sizes k and h: In the real-world scenario, we might not know the number of contributing features and groups. Typically, they could be selected by cross-validation based on the out-of-sample mean square error (MSE). Motivated by this practical need, in this study, we also investigate whether the competing methods could accurately recover the ground truth k and h based on the out-of-sample of MSE. As shown in Fig. 1(d) (g), only our proposed method is able to provide the smallest MSE at h = 5. In addition, for the number of contributing feature k, only our method achieves the smallest MSE around the number of features k = 50. Interpreting the superior performance of our method. All the competing methods are norm-based and they rely on the regularization factor before the penalty norm to adjust to the support size (group number and feature number) requirements. However, this connection is not explicit as our Boolean relaxation where we directly require that Pd i=1 ui k and Pb i=1 zi h. Together with our randomized rounding scheme, our Boolean relaxation method might be at an advantageous position to utilize the prior knowledge on support sizes and/or recover the support sizes from the out-of-sample MSE. 4.2 SIMULATION EVALUATION OF RANDOM ENSEMBLE II Next, we consider a more challenging simulated data generated from the procedure introduced in Section 3.2. We set d = 500, k = 80, h = 10, ζ = 4, and γ = 0.1 as described in Section 3.2. As shown in Fig. 2(a), both w(1) and w(2) are optimal for the linear regression problem. However, the Published as a conference paper at ICLR 2023 Figure 2: (a) Illustration of the data generation process for Random Ensemble II. (b) AI as n increases. (c) AG as n increases. We average results over 100 datasets and the error bar means 95% confidence interval. supports of w(1) are in h = 10 groups but the supports of w(2) are in ζh = 40 groups. The goal is to test whether each method can recover the solution w(1), which is more sparse on the group level. We control the parameters of each method to make it select k = 80 features in h = 10 groups. As shown in Fig. 2(b) and (c), the recovery accuracy (AG and AI) of the proposed method rapidly converges to 1 and 0.93 when more training samples are provided. Surprisingly, the recovery accuracy of all the rest of the competing methods converge very slowly. Note that here we do not need to consider false discovery rate (FDR) because each method selects exact k features in h groups, therefore, FDR is complementary to the accuracy. Overall, under the setting of Random Ensemble II, the performance improvement of our method over the-state-of-art algorithms is significant in terms of both recovery accuracy. 4.3 CANCER DRUG RESPONSE PREDICTION We further adapt our model and algorithm on a real-world application to predict the drug response. The goal of the task is to find the essential genes and pathways that are responsible for the ineffectiveness of cancer therapy. For this task, we chose ℓ2 2-regularized least-square regression model to predict the continuous value of drug response. Table 1: Result comparison for IMATNIB. Method k(s.d.) h(s.d.) Out-of-sample MSE 95%CI Proposed 46 7 32.6 2.2 SGL-Overlap 92 (5.4) 19 (0.5) 46.9 3.7 ENet 60 (8.2) 18 (2.3) 39.6 4.2 SGCover 321 (10.5) 13 (1.7) 55.4 6.9 We collect drug response data from the Cancer Therapeutics Response Portal (CTRP) v2 and the Genomics of Drug Sensitivity in Cancer (GDSC) database Seashore-Ludlow et al. (2015); Yang et al. (2013) with 684 chemotherapy and targeted therapy drugs. For each drug, we create a separate machine learning task to predict the drug response from the expression value of each gene for different tumor samples. In total, we include 1,225 tumor samples and use the gene expression value of 2,369 genes. We focus on the signal transduction pathways which are mined and collected by the Reactome database Jassal et al. (2020). We only consider pathways that contain more than 10 genes and less than 80 genes. We collect 207 pathways (gene groups), in which the average number of genes in each group is 28.9. The gene expression data are extracted from CCLE Barretina et al. (2012). For each drug (machine learning task), we hold 20% of the samples as the test set and used the remaining samples as training and validation set. For each competing method, we use standard cross-validation to determine the hyper-parameter based on the out-of-sample square error MSE on the validation set. Here we only show the performance of the drug IMATNIB as a representative and put the performance of other drugs in the supplementary materials Section I. Table 1 illustrates the estimation of k and h and the out-of-sample MSE on the test set for drug IMATNIB with 10 bootstrap samples. We do not compare with SGL because the Spa SM package Sjöstrand et al. (2018) we used to solve SGL cannot handle overlapping groups. We find that our proposed method achieves the smallest out-of-sample MSE as well as selects the smallest number of genes and pathways. More importantly, as shown in supplementary materials Table S3, the selected pathways are all well associated with the drug response and functional mechanisms of IMATNIB in different types of tumor cells supported by rich literature. We also make predictions for other drugs and show the results in Table. S4. 5 CONCLUSION In this paper, we propose a novel convex framework for learning structured sparsity. We provide theoretical tools to verify the exactness of the solution of the relaxation, and a rounding algorithm to produce the feasible integral solution when the relaxation solution is fractional. For the case of least-squares loss, we perform extensive experiments to demonstrate the effectiveness of the proposed framework. Published as a conference paper at ICLR 2023 Rudolf Ahlswede and Andreas Winter. Strong converse for identification via quantum channels. IEEE Transactions on Information Theory, 48(3):569 579, 2002. Francis Bach. Structured sparsity-inducing norms through submodular functions. In NIPS 2010 : Twenty-Fourth Annual Conference on Neural Information Processing Systems, pp. NIPS, Vancouver, Canada, 2010. URL https://hal.archives-ouvertes.fr/hal-00511310. Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Structured Sparsity through Convex Optimization. Statistical Science, 27(4):450 468, November 2012. ISSN 08834237. doi: 10.1214/12-sts394. URL http://dx.doi.org/10.1214/12-sts394. Luca Baldassarre, Nirav Bhan, Volkan Cevher, Anastasios Kyrillidis, and Siddhartha Satpathi. Groupsparse model selection: Hardness and relaxations. ar Xiv preprint ar Xiv:1303.3207, 2013. J. Barretina, G. Caponigro, N. Stransky, K. Venkatesan, A. A. Margolin, S. Kim, C. J. Wilson, J. Lehár, G. V. Kryukov, D. Sonkin, A. Reddy, M. Liu, L. Murray, M. F. Berger, J. E. Monahan, P. Morais, J. Meltzer, A. Korejwa, J. Jané-Valbuena, F. A. Mapa, J. Thibault, E. Bric-Furlong, P. Raman, A. Shipway, I. H. Engels, J. Cheng, G. K. Yu, J. Yu, P. Aspesi, M. de Silva, K. Jagtap, M. D. Jones, L. Wang, C. Hatton, E. Palescandolo, S. Gupta, S. Mahan, C. Sougnez, R. C. Onofrio, T. Liefeld, L. Mac Conaill, W. Winckler, M. Reich, N. Li, J. P. Mesirov, S. B. Gabriel, G. Getz, K. Ardlie, V. Chan, V. E. Myer, B. L. Weber, J. Porter, M. Warmuth, P. Finan, J. L. Harris, M. Meyerson, T. R. Golub, M. P. Morrissey, W. R. Sellers, R. Schlegel, and L. A. Garraway. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483(7391): 603 607, Mar 2012. Dimitris Bertsimas and Bart Van Parys. Sparse high-dimensional regression: Exact scalable algorithms and phase transitions. The Annals of Statistics, 48(1):300 323, 2020. doi: 10.1214/18-AOS1804. URL https://doi.org/10.1214/18-AOS1804. Dimitris Bertsimas, Jean Pauphilet, and Bart Van Parys. Sparse Regression: Scalable Algorithms and Empirical Performance. Statistical Science, 35(4):555 578, 2020. doi: 10.1214/19-STS701. URL https://doi.org/10.1214/19-STS701. A. L. Boulesteix and K. Strimmer. Partial least squares: a versatile tool for the analysis of highdimensional genomic data. Brief Bioinform, 8(1):32 44, Jan 2007. B. S. Braun and K. Shannon. Targeting Ras in myeloid leukemias. Clin Cancer Res, 14(8):2249 2252, Apr 2008. S. C. Chen, T. T. Liao, and M. H. Yang. Emerging roles of epithelial-mesenchymal transition in hematological malignancies. J Biomed Sci, 25(1):37, Apr 2018. Y. J. Chung, T. M. Kim, D. W. Kim, H. Namkoong, H. K. Kim, S. A. Ha, S. Kim, S. M. Shin, J. H. Kim, Y. J. Lee, H. M. Kang, and J. W. Kim. Gene expression signatures associated with the resistance to imatinib. Leukemia, 20(9):1542 1550, Sep 2006. A. M. Coluccia, A. Vacca, M. Duñach, L. Mologni, S. Redaelli, V. H. Bustos, D. Benati, L. A. Pinna, and C. Gambacorti-Passerini. Bcr-Abl stabilizes beta-catenin in chronic myeloid leukemia through its tyrosine phosphorylation. EMBO J, 26(5):1456 1466, Mar 2007. S. Corrêa, R. Binato, B. Du Rocher, M. T. Castelo-Branco, L. Pizzatti, and E. Abdelhay. Wnt/bcatenin pathway regulates ABCB1 transcription in chronic myeloid leukemia. BMC Cancer, 12: 303, Jul 2012. Kenneth R Davidson and Stanislaw J Szarek. Local operator theory, random matrices and banach spaces. Handbook of the geometry of Banach spaces, 1(317-366):131, 2001. Damek Davis. Lecture notes for mathematical programming, 2020. Published as a conference paper at ICLR 2023 Marwa El Halabi and Volkan Cevher. A totally unimodular view of structured sparsity. In Guy Lebanon and S. V. N. Vishwanathan (eds.), Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pp. 223 231, San Diego, California, USA, 09 12 May 2015. PMLR. URL https://proceedings.mlr.press/v38/elhalabi15.html. Gersende Fort and Sophie Lambert-Lacroix. Classification using partial least squares with penalized logistic regression. Bioinformatics, 21(7):1104 1111, 11 2004. ISSN 1367-4803. doi: 10.1093/ bioinformatics/bti114. URL https://doi.org/10.1093/bioinformatics/bti114. J. Friedman, T. Hastie, and R. Tibshirani. A note on the group lasso and a sparse group lasso, 2010. Alexandre Gramfort and Matthieu Kowalski. Improving m/eeg source localization with an intercondition sparse prior. In 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, pp. 141 144, 2009. doi: 10.1109/ISBI.2009.5193003. J. J. Gu, J. R. Ryu, and A. M. Pendergast. Abl tyrosine kinases in T-cell signaling. Immunol Rev, 228 (1):170 183, Mar 2009. Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2022. URL https://www. gurobi.com. Marwa El Halabi, Francis Bach, and Volkan Cevher. Combinatorial Penalties: Which structures are preserved by convex relaxations? In AISTATS 2018 - 22nd International Conference on Artificial Intelligence and Statistics, Canary Islands, Spain, April 2018. URL https://hal. archives-ouvertes.fr/hal-01652151. C. H. Heldin. Targeting the PDGF signaling pathway in tumor treatment. Cell Commun Signal, 11: 97, Dec 2013. T. C. Hoang, T. K. Bui, T. Taguchi, T. Watanabe, and Y. Sato. All-trans retinoic acid inhibits KIT activity and induces apoptosis in gastrointestinal stromal tumor GIST-T1 cell line by affecting on the expression of survivin and Bax protein. J Exp Clin Cancer Res, 29:165, Dec 2010. F. F. Huang, L. Zhang, D. S. Wu, X. Y. Yuan, Y. H. Yu, X. L. Zhao, F. P. Chen, and H. Zeng. PTEN regulates BCRP/ABCG2 and the side population through the PI3K/Akt pathway in chronic myeloid leukemia. PLo S One, 9(3):e88298, 2014. Junzhou Huang, Tong Zhang, and Dimitris Metaxas. Learning with structured sparsity. Journal of Machine Learning Research, 12(103):3371 3412, 2011. URL http://jmlr.org/papers/ v12/huang11b.html. Y. Huang, E. O. Comiskey, R. S. Dupree, S. Li, A. J. Koleske, and J. K. Burkhardt. The c-Abl tyrosine kinase regulates actin remodeling at the immune synapse. Blood, 112(1):111 119, Jul 2008. Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graph lasso. In Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09. ACM Press, 2009. doi: 10.1145/1553374.1553431. URL https://doi.org/10.1145/ 1553374.1553431. B. Jassal, L. Matthews, G. Viteri, C. Gong, P. Lorente, A. Fabregat, K. Sidiropoulos, J. Cook, M. Gillespie, R. Haw, F. Loney, B. May, M. Milacic, K. Rothfels, C. Sevilla, V. Shamovsky, S. Shorser, T. Varusai, J. Weiser, G. Wu, L. Stein, H. Hermjakob, and P. D Eustachio. The reactome pathway knowledgebase. Nucleic Acids Res, 48(D1):D498 D503, 01 2020. Rodolphe Jenatton, Jean-Yves Audibert, and Francis Bach. Structured variable selection with sparsityinducing norms. 2009. doi: 10.48550/ARXIV.0904.3523. URL https://arxiv.org/abs/ 0904.3523. Rodolphe Jenatton, Jean-Yves Audibert, and Francis Bach. Structured Variable Selection with Sparsity-Inducing Norms. Journal of Machine Learning Research, 12:2777 2824, 2011. URL https://hal.inria.fr/inria-00377732. Published as a conference paper at ICLR 2023 J. Kim, D. G. Bates, I. Postlethwaite, P. Heslop-Harrison, and K. H. Cho. Least-squares methods for identifying biochemical regulatory networks from noisy measurements. BMC Bioinformatics, 8:8, Jan 2007. Seyoung Kim and Eric P. Xing. Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eqtl mapping. The Annals of Applied Statistics, 6(3), Sep 2012. ISSN 1932-6157. doi: 10.1214/12-aoas549. URL http://dx.doi.org/10.1214/ 12-AOAS549. Anastasios Kyrillidis, Luca Baldassarre, Marwa El Halabi, Quoc Tran-Dinh, and Volkan Cevher. Structured sparsity: Discrete and convex approaches. Co RR, abs/1507.05367, 2015. URL http://arxiv.org/abs/1507.05367. E. Leo, M. Mancini, M. Aluigi, S. Luatti, F. Castagnetti, N. Testoni, S. Soverini, M. A. Santucci, and G. Martinelli. BCR-ABL1-associated reduction of beta catenin antagonist Chibby1 in chronic myeloid leukemia. PLo S One, 8(12):e81425, 2013. L. Li, D. K. Blumenthal, T. Masaki, C. M. Terry, and A. K. Cheung. Differential effects of imatinib on PDGF-induced proliferation and PDGF receptor signaling in human arterial and venous smooth muscle cells. J Cell Biochem, 99(6):1553 1563, Dec 2006. Tianyi Lin, Chi Jin, and Michael. I. Jordan. Near-optimal algorithms for minimax optimization, 2020. Jun Liu and Jieping Ye. Moreau-Yosida regularization for grouped tree structure learning. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta (eds.), Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010. URL https://proceedings.neurips.cc/paper/2010/file/ d490d7b4576290fa60eb31b5fc917ad1-Paper.pdf. C. J. Malavaki, A. E. Roussidis, C. Gialeli, D. Kletsas, T. Tsegenidis, A. D. Theocharis, G. N. Tzanakakis, and N. K. Karamanos. Imatinib as a key inhibitor of the platelet-derived growth factor receptor mediated expression of cell surface heparan sulfate proteoglycans and functional properties of breast cancer cells. FEBS J, 280(10):2477 2489, May 2013. Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. A Unified Framework for High-Dimensional Analysis of M-Estimators with Decomposable Regularizers. Statistical Science, 27(4):538 557, 2012. doi: 10.1214/12-STS400. URL https://doi.org/10. 1214/12-STS400. Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):221 259, 2009. doi: 10.1007/s10107-007-0149-x. URL https://doi.org/10. 1007/s10107-007-0149-x. Danh V. Nguyen and David M. Rocke. Tumor classification by partial least squares using microarray gene expression data . Bioinformatics, 18(1):39 50, 01 2002. ISSN 1367-4803. doi: 10.1093/ bioinformatics/18.1.39. URL https://doi.org/10.1093/bioinformatics/18.1. 39. C. Nishioka, T. Ikezoe, J. Yang, K. Udaka, and A. Yokoyama. Imatinib causes epigenetic alterations of PTEN gene via upregulation of DNA methyltransferases and polycomb group proteins. Blood Cancer J, 1(12):e48, Dec 2011. Roberto Oliveira et al. Sums of random hermitian matrices and an inequality by Rudelson. Electronic Communications in Probability, 15:203 212, 2010. C. Peng, Y. Chen, Z. Yang, H. Zhang, L. Osterby, A. G. Rosmarin, and S. Li. PTEN is a tumor suppressor in CML stem cells and BCR-ABL-induced leukemias in mice. Blood, 115(3):626 635, Jan 2010. Mert Pilanci, Martin J Wainwright, and Laurent El Ghaoui. Sparse learning via boolean relaxations. Mathematical Programming, 151(1):63 87, 2015. F. Rapaport, E. Barillot, and J. P. Vert. Classification of array CGH data using fused SVM. Bioinformatics, 24(13):i375 382, Jul 2008. Published as a conference paper at ICLR 2023 Mark Schmidt, Ewout van den Berg, Michael Friedlander, and Kevin Murphy. Optimizing costly functions with simple constraints: A limited-memory projected quasi-newton algorithm. In David van Dyk and Max Welling (eds.), Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pp. 456 463, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16 18 Apr 2009. PMLR. URL http://proceedings.mlr.press/v5/schmidt09a.html. B. Seashore-Ludlow, M. G. Rees, J. H. Cheah, M. Cokol, E. V. Price, M. E. Coletti, V. Jones, N. E. Bodycombe, C. K. Soule, J. Gould, B. Alexander, A. Li, P. Montgomery, M. J. Wawer, N. Kuru, J. D. Kotz, C. S. Hon, B. Munoz, T. Liefeld, V. Danˇcík, J. A. Bittker, M. Palmer, J. E. Bradner, A. F. Shamji, P. A. Clemons, and S. L. Schreiber. Harnessing Connectivity in a Large-Scale Small-Molecule Sensitivity Dataset. Cancer Discov, 5(11):1210 1223, Nov 2015. Noah Simon, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. A sparse-group lasso. Journal of Computational and Graphical Statistics, 2013. Karl Sjöstrand, Line Harder Clemmensen, Rasmus Larsen, Gudmundur Einarsson, and Bjarne Ersbøll. Spasm: A matlab toolbox for sparse statistical modeling. Journal of Statistical Software, 84(10), 2018. doi: 10.18637/jss.v084.i10. URL https://doi.org/10.18637/jss.v084.i10. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B), 58:267 288, 1996. Martin Wainwright. Information-theoretic bounds on sparsity recovery in the high-dimensional and noisy setting. In 2007 IEEE International Symposium on Information Theory, pp. 961 965. IEEE, 2007. David P Williamson and David B Shmoys. The design of approximation algorithms. Cambridge university press, 2011. Yongxin Xi, Uri Hasson, Peter J Ramadge, and Zhen Xiang. Boosting with spatial regularization. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta (eds.), Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc., 2009. URL https://proceedings.neurips.cc/paper/2009/file/ f2217062e9a397a1dca429e7d70bc6ca-Paper.pdf. W. Yang, J. Soares, P. Greninger, E. J. Edelman, H. Lightfoot, S. Forbes, N. Bindal, D. Beare, J. A. Smith, I. R. Thompson, S. Ramaswamy, P. A. Futreal, D. A. Haber, M. R. Stratton, C. Benes, U. Mc Dermott, and M. J. Garnett. Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Res, 41(Database issue):D955 961, Jan 2013. Lei Yuan, Jun Liu, and Jieping Ye. Efficient methods for overlapping group lasso. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. URL https://proceedings.neurips.cc/paper/2011/file/ 03c6b06952c750899bb03d998e631860-Paper.pdf. H. Zhang, Y. Wang, H. Yang, Z. Huang, X. Wang, and W. Feng. TCF7 knockdown inhibits the imatinib resistance of chronic myeloid leukemia K562/G01 cells by neutralizing the Wnt/b-catenin/TCF7/ABC transporter signaling axis. Oncol Rep, 45(2):557 568, Feb 2021. Peng Zhao, Guilherme Rocha, and Bin Yu. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, 37(6A):3468 3497, 2009. doi: 10.1214/07-AOS584. URL https://doi.org/10.1214/07-AOS584. Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous optimization for structure learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/ paper/2018/file/e347c51419ffb23ca3fd5050202f9c3d-Paper.pdf. Published as a conference paper at ICLR 2023 L. Zhou, N. An, R. C. Haydon, Q. Zhou, H. Cheng, Y. Peng, W. Jiang, H. H. Luu, P. Vanichakarn, J. P. Szatkowski, J. Y. Park, B. Breyer, and T. C. He. Tyrosine kinase inhibitor STI-571/Gleevec down-regulates the beta-catenin signaling activity. Cancer Lett, 193(2):161 170, Apr 2003. Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67:301 320, 2005. Published as a conference paper at ICLR 2023 Supplementary Materials A PROOF OF THEOREM 2.1 Proof Theorem 2.1. Let ui {0, 1} indicate whether the i-th feature is selected and ugj be a vector containing ui, i gj. We then define D(u) := diag(u) and D(ugj) := diag(ugj). Considering the change of variable w = D(u)w, we find the original problem equation 1 is equivalent to P = min D(u)w 0 k Pb j=1 1 h D(ugj )wgj i=1 f(w D(u)xi; yi) + 1 2ρ D(u)w 2 2 We further introduce zj {0, 1} to indicate whether the group of features gj is selected and obtain the following equivalent formulation P = min D(u)w 0 k D(ugj )wgj 0 zj, j Pb j=1 zj h i=1 f(w D(u)xi; yi) + 1 2ρ D(u)w 2 2 We further split w and u and then equation 21 becomes P = min (u,z) Γ min w Rd i=1 f(w D(u)xi; yi) + 1 j=1 zj h, ui zj, i gj, u {0, 1}d, z {0, 1}b It is easy to verify that equation 22 achieves the same objective function value of equation 1 at the same unique optimal solution w . It remains to prove the inner minimization is equivalent to i=1 f(w D(u)xi; yi) + 1 2ρv XD(u)X v i=1 f (vi; yi) (23) Replacing f by its Legendre-Fenchel conjugate f , we have min w Rd max v Rn i=1 w D(u)xi vi f (vi : yi) + 1 Under the stated assumptions, strong duality must hold and therefore minimum and maximum can be exchanged. max v Rn min w Rd i=1 w D(u)xi vi f (vi : yi) + 1 The objective function is strongly convex with respect to w. Hence, we can obtain the unique minimizer w = 1 ρ Pn i=1 D(u)xivi. Substituting w yields equation 23. B PROOFS AND DEVIATIONS FOR THE CONVEX FORMULATION B.1 PROOF OF THEOREM 2.2 To prove Theorem 2.2, let us first provide the sufficient and necessary conditions for PBR to have integral solutions in the following Lemma. Lemma B.1. Suppose that each feature belongs to exactly one group. Then suppose that the integral solution (ˆu, ˆz) of equation 3 selects exactly k features and h groups. This solution is also the optimal solution of the relaxed program PBR if and only if there exist non-negative {λup 1, λup 0}p [d], {λzi 1, λzi 0}i [b], {λup zi} p gi, λk, and λh, such that ˆv arg max v Rn 2ρv XD(ˆu)X v i=1 f (vi; yi) Published as a conference paper at ICLR 2023 λup 1 λup 0 + λk = (X p ˆv)2 λup zi, p gi; (27) λzi 1 λzi 0 + λh = X p gi λup zi, i [b]; (28) λup 1 = 0, p : ˆup = 0; (29) λup 0 = 0, p : ˆup = 1; (30) λzi 1 = 0, i : ˆzi = 0; (31) λzi 0 = 0, i : ˆzi = 1; (32) λup zi = 0, p gi : ˆup < ˆzi. (33) To prove Lemma B.1, we need to use the following two theorems. Theorem B.2 (Davis (2020)). Suppose x is a local minimizer of f : Rd R on a closed convex set X Rd. If f is differentiable at x, it holds that f( x) NX ( x). (34) Theorem B.3 (Davis (2020)). Let A Rm n and let β Rm. Consider the polyhedron Q(A, β) = {x|Ax β}. Suppose x Q(A, β), then the normal cone at x is NQ(A,β)(x) = {A y|y Rm such that y 0 and y (β Ax) = 0}. Proof of Lemma B.1. PBR is PBR = min (u,z) Ωmax v Rn 2ρv XD(u)X v i=1 f (vi; yi) | {z } F (u,z) where Ω= {(u, z)| P i zi h; ui zj; i gj; u [0, 1]d; z [0, 1]b} is the feasible set for (u, z). By Theorem B.2, we know (ˆu, ˆz) is optimal if and only if the following inclusion holds: F(ˆu, ˆz) NΩ(ˆu, ˆz), (36) where NΩ(ˆu, ˆz) is the normal cone at (ˆu, ˆz). Regarding the Left-Hand-Side of equation 36, by standard calculation, we have that ui F(ˆu) = (X i ˆv)2 (ˆv is defined in equation 26) and zi F(ˆz) = 0. Therefore, F(ˆu, ˆz) = (X 2 ˆv)2 ... (X d ˆv)2 0 0 ... 0 Regarding the Right-Hand-Side of equation 36, we obtain the normal cone NΩ(ˆu, ˆz) using Theorem B.3. Specifically, the feasible set Ωis a polyhedron that can be presented by Ω(A, b) = ˆu ˆz β , where the A matrix and β vector are constructed as follows. Published as a conference paper at ICLR 2023 feature block: d columns z }| { 1 group block: b columns z }| { 1 1 1 ... 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 ˆu1 ˆu2 ... ˆud ˆz1 ˆz2 ... ˆzb 1 0 1 0 ... 1 0 k 1 0 1 0 ... 1 0 h 0 ... 0 Here, A is the feature-group relation matrix where for each feature group relation p gi, there is a row in A where the p-th entry in the feature block is 1, the i-th entry in the group block is 1, and all the rest entries are 0. According to Theorem B.3, we know the normal cone NΩ(ˆu, ˆz) = NΩ(A,b)(ˆu, ˆz) = A λ λ (R 0) c : λ b A ˆu ˆz = 0 , where c denotes the number of constraints (i.e., the row dimension of A). We identify the entries of λ as follows: λup 1 denotes the dual parameter corresponding to the constraint up 1 for each feature p; λup 0 denotes the dual parameter corresponding to the constraint up 0 for each feature p; λk denotes the dual parameter corresponding to the constraint P j uj k; λzi 1 denotes the dual parameter corresponding to the constraint zi 1 for each group i; λzi 0 denotes the dual parameter corresponding to the constraint zi 0 for each group i; λh denotes the dual parameter corresponding to the constraint P i zi h; λup zi denotes the dual parameter corresponding to the constraint up zi for each feature p and group gi such that p gi. Finally, we conclude that the equivalent condition of equation 36 is there exists λ (R 0)c such that F(ˆu, ˆz) = A λ and λ b A ˆu ˆz = 0. By equation 37, we obtain equa- tion 27 and equation 28 as the equivalent condition of F(ˆu, ˆz) = A λ. We also obtain equation 29, equation 30, equation 31, equation 32, and equation 33 as the equivalent condition of λ b A ˆu ˆz With Lemma B.1 in hand, we can prove Theorem 2.2 in the following. Proof of Theorem 2.2. We first prove the sufficient condition. Given λk and λh, we only need to construct non-negative {λup 1, λup 0}p [d], {λzi 1, λzi 0}i [b], {λup zi} p gi to satisfy equation 27-equation 33. By Lemma B.1, this will establish the optimality of (ˆu, ˆz) in the relaxed program PBR. We first construct {λup 1, λup 0}p [d] and {λup zi} p gi as follows. 1. For each group i such that ˆzi = 1, and for each p gi and ˆup = 1, set λup 1 = λup 0 = 0 and λup zi = (X p ˆv)2 λk. For each p gi and ˆup = 0, set λup 0 = λk (X p ˆv)2 and λup 1 = λup zi = 0. By equation 14, one may verify that all constructed values in this step are non-negative. Published as a conference paper at ICLR 2023 2. For each group i such that ˆzi = 1, set λzi 0 = 0 and λzi 1 = P p gi,ˆup=1((X p ˆv)2 λk) λh, which is non-negative due to equation 15. 3. For each group i such that ˆzi = 0, and for each p gi and (X p ˆv)2 > λk, set λup 1 = λup 0 = 0 and λup zi = (X p ˆv)2 λk. For each p gi and (X p ˆv)2 λk, set λup 0 = λk (X p ˆv)2 and λup 1 = λup zi = 0. Observe that we always have λup zi = max{(X p ˆv)2 λk, 0} for each p gi. 4. For each group i such that ˆzi = 0, set λzi 1 = 0 and λzi 0 = λh P p gi max{(X p ˆv)2 λk, 0}, which is non-negative due to equation 16. Finally, it is straightforward to verify that equation 27-equation 33 are satisfied by our constructed λ, and therefore, by Lemma B.1, we conclude that (ˆu, ˆz) is the optimal solution of the relaxed program PBR. We next prove the necessary condition. Given PBR and P have the same integral solution, we only need to show there exist λk and λh that satisfy equation 14 equation 16. By Lemma B.1 and ˆu is the integral solution, we have 1. For each group i such that ˆzi = 1 and p gi, ˆup = 1, we have (X p ˆv)2 = λk + λup 1 + λup zi (X p ˆv)2 > λk. (39) For each group i such that ˆzi = 1 and p gi, ˆup = 0, we have (X p ˆv)2 = λk λup 0 (X p ˆv)2 λk. (40) 2. For each group i such that ˆzi = 1 and all p gi, ˆup = 1, we apply equation 39 and equation 28 and have X (X a ˆv)2 λk = λh + λzi 1 X (X a ˆv)2 λk > λh. (41) 3. For each group i such that ˆzi = 0, we apply equation 40 and equation 28 and have X (X p ˆv)2 λk = λh λzi 0 X p gi max{(X p ˆv)2 λk, 0} λh λzi 0 X p gi max{(X p ˆv)2 λk, 0} λh. (42) B.2 DERIVATION OF EQUATION 10 In the case of least-square regression, the Legendre-Fenchel conjugate of the least-square loss f(t; y) = 1 2(t y)2 is given by f (s; y) = s2 2 + sy. Substituting this conjugate function into H(u) in equation 3 in Theorem 2.1, we have H(u) = max v Rn 2ρv XD(u)X v 1 2 v 2 2 v y . (43) We can verify that the unique optimal solution of H(u) is ˆv = XD(u)X ρ + I 1 y. (44) Substituting ˆv back into equation 3 and applying Theorem 2.1 yield the representation LBR = min (u,z) Ω ρXD(u)X + I 1 y B.3 PROOF OF COROLLARY 2.3 Proof of Corollary 2.3. Under the least-square regression setting, we apply Theorem 2.2 and based on equation 5, we know ˆv = XD(ˆu)X ρ + I 1 y. We further define M := In + ρ 1XSX S , where S is the support set indicated by ˆu and then ˆv = My. Substituting ˆv by ˆv = My in Theorem 2.2 proves the Corollary. Published as a conference paper at ICLR 2023 C PROOF OF THEOREM 2.4 Proof of Theorem 2.4. For least-squares loss f(t, y) = 1 2(t y)2, we have that H(u) = max v Rn 2ρv XD(u)X v 1 2 v 2 2 v y . Since the optimal value if non-negative, the optimal dual parameter v Rn must satisfy that v 2 2 y 2 2. Note that H(u) P H(u) H( u) = H(u) H( u) + H( u) H( u), (46) where we set ui = ui/ zj if the feature i belongs to group j and zj = 1, otherwise we set ui = 0. For H(u) H( u), we have that H(u) H( u) 2ρv XD(u)X v 1 2 v 2 2 v y max v Rn 2ρv XD( u)X v 1 2 v 2 2 v y 2ρv X(D(u) D( u))X v ρσmax X(D(u) D( u))X , (47) where σmax( ) denotes the maximum singular value of the matrix. Similarly, for H( u) H( u), we have that H( u) H( u) 1 ρσmax X(D( u) D( u))X . (48) For X(D( u) D( u))X , we rewrite it as X(D( u) D( u))X = X Note that by our assumption, the operator norm of P zj ui Xi X i is at most |gj| G and the mean of the random matrix is 0. By the Ahlswede-Winter matrix concentration bound Ahlswede & Winter (2002); Oliveira et al. (2010), we have that Pr σmax X(D( u) D( u))X rz Gt rz exp( Ω(t2)), (49) where rz is the number of fractional entries in z. The standard matrix concentration bound (e.g., matrix Hoeffding) states that for n-dimensional random matrices X1, X2, . . . , XM, we have that Pr[σmax(X1 + + XM) α] 2n exp( α2/(8Mσ2)), where σ upper bounds the operator norms of X1, X2, . . . , XM almost surely. In the context of Eq. (49), we have that σ = G and we set α = rz Gt. Using the matrix concentration inequality, we upper bound the Left-Hand-Side of Eq. (49) by 2n exp( rzt2/8M) 2n exp( t2/8) (since rz M). Note that this bound is already good enough since the only difference from the Right-Hand-Side of Eq. (49) is the factor 2n instead of rz. These factors would go into the logarithmic factor in the final error bound of Theorem 2.4 so they do not make much difference. To improve the factor 2n to rz in Eq. (49), we need to show that there exists rz dimensional subspace whose basis vector denoted by the columns of Q Rn rz so that X(D( u) D( u))X can be written as Q(X 1 + . . . X M)Q almost surely (where X 1, . . . , X M are rz-dimensional matrices and their operator norms are also bounded by G). The construction of Q and X 1, . . . , X M is possible because each matrix associated with a fractional variable zj is low rank and there are only rz fractional zj s. We then apply the above matrix concentration bound to X 1 + . . . X M to derive Eq. (49). For X(D(u) D( u))X , we have that X(D(u) D( u))X = X i (ui ui)Xi X i . Again, by our assumption, the operator norm of (ui ui)Xi X i is at most 1, and the mean of the random matrix is 0 (even when conditioned on z). Therefore, by the Ahlswede-Winter matrix concentration bound Ahlswede & Winter (2002); Oliveira et al. (2010), we have that Pr σmax X(D(u) D( u))X rut|z ru exp( Ω(t2)), (50) where ru is the number of fractional entries in u. Published as a conference paper at ICLR 2023 Combining equation 46, equation 47, equation 48, equation 49, and equation 50, we have that with probability at least (1 δ), it holds that rz log(rz/δ)G + p ru log(ru/δ) ρ proving the theorem. D ANALYSIS OF THE PERFORMANCE OF LBR ON SYNTHETIC RANDOM ENSEMBLES D.1 PROOF OF THEOREM 3.1 Proof of Theorem 3.1. Let ρXD(u )X + I 1 . For each feature index i {1, 2, . . . , n}, by y = Mw + ϵ, we have that e i X MXy = e i X MXw + e i X Mϵ. (51) We first bound e i X MXw as follows. Applying Lemma D.1 to each feature index i such that wi = 0, and via a union bound, we have that there exists a universal constant c1 (0, 1/3200), such that with probability at least (1 3k exp( c1n1 2δ)), it holds that i : wi = 0, 1 ρe i X MXw [0.9/ Also, applying Lemma D.2 to each feature index i such that wi = 0, and via a union bound, we have that with probability at least (1 4(d k) exp( c1n/k)), it holds that i : wi = 0, 1 ρe i X MXw 0.1/ We then bound e i X Mϵ as follows. Note that M I. Therefore, applying Lemma D.3 to each feature index i (with τ = 0.1ρ/ k), and via a union bound, we have that with probability at least (1 2d exp( n/8) 2d exp( n2δ/(400γ2k))), it holds that i {1, 2, . . . , d}, e i X Mϵ 0.1ρ/ Now we condition on the event that all of equation 52, equation 53, and equation 54 happen. By equation 51, we have that i : wi = 0 : e i X MXy [0.8ρ/ i : wi = 0 : e i X MXy [0, 0.2ρ/ k]. (56) We now set λk = (0.2ρ/ k)2, λh = ((0.79ρ/ k)2 λk) kmin (where we let kmin = minj {1,2,...,h}{kj} be the size of the smallest non-empty group (in terms of non-zero coefficients)), and verify the conditions in Corollary 2.3 as follows: 1. Fix any group g such that z g = 1. For each i g such that u i = 1, we have that wi = 0 and therefore (e i X MXy)2 (0.8ρ/ k)2 > λk by equation 55. For each i g such that u i = 0, we have that wi = 0 and therefore (e i X MXy)2 (0.2ρ/ k)2 λk by equation 56. 2. Fix any group g such that z g = 1. By equation 55, we have P i g:u i =1((e i X My)2 λk) ((0.8ρ/ k)2 λk) kmin > λh. 3. Fix any group g such that z g = 0. By equation 55 and equation 56, we have P i g max{(e i X My)2 λk, 0} 0 < λh. Finally, the theorem is proved by collecting the failure probabilities of the desired events (equation 52, equation 53, and equation 54). Published as a conference paper at ICLR 2023 D.2 PROOF OF THEOREM 3.2 Proof of Theorem 3.2. Let ρXD(u )X + I 1 = 1 ρX1X 1 + I 1 = 1 ρX2X 2 + I 1 . For each feature index i {1, 2, . . . , n}, by y = Mw(1) + ϵ, we have that e i X MXy = e i X MXw(1) + e i X Mϵ. (57) We first bound e i X MXw(1) as follows. Applying Lemma D.1 to each feature index i such that w(1) i = 0, and via a union bound, we have that there exists a universal constant c 1 (0, 1/3200), such that with probability at least (1 3k exp( c 1n1 2δ)), it holds that i : w(1) i = 0, 1 ρe i X MXw(1) [0.9|w(1) i |, 1.1|w(1) i |] [0.9ξ, 1.1ζ1/4ξ]. (58) Similarly, applying Lemma D.1 to each feature index i such that w(2) i = 0, and via a union bound, we have that with probability at least (1 3k exp( c 1n1 2δ)), it holds that i : w(2) i = 0, 1 ρe i X MXw(1) = 1 ρe i X MXw(2) [0.9|w(2) i |, 1.1|w(2) i |] [0.9ξ, 1.1ζ1/4ξ]. Finally, applying Lemma D.2 to each feature index i such that both w(1) i = 0 and w(2) i = 0, and via a union bound, we have that with probability at least (1 4(d 2k) exp( c 1nξ2)), it holds that i : w(1) i = 0 w(2) i = 0, 1 ρe i X MXw(1) 0.1ξ. (60) We then bound e i X Mϵ as follows. Note that M I. Therefore, applying Lemma D.3 to each feature index i (with τ = 0.1ξρ), and via a union bound, we have that with probability at least (1 2d exp( n/8) 2d exp( ξ2n2δ/(400γ2))), it holds that i {1, 2, . . . , d}, e i X Mϵ 0.1ξρ. (61) Now we condition on the event that all of equation 58, equation 59, equation 60, and equation 61 happen. By equation 57, we have that i : w(1) i = 0 w(2) i = 0 : e i X MXy [0.8ξρ, 1.2ζ1/4ξρ], (62) i : w(1) i = 0 w(2) i = 0 : e i X MXy [0, 0.2ξρ]. (63) We now set λk = (0.2ξρ)2, λh = ((0.79ξρ)2 λk) (k/h), and verify the conditions in Corollary 2.3 as follows: 1. Fix any group g such that z g = 1. For each i g such that u i = 1, we have that w(1) i = 0 and therefore (e i X MXy)2 (0.8ξρ)2 > λk by equation 62. For each i g such that u i = 0, we have that w(1) i = w(2) i = 0 and therefore (e i X MXy)2 (0.2ξρ)2 λk by equation 63. 2. Fix any group g such that z g = 1. By equation 62, we have P i g:u i =1((e i X My)2 λk) = ((0.8ξρ)2 λk) (k/h) > λh. 3. Fix any group g such that z g = 0. By equation 62 and equation 63, we have P i g max{(e i X My)2 λk, 0} ((1.44ξρ)2 ζ λk) (k/(ζh)) ((1.44ξρ)2 λk) (k/( ζh)) = 2.0336(ξρ)2 (k/h)/ ζ < λh, where the last inequality holds for large enough constant ζ > 1. Finally, the theorem is proved by collecting the failure probabilities of the desired events (equation 58, equation 59, equation 60, and equation 61). D.3 TECHNICAL LEMMAS Lemma D.1. Suppose k n/4. Let X Rn k be a matrix with i.i.d. N(0, 1) entries. Let ρ = n1/2+δ (δ 0), and M = (I + 1 ρXX ) 1. For any ϵ 32nδ 1/2, any fixed vector z Rk such that z 1 and any fixed index i {1, 2, . . . , k}, with probability at least Published as a conference paper at ICLR 2023 (1 3 exp n1 2δϵ2/2048 ), it holds that e i ρX MX I z ϵ, where ei is the i-th (column) basis vector. Proof. The proof follows the similar lines of Part (1) of the proof of Lemma 2 in Pilanci et al. (2015). However, we adopt a different regularization parameter ρ. We write X = UDV for the singular decomposition of X. By standard results on the singular value of Gaussian random matrices (e.g., Davidson & Szarek (2001)), for each t 0, it holds that Pr[ j {1, 2, . . . , k} : n k t Djj n + k + t] 1 2 exp( t2/2). In particular, if we set t = n/4, we have that Pr[ j {1, 2, . . . , k} : n/4 Djj 7 n/4] 1 2 exp( n/32). (64) The rest of the proof will be carried out by conditioning on the successful event in equation 64. Note that X MX = V (ρI + D2) 1D2V , and therefore, 1 ρX MX I = V [(ρI + D2) 1D2 I]V = V DV , where we let D := diag({ D2 jj ρ+D2 jj 1}j {1,2,...,k}), and have that Djj 0 and Djj 16nδ 1/2 for all j {1, 2, . . . , k} due to the successful event in equation 64. Note that e i ρX MX I z = e i V DV z = j Vij Djj X j V 2 ij Djjzi j Vij Djj X q:q =i Vqjzq It is easy to bound the first term in equation 65 by j V 2 ij Djjzi 16nδ 1/2. (66) Let V be the (k 1) k matrix obtained by removing the i-th row from V , and let z be the (k 1)- dimensional vector by removing the i-th entry of z. We can rewrite the second term in equation 65 as j Vij Djj X q:q =i Vqjzq = e i V D V z . (67) Observe that even when conditioned on D (and therefore D), e i V D V is a (k 1)-dimensional vector pointing towards a uniform random direction, and its 2-norm is at most 16nδ 1/2. On the other hand, z is a fixed (k 1)-dimensional vector with z 2 k 1. Therefore, by standard spherical concentration inequality, we have that Pr h e i V D V z ϵ 16nδ 1/2i 1 exp n(ϵ 16nδ 1/2)2/512 1 exp n1 2δϵ2/2048 . (68) Combining equation 65, equation 66, equation 67, equation 68, and collecting the probabilities, we prove the desired result. Lemma D.2. Suppose k n/4. Let X Rn k be a matrix with i.i.d. N(0, 1) entries. Let u Rk be a column vector with i.i.d. N(0, 1) entries. Let ρ = n1/2+δ (δ 0), and M = (I + 1 ρXX ) 1. For any ϵ (0, 1) and any fixed vector z Rk such that z 2 1 , with probability at least (1 4 exp nϵ2/32 ), it holds that 1 ρu MXz ϵ. Published as a conference paper at ICLR 2023 Proof. The following proof is based on the standard calculation, which also appeared in Part (2) of the proof of Lemma 2 in Pilanci et al. (2015). Write X = UDV for the singular decomposition of X. Again, we have equation 64 and will condition on the successful event in equation 64 for the rest of the proof. Note that 1 ρMXz = U(ρD 1 + D) 1V z, and therefore 1 ρMXz 2 4/ n by the successful event in equation 64. Thus, 1 ρu MXz is a centered Gaussian with standard deviation 4/ n, and the probability that 1 ρu MXz > ϵ is at most 2 exp( ϵ2n/32) by the standard Gaussian tail bound. The lemma is proved by collecting the failure probabilities. Lemma D.3. Let u Rn be a column vector with i.i.d. N(0, 1) entries. Let M Rn n be any PSD matrix (which might depend on u) such that its eigenvalues are at most 1. Let ϵ Rn be an independent column noise vector with i.i.d. N(0, γ2) entries. For any τ > 0, with probability at least (1 2 exp( n/8) 2 exp( τ 2/(4γ2n))), it holds that u Mϵ τ. Proof. By standard χ2 concentration results, we have that with probability at least (1 2 exp( n/8)), it holds that u 2 2 2n. The rest of the proof will be carried out conditioning on this event. Note that Mu 2 u 2. Therefore, u Mϵ N(0, γ2 u 2 2). By the standard Gaussian tail bound, we have that Pr[|u Mϵ| τ] 1 2 exp( τ 2/(2γ2 u 2 2)) 1 2 exp( τ 2/(4γ2n)). The lemma is proved by collecting the failure probabilities. E OPTIMIZATION We use the projected Quasi-Newton (PQN) method to solve the optimization LBR defined in equation 10. The details of PQN is elaborated in Schmidt et al. (2009) Algorithm 1, therefore, we refer the interested audiences to Schmidt et al. (2009) for more details. To apply PQN, we need to know the gradient of the objective function in equation 10. The partial gradient of G(u) w.r.t ui can be written as G(u) ρXD(u)X + I 1 y Computing such a gradient requires the solution of a ranku 0 linear system of size n, which can be calculated in time O( u 3 0) + O(nd) via the QR decomposition. When the sparsity level k is relatively small, such computation is not expensive. We also need to do the following projection in PQN. min x Ω: x y 2 2, (70) where Ωis defined in Section 2.2. The projection on the relaxed constraint set Ωcan be efficiently obtained by a commercial solver (we use Gurobi Gurobi Optimization, LLC (2022)). F NUMERICAL VERIFICATION OF COROLLARY 2.3 Corollary 2.3 states that under the least-squares regression setting, under certain conditions, the solution of the relaxed program is integral and consistent with the optimal solution of the problem with Boolean constraints. In this section, we numerically verify the equivalence established in Corollary 2.3. We consider to generate synthetic data follow Random Ensemble I. we set d = 200, b = 10, k = 50 (ki = 10, i {1, 2, . . . , 5}), h = 5. We vary sample size n and SNR (controlled by γ) to see whether the solution of the relaxed program is the optimal integral solution of the program with Boolean constraints. For each n and SNR, we randomly generate 10 datasets. We run our relaxed program on these 10 datasets and count the number of solutions that are the same as the integral solutions of the program with Boolean constraints. As shown in the Table S1, the proposed equivalence can be achieved when the sample size and SNR are large. Published as a conference paper at ICLR 2023 Figure S1: Performance comparison for different γs. (a) AI as n increases when γ = 3.5. (b) AG as n increases when γ = 3.5. We average results over 10 datasets and the error bar means 95% confidence interval. Table S1: Number of solutions that is the integral solutions of the program with Boolean constraints . SNR n = 100 n = 200 n = 300 n = 1, 000 40 0/10 10/10 10/10 10/10 30 0/10 10/10 10/10 10/10 20 0/10 0/10 0/10 10/10 G ADDITIONAL EXPERIMENTS FOR RANDOM ENSEMBLE I We compare our proposed method with other state-of-the-art methods on simulation data generated from Random Ensemble I with different γs. We first set γ = 3.5 and compare the performance of the competing methods on recovering the individual and group supports. Note that when γ = 3.5, the signal to noise ratio is around 3. We compare the competing methods on 10 different data sets generated from Random Ensemble I with γ = 3.5. As illustrated in Fig. S1 (a) and (b), our proposed method still outperforms all the competing methods in terms of both AI and AG with the increasing number of samples. H EXPERIMENTS FOR SYNTHETIC DATA SATISFYING MUTUAL INCOHERENCE CONDITION In Random Ensemble I, we generate xi N(0, I) as i.i.d features. In this experiments, we aims to evaluate the performance of the competing methods in the presence of correlation between features. We follow the way we generate the data in Random Ensemble I, where we set d = 1000, b = 10, k = 50 (ki = 10, i {1, 2, . . . , 5}), h = 5, and γ = 0.1. Then only different is that we generate xi N(0, Σ), where Σ is the Toeplitz covariance matrix Σij = p|i j|. Such matrices satisfy the mutual incoherence condition, required by ℓ1-regularized estimators to be statistically consistent. We consider p = 0.2 and p = 0.7 for Σ to evaluate the competing methods performance under different correlations. The performance is shown in Fig. S2. We find that when p = 0.2, support recovery (both individual level Fig. S2(a) and group level Fig. S2(b)) can be easily achieved by most of the models. And our model outperforms other competing methods. However, when p = 0.7, all models have difficulties to accurately recover the support at the individual level (Fig. S2(c)), which makes sense because the correlation between features are high so that finding the correct features becomes more challenge. Our model still outperforms other competing methods at both individual level (Fig. S2(c)) and group level (Fig. S2(d)). I ADDITIONAL EXPERIMENTS FOR CANCER DRUG RESPONSE PREDICTION We first show Table. S3, which illustrates the pathways and genes identified by the proposed method for drug IMATNIB and the corresponding researches that support the findings. We further find the targeted pathways and genes for three other drugs: GEFITINIB, BEXAROTENE, and BOSUTINIB. In Table. S4, we show the number of targeted pathways and genes identified by Published as a conference paper at ICLR 2023 500 1000 1500 0 Sample size Accuracy of group support recoery Accuracy of group Sample size Accuracy of Individual support recovery 500 1000 1500 0 Proposed SGL ENet SGL SGCover 500 1000 1500 Sample size support recoery (b) (c) (d) SGL Proposed SGL SGCover SGL Proposed SGL SGCover Proposed SGL ENet SGL SGCover 500 1000 1500 0 Sample size Accuracy of Individual support recovery Figure S2: Performance comparison for different ps. (a) AI as n increases when p = 0.2. (b) AG as n increases when p = 0.2. (c) AI as n increases when p = 0.7. (d) AG as n increases when p = 0.y. We average results over 10 datasets and the error bar means 95% confidence interval for (a)-(d). each competing method and the out-of-sample MSE. As shown our method achieves the smallest out-of-sample MSE and the fewest number of pathways and genes. J TIME COMPARISON FOR RANDOM ENSEMBLE I We compared the running time for experiments of Random Ensemble I when the sample size is 1,000 in the Table. S2. Table S2: Time comparison (mean s.d.) for Random Ensemble I when the sample size is 1000. Method SGL SGL_ SGCover ENEt Proposed Time (s) 1.4 0.1 4.2 0.2 20.8 0.9 0.2 0.03 24.4 2.1 K IMPLEMENTATION DETAILS For section 4.1 subsection Feature selection with given support sizes k and h and section 4.2, we select the parameters as follows. For our method, because k and h are given, we only have one parameter ρ in (1) left, we select ρ by the 5-fold CV in terms of MSE. For the result of the methods, which cannot control k and h, we just sweep the parameters to let them yield the desired k and h. For the real-world application, we select the parameters in terms of out-of-sample MSE. L CODE AVAILABILITY The codes for the proposed method can be found here: https://anonymous.4open. science/r/L0GL-F107/Readme Table S3: Pathways and genes identified by the proposed methods for IMATNIB. Pathway Genes Reference RHO GTPases Activate WASPs and WAVEs ARPC1B WASF1 ARPC5 WASL CYFIP1 ACTG1 ACTR3 Gu et al. (2009); Huang et al. (2008); Chen et a Regulation of PTEN gene transcription LAMTOR3 LAMTOR4 SNAI1 RPTOR RRAGA RRAGB MBD3 RRAGD PHC3 GATAD2A RCOR1 MECOM CBX8 LAMTOR2 Nishioka et al. (2011); Peng et al. (2010); Huan Signaling by PDGF PDGFC COL4A3 COL6A2 COL6A3 COL9A3 Malavaki et al. (2013); Li et al. (2006); Heldin Retinoid metabolism and transport CLPS LRP8 APOC3 SDC4 LPL LRP10 LRP12 APOA2 Hoang et al. (2010) TCF transactivating complex RBBP5 KAT5 PYGO1 PYGO2 BCL9 Zhang et al. (2021); Coluccia et al. (2007); Cor Deactivation of the beta-catenin transactivating complex RBBP5 SOX3 SRY PYGO1 PYGO2 CBY1 BCL9 Zhou et al. (2003); Leo et al. (2013) RAS processing ZDHHC9 GOLGA7 BCL2L1 ABHD17B Chung et al. (2006); Braun & Shannon (2008) Published as a conference paper at ICLR 2023 Table S4: Result comparison for three other drugs. Durg Method k (s.d.) h (s.d.) Out-of-sample MSE 95% CI BOSUTINIB Proposed method 36 5 29.4 2.1 SGL-Overlap 64 ( 4.3) 9 (0.9) 42.6 2.3 ENet 48 (5.3) 18(3.2) 35.4 3.1 SGCover 240 (10.4) 15 (1.8) 52.4 4.2 GEFITINIB Proposed method 42 6 35.4 1.4 SGL-Overlap 78 (4.6) 13 (1.3) 4.78 2.1 ENet 49 (5.6) 15 (2.1) 38.7 2.9 SGCover 278 (13.4) 12 (2.4) 59.6 3.6 BEXAROTENE Proposed method 52 8 36.9 1.8 SGL-Overlap 86 (3.8) 15 (1.5) 61.8 2.7 ENet 64 (5.2) 17 (3.5) 38.4 2.3 SGCover 312 (16.3) 18 (2.3) 58.2 3.1