# bilevel_learning_of_the_group_lasso_structure__9a66d929.pdf Bilevel Learning of the Group Lasso Structure Jordan Frecon ,1 Saverio Salzo ,1 Massimiliano Pontil1,2 1 Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia (Italy) 2 Department of Computer Science, University College London (UK) Regression with group-sparsity penalty plays a central role in high-dimensional prediction problems. However, most existing methods require the group structure to be known a priori. In practice, this may be a too strong assumption, potentially hampering the effectiveness of the regularization method. To circumvent this issue, we present a method to estimate the group structure by means of a continuous bilevel optimization problem where the data is split into training and validation sets. Our approach relies on an approximation scheme where the lower level problem is replaced by a smooth dual forward-backward algorithm with Bregman distances. We provide guarantees regarding the convergence of the approximate procedure to the exact problem and demonstrate the well behaviour of the proposed method on synthetic experiments. Finally, a preliminary application to genes expression data is tackled with the purpose of unveiling functional groups. 1 Introduction With recent technological advances, high-dimensional datasets have become massively widespread in numerous applications ranging from social sciences to computational biology [20, 25, 1]. In addition, in many statistical problems, the number of unknown parameters can be significantly larger than the number of data samples, thus leading to underdetermined and computationally intractable problems. Nonetheless, many classes of datasets exhibit a sparse representation when expressed as a linear combination of suitable dictionary elements. This has led, over the past decades, to the development of sparsity inducing norms and regularizers to unveil structure in the data. However, beyond the sparsity patterns of the data, there might also be a more complex structure, which is widely referred to as structured sparsity [16, 22, 23, 29]. In this line of research, a lot of work has been devoted to encode a priori structure of the data in (possibly overlapping) groups or hierarchical trees [31, 15, 17]. In the present paper, we restrict our study to the popular Group Lasso problem [30]. Given a vector of outputs y RN and a design matrix X RN P , the Group Lasso problem amounts in finding ˆw argmin w RP 1 2 y Xw 2 + λ l=1 w Gl 2, (1) for some regularization parameter λ > 0 and a non-overlapping group structure, i.e., an unordered partition of the features in L groups {G1, . . . , GL} such that L l=1Gl = {1, . . . , P} and ( l = l ), Gl Gl = . The specific form of the regularizer permits to enforce sparsity at the group-level, thus often leading to a better interpretability of the features than standard Lasso. However, in many applications, we might have hundreds or thousands of features whose groupstructure {G1, . . . , GL} may be unknown, or only partially known. In addition, the number of groups L itself might not be known. Nonetheless, the prior knowledge of the group structure is crucial in order to achieve a lower prediction error [19]. Note that this problem can be seen as purely Equal contribution. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. combinatorial since it amounts in searching the best partition amongst (LP /L!) possible unordered partitions. In this paper we address the problem of learning the Group Lasso structure, within the setting of multi-task learning, through a bilevel optimization approach. We establish some basic mathematical properties of this methodology and demonstrate that it works well in practice. Related works. We are only aware of few approaches devoted to infer the Group Lasso structure. A probabilistic modeling approach has been investigated in [14] to learn the relevance of pairs of features only. More recently, [27] considered a broad family of heavy-tailed priors for the group variables along with a variational inference scheme to learn the parameters of these priors. However, this approach becomes prohibitive when dealing with a large number of features. In addition, the setting is different: it analyzes the latent Group Lasso and it assumes that the learning tasks have a similar structure, meaning that the relevance of a given group is largely shared across the tasks. Contributions and outline. The principal contribution of this paper is the formulation of the problem of learning the Group Lasso structure as a continuous bilevel optimization problem. In Section 2, we present, in a formal way, our bilevel approach. A new algorithmic solution based on an upper stochastic gradient descent and a lower dual forward-backward scheme with Bregman distances is devised in Section 3. The performance of the proposed approach are quantitatively assessed on synthetic data in Section 4, and shown to favorably compare against standard approaches. In addition, an application to real data in the context of gene expression analysis is provided with the goal of discovering functional groups. Finally, conclusions and perspectives are drawn in Section 5. Notations. Let X be an Euclidian space. Γ0(X) denotes the space of functions h: X ] , + ] closed, proper and convex. We also denote by argmin h the set of minimizers of h or the minimizer of h when it is unique. 2 Proposed Bilevel Problem for Learning the Groups In this section, we describe a bilevel framework for estimating the Group Lasso structure based on a multi-task learning problem, without any further a priori information. 2.1 Original Problem We encapsulate the group structure by means of an hyperparameter θ = [θ1 . . . θL] {0, 1}P L, defining at most L groups, such that ( p {1, . . . , P}, l {1, . . . , L}), θp,l = 1 if the p-th feature belongs to the l-th group, and 0 otherwise. Note that when no prior information on the number of groups is given, one should consider the extreme setting where there might be at most L = P groups. In order to properly select θ, we propose to consider the following bilevel problem. Problem 2.1 (Mixed Integer-Continuous Bilevel Problem.). Given some vectors of outputs yt RN and design matrices Xt RN P for t {1, . . . , T}, as well as a regularization parameter λ > 0, find ˆθ argmin θ {0,1}P L C(ˆw(θ)) s.t. l=1 θl = 1P , (2) where C(ˆw(θ)) = (1/T) PT t=1 Ct( ˆwt(θ)), Ct : RP R is a smooth function and ˆw(θ) = ( ˆw1(θ), . . . , ˆw T (θ)) is a minimizer of T separated Group Lasso problems sharing a common group structure, i.e., it solves minimize (w1,...,w T ) RP T 1 T 1 2 yt Xtwt 2 + λ l=1 θl wt 2 where denotes the Hadamard product. The constraint in the right-hand side of Problem (2) ensures that every feature belongs to a single group. A natural choice for Ct is to consider the validation error Ct( ˆwt(θ)) = 1 2 y(val) t X(val) t ˆwt(θ) 2 evaluated on a set {y(val), X(val)}. For such choice, the selection of θ is motivated by the need of generalizing well to unseen data. In practice, it is often a good surrogate to the estimation error when the true features are unknown. We note that directly solving Problem 2.1 is a challenge, since it is a mixed integer-continuous bilevel problem. To overcome this difficulty, we consider in the next section a relaxation of the problem in the continuous setting. 2.2 Relaxed Problem We propose to consider a continuous relaxed version of Problem 2.1 where θ [0, 1]P L and the penalty term ϵ 2T PT t=1 wt 2 2 is added to (3), for some ϵ > 0, in order to ensure strong convexity of the lower level objective function and hence the uniqueness of its minimizer. The resulting problem lies within the framework of continuous bilevel optimization [10] which has recently been gaining a renewed interest in image processing [18, 5] as well as in neural networks and machine learning (see e.g. [4, 21, 24, 12]). Our relaxation of Problem 2.1 is formulated as follows. Problem 2.2 (Exact Bilevel Problem). Let C be as in Problem 2.1 and let ψ, ξ : [0, 1] R be increasing continuous functions such that ψ(0) = ξ(0) = 0 and ψ(1) = ξ(1) = 1. Given some vectors of outputs yt RN and design matrices Xt RN P for t {1, . . . , T}, as well as some regularization parameters λ > 0 and ϵ > 0, solve minimize θ Θ U(θ) with ˆw(θ) = argminw RP T L(w, θ), U(θ) = C(ˆw(θ)), (4) L(w, θ) := 1 t=1 Lt(wt, θ), Lt(wt, θ) = 1 2 yt Xtwt 2 2 + ϵ 2 wt 2 2 + λ l=1 ψ(θl) wt 2, Θ = θ = [θ1 . . . θL] [0, 1]P L l=1 ξ(θl) = 1P (5) and ψ and ξ are applied component-wise to the vectors θl s. Remark 2.1. The functions φ and ξ permit to cover different continuous relaxations of Problem 2.1 and the conditions ψ(0) = ξ(0) = 0 and ψ(1) = ξ(1) = 1 are compatibility conditions with Problem 2.1. Among the different choices of ψ and ξ we point out ψ = ξ = Id, which corresponds to a convex relaxation in which Θ is the unit simplex. The following result establishes the existence of solution of Problem 2.2. The proof is given in the supplementary material (Section A.1). Proposition 2.1 (Existence of Solutions). Suppose that Θ is a compact nonempty subset of RP L + and that C and ψ are continuous functions. Then θ 7 ˆw(θ) is continuous and hence Problem 2.2 admits solutions. 2.3 Approximate Problem Usually, we don t have a closed form expression for ˆw(θ) but we rather have an iterative procedure converging to ˆw(θ) that we arbitrarily stop after Q iterations. Therefore, we actually solve an approximate problem of the following form. Problem 2.3 (Approximate Bilevel Problem). Let C and Θ be as in Problem 2.2. Given two mappings A and B, as well as a maximum number of inner iterations Q N, solve minimize θ Θ U(Q)(θ), where u(0)(θ) is chosen arbitrarily for q = 0, 1, . . . , Q 1 u(q+1)(θ) = A(u(q)(θ), θ) w(Q)(θ) = B(u(Q)(θ), θ), U(Q)(θ) = C(w(Q)(θ)). Remark 2.2. Problem 2.3 encompasses many situations encountered in practice. For example, when B = Id, it reduces to the usual case where w(q+1)(θ) = A(w(q)(θ), θ). In addition, this formulation also covers dual algorithms: in this case A corresponds to the dual variable update, and B denotes the primal-dual relationship (see, e.g., [3]). The following theorem gives the conditions under which the approximate problem converges to the exact one as the number of inner iterations Q grows. Theorem 2.1 (Convergence of the Approximate Problem). In addition to the assumptions of Problem 2.3, suppose that the iterates {w(Q)(θ)}Q N converge to ˆw(θ) uniformly on Θ as Q + . Then the approximate Problem 2.3 converges to the exact Problem 2.2 in the following sense inf θ Θ U(Q)(θ) Q + inf θ Θ U(θ) and argmin θ Θ U(Q)(θ) Q + argmin θ Θ U(θ), (7) where the latter convergence is meant as set convergence, i.e., for every sequence (ˆθ(Q))Q N such that ˆθ(Q) argmin U(Q), we have dist(ˆθ(Q), argmin U) 0 as Q + , which is equivalent to max{dist(ˆθ, argmin U) | ˆθ argmin U(Q)} 0 as Q + . Theorem 2.1 justifies the minimization of U(Q) (for sufficiently large Q) instead of U. 3 Algorithmic Solution The lower level problem in (4)-(5), can be, in principle, addressed by several available solvers. However, since this problem is nonsmooth, these solvers are usually nonsmooth as well, that is A and B in (6) are nonsmooth. This causes U(Q) to be nonsmooth, besides being nonconvex. In that case, minimizing U(Q) is a challenge. Indeed, even just determining a (hyper)subgradient of U(Q) in a stable fashion by recursively computing a subgradient of u(q)(θ) might be hopeless. Therefore, we embrace the idea proposed in [24] to devise a smooth algorithm by relying on Bregman proximity operators and we make two advances. First, we propose a new algorithm based on a dual forwardbackward scheme with Bregman distances where A and B are smooth. Second, by relying on [2], we prove the uniform convergence of such algorithm to the solution of the lower level problem, so to meet the requirements of Theorem 2.1. This approach finally gives a smooth function U(Q) whose gradient can be recursively computed by applying the standard chain rule [13]. 3.1 Principle Since the proposed bilevel problem is a nonconvex problem with possibly many minima, finding the global optimum is out of reach. However, local minima can still be of high quality, meaning that no improvements in the objective can be obtained by small perturbations and that the corresponding objective value is close to the infimum. Let us remark that, since in the parametrization of the groups the ordering is not relevant, the upper level objective function is invariant under permutations of (θ1, . . . , θL), so there are L! equivalent solutions. In order to solve the bilevel problem, we rely on the following projected gradient descent algorithm ( k {0, . . . , K 1}), θ(k+1) = PΘ θ(k) γ U(Q)(θ(k)) , (8) where PΘ denotes the projection onto Θ (see [8] for an efficient projection method when Θ is the unit simplex) and γ > 0 is a given step-size. Overall, this procedure requires to compute the Q-th iterate w(Q)(θ(k)) as well as the hypergradient U(Q)(θ(k)). Since both the lower and upper level problems are separable with respect to the tasks, the hypergradient is the sum of T terms. In Section 3.4, we design a stochastic variant of (8) taking advantage of this structure. 3.2 Solving the Lower Level Problem In this section, we address the lower level problem in (4)-(5). Since it is is separable with respect to the tasks, without loss of generality we can deal with a single task omitting the index t. Problem 3.1. Given some vectors of outputs y RN, a design matrix X RN P , regularization parameters λ > 0 and ϵ > 0, as well as some group structure θ Θ, find ˆw(θ) = argmin w RP n L(w, θ) := 1 2 y Xw 2 2 + ϵ 2 w 2 2 | {z } f(w) l=1 ψ(θl) w 2 | {z } g(Aθw) where f Γ0(RP ) is smooth and ϵ-strongly convex, g Γ0(RP L) is nonsmooth and Aθ is the linear operator defined as Aθ : w RP 7 (ψ(θ1) w, . . . , ψ(θL) w) RP L. Let us note that in order to solve Problem 3.1 we cannot use the standard forward-backward algorithm [6, 7] since the proximity operator of g Aθ cannot be computed in closed form. Moreover, we also ask for a smooth algorithm, meaning one for which A and B in (6) are smooth. Therefore, we tackle the dual of Problem 3.1. Problem 3.2. Find a solution ˆu(θ) of minimize u RP L f ( A θ u) + g (u), (10) where f and g denote the Fenchel conjugates of f and g respectively, and where A θ is the transpose of the operator Aθ, that is, A θ : u RP L 7 PL l=1 ψ(θl) ul RP . Note that the dual Problem 3.2 admits a solution, since strong duality holds and the primal Problem 3.1 has solutions [3]. Moreover it is a smooth constrained convex optimization problem. Indeed, since f is closed and ϵ-strongly convex, it follows that f is everywhere differentiable with ϵ 1Lipschitz continuous gradient and hence [f ( A θ )] = Aθ f ( A θ ) is Aθ 2ϵ 1-Lipschitz continuous. Besides, we have f = ( f) 1 = (X X + ϵId P ) 1( + X y). On the other hand, g is the indicator function of the product of L balls B2(λ) . . . B2(λ) := B2(λ)L, i.e., g (u) = PL l=1 ıB2(λ)(ul), where B2(λ) is the closed ball of RP centered at zero and of radius λ. We propose to solve Problem 3.1 by applying a forward-backward algorithm with Bregman distances to the dual Problem 3.2 [2, 28] and using the primal-dual link w = f ( A θ u). This algorithm calls for a Bregman proximity operator of g which can be made smooth with an appropriate choice of the Bregman distance. In the following, we provide the related details. Definition 3.1 (Bregman Proximity Operator [28]). Let X be an Euclidean space, h Γ0(X) and let Φ Γ0(X) be a Legendre function. Then, the Bregman proximity operator (in Van Nguyen sense) of h with respect to Φ is proxΦ h (v) = argmin u X h(u) + Φ(u) u, v . (11) The dual forward-backward algorithm with Bregman distances (FBB) for Problem 3.1 is as follows. Given some step-size γ > 0 and u(0)(θ), then for q = 0, 1, . . . , Q 1 u(q+1)(θ) = proxΦ γg Φ(u(q)(θ)) + γAθ f ( A θ u(q)(θ)) w(Q)(θ) = f ( A θ u(Q)(θ)). The updating rules in (12) define the mappings A and B in Problem 2.3. We note that in this case, since f is an affine mapping, B is smooth. Whereas the smoothness of A depends on the choice of the Legendre function Φ. We consider Φ(u) = PL l=1 φ(ul) with dom φ = B2(λ), so that, for every l {1, . . . , L}, u(q+1) l (θ) = proxφ γıB2(λ) φ(u(q) l (θ)) + γψ(θl) f ( A θ u(q)(θ) = argmin u RP ıB2(λ)(u) + φ(u) u, φ(u(q) l (θ)) + γψ(θl) f ( A θ u(q)(θ)) = argmin u RP φ(u) u, φ(u(q) l (θ)) + γψ(θl) f ( A θ u(q)(θ)) = φ ( φ(u(q) l (θ)) + γψ(θl) f ( A θ u(q)(θ))). (13) Therefore, in order to make A smooth, we need to choose the Legendre function φ so that φ is twice differentiable. Here, we propose to resort to the following function. Definition 3.2 (Separable Hellinger-like Function [2]). The separable Hellinger-like function is defined as Φ(u) = PL l=1 φ(ul) where for every ul RP , λ2 ul 2 2, if ul B2(λ), + , otherwise. (14) Algorithm 1 Dual forward-backward with Bregman distances: FBB-GLasso(y, X, λ, θ) Require: Data y, design matrix X, regularization parameter λ and group-structure θ Set the number of iterations Q N and the step-size γ < ϵλ 1 Aθ 2. Set L to the number of groups in θ Initialize u(0)(θ) 0P L for q = 0 to Q 1 do w(q)(θ) = (X X + ϵId P ) 1 X y PL l=1 ψ(θl) u(q) l (θ) for l = 1 to L do v(q+1) l (θ) = u(q) l (θ) λ2 uq l (θ) 2 + γψ(θl) w(q)(θ) u(q+1) l (θ) = λ q 1+ v(q+1) l (θ) 2 2 v(q+1) l (θ) end for end for w(Q)(θ) = (X X + ϵId P ) 1 X y PL l=1 ψ(θl) u(Q) l (θ) output w(Q) For such choice, we have that for every v RP , φ (v) = λv/ p 1 + v 2 2. The corresponding forward-backward scheme with Bregman distance is given in Algorithm 1 where, for the sake of readability, we introduced the primal iterates w(q)(θ) and the auxiliary variables v(q) l (θ), denoting the argument of φ in (13). The following theorem addresses the convergence of Algorithm 1. The corresponding proof is given in the supplementary material (Section A.2). Theorem 3.1 (Convergence of the Dual FBB Scheme). The sequence {w(Q)(θ)}Q N generated by Algorithm 1 converges to the solution ˆw(θ) of Problem 3.1 for any step-size 0 < γ < ϵλ 1 Aθ 2. In addition if γ = ϵλ 1 Aθ 2/2, then ( Q N) 1 2 w(Q)(θ) ˆw(θ) 2 2 2λϵ 2 Q Aθ 2DΦ(ˆu(θ), u(0)), (15) where DΦ is the Bregman distance associated to Φ, i.e., ( u dom Φ, v int dom Φ), DΦ(u, v) = Φ(u) Φ(v) Φ(v), u v . (16) Remark 3.1. Since ran(ψ) [0, 1], Aθ 2 L. If ψ2 ξ, then Aθ 1; equality is obtained when ψ2 = ξ. Therefore, since DΦ( , u(0)) is continuous on dom Φ = B(λ)L, Aθ 2DΦ(ˆu(θ), u(0)) in (15) can be uniformly bounded from above on Θ. Theorem 3.1 and Remark 3.1 establish that {w(Q)(θ)}Q N converges to ˆw(θ) uniformly on Θ as Q + with a sublinear rate. This result applies to every task of the lower level objective in Problem 2.2 and hence it also applies to the collection of tasks {w(Q)(θ)}Q N and ˆw(θ). Therefore, the requirements of Theorem 2.1 are met and the solutions of Problems 2.3 converge to the solutions of Problem 2.2 as Q + . 3.3 Computing the Hypergradient In this section, we discuss the computation of the (hyper)gradient of U(Q). It follows from (6) that, for every θ Θ, U(Q)(θ) = 1 t=1 [(w(Q) t ) (θ)] | {z } R(P L) P Ct(w(Q) t (θ)) | {z } RP RP L, (17) where Ct(w(Q) t (θ)) = X(val) t (X(val) t w(Q) t (θ) y(val) t ). In equation (17) the derivative (w(Q) t ) (θ) can be computed by recursively differentiating the formulas in (12). This is the so-called forward mode for the computation of the hypergradient. However in our setting, this requires storing the derivatives (u(q)) (θ) which have size (P L) (P L). Here, since we are interested in the product [(w(Q) t ) (θ)] Ct(w(Q) t (θ)) we implement the reverse mode differentiation [13] (see also [11]) which gives a more efficient procedure that only requires storing matrices of size P L. The details are given in Algorithm 2 in the supplementary material (Section B.1). Finally, as suggested in [13, Chapter 15] and more recently in [24], we implement a variant of Algorithm 2 in which all the derivatives of the mapping A are evaluated at the last iterate u(Q) (instead of varying during the iterations), This reduces the execution time and memory requirements. In our experiments, we observe that the hypergradient is left unchanged by this operation as long as Q is large enough. 3.4 Solving the Approximate Bilevel Problem Since the hypergradient in (17) has the form of a sum of T terms, each one depending on a single task, we implement a stochastic solver, by estimating the hypergradient U(Q) on a single task chosen at random. Here, we resort to the prox SAGA algorithm [26], which is a nonconvex proximal variant of SAGA [9]. The details are given in the supplementary material (Section B.2). In the following we provide the related convergence theorem. Theorem 3.2 (Convergence of the Proposed Bilevel Scheme). Let β be the Lipschitz constant of U(Q) and γ 1/(5βT). Let {θ(k)}K k=1 be generated according to Algorithm 4 in the supplementary material (Section B.2). Then, for k uniformly sampled from {1, . . . , K}, the following holds: E h Gγ(θ( k)) 2i 50βT 2 5T 2 U(Q)(θ(0)) U(Q)(θ ) where θ is a minimizer of U(Q) and Gγ is the gradient mapping θ PΘ(θ γ U(Q)(θ) . (19) We note that computing the Lipschitz constant β is out of reach. Thus, we choose γ small enough such that the algorithm converges. Finally, we suggest the initialization θ(0) = PΘ(L 11P L+n) where n N(0P L, 0.1L 11P L) in order to be as uninformative as possible regarding the group structure while still breaking the symmetry by adding a small perturbation. 4 Numerical Experiments In this section, we first devise synthetic experiments to illustrate and assess the performance of the proposed method. Then, we tackle a real-data experiment in the context of gene expression analysis. A MATLAB R toolbox is available upon request to the authors. 4.1 Synthetic Experiments Experimental setting. We consider the setting where N = 50, P = 100, and the group structure θ is made of L = 10 groups equally distributed over the features such that ( p {1, . . . , P}, l {1, . . . , L }), θ p,l = 1 if p Gl := {1 + (l 1)(P/L ), . . . , l(P/L )} and 0 otherwise. In addition, if not stated otherwise, we fix T = 500 tasks and every regressor w t is set to have non-zero coefficients equal to 1 in at most 2 groups chosen at random. Both the training, validation and test sets are synthesized as follow. For every task t {1, . . . , T}, the design matrix Xt RN P is first drawn from a standard normal distribution N(0N P , 1N P ) and then normalized column-wise. Finally we define the vector of outputs yt = Xtw t + n where n N(0N, 0.31N). We consider the convex relaxation pointed in Remark 2.1, set (Q = 500, ϵ = 10 3, γ = 0.1, K = 2000) and denote the proposed solution as θBi GL. We also consider its threshold counterpart θBi GLThr where each feature is assigned to its most dominant group. These two solutions are compared with Lasso and oracle Group Lasso, computed respectively for θLasso = Id P and θGL = θ . In addition to the validation error, performance are quantified in terms of test and estimation error, (1/2T) PT t=1 y(test) t X(test) t 2 2 and (1/2T) PT t=1 w t 2 2 respectively. Illustration of the method. First, we illustrate the well-behaviour of the algorithmic solution, for various values of λ, when L is known. We consider the previously mentioned setting and display in Fig.5 (in the supplementary material) the corresponding oracle w (top left) exhibiting 10 groups. 0 1000 2000 3000 Figure 1: The minimization of U(Q) is displayed in the left plot for various λ. Comparison of estimation errors (middle) show that the proposed Bi GL and Bi GLThr estimates yield performance close to the oracle GL. In addition, θBi GLThr satisfactorily agrees with the oracle θ (right). 50 500 5000 101 102 103 2 Figure 2: Left and middle plots illustrate the impact of Q and T on the validation and the estimation error respectively. The right-hand figure shows that an adequate estimation of the groups can be obtained even when the number of groups is set to 20 instead of 10. Figure 1 (left) shows, for several values of λ, how the upper level objective decreases as the number of outer iterations k grows. Even though convergence is not yet fully reached, the corresponding solutions still yield performance close to oracle Group Lasso as shown by the validation, test and estimation error (see Figure 1 and supplementary material). More importantly, Figure 1 (right) shows that for the λ minimizing the validation error, denoted λmin, the corresponding estimated groups θBi GLThr satisfactorily agree with the oracle θ , thus confirming that minimizing the validation error is an adequate way to learn the groups. Impact of the number of inner iterations Q. Now that a proof of concept has been provided, we propose to investigate the impact of Q on the validation error. To do so, we repeat the same experiment for λ = λmin and different values of Q. Once the estimates θBi GL and θBi GLThr are obtained, the validation error (where the ˆw( ) s are computed a posteriori for 104 iterations) for each of the four methods is plotted as a function of Q in Fig.2 (left). The results show that increasing Q sufficiently permits reaching performance close to GL. In addition, we stress that, for Q 500, the performance of Bi GL and Bi GLThr become indistinguishable, thus showing that the algorithm does tend to assign a single group to each feature. Impact of the number of tasks T. Here we investigate how the estimation errors varies as the number of task T increases, see in Fig.2 (middle). While the performance of Lasso and GL do not significantly depend on T, we observe that the performance of Bi GL and Bi GLThr get close to those of GL as T grows. Similar conclusions can be drawn regarding the test error. Hence, this confirms that learning the groups is intrinsically a multi-task problem that benefits from having a large number of tasks. Impact of the number of groups L. While in the previous experiments the number of groups was known a priori (L = 10), here we relax this assumption and let the algorithm find at most L = 20 groups. We repeat the experiment and show the results in Fig. 2 (right). Note that 9 out of the 10 extra groups are not displayed since they were found empty, while the remaining group contains very few features. Overall, the oracle θ is still satisfactorily estimated. Impact of groups sizes. We repeat the same experiment except that θ is now made of 5 groups of 5 features and 5 groups of 15 features. The proposed method still satisfactorily estimates groups of different size, as Figure 3 shows, in case L is known as well as if L is overestimated (L = 20). Figure 3: Illustration of the estimated group-structure when the oracle groups have different sizes. Left: initialization with the correct number of groups L = 10. Right: initialization with an overestimation L = 20 of the number of groups. 10-1 100 0.6 10-1 100 0.15 Figure 4: Application to the prediction of gene ontology classes from regulatory motifs. Our approach is able to reach a lower prediction error than Lasso by partitioning the features into 30 groups. 4.2 Application to Real Data Understanding the complexity of gene expression networks and the mechanisms involved in its regulation constitutes an extremely difficult task [25]. In this section, we lead a preliminary experiment on gene expression data collected from https://www.ensembl.org/ using Bio Mart. The data consists of N = 60 genes each one characterized by P = 50 features, corresponding to the regulatory motifs in promoters. These samples may belong to at most 108 gene ontology classes. Each class corresponds to a very specific molecular function of the transcripts. The data set is split into training, validation and test sets of 20 genes each. We perform a multi-task classification (T = 108) where each task consists of a one versus all classification problem. Our bilevel algorithm is initialized with L = 50 possible groups. Validation and test errors are displayed in Fig. 4 as functions of λ. The results show a significant decrease in prediction error when using the proposed method compared with Lasso. In addition, θBi GLThr suggests that there exist 30 relevant groups among the features. This preliminary experiment is encouraging and paves the way to set extended and more comprehensive experiments in gene data analysis. 5 Conclusion This contribution studied the problem of learning the groups by solving a continuous bilevel problem. We replaced the exact Group Lasso optimization problem by a smooth dual forward-backward algorithm with Bregman distances. This method is in the line of what has been proposed in [24]. We also provided theoretical justifications of this approximation method which, to the best of our knowledge, is new. When compared to standard sparse regression methods, the proposed procedure achieved equivalent performance to the oracle Group Lasso where the true groups are known. Moreover, when the numbers of tasks and inner iterations are sufficiently large, a satisfactory estimate of the groups can be obtained even if the number of groups are unknown. One of the advantages of the proposed approach is that it can be easily adapted to different convex losses with Lipschitz-continuous gradient. Future works notably include the extension to overlapping groups [15] and could also aim at learning the groups in group-sparse classification problems. Acknowledgments We wish to thank Luca Franceschi and the anonymous referees for their useful comments. We also would like to thank Giorgio Valentini for providing the gene expression dataset. This work was supported in part by SAP SE. [1] A. Ahmed and E. Xing. Recovering time-varying networks of dependencies in social and biological studies. Proceedings of the National Academy of Sciences, 106(29):11878 11883, 2009. [2] H. H. Bauschke, J. Bolte, and M. Teboulle. A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Mathematics of Operations Research, 42(2):330 348, 2016. [3] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York, USA, 2nd edition, 2017. [4] Y. Bengio. Gradient-based optimization of hyperparameters. Neural Computation, 12(8):1889 1900, 2000. [5] L. Calatroni, C. Chung, J. C. De los Reyes, C.-B. Schönlieb, and T. Valkonen. Bilevel approaches for learning of variational imaging models. Variational Methods, pages 252 290, 2016. [6] G. Chen and R. T. Rockafellar. Convergence rates in forward backward splitting. SIAM Journal on Optimization, 7(2):421 444, 1997. [7] P. L Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. SIAM Multiscale Modeling & Simulation, 4(4):1168 1200, 2005. [8] L. Condat. Fast projection onto the simplex and the ℓ1-ball. Mathematical Programming, 158(1-2):575 585, 2016. [9] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27, pages 1646 1654, Montreal, Canada, 08 13 Dec 2014. [10] S. Dempe. Foundations of Bilevel Programming. Springer, Boston, USA, 2002. [11] L. Franceschi, N. Donini, P. Frasconi, and M. Pontil. Forward and reverse gradient-based hyperparameter optimization. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1165 1173, Sydney, Australia, 06 11 Aug 2017. [12] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil. Bilevel programming for hyperparameter otimization and meta-learning. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 1568 1577, Stockholm, Sweden, 10 15 Jul 2018. [13] A. Griewank and A. Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Society for Industrial and Applied Mathematics, Philadelphia, USA, 2nd edition, 2008. [14] D. Hernández-Lobato and J. M. Hernández-Lobato. Learning feature selection dependencies in multi-task learning. In Advances in Neural Information Processing Systems 26, pages 746 754, Lake Tahoe, USA, 05 10 Dec 2013. [15] L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with overlap and graph Lasso. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 433 440, Montreal, Canada, 14 18 Jun 2009. [16] R. Jenatton, J.-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms. Journal of Machine Learning Research, 12:2777 2824, 2011. [17] S. Kim and E. Xing. Tree-guided group Lasso for multi-response regression with structured sparsity, with an application to e QTL mapping. The Annals of Applied Statistics, 6(3):1095 1117, 2012. [18] K. Kunisch and T. Pock. A bilevel optimization approach for parameter learning in variational models. SIAM Journal on Imaging Sciences, 6(2):938 983, 2013. [19] K. Lounici, M. Pontil, S. Van De Geer, and A. B. Tsybakov. Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics, 39(4):2164 2204, 2011. [20] S. Ma, X. Song, and J. Huang. Supervised group Lasso with applications to microarray data analysis. BMC Bioinformatics, 8(1):60, 2007. [21] D. Maclaurin, D. Duvenaud, and R. P. Adams. Gradient-based hyperparameter optimization through reversible learning. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 2113 2122, Lille, France, 2015. [22] A. Maurer and M. Pontil. Structured sparsity and generalization. Journal of Machine Learning Research, 13:671 690, 2012. [23] C. A. Micchelli, J. Morales, and M. Pontil. Regularizers for structured sparsity. Advances in Computational Mathematics, 38(3):455 489, 2013. [24] P. Ochs, R. Ranftl, T. Brox, and T. Pock. Techniques for gradient-based bilevel optimization with non-smooth lower level problems. Journal of Mathematical Imaging and Vision, 56(2):175 194, 2016. [25] M. Re and G. Valentini. Predicting gene expression from heterogeneous data. In International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics, Genoa, Italy, 15-17 Oct 2009. [26] S. Reddi, S. Sra, B. Poczos, and A. Smola. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In Advances in Neural Information Processing Systems 29, pages 1145 1153, Barcelona, Spain, 05 10 Dec 2016. [27] N. Shervashidze and F. Bach. Learning the structure for structured sparsity. IEEE Transactions on Signal Processing, 63(18):4894 4902, 2015. [28] Q. Van Nguyen. Forward-backward splitting with Bregman distances. Vietnam Journal of Mathematics, 45(3):519 539, 2017. [29] M. J. Wainwright. Structured regularizers for high-dimensional problems: statistical and computational issues. Annual Review of Statistics and Its Application, 1:233 253, 2014. [30] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49 67, 2006. [31] P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, pages 3468 3497, 2009.