# group_sparse_optimization_via_lpq_regularization__f74ab7e3.pdf

Journal of Machine Learning Research 18 (2017) 1-52 Submitted 12/15; Revised 2/17; Published 4/17

Group Sparse Optimization via ℓp,q Regularization

Yaohua Hu mayhhu@szu.edu.cn College of Mathematics and Statistics Shenzhen University Shenzhen 518060, P. R. China

Chong Li cli@zju.edu.cn School of Mathematical Sciences Zhejiang University Hangzhou 310027, P. R. China

Kaiwen Meng mkwfly@126.com School of Economics and Management Southwest Jiaotong University Chengdu 610031, P. R. China

Jing Qin qinjing@cuhk.edu.hk School of Life Sciences The Chinese University of Hong Kong Shatin, New Territories, Hong Kong and Shenzhen Research Institute The Chinese University of Hong Kong Shenzhen 518057, P. R. China

Xiaoqi Yang mayangxq@polyu.edu.hk Department of Applied Mathematics The Hong Kong Polytechnic University Kowloon, Hong Kong

Editor: Mark Schmidt

In this paper, we investigate a group sparse optimization problem via ℓp,q regularization in three aspects: theory, algorithm and application. In the theoretical aspect, by introducing a notion of group restricted eigenvalue condition, we establish an oracle property and a global recovery bound of order O(λ 2 2 q ) for any point in a level set of the ℓp,q regularization problem, and by virtue of modern variational analysis techniques, we also provide a local analysis of recovery bound of order O(λ2) for a path of local minima. In the algorithmic aspect, we apply the well-known proximal gradient method to solve the ℓp,q regularization problems, either by analytically solving some speciﬁc ℓp,q regularization subproblems, or by using the Newton method to solve general ℓp,q regularization subproblems. In particular, we establish a local linear convergence rate of the proximal gradient method for solving the ℓ1,q regularization problem under some mild conditions and by ﬁrst proving a second-order growth condition. As a consequence, the local linear convergence rate of proximal gradient method for solving the usual ℓq regularization problem (0 < q < 1) is obtained. Finally in

. Corresponding author.

c 2017 Yaohua Hu, Chong Li, Kaiwen Meng, Jing Qin, and Xiaoqi Yang.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/15-651.html.

Hu, Li, Meng, Qin, and Yang

the aspect of application, we present some numerical results on both the simulated data and the real data in gene transcriptional regulation. Keywords: group sparse optimization, lower-order regularization, nonconvex optimization, restricted eigenvalue condition, proximal gradient method, iterative thresholding algorithm, gene regulation network

1. Introduction

In recent years, a great amount of attention has been paid to sparse optimization, which is to ﬁnd the sparse solutions of an underdetermined linear system. The sparse optimization problem arises in a wide range of ﬁelds, such as compressive sensing, machine learning, pattern analysis and graphical modeling; see Blumensath and Davies (2008); Cand es et al. (2006b); Chen et al. (2001); Donoho (2006a); Fan and Li (2001); Tibshirani (1994) and references therein. In many applications, the underlying data usually can be represented approximately by a linear system of the form Ax = b + ε,

where A Rm n and b Rm are known, ε Rm is an unknown noise vector, and x = (x1, x2, . . . , xn) Rn is the variable to be estimated. If m n, the above linear system is seriously ill-conditioned and may have inﬁnitely many solutions. The sparse optimization problem is to recover x from information b such that x is of a sparse structure. The sparsity of variable x has been measured by the ℓp norm x p (p = 0, see Blumensath and Davies (2008); p = 1, see Beck and Teboulle (2009); Chen et al. (2001); Daubechies et al. (2004); Donoho (2006a); Tibshirani (1994); Wright et al. (2009); Yang and Zhang (2011); and p = 1/2, see Chartrand and Staneva (2008); Xu et al. (2012)). The ℓp norm x p for p > 0 is deﬁned by

i=1 |xi|p !1/p ,

while the ℓ0 norm x 0 is deﬁned by the number of nonzero components of x. The sparse optimization problem can be modeled as

min Ax b 2 s.t. x 0 s,

where s is the given sparsity level. For the sparse optimization problem, a popular and practical technique is the regularization method, which is to transform the sparse optimization problem into an unconstrained optimization problem, called the regularization problem. For example, the ℓ0 regularization problem is min x Rn Ax b 2 2 + λ x 0,

where λ > 0 is the regularization parameter, providing a tradeoﬀbetween accuracy and sparsity. However, the ℓ0 regularization problem is nonconvex and non-Lipschitz, and thus it is generally intractable to solve it directly (indeed, it is NP-hard; see Natarajan, 1995).

Group Sparse Optimization

To overcome this diﬃculty, two typical relaxations of the ℓ0 regularization problem are introduced, which are the ℓ1 regularization problem

min x Rn Ax b 2 2 + λ x 1 (1)

and the ℓq regularization problem (0 < q < 1)

min x Rn Ax b 2 2 + λ x q q. (2)

1.1 ℓq Regularization Problems

The ℓ1 regularization problem (1), also called Lasso (Tibshirani, 1994) or Basis Pursuit (Chen et al., 2001), has attracted much attention and has been accepted as one of the most useful tools for sparse optimization. Since the ℓ1 regularization problem is a convex optimization problem, many exclusive and eﬃcient algorithms have been proposed and developed for solving problem (1); see Beck and Teboulle (2009); Combettes and Wajs (2005); Daubechies et al. (2004); Hu et al. (2016); Nesterov (2012, 2013); Xiao and Zhang (2013); Yang and Zhang (2011). However, the ℓ1 regularization problem (1) suﬀers some frustrations in practical applications. It was revealed by extensive empirical studies that the solutions obtained from the ℓ1 regularization problem are much less sparse than the true sparse solution, that it cannot recover a signal or image with the least measurements when applied to compressed sensing, and that it often leads to sub-optimal sparsity in reality; see Chartrand (2007); Xu et al. (2012); Zhang (2010). Recently, to overcome these drawbacks of ℓ1 regularization, the lower-order regularization technique (that is, the ℓq regularization with 0 < q < 1) is proposed to improve the performance of sparsity recovery of the ℓ1 regularization problem. Chartrand and Staneva (2008) claimed that a weaker restricted isometry property is suﬃcient to guarantee perfect recovery in the ℓq regularization, and that it can recover sparse signals from fewer linear measurements than that required by the ℓ1 regularization. Xu et al. (2012) showed that the ℓ1/2 regularization admits a signiﬁcantly stronger sparsity promoting capability than the ℓ1 regularization in the sense that it allows to obtain a more sparse solution or predict a sparse signal from a smaller amount of samplings. Qin et al. (2014) exhibited that the ℓ1/2 regularization achieves a more reliable solution in biological sense than the ℓ1 regularization when applied to infer gene regulatory network from gene expression data of mouse embryonic stem cell. However, the ℓq regularization problem is nonconvex, nonsmooth and non-Lipschitz, and thus it is diﬃcult in general to design eﬃcient algorithms for solving it. It was presented in Ge et al. (2011) that ﬁnding the global minimal value of the ℓq regularization problem (2) is strongly NP-hard; while fortunately, computing a local minimum could be done in polynomial time. Some eﬀective and eﬃcient algorithms have been proposed to ﬁnd a local minimum of problem (2), such as interior-point potential reduction algorithm (Ge et al., 2011), smoothing methods (Chen, 2012; Chen et al., 2010), splitting methods (Li and Pong, 2015a,b) and iterative reweighted minimization methods (Lai and Wang, 2011; Lai et al., 2013; Lu, 2014). The ℓq regularization problem (2) is a variant of lower-order penalty problems, investigated in Huang and Yang (2003); Luo et al. (1996); Yang and Huang (2001), for a constrained optimization problem. The main advantage of the lower-order penalty functions

Hu, Li, Meng, Qin, and Yang

over the classical ℓ1 penalty function is that they require weaker conditions to guarantee an exact penalization property and that their least exact penalty parameter is smaller; see Huang and Yang (2003). It was reported in Yang and Huang (2001) that the ﬁrstand second-order necessary optimality conditions of lower-order penalty problems converge to that of the original constrained optimization problem under a linearly independent constraint qualiﬁcation. Besides the preceding numerical algorithms, one of the most widely studied methods for solving the sparse optimization problem is the class of the iterative thresholding algorithms, which is studied in a uniﬁed framework of proximal gradient methods; see Beck and Teboulle (2009); Blumensath and Davies (2008); Combettes and Wajs (2005); Daubechies et al. (2004); Gong et al. (2013); Nesterov (2013); Xu et al. (2012) and references therein. It is convergent fast and of very low computational complexity. Beneﬁtting from its simple formulation and low storage requirement, it is very eﬃcient and applicable for large-scale sparse optimization problems. In particular, the iterative hard (resp. soft, half) thresholding algorithm for the ℓ0 (resp. ℓ1, ℓ1/2) regularization problem was studied in Blumensath and Davies (2008) (resp. Daubechies et al., 2004; Xu et al., 2012).

1.2 Global Recovery Bound

To estimate how far is the solution of regularization problems from that of the linear system, the global recovery bound (also called the ℓ2 consistency) of the ℓ1 regularization problem has been investigated in the literature. More speciﬁcally, under some mild conditions on A, such as the restricted isometry property (RIP, Cand es and Tao, 2005) or restricted eigenvalue condition (REC, Bickel et al., 2009), van de Geer and B uhlmann (2009) established a deterministic recovery bound for the (convex) ℓ1 regularization problem:

x (ℓ1) x 2 2 = O(λ2s), (3)

where x (ℓ1) is a solution of problem (1), x is a solution of the linear system Ax = b, and sparsity s := x 0. In the statistics literature, Bickel et al. (2009); Bunea et al. (2007); Meinshausen and Yu (2009); Zhang (2009) provided the recovery bound in a high probability for the ℓ1 regularization problem when the size of the variable tends to inﬁnity, under REC/RIP or some relevant conditions. However, to the best of our knowledge, the recovery bound for the general (nonconvex) ℓp regularization problem is still undiscovered. We will establish such a deterministic property in section 2.

1.3 Group Sparse Optimization

In applications, a wide class of problems usually have certain special structures, and recently, enhancing the recoverability due to the special structures has become an active topic in sparse optimization. One of the most popular structures is the group sparsity structure, that is, the solution has a natural grouping of its components, and the components within each group are likely to be either all zeros or all nonzeros. In general, the grouping information can be an arbitrary partition of x, and it is usually pre-deﬁned based on prior knowledge of speciﬁc problems. Let x := (x G1, , x Gr) represent the group structure of x. The group

Group Sparse Optimization

sparsity of x with such a group structure can be measured by an ℓp,q norm, deﬁned by

i=1 x Gi q p

Exploiting the group sparsity structure can reduce the degrees of freedom in the solution, thereby leading to better recovery performance. Beneﬁtting from these advantages, the group sparse optimization model has been applied in birthweight prediction (Bach, 2008; Yuan and Lin, 2006), dynamic MRI (Usman et al., 2011) and gene ﬁnding (Meier et al., 2008; Yang et al., 2010) with the ℓ2,1 norm. More speciﬁcally, the ℓ2,1 regularization problem

min x Rn Ax b 2 2 + λ x 2,1

was introduced by Yuan and Lin (2006) to study the grouped variable selection in statistics under the name of group Lasso. The ℓ2,1 regularization, an important extension of the ℓ1 regularization, proposes an ℓ2 regularization for each group and ultimately yields the sparsity in a group manner. Since the ℓ2,1 regularization problem is a convex optimization problem, some eﬀective algorithms have been proposed, such as, the spectral projected gradient method (van den Berg et al., 2008), Spa RSA (Wright et al., 2009) and the alternating direction method (Deng et al., 2011).

1.4 The Aim of This Paper

In this paper, we will investigate the group sparse optimization via ℓp,q regularization (p 1, 0 q 1), also called the ℓp,q regularization problem

min x Rn F(x) := Ax b 2 2 + λ x q p,q. (4)

We will investigate the oracle property and recovery bound for the ℓp,q regularization problem, which extends the existing results in two ways: one is the lower-order regularization, including the ℓq regularization problem (q < 1); the other is the group sparse optimization, including the ℓ2,1 regularization problem (group Lasso) as a special case. To this end, we will introduce the weaker notions of REC: the lower-order REC and the group REC (GREC). We will further establish the relationships between the new notions with the classical one: the lower-order REC is weaker than the classical REC, but the reverse is not true (see Example 1); and the GREC is weaker than the REC. Under the lower-order GREC, we will provide the oracle property and the global recovery bound for the ℓp,q regularization problem (see Theorem 9). Furthermore, we will conduct a local analysis of recovery bound for the ℓp,q regularization problem by virtue of modern variational analysis techniques (Rockafellar and Wets, 1998). More precisely, we assume that any nonzero group of x is active and the columns of A corresponding to the active components of x (a solution of Ax = b) are linearly independent, which matches the nature of the group sparsity structure. This leads us to the application of implicit function theorem and thus guarantees the existence of a local path around x which satisﬁes a second-order growth condition. As such, in the local recovery bound, we will establish a uniform recovery bound O(λ2S) for all the ℓp,q regularization problems; see Theorem 2.2.

Hu, Li, Meng, Qin, and Yang

The proximal gradient method is one of the most popular and practical methods for the sparse optimization problems, either convex or nonconvex problems. We will apply the proximal gradient method to solve the ℓp,q regularization problem (4). The advantage of the proximal gradient method is that the proximal optimization subproblems of some speciﬁc regularization have the analytical solutions, and the resulting algorithm is thus practically attractive. In the general cases when the analytical solutions of the proximal optimization subproblems seem not available, we will employ the Newton method to solve them. Furthermore, we will investigate a local linear convergence rate of the proximal gradient method for solving the ℓp,q regularization problem when p = 1 and 0 < q < 1 under the assumption that any nonzero group of a local minimum is active. Problem (4) of the case p = 1 and 0 < q < 1 possesses the properties that the regularization term q p,q is concave near a local minimum and that the objective function F( ) of (4) satisﬁes a second-order growth condition, which plays an important role in the establishment of the local linear convergence rate. To the best of our knowledge, this is the ﬁrst attempt to study the local linear convergence rate of proximal gradient method for solving the lowerorder optimization problems. As a consequence of this result, we will obtain the local linear convergence rate of proximal gradient method for solving ℓq regularization problem (0 < q < 1), which includes the iterative half thresholding algorithm (q = 1/2) proposed in Xu et al. (2012) as a special case. The result on local linear convergence rate of proximal gradient method for solving the ℓq regularization problem is still new, as far as we know. In the aspect of application, we will conduct some numerical experiments on both simulated data and real data in gene transcriptional regulation to demonstrate the performance of the proposed proximal gradient method. From the numerical results, it is demonstrated that the ℓp,1/2 regularization is the best one among the ℓp,q regularizations for q [0, 1], and it outperforms the ℓp,1 and ℓp,0 regularizations on both accuracy and robustness. This observation is consistent with several previous numerical studies on the ℓp regularization problem; see Chartrand and Staneva (2008); Xu et al. (2012). It is also illustrated from the numerical results that the proximal gradient method (ℓ2,1/2) outperforms most solvers in group sparse learning, such as OMP (Cai and Wang, 2011), Fo Ba (Zhang, 2011), ℓ1-Magic (Cand es et al., 2006a), ISTA (Daubechies et al., 2004), YALL1 (Yang and Zhang, 2011) etc. The R package of the proximal gradient method for solving group sparse optimization, named GSpar O in CRAN, is available at https://CRAN.R-project.org/package=GSpar O

1.5 Main Contributions

This paper is to investigate the group sparse optimization under a uniﬁed framework of the ℓp,q regularization problem (4). In this paper, we establish the oracle property and recovery bound, design an eﬃcient numerical method for problem (4), and apply the proposed method to solve the problem of gene transcriptional regulation. The main contributions are presented as follows.

(i) We establish the following global recovery bound for the ℓp,q regularization problem (4) under the (p, q)-GREC:

( O(λ 2 2 q S), 2K 1q = 1,

O(λ 2 2 q S

3 q 2 q ), 2K 1q > 1, (5)

Group Sparse Optimization

where x is a solution of Ax = b, S := x p,0 is the group sparsity, 0 < q 1 p 2, x is any point in the level set lev F ( x) of problem (4), and K is the smallest integer such that 2K 1q 1.

(ii) By virtue of the variational analysis technique, for all the ℓp,q regularization problems, we establish a uniform local recovery bound

x p,q(λ) x 2 2 O(λ2S) for small λ,

where 0 < q < 1 p and x p,q(λ) is a local optimal solution of problem (4) (near x).

(iii) We present the analytical formulae for the proximal optimization subproblems of speciﬁc ℓp,q regularizations when p = 1, 2 and q = 0, 1/2, 2/3, 1. Moreover, we prove that any sequence {xk}, generated by proximal gradient method for solving the ℓ1,q regularization problem, linearly converges to a local minimum x under some mild conditions, that is, there exist N N, C > 0 and η (0, 1) such that

F(xk) F(x ) Cηk and xk x 2 Cηk for any k N.

(iv) Our numerical experiments show that, measured by the biological golden standards, the accuracy of the gene regulation networks forecasting can be improved by exploiting the group structure of TF complexes. The successful application of group sparse optimization to gene transcriptional regulation will facilitate biologists to study the gene regulation of higher model organisms in a genome-wide scale.

1.6 The Organization of This Paper

This paper is organized as follows. In section 2, we introduce the notions of q-REC and GREC, and establish the oracle property and (global and local) recovery bounds for the ℓp,q regularization problem. In section 3, we apply the proximal gradient method to solve the group sparse optimization using diﬀerent types of ℓp,q regularization, and investigate the local linear convergence rate of the resulting proximal gradient method. Finally, section 4 exhibits the numerical results on both simulated data and real data in gene transcriptional regulation.

2. Global and Local Recovery Bounds

This section is devoted to the study of the oracle property and (global and local) recovery bounds for the ℓp,q regularization problem (4). To this end, we ﬁrst present some basic inequalities of ℓp norm and introduce the notions of RECs, as well as their relationships. The notations adopted in this paper are described as follows. We let the lowercase letters x, y, z denote the vectors, calligraphic letters I, T , S, J , N denote the index sets, capital letters N, S denote the numbers of groups in the index sets. In particular, we use Gi to denote the index set corresponding to the i-th group and GS to denote the index set {Gi : i S}. For x Rn and T {1, . . . , n}, we use x T to denote the subvector of x corresponding to T . We use sign : R R to denote the signum function, deﬁned by

1, t > 0, 0, t = 0, 1, t < 0.

Hu, Li, Meng, Qin, and Yang

Throughout this paper, we assume that the group sparse optimization problem is of the group structure described as follows. Let x := (x G1, , x Gr) represent the group structure of x, where {x Gi Rni : i = 1, , r} is the grouping of x, Pr i=1 ni = n and nmax := max {ni : i {1, . . . , r}}. For a group x Gi, we use x Gi = 0 (reps. x Gi = 0, x Gi =a 0) to denote a zero (reps. nonzero, active) group, where x Gi = 0 means that xj = 0 for all j Gi; x Gi = 0 means that xj = 0 for some j Gi; and x Gi =a 0 means that xj = 0 for all j Gi. It is trivial to see that

x Gi =a 0 x Gi = 0.

For this group structure and p > 0, the ℓp,q norm of x is deﬁned by

( (Pr i=1 x Gi q p)1/q , q > 0, Pr i=1 x Gi 0 p, q = 0, (6)

which proposes the ℓp norm for each group and then processes the ℓq norm for the resulting vector. When p = q, the ℓp,q norm coincides with the ℓp norm, that is, x p,p = x p. Furthermore, all ℓp,0 norms share the same formula, that is, x p,0 = x 2,0 for all p > 0. In particular, when the grouping structure is degenerated to the individual feature level, that is, if nmax = 1 or n = r, we have x p,q = x q for all p > 0 and q > 0. Moreover, we assume that A and b in (4) are related by a linear model (noiseless)

Let S := {i {1, . . . , r} : x Gi = 0} be the index set of nonzero groups of x, Sc := {1, . . . , r}\ S be the complement of S, S := |S| be the group sparsity of x, and na := P i S ni.

2.1 Inequalities of ℓp,q Norm

We begin with some basic inequalities of the ℓp and ℓp,q norms, which will be useful in the later discussion of RECs and recovery bounds. First, we recall the following well-known inequality n X

i=1 |xi|γ2 !1/γ2

i=1 |xi|γ1 !1/γ1 if 0 < γ1 γ2, (7)

or equivalently (x = (x1, x2, . . . , xn) ),

x γ2 x γ1 if 0 < γ1 γ2.

The following lemma improves Huang and Yang (2003, lem. 4.1) and extends to the ℓp,q norm. It will be useful in providing a shaper global recovery bound (see Theorem 9 later).

Lemma 1 Let 0 < q p 2, x Rn and K be the smallest integer such that 2K 1q 1. Then the following relations hold.

(i) x q q n1 2 K x q 2.

(ii) x q p,q r1 2 K x q p,2.

Group Sparse Optimization

Proof (i) Repeatedly using the property that x 1 n x 2, one has that

i=1 |xi|2q !1/2 n 1 2 + + 1

i=1 |xi|2Kq !2 K

Since 2K 1q 1, by (7), we obtain that n X

i=1 |xi|2Kq !2 K

i=1 (|xi|2)2K 1q ! 1 2K 1q q 2

i=1 |xi|2 !q/2 = x q 2.

Therefore, we arrive at the conclusion that

x q q n1 2 K x q 2.

(ii) By (6), it is a direct consequence of (i).

For example, if q = 1, then K = 1; if q = 1

3, then K = 2. The following lemma describes the triangle inequality of q p,q.

Lemma 2 Let 0 < q 1 p and x, y Rn. Then

x q p,q y q p,q x y q p,q.

Proof By the subadditivity of the ℓp norm and (7), it is easy to see that

x Gi q p y Gi q p x Gi y Gi q p, for i = 1, . . . , r.

Consequently, the conclusion directly follows from (6).

The following lemma will be beneﬁcial to studying properties of the lower-order REC in Proposition 5 later.

Lemma 3 Let γ 1, and two ﬁnite sequences {yi : i I} and {xj : j J } satisfy that yi xj 0 for all (i, j) I J . If P i I yi P j J xj, then P i I yγ i P j J xγ j .

Proof Set y := 1 |I| P i I yi and α := mini I yi. By Huang and Yang (2003, lem. 4.1), one has that X

i I yγ i 1 |I|γ 1

!γ = |I| yγ. (8)

On the other hand, let M N and β [0, α) be such that P j J xj = Mα + β. Observing that γ 1 and 0 xj α for all j J , we obtain that xγ j xjαγ 1, and thus, P j J xγ j Mαγ + αγ 1β. By (8), it remains to show that

|I| yγ Mαγ + αγ 1β. (9)

If |I| > M, the relation (9) is trivial since y α > β; otherwise, |I| M, from the facts that |I| y Mα + β (that is, P i I yi P j J xj) and that γ 1, it follows that

|I| yγ M1 γ(Mα + β)γ M1 γ(Mγαγ + γMγ 1αγ 1β) Mαγ + αγ 1β.

Therefore, we verify the relation (9), and the proof is complete.

Hu, Li, Meng, Qin, and Yang

2.2 Group Restricted Eigenvalue Conditions

This subsection aims at the development of the critical conditions on the matrix A to guarantee the oracle property and the global recovery bound of the ℓp,q regularization problem (4). In particular, we will focus on the restricted eigenvalue condition (REC), and extend it to the lower-order setting and equip it with the group structure. In the scenario of sparse optimization, given the sparsity level s, it is always assumed that the 2s-sparse minimal eigenvalue of A A is positive (see Bickel et al., 2009; Bunea et al., 2007; Meinshausen and Yu, 2009), that is,

φmin(2s) := min x 0 2s x A Ax

x x > 0, (10)

which is the minimal eigenvalue of any 2s 2s dimensional submatrix. It is well-known that the solution at sparsity level s of the linear system Ax = b is unique if the condition (10) is satisﬁed; otherwise, assume that there are two distinct vectors ˆx and x such that Aˆx = A x and ˆx 0 = x 0 = s. Then x := ˆx x is a vector such that Ax = 0 and x 0 2s, and thus φmin(2s) = 0, which is contradict with (10). Therefore, if the 2s-sparse minimal eigenvalue of A A is zero (that is, φmin(2s) = 0), one has no hope of recovering the true sparse solution from noisy observations. However, only condition (10) is not enough and some further condition is required to maintain the nice recovery of regularization problems; see Bickel et al. (2009); Bunea et al. (2007); Meinshausen and Yu (2009); van de Geer and B uhlmann (2009); Zhang (2009) and references therein. For example, the REC was introduced in Bickel et al. (2009) to investigate the ℓ2 consistency of the ℓ1 regularization problem (Lasso), where the minimum in (10) is replaced by a minimum over a restricted set of vectors measured by an ℓ1 norm inequality and the denominator is replaced by the ℓ2 norm of only a part of x. We now introduce the notion of the lower-order REC. Note that the residual ˆx := x (ℓq) x, where x (ℓq) is an optimal solution of the ℓq regularization problem and x is a sparse solution of Ax = b, of the ℓq regularization problem always satisﬁes

ˆx Sc q ˆx S q, (11)

where S is the support of x. Thus we introduce a lower-order REC, where the minimum is taken over a restricted set measured by an ℓq norm inequality such as (11), for establishing the global recovery bound of the ℓq regularization problem. Given s t n, x Rn and I {1, . . . , n}, we denote by I(x; t) the subset of {1, . . . , n} corresponding to the ﬁrst t largest coordinates in absolute value of x in Ic.

Deﬁnition 4 Let 0 q 1. The q-restricted eigenvalue condition relative to (s, t) (q REC(s, t)) is said to be satisﬁed if

φq(s, t) := min Ax 2

x T 2 : |I| s, x Ic q x I q, T = I(x; t) I > 0.

The q-REC describes a kind of restricted positive deﬁniteness of A A, which is valid only for the vectors satisfying the relation measured by an ℓq norm. The q-REC presents a uniﬁed framework of the REC-type conditions whenever q [0, 1]. In particular, we note by

Group Sparse Optimization

(b) 1/2-REC

Figure 1: The geometric interpretation of the RECs: the gray regions show the feasible sets Cq(s) (q = 1, 1/2, 0). The q-REC holds if and only if the null space of A does not intersect the gray region.

deﬁnition that 1-REC reduces to the classical REC (Bickel et al., 2009), and that φmin(2s) = φ2 0(s, s), and thus (10) 0-REC(s, s) is satisﬁed.

It is well-known in the literature that the 1-REC is a stronger condition than the 0-REC (10). A natural question arises what are the relationships between the general q-RECs. To answer this question, associated with the q-REC, we consider the feasible set

Cq(s) := {x Rn : x Ic q x I q for some |I| s},

which is a cone. Since the objective function associated with the q-REC is homogeneous, the q-REC(s, t) says that the null space of A does not cross over Cq(s). Figure 1 presents the geometric interpretation of the q-RECs. It is shown in Figure 1 that C0(s) C1/2(s) C1(s), and thus 1-REC 1/2-REC 0-REC.

It is also observed from Figure 1 that the gap between the 1-REC and 1/2-REC and that between 1/2-REC and 0-REC are the matrices whose null spaces fall in the cones of C1(s) \ C1/2(s) and C1/2(s) \ C0(s), respectively. We now provide a rigorous proof in the following proposition to identify the relationship between the feasible sets Cq(s) and between the general q-RECs: the lower the q, the smaller the cone Cq(s), and the weaker the q-REC.

Proposition 5 Let 0 q1 q2 1 and 1 s t n. Then the following statements are true:

(i) Cq1(s) Cq2(s), and

(ii) if the q2-REC(s, t) holds, then the q1-REC(s, t) holds.

Proof (i) Fix x Cq1(s). We use I to denote the index set of the ﬁrst s largest coordinates in absolute value of x. Since x Cq1(s), it follows that x Ic q1 x I q1 (|I | s due to the construction of I ). By Lemma 3 (taking γ = q2/q1), one has that

x Ic q2 x I q2,

Hu, Li, Meng, Qin, and Yang

that is, x Cq2(s). Hence it follows that Cq1(s) Cq2(s). (ii) As proved by (i) that Cq1(s) Cq2(s), by the deﬁnition of q-REC, it follows that

φq1(s, t) φq2(s, t) > 0.

The proof is complete.

To the best of our knowledge, this is the ﬁrst work on introducing the lower-order REC and establishing the relationship of the lower-order RECs. In the following, we provide a counter example to show that the reverse of Proposition 5 is not true.

Example 1 (A matrix satisfying 1/2-REC but not REC) Consider the matrix

A := a a + c a c a a c a + c

where a > c > 0 and a > c > 0. This matrix A does not satisfy the REC(1, 1). Indeed, by letting J = {1} and x = (2, 1, 1) , we have Ax = 0 and thus φ(1, 1) = 0. Below, we claim that A satisﬁes the 1/2-REC(1, 1). It suﬃces to show that φ1/2(1, 1) > 0. Let x = (x1, x2, x3) satisfy the constraint associated with 1/2-REC(1, 1). As s = 1, the deduction is divided into the following three cases.

(i) J = {1}. Then

|x1| x J c 1/2 = |x2| + |x3| + 2|x2|1/2|x3|1/2. (12)

Without loss of generality, we assume |x1| |x2| |x3|. Hence, T = {1, 2} and

Ax 2 x T 2 min{a, a}|x1 + x2 + x3| + min{c, c}|x2 x3|

|x1| + |x2| . (13)

3|x1|, (13) reduces to

3 |x1| 4 3|x1| = min{a, a}

3|x1|, substituting (12) into (13), one has that

( min{c, c}

2|x2|, min{a, a}

2|x2|. (15)

(ii) J = {2}. Since T = {2, 1} or {2, 3}, it follows from Huang and Yang (2003, lem. 4.1) that |x2| x J c 1/2 |x1| + |x3|. (16)

Thus, it is easy to verify that x T 2 2|x2| and that

Ax 2 x T 2 |ax1 + (a + c)x2 + (a c)x3|

2|x2| = |a(x1 + x2 + a c

a x3) + cx2| 2|x2| c

where the last inequality follows from (16) and the fact that a > c.

Group Sparse Optimization

(iii) J = {3}. Similar to the deduction of (ii), one has that

Ax 2 x T 2 | ax1 + ( a c)x2 + ( a + c)x3|

Therefore, by (14)-(15) and (17)-(18), we conclude that φ1/2(1, 1) 1

8 min{c, c} > 0, and thus, the matrix A satisﬁes the 1/2-REC(1, 1).

In order to establish the oracle property and the global recovery bound for the ℓp,q regularization problem, we further introduce the notion of group restricted eigenvalue condition (GREC). Given S N r, x Rn and J {1, . . . , r}, we use ranki(x) to denote the rank of x Gi p among { x Gj p : j J c} (in a decreasing order), J (x; N) to denote the index set of the ﬁrst N largest groups in the value of x Gi p among { x Gj p : j J c}, that is, J (x; N) := {i J c : ranki(x) {1, . . . , N}} .

Furthermore, by letting R := r |J |

N , we denote

Jk(x; N) := {i J c : ranki(x) {k N + 1, . . . , (k + 1)N}} , k = 1, . . . , R 1, {i J c : ranki(x) {RN + 1, . . . , r |J |}} , k = R. (19)

Note that the residual ˆx := x (ℓp,q) x of the ℓp,q regularization problem always satisﬁes ˆx GSc p,q ˆx GS p,q. Thus we introduce the notion of GREC, where the minimum is taken over a restricted set measured by an ℓp,q norm inequality, as follows.

Deﬁnition 6 Let 0 < q p 2. The (p, q)-group restricted eigenvalue condition relative to (S, N) ((p, q)-GREC(S, N)) is said to be satisﬁed if

φp,q(S, N) := min Ax 2

x GN p,2 : |J | S, x GJ c p,q x GJ p,q, N = J (x; N) J > 0.

The (p, q)-GREC extends the q-REC to the setting equipping with a pre-deﬁned group structure. Handling the components in each group as one element, the (p, q)-GREC admits the fewer degree of freedom, which is S (about s/nmax), on its associated constraint than that of the q-REC, and thus it characterizes a weaker condition than the q-REC. For example, the 0-REC(s, s) is to indicate the restricted positive deﬁniteness of A A, which is valid only for the vectors whose cardinality is less than 2s; while the (p, 0)-GREC(S, S) is to describe the restricted positive deﬁniteness of A A on any 2S-group support, whose degree of freedom is much less than the 2s-support. Thus the (p, 0)-GREC(S, S) provides a broader condition than the 0-REC(s, s). Similar to the proof of Proposition 5, we can show that if 0 q1 q2 1 p 2 and the (p, q2)-GREC(S, N) holds, then the (p, q1)-GREC(S, N) also holds. We end this subsection by providing the following lemma, which will be useful in establishing the global recovery bound for the ℓp,q regularization problem in Theorem 9.

Lemma 7 Let 0 < q 1 p, τ 1 and x Rn, N := J (x; N) J and Jk := Jk(x; N) for k = 1, . . . , R. Then the following inequalities hold

k=1 x GJk p,τ N 1 τ 1

q x GJ c p,q.

Hu, Li, Meng, Qin, and Yang

Proof By the deﬁnition of Jk (19), for each j Jk, one has that

x Gj p x Gi p, for each i Jk 1,

and thus x Gj q p 1

i Jk 1 x Gi q p = 1

N x GJk 1 q p,q.

Consequently, we obtain that

x GJk τ p,τ = X

i Jk x Gi τ p N1 τ/q x GJk 1 τ p,q.

Further by Huang and Yang (2003, lem. 4.1) (τ 1 and q 1), it follows that

x GN c p,τ = PR k=1 P i Jk x Gi τ p 1/τ

PR k=1 x GJk p,τ N 1 τ 1

q PR k=1 x GJk 1 p,q

q x GJ c p,q.

The proof is complete.

2.3 Global Recovery Bound

In recent years, many articles have been devoted to establishing the oracle property and the global recovery bound for the ℓ1 regularization problem (1) under the RIP or REC; see Bickel et al. (2009); Meinshausen and Yu (2009); van de Geer and B uhlmann (2009); Zhang (2009). However, to the best of our knowledge, there is few paper devoted to investigating these properties for the lower-order regularization problem. In the preceding subsections, we have introduced the general notion of (p, q)-GREC. Under the (p, q)-GREC(S, S), the solution of Ax = b with group sparsity being S is unique. In this subsection, we will present the oracle property and the global recovery bound for the ℓp,q regularization problem (4) under the (p, q)-GREC. The oracle property provides an upper bound on the squares error of the linear system and the violation of the true nonzero groups for each point in the level set of the objective function of problem (4)

lev F ( x) := {x Rn : Ax b 2 2 + λ x q p,q λ x q p,q}.

Proposition 8 Let 0 < q 1 p, S > 0 and let the (p, q)-GREC(S, S) hold. Let x be the unique solution of Ax = b at a group sparsity level S, and S be the index set of nonzero groups of x. Let K be the smallest integer such that 2K 1q 1. Then, for any x lev F ( x), the following oracle inequality holds

Ax A x 2 2 + λ x GSc q p,q λ 2 2 q S(1 2 K) 2 2 q /φ

2q 2 q p,q (S, S). (20)

Moreover, letting N := S S(x ; S), we have

x GN x GN 2 p,2 λ 2 2 q S(1 2 K) 2 2 q /φ

4 2 q p,q (S, S).

Group Sparse Optimization

Proof Let x lev F ( x). That is, Ax b 2 2 + λ x q p,q λ x q p,q. By Lemmas 1(ii) and 2, one has that

Ax A x 2 2 + λ x GSc q p,q λ x GS q p,q λ x GS q p,q λ x GS x GS q p,q λS1 2 K x GS x GS q p,2. (21)

Noting that

x GSc x GSc q p,q x GS x GS q p,q x GSc q p,q ( x GS q p,q x GS q p,q) = x q p,q x q p,q 0.

Then the (p, q)-GREC(S, S) implies that

x GS x GS p,2 Ax A x 2/φp,q(S, S).

This, together with (21), yields that

Ax A x 2 2 + λ x GSc q p,q λS1 2 K Ax A x q 2/φq p,q(S, S), (22)

and consequently,

Ax A x 2 λ 1 2 q S(1 2 K)/(2 q)/φ

q 2 q p,q (S, S). (23)

Therefore, by (22) and (23), we arrive at the oracle inequality (20). Furthermore, by the deﬁnition of N , the (p, q)-GREC(S, S) implies that

x GN x GN 2 p,2 Ax A x 2 2/φ2 p,q(S, S) λ 2 2 q S(1 2 K) 2 2 q /φ

4 2 q p,q (S, S).

The proof is complete.

One of the main results of this section is presented as follows, where we establish the global recovery bound for the ℓp,q regularization problem under the (p, q)-GREC. We will apply oracle inequality (20) and Lemma 7 in our proof.

Theorem 9 Let 0 < q 1 p 2, S > 0 and let the (p, q)-GREC(S, S) hold. Let x be the unique solution of Ax = b at a group sparsity level S, and S be the index set of nonzero groups of x. Let K be the smallest integer such that 2K 1q 1. Then, for any x lev F ( x), the following global recovery bound for problem (4) holds

x x 2 2 2λ 2 2 q S

q +(1 2 K) 4 q(2 q) /φ

4 2 q p,q (S, S). (24)

More precisely,

( O(λ 2 2 q S), 2K 1q = 1,

O(λ 2 2 q S

3 q 2 q ), 2K 1q > 1. (25)

Hu, Li, Meng, Qin, and Yang

Proof Let N := S S(x ; S) as in Proposition 8. Since p 2, it follows from Lemma 7 and Proposition 8 that

x GN c 2 2 x GN c 2 p,2 S1 2/q x GSc 2 p,q λ 2 2 q S

q +(1 2 K) 4 q(2 q) /φ

4 2 q p,q (S, S).

Then by Proposition 8, one has that

x x 2 2 = x GN x GN 2 2 + x GN c 2 2

λ 2 2 q S(1 2 K) 2 2 q /φ

4 2 q p,q (S, S) + λ 2 2 q S

q +(1 2 K) 4 q(2 q) /φ

4 2 q p,q (S, S)

q +(1 2 K) 4 q(2 q) /φ

4 2 q p,q (S, S),

where the last inequality follows from the fact that 2K 1q 1. This proves (24). In particular, if 2K 1q = 1, then q 2

q + 1 2 K 4 q(2 q) = 1 and thus

x x 2 2 O(λ 2 2 q S).

If 2K 1q > 1, then 2K 2q < 1. Hence, q 2

q + 1 2 K 4 q(2 q) < 3 q

2 q, and consequently

x x 2 2 O(λ 2 2 q S

Hence (25) is obtained. The proof is complete.

The global recovery bound (25) is deduced under general (p, q)-GREC(S, S), which is weaker than the REC as used in van de Geer and B uhlmann (2009). It shows that the sparse solution x may be recovered by any point x in the level set lev F ( x), in particular, when x

is a global optimal solution of problem (4), as long as λ is suﬃciently small. It is well-known that when p = 2 and q = 1, convex regularization problem (4) is numerically intractable for ﬁnding the sparse solution and that when q < 1 ﬁnding a point in the nonconvex level set lev F ( x) is equivalent to ﬁnding a global minimum of minimizing the indicator function of the nonconvex level set, which is NP-hard. Thus Theorem 9 is only a theoretical result and does not provide any insight for the numerical computation of a sparse solution by virtue of problem (4). We will design a proximal gradient method in section 3, test its numerical eﬃciency and provide some general guidance on which q is more attractive in practical applications in section 4. To conclude this subsection, we illustrate by an example in which (24) does not hold when q = 1, but it does and is also tight when q = 1

2. We will testify the recovery bound O(λ4/3S) in (24) when q = 1

2 by using a global optimization method.

Example 2 By letting a = a = 2 and c = c = 1 in Example 1, we consider the following matrix:

A := 2 3 1 2 1 3

We set b := (2, 2) and then a true solution of Ax = b is x := (1, 0, 0) . Denoting x := (x1, x2, x3) , the objective function associated with the ℓ1 regularization problem (1) is

F(x) := Ax b 2 2 + λ x 1 = (2x1 + 3x2 + x3 2)2 + (2x1 + x2 + 3x3 2)2 + λ(|x1| + |x2| + |x3|).

Group Sparse Optimization

Let x (ℓ1) := (x 1, x 2, x 3) be an optimal solution of problem (1). Without loss of generality, we assume λ 1. The necessary condition of x (ℓ1) being an optimal solution of problem (1) is 0 F(x (ℓ1)), that is,

0 16x 1 + 16x 2 + 16x 3 16 + λ |x 1|, (26a)

0 16x 1 + 20x 2 + 12x 3 16 + λ |x 2|, (26b)

0 16x 1 + 12x 2 + 20x 3 16 + λ |x 3|, (26c)

where |µ| := sign(µ), µ = 0, [ 1, 1], µ = 0. We ﬁrst show that x i 0 for i = 1, 2, 3 by contradiction. Indeed, if x 1 < 0, (26a) reduces to 16x 1 + 16x 2 + 16x 3 16 = λ.

Summing (26b) and (26c), we further have

λ = 16x 1 + 16x 2 + 16x 3 16 λ

2 ( |x 2| + |x 3|),

which implies that x 2 0 and x 3 0. Hence, it follows that F(x ) > F(0), which indicates that x is not an optimal solution of problem (1), and thus, x 1 < 0 is impossible. Similarly, we can show that x 2 0 and x 3 0. Next, we ﬁnd the optimal solution x (ℓ1) by only considering x (ℓ1) 0. It is easy to obtain that the solution of (26) and the corresponding objective value associated with problem (1) can be represented respectively by

16 2x 3, x 2 = x 3

, and F (x (ℓ1)) = λ λ2

Hence, x (ℓ1) := 0, 1

32 is an optimal solution of problem (1). The estimated error for this x (ℓ1) is

x (ℓ1) x 2 2 = 1 + 1

which does not meet the recovery bound (25) for any λ 1. It is revealed from Example 1 that this matrix A satisﬁes the 1/2-REC(1, 1). Then the hypothesis of Theorem 9 is veriﬁed, and thus, Theorem 9 is applicable to establishing the recovery bound (25) for the ℓ1/2 regularization problem. Although we cannot obtain the closed-form solution of this nonconvex ℓ1/2 regularization problem, as it is of only 3dimensions, we can apply a global optimization method, the ﬁlled function method (Ge, 1990), to ﬁnd the global optimal solution x (ℓ1/2) and thus to testify the recovery bound (25). This is done by computing the ℓ1/2 regularization problem for many λ to plot the curve x (ℓ1/2) x 2 2. Figure 2 illustrates the variation of the estimated error x (ℓ1/2) x 2 2 and the bound 2λ4/3 (that is the right-hand side of (24), where S = 1 and φ1/2(1, 1) 1 (see Example 1)), when varying the regularization parameter λ from 10 8 to 1. It is illustrated from Figure 2 the recovery bound (25) is satisﬁed, and it is indeed tight, for this example.

Hu, Li, Meng, Qin, and Yang

0 0.2 0.4 0.6 0.8 1 10 10

x (ℓ1/2) x 2 2

Estimated Error Recovery Bound

Figure 2: The illustration of the recovery bound (24) and estimated error.

2.4 Local Recovery Bound

In the preceding subsection, we provided the global analysis of the recovery bound for the ℓp,q regularization problem under the (p, q)-GREC; see Theorem 9. One can also observe from Figure 2 that the global recovery bound (25) is tight for the ℓ1/2 regularization problem as the curves come together at λ 0.5, but there is still a big gap for the improvement when λ is small. This subsection is devoted to providing a local analysis of the recovery bound for the ℓp,q regularization problem by virtue of the variational analysis technique (Rockafellar and Wets, 1998). For x Rn and δ R+, we use B(x, δ) to denote the open ball of radius δ centered at x. For a lower semi-continuous (lsc) function f : Rn R and x, w Rn, the subderivative of f at x along the direction w is deﬁned by

df( x)(w) := lim inf τ 0, w w f( x + τw ) f( x)

To begin with, we show in the following lemma a signiﬁcant advantage of lower-order regularization over the ℓ1 regularization: the lower-order regularization term can easily induce the sparsity of the local minimum.

Lemma 10 Let 0 < q < 1 p. Let f : Rn R be a lsc function satisfying df(0)(0) = 0. Then the function F := f + λ q p,q has a local minimum at 0 with the ﬁrst-order growth condition being fulﬁlled, that is, there exist some ϵ > 0 and δ > 0 such that

F(x) F(0) + ϵ x 2 for any x B(0, δ).

Proof Let ϕ := λ q p,q and then F = f + ϕ. Since ϕ is grouped separable, by Rockafellar and Wets (1998, prop. 10.5), it follows from the deﬁnition that dϕ(0) = δ{0}, where δX is the indicator function of X. Applying Rockafellar and Wets (1998, prop. 10.9), it follows that d F(0) df(0) + dϕ(0) = df(0) + δ{0}. (27)

Group Sparse Optimization

By the assumption that f is ﬁnite and df(0)(0) = 0, its subderivative df(0) is proper (see Rockafellar and Wets, 1998, ex. 3.19). Noting that df(0)(0) = 0, we obtain that df(0) + δ{0} = δ{0}. This, together with (27), yields that d F(0) δ{0}. Therefore, by deﬁnition, there exist some ϵ > 0 and δ > 0 such that

F(x) F(0) + ϵ x 2 for any x B(0, δ).

The proof is complete.

With the help of the above lemma, we can present in the following a local version of the recovery bound. This is done by constructing a path of local minima depending on the regularization parameter λ for the regularization problem, which starts from a sparse solution of the original problem and shares the same support as this sparse solution has, resulting in a sharper bound in terms of λ2.

Theorem 11 Let x be a solution of Ax = b, S be the group sparsity of x, and B be a submatrix of A consisting of its columns corresponding to the active components of x. Suppose that any nonzero group of x is active, and that the columns of A corresponding to the active components of x are linearly independent. Let 0 < q < 1 p. Then there exist κ > 0 and a path of local minima of problem (4), x (λ), such that

x (λ) x 2 2 λ2q2S (B B) 1 2 max x Gi =0

x Gi 2(q p) p x Gi 2p 2 2p 2 for any λ < κ.

Proof Without loss of generality, we let x be of structure x = ( z , 0) with

z = ( x G1, . . . , x GS) and x Gi =a 0 for i = 1, . . . , S,

and let s be the sparsity of x. Let A = (B, D) with B being the submatrix involving the ﬁrst s columns of A (corresponding to the active components of x). By the assumption, we have that B is of full column rank and thus B B is invertible. In this setting, the linear relation A x = b reduces to B z = b. The proof of this theorem is divided into the following three steps:

(a) construct a smooth path from x by the implicit function theorem;

(b) validate that every point of the constructed path is a local minimum of (4); and

(c) establish the recovery bound for the constructed path.

First, to show (a), we deﬁne H : Rs+1 Rs by

H(z, λ) = 2B (Bz b) + λq

z G1 q p p σ(z G1) ... z GS q p p σ(z GS)

where σ(z Gi) = vector |zj|p 1sign(zj)

Gi, denoting a vector consisting of |zj|p 1sign(zj) for all j Gi. Let δ > 0 be suﬃciently small such that sign(z) = sign( z) for any z B( z, δ)

Hu, Li, Meng, Qin, and Yang

and thus H is smooth on B( z, δ) R. Note that H( z, 0) = 0 and H

z ( z, 0) = 2B B. By the implicit function theorem (Rudin, 1976), there exist some κ > 0, δ (0, δ) and a unique smooth function ξ : ( κ, κ) B( z, δ) such that

{(z, λ) B( z, δ) ( κ, κ) : H(z, λ) = 0} = {(ξ(λ), λ) : λ ( κ, κ)}, (28)

0 ... 0 0 0 MS

ξ(λ)G1 q p p σ(ξ(λ)G1) ... ξ(λ)GS q p p σ(ξ(λ)GS)

where Mi for each i = 1, . . . , S is denoted by

Mi = (q p) ξ(λ)Gi q 2p p (σ(ξ(λ)Gi))(σ(ξ(λ)Gi)) + (p 1) ξ(λ)Gi q p p diag |ξ(λ)j|p 2 ,

and diag |ξ(λ)j|p 2 denotes a diagonal matrix generated by vector |ξ(λ)j|p 2 . Thus, by (28) and (29), we have constructed a smooth path ξ(λ) near z, λ ( κ, κ), such that

2B (Bξ(λ) b) + λq

ξ(λ)G1 q p p σ(ξ(λ)G1) ... ξ(λ)GS q p p σ(ξ(λ)GS)

0 ... 0 0 0 MS

This shows that (a) is done as desired. For ﬁxed λ ( κ, κ), let x (λ) := (ξ(λ) , 0) . To verify (b), we prove that x (λ), with ξ(λ) satisfying (30) and (31), is a local minimum of problem (4). Let h : Rs R be a function with h(z) := Bz b 2 2 + λ z q p,q for any z Rs. Note that h(ξ(λ)) = Ax (λ) b 2 2 +λ x (λ) q p,q and that h is smooth around ξ(λ). By noting that ξ(λ) satisﬁes (30) and (31) (the ﬁrstand secondderivative of h at ξ(λ)), one has that h satisﬁes the second-order growth condition at ξ(λ), that is, there exist ϵλ > 0 and δλ > 0 such that

h(z) h(ξ(λ)) + 2ϵλ z ξ(λ) 2 2 for any z B(ξ(λ), δλ). (32)

In what follows, let ϵλ > 0 and δλ > 0 be given as above, and select ϵ0 > 0 such that

ϵλϵ0 B D > 0. (33)

According to Lemma 10 (with D 2 2 + 2 Bξ(λ) b, D 2ϵ0 2 2 in place of f), there exists δ0 > 0 such that

Dy 2 2 + 2 Bξ(λ) b, Dy 2ϵ0 y 2 2 + λ y q p,q 0 for any y B(0, δ0). (34)

Group Sparse Optimization

Thus, for each x := (z, y) B(ξ(λ), δλ) B(0, δ0), it follows that

Ax b 2 2 + λ x q p,q = Bz b + Dy 2 2 + λ z q p,q + λ y q p,q = Bz b 2 2 + λ z q p,q + Dy 2 2 + 2 Bz b, Dy + λ y q p,q = h(z) + Dy 2 2 + 2 Bξ(λ) b, Dy + λ y q p,q + 2 B(z ξ(λ)), Dy .

By (32) and (34), it yields that

Ax b 2 2 + λ x q p,q h(ξ(λ)) + 2ϵλ z ξ(λ) 2 2 + 2ϵ0 y 2 2 + 2 B(z ξ(λ)), Dy h(ξ(λ)) + 4 ϵλϵ0 z ξ(λ) 2 y 2 2 B D z ξ(λ) 2 y 2 = Ax (λ) b 2 2 + λ x (λ) q p,q + 2(2 ϵλϵ0 B D ) z ξ(λ) 2 y 2 Ax (λ) b 2 2 + λ x (λ) q p,q,

where the last inequality follows from (33). Hence x (λ) is a local minimum of problem (4), and (b) is veriﬁed. Finally, we check (c) by providing an upper bound on the distance from ξ(λ) to z. By (30), one has that

ξ(λ) z = λq

2 ((B B) 1)

ξ(λ)G1 q p p σ(ξ(λ)G1) ... ξ(λ)GS q p p σ(ξ(λ)GS)

Noting that {ξ(λ) : λ ( κ, κ)} B( z, δ), without loss of generality, we assume for any λ < κ that

ξ(λ)Gi 2(q p) p 2 z Gi 2(q p) p and ξ(λ)Gi 2p 2 2p 2 2 z Gi 2p 2 2p 2 for i = 1, . . . , S

(otherwise, we choose a smaller δ). Recall that σ(ξ(λ)Gi) = vector |ξ(λ)j|p 1sign(ξ(λ)j)

Gi. We obtain from (35) that

ξ(λ) z 2 2 λ2q2

4 (B B) 1 2 PS i=1 ξ(λ)Gi 2(q p) p P j Gi |ξ(λ)j|2p 2

4 (B B) 1 2 PS i=1 ξ(λ)Gi 2(q p) p ξ(λ)Gi 2p 2 2p 2

4 (B B) 1 2S max i=1,...,S

ξ(λ)Gi 2(q p) p ξ(λ)Gi 2p 2 2p 2

λ2q2S (B B) 1 2 max i=1,...,S

z Gi 2(q p) p z Gi 2p 2 2p 2 .

Hence we arrive at that

x (λ) x 2 2 = ξ(λ) z 2 2 λ2q2S (B B) 1 2 max x Gi =0

x Gi 2(q p) p x Gi 2p 2 2p 2

for any λ < κ, and the proof is complete.

Hu, Li, Meng, Qin, and Yang

Theorem 11 provides a uniform local recovery bound for all the ℓp,q regularization problems (0 < q < 1 p), which is

x p,q(λ) x 2 2 O(λ2S),

where x p,q(λ) is a local optimal solution of problem (4) (near x). This bound improves the

global recovery bound given in Theorem 9 (of order O(λ 2 2 q )) and shares the same one with the ℓp,1 regularization problem (group Lasso); see Blumensath and Davies (2008); van de Geer and B uhlmann (2009). It is worth noting that our proof technique is not working when q = 1 as Lemma 10 fails in this case.

3. Proximal Gradient Method for Group Sparse Optimization

Many eﬃcient algorithms have been proposed to solve sparse optimization problems, and one of the most popular optimization algorithms is the proximal gradient method (PGM); see Beck and Teboulle (2009); Combettes and Wajs (2005); Xiao and Zhang (2013) and references therein. It was reported in Combettes and Wajs (2005) that the PGM for solving the ℓ1 regularization problem (1) reduces to the well-known iterative soft thresholding algorithm (ISTA), and that the ISTA has a local linear convergence rate under some assumptions; see Bredies and Lorenz (2008); Hale et al. (2008); Tao et al. (2016). Recently, the global convergence of the PGM for solving some types of nonconvex regularization problems have been studied under the framework of the Kurdyka Lojasiewicz theory (Attouch et al., 2010; Bolte et al., 2013), the majorization-minimization scheme (Mairal, 2013), the coordinate gradient descent method (Tseng and Yun, 2009), the general iterative shrinkage and thresholding (Gong et al., 2013) and the successive upper-bound minimization approach (Razaviyayn et al., 2013). In this section, we apply the PGM to solve the group sparse optimization problem (4) (PGM-GSO), which is stated as follows.

Algorithm 1 (PGM-GSO) Select a stepsize v, start with an initial point x0 Rn, and generate a sequence {xk} Rn via the iteration

zk = xk 2v A (Axk b), (36)

xk+1 arg min x Rn

λ x q p,q + 1

2v x zk 2 2

Global convergence of the PGM-GSO falls in the framework of the Kurdyka Lojasiewicz theory (see Attouch et al., 2010). In particular, following from Bolte et al. (2013, prop. 3), the sequence generated by the PGM-GSO converges to a critical point, especially a global minimum when q 1 and a local minimum when q = 0 (inspired by the idea in Blumensath and Davies, 2008), as summarized as follows.

Theorem 12 Let p 1, and let {xk} be a sequence generated by the PGM-GSO with v < 1

2 A 2 2 . Then the following statements hold:

(i) if q 1, then {xk} converges to a global minimum of problem (4),

Group Sparse Optimization

(ii) if q = 0, then {xk} converges to a local minimum of problem (4), and

(iii) if 0 < q < 1, then {xk} converges to a critical point 1 of problem (4).

Although the global convergence of the PGM-GSO has been provided in Theorem 12, some important issues of the PGM-GSO have not been discovered yet. The section is to continue the development of the PGM-GSO, concentrating on its eﬃciency and applicability. In particular, we will establish the local convergence rate of the PGM-GSO under some mild conditions, and derive the analytical solutions of subproblem (37) for some speciﬁc p and q.

3.1 Local Linear Convergence Rate

In this subsection, we establish the local linear convergence rate of the PGM-GSO for the case when p = 1 and 0 < q < 1. For the reminder of this subsection, we always assume that p = 1 and 0 < q < 1. To begin with, by virtue of the second-order necessary condition of subproblem (37), the following lemma provides a lower bound for nonzero groups of sequence {xk} generated by the PGM-GSO and shows that the index set of nonzero groups of {xk} maintains constant for large k.

Lemma 13 Let K = (vλq(1 q)) 1 2 q , and let {xk} be a sequence generated by the PGMGSO with v < 1

2 A 2 2 . Then the following statements hold:

(i) For any i and k, if xk Gi = 0, then xk Gi 1 K.

(ii) xk shares the same index set of nonzero groups for large k, that is, there exist N N and I {1, . . . , r} such that xk Gi = 0, i I, xk Gi = 0, i / I, for all k N.

Proof (i) For each group xk Gi, by (37), one has that

xk Gi arg min x Rni

λ x q 1 + 1

2v x zk 1 Gi 2 2

If xk Gi = 0, we deﬁne Ak i := {j Gi : xk j = 0} and ak i := |Ak i |. Without loss of generality, we assume that the ﬁrst ak i components of xk Gi are nonzeros. Then (38) implies that

xk Gi arg min

x Rak i {0}

λ x q 1 + 1

2v x zk 1 Gi 2 2

Its second-order necessary condition says that

1 v Ik i + λq(q 1)Mk i 0,

1. A point x is said to be a critical point of F if 0 belongs to its limiting subdiﬀerential at x; see Mordukhovich (2006).

Hu, Li, Meng, Qin, and Yang

where Ik i is the identity matrix in Rak i ak i and Mk i = xk Ak i q 2 1 (sign(xk Ak i ))(sign(xk Ak i )) . Let

e be the ﬁrst column of Ik i . Then, we obtain that

1 ve Ik i e + λq(q 1)e Mk i e 0,

that is, 1 v + λq(q 1) xk Ak i q 2 1 0.

Consequently, it implies that

xk Gi 1 = xk Ak i 1 (vλq(1 q)) 1 2 q = K.

Hence, it completes the proof of (i). (ii) Recall from Theorem 12 that {xk} converges to a critical point x . Then there exists N N such that xk x 2 < K 2 n, and thus,

xk+1 xk 2 xk+1 x 2 + xk x 2 < K n, (39)

for any k N. Proving by contradiction, without loss of generality, we assume that there exist k N and i {1, . . . , r} such that xk+1 Gi = 0 and xk Gi = 0. Then it follows from (i) that

xk+1 xk 2 1 n xk+1 xk 1 1 n xk+1 Gi xk Gi 1 K n,

which yields a contradiction with (39). The proof is complete.

Let x Rn, and let

S := i {1, . . . , r} : x Gi = 0 and B := (A j)j GS.

Consider the following restricted problem

min y Rna f(y) + ϕ(y), (40)

where na := P i S ni, and

f : Rna R with f(y) := By b 2 2 for any y Rna,

ϕ : Rna R with ϕ(y) := λ y q 1,q for any y Rna.

The following lemma provides the ﬁrstand second-order conditions for a local minimum of the ℓ1,q regularization problem, and shows a second-order growth condition for the restricted problem (40), which is useful for establishing the local linear convergence rate of the PGMGSO.

Group Sparse Optimization

Lemma 14 Assume that x is a local minimum of problem (4). Suppose that any nonzero group of x is active, and the columns of B are linearly independent 2. Then the following statements are true:

(i) The following ﬁrstand second-order conditions hold

2B (By b) + λq

y G1 q 1 1 sign(y G1) ... y GS q 1 1 sign(y GS)

2B B + λq(q 1)

0 ... 0 0 0 M S

where M i = y Gi q 2 1 sign(y Gi) sign(y Gi) .

(ii) The second-order growth condition holds at y for problem (40), that is, there exist ε > 0 and δ > 0 such that

(f + ϕ)(y) (f + ϕ)(y ) + ε y y 2 2 for any y B(y , δ). (43)

Proof Without loss of generality, we assume that S := {1, . . . , S}. By assumption, x is of structure x := (y , 0) with

y := (x G1 , . . . , x GS ) and x Gi =a 0 for i = 1, . . . , S. (44)

(i) By (44), one has that ϕ( ) is smooth around y with its ﬁrstand second-derivatives being

ϕ (y ) = λq

y G1 q 1 1 sign(y G1) ... y GS q 1 1 sign(y GS)

ϕ (y ) = λq(q 1)

0 ... 0 0 0 M S

hence (f + ϕ)( ) is also smooth around y . Therefore, we obtain the following ﬁrstand second-order necessary conditions of problem (40)

f (y ) + ϕ (y ) = 0 and f (y ) + ϕ (y ) 0,

2. This assumption is mild, and it holds automatically for the case when nmax = 1 (see Chen et al., 2010, thm. 2.1).

Hu, Li, Meng, Qin, and Yang

which are (41) and

2B B + λq(q 1)

0 ... 0 0 0 M S

respectively. Proving by contradiction, we assume that (42) does not hold, that is, there exists some w = 0 such that

2w B Bw + λq(q 1)

j Gi wjsign(y j )

By assumption, one has that B B 0, and thus it follows from (45) that

j Gi wjsign(y j )

> 0 for some i {1, . . . , S}. (46)

Let h : R R with h(t) := B(y + tw) b 2 2 + λ y + tw p p for any t R. Clearly, h( ) has a local minimum at 0, and h( ) is smooth around 0 with its derivatives being

h (0) = 2w B (By b) + λq

y Gi q 1 1 X

j Gi wjsign(y j )

h (0) = 2w B Bw + λq(q 1)

j Gi wjsign(y j )

= 0 (by (45)),

h(3)(0) = λq(q 1)(q 2)

j Gi wjsign(y j )

h(4)(0) = λq(q 1)(q 2)(q 3)

j Gi wjsign(y j )

due to (46). However, by elementary of calculus, it is clear that h(4)(0) must be nonnegative (since h( ) obtains a local minimum at 0), which yields a contradiction to (47). Therefore, we proved (42). (ii) By the structure of y (44), ϕ( ) is smooth around y , and thus, (f +ϕ)( ) is also smooth around y with its derivatives being

f (y ) + ϕ (y ) = 0 and f (y ) + ϕ (y ) 0;

due to (41) and (42). Hence (43) follows from Rockafellar and Wets (1998, thm. 13.24). This completes the proof.

The key for the study of local convergence rate of the PGM-GSO is the descent property of the function f+ϕ in each iteration step. The following lemma states some basic properties of active groups of sequence {xk} generated by the PGM-GSO.

Group Sparse Optimization

Lemma 15 Let {xk} be a sequence generated by the PGM-GSO with v < 1

2 A 2 2 , which converges to x (by Theorem 12). Assume that the assumptions in Lemma 14 are satisﬁed. We deﬁne

α := B 2 2, L := 2 A 2 2 and Dk := ϕ(yk) ϕ(yk+1) + f (yk), yk yk+1 .

Then there exist δ > 0 and N N such that the following inequalities hold for any k N:

v α yk yk+1 2 2, (48)

(f + ϕ)(yk+1) (f + ϕ)(yk) 1 Lv 2(1 vα)

Proof By Lemma 13(ii) and the fact that {xk} converges to x , one has that xk shares the same index set of nonzero groups with that of x for large k; further by the structure of y (44), we obtain that all components in nonzero groups of yk are nonzero for large k. In another word, we have

there exists N N such that yk =a 0 and xk GSc = 0 for any k N; (50)

hence ϕ( ) is smooth around yk for any k N. In view of PGM-GSO and the decomposition of x = y , z , one has that

yk+1 arg min ϕ(y) + 1

y yk vf (yk) 2

Its ﬁrst-order necessary condition is

ϕ (yk+1) = 1

yk vf (yk) yk+1 . (51)

Recall from (42) that ϕ (y ) 2B B. Since ϕ( ) is smooth around y , then there exists δ > 0 such that ϕ (w) 2B B for any w B(y , δ). Noting that {yk} converges to y , without loss of generality, we assume that yk y < δ for any k N (otherwise, we can choose a larger N). Therefore, one has that ϕ (yk) 2B B for any k N. Then by Taylor expansion, we can assume without loss of generality that the following inequality holds for any k N and any w B(y , δ) (otherwise, we can choose a smaller δ):

ϕ(w) > ϕ(yk+1) + ϕ (yk+1), w yk+1 α w yk+1 2 2.

Hence, by (51), it follows that

ϕ(w) ϕ(yk+1) > 1

v yk vf (yk) yk+1, w yk+1 α w yk+1 2 2. (52)

Then (48) follows by setting w = yk. Furthermore, by the deﬁnition of f( ), it is of class C1,1 L and it follows from Bertsekas (1999, prop. A.24) that

f(y) f(x) f (x)(y x) L

2 y x 2 for any x, y.

Hu, Li, Meng, Qin, and Yang

Then, by the deﬁnition of Dk, it follows that

(f + ϕ)(yk+1) (f + ϕ)(yk) + Dk = f(yk+1) f(yk) + f (yk), yk yk+1 L

2 yk yk+1 2 2 Lv 2(1 vα)Dk,

where the last inequality follows from (48), and thus, (49) is proved.

The main result of this subsection is presented as follows, in which we establish the local linear convergence rate of the PGM-GSO to a local minimum for the case when p = 1 and 0 < q < 1 under some mild assumptions.

Theorem 16 Let {xk} be a sequence generated by the PGM-GSO with v < 1

2 A 2 2 . Then {xk} converges to a critical point x of problem (4). Assume that x is a local minimum of problem (4). Suppose that any nonzero group of x is active, and the columns of B are linearly independent. Then there exist N N, C > 0 and η (0, 1) such that

F(xk) F(x ) Cηk and xk x 2 Cηk for any k N. (53)

Proof The convergence of {xk} to a critical point x of problem (4) directly follows from Theorem 12. Let Dk, N and δ be deﬁned as in Lemma 15, and let

rk := F(xk) F(x ).

Note in (50) that yk =a 0 and xk Gc S = 0 for any k N. Thus

rk = (f + ϕ)(yk) (f + ϕ)(y ) for any k N.

It is trivial to see that ϕ( ) is smooth around y (as it is active) and that

ϕ (y ) = λq(q 1)

0 ... 0 0 0 M S

0, f (y ) + ϕ (y ) 0;

as shown in (42). This shows that ϕ( ) is concave around y , while (f + ϕ)( ) is convex around y . Without loss of generality, we assume that ϕ( ) is concave and (f + ϕ)( ) is convex in B(y , δ) and that yk B(y , δ) for any k N (since {yk} converges to y ). By the convexity of (f + ϕ)( ) in B(y , δ), it follows that for any k N

rk = (f + ϕ)(yk) (f + ϕ)(y ) f (yk) + ϕ (yk), yk y = f (yk) + ϕ (yk), yk yk+1 + f (yk) + ϕ (yk), yk+1 y = Dk ϕ(yk) + ϕ(yk+1) + ϕ (yk), yk yk+1 + f (yk) + ϕ (yk), yk+1 y .

Noting that ϕ( ) is concave in B(y , δ), it follows that

ϕ(yk) ϕ(yk+1) ϕ (yk), yk yk+1 .

Group Sparse Optimization

Consequently, (54) is reduced to

rk Dk + f (yk) + ϕ (yk), yk+1 y = Dk + ϕ (yk) ϕ (yk+1), yk+1 y + f (yk) + ϕ (yk+1), yk+1 y Dk + Lϕ + 1

v yk yk+1 2 yk+1 y 2, (55)

where the last inequality follows from the smoothness of ϕ on B(y , δ) and (51), and Lϕ is the Lipschitz constant of ϕ ( ) on B(y , δ). Let β := 1 Lv 2(1 vα) (0, 1) (due to the assumption v < 1

L). Then, (49) is reduced to

rk rk+1 = (f + ϕ)(yk) (f + ϕ)(yk+1) βDk > 0,

and thus, it follows from (55) and (48) that

βrk βDk + β Lϕ + 1

v yk yk+1 2 yk+1 y 2 rk rk+1 + β Lϕ + 1

v yk+1 y 2 q

rk rk+1 + Lϕ + 1

vβ 1 vα yk+1 y 2 rk rk+1.

Recall from Lemma 14(ii) that there exists c > 0 such that

y y 2 2 c ((f + ϕ)(y) (f + ϕ)(y )) for any y B(y , δ).

Thus, it follows that

yk+1 y 2 2 crk+1 crk for any k N. (57)

v 2. By Young s inequality, (56) yields that

βrk rk rk+1 + 1

2ϵ yk+1 y 2 2 Lϕ + 1

v 2 + ϵvβ 2(1 vα)(rk rk+1)

rk rk+1 + β

2 rk + cv 2(1 vα) Lϕ + 1

v 2 (rk rk+1). (58)

Let γ := cv 2(1 vα) Lϕ + 1

v 2 > 0. Then, (58) is reduced to

rk+1 1 + γ β

2 1 + γ rk = η1rk,

where η1 := 1+γ β

2 1+γ (0, 1). Thus, by letting C1 := r Nη N 1 , it follows that

rk ηk N 1 r N = C1ηk 1 for any k N.

By letting η2 = η1 and C2 = c C1, it follows from (57) that

xk x 2 = yk y 2 (crk)1/2 C2ηk 2 for any k N.

Letting C := max{C1, C2} and η := max{η1, η2}, we obtain (53). The proof is complete.

Hu, Li, Meng, Qin, and Yang

Theorem 16 establishes the linear convergence rate of the PGM for solving the ℓ1,q regularization problem under two assumptions: (i) the critical point x of the sequence produced by the PGM is a local minimum of problem (4), and (ii) any nonzero group of the local minimum is an active one. The assumption (i) is important by which we are able to establish a second-order growth property, which plays a crucial role in our analysis. Note that the assumption (ii) is satisﬁed automatically for the sparse optimization problem (nmax = 1). Hence, when nmax = 1, we obtain the linear convergence rate of the PGM for solving ℓq regularization problem (0 < q < 1). This result is stated below as a corollary.

Corollary 17 Let 0 < q < 1, and let {xk} be a sequence generated by the PGM for solving the following ℓq regularization problem

min x Rn F(x) := Ax b 2 2 + λ x q q (59)

2 A 2 2 . Then {xk} converges to a critical point x of problem (59). Further assume that x is a local minimum of problem (59). Then there exist N N, C > 0 and η (0, 1) such that

F(xk) F(x ) Cηk and xk x 2 Cηk for any k N.

While we are carrying out the revision of our manuscript, we have found that the local linear convergence rate of the PGM has been studied in the literature. On one hand, the local linear convergence rate of the PGM for solving the ℓ1 regularization problem (ISTA) has been established under some assumptions in Bredies and Lorenz (2008); Hale et al. (2008); Tao et al. (2016), and, under the framework of the so-called KL theory, it is established that the sequence generated by the PGM linearly converges to a critical point of a KL function if its KL exponent is in (0,1

2]; see Frankel et al. (2015); Li and Pong (2016); Xu and Yin (2013). However, the KL exponent of the ℓq regularized function is still unknown, and thus the linear convergence result in these references cannot directly be applied to the ℓq regularization problem. On the other hand, Zeng et al. (2015) has obtained the linear convergence rate of the ℓq regularization problem under the framework of a restricted KL property. However, it seems that their result is restrictive as it is assumed that the stepsize v and the regularization component q satisfy

q 2 < λmin(AT SAS) A 2 2 and q 4λmin(AT SAS) < v < 1 2 A 2 2 ,

where S is the active index of the limiting point x , while our result in Corollary 17 holds for all the stepsize v < 1

2 A 2 2 and the regularization component 0 < q < 1.

3.2 Analytical Solutions of Proximal Subproblems

Since the main computation of the PGM is the proximal step (37), it is signiﬁcant to investigate the solutions of subproblem (37) for the speciﬁc applications so as to spread the application of the PGM. Note that x q p,q and x zk 2 2 are both grouped separable. Then

Group Sparse Optimization

the proximal step (37) can be achieved parallelly in each group, and is equivalent to solve a cycle of low dimensional proximal optimization subproblems

xk+1 Gi arg min x Rni

λ x Gi q p + 1

2v x Gi zk Gi 2 2

for i = 1, , r. (60)

When p and q are given as some speciﬁc numbers, such as p = 1, 2 and q = 0, 1/2, 2/3, 1, the solution of subproblem (60) of each group can be given explicitly by an analytical formula, as shown in the following proposition.

Proposition 18 Let z Rl, v > 0 and the proximal regularization be

Qp,q(x) := λ x q p + 1

2v x z 2 2 for any x Rl.

Then the proximal operator

Pp,q(z) arg min x Rl {Qp,q(x)}

has the following analytical formula:

(i) if p = 2 and q = 1, then

z, z 2 > vλ,

0, otherwise,

(ii) if p = 2 and q = 0, then

2vλ, 0 or z, z 2 =

2vλ, 0, z 2 <

(iii) if p = 2 and q = 1/2, then

P2,1/2(z) =

16 z 3/2 2 cos3 π 3 ψ(z)

3vλ+16 z 3/2 2 cos3 π 3 ψ(z)

3 z, z 2 > 3

0 or 16 z 3/2 2 cos3 π 3 ψ(z)

3vλ+16 z 3/2 2 cos3 π 3 ψ(z)

3 z, z 2 = 3

ψ(z) = arccos

(iv) if p = 1 and q = 1/2, then

P1,1/2(z) =

z, Q1,1/2( z) < Q1,1/2(0), 0 or z, Q1,1/2( z) = Q1,1/2(0), 0, Q1,1/2( z) > Q1,1/2(0),

Hu, Li, Meng, Qin, and Yang

z 1 cos π 3 ξ(z)

3 sign(z), ξ(z) = arccos

(v) if p = 2 and q = 2/3, then

P2,2/3(z) =

32vλa2+3 a3/2+

2 z 2 a3 z, z 2 > 2 2

0 or 3 a3/2+

32vλa2+3 a3/2+

2 z 2 a3 z, z 2 = 2 2

0, z 2 < 2 2

3(2vλ)1/4 cosh ϕ(z)

1/2 , ϕ(z) = arccosh 27 z 2 2 16(2vλ)3/2

(vi) if p = 1 and q = 2/3, then

P1,2/3(z) =

z, Q1,2/3( z) < Q1,2/3(0), 0 or z, Q1,2/3( z) = Q1,2/3(0), 0, Q1,2/3( z) > Q1,2/3(0),

z = z 4vλ a1/2

2 z 1 a3 sign(z),

3(2vλl)1/4 cosh ζ(z)

1/2 , ζ(z) = arccosh 27 z 2 1 16(2vλl)3/2

Proof Since the proximal regularization Qp,q( ) := λ q p + 1

2v z 2 2 is non-diﬀerentiable only at 0, Pp,q(z) must be 0 or some point x( = 0) satisfying the ﬁrst-order optimality condition

| x1|p 1sign( x1) ... | xl|p 1sign( xl)

v( x z) = 0. (65)

Thus, to derive the analytical formula of the proximal operator Pp,q(z), we just need to calculate such x via (65), and then compare the objective function values Qp,q( x) and Qp,q(0) to obtain the solution inducing a smaller value. The proofs of the six statements follow in the above routine, and we only provide the detailed proofs of (iii) and (v) as samples. (iii) When p = 2 and q = 1/2, (65) reduces to

2 x 3/2 2 + 1

v( x z) = 0, (66)

Group Sparse Optimization

and consequently,

x 3/2 2 z 2 x 1/2 2 + 1

2vλ = 0. (67)

Denote η = x 1/2 2 > 0. The equation (67) can be transformed into the following cubic algebraic equation

η3 z 2η + 1

2vλ = 0. (68)

Due to the hyperbolic solution of the cubic equation (see Short, 1937), by denoting

3 , α = arccos

and β = arccosh

the solution of (68) can be expressed as the follows.

(1) If 0 z 2 3 vλ

4 2/3, then the three roots of (68) are given by

η1 = r cosh β

3r 2 sinh β

3r 2 sinh β

where i denotes the imaginary unit. However, this β does not exist since the value of hyperbolic cosine must be positive. Thus, in this case, P2,1/2(z) = 0.

(2) If z 2 > 3 vλ

4 2/3, then the three roots of (68) are

η1 = r cos π

, η2 = r sin π

, η3 = r cos 2π

The unique positive solution of (68) is x 1/2 2 = η1, and thus, the unique solution of (66) is given by

x = 2η3 1 vλ + 2η3 1 z = 16 z 3/2 2 cos3 π 3 ψ(z)

3vλ + 16 z 3/2 2 cos3 π 3 ψ(z)

Finally, we compare the objective function values Q2,1/2( x) and Q2,1/2(0). For this purpose,

when z 2 > 3 vλ

4 2/3, we deﬁne

H( z 2) := v x 2 Q2,1/2(0) Q2,1/2( x)

1 2v z 2 2 λ x 1/2 2 1

= z 2 x 2 2+2vλ x 1/2 2 2 x 2 = 1

4vλ x 1/2 2 ,

where the third equality holds since that x is proportional to z, and fourth equality follows from (67). Since both z 2 and x 2 are strictly increasing on z 2, H( z 2) is also strictly

increasing when z 2 > 3 vλ

4 2/3. Thus the unique solution of H( z 2) = 0 satisﬁes

z 2 x 1/2 2 = 3

Hu, Li, Meng, Qin, and Yang

and further, (67) implies that the solution of H( z 2) = 0 is

Therefore, we arrive at the formulae (61) and (62). (v) When p = 2 and q = 2/3, (65) reduces to

3 x 4/3 2 + 1

v( x z) = 0, (69)

and consequently,

x 4/3 2 z 2 x 1/3 2 + 2

3vλ = 0. (70)

Denote η = x 1/3 2 > 0 and h(t) = t4 z 2t + 2

3vλ for any t R. Thus, η is the positive solution of h(t) = 0. Next, we seek η by the method of undetermined coeﬃcients. Assume that

h(t) = t4 z 2t + 2

3vλ = (t2 + at + b)(t2 + ct + d), where a, b, c, d R. (71)

By expansion and comparison, we have that

a + c = 0, b + d + ac = 0, ad + bc = z 2, bd = 2

c = a, b = 1

a4 z 2 2 a2

By letting M = a2, the last one of the above equalities reduces to the following cubic algebraic equation

3vλM z 2 2 = 0. (73)

According to the Cardano formula for the cubic equation, the root of (73) can be represented by

which can also be reformulated in the following hyperbolic form (see Short, 1937)

2vλ cosh ϕ(z)

where ϕ(z) is given by (64). By (71) and (72), we have that η, the positive root of h(t) = 0, satisﬁes

η2 + aη + 1

= 0 or η2 aη + 1

Group Sparse Optimization

Hence, the real roots of the above equations, that is, the real roots of h(t) = 0, are

It is easy to see that η1 > η2 and that η2 should be discarded as it induces the saddle point rather than a minimum (since h(t) > 0 when t < η2). Thus, by (69), (74) and (75), one has

x = 3η4 1 2vλ + 3η4 1 z = 3 a3/2 + p

32vλa2 + 3 a3/2 + p

2 z 2 a3 z,

where a is given by (64). Finally, we compare the objective function values Q2,2/3( x) and Q2,2/3(0). For this purpose, we deﬁne

H( z 2) := v x 2 Q2,2/3(0) Q2,2/3( x)

1 2v z 2 2 λ x 2/3 2 1

= z 2 x 2 2+2vλ x 2/3 2 2 x 2 = 1

3vλ x 1/3 2 ,

where the third equality holds since that x is proportional to z, and fourth equality follows from (70). Since both z 2 and x 2 are strictly increasing on z 2, H( z 2) is also strictly increasing when z 2 > 4(2

9vλ)3/4. Thus the unique solution of H( z 2) = 0 satisﬁes

z 2 x 1/3 2 = 4

and further, (70) implies that the solution of H( z 2) = 0 is

Therefore, we arrive at the formulae (63) and (64). The proof is complete.

Remark 19 Note from Proposition 18 that the solutions of the proximal optimization subproblems might not be unique when Qp,q( x) = Qp,q(0). To avoid this obstacle in numerical computations, we select the solution Pp,q(z) = 0 whenever Qp,q( x) = Qp,q(0), which achieves a more sparse solution, in the deﬁnition of the proximal operator to guarantee a unique update.

Remark 20 By Proposition 18, one sees that the PGM-GSO meets the group sparsity structure, since the components of each iterate within each group are likely to be either all zeros or all nonzeros. When nmax = 1, the data do not form any group structure in the feature space, and the sparsity is achieved only on the individual feature level. In this case, the proximal operators P2,1(z), P2,0(z), and P2,1/2(z) and P1,1/2(z) reduce to the soft thresholding function in Daubechies et al. (2004), the hard thresholding function in Blumensath and Davies (2008) and the half thresholding function in Xu et al. (2012), respectively.

Hu, Li, Meng, Qin, and Yang

Remark 21 Proposition 18 presents the analytical solution of the proximal optimization subproblems (60) when q = 0, 1/2, 2/3, 1. However, in other cases, the analytical solution of problem (60) seems not available, since the algebraic equation (65) does not have an analytical solution (it is diﬃcult to ﬁnd an analytical solution for the algebraic equation whose order is larger than four). Thus, in the general cases of q (0, 1), we alternatively use the Newton method to solve the nonlinear equation (65), which is the optimality condition of the proximal optimization subproblem. The numerical simulation in Figure 5 of section 4 shows that the Newton method works in solving the proximal optimization subproblems (60) for the general q, while the ℓp,1/2 regularization is the best one among the ℓp,q regularizations for q [0, 1].

4. Numerical Experiments

The purpose of this section is to carry out the numerical experiments of the proposed PGM for the ℓp,q regularization problem. We illustrate the performance of the PGM-GSO among diﬀerent types of ℓp,q regularization, in particular, when (p, q) =(2,1), (2,0), (2,1/2), (1,1/2), (2,2/3) and (1,2/3), by comparing them with several state-of-the-art algorithms for simulated data and applying them to infer gene regulatory network from gene expression data of mouse embryonic stem cell. All numerical experiments are implemented in Matlab R2013b and executed on a personal desktop (Intel Core Duo E8500, 3.16 GHz, 4.00 GB of RAM). The R package of the PGM for solving group sparse optimization, named GSpar O in CRAN, is available at https://CRAN.R-project.org/package=GSpar O

4.1 Simulated Data

In the numerical experiments on simulated data, the numerical data are generated as follows. We ﬁrst randomly generate an i.i.d. Gaussian ensemble A Rm n satisfying A A = I. Then we generate a group sparse solution x Rn via randomly splitting its components into r groups and randomly picking k of them as active groups, whose entries are also randomly generated as i.i.d. Gaussian, while the remaining groups are all set to be zeros. We generate the data b by the Matlab script

b = A x + sigma randn(m, 1),

where sigma is the standard deviation of additive Gaussian noise. The problem size is set to n = 1024 and m = 256, and we test on the noisy measurement data with sigma = 0.1%. Assuming the group sparsity level S is predeﬁned, the regularization parameter λ is iteratively updated by obeying the rule: we set the iterative threshold to be the S-th largest value of zk Gi 2 and solve the λ by virtue of Proposition 18. For each given sparsity level, which is k/r, we randomly generate the data A, x, b (as above) 500 times, run the algorithm, and average the 500 numerical results to illustrate the performance of the algorithm. We choose the stepsize v = 1/2 in the tests of the PGMGSO. Two key criteria to characterize the performance are the relative error x x 2/ x 2 and the successful recovery rate, where the recovery is deﬁned as success when the relative error between the recovered data and the true data is smaller than 0.5%; otherwise, it is regarded as failure.

Group Sparse Optimization

50 100 150 200 250 10 3

Relative Error

(a) Convergence Rate: sparsity level 1%

ℓ2,1 ℓ2,0 ℓ2,1/2 ℓ1,1/2 ℓ2,2/3 ℓ1,2/3

50 100 150 200 250 10 3

Relative Error

(b) Convergence Rate: sparsity level 5%

50 100 150 200 250 10 3

Relative Error

(c) Convergence Rate: sparsity level 10%

0 5 10 15 20 0

Rate of Success (%)

Sparsity Level (%)

(d) Successful Recovery Rate

Figure 3: Convergence results and recovery rates for diﬀerent sparsity levels.

We carry out six experiments with the initial point x0 = 0 (unless otherwise speciﬁed). In the ﬁrst experiment, setting r = 128 (so group size G = 1024/128 = 8), we compare the convergence rates and the successful recovery rates of the PGM-GSO with (p, q) = (2, 1), (2, 0), (2, 1/2), (1, 1/2), (2, 2/3) and (1, 2/3) for diﬀerent sparsity levels. In Figure 3, (a), (b), and (c) illustrate the convergence rate results on sparsity level 1%, 5%, and 10%, respectively, while (d) plots the successful recovery rates on diﬀerent sparsity levels. When the solution is of high sparse level, as shown in Figure 3(a), all ℓp,q regularization problems perform perfect and achieve a fast convergence rate. As demonstrated in Figure 3(b), when the sparsity level drops to 5%, ℓp,1/2 and ℓp,2/3 (p = 1 and 2) perform better and arrive at a more accurate level than ℓ2,1 and ℓ2,0. As illustrated in Figure 3(c), when the sparsity level is 10%, ℓp,1/2 further outperforms ℓp,2/3 (p = 1 or 2), and it surprises us that ℓ2,q performs better than ℓ1,q (q = 1/2 or 2/3). From Figure 3(d), it is illustrated that ℓp,1/2 achieves a better successful recovery rate than ℓp,2/3 (p = 1 or 2), which outperforms ℓ2,0 and ℓ2,1. Moreover, we surprisingly see that ℓ2,q also outperforms ℓ1,q (q = 1/2 or 2/3) on the successful recovery rate. In a word, ℓ2,1/2 performs as the best one of these six regularizations on both accuracy and robustness. In this experiment, we also note that the running times are at a same level, about 0.9 second per 500 iterations.

The second experiment is performed to show the sensitivity analysis on the group size (G = 4, 8, 16, 32) of the PGM-GSO with the six types of ℓp,q regularization. As shown in Figure 4, the six types of ℓp,q regularization reach a higher successful recovery rate for the larger group size. We also note that the larger the group size, the shorter the running time.

The third experiment is implemented to study the variation of the PGM-GSO when varying the regularization order q (ﬁx p = 2). Recall from Theorem 18 that the analytical solu-

Hu, Li, Meng, Qin, and Yang

0 5 10 15 20 25 30 0

Rate of Success (%)

Sparsity Level (%)

G=4 G=8 G=16 G=32

0 5 10 15 20 25 30 0

Rate of Success (%)

Sparsity Level (%)

G=4 G=8 G=16 G=32

0 5 10 15 20 25 30 0

Rate of Success (%)

Sparsity Level (%)

G=4 G=8 G=16 G=32

0 5 10 15 20 25 30 0

Rate of Success (%)

Sparsity Level (%)

G=4 G=8 G=16 G=32

0 5 10 15 20 25 30 0

Rate of Success (%)

Sparsity Level (%)

G=4 G=8 G=16 G=32

0 5 10 15 20 25 30 0

Rate of Success (%)

Sparsity Level (%)

G=4 G=8 G=16 G=32

Figure 4: Sensitivity analysis on group size.

tion of the proximal optimization subproblem (60) can be obtained when q = 0, 1/2, 2/3, 1. However, in other cases, the analytical solution of subproblem (60) seems not available, and thus we apply the Newton method to solve the nonlinear equation (65), which is the optimality condition of the proximal optimization subproblem. Figure 5 shows the variation of successful recovery rates by decreasing the regularization order q from 1 to 0. It is illustrated that the PGM-GSO achieves the best successful recovery rate when q = 1/2, which arrives at the same conclusion as the ﬁrst experiment. The farther the distance of q from 1/2, the lower the successful recovery rate. The fourth experiment is to compare the PGM-GSO with several state-of-the-art algorithms in the ﬁeld of sparse optimization, either convex or nonconvex algorithms, as listed in Table 1. All these algorithms, including PGM-GSO, can successfully recover the signal when the solution is of high sparse level. However, some of these algorithms fails to obtain the group sparsity structure along with the group sparsity level decreases. Figure 7 plots the signals estimated by these algorithms in a random trial at a group sparsity level of 15%. It is illustrated that the solutions of the Multi Fo Ba and the PGM-GSO type solvers are of group sparsity structure, while other algorithms do not obtain the group sparse solutions.

Group Sparse Optimization

0 5 10 15 20 25 0

Rate of Success (%)

Sparsity Level (%)

1 0 0.1 0.9 0.3 0.7

Figure 5: Variation of the PGM-GSO when varying the regularization order q.

0 5 10 15 20 25 0

Rate of Success (%)

Sparsity Level (%)

Multi Fo Ba

Lq Recovery

PGM(ℓ2,1/2 )

Figure 6: Comparison between the PGM-GSO and several state-of-the-art algorithms.

In these solvers, the Multi Fo Ba and the ℓ2,1 obtain the true active groups but inaccurate weights, the ℓ2,0 achieves the accurate weights but some incorrect active groups, while ℓp,1/2 and ℓp,2/3 recover perfect solutions in both true active groups and accurate weights. Figure 6 demonstrates the overall performance of these algorithms by plotting the successful recovery rates on diﬀerent sparsity levels. It is indicated by Figure 6 that ℓ2,1/2 can achieve the higher successful recovery rate than other algorithms, by exploiting the group sparsity structure and lower-order regularization. From this experiment, it is demonstrated that the PGM-GSO (ℓ2,1/2) outperforms most solvers of sparse learning in solving sparse optimization problems with group structure. The ﬁfth experiment is devoted to the phase diagram study (see Donoho, 2006b) of the ℓp,q regularization problem, which further demonstrates the stronger sparsity promoting capability of ℓp,1/2 and ℓp,2/3 (p = 1, 2) regularization over ℓ2,1 regularization. In this experiment, we consider a noiseless signal recovery problem with n = 512, G = 4, r = 128 and sigma = 0 as a prototype. More speciﬁcally, for each ﬁxed m, we vary the number of active groups k from 1 to m/G, that is, the sparsity level varies from 1/r to m/(Gr), and then, we increase m from G to n in the way such that 128 equidistributed values mj = j G are considered. For each speciﬁc problem size, we randomly generate the data 500 times and apply the PGM-GSO to solve the ℓp,q regularization problem. For these noiseless data,

Hu, Li, Meng, Qin, and Yang

SL0 SL0 is an algorithm for ﬁnding the sparsest solutions of an underdetermined system of linear equations based on Smoothed ℓ0 norm (Mohimani et al., 2009). The package is available at http://ee.sharif.edu/ SLzero/

SPGL1 SPGL1 is a Matlab solver for large-scale sparse reconstruction (van den Berg and Friedlander, 2008). The package is available at http://www.cs.ubc.ca/ mpf/spgl1/

YALL1 YALL1 (Your ALgorithm for L1) is a package of Matlab solvers for the ℓ1 sparse reconstruction, by virtue of the alternating direction method (Deng et al., 2011; Yang and Zhang, 2011). The package is available at http://yall1.blogs.rice.edu/

OMP Orthogonal Matching Pursuit algorithm for the recovery of a highdimensional sparse signal (Cai and Wang, 2011). The package is available at Math Works.

Co Sa MP Compressive Sampling Matched Pursuit algorithms for the recovery of a high-dimensional sparse signal (Needell and Tropp, 2009). The packages are available at Math Works.

Fo Ba Adaptive forward-backward greedy algorithm for sparse learning (Zhang, 2011). The R package is available at https://CRAN.Rproject.org/package=foba

Multi Fo Ba Multi Fo Ba is group Fo Ba for multi-task learning (Tian et al., 2016).

ℓ1-Magic ℓ1-Magic is a collection of Matlab routines for solving the convex optimization programs central to compressive sampling, based on standard interior-point methods (Cand es et al., 2006a). The package is available at https://statweb.stanford.edu/ candes/l1magic/

ISTA Iterative Shrinkage/Thresholding Algorithm (Daubechies et al., 2004) for solving the ℓ1 regularization problem.

GBM Gradient Based Method for solving the ℓ1/2 regularization problem (Wu et al., 2014). Suggested by the authors, we choose the initial point as the solution obtained by the ℓ1-Magic.

Lq Recovery Lq Recovery is an iterative algorithm for the ℓp norm minimization (Foucart and Lai, 2009). The code is available at http://www.math.drexel.edu/ foucart/software.htm

Hard TA Iterative Hard Thresholding Algorithm (Blumensath and Davies, 2008) for solving the ℓ0 regularization problem.

Half TA Iterative Half Thresholding Algorithm (Xu et al., 2012) for solving the ℓ1/2 regularization problem.

Table 1: List of the state-of-the-art algorithms for sparse learning.

Group Sparse Optimization

0 200 400 600 800 1000 1

SL0, RE=8.8e 1

0 200 400 600 800 1000 1

ℓ1-Magic, RE=8.5e-1

0 200 400 600 800 1000 1

Fo Ba, RE=1.0

0 200 400 600 800 1000 1

ISTA, RE=8.3e 1

0 200 400 600 800 1000 1

PGM(ℓ2,1), RE=3.9e-1

0 200 400 600 800 1000 1

PGM(ℓ1,1/2), RE=4.6e-3

0 200 400 600 800 1000 1

SPGL1, RE=8.2e 1

0 200 400 600 800 1000 1

OMP, RE=1.3

0 200 400 600 800 1000 1

Multi Fo Ba, RE=2.9e 1

0 200 400 600 800 1000 1

Hard TA, RE=9.8e 1

0 200 400 600 800 1000 1

PGM(ℓ2,0), RE=3.8e-1

0 200 400 600 800 1000 1

PGM(ℓ2,2/3), RE=4.7e-3

0 200 400 600 800 1000 1

YALL1, RE=8.5e 1

0 200 400 600 800 1000 1

Co Sa MP, RE=1.3

0 200 400 600 800 1000 1

Lq Recovery, RE=9.2e 1

0 200 400 600 800 1000 1

Half TA, RE=9.5e 1

0 200 400 600 800 1000 1

PGM(ℓ2,1/2), RE=4.6e-3

0 200 400 600 800 1000 1

PGM(ℓ1,2/3), RE=4.9e-3

Figure 7: Simulation of the PGM-GSO and several state-of-the-art algorithms.

the recovery is deﬁned as success whenever the relative error between the recovered data and the true data is smaller than 10 5, otherwise, it is regarded as failure. Also, we embody a pixel blue whenever the point is in the case of success, otherwise, red when failure. In this way, a phase diagram of an algorithm is plotted in Figure 8, where the color of each cell reﬂects the empirical recovery rate (scaled between 0 and 1). It is illustrated in Figure 8 that the phase transition phenomenon does appear for the PGM-GSO with the six types of ℓp,q regularization. It is shown that ℓp,1/2 and ℓp,2/3 (p = 1, 2) regularizations are more robust than that of ℓ2,0 and ℓ2,1 in the sense that they allow to achieve higher recovery rates and recover a sparse signal from a smaller amount of samples. It is also revealed by Figure 8 that ℓ2,1/2 regularization is the most robust one of these six regularizations, which achieves the same conclusion as the ﬁrst and third experiments. Even though some global optimization method, such as the ﬁlled function method (Ge, 1990), can ﬁnd the global solution of the lower-order regularization problem as in Example 2, however, it does not work for the large-scale sparse optimization problems. Because, in the ﬁlled function method, all the directions need to be searched or compared in each iteration, which costs a large amount of time and hampers the eﬃciency for solving the large-scale problems.

4.2 Real Data in Gene Transcriptional Regulation

Gene transcriptional regulation is the process that a combination of transcription factors (TFs) act in concert to control the transcription of the target genes. Inferring gene regulatory network from high-throughput genome-wide data is still a major challenge in systems

Hu, Li, Meng, Qin, and Yang

Figure 8: Phase diagram study of ℓp,q regularization of group sparse optimization. Blue denotes perfect recovery in all experiments, and red denotes failure for all experiments.

biology, especially when the number of genes is large but the number of experimental samples is small. In large genomes, such as human and mouse, the complexity of gene regulatory system dramatically increases. Thousands of TFs combine in diﬀerent ways to regulate tens of thousands target genes in various tissues or biological processes. However, only a few TFs collaborate and usually form complexes (groups of cooperative TFs) to control the expression of a speciﬁc gene in a speciﬁc cell type or developmental stage. Thus, the prevalence of TF complex makes the solution of gene regulatory network have a group structure, and the gene regulatory network inference in such large genomes becomes a group sparse optimization problem, which is to search a small number of TF complexes (or TFs) from a pool of thousands of TF complexes (or TFs) for each target gene based on the dependencies between the expression of TF complexes (or TFs) and the targets. Even though TFs often work in the form of complexes (Xie et al., 2013), and TF complexes are very important in the control of cell identity and diseases (Hnisz et al., 2013), current methods to infer gene regulatory network usually consider each TF separately. To take the grouping information of TF complexes into consideration, we can apply the group sparse optimization to gene

Group Sparse Optimization

regulatory network inference with the prior knowledge of TF complexes as the pre-deﬁned grouping.

4.2.1 Materials

Chromatin immunoprecipitation (Ch IP) coupled with next generation sequencing (Ch IPseq) identiﬁes in vivo active and cell-speciﬁc binding sites of a TF. They are commonly used to infer TF complexes recently. Thus, we manually collect Ch IP-seq data in mouse embryonic stem cells (m ESCs), as shown in Table 2. Transcriptome is the gene expression proﬁle of the whole genome that is measured by microarray or RNA-seq. The transcriptome data in m ESCs for gene regulatory network inference are downloaded from Gene Expression Omnibus (GEO). 245 experiments under perturbations in m ESC are collected from three papers Correa-Cerro et al. (2011); Nishiyama et al. (2009, 2013). Each experiment produced transcriptome data with or without overexpression or knockdown of a gene, in which the control and treatment have two replicates respectively. Gene expression fold changes between control samples and treatment samples of 12488 target genes in all experiments are log 2 transformed and form matrix B R245 12488 (Figure 9A). The known TFs are collected from four TF databases, TRANSFAC, JASPAR, Uni PROBE and TFCat, as well as literature. Let matrix H R245 939 be made up of the expression proﬁles of 939 known TFs, and matrix Z R939 12488 describe the connections between these TFs and targets. Then, the regulatory relationship between TFs and targets can be represented approximately by a linear system

HZ = B + ϵ.

The TF-target connections deﬁned by Ch IP-seq data are converted into an initial matrix Z0 (see Qin et al., 2014). Indeed, if TF i has a binding site around the gene j promoter within a deﬁned distance (10 kbp), a non-zero number is assigned on Z0 ij as a prior value.

Figure 9: Workﬂow of gene regulatory network inference with ℓp,q regularization.

Hu, Li, Meng, Qin, and Yang

Now we add the grouping information (TF complexes) into this linear system. The TF complexes are inferred from Ch IP-seq data (Table 2) via the method described in Giannopoulou and Elemento (2013). Let the group structure of Z be a matrix W R2257 939

(actually, the number of groups is 1439), whose Moore-Penrose pseudoinverse is denoted by W + (Horn and Johnson., 1985). We further let A := HW + and X := WZ. Then the linear system can be converted into

AX = B + ϵ,

where A denotes expression proﬁles of TF complexes, and X represents connections between TF complexes and targets (Figure 9A).

Factor GEO accession Pubmed ID Factor GEO accession Pubmed ID Atf7ip GSE26680 - Rad21 GSE24029 21589869 Atrx GSE22162 21029860 Rbbp5 GSE22934 21477851 Cdx2 GSE16375 19796622 Rcor1 GSE27844 22297846 Chd4 GSE27844 22297846 Rest GSE26680 - Ctcf GSE11431 18555785 Rest GSE27844 22297846 Ctcf GSE28247 21685913 Rnf2 GSE13084 18974828 Ctr9 GSE20530 20434984 Rnf2 GSE26680 - Dpy30 GSE26136 21335234 Rnf2 GSE34518 22305566 E2f1 GSE11431 18555785 Setdb1 GSE17642 19884257 Ep300 GSE11431 18555785 Smad1 GSE11431 18555785 Ep300 GSE28247 21685913 Smad2 GSE23581 21731500 Esrrb GSE11431 18555785 Smarca4 GSE14344 19279218 Ezh2 GSE13084 18974828 Smc1a GSE22562 20720539 Ezh2 GSE18776 20064375 Smc3 GSE22562 20720539 Jarid2 GSE18776 20064375 Sox2 GSE11431 18555785 Jarid2 GSE19365 20075857 Stat3 GSE11431 18555785 Kdm1a GSE27844 22297846 Supt5h GSE20530 20434984 Kdm5a GSE18776 20064375 Suz12 GSE11431 18555785 Klf4 GSE11431 18555785 Suz12 GSE13084 18974828 Lmnb1 GSE28247 21685913 Suz12 GSE18776 20064375 Med1 GSE22562 20720539 Suz12 GSE19365 20075857 Med12 GSE22562 20720539 Taf1 GSE30959 21884934 Myc GSE11431 18555785 Taf3 GSE30959 21884934 Mycn GSE11431 18555785 Tbp GSE30959 21884934 Nanog GSE11431 18555785 Tbx3 GSE19219 20139965 Nipbl GSE22562 20720539 Tcfcp2l1 GSE11431 18555785 Nr5a2 GSE19019 20096661 Tet1 GSE26832 21451524 Pou5f1 GSE11431 18555785 Wdr5 GSE22934 21477851 Pou5f1 GSE22934 21477851 Whsc2 GSE20530 20434984 Prdm14 GSE25409 21183938 Zfx GSE11431 18555785

Table 2: Ch IP-seq data for TF complex inference.

A literature-based golden standard (low-throughput golden standard) TF-target pair set from biological studies (Figure 9C), including 97 TF-target interactions between 23 TFs and 48 target genes, is downloaded from i Sc Mi D (Integrated Stem Cell Molecular Interactions Database). Each TF-target pair in this golden standard data set has been veriﬁed by biological experiments. Another more comprehensive golden standard m ESC network is constructed from high-throughput data (high-throughput golden standard) by Ch IP-Array (Qin et al., 2011) using the methods described in Qin et al. (2014). It contains 40006 TF-target pairs between 13092 TFs or targets (Figure 9C). Basically, each TF-target pair in the network is evidenced by a cell-type speciﬁc binding site of the TF on the target s promoter and the expression change of the target in the perturbation experiment of the TF,

Group Sparse Optimization

0 0.2 0.4 0.6 0.8 1 0

True positive rate

False positive rate

CQA (AUC:0.730)

ℓ2,1 (AUC:0.746)

ℓ2,0 (AUC:0.914)

ℓ2,1/2 (AUC:0.911)

ℓ1,1/2 (AUC:0.915)

(a) Evaluation with high-throughput golden standard.

0 0.2 0.4 0.6 0.8 1 0

True positive rate

False positive rate

CQA (AUC:0.722)

ℓ2,1 (AUC:0.653)

ℓ2,0 (AUC:0.803)

ℓ2,1/2 (AUC:0.736)

ℓ1,1/2 (AUC:0.837)

(b) Evaluation with literature-based lowthroughput golden standard.

Figure 10: ROC curves and AUCs of the PGM-GSO on m ESC gene regulatory network inference.

which is generally accepted as a true TF-target regulation. These two independent golden standards are both used to validate the accuracy of the inferred gene regulatory networks.

4.2.2 Numerical Results

We apply the PGM-GSO, starting from the initial matrix X0 := WZ0, to the gene regulatory network inference problem (Figure 9B), and compare with the CQ algorithm (CQA), which is shown in Wang et al. (2017) an eﬃcient solver for gene regulatory network inference based on the group structure of TF complexes. The area under the curve (AUC) of a receiver operating characteristic (ROC) curve is widely recognized as an important index of the overall classiﬁcation performance of an algorithm (see Fawcett, 2006). Here, we apply AUC to evaluate the performance of the PGM-GSO with four types of ℓp,q regularization, (p, q) = (2, 1), (2, 0), (2, 1/2) and (1, 1/2), as well as the CQA. A series of numbers of predictive TF complexes (or TFs), denoted by k, from 1 to 100 (that is, the sparsity level varies from about 0.07% to 7%) are tested. For each k and each pair of TF complex (or TF) i and target j, if the X(k) Gij is non-zero, this TF complex (or TF) is regarded as a potential regulator of this target in this test. In biological sense, we only concern about whether the true TF is predicted, but not the weight of this TF. We also expect that the TF complexes (or TFs) which are predicted in a higher sparsity level should be more important than those that are only reported in a lower sparsity level. Thus, when calculating the AUC, a score Scoreij is applied as the predictor for TF i on target j:

( maxk{1/k}, X(k) ij = 0, 0, otherwise.

Both high-throughput and low-throughput golden standards are used to draw the ROC curves of the PGM-GSO with four types of ℓp,q regularization and the CQA in Figure 10

Hu, Li, Meng, Qin, and Yang

to compare their accuracy. When matched with the high-throughput golden standard, it is illustrated from Figure 10(a) that ℓ2,1/2, ℓ1,1/2 and ℓ2,0 perform almost the same (as indicated by the almost same AUC value), and signiﬁcantly outperform ℓ2,1 and CQA. With the lowthroughput golden standard, it is demonstrated from Figure 10(b) that ℓ1,1/2 is slightly better than ℓ2,1/2, ℓ2,0 and CQA, and these three regularizations perform much better than ℓ2,1. These results are basically consistent with the results from simulated data. Since the golden standards we use here are obtained from real biological experiments, which are wellaccepted as true TF-target regulations, the higher AUC, the more biologically accurate the result gene regulatory network is. Thus, our results indicate that the ℓp,1/2 and ℓp,0 regularizations are applicable to gene regulatory network inference in biological researches that study higher organisms but generate transcriptome data for only a small number of samples, which facilitates biologists to analyze gene regulation in a system level.

Acknowledgments

We are grateful to the anonymous reviewers for their valuable suggestions and remarks which helped to improve the quality of the paper. We are also thankful to Professor Marc Teboulle for providing the reference Bolte et al. (2013) and the suggestion that the global convergence of the PGM-GSO can be established by using the so-called Kurdyka Lojasiewicz theory, and Professor Junwen Wang for providing the gene expression data and biological golden standards for inferring gene regulatory network of mouse embryonic stem cell. Hu s work was supported in part by the National Natural Science Foundation of China (11601343), Natural Science Foundation of Guangdong (2016A030310038) and Foundation for Distinguished Young Talents in Higher Education of Guangdong (2015KQNCX145). Li s work was supported in part by the National Natural Science Foundation of China (11571308, 11371325, 91432302). Meng s work was supported in part by the National Natural Science Foundation of China (11671329, 31601066, 71601162). Qin s work was supported in part by the National Natural Science Foundation of China (41606143). Yang s work was supported in part by the Research Grants Council of Hong Kong (Poly U 152167/15E) and the National Natural Science Foundation of China (11431004).

H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka Lojasiewicz inequality. Mathematics of Operations Research, 35(2):438 457, 2010.

F. R. Bach. Consistency of the group Lasso and multiple kernel learning. Journal of Machine Learning Research, 9:1179 1225, 2008.

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009.

D. P. Bertsekas. Nonlinear Programming. Athena Scientiﬁc, Cambridge, 1999.

Group Sparse Optimization

P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics, 37:1705 1732, 2009.

T. Blumensath and M. E. Davies. Iterative thresholding for sparse approximations. Journal of Fourier Analysis and Applications, 14:629 654, 2008.

J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1):459 494, 2013.

K. Bredies and D. A. Lorenz. Linear convergence of iterative soft-thresholding. Journal of Fourier Analysis and Applications, 14(5):813 837, 2008.

F. Bunea, A. Tsybakov, and M. Wegkamp. Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics, 1:169 194, 2007.

T. T. Cai and L. Wang. Orthogonal matching pursuit for sparse signal recovery with noise. IEEE Transactions on Information Theory, 57:4680 4688, 2011.

E. J. Cand es, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489 509, 2006a.

E. J. Cand es and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51:4203 4215, 2005.

E. J. Cand es, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):1207 1223, 2006b.

R. Chartrand. Exact reconstruction of sparse signals via nonconvex minimization. IEEE Signal Processing Letters, 14(10):707 710, 2007.

R. Chartrand and V. Staneva. Restricted isometry properties and nonconvex compressive sensing. Inverse Problems, 24:1 14, 2008.

S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Review, 43:129 159, 2001.

X. Chen, F. Xu, and Y. Ye. Lower bound theory of nonzero entries in solutions of ℓ2-ℓp minimization. SIAM Journal on Scientiﬁc Computing, 32(5):2832 2852, 2010.

X. Chen. Smoothing methods for nonsmooth, nonconvex minimization. Mathematical Programming, 134(1):71 99, 2012.

P. Combettes and V. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Modeling and Simulation, 4(4):1168 1200, 2005.

L. S. Correa-Cerro, Y. Piao, A. A. Sharov, A. Nishiyama, J. S. Cadet, H. Yu, L. V. Sharova, L. Xin, H. G. Hoang, M. Thomas, Y. Qian, D. B. Dudekula, E. Meyers, B. Y. Binder, G. Mowrer, U. Bassey, D. L. Longo, D. Schlessinger, and M. S. Ko. Generation of mouse ES cell lines engineered for the forced induction of transcription factors. Scientiﬁc Reports, 1:167, 2011.

Hu, Li, Meng, Qin, and Yang

I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57:1413 1457, 2004.

W. Deng, W. Yin, and Y. Zhang. Group sparse optimization by alternating direction method. Technical report, Rice University, 2011.

D. L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(8): 1289 1306, 2006a.

D. L. Donoho. High-dimensional centrally symmetric polytopes with neighborliness proportional to dimension. Discrete and Computational Geometry, 35(4):617 652, 2006b.

J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348 1360, 2001.

T. Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861 874, 2006.

S. Foucart and M.-J. Lai. Sparsest solutions of underdetermined linear systems via ℓqminimization for 0 < q 1. Applied and Computational Harmonic Analysis, 26(3): 395 407, 2009.

P. Frankel, G. Garrigos, and J. Peypouquet. Splitting methods with variable metric for Kurdyka Lojasiewicz functions and general convergence rates. Journal of Optimization Theory and Applications, 165(3):874 900, 2015.

D. Ge, X. Jiang, and Y. Ye. A note on complexity of Lp minimization. Mathematical Programming, 129:285 299, 2011.

R. Ge. A ﬁlled function method for ﬁnding a global minimizer of a function of several variables. Mathematical Programming, 46(1):191 204, 1990.

E. G. Giannopoulou and O. Elemento. Inferring chromatin-bound protein complexes from genome-wide binding assays. Genome Research, 23(8):1295 1306, 2013.

P. Gong, C. Zhang, Z. Lu, J. Huang, and J. Ye. A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. International Conference on Machine Learning (ICML), 2013.

E. T. Hale, W. Yin, and Y. Zhang. Fixed-point continuation for ℓ1-minimization: Methodology and convergence. SIAM Journal on Optimization, 19(3):1107 1130, 2008.

D. Hnisz, B. J. Abraham, T. I. Lee, A. Lau, V. Saint-Andre, A. A. Sigova, H. A. Hoke, and R. A. Young. Super-enhancers in the control of cell identity and disease. Cell, 155(4): 934 947, 2013.

R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, New York, 1985.

Group Sparse Optimization

Y. Hu, C. Li, and X. Yang. On convergence rates of linearized proximal algorithms for convex composite optimization with applications. SIAM Journal on Optimization, 26(2): 1207 1235, 2016.

X. Huang and X. Yang. A uniﬁed augmented Lagrangian approach to duality and exact penalization. Mathematics of Operations Research, 28(3):533 552, 2003.

M.-J. Lai and J. Wang. An unconstrained ℓq minimization with 0 < q 1 for sparse solution of underdetermined linear systems. SIAM Journal on Optimization, 21(1):82 101, 2011.

M.-J. Lai, Y. Xu, and W. Yin. Improved iteratively reweighted least squares for unconstrained smoothed ℓq minimization. SIAM Journal on Numerical Analysis, 51(2):927 957, 2013.

G. Li and T. K. Pong. Douglas Rachford splitting for nonconvex optimization with application to nonconvex feasibility problems. Mathematical Programming, 159(1):1 31, 2015a.

G. Li and T. K. Pong. Global convergence of splitting methods for nonconvex composite optimization. SIAM Journal on Optimization, 25(4):2434 2460, 2015b.

G. Li and T. K. Pong. Calculus of the exponent of Kurdyka Lojasiewicz inequality and its applications to linear convergence of ﬁrst-order methods. https://ar Xiv:1602.02915, 2016.

Z. Lu. Iterative reweighted minimization methods for lp regularized unconstrained nonlinear programming. Mathematical Programming, 147(1):277 307, 2014.

Z.-Q. Luo, J. S. Pang, and D. Ralph. Mathematical Programs with Equilibrium Constraints. Cambridge University Press, Cambridge, 1996.

J. Mairal. Optimization with ﬁrst-order surrogate functions. International Conference on Machine Learning (ICML), 2013.

L. Meier, S. A. van de Geer, and P. B uhlmann. The group Lasso for logistic regression. Journal of the Royal Statistical Society: Series B, 70:53 71, 2008.

N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for highdimensional data. Annals of Statistics, 37:246 270, 2009.

H. Mohimani, M. Babaie-Zadeh, and C. Jutten. A fast approach for overcomplete sparse decomposition based on smoothed l0 norm. IEEE Transactions on Signal Processing, 57 (1):289 301, 2009.

B. S. Mordukhovich. Variational Analysis and Generalized Diﬀerentiation I: Basic Theory. Springer, Berlin, 2006.

B. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227 234, 1995.

Hu, Li, Meng, Qin, and Yang

D. Needell and J. A. Tropp. Co Sa MP: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301 321, 2009.

Y. Nesterov. Eﬃciency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341 362, 2012.

Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125 161, 2013.

A. Nishiyama, L. Xin, A. A. Sharov, M. Thomas, G. Mowrer, E. Meyers, Y. Piao, S. Mehta, S. Yee, Y. Nakatake, C. Stagg, L. Sharova, L. S. Correa-Cerro, U. Bassey, H. Hoang, E. Kim, R. Tapnio, Y. Qian, D. Dudekula, M. Zalzman, M. Li, G. Falco, H. T. Yang, S. L. Lee, M. Monti, I. Stanghellini, M. N. Islam, R. Nagaraja, I. Goldberg, W. Wang, D. L. Longo, D. Schlessinger, and M. S. Ko. Uncovering early response of gene regulatory networks in ESCs by systematic induction of transcription factors. Cell Stem Cell, 5: 420 433, 2009.

A. Nishiyama, A. A. Sharov, Y. Piao, M. Amano, T. Amano, H. G. Hoang, B. Y. Binder, R. Tapnio, U. Bassey, J. N. Malinou, L. S. Correa-Cerro, H. Yu, L. Xin, E. Meyers, M. Zalzman, Y. Nakatake, C. Stagg, L. Sharova, Y. Qian, D. Dudekula, S. Sheer, J. S. Cadet, T. Hirata, H. T. Yang, I. Goldberg, M. K. Evans, D. L. Longo, D. Schlessinger, and M. S K.o. Systematic repression of transcription factors reveals limited patterns of gene expression changes in ES cells. Scientiﬁc Reports, 3:1390, 2013.

J. Qin, M. J. Li, P. Wang, M. Q. Zhang, and J. Wang. Ch IP-Array: Combinatory analysis of Ch IP-seq/chip and microarray gene expression data to discover direct/indirect targets of a transcription factor. Nucleic Acids Research, 39:W430 436, 2011.

J. Qin, Y. Hu, F. Xu, H. K. Yalamanchili, and J. Wang. Inferring gene regulatory networks by integrating Ch IP-seq/chip and transcriptome data via LASSO-type regularization methods. Methods, 67(3):294 303, 2014.

M. Razaviyayn, M. Hong, and Z.-Q. Luo. A uniﬁed convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23 (2):1126 1153, 2013.

R. T. Rockafellar and R. J.-B. Wets. Variational Analysis. Springer-Verlag, Berlin, 1998.

W. Rudin. Principles of Mathematical Analysis. Mc Graw-Hill, New York, 1976.

W. T. Short. Hyperbolic solution of the cubic equation. National Mathematics Magazine, 12(3):111 114, 1937.

S. Tao, D. Boley, and S. Zhang. Local linear convergence of ISTA and FISTA on the LASSO problem. SIAM Journal on Optimization, 26(1):313 336, 2016.

L. Tian, P. Xu, and Q. Gu. Forward backward greedy algorithms for multi-task learning with faster rates. In Proceedings of the Thirty-Second Conference on Uncertainty in Artiﬁcial Intelligence, pages 735 744, 2016.

Group Sparse Optimization

R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B, 58:267 288, 1994.

P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 117(1):387 423, 2009.

M. Usman, C. Prieto, T. Schaeﬀter, and P. G. Batchelor. k-t Group sparse: A method for accelerating dynamic MRI. Magnetic Resonance in Medicine, 66(4):1163 1176, 2011.

S. A. van de Geer and P. B uhlmann. On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics, 3:1360 1392, 2009.

E. van den Berg and M. P. Friedlander. Probing the pareto frontier for basis pursuit solutions. SIAM Journal on Scientiﬁc Computing, 31(2):890 912, 2008.

E. van den Berg, M. Schmidt, M. P. Friedlander, and K. Murphy. Group sparsity via linear-time projection. Technical report, University of British Columbia, 2008.

J. Wang, Y. Hu, C. Li, and J.-C. Yao. Linear convergence of CQ algorithms and applications in gene regulatory network inference. Inverse Problems, to appear, 2017.

S. J. Wright, R. D. Nowak, and M. A T Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7):2479 2493, 2009.

L. Wu, Z. Sun, and D.-H. Li. A gradient based method for the l2-l1/2 minimization and application to compressive sensing. Paciﬁc Journal of Optimization, 10(2):401 414, 2014.

L. Xiao and T. Zhang. A proximal-gradient homotopy method for the sparse least-squares problem. SIAM Journal on Optimization, 23(2):1062 1091, 2013.

D. Xie, A. P. Boyle, L. Wu, J. Zhai, T. Kawli, and M. Snyder. Dynamic trans-acting factor colocalization in human cells. Cell, 155(3):713 724, 2013.

Y. Xu and W. Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences, 6(3):1758 1789, 2013.

Z. Xu, X. Chang, F. Xu, and H. Zhang. L1/2 regularization: A thresholding representation theory and a fast solver. IEEE Transactions on Neural Networks and Learning Systems, 23:1013 1027, 2012.

H. Yang, Z. Xu, I. King, and M. R. Lyu. Online learning for group Lasso. International Conference on Machine Learning (ICML), 2010.

J. Yang and Y. Zhang. Alternating direction algorithms for ℓ1-problems in compressive sensing. SIAM Journal on Scientiﬁc Computing, 33(1):250 278, 2011.

X. Yang and X. Huang. A nonlinear Lagrangian approach to constrained optimization problems. SIAM Journal on Optimization, 11(4):1119 1144, 2001.

Hu, Li, Meng, Qin, and Yang

M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of The Royal Statistical Society: Series B, 68:49 67, 2006.

J. Zeng, S. Lin, and Z. Xu. Sparse regularization: Convergence of iterative jumping thresholding algorithm. https://arxiv.org/abs/1402.5744, 2015.

T. Zhang. Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Transactions on Information Theory, 57(7):4689 4708, 2011.

T. Zhang. Some sharp performance bounds for least squares regression with L1 regularization. Annals of Statistics, 37:2109 2144, 2009.

T. Zhang. Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research, 11:1081 1107, 2010.