# fast_sparse_group_lasso__1a779e0a.pdf

Fast Sparse Group Lasso

Yasutoshi Ida1,3 Yasuhiro Fujiwara2 Hisashi Kashima3,4

1NTT Software Innovation Center 2NTT Communication Science Laboratories 3Kyoto University 4RIKEN AIP yasutoshi.ida@ieee.org yasuhiro.fujiwara.kh@hco.ntt.co.jp kashima@i.kyoto-u.ac.jp

Sparse Group Lasso is a method of linear regression analysis that ﬁnds sparse parameters in terms of both feature groups and individual features. Block Coordinate Descent is a standard approach to obtain the parameters of Sparse Group Lasso, and iteratively updates the parameters for each parameter group. However, as an update of only one parameter group depends on all the parameter groups or data points, the computation cost is high when the number of the parameters or data points is large. This paper proposes a fast Block Coordinate Descent for Sparse Group Lasso. It efﬁciently skips the updates of the groups whose parameters must be zeros by using the parameters in one group. In addition, it preferentially updates parameters in a candidate group set, which contains groups whose parameters must not be zeros. Theoretically, our approach guarantees the same results as the original Block Coordinate Descent. Experiments show that our algorithm enhances the efﬁciency of the original algorithm without any loss of accuracy.

1 Introduction

Sparse Group Lasso (SGL) [3, 17] is a popular feature-selection method based on the linear regression model for data that have group structures. For the analysis of such data, it is important to identify not only individual features but also groups of features that have some relationships with the response. SGL ﬁnds such groups and features by obtaining sparse parameters corresponding to the features in the linear regression model. In particular, SGL effectively achieves parameter sparsity by utilizing two types of regularizations: featureand group-level regularization. Owing to its effectiveness, SGL is used in the analysis of the various data, e.g., gene expression data [11, 16] and climate data [13].

In order to obtain the sparse parameters in SGL, Block Coordinate Descent (BCD) is used as a standard approach [3, 17]. BCD iteratively updates the parameters for each group until convergence. In particular, it ﬁrst checks whether the parameters in a group are zeros by using all the parameters or data points. This process induces group-level sparsity. If the parameters in the group are determined as nonzeros, BCD updates the parameters in the group. It applies the aforementioned steps to the parameters of each group until the parameters of all groups converge.

Although SGL is practical for analyzing group structured data, BCD suffers from high computation costs. The main bottleneck is the computation to check whether the parameters of a group are zeros, because the computation uses all the parameters or data points. The screening technique is the main existing approach for reducing the computation cost of BCD by reducing the data size [18, 12 14]. This technique eliminates features and groups whose parameters are zeros, before entering the iterations of BCD. However, the screening techniques cannot be expected to reduce the data size when the initial parameters are far from optimal [9]. The screening techniques often face such problems in practice, and the efﬁciency of BCD would not be increased in such cases. Therefore, speeding up BCD is still an important topic of study for handling large data sizes.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

This paper proposes a fast BCD for SGL. Our main idea is to identify the groups whose parameters must be zeros by only using the parameters in the group, whereas the standard method uses all the parameters or data points. As the number of parameters in one group is much smaller than the total number of parameters or data points, our method efﬁciently skips the computation of groups whose parameters must be zeros. Another idea is to extract a candidate group set, which contains groups whose parameters must not be zeros. As the parameters in the set are likely to largely contribute to the prediction [5, 4], we can expect BCD to effectively optimize the objective function by preferentially updating the parameters in the set. The attractive point of our method is that it does not need any additional hyperparameters, which incur additional tuning costs. In addition, it provably guarantees convergence to the same value of the objective function as the original BCD. Experiments demonstrate that our method enhances the efﬁciency of BCD while achieving the same prediction error. Although we consider the case of non-overlapping groups in the paper, our method is relatively easy to be extended for overlapping groups by using overlap norm [8].

2 Preliminary

2.1 Sparse Group Lasso

This section deﬁnes the SGL as a method of linear regression analysis that ﬁnds small sets of groups in addition to features that achieve high prediction accuracy for the response. Let n be the number of data points, where each data point is represented as a p-dimensional feature vector, y Rn be a continuous response, and G be the total number of feature groups. A matrix of features X Rn p is represented as X = [X(1), X(2), ..., X(G)], where X(g) Rn pg is the block matrix of X corresponding to the g-th feature group with the number of features pg. Similarly, parameter vector β Rp is represented as β = [β(1)T, β(2)T, ..., β(G)T]T, where β(g) Rpg is the parameter (coefﬁcient) vector of group g. Therefore, the linear regression model in SGL is represented as y = Xβ = X(1)β(1) + + X(G)β(G). Solution ˆβ is obtained by solving the following problem: ˆβ = arg min β Rp 1 2n||y PG g=1 X(g)β(g)||2 2 + (1 α)λ PG g=1 pg||β(g)||2 + αλ||β||1, (1)

where α [0, 1] and λ 0 are regularization constants; α decides the balances of the convex combination of l1 and l2 norm penalties and λ controls the degree of sparsity of the solution.

2.2 Block Coordinate Descent

BCD is a standard approach used to obtain solution ˆβ of SGL [17]. It consists of a group-level outer loop and an element-level loop. The group-level outer loop checks whether parameter vector β(g)

for each feature group is a zero vector. If β(g) turns to be a nonzero vector, the element-level loop updates each parameter in β(g). The process terminates if the whole parameter vector converges.

In the element-level loop, BCD updates β(g) in group g if the parameter vector of the group is not a zero vector. The updated parameter vector β(g) new is deﬁned as follows:

β(g) new = 1 t(1 α)λ ||S(Z(g),tαλ)||2

+S(Z(g), tαλ). (2)

In Equation (2), Z(g) = β(g) + t

n(X(g)Tr( g) X(g)TX(g)β(g)), where t 0 is the step size, and r( g) is the partial residual and is deﬁned as follows:

r( g) = y PG l =g X(l)β(l). (3)

S( ) is the coordinate-wise soft-threshold operator; the j-th element is computed as S(z, γ)[j] = sign(z[j])(|z[j]| γ)+. β(g) is iteratively updated using Equation (2) until convergence. If the parameter vector of the group is determined as a zero vector, Equation (2) is skipped. The computation cost of Equation (2) is O(p2 g) time because X(g)TX(g) in Z(g) are precomputed before entering the main loop. In addition, X(g)Tr( g) has already been computed in the group-level outer loop, as described next.

In the group-level outer loop, β(g) is the zero vector if the following condition holds:

||S(X(g)Tr( g), αλ)||2 pg(1 α)λ. (4)

In other words, if Equation (4) holds, the parameter vector of group g is a zero vector; Equation (2) is then skipped, and the parameter vector is not updated. X(g)Tr( g) in Equation (4) is computed using the following equation that consists of only matrix operations:

X(g)Tr( g) = X(g)Ty X(g)TXβ + X(g)TX(g)β(g). (5)

In this equation, X(g)Ty, X(g)TX, and X(g)TX(g) are precomputed before entering the loops. The costs of the precomputations have relatively low impacts on the total computational cost because precomputations are performed only once in the total computation, and are easily parallelized. On the other hand, the computation cost of Equation (5) is still high because it requires O(ppg + p2 g) time, and it is repeatedly performed until convergence. As a result, we need O(ppg + p2 g) time for Equation (4) at every iteration. We can modify Equation (5) to have O(npg) time by maintaining the partial residuals of Equation (3) as described in [7]. However, in either case, the computation cost of Equation (4) depends on p or n. Therefore, Equation (4) incurs a large computation cost for high-dimensional features or a large number of data points.

3 Proposed Approach

In this section, we introduce our algorithm, which efﬁciently obtains the solution of SGL. First, we explain the ideas underlying our algorithm. Next, we introduce several lemmas that are necessary to derive our algorithm. We then describe the algorithm.1

In SGL, obtaining the solution through BCD incurs a high computation cost. This is because (i) Equation (4) requires O(ppg + p2 g) or O(npg) time, which incurs a large computation cost for large feature vectors or a large number of data points, and (ii) BCD always checks all of the feature groups using Equation (4) at every iteration even when most of the groups have zero vectors.

Our main idea is to identify groups whose parameter vectors must be zero vectors by approximating Equation (4), which checks whether the parameter vector of each group is a zero. In particular, we compute the upper bound of ||S(X(g)Tr( g), αλ)||2 instead of computing the exact value. If the upper bound is lower than pg(1 α)λ, the parameter vector of the group must be a zero vector. As a result, we can safely skip the computation of the group. As our upper bound requires only O(pg) time instead of the O(ppg + p2 g) or O(npg) time for the original Equation (4), we can effectively reduce the computation cost.

Another idea is to extract a candidate group set, which contains groups whose parameter vectors must not be zero vectors. As the parameters in the set are likely to largely contribute to the prediction [5, 4], we can expect BCD to effectively optimize the objective function by preferentially updating the parameters in the set. In addition, our method only requires O(G) time to construct the set, and thus the computation cost is relatively low.

3.2 Upper Bound for Skipping Computations

We introduce the upper bound of ||S(X(g)Tr( g), αλ)||2 in Equation (4). To derive a tight upper bound, we introduce reference parameter vectors and partial residuals of Equation (3) that are computed before entering the group-level outer loop. To be speciﬁc, we can obtain a tight bound by explicitly utilizing the term representing the difference between the reference and current parameter vectors. As many parameter vectors rapidly converge during the iterations, the difference between the reference and current parameter vectors rapidly decreases. We deﬁne the upper bound as follows:

Deﬁnition 1 (Upper bound) Let U (g) be an upper bound of ||S(X(g)Tr( g), αλ)||2 in Equation (4), and r( g) be a partial residual of Equation (3) before entering the group-level outer loop. Then, U (g) is deﬁned as follows:

U (g) = ||X(g)T r( g)||2 + Λ(g, g) + PG l=1 Λ(g, l), (6)

1We show all the proofs in the supplementary material.

where Λ(g, l) = || ˆK(g)[l]||2||β(l) β(l)||2. The i-th element of ˆK(g)[l] Rpg is given as ||K(g,l)[i, : ]||2, that is, the l2 norm of the i-th row vector in block matrix K(g,l) Rpg pl of K := XTX Rp p. β(g) is a parameter vector before entering the group-level outer loop.

Note that we can precompute ||X(g)T r( g)||2 and || ˆK(g)[ ]||2 before entering the group-level outer loop and the main loop, respectively. The following lemma shows the property of the upper bound corresponding to groups with parameters that must be zeros:

Lemma 1 (Groups with zero vectors) If U (g) satisﬁes U (g) pg(1 α)λ, parameter β(g) for group g is a zero vector.

Lemma 1 indicates that we can identify groups whose parameters must be zeros by using upper bound U (g) instead of ||S(X(g)Tr( g), αλ)||2. The error bound of U (g) for ||S(X(g)Tr( g), αλ)||2 is described in a later section.

3.3 Online Update Scheme of Upper Bound

Although we can identify groups whose parameters must be zeros by using upper bound U (g), O(p + pg) time is still required to compute Equation (6) of the upper bound even if we precompute ||X(g)T r( g)||2 and || ˆK(g)[ ]||2. As the standard approach requires O(ppg + p2 g) or O(npg) time, the efﬁciency of our approach would be moderate. This is the motivation behind our use of the online update scheme for the upper bound. In particular, when a parameter vector of a group is updated, we use the following deﬁnition for the upper-bound computation:

Deﬁnition 2 (Online update scheme of upper bound) If β(g) is updated to β(g) , we update upper bound U (g) of Equation (6) as follows:

U (g) =U (g) 2Λ(g, g)+2|| ˆK(g)[g]||2||β(g) β(g)||2. (7)

Equation (7) clearly holds because we subtract old values of 2Λ(g, g) from Equation (6), and add updated values of 2|| ˆK(g)[g]||2||β(g) β(g)||2 to the equation. In terms of the computation cost, we have the following lemma:

Lemma 2 (Computation cost for online update scheme of upper bound) The computation of Equation (7) requires O(pg) time given precomputed ||X(g)T r( g)||2 and || ˆK(g)[ ]||2 when the parameter vector of group g is updated.

The above lemma shows that we can update the upper bound in O(pg) time. The computation cost is signiﬁcantly low compared with the computations of Equations (4) and (6), which require O(ppg + p2 g) (or O(npg)) and O(p + pg) times, respectively. Therefore, we can efﬁciently identify groups whose parameters must be zeros on the basis of Lemma 1 and Deﬁnition 2.

3.4 Candidate Group Set for Selective Updates

In this section, we introduce a method to extract the candidate group set, which contains the groups whose parameters must not be zeros. We expect BCD to effectively update the parameter vectors by preferentially updating the parameter vectors on the candidate group set. To extract the candidate group set, we utilize a criterion, which approximates ||S(X(g)Tr( g), αλ)||2 in Equation (4). If the criterion, deﬁned as follows, is above a threshold, the group is included in the set.

Deﬁnition 3 (Criterion to extract candidate group set) Let C(g) be a criterion, which is used to check whether the group is included in the candidate group set. Then, C(g) is deﬁned as follows:

C(g) = ||X(g)T r( g)||2 αλ p

where r( g) is a partial residual of Equation (3) before entering the group-level outer loop.

The error bounds of C(g) and U (g) for ||S(X(g)Tr( g), αλ)||2 are shown as follows:

Lemma 3 (Error bound) Let ϵ be an error bound of C(g) for ||S(X(g)Tr( g), αλ)||2 such that |C(g) ||S(X(g)Tr( g), αλ)||2| ϵ. We then have ϵ = Λ(g, g) + PG l=1 Λ(g, l) + αλ p

pg/2. In addition, we have |U (g) ||S(X(g)Tr( g), αλ)||2| 2ϵ.

The above lemma suggests that C(g) approximates ||S(X(g)Tr( g), αλ)||2 better than U (g) because the error bound of C(g) is half the size of that of U (g). We extract candidate group set C with respect to C(g) by using the following deﬁnition:

Deﬁnition 4 (Candidate group set) Candidate group set C is deﬁned as

C = {g {1, ..., G}|C(g) > pg(1 α)λ}. (9)

The candidate group set has the following property:

Lemma 4 (Groups containing nonzero vectors) Candidate group set C contains the groups whose parameters must be nonzeros.

The above lemma suggests that the candidate group set comprises not only the groups whose parameters must be nonzeros but also groups whose parameters can be nonzeros. In terms of the computation cost, we have the following lemma to extract the candidate group set:

Lemma 5 (Computation cost of candidate group set) Given precomputed ||X(g)T r( g)||2, we can extract candidate group set C at O(G) time.

3.5 Algorithm

This section describes our algorithm, which utilizes the above-mentioned deﬁnitions and lemmas. Algorithm 1 gives a full description of our approach, which is based on BCD with the sequential rule [6]: a standard approach for SGL. The sequential rule is used to tune regularization constant λ with respect to the sequence of (λq)Q 1 q=0 , where λ0 > λ1 > ... > λQ 1: it sequentially optimizes the parameter vector by using (λq)Q 1 q=0 , and reuses the solution of the previous λ as the initial parameters for the current λ.

Our main idea is to skip groups whose parameters must be zeros during the optimization by utilizing Lemma 1. As upper bound U (g) in Lemma 1 can be computed with a low computation cost as described in Lemma 2, we can efﬁciently avoid the computation of Equation (4), which is the main bottleneck of standard BCD. In addition, we extract the candidate group set before we start to optimize the parameters for the current λ. The impact of the computation cost is relatively low on the total cost, as shown in Lemma 5. We expected BCD to raise the effectiveness by preferentially updating the parameters in the set based on Lemma 4.

In Algorithm 1, (lines 2 4), ﬁrst precomputes || ˆK(g)[l]||2, which is used for computing the upper bounds. In the loop of the sequential rule, we construct the candidate group set (lines 6 10). Although we computed Equation (9) in the initial iteration, we reused the term ||X(g)T r( g)||2 of the previous iteration in the equation for the other iterations. Next, BCD is performed on the parameter vectors of the set (lines 11 19). Then, the algorithm enters the loop of another BCD with upper bounds (lines 20 36). The reference parameter vector is set (line 21), and ||X(g)T r( g)||2 is precomputed, which is also used for the computation of the upper bounds (lines 22 and 23). In the group-level outer loop, upper bound U (g) of group g was computed using Equation (7) (line 25). Note that Equation (6) is used for the initial computation of the upper bound. If bound U (g) is lower than threshold pg(1 α)λ, the parameters of the group were set to zeros by following Lemma 1 (lines 26 and 27). If the bound does not meet the condition, the same procedure as that of the original BCD is performed (lines 28 34). Next, ||β(g) β(g)||2, which is used for the computation of the upper bound is updated (line 35).

In terms of the computation cost, our algorithm has the following property:

Theorem 1 (Computation cost) Let S and S be the rates of the un-skipped groups when Lemma 1 and Equation (4) are used, respectively. Suppose that all groups have the same size, pg. If tm and

Algorithm 1 Fast Sparse Group Lasso

1: A = {1, ..., G}, β 0, β 0; A has all the group indices 2: for each g A do The precomputation for the bounds 3: for each l A do 4: compute || ˆ K(g)[l]||2; 5: for q = 0 to Q 1 do The loop for the sequential rule of regularization constants (λq)Q 1 q=0 6: C = ; Initialize candidate group set C 7: for each g A do The loop for extracting candidate group set 8: compute C(g) by Equation (8); 9: if C(g) > pg(1 α)λq then Add groups to C by following Lemma 4 10: add g to C; 11: repeat The main loop for BCD on candidate group set C 12: for each g C do Group-level outer loop 13: if ||S(X(g)Tr( g), αλl)||2 pg(1 α)λq then Check the condition of Equation (4) 14: β(g) 0; 15: else 16: repeat Element-level loop 17: update β(g) by Equation (2); 18: until β(g) converges 19: until β converges 20: repeat The main loop for BCD with the upper bounds 21: β β; Set the reference parameter vector 22: for each g A do The precomputation for the upper bounds 23: compute ||X(g)T r( g)||2; 24: for each g A do Group-level outer loop 25: compute U (g) by Equation (7); 26: if U (g) pg(1 α)λq then Skip the group whose parameters must be zeros by following Lemma 1 27: β(g) 0; 28: else 29: if ||S(X(g)Tr( g), αλl)||2 pg(1 α)λq then Check the condition of Equation (4) 30: β(g) 0; 31: else 32: repeat Element-level loop 33: update β(g) by Equation (2); 34: until β(g) converges 35: update ||β(g) β(g)||2; For online update scheme for the upper bounds 36: until β converges

tf are the numbers of iterations of BCD for the main loop and element-level loop, respectively, our approach requires O(G{(Q+Stm)(ppg +p2 g)+S pgtm(tfpg +1)+Q}) or O(G{(Q+Stm)npg + S pgtm(tfpg + 1) + Q}) time.

According to Theorem 1, when we have a large number of groups that are skipped on the basis of the upper bound, the rate of un-skipped groups S in Theorem 1 is small. As a result, the total computation cost is effectively reduced.

In terms of the convergence, our algorithm has the following property:

Theorem 2 (Convergence property) Suppose that the regularization constants in Algorithm 1 are the same as those of the original BCD, and the BCD converges. Then, the solution of Algorithm 1 has the same value of the objective function as that of the original BCD.

Theorem 2 shows that our algorithm returns the same value of the objective function as the original approach. Therefore, our approach dose not decrease the accuracy compared to the original approach.

4 Related Work

To improve the efﬁciency of optimization with sparsity-inducing regularization, safe screening is generally used [6]; it eliminates zero parameters in the solution before the optimization is initiated. As the size of the feature vector can be reduced before optimizing the problem, the efﬁciency of the optimization is improved. The current state-of-the-art safe screening method for SGL is the GAP

Safe rule [18, 12 14], which is based on dual gap computation. The dual gap is computed as the difference between the primal and dual problems of SGL. They deﬁne a safe region that contains the solution based on the dual gap. By utilizing the safe region, this approach can identify groups and features that must be inactive, and eliminates them. If the safe region is small, this approach effectively eliminates groups and features. However, unless λ is large or a good approximate solution is already known, the screening is often ineffective [9]. To overcome this problem, Ndiaye et al. [14] used the dynamic safe rule [1, 2] with the GAP Safe rule for SGL. This dynamic GAP Safe rule effectively eliminates groups and features by repeatedly using the GAP Safe rule during the iterations of BCD.

5 Experiments

We evaluated the processing time and prediction error of our approach by conducting experiments on six datasets from the LIBSVM2 website (abalone, cpusmall, boston, bodyfat, eunite2001, and pyrim). The numbers of data points were 4177, 8192, 506, 252, 336, and 74, respectively. In order to obtain group structure, we used the polynomial features of these datasets [15]. In particular, we created second-order polynomial features by following the method used in [16]. The groups, which consisted of product over combinations of features up to the second degree, were created by using a polynomial kernel. As a result, the numbers of groups for each dataset were 36, 78, 91, 105, 136, and 378, respectively. The total numbers of features were 176, 408, 481, 560, 736, and 2133, respectively.

We compared our method with the original BCD, GAP Safe rule [13], and dynamic GAP Safe rule [14]. We tuned λ for all approaches based on the sequential rule by following the methods in [18, 12 14]. The search space was a non-increasing sequence of Q parameters (λq)Q 1 q=0 deﬁned as λq = λmax10 δq/Q 1. We used δ = 4 and Q = 100 [18, 12 14]. For a fair comparison, λmax was computed according to the dual norm by following the concept of GAP Safe rule [13]; Gap Safe rule safely eliminates groups and features under this setting. For dynamic GAP Safe rule, the interval of dual gap computations is set to 10 [14]. For another tuning parameter α, we used the settings α [0.2, 0.4, 0.6, 0.8]. We stopped the algorithm for each λq when the relative tolerance ||β βnew||2/||βnew||2 dropped below 10 5 for all approaches [9, 10]. All the experiments were conducted on a Linux 2.20 GHz Intel Xeon server with 264 GB of main memory.

5.1 Processing Time

We evaluated the processing times of the sequential rules for each α [0.2, 0.4, 0.6, 0.8]. Figure 1 shows the processing time of each approach on the six datasets. Note that the processing times include precomputation times for a fair comparison. In the ﬁgure, the terms origin, GAP, dynamic GAP, and ours represent the standard BCD, GAP Safe rule [13], dynamic GAP Safe rule [14], and our approach, respectively. Our approach is faster than the previous approaches for all datasets and hyperparameters; it reduces the processing time by up to 97% from the standard approach as shown in Figure 1 (f). Table 1 shows the number of computations for Equation (4), which is the main bottleneck of BCD. The result suggests the effectiveness of the upper bound and candidate group set, which effectively reduce the number of computations, and contribute to the reduction of the processing time, as shown in Figure 1. The GAP Safe rule and dynamic GAP Safe rule eliminate groups and features that must be inactive, and increase the efﬁciency of BCD. However, when they cannot eliminate a signiﬁcant number of groups and features, they require a large computation cost for BCD. To be speciﬁc, large numbers of groups and features remain when λ has a small value even if we use dynamic GAP Safe rule. This is because the safe region is large for small λ [12, 13], and it contains many groups and features that may be active. Furthermore, if the screening cannot eliminate a signiﬁcant number of groups and features, the processing time may increase owing to the computation of the dual gap, as shown for α = 0.4 in Figure 1(a).

5.2 Accuracy

In this section, we evaluate the prediction error on test data to conﬁrm the effectiveness of our algorithm. We split the data into training and test data for each dataset. That is, 50% of a dataset was used as test data for evaluating the prediction error in terms of the squared loss for the response. The

2https://www.csie.ntu.edu.tw/~cjlin/libsvm/

0.2 0.4 0.6 0.8 α

Wall clock time [s]

dynamic GAP ours

(a) abalone

0.2 0.4 0.6 0.8 α

Wall clock time [s]

dynamic GAP ours

(b) cpusmall

0.2 0.4 0.6 0.8 α

Wall clock time [s]

dynamic GAP ours

0.2 0.4 0.6 0.8 α

Wall clock time [s]

dynamic GAP ours

(d) bodyfat

0.2 0.4 0.6 0.8 α

Wall clock time [s]

dynamic GAP ours

(e) eunite2001

0.2 0.4 0.6 0.8 α

Wall clock time [s]

dynamic GAP ours

Figure 1: Processing times of sequential rules for each hyperparameter α.

Table 1: Numbers of computations for Eq. (4)

dataset # of computations for Eq. (4)

origin ours

abalone 1.141 105 3.168 103

cpusmall 2.105 105 7.768 104

boston 1.248 106 9.998 104

bodyfat 1.694 107 2.403 106

eunite2001 8.629 105 2.052 105

pyrim 7.667 107 7.523 106

Table 2: Prediction errors.

dataset prediction error

origin ours

abalone 2.232 2.232 cpusmall 7.886 7.886 boston 9.887 9.887 bodyfat 5.434 10 3 5.434 10 3

eunite2001 2.010 102 2.010 102

pyrim 4.615 10 3 4.615 10 3

results are shown in Table 2. The squared losses of our approach are the same as those of the original approach. This is because our approach is guaranteed to yield the same value of the objective function as that of the original approach, as described in Theorem 2. The results presented in Table 2 indicate that the prediction results match those of the original approach while improving the efﬁciency.

6 Conclusion

We proposed a fast Block Coordinate Descent for Sparse Group Lasso. The main bottleneck of the original Block Coordinate Descent is the computation to check whether groups have zero or nonzero parameter vectors, because it uses all the parameters or data points. In contrast, our approach identiﬁes the groups whose parameters must be zeros by using the parameters in the group, and skips the computation. Furthermore, the proposed approach identiﬁes the candidate group set, which contains the groups whose parameters must not be zeros. The parameters are preferentially updated in the set to raise the effectiveness of Block Coordinate Descent. The attractive point of our method is that it does not need any additional hyperparameters. In addition, it provably guarantees the same results as the original method. The experimental results showed that our method reduces the processing time by up to 97% without any loss of accuracy compared with that of the original method.

[1] Antoine Bonnefoy, Valentin Emiya, Liva Ralaivola, and Rémi Gribonval. A Dynamic Screening Principle for the Lasso. In European Signal Processing Conference, pages 6 10, 2014.

[2] Antoine Bonnefoy, Valentin Emiya, Liva Ralaivola, and Rémi Gribonval. Dynamic Screening: Accelerating First-Order Algorithms for the Lasso and Group-lasso. IEEE Trans. Signal Processing, 63(19):5121 5132, 2015.

[3] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A Note on The Group Lasso and a Sparse Group Lasso. ar Xiv preprint ar Xiv:1001.0736, 2010.

[4] Yasuhiro Fujiwara, Yasutoshi Ida, Junya Arai, Mai Nishimura, and Sotetsu Iwamura. Fast Algorithm for the Lasso based L1-Graph Construction. PVLDB, 10(3):229 240, 2016.

[5] Yasuhiro Fujiwara, Yasutoshi Ida, Hiroaki Shiokawa, and Sotetsu Iwamura. Fast Lasso Algorithm via Selective Coordinate Descent. In Proceedings of AAAI Conference on Artiﬁcial Intelligence, pages 1561 1567, 2016.

[6] Laurent El Ghaoui, Vivian Viallon, and Tarek Rabbani. Strong Rules for Discarding Predictors in Lasso-type Problems. Paciﬁc Journal of Optimization, 8(4):667 698, 2012.

[7] Jian Huang, Patrick Breheny, and Shuangge Ma. A Selective Review of Group Selection in High-Dimensional Models. Statistical Science, 27(4):481 499, 2012.

[8] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group Lasso with Overlap and Graph Lasso. In Proceedings of International Conference on Machine Learning (ICML), pages 433 440, 2009.

[9] Tyler B. Johnson and Carlos Guestrin. Uniﬁed Methods for Exploiting Piecewise Linear Structure in Convex Optimization. In Advances in Neural Information Processing Systems (Neur IPS), pages 4754 4762, 2016.

[10] Tyler B. Johnson and Carlos Guestrin. Stingy CD: Safely Avoiding Wasteful Updates in Coordinate Descent. In Proceedings of International Conference on Machine Learning (ICML), pages 1752 1760, 2017.

[11] Dougu Nam and Seon-Young Kim. Gene-set Approach for Expression Pattern Analysis. Brief. Bioinforma., 9(3):189 197, 2008.

[12] Eugène Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. GAP Safe Screening Rules for Sparse Multi-task and Multi-class models. In Advances in Neural Information Processing Systems (Neur IPS), pages 811 819, 2015.

[13] Eugène Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. GAP Safe Screening Rules for Sparse-Group Lasso. In Advances in Neural Information Processing Systems (Neur IPS), pages 388 396, 2016.

[14] Eugène Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. Gap Safe Screening Rules for Sparsity Enforcing Penalties. Journal of Machine Learning Research (JMLR), 18(1):4671 4703, 2017.

[15] Paul Pavlidis, Jason Weston, Jinsong Cai, and William Noble Grundy. Gene Functional Classiﬁcation from Heterogeneous Data. In Proceedings of the Fifth Annual International Conference on Computational Biology, pages 249 255. ACM, 2001.

[16] Volker Roth and Bernd Fischer. The group-Lasso for Generalized Linear Models: Uniqueness of Solutions and Efﬁcient Algorithms. In Proceedings of International Conference on Machine Learning (ICML), pages 848 855, 2008.

[17] Noah Simon, Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A Sparse-Group Lasso. Journal of Computational and Graphical Statistics, 22(2):231 245, 2013.

[18] Jie Wang and Jieping Ye. Two-Layer Feature Reduction for Sparse-Group Lasso via Decomposition of Convex Sets. In Advances in Neural Information Processing Systems (Neur IPS), pages 2132 2140, 2014.