# group_sparse_additive_machine__6e0dd664.pdf Group Sparse Additive Machine Hong Chen1, Xiaoqian Wang1, Cheng Deng2, Heng Huang1 1 Department of Electrical and Computer Engineering, University of Pittsburgh, USA 2 School of Electronic Engineering, Xidian University, China chenh@mail.hzau.edu.cn,xqwang1991@gmail.com chdeng@mail.xidian.edu.cn,heng.huang@pitt.edu A family of learning algorithms generated from additive models have attracted much attention recently for their flexibility and interpretability in high dimensional data analysis. Among them, learning models with grouped variables have shown competitive performance for prediction and variable selection. However, the previous works mainly focus on the least squares regression problem, not the classification task. Thus, it is desired to design the new additive classification model with variable selection capability for many real-world applications which focus on high-dimensional data classification. To address this challenging problem, in this paper, we investigate the classification with group sparse additive models in reproducing kernel Hilbert spaces. A novel classification method, called as group sparse additive machine (Group SAM), is proposed to explore and utilize the structure information among the input variables. Generalization error bound is derived and proved by integrating the sample error analysis with empirical covering numbers and the hypothesis error estimate with the stepping stone technique. Our new bound shows that Group SAM can achieve a satisfactory learning rate with polynomial decay. Experimental results on synthetic data and seven benchmark datasets consistently show the effectiveness of our new approach. 1 Introduction The additive models based on statistical learning methods have been playing important roles for the high-dimensional data analysis due to their well performance on prediction tasks and variable selection (deep learning models often don t work well when the number of training data is not large). In essential, additive models inherit the representation flexibility of nonlinear models and the interpretability of linear models. For a learning approach under additive models, there are two key components: the hypothesis function space and the regularizer to address certain restrictions on estimator. Different from traditional learning methods, the hypothesis space used in additive models is relied on the decomposition of input vector. Usually, each input vector X Rp is divided into p parts directly [17, 30, 6, 28] or some subgroups according to prior structural information among input variables [27, 26]. The component function is defined on each decomposed input and the hypothesis function is constructed by the sum of all component functions. Typical examples of hypothesis space include the kernel-based function space [16, 6, 11] and the spline-based function space [13, 15, 10, 30]. Moreover, the Tikhonov regularization scheme has been used extensively for constructing the additive models, where the regularizer is employed to control the complexity of hypothesis space. The examples of regularizer include the kernel-norm regularization associated with the reproducing kernel Hilbert space (RKHS) [5, 6, 11] and various sparse regularization [17, 30, 26]. More recently several group sparse additive models have been proposed to tackle the high-dimensional regression problem due to their nice theoretical properties and empirical effectiveness [15, 10, Corresponding author 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. 26]. However, most existing additive model based learning approaches are mainly limited to the least squares regression problem and spline-based hypothesis spaces. Surprisingly, there is no any algorithmic design and theoretical analysis for classification problem with group sparse additive models in RKHS. This paper focuses on filling in this gap on algorithmic design and learning theory for additive models. A novel sparse classification algorithm, called as group sparse additive machine (Group SAM), is proposed under a coefficient-based regularized framework, which is connected to the linear programming support vector machine (LPSVM) [22, 24]. By incorporating the grouped variables with prior structural information and the ℓ2,1-norm based structured sparse regularizer, the new Group SAM model can conduct the nonlinear classification and variable selection simultaneously. Similar to the sparse additive machine (SAM) in [30], our Group SAM model can be efficiently solved via proximal gradient descent algorithm. The main contributions of this paper can summarized in two-fold: A new group sparse nonlinear classification algorithm (Group SAM) is proposed by extending the previous additive regression models to the classification setting, which contains the LPSVM with additive kernel as its special setting. To the best of our knowledge, this is the first algorithmic exploration of additive classification models with group sparsity. Theoretical analysis and empirical evaluations on generalization ability are presented to support the effectiveness of Group SAM. Based on constructive analysis on the hypothesis error, we get the estimate on the excess generalization error, which shows that our Group SAM model can achieve the fast convergence rate O(n 1) under mild conditions. Experimental results demonstrate the competitive performance of Group SAM over the related methods on both simulated and real data. Before ending this section, we discuss related works. In [5], support vector machine (SVM) with additive kernels was proposed and its classification consistency was established. Although this method can also be used for grouped variables, it only focuses on the kernel-norm regularizer without addressing the sparseness for variable selection. In [30], the SAM was proposed to deal with the sparse representation on the orthogonal basis of hypothesis space. Despite good computation and generalization performance, SAM does not explore the structure information of input variables and ignores the interactions among variables. More important, different from finite splines approximation in [30], our approach enables us to estimate each component function directly in RKHS. As illustrated in [20, 14], the RKHS-based method is flexible and only depends on few tuning parameters, but the commonly used spline methods need specify the number of basis functions and the sequence of knots. It should be noticed that the group sparse additive models (Group Sp AM in [26]) also address the sparsity on the grouped variables. However, there are key differences between Group SAM and Group Sp AM: 1) Hypothesis space. The component functions in our model are obtained by searching in kernel-based data dependent hypothesis spaces, but the method in [26] uses data independent hypothesis space (not associated with kernel). As shown in [19, 18, 4, 25], the data dependent hypothesis space can provide much more adaptivity and flexibility for nonlinear prediction. The advantage of kernel-based hypothesis space for additive models is also discussed in [14]. 2) Loss function. The hinge loss used in our classification model is different from the least-squares loss in [26]. 3) Optimization. Our Group SAM only needs to construct one component function for each variable group, but the model in [26] needs to find the component functions for each variable in a group. Thus, our method is usually more efficient. Due to the kernel-based component function and non-smooth hinge loss, the optimization of Group Sp AM can not be extended to our model directly. 4) Learning theory. We establish the generalization bound of Group SAM by the error estimate technique with data dependent hypothesis spaces, while the error bound is not covered in [26]. Now, we present a brief summary in Table 1 to better illustrate the differences of our Group SAM with other methods. The rest of this paper is organized as follows. In next section, we revisit the related classification formulations and propose the new Group SAM model. Theoretical analysis on generalization error bound is established in Section 3. In Section 4, experimental results on both simulated examples and real data are presented and discussed. Finally, Section 5 concludes this paper. Table 1: Properties of different additive models. SAM [30] Group Lasso[27] Group Sp AM [26] Group SAM Hypothesis space data-independent data-independent data-independent data-dependent Loss function hinge loss least-square least-square hinge loss Group sparsity No Yes Yes Yes Generalization bound Yes No No Yes 2 Group sparse additive machine In this section, we first revisit the basic background of binary classification and additive models, and then introduce our new Group SAM model. Let Z := (X, Y) Rp+1, where X Rp is a compact input space and Y = { 1, 1} is the set of labels. We assume that the training samples z := {zi}n i=1 = {(xi, yi)}n i=1 are independently drawn from an unknown distribution ρ on Z, where each xi X and yi { 1, 1}. Let s denote the marginal distribution of ρ on X as ρX and denote its conditional distribution for given x X as ρ( |x). For a real-valued function f : X R, we define its induced classifier as sgn(f), where sgn(f)(x) = 1 if f(x) 0 and sgn(f)(x) = 1 if f(x) < 0. The prediction performance of f is measured by the misclassification error: R(f) = Prob{Y f(X) 0} = Z X Prob(Y = sgn(f)(x)|x)dρX . (1) It is well known that the minimizer of R(f) is the Bayes rule: fc(x) = sgn Z Y ydρ(y|x) = sgn Prob(y = 1|x) Prob(y = 1|x) . Since the Bayes rule involves the unknown distribution ρ, it can not be computed directly. In machine learning literature, the classification algorithm usually aims to find a good approximation of fc by minimizing the empirical misclassification risk: i=1 I(yif(xi) 0) , (2) where I(A) = 1 if A is true and 0 otherwise. However, the minimization problem associated with Rz(f) is NP-hard due to the 0 1 loss I. To alleviate the computational difficulty, various convex losses have been introduced to replace the 0 1 loss, e.g., the hinge loss, the least square loss, and the exponential loss [29, 1, 7]. Among them, the hinge loss is the most popular error metric for classification problem due to its nice theoretical properties. In this paper, following [5, 30], we use the hinge loss: ℓ(y, f(x)) = (1 yf(x))+ = max{1 yf(x), 0} to measure the misclassification cost.The expected and empirical risks associated with the hinge loss are defined respectively as: Z (1 yf(x))+dρ(x, y) , i=1 (1 yif(xi))+ . In theory, the excess misclassification error R(sgn(f)) R(fc) can be bounded by the excess convex risk E(f) E(fc) [29, 1, 7]. Therefore, the classification algorithm usually is constructed under structural risk minimization [22] associated with Ez(f). In this paper, we propose a novel group sparse additive machine (Group SAM) for nonlinear classification. Let {1, , p} be partitioned into d groups. For each j {1, ..., d}, we set X (j) as the grouped input space and denote f (j) : X (j) R as the corresponding component function. Usually, the groups can be obtained by prior knowledge [26] or be explored by considering the combinations of input variables [11]. Let each K(j) : X (j) X (j) R be a Mercer kernel and let HK(j) be the corresponding RKHS with norm K(j). It has been proved in [5] that j=1 f (j) : f (j) HK(j), 1 j d o f 2 K = inf n d X j=1 f (j) 2 K(j) : f = is an RKHS associated with the additive kernel K = Pd j=1 K(j). For any given training set z = {(xi, yi)}n i=1, the additive model in H can be formulated as: fz = arg min f=Pd j=1 f (j) H n Ez(f) + η j=1 τj f (j) 2 K(j) o , (3) where η = η(n) is a positive regularization parameter and {τj} are positive bounded weights for different variable groups. The solution fz in (3) has the following representation: j=1 fz (j)(x(j)) = i=1 α(j) z,iyi K(j)(x(j) i , x(j)), α(j) z,i R, 1 i n, 1 j d . Observe that fz (j)(x) 0 is equivalent to α(j) z,i = 0 for all i. Hence, we expect α(j) z 2 = 0 for αz(j) = ( α(j) z,1, , α(j) z,n)T Rn if the j-th variable group is not truly informative. This motivation pushes us to consider the sparsity-induced penalty: Ω(f) = inf n d X j=1 τj α(j) 2 : f = i=1 α(j) i yi K(j)(x(j) i , ) o . This group sparse penalty aims at the variable selection [27] and was introduced into the additive regression model [26]. Inspired by learning with data dependent hypothesis spaces [19], we introduce the following hypothesis spaces associated with training samples z: j=1 f (j) : f (j) H(j) z o , (4) H(j) z = n f (j) = i=1 α(j) i K(j)(x(j) i , ) : α(j) i R o . Under the group sparse penalty and data dependent hypothesis space, the group sparse additive machine (Group SAM) can be written as: fz = arg min f Hz i=1 (1 yif(xi))+ + λΩ(f) o , (5) where λ > 0 is a regularization parameter. Let s denote α(j) = (α(j) 1 , , α(j) n )T and K(j) i = (K(j)(x(j) 1 , x(j) i ), , K(j)(x(j) n , x(j) i ))T . The Group SAM in (5) can be rewritten as: j=1 f (j) z = t=1 α(j) z,t K(j)(x(j) t , ) , {α(j) z } = arg min α(j) Rn,1 j d j=1 (K(j) i )T α(j) j=1 τj α(j) 2 o . (6) The formulation (6) transforms the function-based learning problem (5) into a coefficient-based learning problem in a finite dimensional vector space. The solution of (5) is spanned naturally by the kernelized functions {K(j)( , x(j) i ))}, rather than B-Spline basis functions [30]. When d = 1, our Group SAM model degenerates to the special case which includes the LPSVM loss and the sparsity regularization term. Compared with LPSVM [22, 24] and SVM with additive kernels [5], our Group SAM model imposes the sparsity on variable groups to improve the prediction interpretation of additive classification model. For given {τj}, the optimization problem of Group SAM can be computed efficiently via an accelerated proximal gradient descent algorithm developed in [30]. Due to space limitation, we don t recall the optimization algorithm here again. 3 Generalization error bound In this section, we will derive the estimate on the excess misclassification error R(sgn(fz)) R(fc). Before providing the main theoretical result, we introduce some necessary assumptions for learning theory analysis. Assumption A. The intrinsic distribution ρ on Z := X Y satisfies the Tsybakov noise condition with exponent 0 q . That is to say, for some q [0, ) and > 0, ρX {x X : |Prob(y = 1|x) Prob(y = 1|x)| t} tq, t > 0. (7) The Tsybakov noise condition was proposed in [21] and has been used extensively for theoretical analysis of classification algorithms [24, 7, 23, 20]. Indeed, (7) holds with exponent q = 0 for any distribution and with q = for well separated classes. Now we introduce the empirical covering numbers [8] to measure the capacity of hypothesis space. Definition 1 Let F be a set of functions on Z with u = {ui}k i=1 Z. Define the ℓ2-empirical metric as ℓ2,u(f, g) = 1 n Pk t=1(f(ut) g(ut))2 1 2 . The covering number of F with ℓ2-empirical metric is defined as N2(F, ε) = supn N supu X n N2,u(F, ε), where N2,u(F, ε) = inf n l N : {fi}l i=1 F s. t. F = i=1 {f F : ℓ2,u(f, fi) ε} o . Let Br = {f HK : f K r} and B(j) r = {f (j) HK(j) : f (j) K(j) r}. Assumption B. Assume that κ = Pd j=1 supx(j) p K(j)(x(j), x(j)) < and for some s (0, 2), cs > 0, log N2(B(j) 1 , ε) csε s, ε > 0, j {1, ..., d}. It has been asserted in [6] that under Assumption B the following holds: log N2(B1, ε) csd1+sε s, ε > 0. It is worthy noticing that the empirical covering number has been studied extensively in learning theory literatures [8, 20]. Detailed examples have been provided in Theorem 2 of [19], Lemma 3 of [18], and Examples 1, 2 of [9]. The capacity condition of additive assumption space just depends on the dimension of subspace X (j). When K(j) Cν(X (j) X (j)) for every j {1, , d}, the theoretical analysis in [19] assures that Assumption B holds true for: 2d0 d0+2ν , ν (0, 1]; 2d0 d0+ν , ν [1, 1 + d0/2]; d0 ν , ν (1 + d0/2, ). Here d0 denotes the maximum dimension among {X (j)}. With respect to (3), we introduce the data-free regularized function fη defined by: fη = arg min f=Pd j=1 f (j) H j=1 τj f (j) 2 K(j) o . (8) Inspired by the analysis in [6], we define: D(η) = E(fη) E(fc) + η j=1 τj f (j) η 2 K(j) (9) as the approximation error, which reflects the learning ability of hypothesis space H under Tikhonov regularization scheme. The following approximation condition has been studied and used extensively for classification problems, such as [3, 7, 24, 23]. Please see Examples 3 and 4 in [3] for the explicit version for Soblov kernel and Gaussian kernel induced reproducing kernel Hilbert space. Assumption C. There exists an exponent β (0, 1) and a positive constant cβ such that: D(η) cβηβ, η > 0. Now we introduce our main theoretical result on the generalization bound as follows. Theorem 1 Let 0 < min j τj max j τj c0 < and Assumptions A-C hold true. Take λ = n θ in (5) for 0 < θ min{ 2 s 2 2β }. For any δ (0, 1), there exists a constant C independent of n, δ such that R(sgn(fz)) R(fc) C log(3/δ)n ϑ with confidence 1 δ, where ϑ = min nq + 1 q + 2, β(2θ + 1) 2β + 2 , (q + 1)(2 s 2sθ) 4 + 2q + sq , 3 + 5β + 2βθ 2θ Theorem 1 demonstrates that Group SAM in (5) can achieve the convergence rate with polynomial decay under mild conditions in hypothesis function space. When q , β 1, and each K(j) C , the error decay rate of Group SAM can arbitrarily close to O(n min{1, 1+2θ 4 }). Hence, the fast convergence rate O(n 1) can be obtained under proper selections on parameters. To verify the optimal bound, we need provide the lower bound for the excess misclassification error. This is beyond the main focus of this paper and we leave it for future study. Additionally, the consistency of Group SAM can be guaranteed with the increasing number of training samples. Corollary 1 Under conditions in Theorem 1, there holds R(sgn(fz)) R(fc) 0 as n . To better understand our theoretical result, we compare it with the related works as below: 1) Compared with group sparse additive models. Although the asymptotic theory of group sparse additive models has been well studied in [15, 10, 26], all of them only consider the regression task under the mean square error criterion and basis function expansion. Due to the kernel-based component function and non-smooth hinge loss, the previous analysis cannot be extended to Group SAM directly. 2) Compared with classification with additive models. In [30], the convergence rate is presented for sparse additive machine (SAM), where the input space X is divided into p subspaces directly without considering the interactions among variables. Different to the sparsity on variable groups in this paper, SAM is based on the sparse representation of orthonormal basis similar with [15]. In [5], the consistency of SVM with additive kernel is established, where the kernel-norm regularizer is used. However, the sparsity on variables and the learning rate are not investigated in previous articles. 3) Compared with the related analysis techniques. While the analysis technique used here is inspired from [24, 23], it is the first exploration for additive classification model with group sparsity. In particular, the hypothesis error analysis develops the stepping stone technique from the ℓ1-norm regularizer to the group sparse ℓ2,1-norm regularizer. Our analysis technique also can be applied to other additive models. For example, we can extend the shrunk additive regression model in [11] to the sparse classification setting and investigate its generalization bound by the current technique. Proof sketches of Theorem 1 To get tight error estimate, we introduce the clipping operator π(f)(x) = max{ 1, min{f(x), 1}}, which has been widely used in learning theory literatures, such as [7, 20, 24, 23]. Since R(sgn(fz)) R(fc) can be bounded by E(π(fz)) E(fc), we focus on bounding the excess convex risk. Using fη as the intermediate function, we can obtain the following error decomposition. Proposition 1 For fz defined in (5), there holds R(sgn(fz)) R(fc) E(π(fz)) E(fc) E1 + E2 + E3 + D(η), where D(η) is defined in (9), E1 = E(π(fz)) E(fc) Ez(π(fz)) Ez(fc) , E2 = Ez(fη) Ez(fc) Ez(fη) E(fc) , E3 = Ez(π(fz)) + λΩ(fz) Ez(fη) + η j=1 τj f (j) η 2 K(j) . In learning theory literature, E1 + E2 is called as the sample error and E3 is named as the hypothesis error. Detailed proofs for these error terms are provided in the supplementary materials. The upper bound of hypothesis error demonstrates that the divergence induced from regularization and hypothesis space tends to zero as n under proper selected parameters. To estimate the hypothesis error E3, we choose fz as the stepping stone function to bridge Ez(π(fz)) + λΩ(fz) and Ez(fη) + λ Pd j=1 τj f (j) η 2 K(j). The proof is inspired from the stepping stone technique for support vector machine classification [24]. Notice that our analysis is associated with the ℓ2,1-norm regularizer while the previous analysis just focuses on the ℓ1-norm regularization. The error term E1 reflects the divergence between the expected excess risk E(π(fz)) E(fc) and the empirical excess risk Ez(π(fz)) Ez(fc). Since fz involves any given z = {(xi, yi)}n i=1, we introduce the concentration inequality in [23] to bound E1. We also bound the error term E2 in terms of the one-side Bernstein inequality [7]. 4 Experiments To evaluate the performance of our proposed Group SAM model, we compare our model with the following methods: SVM (linear SVM with ℓ2-norm regularization), L1SVM (linear SVM with ℓ1norm regularization), Gaussian SVM (nonlinear SVM using Gaussian kernel), SAM (Sparse Additive Machine) [30], and Group Sp AM (Group Sparse Additive Models) [26] which is adapted to the classification setting. Table 2: Classification accuracy comparison on the synthetic data. The upper half shows the results with 24 features groups, while the lower half corresponds to the results with 300 feature groups. The table shows the average classification accuracy and the standard deviation in 2-fold cross validation. SVM Gaussian SVM L1SVM SAM Group Sp AM Group SAM σ = 0.8 0.943 0.011 0.935 0.028 0.925 0.035 0.895 0.021 0.880 0.021 0.953 0.018 σ = 0.85 0.943 0.004 0.938 0.011 0.938 0.004 0.783 0.088 0.868 0.178 0.945 0.000 σ = 0.9 0.935 0.014 0.925 0.007 0.938 0.011 0.853 0.117 0.883 0.011 0.945 0.007 σ = 0.8 0.975 0.035 0.975 0.035 0.975 0.035 0.700 0.071 0.275 0.106 1.000 0.000 σ = 0.85 0.975 0.035 0.975 0.035 0.975 0.035 0.600 0.141 0.953 0.004 1.000 0.000 σ = 0.9 0.975 0.035 0.975 0.035 0.975 0.035 0.525 0.035 0.983 0.004 1.000 0.000 As for evaluation metric, we calculate the classification accuracy, i.e., percentage of correctly labeled samples in the prediction. In comparison, we adopt 2-fold cross validation and report the average performance of each method. We implement SVM, L1SVM and Gaussian SVM using the LIBSVM toolbox [2]. We determine the hyper-parameter of all models, i.e., parameter C of SVM, L1SVM and Gaussian SVM, parameter λ of SAM, parameter λ of Group Sp AM, parameter λ in Eq. (6) of Group SAM, in the range of {10 3, 10 2, . . . , 103}. We tune the hyper-parameters via 2-fold cross validation on the training data and report the best parameter w.r.t. classification accuracy of each method. In the accelerated proximal gradient descent algorithm for both SAM and Group SAM, we set µ = 0.5, and the number of maximum iterations as 2000. 4.1 Performance comparison on synthetic data We first examine the classification performance on the synthetic data as a sanity check. Our synthetic data is randomly generated as a mixture of Gaussian distributions. In each class, data points are sampled i.i.d. from a multivariate Gaussian distribution with the covariance being σI, with I as the identity matrix. This setting indicates independent covariates of the data. We set the number of classes to be 4, the number of samples to be 400, and the number of dimensions to be 24. We set the value of σ in the range of {0.8, 0.85, 0.9} respectively. Following the experimental setup in [31], we make three replicates for each feature in the data to form 24 feature groups (each group has three replicated features). We randomly pick 6 feature groups to generate the data such that we can evaluate the capability of Group SAM in identifying truly useful feature groups. To make the classification task more challenging, we add random noise drawn from uniform distribution U(0, θ) where θ is 0.8 times the maximum value in the data. In addition, we test on a high-dimensional case by generating 300 feature groups (e.g., a total of 900 features) with 40 samples in a similar approach. We summarize the classification performance comparison on the synthetic data in Table 2. From the experimental results we notice that Group SAM outperforms other approaches under all settings. This comparison verifies the validity of our method. We can see that Group SAM significantly improves the performance of SAM, which shows that the incorporation of group information is indeed beneficial for classification. Moreover, we can notices the superiority of Group SAM over Group Sp AM, which illustrates that our Group SAM model is more suitable for classiciation. We also present the comparison of feature groups in Table 3. For illustration purpose, we use the case with 24 feature groups as an example. Table 3 shows that the feature groups identified by Group SAM are exactly the same as the ground truth feature groups used for synthetic data generation. Such results further demonstrate the effectiveness of Group SAM method, from which we know Group SAM is able to select the truly informative feature groups thus improve the classification performance. 4.2 Performance comparison on benchmark data In this subsection, we use 7 benchmark data from UCI repository [12] to compare the classification performance of different methods. The 7 benchmark data includes: Ecoli, Indians Diabetes, Breast Cancer, Stock, Balance Scale, Contraceptive Method Choice (CMC) and Fertility. Similar to the settings in synthetic data, we construct feature groups by replicating each feature for 3 times. In each Table 3: Comparison between the true feature group ID (used for data generation) and the selected feature group ID by our Group SAM method on the synthetic data. Order of the true feature group ID does not represent the order of importance. True Feature Group IDs Selected Feature Group IDs via Group SAM σ = 0.8 2,3,4,8,10,17 3,10,17,8,2,4 σ = 0.85 1,5,10,12,17,21 5,12,17,21,1,10 σ = 0.9 2,6,7,9,12,22 6,22,7,9,2,12 Table 4: Classification accuracy comparison on the benchmark data. The table shows the average classification accuracy and the standard deviation in 2-fold cross validation. SVM Gaussian SVM L1SVM SAM Group Sp AM Group SAM Ecoli 0.815 0.054 0.818 0.049 0.711 0.051 0.816 0.039 0.771 0.009 0.839 0.028 Indians Diabetes 0.651 0.000 0.652 0.002 0.638 0.018 0.652 0.000 0.643 0.004 0.660 0.013 Breast Cancer 0.968 0.017 0.965 0.017 0.833 0.008 0.833 0.224 0.958 0.027 0.966 0.014 Stock 0.913 0.001 0.911 0.002 0.873 0.001 0.617 0.005 0.875 0.005 0.917 0.005 Balance Scale 0.864 0.003 0.869 0.004 0.870 0.003 0.763 0.194 0.848 0.003 0.893 0.003 CMC 0.420 0.011 0.445 0.015 0.437 0.014 0.427 0.000 0.433 0.003 0.456 0.003 Fertility 0.880 0.000 0.880 0.000 0.750 0.184 0.860 0.028 0.780 0.000 0.880 0.000 feature group, we add random noise drawn from uniform distribution U(0, θ) where θ is 0.3 times the maximum value in each data. We display the comparison results in Table 4. We find that Group SAM performs equal or better than the compared methods in all benchmark datasets. Compared with SVM and L1SVM, our method uses additive model to incorporate nonlinearity thus is more appropriate to find the complex decision boundary. Moreover, the comparison with Gaussian SVM and SAM illustrates that by involving the group information in classification, Group SAM makes better use of the structure information among features such that the classification ability can be enhanced. Compared with Group Sp AM, our Group SAM model is proposed in data dependent hypothesis spaces and employs hinge loss in the objective, thus is more suitable for classification. 5 Conclusion In this paper, we proposed a novel group sparse additive machine (Group SAM) by incorporating the group sparsity into the additive classification model in reproducing kernel Hilbert space. By developing the error analysis technique with data dependent hypothesis space, we obtain the generalization error bound of the proposed Group SAM, which demonstrates our model can achieve satisfactory learning rate under mild conditions. Experimental results on both synthetic and real-world benchmark datasets validate the algorithmic effectiveness and support our learning theory analysis. In the future, it is interesting to investigate the learning performance of robust group sparse additive machines with loss functions induced by quantile regression [6, 14]. Acknowledgments This work was partially supported by U.S. NSF-IIS 1302675, NSF-IIS 1344152, NSF-DBI 1356628, NSF-IIS 1619308, NSF-IIS 1633753, NIH AG049371. Hong Chen was partially supported by National Natural Science Foundation of China (NSFC) 11671161. We are grateful to the anonymous NIPS reviewers for the insightful comments. [1] P. L. Bartlett, M. I. Jordan, and J. D. Mcauliffe. Convexity, classification and risk bounds. J. Amer. Statist. Assoc., 101(473):138 156, 2006. [2] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(27):1 27, 2011. [3] D. R. Chen, Q. Wu, Y. Ying, and D. X. Zhou. Support vector machine soft margin classifiers: error analysis. J. Mach. Learn. Res., 5:1143 1175, 2004. [4] H. Chen, Z. Pan, L. Li, and Y. Tang. Learning rates of coefficient-based regularized classifier for density level detection. Neural Comput., 25(4):1107 1121, 2013. [5] A. Christmann and R. Hable. Consistency of support vector machines using additive kernels for additive models. Comput. Stat. Data Anal., 56:854 873, 2012. [6] A. Christmann and D. X. Zhou. Learning rates for the risk of kernel based quantile regression estimators in additive models. Anal. Appl., 14(3):449 477, 2016. [7] F. Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge Univ. Press, Cambridge, U.K., 2007. [8] D. Edmunds and H. Triebel. Function Spaces, Entropy Numbers, Differential Operators. Cambridge Univ. Press, Cambridge, U.K., 1996. [9] Z. Guo and D. X. Zhou. Concentration estimates for learning with unbounded sampling. Adv. Comput. Math., 38(1):207 223, 2013. [10] J. Huang, J. Horowitz, and F. Wei. Variable selection in nonparametric additive models. Ann. Statist., 38(4):2282 2313, 2010. [11] K. Kandasamy and Y. Yu. Additive approximation in high dimensional nonparametric regression via the salsa. In ICML, 2016. [12] M. Lichman. UCI machine learning repository, 2013. [13] Y. Lin and H. H. Zhang. Component selection and smoothing in smoothing spline analysis of variance models. Ann. Statist., 34(5):2272 2297, 2006. [14] S. Lv, H. Lin, H. Lian, and J. Huang. Oracle inequalities for sparse additive quantile regression in reproducing kernel hilbert space. Ann. Statist., preprint, 2017. [15] L. Meier, S. van de Geer, and P. Buehlmann. High-dimensional additive modeling. Ann. Statist., 37(6B):3779 3821, 2009. [16] G. Raskutti, M. Wainwright, and B. Yu. Minimax-optimal rates for sparse additive models over kernel classes via convex programming. J. Mach. Learn. Res., 13:389 427, 2012. [17] P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additive models. J. Royal. Statist. Soc B., 71:1009 1030, 2009. [18] L. Shi. Learning theory estimates for coefficient-based regularized regression. Appl. Comput. Harmon. Anal., 34(2):252 265, 2013. [19] L. Shi, Y. Feng, and D. X. Zhou. Concentration estimates for learning with ℓ1-regularizer and data dependent hypothesis spaces. Appl. Comput. Harmon. Anal., 31(2):286 302, 2011. [20] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008. [21] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Ann. Statis., 32:135 166, 2004. [22] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998. [23] Q. Wu, Y. Ying, and D. X. Zhou. Multi-kernel regularized classfiers. J. Complexity, 23:108 134, 2007. [24] Q. Wu and D. X. Zhou. Svm soft margin classifiers: linear programming versus quadratic programming. Neural Comput., 17:1160 1187, 2005. [25] L. Yang, S. Lv, and J. Wang. Model-free variable selection in reproducing kernel hilbert space. J. Mach. Learn. Res., 17:1 24, 2016. [26] J. Yin, X. Chen, and E. Xing. Group sparse additive models. In ICML, 2012. [27] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variabels. J. Royal. Statist. Soc B., 68(1):49 67, 2006. [28] M. Yuan and D. X. Zhou. Minimax optimal rates of estimation in high dimensional additive models. Ann. Statist., 44(6):2564 2593, 2016. [29] T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Statist., 32:56 85, 2004. [30] T. Zhao and H. Liu. Sparse additive machine. In AISTATS, 2012. [31] L. W. Zhong and J. T. Kwok. Efficient sparse modeling with automatic feature grouping. In ICML, 2011.