# sparse_shrunk_additive_models__1fb19661.pdf

Sparse Shrunk Additive Models

Guodong Liu * 1 Hong Chen * 2 Heng Huang 1 3

Most existing feature selection methods in literature are linear models, so that the nonlinear relations between features and response variables are not considered. Meanwhile, in these feature selection models, the interactions between features are often ignored or just discussed under prior structure information. To address these challenging issues, we consider the problem of sparse additive models for high-dimensional nonparametric regression with the allowance of the ﬂexible interactions between features. A new method, called as sparse shrunk additive models (SSAM), is proposed to explore the structure information among features. This method bridges sparse kernel regression and sparse feature selection. Theoretical results on the convergence rate and sparsity characteristics of SSAM are established by the novel analysis techniques with integral operator and concentration estimate. In particular, our algorithm and theoretical analysis only require the component functions to be continuous and bounded, which are not necessary to be in reproducing kernel Hilbert spaces. Experiments on both synthetic and real-world data demonstrate the effectiveness of the proposed approach.

1. Introduction

Sparse feature selection has attracted much attention in machine learning community for learning tasks with highdimensional data, especially useful in bioinformatics related applications. Linear models with ℓ1-norm regularization, such as Lasso (Tibshirani, 1996) and Dantzig selec-

*Equal contribution 1Department of Electrical and Computer Engineering, University of Pittsburgh, PA, USA 2Department of Mathematics and Statistics, College of Science, Huazhong Agricultural University, Wuhan, China 3JD Finance America Corporation, Mountain View, CA, USA. Correspondence to: Hong Chen <chenh@mail.hzau.edu.cn>, Guodong Liu <guodong.liu.e@pitt.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

tor(Candes & Tao, 2007), have been well studied for their theoretical properties and extensively used for feature selection applications. However, in many applications, the linear assumption could be too restricted to select the optimal features, because the relations between features and response variables could be nonlinear. Because of the difﬁculties in both computational algorithm and learning theory analysis, only few of existing feature selection methods in literature focus on the nonlinear feature selection.

To enhance the ability of feature selection models with considering nonlinear relationship between features and response variables, several sparse learning based additive models were proposed for regression (Ravikumar et al., 2009; Huang et al., 2010; Raskutti et al., 2012; Yuan & Zhou, 2016; Yin et al., 2012; Chen et al., 2020, In press) and classiﬁcation (Zhao & Liu, 2012; Chen et al., 2017), which are extensions of original additive models (Hastie & Tibshirani, 1990). Note that, in these additive models, each component function is a univariate smooth function (Ravikumar et al., 2009; Huang et al., 2010; Raskutti et al., 2012; Yuan & Zhou, 2016; Zhao & Liu, 2012) or is deﬁned on grouped features with prior structure information (Chen et al., 2017; Yin et al., 2012). Although these sparse additive models can conduct nonlinear feature selection, all of them do not explore the important feature interaction without prior structure information. Recently, the shrunk additive least square approximation (SALSA) (Kandasamy & Yu, 2016) method was introduced to utilize the feature interactions, but without feature selection mechanism.

On the other hand, the sparse sample selection arises from learning tasks with large-scale data. The generalized Lasso was proposed in (Roth, 2004) to handle the regression problem with addressing sample sparsity, and its learning theory has been studied in (Shi et al., 2011). Recently, Nystr om approximation has been used for selecting important samples (landmark points) in kernel methods, which show that the predictor can be derived efﬁciently from data dependent hypothesis spaces associated with subsamples (Kumar et al., 2012; Alaoui & Mahoney, 2015; Rudi et al., 2015). While some fast algorithms have been developed for sparse kernel regression, none of them is capable of the feature selection and provides the interpretability of prediction.

To address the above challenges, in this paper, we propose

Sparse Shrunk Additive Models

a novel sparse shrunk additive model (SSAM) for jointly selecting features and samples with learning the feature interactions and mining the structure information among features. Different to previous models, our new method will simultaneously conduct sparse feature selection, sparse sample selection, and feature interactions learning. Our SSAM can utilize the component functions from general continuous and bounded function space (Sun & Wu, 2011; Chen et al., 2016) and can be implemented efﬁciently via the optimization technique in (Nesterov, 2013).

More important, to better understand the learning theory properties of SSAM, we investigate its convergence rate and sparsity. The proposed SSAM involves the shrunk structure on features and the ℓ1-norm regularization on data dependent hypothesis spaces. While these features provide the superior ﬂexibility and adaptivity of SSAM, there are new technical difﬁculties to characterize its theory properties. To address the new difﬁculties, we introduce a novel decomposition on the excess generalization error, and develop the recent approximation techniques with integral operator and concentration estimates with empirical covering numbers. Our main contributions in this paper include:

A sparse shrunk additive algorithm is proposed to improve the feature selection ability of nonlinear models. It is a uniform framework to bridge sparse feature selection, sparse sample selection, and feature interaction structure learning tasks. SSAM can be implemented efﬁciently and its effectiveness is supported by the empirical studies.

Generalization bound on the excess risk is provided for SSAM under mild conditions, which implies the fast convergence rate can be achieved. Additionally, the necessary and sufﬁcient condition is derived to characterize the sparsity of SSAM.

2. Sparse Shrunk Additive Models

Let X Rn be an explanatory feature space and let Y [ 1, 1] be the response set. Let z := {zi}m i=1 = {(xi, yi)}m i=1 be independent copies of a random sample (x, y) following an unknown intrinsic distribution ρ on Z := X Y. Denote the marginal distribution of ρ on X as ρX and denote the conditional distribution for given x X as ρ( |x). Given z, the main goal of regression learning is to infer a functional relation between the input x X and the corresponding output y Y. Usually, the expected risk associated with least squares loss is used to evaluate the prediction performance, which is denoted by

Z (y f(x))2dρ(x, y).

In theory, the minimizer of E(f) over all measurable functions is the regression function

Y ydρ(y|x).

2.1. Sparse additive models

Additive models (Hastie & Tibshirani, 1990) aim to ﬁnd the predictor in the special hypothesis space F = {f : f(X) = Pn j=1 fj(Xj), X = (X1, ..., Xn) X}. Here, each fj Fj is one-dimensional smooth function, and its typical examples include the spline function and the Gaussian function. The optimization framework of standard additive model is

j=1 fj(xij))2. (1)

Theoretical analysis on (1) shows the good performance of additive model relies on the condition that the number of features n is not large relative to the sample size m.

The algorithm of sparse additive models (Sp AM) (Ravikumar et al., 2009) is proposed to address the feature selection in the high dimensional setting, which can be formulated as the following regularized framework

j=1 fj(xij))2 + λ

j=1 fj o , (2)

where λ > 0 is a regularization parameter and Pn j=1 fj behaves liken an ℓ1 ball across different components to encourage functional sparsity (Ravikumar et al., 2009; Yin et al., 2012). The Sp AM (2) can be solved efﬁciently in terms of the back-ﬁtting algorithm (Hastie et al., 2001), and has been extended to the group sparse additive regression (Huang et al., 2010; Raskutti et al., 2012; Yin et al., 2012).

2.2. Shrunk additive models

Although Sp AM (2) has nice properties, it ignores the interactions between features. Recently, a novel method, called shrunk additive least squares approximation (SALSA), is proposed in (Kandasamy & Yu, 2016) and has shown satisfactory prediction performance.

For any given 1 k n and {1, 2, ..., n}, we denote d = n k as the number of index subsets with k elements . It is easy to see that d = n as k = 1 and d = n(n 1)

2 as k = 2. Let x(j) Rk be a subset of x with k features and denote its corresponding space as X (j).

Denote HK(j) as a reproducing kernel Hilbert space (RKHS) (Aronszajn, 1950; Scholk opf & Smola, 2001; Shawe-Taylor & Cristianini, 2004) associated with a symmetric and positive deﬁnite kernel K(j) : X (j) X (j) R, j {1, ..., d}.

Sparse Shrunk Additive Models

The SALSA is dependent on the hypothesis space with additive kernels, which is deﬁned by:

j=1 f (j) : f (j) HK(j), j = 1, 2, ..., d o .

Indeed, (H, K) also is an RKHS for K = Pd j=1 K(j),

where f 2 K = inf{Pd j=1 f (j) 2 K(j) : f = Pd j=1 f (j)} (Raskutti et al., 2012; Christmann & Zhou, 2016; Yuan & Zhou, 2016).

Given training samples z = {(xi, yi)}m i=1, the SALSA in (Kandasamy & Yu, 2016) can be formulated as the following optimization problem:

fz = arg min f=Pd j=1 f (j) H

j=1 f (j)(x(j) i ) 2

j=1 f (j) 2 K(j) o , (3)

where η > 0 is a regularization parameter.

Remark 6 in (Kandasamy & Yu, 2016) tells us that the predictor of SALSA can be expressed as:

j=1 f (j) z =

i=1 wi K(j)(x(j) i , ), wi R .

It also has been demonstrated that SALSA in (3) can be considered as kernel ridge regression with shrunk features and additive kernels (Kandasamy & Yu, 2016). Despite nice theoretical and empirical analysis, SALSA does not address the sparsity of shrunk features. For high dimensional data, the sparsity on shrunk features usually is beneﬁt to explore the structure information among features, which will improve the interpretability of learning model.

2.3. New sparse shrunk additive models

To improve the sparsity of SALSA, we propose a new algorithm, named as sparse shrunk additve models (SSAM). Some sparse methods (e.g., Lasso (Tibshirani, 1996) and kernelized Lasso (Roth, 2004)) can be considered as the special cases of our new model. It is interesting that SSAM also is a natural but nontrivial extension of sparse regularized regression in data dependent hypothesis spaces (Shi et al., 2011; Sun & Wu, 2011; Feng et al., 2016).

For any given training samples z, we introduce the following data dependent hypothesis space:

Hz = {f : f(x) =

j=1 f (j)(x(j)), f (j) H(j) z }, (4)

where H(j) z = {f (j) = Pm i=1 α(j) i K(j)(x(j) i , ) : α(j) i R} and K(j) : X (j) X (j) R be a continuous function satisfying K(j) < + . Without loss of generality, this paper assumes K(j) 1 for each 1 j d.

The predictor of SSAM can be expressed as

j=1 f (j) z =

t=1 ˆα(j) t K(j)(x(j) t , ),

where, for 1 t m and 1 j d,

{ˆα(j) t } = arg min

α(j) t R,t,j

t=1 |α(j) t |

t=1 α(j) t K(j)(x(j) t , x(j) i ) 2o .(5)

Let α(j) = (α(j) 1 , ..., α(j) m )T Rm and K(j) i = (K(j)(x(j) 1 , x(j) i ), ..., K(j)(x(j) m , x(j) i ))T Rm. Denote Ki = ((K(1) i )T , ..., (K(d) i )T )T Rmd and α = ((α(1))T , ..., (α(d))T )T Rmd, we can see Pd j=1(K(j) i )T α(j) = KT i α. Moreover, by denoting Y = (y1, y2, ..., ym)T Rm and K = (K1, ..., Km)T Rm md, we have

ˆα = arg min α Rmd

m Y Kα 2 2 + λ α 1 o . (6)

Moreover, for j {1, ..., d} and q {1, 2}, deﬁne

f (j) q ℓq = inf n mq 1 m X

t=1 |α(j) t |q :

t=1 α(j) t K(j)(x(j) t , ) o

and f q ℓq := Pd j=1 f (j) q ℓq for f = Pd j=1 f (j). Then, we can formulate SSAM from the viewpoint of function approximation as below

fz = arg min f Hz

i=1 (yi f(xi))2 + λ f ℓ1 o . (7)

Except the additive structure on Hz, (7) is consistent with the sparse kernel machine in data dependent hypothesis spaces (Roth, 2004; Shi et al., 2011).

SSAM can be transformed to other methods by explicit selections on k, K(j). When k = 1 and K(j)(x(j), x(j)) = x(j), our model is equivalent to Lasso (Tibshirani, 1996). When k = n and K(j)(x(j), x(j)) = K(x, x), SSAM can be considered the kernelized Lasso (Roth, 2004).

Sparse Shrunk Additive Models

Different from SALSA (Kandasamy & Yu, 2016), our SSAM is based on general kernel, which is not necessary to be a Mercer kernel. Moreover, our SSAM not only can handle regression prediction by using the interactions between features, but also can explore the structure of shrunk features for model selection. The previous SALSA only works for prediction task.

2.4. Comparisons with the related methods

Now we provide some comparisons for SSAM in (5) with the related regularized methods, including Kernel ridge regression (KRR), Least absolute shrinkage and selection operator (Lasso) (Tibshirani, 1996), Kernelized Lasso (KLasso) (Roth, 2004; Sun & Wu, 2015), Additive model with kernel regularization (KAM) (Christmann & Zhou, 2016), Sparse additive models (Sp AM) (Ravikumar et al., 2009), Component selection and smoothing operator (COSSO) (Lin & Zhang, 2006), and Shrunk additive least squares approximation (SALSA) (Kandasamy & Yu, 2016). A brief summary is presented in Table 1 to show the algorithmic properties including the component function, the regularizer on each component, sample/feature sparsity, feature interaction, and the number of additive components.

From Table 1, we know that SSAM bridges sparse kernel regression and sparse additive models. In theory, SSAM not only can exploit the interactions among features for prediction, but also handle the selections on features and samples simultaneously. In particular, the selection of shrunk features can be used to characterize the structure among features, which is essentially different from the grouped features under prior knowledge (Yin et al., 2012). By introducing the shrunk features, the proposed SSAM encourages the group features to be selected simultaneously, while the previous sparse additive models (Meier et al., 2009; Huang et al., 2010) usually select feature individually.

Indeed, as shown in (Bach, 2008) ,the nonparametric group Lasso can be seen as a variable selection method in a generalized additive model, and can also be seen as equivalent to learning a convex combination of kernel, a framework referred to multiple kernel learning (MKL). The link between the group Lasso and MKL is established in (Bach, 2008) based on the works in (Bach et al., 2004; G.R.G.Lanckrit et al., 2004). However, there are key deferences between our SSAM and MKL (or group Lasso in (Bach, 2008)): 1) Hypothesis space (continuous and bounded function space VS RKHS). The proposed SSAM only requires the component functions to be continuous and bounded, which are not necessary to be in reproducing kernel Hilbert spaces (RKHS). That is to say, we consider the generalized kernelbased hypothesis space (Shi et al., 2011; Sun & Wu, 2011; Chen et al., 2016), which is not necessary to be associated with positive deﬁnite kernel used in (Bach, 2008).

2) Regularization (1-norm with data-dependent hypothesis space VS Hilbert norm with data-independent RKHS). We use the 1-norm on coefﬁcients, which is different from the Hilbert norm used in the nonparametric group Lasso (Bach, 2008). From the function approximation point of view, we ﬁnd the prediction function from data dependent hypothesis spaces (Shi et al., 2011; Sun & Wu, 2011; Chen et al., 2016; Feng et al., 2016) with sparsity restriction on samples and features simultaneously (via 1-norm). However, the nonparametric group Lasso (Bach, 2008) is associated with data independent RKHS and only addresses the feature sparsity. In addition, the kernel Lasso (Roth, 2004) only focuses on the sample sparsity since it does not consider the input variable decomposition. 3) Learning theory (Error bound based on integral operator approximation and concentration estimate with empirical covering numbers VS Consistency based on covariance operator analysis). According to 1) and 2), the theory analysis for MKL (e.g. (Bach et al., 2004; G.R.G.Lanckrit et al., 2004)) or group Lasso (Bach, 2008) doesn t hold true for our approach under mild restriction on component function. As studied in (Shi et al., 2011; Sun & Wu, 2011; Chen et al., 2016; Feng et al., 2016), the learning theory analysis is much more difﬁcult for data-dependent hypothesis space with generalized kernel. In this paper, we overcame the difﬁculty of theoretical analysis by developing and integrating the integral operator approximation (Smale & Zhou, 2007; Sun & Wu, 2011; Shi, 2013) and the concentration estimation with empirical covering numbers (Wu et al., 2007; Shi et al., 2011).

3. Theoretical Analysis

We begin this section with some necessary deﬁnitions and assumptions used in our analysis. Let L2 ρX(j) be a square-

integrable function space on X (j) with distribution ρX (j). For each j {1, 2, ..., d} and f L2 ρX(j), deﬁne the integral operator LK(j) : L2 ρX(j) L2 ρX(j) as

LK(j)(f)(x(j)) = Z

X (j) K(j)(x(j), u)f(u)dρX (j)(u).

Deﬁne K(j)(x(j), x(j)) = R K(j)(x(j), u)K(j)( x(j), u)dρX (j)(u). It has been veriﬁed in (Sun & Wu, 2011) that K(j) is a Mercer kernel and L K(j) = LK(j)LT K(j) : L2 ρX(j) L2 ρX(j) is a self-adjoint positive operator with decreasing eigenvalues {λ(j) t } t=1 and eigenfunctions {ψ(j) t } t=1, where {ψ(j) t } t=1 form an orthonormal basis of L2 ρX(j). For given r > 0, deﬁne the r-th power Lr K(j) of L K(j) by

t cj,tψ(j) t ) = X

t cj,t(λ(j) t )rψ(j) t , (cj,t)t N ℓ2.

Sparse Shrunk Additive Models

Table 1. Properties of kernel methods and additive models ( means using the given formulation or information and means not available for the information)

property KRR KLasso Lasso KAM Sp AM COSSO SALSA SSAM

Component function RKHS continuous linear RKHS Hilbert Spline RKHS continuous Regularization K-norm 1-norm 1-norm K-norm 2,1norm 2,1norm K-norm 1-norm

Sparsity (sample)

Sparsity (feature)

Feature Interaction

Component number 1 1 n n n Pd k=1 n k n k n k

K-norm:=Kernel norm. *The number can be reduced largely by incorporating prior information of features.

Assumption 1. Assume that fρ = Pd j=1 f (j) ρ , where for

each j {1, 2, ..., d}, f (j) ρ : X (j) R is a function of the form f (j) ρ = Lr K(j)(g(j) ρ ) with some r > 0 and g(j) ρ L2 ρX(j).

This regularity condition on fρ has been studied for coefﬁcient-based regularized regression with general kernel (Sun & Wu, 2011; Shi, 2013). For the additive model with Mercer kernel, similar assumption has been introduced in (Christmann & Zhou, 2016).

We also need the Lipschitz continuous condition on each kernel K(j). The restrictive condition has been studied extensively in learning theory of kernel methods, e.g., (Shi et al., 2011; Shi, 2013). In particular, the Gaussian kernel satisﬁes this condition. Assumption 2. For each j {1, 2, ..., d}, the kernel function K(j) : X (j) X (j) R is Cs with some s > 0 satisfying

K(j)(u, v) K(j)(u, v ) cs v v s 2, u, v, v X (j)

for some positive constant cs.

From the deﬁnition of fρ and Y [ 1, 1], we know that |fρ(x)| 1 for any x X. Thus, we can utilize the following projection operator to get tight error estimate which is a standard technique in error analysis (Cucker & Zhou, 2007; Steinwart et al., 2009). Deﬁnition 1. The projection operator π is deﬁned on the space of measurable functions f : X R as π(f)(x) = max 1, min{f(x), 1} .

2k/(k + 2s), if s (0, 1]; 2k/(k + 2), if s (1, 1 + k/2]; k/s, if s (1 + k/2, ).

Our ﬁrst theoretical result is the upper bound of E(π(fz)) E(fρ).

Theorem 1. Let Assumptions 1 and 2 be true. For any 0 < δ < 1, with conﬁdence 1 δ, there exists positive constant c1 independent of m, δ such that:

(1) If r (0, 1

2) in Assumption 1, setting λ = m θ1 with θ1 (0, 2 2+p),

E(π(fz)) E(fρ) c1 log(8/δ)m γ1,

where γ1 = min 2rθ1, 1 θ1+2rθ1

2 , 2 2+p (2

2r)θ1, 2(1 pθ1)

2 in Assumption 1, taking λ = m θ2 with some θ2 (0, 2 2+p),

E(π(fz)) E(fρ) c1 log(8/δ)m γ2,

where γ2 = min n θ2, 1

2, 2 2+p θ2 o .

Theorem 1 provides the upper bound of generalization error to SSAM with Lipshitz continuous kernel. For r (0, 1

2), as s , we have γ1 min{2rθ1, 1

2)θ, 1 2θ1 + 2rθ1}. Moreover, when r 1

2, the convergence rate O(m 1

2 ) can be reached.

2, taking θ2 = 1 2+p, we get the convergence rate

O(m 1 2+p ).

The following result is about a special case when f (j) ρ H(j).

Theorem 2. Assume that f (j) ρ H(j) for each 1 j d. Take λ = m 2 2+3p in (5). For any 0 < δ < 1, with conﬁdence 1 δ we have

E(π(fz)) E(fρ) c2 log(1/δ)m 2 2+3p ,

where c2 is a positive constant independent of m, δ, and p is deﬁned in (8).

Under the strong condition on fρ, the convergence rate can be arbitrary close to O(m 1) as s .

Sparse Shrunk Additive Models

Now we summarize the comparisons on the related convergence analysis of additive models with feature interactions.

For SALSA in (Kandasamy & Yu, 2016), the convergence rate with polynomial decay is also obtained under mild condition on fρ. Different from our work, the previous analysis is limited to the Mercer kernel and the error is expressed with the expectation version.

For the generalized Sp AM in (Tyagi et al., 2016), theoretical analysis demonstrates its effectiveness to estimate the underlying component functions, which provides stronger guarantees than generalization bound. However, the condition on H(j) is much restrictive than SSAM.

For the ﬁxed design setting, the COSSO estimator in (Lin & Zhang, 2006) has a convergence rate O(m s 2 s+1 ), where s is the order of smoothness of the components in Sobolev space. It can be seen from Theorem 2 that the faster learning rate of SSAM can be reached as K C .

In the future, it is natural to extend the current result from uniform boundedness to unbounded sampling by the analysis techniques in (Steinwart et al., 2009; Wang & Zhou, 2011; Guo & Zhou, 2013).

Besides the generalization ability, SSAM also advocates the sparsity on features and samples by employing the ℓ1 regularization. The sparsity of SSAM can be characterized as below.

Theorem 3. For t {1, 2, ..., m} and j {1, 2, ..., d}, ˆα(j) t = 0 if and only if

i=1 (yi fz(xi))K(j)(x(j) t , x(j) i ) < λ

Theorem 3 provides a necessary and sufﬁcient condition for the zero pattern of ˆα. In terms of the discussions in (Shi et al., 2011), Theorem 3 also implies the probabilistic conﬁdence bound to ensure the sparsity of ˆα(j) t in (5). In particular, for the ﬁxed design setting, the sparsity recovery may be achieved by adding some conditions (Feng et al., 2016; Yang et al., 2016). We leave it for future study.

4. Proof Sketches

The proofs of Theorems 1 and 2 involve a integration of techniques for error analysis with integral operator approximation (Smale & Zhou, 2007; Sun & Wu, 2011; Shi, 2013; Nie & Wang, 2015) and the empirical process theory for analyzing kernel methods (Pinelis, 1994; Wu et al., 2007;

Christmann & Zhou, 2016). The proof of Theorem 3 follows the analysis technique for sparse characterization (Shi et al., 2011; Sun & Wu, 2015).

The key to bound E(π(fz)) E(fρ) is a novel error decomposition, where some intermediate functions are constructed as the stepping stone functions. Then, we bound the decomposed terms respectively in terms of operator approximation and concentration equalities for empirical processes.

From Proposition 1 in (Shi, 2013), we know that LT K(j) =

1 2 K(j) and LK(j) = L

1 2 K(j)U T for each j {1, 2, ..., d}, where U is a partial isometry on L2 ρX(j) with U T U being the orthogonal prediction onto the RKHS H K(j).

For any j {1, 2, ..., d}, deﬁne the intermediate function f (j) λ by

f (j) λ = arg min f L2ρX(j)

n LK(j)f (j) f (j) ρ 2 L2ρX(j)

+λ U T f (j) 2 L2ρX(j)

Denote fλ = Pd j=1 f (j) λ and gλ = Pd j=1 g(j) λ with g(j) λ =

LK(j)f (j) λ .

Deﬁne the empirical version of gλ as

j=1 f (j) λ (x(j) i )K(j)(x(j) i , x(j)), x X. (10)

Now we give the following error decomposition.

Proposition 1. For fz, ˆgλ deﬁned in (5) and (10), respectively, there holds

E(π(fz)) E(fρ) E1 + E2 + E3,

E1 = E(π(fz)) Ez(π(fz)) + Ez(ˆgλ) E(ˆgλ),

E2 = E(ˆgλ) E(gλ) + λ ˆgλ ℓ1

E3 = E(gλ) E(fρ).

Proof. According the deﬁnition of fz, we have

E(π(fz)) E(fρ)

E(π(fz)) Ez(π(fz)) + Ez(ˆgλ) E(fρ) + λ ˆgλ ℓ1

+ n Ez(fz) + λ fz ℓ1 (Ez(ˆgλ) + λ ˆgλ ℓ1) o

E(π(fz)) Ez(π(fz)) + Ez(ˆgλ) E(fρ)

+λ ˆgλ ℓ1. (11)

Sparse Shrunk Additive Models

Ez(ˆgλ) E(fρ) = (E(ˆgλ) E(gλ)) + E(gλ) E(fρ)

+Ez(ˆgλ) E(ˆgλ). (12)

Combining both (11) and (12), we get the desired decomposition.

The error term E1 measures the divergence between the empirical risk and the corresponding expected risk, which usually is called sample error in learning theory. In terms of recent theoretical progress for learning with data dependent hypothesis spaces (Shi et al., 2011; Shi, 2013; Feng et al., 2016), we can bound sample error E1 via concentration inequality associated with empirical covering numbers (Wu et al., 2007; Christmann & Zhou, 2016).

The error term E2 reﬂects the drift risk for learning with hypothesis spaces Hz and H, and hence is called as the hypothesis error. By relating E(ˆgλ) E(gλ) with Pd j=1 ˆg(j) λ g(j) λ L2ρX(j) , we can estimate this hypothesis

error through the inequality in Hilbert space (Pinelis, 1994; Smale & Zhou, 2007).

The error term E3 is called the approximation error, which describes the approximation ability of regularized scheme. Following the approximation analysis with integral operator in (Smale & Zhou, 2007; Shi, 2013; Nie & Wang, 2015), we derive the upper bound of E2 based on the properties of L K(j), 1 j d.

Detail technical proofs are provided in the supplementary materials.

5. Experimental Results

This section shows the empirical evaluation of SSAM. We ﬁrst introduce the experimental setups following (Kandasamy & Yu, 2016), and then validate SSAM s ability for feature selection and regression prediction.

We consider SSAM for pairwise interaction setting, and set k = 2, d = n 2 . Similar with (Kandasamy & Yu, 2016), each kernel on X (j) is generated from Gaussian kernel. For example, when x(j) s = (xs1, xs2) and x(j) t = (xt1, xt2), the shrunk kernel K(j)(x(j) s , x(j) t ) = exp{ (xs1 xt1)2

exp{ (xs2 xt2)2

2µ2 2 }, where µi = 4.5σim 1

10 and σi is the standard deviation on i-th coordination. The regularization parameter λ is chosen via ﬁve-fold cross validation with respect to the mean square error (MSE).

We implement our SSAM method via accelerated proximal gradient methods (Nesterov, 2013) to get the coefﬁcient

vector ˆα. For sparse representation and feature selection,

we compute m P

t=1 ˆα(j) t on the j-th pairwise features, and then

select the informative shrunk features. For synthetic data, we compare our model with COSSO (Lin & Zhang, 2006) to validate our motivation for feature selection. For real-word benchmark data, we compare MSE of SSAM with SALSA (Kandasamy & Yu, 2016), COSSO (Lin & Zhang, 2006), Sp AM (Ravikumar et al., 2009), and Lasso (Tibshirani, 1996).

5.1. Experiments with synthetic data

Following the ideas in (Lin & Zhang, 2006; Yin et al., 2012), we use two different types of data to evaluate the model selection ability of SSAM. The ﬁrst type of synthetic data has at most one informative pairwise features and the second one has at least two pairwise features. Since SALSA does not concern the selection of shrunk features, we compare the performance SSAM with COSSO (Lin & Zhang, 2006). As shown in Table 1, COSSO is based on component functions on both single and pairwise input features.

Generate synthetic data: We generate the n-dimensional input xi = (xi1, xi2, ..., xin)T with xij = Wij+ηUi

1+η and n = 10, where W and U are sampled from independent uniform distributions deﬁned in [ 0.5, 0.5]. Parameter η controls the magnitude of correlation. Inputs are independent if η = 0 and correlated if η = 1.

Example set I: We apply SSAM with 100 training samples on three underlying functions (a. simple additive model, b. simple pairwise interaction model, c. multi-ways interaction model). For xt = (xt1, xt2, ..., xtn)T ,

a. f (xt) = xt1 + xt2 + xt3 + exp( xt4)

b. f (xt) = (2xt1 1)(2xt2 1)

c. f (xt) = (2sin(xt1) 1)(2sin(xt2) 1)

(2sin(xt3) 1)(2sin(xt4) 1)

Example set II: We also apply SSAM with 100 training samples on much complicated interaction models (e. overlapped pairwise interaction, f. independent pairwise interaction, g. circle related pairwise interaction):

e. f (xt) = (2sin(xt1) 1)(2sin(xt2) 1)

+sin(xt1)sin(xt3),

f. f (xt) = 2exp(xt1+xt2+0.2)+2exp 1(xt3+xt4),

g. f (xt) = (2xt1 1)(2xt2 1)+(2xt2 1)(2xt3 1)

+(2xt1 1)(2xt3 1).

The ﬁnal output is y = f (x) + ϵ, where ϵ N(0, 0.25). For each example, we make feature selection according to

the values of 100 P

t=1 ˆα(j) t for j {1, ..., 45}. The Precision@τ

Sparse Shrunk Additive Models

Table 2. Precision@τ for feature selection (a) Synthetic data I f (m, n, η) τ SSAM COSSO 4 3.88 3.69 5 3.92 3.81 (100,10,0)

6 3.93 3.85 4 3.37 2.58 5 3.68 2.80

6 3.82 2.91 1 0.97 1 2 0.97 1 (100,10,0)

3 0.97 1 1 0.95 0.62 2 0.95 0.65

3 0.98 0.68 4 3.94 0.63 5 3.97 0.68 (100,10,0)

6 3.97 0.75 4 3.69 0.84 5 3.87 0.91

6 3.92 0.94

(b) Synthetic data II f (m, n, η) τ SSAM COSSO 2 1.05 0.73 3 1.13 0.90 (100,10,0)

4 1.20 0.90 2 1.04 0.13 3 1.10 0.16

4 1.12 0.20 2 0.72 0.88 3 0.93 1 (100,10,0)

4 1.23 1 2 1.90 0.94 3 1.94 0.94

4 1.95 0.97 3 2.94 2.98 4 2.94 2.98 (100,10,0)

5 2.94 3 3 2.85 2.14 4 2.85 2.40

5 2.85 2.49

is used to measure the performance of feature selection, which describes the number of truly informative features in the top-τ selected results. Tables 2(a) and 2(b) provide the average results on Precision@τ after repeating 100 times. In most cases, SSAM performs better than COSSO for feature selection. Especially, SSAM behaves more stable than COSSO in complicated models (e.g. c, g) and dependent features.

Table 3. Average MSE on real data.

SSAM SALSA COSSO Sp AM Lasso Insulin 1.0146 1.0206 1.1379 1.2035 1.1103 Skillcraft 0.5432 0.5470 0.5551 0.90545 0.6650 Airfoil 0.4866 0.5176 0.5178 0.9623 0.5199 Forestﬁre 0.3477 0.3530 0.3753 0.9694 0.5193 Housing 0.3787 0.2642 1.3097 0.8165 0.4452 CCPP 0.0694 0.0678 0.9684 0.0647 0.0740 Music 0.6295 0.6251 0.7982 0.7683 0.6349 Telemonit 0.0689 0.0347 5.7192 0.8643 0.0863

5.2. Experiments with real-world benchmark data

We compare the prediction performance of SSAM with the most related additive models, where eight data sets are used under the same experimental setups in (Kandasamy & Yu, 2016). The data sets are from UCI repository (http://archive.ics.uci.edu/ml) and (Tu et al., 2012), which include Insulin (n = 50, m = 256), Skillcraft (n = 18, m = 1700), Airfoil (n = 40, m = 750), Forestﬁre (n = 10, m = 211), Housing (n = 12, m = 256), CCPP (n = 59, m = 2000), Music (n = 90, m = 1000), Telemonit (n = 19, m = 1000). As shown in Table 3, on all eight benchmark datasets, our SSAM has best results on four of them, second best results on three of the rest, and third best result on the rest one. Experimental results show that our SSAM has comparable performance with SALSA, even if only pairwise interaction features are used.As shown in (Kandasamy & Yu, 2016), SALSA has shown competitive performance with many nonparametric models and parametric models (but SALSA cannot do feature selection). Therefore, SSAM is effective for regression prediction besides its capacity for sparse feature selection.

5.3. More experimental results

According to the reviewer comments of scalability, we did new experiments on simulated data for the high dimensional setting (20,000 samples and other settings remain the same). The average results (with 20 repeats) in Table 4 demonstrate that SSAM scales well in high dimensional setting.

One reviewer suggested us to compare with more methods beside COSSO. We added new comparison results on simulated data with Sp AM and the other new method RMR (Wang et al., 2017) in Table 6. The new results also verify the effectiveness of the proposed method.

In addition, we added new experimental results on real data with RMR in Table 5, This results also show the proposed method is better.

Sparse Shrunk Additive Models

Table 4. Precision@τ for feature selection

f (m, n, η) τ SSAM 4 3.95 5 4.00 (20000,10,0)

6 4.00 4 3.90 5 3.90

(20000,10,1)

6 4.00 4 3.90 5 4.00 (20000,10,0)

6 4.00 4 3.70 5 4.00

(20000,10,1)

6 4.00 3 2.90 4 2.95 (20000,10,0)

5 3.00 3 2.85 4 2.85

(20000,10,1)

Table 5. Average MSE on real data.

SSAM RMR Insulin 1.0146 1.0198 Skillcraft 0.5432 0.6486 Airfoil 0.4866 0.5314 Forestﬁre 0.3477 0.3765 Housing 0.3787 0.4375 CCPP 0.0694 0.0667 Music 0.6295 0.6210 Telemonit 0.0689 0.0824

6. Conclusion

In this paper, we proposed a uniform scheme for nonlinear feature and sample selections under additive models. Learning theory analysis has been provided to demonstrate the convergence and sparsity properties of SSAM, where involves novel analysis technique with integral operator and concentration estimation. Experiments on both synthetic and real-world datasets support the effectiveness of our new model.

Acknowledgements

We thank the anonymous reviewers for their helpful comments. G.L. and H.H. were partially supported by U.S. NSF IIS 1836945, IIS 1836938, IIS 1845666, IIS 1852606, IIS 1838627, IIS 1837956. H.C. was partially supported by the National Natural Science Foundation of China (NSFC)

Table 6. Precision@τ for feature selection (a) Synthetic data I f (m, n, η) τ SSAM Sp AM RMR 4 3.88 3.85 3.81 5 3.92 3.90 3.95 (100,10,0)

6 3.93 3.97 3.97 4 3.37 3.50 2.64 5 3.68 3.61 3.02

6 3.82 4.00 3.06 1 0.97 0.94 1.00 2 0.97 1.00 2.00 (100,10,0)

3 0.97 1.00 2.00 1 0.95 0.94 0.91 2 0.95 1.00 0.91

3 0.98 1.00 0.98 4 3.94 3.93 3.55 5 3.97 3.96 3.71 (100,10,0)

6 3.97 3.98 3.81 4 3.69 3.54 2.82 5 3.87 3.83 3.20

6 3.92 3.40 3.46

(b) Synthetic data II f (m, n, η) τ SSAM Sp AM RMR 2 1.05 1.00 1.67 3 1.13 1.17 1.96 (100,10,0)

4 1.20 1.19 2.13 2 1.04 1.00 1.26 3 1.10 1.13 1.64

4 1.12 1.30 2.97 2 0.72 0.89 1.83 3 0.93 1.00 2.37 (100,10,0)

4 1.23 1.00 2.97 2 1.90 2.00 1.13 3 1.94 2.00 1.66

4 1.95 2.00 1.93 3 2.94 2.90 3.00 4 2.94 2.98 3.00 (100,10,0)

5 2.94 3.00 3.00 3 2.85 2.80 2.50 4 2.85 2.82 2.72

5 2.85 3.00 2.84

under Grants 11671161.

Alaoui, A. and Mahoney, M. W. Fast randomized kernel ridge regression with statistical guarantees. In NIPS, 2015.

Aronszajn, N. Theory of reproducing kernels. Trans. Amer.

Sparse Shrunk Additive Models

Math. Soc., 68:337 404, 1950.

Bach, F. Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res., 9:1179 1225, 2008.

Bach, F., Lanckrit, G., and Jordan, M. Multiple kernel learning, conic duality and the smo agorithm. In ICML, 2004.

Candes, E. and Tao, T. The dantzig selector: Statistical estimation when p is much larger than n. Ann. Statist., 35(6):2313 2351, 2007.

Chen, H., Xia, H., Cai, W., and Huang, H. Error analysis of generalized nystr om kernel regression. In NIPS, 2016.

Chen, H., Wang, X., Deng, C., and Huang, H. Group sparse additive machine. In NIPS, 2017.

Chen, H., Wang, Y., Zheng, F., Deng, C., and Huang, H. Sparse modal additive model. IEEE Transactions on Neural Networks and Learning Systems, 2020, In press.

Christmann, A. and Zhou, D. X. Learning rates for the risk of kernel-based quantile regression estimators in additive models. Analysis and Applications, 14(3):449 477, 2016.

Cucker, F. and Zhou, D. X. Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, 2007.

Feng, Y., Lv, S., Hang, H., and Suykens, J. A. Kernelized elastic net regularization: Generalization bounds, and sparse recovery. Neural Comput., 28(3):525 562, 2016.

G.R.G.Lanckrit, N.Cristianini, L. G., P.Barlett, and Jordan, M. Learning the kernel matrix with semideﬁnite programming. J. Mach. Learn. Res., 5:27 72, 2004.

Guo, Z. and Zhou, D.-X. Concentration estimates for learning with unbounded sampling. Adv. Comput. Math., 38 (1):207 223, 2013.

Hastie, T. and Tibshirani, R. Generalized Additive Models, volume 43. Chapman and Hall press, London, 1990.

Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2001.

Huang, J., Horowitz, J. L., and Wei, F. Variable selection in nonparametric additive models. Ann. Statist., 38(4): 2282 2313, 2010.

Kandasamy, K. and Yu, Y. Additive approximation in high dimensional nonparametric regression via the salsa. In ICML, 2016.

Kumar, S., Mohri, M., and Talwalkar, A. Sampling methods for the nystr om method. J. Mach. Learn. Res., 13:981 1006, 2012.

Lin, Y. and Zhang, H. H. Component selection and smoothing in smoothing spline analysis of variance models. Ann. Statist., 34(5):2272 2297, 2006.

Meier, L., Van de Geer, S., and B uhlmann, P. Highdimensional additive modeling. Ann. Statist., 37(6B): 3779 3821, 2009.

Nesterov, Y. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125 161, 2013.

Nie, W. and Wang, C. Constructive analysis for coefﬁcient regularization regression algorithms. Journal of Mathematical Analysis and Applications, 431(2):1153 1171, 2015.

Pinelis, I. Optimum bounds for the distributions of martingales in banach spaces. The Annals of Probability, pp. 1679 1706, 1994.

Raskutti, G., Wainwright, M. J., and Yu, B. Minimaxoptimal rates for sparse additive models over kernel classes via convex programming. J. Mach. Learn. Res., 13:389 427, 2012.

Ravikumar, P., Lafferty, J., Liu, H., and Wasserman, L. Sparse additive models. J. Royal. Statist. Soc B., 71(5): 1009 1030, 2009.

Roth, V. The generalized lasso. IEEE Trans. Neural Networks, 15(1):16 28, 2004.

Rudi, A., Camoriano, R., and Rosasco, L. Less is more: Nystr om computational regularization. In NIPS, pp. 1657 1665, 2015.

Scholk opf, B. and Smola, A. J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, 2001.

Shawe-Taylor, J. and Cristianini, N. Kernel Methods for Pattern Analysis. Cambridge university press, 2004.

Shi, L. Learning theory estimates for coefﬁcient-based regularized regression. Appl. Comput. Harmon. Anal., 34 (2):252 265, 2013.

Shi, L., Feng, Y., and Zhou, D. X. Concentration estimates for learning with ℓ1-regularizer and data dependent hypothesis spaces. Appl. Comput. Harmon. Anal., 31(2): 286 302, 2011.

Smale, S. and Zhou, D. X. Learning theory estimates via integral operators and their approximations. Constr. Approx., 26(2):153 172, 2007.

Sparse Shrunk Additive Models

Steinwart, I., Hush, D., and Scovel, C. Optimal rates for regularized least squares regression. In COLT, 2009.

Sun, H. and Wu, Q. Least square regression with indeﬁnite kernels and coefﬁcient regularization. Appl. Comput. Harmon. Anal., 30(1):96 109, 2011.

Sun, H. and Wu, Q. Sparse representation in kernel machines. IEEE Transactions on Neural Networks and Learning Systems, 26(10):2576 2582, 2015.

Tibshirani, R. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58(1):267 288, 1996.

Tu, Z. et al. Integrative analysis of a cross-loci regulation network identiﬁes app as a gene regulating insulin secretion from pancreatic islets. PLo S Genet, 8(12):e1003107, 2012.

Tyagi, H., Kyrillidis, A., G artner, B., and Krause, A. Learning sparse additive models with interactions in high dimensions. In AISTATS, 2016.

Wang, C. and Zhou, D.-X. Optimal learning rates for least squares regularized regression with unbounded sampling. J. Complexity, 27(1):55 67, 2011.

Wang, X., Chen, H., Cai, W., Shen, D., and Huang, H. Regularized modal regression with applications in cognitive impairment prediction. In Advances in neural information processing systems, pp. 1448 1458, 2017.

Wu, Q., Ying, Y., and Zhou, D.-X. Multi-kernel regularized classiﬁers. J. Complexity, 23(1):108 134, 2007.

Yang, L., Lv, S., and Wang, J. Model-free variable selection in reproducing kernel hilbert space. J. Mach. Learn. Res., 17:1 24, 2016.

Yin, J., Chen, X., and Xing, E. P. Group sparse additive models. In ICML, 2012.

Yuan, M. and Zhou, D. X. Minimax optimal rates of estimation in high dimensional additive models. Ann. Statist., 44(6):2564 2593, 2016.

Zhao, T. and Liu, H. Sparse additive machine. In AISTATS, pp. 1435 1443, 2012.