# sparse_kernel_regression_with_coefficientbased_ell_qregularization__ecf9c328.pdf

Journal of Machine Learning Research 20 (2019) 1-44 Submitted 3/13; Revised 8/18; Published 10/19

Sparse Kernel Regression with Coeﬃcient-based ℓq regularization

Lei Shi mastone1983@gmail.com Shanghai Key Laboratory for Contemporary Applied Mathematics School of Mathematical Sciences, Fudan University Shanghai, P. R. China Xiaolin Huang xiaolinhuang@sjtu.edu.cn Institute of Image Processing and Pattern Recognition Institute of Medical Robotics, Shanghai Jiao Tong University MOE Key Laboratory of System Control and Information Processing Shanghai, P. R. China Yunlong Feng ylfeng@albany.edu Department of Mathematics and Statistics, State University of New York at Albany New York, USA Johan A.K. Suykens johan.suykens@esat.kuleuven.be Department of Electrical Engineering, ESAT-STADIUS, KU Leuven Kasteelpark Arenberg 10, Leuven, B-3001, Belgium

Editor: Mikhail Belkin

In this paper, we consider the ℓq regularized kernel regression with 0 < q 1. In form, the algorithm minimizes a least-square loss functional adding a coeﬃcient-based ℓq penalty term over a linear span of features generated by a kernel function. We study the asymptotic behavior of the algorithm under the framework of learning theory. The contribution of this paper is two-fold. First, we derive a tight bound on the ℓ2 empirical covering numbers of the related function space involved in the error analysis. Based on this result, we obtain the convergence rates for the ℓ1 regularized kernel regression which is the best so far. Second, for the case 0 < q < 1, we show that the regularization parameter plays a role as a trade-oﬀbetween sparsity and convergence rates. Under some mild conditions, the fraction of non-zero coeﬃcients in a local minimizer of the algorithm will tend to 0 at a polynomial decay rate when the sample size m becomes large. As the concerned algorithm is non-convex, we also discuss how to generate a minimizing sequence iteratively, which can help us to search a local minimizer around any initial point. Keywords: Learning Theory, Kernel Regression, Coeﬃcient-based ℓq-regularization (0 < q 1), Sparsity, ℓ2-empirical Covering Number

1. Introduction and Main Results

The regression problem aims at estimating the function relations from random samples and occurs in various statistical inference applications. An output estimator of regression algorithms is usually expressed as a linear combination of features, i.e., a collection of candidate functions. As an important issue in learning theory and methodologies, sparsity focuses

c 2019 Lei Shi, Xiaolin Huang, Yunlong Feng, and Johan A.K. Suykens.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/13-124.html.

Shi, Huang, Feng and Suykens

on studying the sparse representations of such linear combinations resulted from the algorithms. It is widely known that an ideal way to obtain the sparsest representations is to penalize the combinatorial coeﬃcients by the ℓ0 norm. However, the algorithms based on ℓ0 norm often lead to an NP-hard discrete optimization problem (see e.g., Natarajan (1995)), which motives the researchers to consider the ℓq norm (0 < q 1) as the substitution. In particular, the ℓ1 norm constrained or penalized algorithms have achieved great success in a wide range of areas from signal recovery (Cand es et al. (2006)) to variable selection in statistics (Tibshirani (1996)). Recently, several theoretical and experimental results (see e.g., Cand es et al. (2008); Chartrand (2007); Fan and Li (2001); Saad and O. Yılmaz (2010); Rakotomamonjy et al. (2011); Xu et al. (2012)) suggest that ℓq norm with q (0, 1) yields sparser solutions than the ℓ1 norm to produce accurate estimation. Due to the intensive study on compressed sensing (see e.g., Donoho (2006)), the algorithms involving the ℓq norm (0 < q 1) have drawn much attention in the last few years and been used for various applications, including image denoising, medical reconstruction and database updating. In this paper, we focus on the ℓq regularized kernel regression. In form, the algorithms minimize a least-square loss functional adding a coeﬃcient-based ℓq penalty term over a linear span of features generated by a kernel function. We shall establish a rigorous mathematical analysis on the asymptotic behavior of the algorithm under the framework of learning theory. Let X be a compact subset of Rd and Y R, ρ be a Borel probability distribution on Z = X Y . For f : X Y and (x, y) Z, the least-square loss (f(x) y)2 gives the error with f as a model for the process producing y from x. Then the resulting target function is called regression function and satisﬁes

fρ = arg min Z

Z (f(x) y)2dρ f : X Y, measurable .

From Proposition 1.8 in Cucker and Zhou (2007), the regression function can be explicitly given by

Y ydρ(y|x), x X, (1)

where ρ( |x) is the conditional probability measure induced by ρ at x. In the supervised learning framework, ρ is unknown and one estimates fρ based on a set of observations z = {(xi, yi)}m i=1 Zm which is assumed to be drawn independently according to ρ. We additionally suppose that ρ( |x) is supported on [ M, M], for some M 1 and each x X. This uniform boundedness assumption for the output is standard in most literature in learning theory (see e.g., Zhang (2003); Smale and Zhou (2007); Mendelson and Neeman (2010); Wu et al. (2006)). Throughout the paper, we will use these assumptions without any further reference. Usually one may get an estimator of fρ by minimizing the empirical loss functional 1

m Pm i=1(f(xi) yi)2 over a hypothesis space, i.e., a pre-selected function set on X. In kernel regression, the hypothesis space is generated by a kernel function K : X X R. Recall that {xi}m i=1 is the input data of the observations. The hypothesis space considered here is taken to be the linear span of the set {Kxi}m i=1. For t X, we denote by Kt the

Sparse Kernel Regression

function Kt : X R x 7 K(x, t).

Let 0 < q 1, the output estimator of ℓq regularized kernel regression is given by ˆfq = Pm i=1 cz q,i Kxi, where its coeﬃcient sequence cz q = (cz q,i)m i=1 satisﬁes

cz q = arg min c Rm

i=1 ci K(xj, xi)

!2 + γ c q q

Here γ > 0 is called a regularization parameter and c q denotes the ℓq norm of c. Recall that for any 0 < q 1 and any sequence w = (wn) n=1, the ℓ0 norm and ℓq norm are deﬁned respectively as

n=1 I(wn = 0) and w q =

n supp(w) |wn|q

where I( ) is the indicator function and supp(w) := {n N : wn = 0} denotes the support set of w. Strictly speaking, 0 is not a real norm and q merely deﬁnes a quasi-norm when 0 < q < 1 (e.g., see Conway (2000)). From an approximation viewpoint, we are in fact seeking for a function to approximate fρ from the function set spanned by the kernelized dictionary {Kxi}m i=1. The kernelized dictionary together with its induced learning models has been previously considered in literature. In a supervised regression setting, to pursue a sparse nonlinear regression machine, Roth (2004) proposed the ℓ1-norm regularized learning model induced by the kernelized dictionary, namely, the kernelized Lasso. It is indeed a basis pursuit method, the idea of which can be dated back to Chen and Donoho (1994); Girosi (1998). It was in Wu and Zhou (2008) that a framework of analyzing the generalization bounds for learning models induced by the kernelized dictionary was proposed. The idea behind is controlling the complexity of the hypothesis space and then investigating the approximation ability as well as the data-ﬁtting risk of functions in this hypothesis space via approximation and concentration techniques, which is a typical learning theory approach. Following this line, a series of interesting studies have been expanded for various learning models induced by the kernelized dictionary. For example, probabilistic generalization bounds for diﬀerent models were derived in Shi et al. (2011); Wang et al. (2012); Shi (2013); Lin et al. (2014); Feng et al. (2016) and many others. However, it is worth pointing out that one is not suggested to simply treat the kernelized dictionary as commonly used dictionaries. This is because learning models induced by the kernelized dictionary may possess more ﬂexibility. For the kernelized dictionary, the positive semi-deﬁnite constraint on the kernel function is removed. The removing of the positive semi-deﬁnite constraints allows us to utilize some speciﬁc indeﬁnite kernels to cope with real-world applications, see e.g., Schleif and Tino (2015). Moreover, {Kxi}m i=1 is a data-dependent dictionary. In a nonlinear regression setting, comparing with models induced by basis functions, having seen enough observations, the data-dependent dictionary can provide adequate information. Consequently, the local information of the regression function can be captured with this redundant dictionary. An illustrative example

Shi, Huang, Feng and Suykens

of this observation can be found in Ando et al. (2008). Recently, learning with indeﬁnite kernels has drawn much attention. Most of work focused on the algorithmic study, e.g., Loosli et al. (2016); Huang et al. (2017). The algorithm under consideration can provide a simple scenario for regularized indeﬁnite kernel regression. However, the theoretical work on this aspect is still limited so far. Compared with the algorithms involving ℓ0 penalty, the ℓ1 regularized algorithms can be eﬃciently solved by convex programming methods. When 0 < q < 1, the problem (2) is a non-convex optimization problem. Many eﬃcient approaches have been developed to solve the ℓq minimization problem of this type, e.g., Cand es et al. (2008); Chen et al. (2010); Huang et al. (2008); Lai and Wang (2011); Xu et al. (2012); but there is no approach that guarantees to ﬁnd a global minimizer. Most proposed approaches are descent-iterative in nature. To illustrate the principle of the minimization process, we deﬁne the objective functional of algorithm (2) as

Tγ,q(c) = y Kc 2 2 + γ c q q, c Rm, (3)

where y = y1 m, , ym m

Rm and K Rm m with entries Ki,j = K(xi,xj) m , 1 i, j m. Given γ > 0 and an initial point c0, the descent-iterative minimization approach generates a minimizing sequence {ck} k=1 such that Tγ,q(ck) are strictly decreasing along the sequence. Thus any local minimizer, including the global minimizer, that a descent approach may ﬁnd must be in the level set {c Rm : Tγ,q(c) < Tγ,q(c0)}. Therefore, in both theory and practice, one may be only interested in the local minimizer around some pre-given initial point. Speciﬁcally, for our problem, a reasonable choice of the initial point would be the solution in the case q = 1.

Assumption 1 For γ > 0 and 0 < q < 1, we assume that the coeﬃcient sequence cz q of the estimator ˆfq is a local minimizer of the problem (2) and satisﬁes Tγ,q(cz q) < Tγ,q(cz 1), where cz 1 is the global minimizer of the problem (2) at q = 1.

In Section 2.1, we shall present a scheme for searching a local minimizer of problem (2) by constructing a descent-iterative minimization process. The previous theoretical analysis about least-square regression with a coeﬃcient-based penalty term is valid only for a convex learning model, e.g., the ℓq regularized regression with q 1. To enhance the sparsity, one expects to use a non-convex ℓq penalty, i.e., 0 < q < 1, but no optimization approach guarantees to ﬁnd a global minimizer of the induced optimization problem. There is still a gap between the existing theoretical analysis and the optimization process: the estimator needs to be globally optimal in the theoretical analysis while the optimization method can not ensure the global optimality of its solutions. Up to our knowledge, due to the nonconvexity of ℓq term, there still lacks a rigorous theoretical demonstration to support its eﬃciency in non-parametric regression. In this paper, we aim to ﬁll this gap by developing an elegant theoretical analysis on the asymptotic performances of estimators ˆfq satisfying Assumption 1, where these estimators can be generated by the scheme in Section 2.1 or another descent-iterative minimization process. Here we would like to point out that the established convergence analysis of ˆfq only requires ˆfq to be a stationary point around ˆf1. One aim of this work is to discuss the sparseness of the algorithms (2), which is characterized by the upper bounds on the fraction of the non-zero coeﬃcients in the expression

Sparse Kernel Regression

ˆfq = Pm i=1 cz q,i Kxi. In general, the total number of coeﬃcients is as large as the sample size m. But for some kind of kernels such as polynomial kernels, the representation of ˆfq is not unique whenever the sample size becomes large. For the sake of simplicity, we restrict our discussion on a special class of kernels.

Deﬁnition 1 A function K : X X R is called an admissible kernel if it is continuous and for any k N, (c1, , ck) Rk and distinct set {t1, , tk} X, Pk j=1 cj Ktj = 0 for all x X implies cj = 0, j = 1, , k.

It should be noticed that an admissible kernel here is not necessarily symmetric or positive semi-deﬁnite. Several widely used kernel functions from multi-variate approximation satisfy the condition of the admissible kernels, e.g., exponential kernels, Gaussian kernels, inverse multi-quadric kernels, B-spline kernels and compact supported radial basis function kernels including Wu s functions (Wu (1995)) and Wendland s functions (Wendland (1995)). One may see Wendland (2005) and references therein for more details of these kernels. For the same reason, the kernel function is required to be universal in Steinwart (2003) when discussing the sparseness of support vector machines. It is noticed that most universal kernels (see Micchelli et al. (2006)) are also admissible kernels. If K is admissible, as long as the input data are mutually diﬀerent, the representation of ˆfq is unique and the number of non-zero coeﬃcients is given by cz q 0. In the followings, when we refer to the linear combination of {Kxi}m i=1, we always suppose that {xi}m i=1 is pairwise distinct. In our setting, this assumption can be almost surely satisﬁed if the data generating distribution ρ is continuous. Except sparsity, another purpose of this paper is to investigate how the estimator ˆfq given by (2) approximates the regression function fρ. Let ρX be the marginal distribution of ρ on X. With a suitable choice of γ depending on the sample size m, we show that the estimator ˆfq (or πM( ˆfq) see Deﬁnition 2) converges to fρ in the function space L2 ρX(X) as m tends to inﬁnity. Here, for a Borel measure Q on X, the space L2 Q(X) consists of all the square-integrable functions with respect to Q and the norm is given by f L2 Q = R

X |f(x)|2d Q 1/2. In order to state our results, we further recall some notations used in this paper. We say that K is a Mercer kernel if it is continuous, symmetric and positive semi-deﬁnite on X X. Such a kernel can generate a reproducing kernel Hilbert space (RKHS) HK (e.g., see Aronszajn (1950)). For a continuous kernel function K, deﬁne

e K(u, v) = Z

X K(u, x)K(v, x)dρX(x). (4)

Then one can verify that e K is a Mercer kernel. Being an important convex approach for pursuing sparsity, regularizing with ℓ1 norm deserves special attention in its own right. The following theorem illustrates our general error analysis for ℓ1 regularized kernel regression. The result is stated in terms of properties of the input space X, the measure ρ and the kernel K.

Theorem 1 Assume that X is a compact convex subset of Rd with Lipschitz boundary, K C s(X X) with s > 0 is an admissible kernel and fρ H e K with e K deﬁned by (4).

Shi, Huang, Feng and Suykens

Let 0 < δ < 1 and

( d+2s 2d+2s, if 0 < s < 1, d+2 s 2d+2 s , otherwise, (5)

where s denotes the integral part of s. Take γ = mϵ Θ with 0 < ϵ Θ 1

2. Then with conﬁdence 1 δ, there holds

ˆf1 fρ 2 L2ρX Cϵ log(6/δ) (log(2/δ) + 1)6 mϵ Θ, (6)

where Cϵ > 0 is a constant independent of m or δ.

We shall prove Theorem 1 in Section 4.3 with the constant Cϵ given explicitly. The convergence rate presented here improves the existing ones obtained in Wang et al. (2012); Shi et al. (2011); Guo and Shi (2012). In particular, for a C kernel (such as Gaussian kernels), the rate can be arbitrarily close to m 1. To see the improvement, recall that the best convergence rates so far are given by Guo and Shi (2012). It was proved that if fρ H e K and K C , then with conﬁdence 1 δ,

ˆf1 fρ 2 L2ρX = O

log(24/δ) + log log2

We improve this result in Theorem 1, as the convergence rate (6) is always faster than m 1/2 even for Lipschitz continuous kernels. Next, we will further illustrate the optimality of the convergence rates obtained in Theorem 1 when fρ belongs to some speciﬁc function space. Let X = [0, 1]d, ρX be uniform distribution on [0, 1]d and K C s(X X) with s := s d/2 > 0 being an integer. Then the RKHS H e K C s([0, 1]d) and the Sobolev space W s 2 ([0, 1]d) consists of functions belonging to the H older space C s 1,α([0, 1]d) with an arbitrary α (0, 1) (see e.g., Adams and Fournier (2003)). If fρ H e K W s 2 ([0, 1]d), the claimed rate in (6) can be arbitrarily close to O(m 2s /(2s +d)) which is proven to be mini-max optimal in Fisher and Steinwart (2017). Our reﬁned result is mainly due to the following reasons. Firstly, when K C s with s 2 and the input space X satisﬁes some regularity condition, we obtain a tight upper bound on the empirical covering numbers of the related hypothesis space (see Theorem 11). Secondly, we apply the projection operator in the error analysis to get better estimates.

Deﬁnition 2 For M 1, the projection operator πM on R is deﬁned as

( M if t < M, t if M t M, M if t > M.

The projection of a function f : X R is given by πM(f)(x) = πM(f(x)), x X.

The projection operator was introduced in the literature of Chen et al. (2004); Steinwart and Christmann (2008). It helps to improve the bounds in the convergence analysis, which is very critical for sharp estimation. In fact, under the uniform boundedness assumption, the performance of the algorithm (2) can be measured by the error πM( ˆf) fρ L2ρX , where ˆf is a resulting estimator. To

Sparse Kernel Regression

explain the details, we recall the deﬁnition of regression function fρ given by (1). As the conditional distribution ρ( |x) is supported on [ M, M] for every x X, the target function fρ takes value in [ M, M] on X. So to see how an estimator ˆf approximates fρ, it is natural to project values of ˆf onto the same interval by the projection operator πM( ). Due to the analysis in this paper, one can always expect better estimates by projecting the output estimator onto the interval [ M, M]. So if we consider the estimator πM( ˆf1) in Theorem 1, the obtained result can be further improved. However, in order to make comparisons with previous results, we just give the error analysis for ˆf1. But for the case 0 < q < 1, we shall only consider the error πM( ˆfq) fρ L2ρX . To illustrate the sparseness of the algorithm, we

also derive an upper bound on the quantity cz q 0 m , where cz q denotes the coeﬃcient sequence of ˆfq.

Theorem 3 Assume that X is a compact convex subset of Rd with Lipschitz boundary, K C (X X) is an admissible kernel and fρ H e K with e K deﬁned by (4). For 0 < q < 1, the estimator ˆfq is given by algorithm (2) and satisﬁes Assumption 1. Let 0 < δ < 1 and γ = m τ with 1 q < τ < 1. With conﬁdence 1 δ, there hold

πM( ˆfq) fρ 2 L2ρX e C log(18/δ) + log log 8 q(1 τ)

3 m (τ (1 q)), (7)

and cz q 0 m e C (q(1 q)) q/(2 q) log(2/δ) + log log 8 q(1 τ)

6 m q(1 τ 2 q ), (8)

where e C and e C are positive constants independent of m or δ.

From Theorem 3, one can see that under the restrictions on τ and q, the quantity cz q 0 m converges to 0 at a polynomial rate when the sample size m becomes large. The regularization parameter γ plays an important role as a trade-oﬀbetween sparsity and convergence rates. Thus one can obtain a sparser solution at the price of lower estimation accuracy. Due to Theorem 3, when 3

5 2 < q < 1, we may take τ = (2 q)(1 q),

then the quantity cz q 0 m behaves like O(m q2) and the corresponding convergence rate is O(m (1 q)2). In our sparsity analysis (see Section 5), the regularization parameter γ also plays a role as a thresholding for the value of non-zero coeﬃcient in ˆfq = Pm i=1 cz q,i Kxi. Due to our analysis, a lower bound for non-zero coeﬃcients is given by O(γ2/(2 q)), which implies that a small q will lead to more zero coeﬃcients in the kernel expansion for a ﬁxed γ < 1. It should be mentioned that our sparsity analysis is only valid for 0 < q < 1. For the RKHS-based regularization algorithms, it is well known that a classical way to obtain sparsity is to introduce the ϵ insensitive zone in the loss function. A theoretical result for this approach in Steinwart and Christmann (2009) shows that the fraction of non-zero coeﬃcients in the kernel expansion is asymptotically lower and upper bounded by constants. From this point of view, regularizing the combinatorial coeﬃcients by the ℓq norm is a more powerful way to produce sparse solutions. As our theoretical analysis only gives results for the worst case situations, one can expect better performance of the ℓq regularized kernel regression in practice.

Shi, Huang, Feng and Suykens

At the end of this section, we point out that the mathematical analysis for the case 0 < q < 1 is far from optimal. It is mainly because of our analysis is based on Assumption 1. Under this assumption, one needs to bound cz 1 q q which is critical in the error analysis, where cz 1 denotes the coeﬃcient sequence of the estimator ˆf1. In fact, due to the discussion in Section 2.1, we can construct a minimizing sequence from any point. Besides the solution of ℓ1 regularized kernel regression, we may consider some other choices of the starting point, e.g, the solution of RKHS-based regularized least-square regression. We believe that how to bound the q norm of these initial vectors is still a problem when one considers other possible starting points. In this paper, we use the reverse H older inequality to handle this term and the bound is too loose especially when q is small. Actually, even if ˆfq is a global minimizer of problem (2), we can not give an eﬀective approach to conduct the error analysis. Additionally, we do not assume any sparsity condition on the target function. One possible condition that one may consider is that the regression function belongs to the closure of the linear span of {Kx|x X} under the ℓq constraint. Compared with the hard sparsity introduced by the ℓ0 norm, such kind of sparsity assumption is referred as to soft sparsity (see Raskutti et al. (2011)), which is based on imposing a certain decay rate on the entries of the coeﬃcient sequence. Developing the corresponding mathematical analysis under the soft sparsity assumption will help us to understand the role of the ℓq regularization in feature selections in an inﬁnite-dimensional hypothesis space. We shall consider this topic in future work. However, the sparsity analysis in this paper is still valid to derive the asymptotical bound for cz q 0 m and will lead to better estimates if a more elaborate error analysis can be given.

The paper is organized as follows. The next section presents a descent-iterative minimization process for algorithm (2) and establishes the framework of error analysis. In Section 3, we derive a tight bound on empirical covering numbers of the hypothesis space under the ℓ1 constraint. In Section 4 and Section 5, we derive the related results on the error analysis and sparseness of ℓq regularized kernel regression.

2. Preliminaries

This section is devoted to generating the minimizing sequences and establishing the framework of mathematical analysis for ℓq regularized kernel regression.

2.1. Minimizing sequences for ℓq regularized kernel regression

In this part, we present a descent-iterative minimization process for algorithm (2), which can probably search a local minimizer starting from any initial point. Motivated by recent work on ℓ1/2 regularization in Xu et al. (2012), we generalize their strategy to the case 0 < q < 1.

Let sgn(x) be given by sgn(x) = 1 for x 0 and sgn(x) = 1 for x < 0. Deﬁne a function ψη,q for η > 0 and 0 < q < 1 as

ψη,q(x) = sgn(x)tη,q(|x|), |x| > aqη1/(2 q), 0, otherwise, (9)

Sparse Kernel Regression

where aq = (1 q

q 1 2 q and tη,q(|x|) denotes the solution of the equation

2t + ηqtq 1 2|x| = 0 (10)

on the interval [(q(1 q)η/2)1/(2 q), ) with respect to the variable t. We further deﬁne a map Ψη,q : Rm Rm, which is given by

Ψη,q(d) = (ψη,q(d1), , ψη,q(dm)), d = (d1, , dm) Rm. (11)

Then we have the following important lemma.

Lemma 4 For any η > 0, 0 < q < 1 and d = (d1, , dm) Rm, the map Ψη,q : Rm Rm

given by (11) is well-deﬁned and Ψη,q(d) is a global minimizer of the problem

min c Rm c d 2 2 + η c q q . (12)

We shall leave the proof to the Appendix. The function ψη,q deﬁnes the ℓq thresholding function for 0 < q < 1. According to the proof of Lemma 4, given a x R and η > 0, the value of ψη,q(x) in equation (9) is essentially a global minimizer of the problem

min t R |t x|2 + η|t|q .

When q = 1/2, the function ψη,1/2 is exactly the half thresholding function obtained in Xu et al. (2012). We also observe that though the analysis in Lemma 4 is based on the fact 0 < q < 1, the expression of ψη,q is coherent for q [0, 1]. Concretely, as limq 1 aq = 1

2, by letting q 1 in the deﬁnition of ψη,q, one may obtain the soft thresholding function for ℓ1 regularization, which is given by (e.g., see Daubechies et al. (2004))

ψη,1(x) = x sgn(x)η/2, |x| > η/2, 0, otherwise.

Similarly, if taking q = 0 in the expression of ψη,q, one may also derive the hard thresholding function for ℓ0 regularization, which is deﬁned as (e.g., see Blumensath and Davies (2008))

ψη,0(x) = x, |x| > η1/2, 0, otherwise.

The expression of ψη,q is very useful as we can establish a descent-iterative minimization process for algorithm (2) based on the idea in Daubechies et al. (2004). Recall the deﬁnitions of K and y in the objective functional Tγ,q given by (3). For any (λ, γ) (0, )2 and c0 Rm, we iteratively deﬁne a sequence {cn} n=1 as

cn+1 = Ψλγ,q cn + λKT (y Kcn) . (13)

Then {cn} n=1 is a minimizing sequence with a suitable chosen λ > 0.

Proposition 5 Let 0 < q < 1, γ > 0 and 0 < λ q

2 K 2 2 , where 2 denotes the spectral norm of the matrix. If the sequence {cn} n=0 is generated by the iteration process (13), then the following statements are true.

Shi, Huang, Feng and Suykens

(i) If c Rm is a local minimizer of the objective functional Tγ,q(c), then c is a stationary point of the iteration process (13), i.e.,

c = Ψλγ,q c + λKT (y Kc ) .

(ii) The sequence {cn} n=0 is a minimizing sequence such that the sequence {Tγ,q(cn)} n=0 is monotonically decreasing.

(iii) The sequence {cn} n=0 converges to a stationary point of the iteration process (13) whenever λ is suﬃciently small.

We also prove this proposition in the Appendix. The properties of ψη,q play an important role in the proof. When q = 1/2 and q = 2/3, the equation 2t + ηqtq 1 2|x| = 0 can be analytically solved, i.e., the corresponding thresholding function can be explicit expressed as an analytical function. This motivated people to develop eﬃcient algorithms based on (13) for these two special cases. In particular, the ℓ1/2 regularization problem has been intensively studied in some literature (see Xu et al. (2012) and references therein). Since a general formula for ℓq thresholding function is given by (9), it is also interesting to develop corresponding iterative algorithms and compare their empirical performances for diﬀerent values of q.

2.2. Framework of error analysis

In this subsection, we establish the framework of convergence analysis. Because of the least-square nature, one can see Cucker and Zhou (2007) that

f fρ 2 L2ρX = E (f) E (fρ), f : X R,

where E (f) = R

X Y (f(x) y)2dρ. Let ˆf be the estimator produced by algorithm (2). Particularly, the estimator ˆf under our consideration is ˆf1 or πM( ˆfq) with 0 < q 1. To estimate ˆf fρ 2 L2ρX , we only need to bound E ( ˆf) E (fρ). This will be done by applying the

error decomposition approach which has been developed in the literature for regularization schemes (e.g., see Cucker and Zhou (2007); Steinwart and Christmann (2008)). In this paper, we establish the error decomposition formula based on the ﬁrst author s previous work (Guo and Shi (2012)). To this end, we still need to introduce some notations. For any continuous function K : X X R, deﬁne an integral operator on L2 ρX(X) as

X K(x, t)f(t)dρX(t), x X.

Since X is compact and K is continuous, LK and its adjoint L K are both compact operators. If K is a Mercer kernel, the corresponding integral operator LK is a self-adjoint positive operator on L2 ρX, and its r th power Lr K is well-deﬁned for any r > 0. From Cucker and

Zhou (2007), we know that the RKHS HK is in the range of L

1 2 K. Recalling the Mercer kernel e K deﬁned as (4) for a continuous kernel K, it is easy to check that L e K = LKL K.

Sparse Kernel Regression

Following the same idea in Guo and Shi (2012), we use the RKHS H e K with the norm denoted by e K to approximate fρ. The approximation behavior is characterized by the regularization error D(γ) = min f H e K

n E (f) E (fρ) + γ f 2 e K

The following assumption is standard in the literature of learning theory (e.g., see Cucker and Zhou (2007); Steinwart and Christmann (2008)).

Assumption 2 For some 0 < β 1 and cβ > 0, the regularization error satisﬁes

D(γ) cβγβ, γ > 0. (14)

The decay of D(γ) as γ 0 measures the approximation ability of the function space H e K. Next, for γ > 0, we deﬁne the regularizing function as

fγ = arg min f H e K

n E (f) E (fρ) + γ f 2 e K

The regularizing function uniquely exists and is given by fγ = γI + L e K 1 L e Kfρ (e.g., see Proposition 8.6 in Cucker and Zhou (2007)), where I denotes the identity operator on H e K. Now we are in a position to establish the error decomposition for algorithm (2). Recall that for 0 < q 1, cz q denotes the coeﬃcient sequence of the estimator ˆfq and z = {(xi, yi)}m i=1 Zm is the sample set. The empirical loss functional Ez(f) is deﬁned for f : X R as

i=1 (f(xi) yi)2.

Proposition 6 For γ > 0, the regularization function fγ is given by (15) and fz,γ = 1 m Pm i=1 Kxigγ(xi) with

gγ = L K γI + L e K 1 fρ.

Let ˆf be an estimator under consideration, we deﬁne

S1 = {E ( ˆf) Ez( ˆf)} + {Ez(fz,γ) E (fz,γ)},

S2 = {E (fz,γ) E (fγ)} + γ{ 1

i=1 |gγ(xi)| gγ L1ρX },

S3 = E (fγ) E (fρ) + γ gγ L2ρX .

If ˆfq satisﬁes Assumption 1 with 0 < q < 1, for the estimator ˆf = πM( ˆfq), there holds

E ( ˆf) E (fρ) + γ cz q q q S1 + S2 + S3 + γm1 q cz 1 q 1. (16)

When q = 1, for ˆf = ˆf1 or πM( ˆf1), there holds

E ( ˆf) E (fρ) + γ cz 1 1 S1 + S2 + S3. (17)

Shi, Huang, Feng and Suykens

To save space, we shall leave the proof to the Appendix. In fact, Proposition 6 presents three error decomposition formulas. When ˆf = ˆf1, the inequality (17) is exactly the error decomposition introduced in Guo and Shi (2012) for ℓ1 regularized kernel regression. Note that the bound (16) involves an additional term γm1 q ˆc1 q 1. Therefore, the asymptotic behavior of the global estimator ˆf1 plays a signiﬁcant part in the convergence analysis of ˆfq with 0 < q < 1. With the help of Proposition 6, one can estimate the total error by bounding Si (i = 1, 2, 3) and γm1 q cz 1 q 1 respectively. The terms S2 and S3 are well estimated in Guo and Shi (2012) by fully utilizing the structure of the functions fz,γ and gγ. Here we directly quote the following bound for S2 + S3. One may see Guo and Shi (2012) for the detailed proof.

Lemma 7 For any (γ, δ) (0, 1)2, with conﬁdence 1 δ, there holds

S2 + S3 8κ2 2κ2 + 1 log2(4/δ) D(γ)

γ2m2 + D(γ)

+(2κ + 1) p

D(γ) log (4/δ)

γD(γ) + 2D(γ), (18)

where κ = K C (X X).

Therefore, our error analysis mainly focuses on bounding S1 and γm1 q cz 1 q 1. The ﬁrst term S1 can be estimated by uniform concentration equalities. These inequalities quantitatively characterize the convergence behavior of the empirical processes over a function set by various capacity measures, such as VC dimension, covering number, entropy integral and so on. For more details, one may refer to Van der Vaart and Wellner (1996) and references therein. In this paper, we apply the concentration technique involving the ℓ2 empirical covering numbers to obtain bounds on S1. The ℓ2 empirical covering number is deﬁned by means of the normalized ℓ2-metric d2 on the Euclidian space Rl given by

i=1 |ai bi|2 !1/2

, a = (ai)l i=1, b = (bi)l i=1 Rl.

Deﬁnition 8 Let (M, d M) be a pseudo-metric space and S M a subset. For every ϵ > 0, the covering number N (S, ϵ, d M) of S with respect to ϵ and the pseudo-metric d M is deﬁned to be the minimal number of balls of radius ϵ whose union covers S, that is,

N (S, ϵ, d) = min

j=1 B(sj, ϵ) for some {sj}ι j=1 M

where B(sj, ϵ) = {s M : d M(s, sj) ϵ} is a ball in M. For a set F of functions on X and ϵ > 0, the ℓ2-empirical covering number of F is given by

N2(F, ϵ) = sup l N sup u Xl N (F|u, ϵ, d2),

where for l N and u = (ui)l i=1 Xl, we denote the covering number of the subset F|u = {(f(ui))l i=1 : f F} of the metric space (Rl, d2) as N (F|u, ε, d2).

Sparse Kernel Regression

As for the last term, we will derive a tighter bound for cz 1 1 by an iteration technique based on the convergence analysis for the estimator πM( ˆf1). The application of the projection operator will lead to better estimates.

3. Capacity of the hypothesis space under ℓ1 constraint

In this section, by means of the ℓ2-empirical covering numbers, we study the capacity of the hypothesis space generated by the kernel function. For R > 0, deﬁne BR to be the linear combination of functions {Kx|x X} under the ℓ1 constraint

i=1 µi Kui : k N, ui X, µi R and

In this paper, we assume that the following capacity assumption for B1 comes into existence.

Assumption 3 For a kernel function K, there exists an exponent p with 0 < p < 2 and a constant c K,p > 0 such that

log2 N2(B1, ϵ) c K,p

p , 0 < ϵ 1. (20)

It is strictly proved in Shi et al. (2011) that, for a compact subset X of Rd and K C s

with some s > 0, the power index p can be given by

2d/(d + 2s), when 0 < s 1, 2d/(d + 2), when 1 < s 1 + d/2, d/s, when s > 1 + d/2. (21)

We will present a much tighter bound on the logarithmic ℓ2-empirical covering numbers of B1. This bound holds for a general class of input space satisfying an interior cone condition.

Deﬁnition 9 A subset X of Rd is said to satisfy an interior cone condition if there exist an angle θ (0, π/2), a radius RX > 0, and a unit vector ξ(x) for every x X such that the cone

C(x, ξ(x), θ, RX) = n x + ty : y Rd, |y| = 1, y T ξ(x) cos θ, 0 t RX o

is contained in X.

Remark 10 The interior cone condition excludes those sets X with cusps. It is valid for any convex subset of Rd with Lipschitz boundary (see e.g., Adams and Fournier (2003)).

Now we are in a position to give our reﬁned result on the capacity of B1.

Shi, Huang, Feng and Suykens

Theorem 11 Let X be a compact subset of Rd. Suppose that X satisﬁes an interior cone condition and K C s(X X) with s 2 is an admissible kernel. Then there exists a constant CX,K that depends on X and K only, such that

log2 N2(B1, ϵ) CX,Kϵ 2d d+2 s log2

, 0 < ϵ 1, (22)

where s denotes the integral part of s.

Recall the asymptotical bound obtained in Shi et al. (2011) with p given by (21). It asserts that for K C s with s 2, the quantity log2 N2(B1, ϵ) grows at most of the order ϵ min{ 2d

s}. Our stated result in Theorem 11 improves the previous bound a lot, as log2 N2(B1, ϵ) can be bounded by O (ϵ s1) for any s1 > 2d d+2 s . Besides the ℓ2 empirical covering number, another way to measure the capacity is the uniform covering number N (B1, ϵ, ) of B1 as a subset of the metric space (C (X), ) of bounded continuous functions on X. From a classical result in function spaces (see Edmunds and Triebel (1996)), for X = [0, 1]d and K C s(X X), the unit ball B1 satisﬁes

d/s log N (B1, ϵ, ) c s

d/s , ϵ > 0.

When d is large or s is small, this estimate is rough. Moreover, the estimate above is asymptotic optimal and cannot be improved, which implies that the uniform covering number is not a suitable measurement for the capacity of B1. The ℓ2 emprical covering number was ﬁrst investigated in the ﬁeld of empirical process (e.g., see Dudley (1987); Van der Vaart and Wellner (1996) and references therein). One usually assumes that, there exist 0 < p < 2 and cp > 0 such that log N2(F, ϵ) cpϵ p, ϵ > 0, (23)

which guarantees the convergence of Dudley s entropy integral, i.e., R 1 0 p

log N2(F, ϵ)dϵ < . This fact is very important, since function class F with bounded entropy integrals satisfy the uniform central limit Theorem (see Dudley (1987) for more details). The classical result on ℓ2 empirical covering number only asserts that if N2(F, ϵ) = O(ϵ p ) for some p > 0, then the convex hull of F satisﬁes (23) with p = 2p p +2 < 2. In this paper, we further clarify the relation between the smoothness of the kernel and the capacity of the hypothesis space. That is, we establish a more elaborate estimate for the power index p in Assumption 3 by using the prior information on the smoothness of the kernel. It should be pointed out that, from the relation N2(F, ϵ) N (F, ϵ, ), capacity assumption (23) for an RKHS generated by a positive semi-deﬁnite kernel K can be veriﬁed by bounding the uniform covering number. It is proved in Zhou (2003) that, when F is taken to be the unit ball in the RKHS HK, there holds log N (F, ϵ, ) = O(ϵ 2d/s) provided that X Rd and K C s(X X). Therefore, a suﬃciently smooth kernel with s > d can guarantee that capacity assumption (23) comes into existence. When X is a Euclidean ball in Rd, the Sobolev space W s 2 (X) with s > d/2 is an RKHS. Birman and Solomyak (1967) proved that log N2(W s 2 (X), ϵ) is upper and lower bounded by O(ϵ p) with p = d/s < 2. However, up to our best knowledge, how to demonstrate capacity assumption (23) for a general RKHS is still widely open. From Theorem 11, one can see that even for positive semi-deﬁnite kernels,

Sparse Kernel Regression

the function space (19) is more suitable to be the hypothesis space than the classical RKHS, as its capacity can be well-estimated by the ℓ2 empirical covering number. We will prove Theorem 11 after a few lemmas. The improvement is mainly due to a local polynomial reproduction formula from the literature of multivariate approximation (see Wendland (2001); Jetter et al. (1999)). A point set Ω= {ω1, , ωn} X is said to be -dense if sup x X min ωj Ω|x ωj| ,

where | | is the standard Euclidean norm on Rd. Denote the space of polynomials of degree at most s on Rd as Pd s . The following lemma is a formula for local polynomial reproduction (see Theorem 3.10 in Wendland (2001)).

Lemma 12 Suppose X satisﬁes an interior cone condition with radius RX > 0 and angle θ (0, π/2). Fix s N with s 2. There exists a constant c0 depending on θ, d and s such that for any -dense point set Ω= {ω1, , ωn} in X with RX

c0 and every u X, we can ﬁnd real numbers bi(u), 1 i n, satisfying

(1) Pn i=1 bi(u)p(ωi) = p(u) p Pd s ,

(2) Pn i=1 |bi(u)| 2,

(3) bi(u) = 0 provided that |u ωi| > c0 .

This formula was ﬁrst introduced by Wang et al. (2012) to learning theory for investigating the approximation property of the kernel-based hypothesis space. For any function set F on X, we use absconv F to denote the absolutely convex hull of F, which is given by

absconv F =

i=1 λifi : k N, fi F and

In order to prove our result, we also need the following lemma.

Lemma 13 Let Q be a probability measure on X and F be a class of n measurable functions of ﬁnite L2 Q diameter diam F. Then for every ϵ > 0,

N (absconv F, ϵdiam F, d2,Q) e + e(2n + 1)ϵ2 2/ϵ2 , (24)

where d2,Q is the metric induced by the norm L2 Q.

This lemma can be proved following the same idea of Lemma 2.6.11 in Van der Vaart and Wellner (1996). Now we can concentrate our eﬀorts on deriving our conclusion in Theorem 11. Proof [Proof of Theorem 11]. A set is called separated if the distance between any two elements of the set is larger than . We take { n} n=1 to be a positive sequence decreasing to 0 with n explicitly given later. Let the set Xn = v1, v2, , v|Xn| be an increasing family of ﬁnite sets, where |Xn| denotes the cardinality of Xn. For every n, Xn is a maximal n separated set in X with respect to inclusion, i.e., each Xn is

Shi, Huang, Feng and Suykens

n separated and if Xn W X then W is not n separated. Note that, if Xn is a maximal n separated set in X, then it is n dense in X. Based on {Xn} n=1, we create a family of sets An = {Kv|v Xn}. Similarly, let A = {Kv|v X}, then A1 A2 A . We ﬁrst limit our discussion to the case that s 2 is an integer. Motivated by the proof of Proposition 1 in Wang et al. (2012), we will show that for any t0 X, the function Kt0 can be approximated by a linear combination of An whenever n is suﬃciently small. For given x X, we let gx(t) = K(x, t). Then gx C s(X). We consider the Taylor expansion of gx at a ﬁxed point t0 of degree less than s, which is given by

Px(t) = K(x, t0) +

α! (t t0)α.

Here α = (α1, , αd) Zd + is a multi-index with |α| = Pd j=1 |αj|, α! = Qd j=1(αj)!, and Dαgx(t0) denotes the partial derivatives of gx at the point t0. Due to Lemma 12, if X satisﬁes an interior cone condition, we take a constant X,s := RX

c0 depending only on X and s. Then for any dense set given by {ω1, , ωn} with X,s, we have

i I(t0) bi(t0)Px(ωi),

where P i I(t0) |bi(t0)| 2 and I(t0) = {i {1, , n} : |ωi t0| c0 }. Note that K(x, t0) = Px(t0) and maxi I(t0) |K(x, ωi) Px(ωi)| cs 0 K C s s. It follows that K(x, t0) X

i I(t0) bi(t0)K(x, ωi)

i I(t0) bi(t0) (Px(ωi) K(x, ωi))

2cs 0 K C s s. (25)

It is important that the right hand side of the above inequality is independent of x or t0. For any function f belonging to B1, there exist k N and {ui}k i=1 X, such that f(x) = Pk i=1 µi K(x, ui) with Pk i=1 |µi| 1. Recall that Xn = v1, v2, , v|Xn| is n dense in X. If n X,s, we set

j In(ui) bj(ui)K(x, vj),

where the index set In(u) is deﬁned for any n N and u X as

In(u) = {j {1, , |Xn|} : |vj u| c0 n} .

Hence, we have

f fn = sup x X

j In(ui) bj(ui)K(x, vj)

2cs 0 K C s s n, (26)

Sparse Kernel Regression

where the last inequality is from (25). Obviously, we can rewrite fn as fn(x) = P|Xn| j=1 νj K(x, vj) where |Xn| X

j In(ui) |bj(ui)| 2.

Therefore, we have fn 2absconv An. For a compact subset X of Rd, there exists a constant c X > 0 depending only on X, such that

N (X, ϵ, d2) c X

d , 0 < ϵ 1,

where d2 denotes the metric induced by the standard Euclidean norm on Rd (see Theorem

5.3 in Cucker and Zhou (2007)). Now we let n = 2c

d . Recall that |Xn| is the maximal cardinality of an n separated set in X. Thus we have |Xn| N (X, n/2, d2) n. For any given ϵ > 0, we choose N := N(ϵ) be to the smallest integer which is larger

than C X,K 1

s with C X,K = 2(s+2)d/scd 0 K d/s C s c X. Then the set 2absconv AN is ϵ

2 dense in B1 due to (26). Recall that for any probability measure Q on X, d2,Q denotes the metric induced by the norm L2 Q. Therefore, we have

N (B1, ϵ, d2,Q) N (2absconv AN, ϵ/2, d2,Q) = N (absconv AN, ϵ/4, d2,Q).

To bound the latter, let N1 := N1(ϵ) be the integer part of ϵ 2d d+2s . We choose a suﬃciently small ϵ > 0 satisfying

(C X,K)s(d+2s)/d2, 2 1c 1/d X X,s s+ d

:= ϵX,K, (27)

then XN1 XN and N1 X,s. For any g absconv AN, there exists {νi}|XN| i=1 R|XN| with P|XN| i=1 |νi| 1, such that

i=1 νi K(x, vi) := g1(x) + g2(x),

i=1 νi K(x, vi) +

i=|XN1|+1 νi

j IN1(vi) bj(vi)K(x, vj)

i=|XN1|+1 νi

j IN1(vi) bj(vi)K(x, vj)

From the expression of g1, we see that it is a linear combination of {K(x, vi)|vi XN1}. Similarly, one can check that the summation of the absolute value of the combinational

Shi, Huang, Feng and Suykens

coeﬃcients is still bounded by 2. Hence, we have g1 2absconv AN1 and g2 absconv KN1,N with

j IN1(vi) bj(vi)K(x, vj)

Therefore, absconv AN 2absconv AN1 + absconv KN1,N.

And it follows that

N (absconv AN, ϵ/4, d2,Q)

N (absconv AN1, ϵ/16, d2,Q) N (absconv KN1,N, ϵ/8, d2,Q) . (28)

To estimate the ﬁrst term, we choose some suitable ϵ1 > 0 and N2 := N2(ϵ1), such that the function set {fj}N2 j=1 is the maximal ϵ1 separated set in absconv AN1 with respect to the distance d2,Q. Hence, for any f absconv AN1, f must belong to one of the N2 balls of radius ϵ1 centered at fj. We denote these balls by B(fj, ϵ1), j = 1, , N2. Moreover, as K is an admissible kernel, any function in absconv AN1 has an unique expression as f(x) = P|XN1| i=1 νf i K(x, vi) with P|XN1| i=1 |νf i | 1. We deﬁne a mapping by Φ(f) = (νf 1 , , νf |XN1|)

R|XN1|. Under this mapping, the image of B(fj, ϵ1) is given by

Im(B(fj, ϵ1)) =

{νi} |XN1| i=1

i=1 |νi| 1 and

i=1 (νi νfj i )2 ϵ1

BR|XN1 |(νfj, ϵ1), (29)

where BR|XN1 |(νfj, ϵ1) denotes the ball in R|XN1| of radius ϵ1 centered at νfj = (νfj 1 , , νfj |XN1|). Furthermore, for f, g absconv AN1, there holds

i=1 (νf i νg i )2, (30)

where κ = K C (X X). Next for any ϵ2 > 0, from (30) and (29), we obtain

N (B(fj, ϵ1), ϵ2, d2,Q) N (Im(B(fj, ϵ1)), ϵ2/κ, d2)

N (BR|XN1 |(νfj, ϵ1), ϵ2/κ, d2)

= N (ϵ1BR|XN1 |, ϵ2/κ, d2), (31)

where BR|XN1 | denotes the unit ball in R|XN1|. Recall that N2 is the cardinality of the maximal ϵ1 separated set in absconv AN1. Then N2 N (absconv AN1, ϵ1/2, d2,Q). Hence,

Sparse Kernel Regression

for any positive ϵ1 and ϵ2, there holds

N (absconv AN1, ϵ2, d2,Q)

j=1 N (B(fj, ϵ1), ϵ2, d2,Q)

N (ϵ1BR|XN1 |, ϵ2/κ, d2) N (absconv AN1, ϵ1/2, d2,Q) , (32)

where the second inequality is from (31). Now we set ϵ2 = ϵ/16 and ϵ1 = 1/(16κ N1), then by equation (32), there holds

N (absconv AN1, ϵ/16, d2,Q)

N (BR|XN1 |, ϵ p

N1, d2) N absconv AN1, 1/(32κ p

N1), d2,Q .

And from equation (28), we further ﬁnd that

N (absconv AN, ϵ/4, d2,Q) N (BR|XN1 |, ϵ p

N absconv AN1, 1/(32κ p

N1), d2,Q N (absconv KN1,N, ϵ/8, d2,Q) . (33)

It is noticed that R|XN1| is a ﬁnite dimensional space and |XN1| N1, then

N (BR|XN1 |, ϵ p

N1, d2) (1 + 2N 1/2 1 ϵ 1)|XN1| 3N1(N1) N1/2ϵ N1. (34)

Next, we use Lemma 13 to estimate the rest two terms. For the function set AN1, it is easy to check that diam AN1 2κ. Then by (24), there holds

N absconv AN1, 1/(32κ p

N1), d2,Q e + e 1368κ2

8192κ2N1 . (35)

Recall that N1 X,s, then we have Diam KN1,N 4cs 0 K C s s N1 and the cardinality of KN1,N satisﬁes that |KN1,N| = |XN| |XN1| N 1. Also by (24), there holds

N (absconv KN1,N, ϵ/8, d2,Q) = N

absconv KN1,N, ϵ 32cs 0 K C s s N1 4cs 0 K C s s N1, L2 Q

e + e(2N 1) ϵ2

1024c2s 0 K 2 C s 2s N1

! 2048c2s 0 K 2 Cs 2s N1 ϵ2 .

Due to the choice of N1, we have ϵ 2d d+2s 1 < N1 ϵ 2d d+2s . Combining the restric-

tion (27) for ϵ, we have 2s N1 = 22sc

d X (N1) 2s

d X ϵ 4s d+2s . It follows that 2s N1ϵ 2

d X ϵ 2d d+2s . On the other hand, one can bound ϵ2 2s N1 by 2 2sc 2s

d X ϵ 2d d+2s . Recalling that

s + 1, we then have

N (absconv KN1,N, ϵ/8, d2,Q)

e + e C X,s 1

ϵ d2 s(d+2s)

22s+8c2s/d X c2s 0 K 2 C s

24s+11c2s/d X c2s 0 K 2 Csϵ 2d d+2s

Shi, Huang, Feng and Suykens

When 0 < ϵ ϵX,K, substituting the bounds (34), (35) and (36) into the inequality (33), we obtain

log2 N (B1, ϵ, d2,Q) C X,sϵ 2d d+2s log2

C X,K = 2 + d 1 + 8192κ2 log2 e + e 1368κ2

+24s+11c2s/d X ds 1c2s 0 K 2 C s log2

e + e C X,s 22s+8c2s/d X c2s 0 K 2 C s

Note that B1 can be considered as a subset of C (X). We use d to denote the metric induced by . Then for ϵX,s < ϵ 1, we ﬁnd that

log2 N (B1, ϵ, d2,Q) log2 N (B1, ϵX,s/κ, d ) log2 N (B1, ϵX,s/κ, d )ϵ 2d d+2s log2

We take CX,K = max{C X,K, log2 N (B1, ϵX,s/κ, d )}, then

log2 N (B1, ϵ, d2,Q) CX,Kϵ 2d d+2s log2

Notice that the above bound is independent of the distribution Q. Then for any sample x = {xi}ℓ i=1 Xℓwith ℓ N, the above estimate holds true for Q = 1

ℓ Pℓ i=1 δxi. Consequently, we prove our conclusion if s is an integer. When s is not an integer, one can check that the proof is also valid if we replace s by its integral part s . Thus we complete the proof of Theorem 11.

4. Convergence analysis

In this section, we investigate the convergence behavior of ℓq kernel regression based on concentration techniques involving ℓ2 empirical covering numbers. Our error analysis presented in this part is under the capacity assumption (20). For an admissible kernel K C s

with s > 0, recall the previous result obtained in Shi et al. (2011) for 0 < s < 2. Then under the assumption of Theorem 11, one can check that the assumption (20) can be satisﬁed with 0 < p < 2. Concretely, when 0 < s < 2, p = 2d d+2 min{1,s}; when s 2, p can be chosen to be

any constant satisfying p > 2d d+2 s .

4.1. Deriving the estimator for the total error

Recall the quantity S1 deﬁned for an estimator ˆf in Proposition 6, we can rewrite it as

S1 = Sz( ˆf) Sz(fz,γ),

where the quantity Sz is deﬁned for f C (X) by

Sz(f) = {E (f) E (fρ)} {Ez(f) Ez(fρ)} .

We use the following proposition to estimate S1. Recall that BR is deﬁned as (19).

Sparse Kernel Regression

Proposition 14 If B1 satisﬁes the capacity condition (20) with 0 < p < 2, then for any R 1 and 0 < δ < 1, with conﬁdence 1 δ, there hold

Sz(πM(f)) 1

2 {E (πM(f)) E (fρ)} + 176M2 log(1/δ)

+ c K,p M2m 2 2+p R

2p 2+p , f BR, (37)

2 {E (f) E (fρ)} + 20(3M + κ)2R2 log(1/δ)

+ c K,p(3M + κ)2m 2 2+p R2, f BR. (38)

Here c K,p > 0 is a constant depending only on p and the kernel function K and κ = K C (X X). The above bounds also hold true for Sz(πM(f)) and Sz(f).

We shall prove Proposition 14 in the Appendix by using the entropy integral based on ℓ2 empirical covering numbers. Now we can derive the estimator for the total error. For R 1 and 0 < q 1, denote

Wq(R) = z Zm : cz q q q R . (39)

Proposition 15 Assume that approximation assumption (14) with 0 < β 1 and capacity condition (20) with 0 < p < 2 are valid, and the estimator ˆfq satisﬁes Assumption 1 with 0 < q < 1. If 0 < γ 1, 0 < δ < 1 and R 1, then there is a subset V (q) R of Zm with measure at most δ such that

E (πM( ˆfq)) E (fρ) + γ cz q q q

2 c K,p M2m 2 2+p R

2p q(2+p) + C1 log3(18/δ) max n γβ 2m 2, γβ 1m 2 2+p o

+(3 cβ + 6cβ)γβ + 2γm1 q cz 1 q 1, z Wq(R)\V (q) R . (40)

Moreover, when q = 1, there are subsets V (1) R and e V (1) R of Zm with each measure at most δ such that

E (πM( ˆf1)) E (fρ) + γ cz 1 1

2 c K,p M2m 2 2+p R

2p 2+p + C1 log3(18/δ) max n γβ 2m 2, γβ 1m 2 2+p o

+(3 cβ + 6cβ)γβ, z W1(R)\V (1) R , (41)

E ( ˆf1) E (fρ) + γ cz 1 1 2(20 + c K,p)(3M + κ)2 log(3/δ)m 2 2+p R2

+ e C1 log3(18/δ) max n γβ 2m 2, γβ 1m 2 2+p , γβo , z W1(R)\e V (1) R . (42)

Here C1 and e C1 are positive constants depending on κ, c K,p, M and cβ.

Shi, Huang, Feng and Suykens

Proof Let ˆf be the estimator under consideration. We estimate S1 by bounding Sz( ˆf) and Sz(fz,γ) respectively. Due to Proposition 14, the quantities Sz(fz,γ) and Sz(fz,γ) have the same upper bound. We thus turn to estimate Sz(fz,γ). For (γ, δ) (0, 1)2, let

R(γ, δ) := 2κ p

D(γ) log (6/δ)

2D(γ) log (6/δ)

From the proof of Proposition 5 in Guo and Shi (2012), there exist two subsets of Zm

denoted by U1 and U2 with ρ(Ui) δ/3, i = 1, 2, such that

fz,γ BR(γ,δ), z Zm\U1, (43)

E (fz,γ) E (fρ) 8κ2 2κ2 + 1 log2(6/δ) D(γ)

γ2m2 + D(γ)

+ 2D(γ), z Zm\U2. (44)

We use inequality (38) to bound Sz(fz,γ). Now (43) and (44) are both valid for fz,γ with z Zm\(U1 U2). Then there exists a subset U3 of Zm with measure at most δ/3 such that

Sz(fz,γ) 4κ2 2κ2 + 1 log2(6/δ) D(γ)

γ2m2 + D(γ)

+D(γ) + 20(3M + κ)2(R(γ, δ))2 log(3/δ)

+ c K,p(3M + κ)2m 2 2+p (R(γ, δ))2, z Zm\(U1 U2 U3).

Note that ρ(U1 U2 U3) δ and

(R(γ, δ))2 (8κ2 + 4) log2(6/δ) D(γ)

γ2m2 + D(γ)

Therefore, with conﬁdence 1 δ, there holds

Sz(fz,γ) Cκ log2(6/δ) max D(γ)

γ2m2 , D(γ)

+(20 + c K,p)Cκ,M log3(6/δ)m 2 2+p max D(γ)

γ2m2 , D(γ)

where Cκ = 8κ2 2κ2 + 1 and Cκ,M = 3(3M + κ)2(8κ2 + 4). Next, for ˆf = ˆf1 or πM( ˆfq) with 0 < q 1, we consider bounding the quantity Sz( ˆf) with z W(R). It is easy if we notice that cz q q q R implies the corresponding estimator ˆfq BR1/q. Thus we can bound Sz( ˆf) by directly applying Proposition 14. When ˆf = πM( ˆfq) with 0 < q < 1, combining the bounds on S1 + S2 given in Lemma 7, we are able to derive the estimator for the total error bound. Due to (37), there exist subsets V (q) 1 with measure at most δ/3 such that

Sz(πM( ˆfq)) 1

n E (πM( ˆfq)) E (fρ) o + 176M2 log(3/δ)

+ c K,p M2m 2 2+p R

2p q(2+p) , z Wq(R)\V (q) 1 .

Sparse Kernel Regression

Similarly, from (45) and Lemma 7, we can ﬁnd subsets V2 and V3 such that

Sz(fz,γ) (Cκ + CK,p) log3(18/δ) max D(γ)

γ2m2 , D(γ) γm2/(2+p)

+ D(γ), z Zm\V2,

S2 + S3 16κ2 2κ2 + 1 log2(12/δ) max D(γ)

γ2m2 , D(γ)

+(2κ + 1) p

D(γ) log (12/δ)

γD(γ) + 2D(γ), z Zm\V3,

where CK,p = (20 + c K,p)Cκ,M and ρ(Vi) δ/3 for i = 2, 3. Let V (q) R = V (q) 1 V2 V3, then ρ(V (q) R ) δ. Recall the error decomposition formula (16). We combine the above three bounds and obtain

E (πM( ˆfq)) E (fρ) + γ cz q q q

n E (πM( ˆfq)) E (fρ) o + (176M2 + 16κ2(2κ2 + 1)) log2(12/δ) max 1

+ c K,p M2m 2 2+p R

2p q(2+p) + (Cκ + CK,p + 16κ2 2κ2 + 1 ) log3(18/δ) max D(γ)

γ2m2 , D(γ) γm2/(2+p)

+(2κ + 1) p

D(γ) log (12/δ)

γD(γ) + 3D(γ) + γm1 q cz 1 q 1, z Wq(R)\V (q) R .

Due to this inequality, we use Approximation condition (14) and ﬁnd that the inequality (40) holds true with

C1 = 2((20 + c K,p)Cκ,M + 176M2 + 40κ2 2κ2 + 1 )cβ + (4κ + 2) cβ.

Following the same method, when ˆf = πM( ˆf1), we can prove inequality (41) by using the error decomposition formula (17) . We next focus on the case ˆf = ˆf1. Due to inequality (38), for R 1, with conﬁdence 1 δ/3, there holds

n E ( ˆf1) E (fρ) o + (20 + c K,p)(3M + κ)2 log(3/δ)m 2 2+p R2, z W1(R).

Also by the same analysis above, we obtain inequality (42) with

e C1 = 6((20 + c K,p)(3M + κ)2(8κ2 + 4) + 8κ2 2κ2 + 1 + 1)cβ + (4κ + 5) cβ.

Thus we complete our proof.

4.2. Bounding cz q q by iteration

To apply Proposition 15 for error analysis, we need to determine some R 1 for the set Wq(R) given by (39). To this end, we shall apply an iteration technique to obtain a tight

Shi, Huang, Feng and Suykens

bound for cz q q q. This technique can be found in Steinwart and Scovel (2007); Wu et al. (2006). Take the case q = 1 for example. Recall that cz 1 is the global minimizer of the target functional Tγ,1 deﬁned by (3). Hence Tγ,1(cz 1) Tγ,1(0). This inequality leads to a trivial bound on cz 1 1, which is given by

γ , z Zm. (46)

By applying inequality (41) iteratively, we are able to improve the above bound to the order O(γβ 1) with some suitable choice of γ = γ(m). Similarly, when 0 < q < 1, a better estimates for cz q q q can be derived based on the inequality (40). The following proposition illustrates the detailed process of iteration.

Proposition 16 Under the assumptions of Proposition 15, let (δ, τ) (0, 1)2 and J(τ, p, q) be a constant given by

J(τ, p, q) = max

2, log (2 (p+2)τ)p

(q pτ)(p+2) log 2p q(p+2)

For q = 1, take γ = m τ with 0 < τ < 2 2+p. Then with conﬁdence 1 δ, there holds

cz 1 1 C2 (log(1/δ) + log(J(τ, p, 1)))3 m(1 β)τ. (48)

For 0 < q < 1 satisfying q > max n 2p p+2, p p+2β o , take γ = m τ with 1 q 1 q(1 β) < τ < 2 2+p. Then with conﬁdence 1 δ, there holds

cz q q q e C2 (log J(τ, p, 1))3q (log(1/δ) + log(J(τ, p, q) + 1))3 m1 q+q(1 β)τ. (49)

Here C2 and e C2 are positive constants depending only on κ, c K,p, M and cβ.

Proof Take 0 < δ < 1 and γ = m τ under the restriction 0 < τ < 2 p+2. For 0 < q 1, we shall verify the following inequality derived from Proposition 15, that is

cz q q q max{am Rθ, bm}, z Wq(R) \ V (q) R , (50)

where V (q) R is a subset of Zm with measure at most δ, θ = 2 q(p+2), am = 4 c K,p M2mτ 2 2+p and bm will be given explicitly in accordance with speciﬁc situations. This inequality ensures that Wq(R) Wq max{am Rθ, bm} V (q) R (51)

with ρ(V (q) R ) δ. We then apply the inclusion (51) for a sequence of radiuses {R(j)}j N deﬁned by R(0) = Mδmτ with a suitable chosen Mδ > 0 and

R(j) = max n am(R(j 1))θ, bm o , j N. (52)

Sparse Kernel Regression

Then (51) holds for each R(j), which implies Wq(R(j 1)) Wq(R(j)) V (q) R(j 1) with ρ(V (q) R(j 1)) δ. Applying this inclusion for j = 1, 2, . . . , J with J to be determined later, we see that

Wq(R(0)) Wq(R(1)) V (q) R(0) Wq(R(J)) J 1 j=0 V (q) R(j) .

By the deﬁnition of the sequence {R(j)}j, we thus have

R(J) = max a1+θ+θ2+...+θJ 1 m (R(0))θJ, a1+θ+θ2+...+θJ 2 m bθJ 1 m , . . . , ambθ m, bm

Recall that R(0) = Mδmτ. When θ = 1, the ﬁrst term on the right-hand side of equation (53) can be bounded as

a1+θ+θ2+...+θJ 1 m (R(0))θJ = a(θJ 1)/(θ 1) m (R(0))θJ

(4 c K,p M2)(θJ 1)/(θ 1)MθJ δ m

θ 1 2(θJ 1) (p+2)(θ 1) .

For the remaining terms, we ﬁnd that

max n a1+θ+θ2+...+θJ 2 m bθJ 1 m , . . . , ambθ m, bm o

= max n (am)(θJ 1 1)/(θ 1)(bm)θJ 1, . . . , ambθ m, bm o

= max n a(θJ 1 1)/(θ 1) m bθJ 1 m , bm o .

Therefore, we obtain

R(J) max Aθ,JMθJ δ m

θ 1 2(θJ 1) (p+2)(θ 1) ,

Aθ,J 1m(τ 2 2+p)(θJ 1 1)/(θ 1)bθJ 1 m , bm

where Aθ,J is deﬁned as Aθ,J = (4 c K,p M2)(θJ 1)/(θ 1) for any J N and θ = 1. Now we go back to our concrete examples. Due to the above analysis, we need to determine bm, Mδ and J depending on the concerned cases. When q = 1, then θ = 2p p+2 < 1 and we can take bm = C 1 log3(18/δ)m(1 β)τ with C 1 = 2C1 +6 cβ +12cβ. One can check that the inequality (50) is valid due to (41). Recall the trivial bound (46) on cz 1 1, which implies cz 1 1 M2/γ = M2mτ for z Zm. We thus take Mδ = M2 and R(0) = M2mτ. From (54), we further ﬁnd that

R(J) max 4 c K,p M2 2+p

2 p M2, C 1 log3(18/δ) 4 c K,p M2 2+p

2 p , C 1 log3(18/δ) mθ1,

where the power index θ1 is given by

θ1 = max 2p p 2

(1 β)τ, (1 β)τ +

! (1 β)τ + (p + 2)τ 2

Shi, Huang, Feng and Suykens

Let J be the smallest integer such that 2p p 2

i.e., J is the integer satisfying

1, log 2 (p+2)τ

2 2pτ log 2p p+2

2, log (2 (p+2)τ)p

(1 pτ)(p+2) log 2p p+2

Then θ1 = (1 β)τ and

R(J) 4 c K,p M2 2+p

2 p M2 + C 1

log3(18/δ)m(1 β)τ.

Since Zm = W1(R(0)) and ρ J 1 j=0 V (1) R(j) Jδ, the measure of the set W1(R(J)) is at least 1 Jδ. By scaling Jδ to δ, we derive the bound (48) with

C2 = 64 4 c K,p M2 2+p

2 p M2 + C 1

Next we turn to the case 0 < q < 1 with q > max n 2p p+2, p p+2β o . The estimation in this

case is more involved. In order to determine bm and R(0) in this case, we need to use the result just obtained for cz 1 1. In fact, for any 0 < δ < 1 and 0 < τ < 2 p+2, with conﬁdence 1 δ/2 there holds

cz 1 1 C2 (log J(τ, p, 1) + log(2/δ))3 m(1 β)τ, (55)

where J(τ, p, 1) is given by (47) at q = 1. The inequality (40) asserts that, with conﬁdence 1 δ/2, there holds

E (πM( ˆfq)) E (fρ) + γ cz q q q 2 c K,p M2m 2 2+p R

2C 1 log3(36/δ)γβ + 2γm1 q cz 1 q 1, z Wq(R). (56)

It is easy to check under the restriction q > max n 2p p+2, p p+2β o , there holds 1 q 1 q(1 β) < 2 2+p and 2p q(2+p) < 1. Then we can take γ = m τ with 1 q 1 q(1 β) < τ < 2 2+p. Combining (55) and (56), we ﬁnd that (50) is satisﬁed with bm = Bδm α, where

Bδ = C 1 + 4Cq 2 (log J(τ, p, 1))3q log3(36/δ) and α = 1 q + q(1 β)τ.

We further deﬁne

R(0) = Mδmτ and Mδ = M2 + Cq 2 (log J(τ, p, 1) + log(1/δ))3q . (57)

Recall that the estimator ˆfq satisﬁes Assumption 1. Then we have

γ cz q q q Tγ,q(cz q) Tγ,q(cz 1) Tγ,1(cz 1) + γm1 q cz 1 q 1 Tγ,1(0) + γm1 q cz 1 q 1.

Sparse Kernel Regression

Due to this inequality and the bound on cz 1 1, one can easily check that with conﬁdence 1 δ, there holds

cz q q q M2mτ + m1 q cz 1 q 1 Mδmτ,

where the last inequality is due to τ > 1 q 1 q(1 β). Hence the measure of the set Wq(R(0)) is

at least 1 δ. We thus can iteratively deﬁne R(j) as (52) with R(0) given by (57). Now we can follow the same fashion to derive our desire bounds. Since θ = 2p q(p+2) < 1, due to the inequality (54), we have

R(J) max Aθ,JMθJ δ , Aθ,J 1BθJ 1 δ , Bδ

where θ2 is given by

max 2p 2p q(p + 2)

2p q(p + 2)

J q(p + 2) 2p q(p + 2)

τ 2q 2p q(p + 2)

2p q(p + 2)

2p q(p + 2)

! α + q(p + 2) 2p q(p + 2)τ 2q 2p q(p + 2)

We chose J to be the smallest integer such that

1, log 2q q(p+2)τ

2q 2pτ log 2p q(p+2)

2, log (2 (p+2)τ)p

(q pτ)(p+2) log 2p q(p+2)

Under the choice of J, we have θ2 = α,

R(J) (M2 + C 1 + 4Cq 2) 4 c K,p M2 2+p

2 p (log J(τ, p, 1))3q log3(36/δ)m α

and the measure of the set W1(R(J)) is at least 1 (J + 1)δ. We thus can derive the inequality (49) with e C2 = 64(M2 + C 1 + 4Cq 2) 4 c K,p M2 2+p

by scaling (J + 1)δ to δ.

4.3. Deriving the convergence rates

In this subsection, we estimate the total error for algorithm (2) and derive the explicit convergence rates. Recall that for 0 < q 1, ˆfq denotes the estimator given by the algorithm (2).

Theorem 17 Assume that approximation assumption (14) with 0 < β 1 and capacity condition (20) with 0 < p < 2 are valid. Let (δ, τ, q) (0, 1]3 and J(τ, p, q) be a constant

Shi, Huang, Feng and Suykens

given by (47). When q = 1, take γ = m τ with 0 < τ < 2 2+p. Then with conﬁdence 1 δ, there holds

ˆf1 fρ 2 L2ρX C3 log(6/δ) (log(2/δ) + log(J(τ, p, 1)))6 m eΘ, (58)

where eΘ = min 2 2 + p 2(1 β)τ, βτ . (59)

Additionally, if an estimator ˆfq satisﬁes Assumption 1 with max n 2p p+2, p p+2β o < q < 1. Take

γ = m τ with 1 q 1 q(1 β) < τ < 2 2+p. Then with conﬁdence 1 δ, there holds

πM( ˆfq) fρ 2 L2ρX e C3 (log(9/δ) + log(J(τ, p, 1)) + log(J(τ, p, q) + 1))3 m (τ α), (60)

where α = 1 q + q(1 β)τ. Here C3 and e C3 are positive constants independent of m or δ.

Proof Recall that for any estimator ˆf under consideration, there holds

ˆf fρ 2 L2ρX = E ( ˆf) E (fρ).

Therefore, due to inequality (42) in Proposition 15, there is a subset e VR of Zm with measure at most δ such that

ˆf fρ 2 L2ρX 2(20 + c K,p)(3M + κ)2 log(3/δ)m 2 2+p R2

+ e C1 log3(18/δ) max n γβ 2m 2, γβ 1m 2 2+p , γβo , z W1(R)\e VR.

Now we choose R to be the right-hand side of inequality (48), i.e.,

R = C2 (log(1/δ) + log(J(τ, p, 1)))3 m(1 β)τ.

Then the measure of the set W1(R)\ e VR is at least 1 2δ. So with conﬁdence at least 1 2δ, there holds

ˆf1 fρ 2 L2ρX C3 max n log3(3/δ), log(3/δ) (log(1/δ) + log(J(τ, p, 1)))6o m eΘ,

where eΘ is given by (59) and

C3 = 54(20 + c K,p)(3M + κ)2C2 2 + 27 e C1.

Then we obtain the error bound (58) by scaling 2δ to δ. Next we prove the error bound (60). Recalling the total error bound (40), for some R 1, there is a subset V (q) R with measure at most δ such that

πM( ˆfq) fρ 2 L2ρX 2 c K,p M2m 2 2+p R

+ e C 1 log3(18/δ) max n γβ 2m 2, γβ 1m 2 2+p , γβo + 2γm1 q cz 1 q 1, z Wq(R)\V (q) R ,

Sparse Kernel Regression

where e C 1 = C1 + 3 cβ + 6cβ. Using the same argument, the inequality (48) ensures the existence of a subset V 1 with measure at most δ such that

cz 1 1 C2 (log(1/δ) + log(J(τ, p, 1)))3 m(1 β)τ, z Zm\V 1.

If we take R to be the right-hand side of the inequality (49), i.e.,

R = e C2 (log J(τ, p, 1))3q (log(1/δ) + log(1 + J(τ, p, q)))3 m α

with α = 1 q +q(1 β)τ. Then the measure of the set Wq(R) is at least 1 δ. Combining these bounds, for z Wq(R)\(V (q) R V 1), there holds

πM( ˆfq) fρ 2 L2ρX

e C3 log3(18/δ) (log(1/δ) + log(J(τ, p, 1)) + log(1 + J(τ, p, q)))max n 6p q(2+p),3q o m eΘ1,

where e C3 = 2 c K,p M2 e C

2p q(2+p) 2 + e C 1 + 2Cq 2 and

eΘ1 = min 2 2 + p 2p q(p + 2) α, 2 2 + p (1 β)τ, βτ, τ α

Note that the measure of the set Wq(R)\(V (q) R V 1) is at least 1 3δ. And under the

restrictions on τ, p and q, we have eΘ1 = τ α and max n 6p q(2+p), 3q o 3. Then the error bound (60) follows by scaling 3δ to δ. Thus we complete the proof.

Now we can prove Theorem 1 based on the error bound (58). Proof [Proof of Theorem 1]. Recall that the function space H e K is in the range of the integral

operator L1/2 e K . Hence the assumption fρ H e K implies the approximation condition (14) is valid with β = 1 (see Proposition 8.5 in Cucker and Zhou (2007)). The assumption on the input space X ensures that X satisﬁes an interior cone condition (see Deﬁnition 9). Then for an admissible kernel K C s with s > 0, due to the previous result obtained in Shi et al. (2011) and Theorem 11, one can check that the capacity assumption (20) is achieved. Concretely, when 0 < s < 2, p = 2d d+2 min{1,s}; when s 2, p can be chosen to be any

constant satisfying p > 2d d+2 s . We just focus on the case s 2. For any 0 < ϵ Θ 1

2 with Θ given by (5), the capacity assumption is achieved for p > 2d d+2 s satisfying 2 2+p = Θ ϵ

2. We thus set τ = Θ ϵ. Next

we need to bound the quantity J(τ, p, 1). In fact, as (2 (p+2)τ)p

(1 pτ)(p+2) = 1 p+2

2 τ 1 pτ 2p p+2, we further have log (2 (p+2)τ)p

(1 pτ)(p+2) log 2p p+2 = log 1 pτ 1 p+2

2 τ log p+2

Then we substitute p = 2 Θ ϵ/2 2 and τ = Θ ϵ into the above equation. Combining the restrictions for Θ and ϵ, we obtain

log (2 (p+2)τ)p

(1 pτ)(p+2) log 2p p+2 log 4 ϵ(1 ϵ) log 1 1 ϵ

Shi, Huang, Feng and Suykens

J(τ, p, 1) = max

2, log (2 (p+2)τ)p

(1 pτ)(p+2) log 2p p+2

log 4 ϵ(1 ϵ) log 1 1 ϵ := Jϵ.

Now recall the general bound (58). Since β = 1, one can check that eΘ = τ = Θ ϵ and the bound (6) is valid with Cϵ = C3(Jϵ + 1)6. As the error bound for the case 0 < s < 2 can be derived following the same way, we thus complete the proof of Theorem 1.

5. Sparsity analysis

In this section, we shall derive an asymptotical upper bound on cz q 0 m , where cz q denotes the coeﬃcient sequence of the estimator ˆfq with 0 < q < 1. In order to do so, we need a lower bound on the value of the non-zero elements in cz q. Recall that for any vector c = (c1, , cm) Rm, the support set of c is given by supp(c) := {j {1, , m} : cj = 0}.

Proposition 18 Let I {1, 2, , m} be a non-empty index set, 0 < q < 1 and c q = (c q,1, , c q,m) be a local minimizer of the following optimization problem

min c Rm,supp(c) I

i=1 ci K(xj, xi)

!2 + γ c q q

Then for i supp(c q), there holds

|c q,i| > q(1 q)

1/(2 q) γ1/(2 q), (62)

where κ = K C (X X). Moreover, if c q is a global minimizer of the optimization problem (61), the above bound can be improved to

|c q,i| 1 q

1/(2 q) γ1/(2 q). (63)

Proof Recall the target functional (3) to be optimized with c = (c1, , cm) Rm, which can be expressed as

Tγ,q(c) = 1

i=1 ci K(xj, xi)

For i supp(c q), we deﬁne an univariate function as

hi(t) = Tγ,q(c \i q (t)), t R,

where c \i q (t) = (c q,1, , c q,i 1, t, c q,i+1, , c q,m). As c q is a local minimizer of the optimization problem (61), thus c q,i is a local minimizer of hi(t). We compute the ﬁrst and the

Sparse Kernel Regression

second derivative of hi(t) on the interval (c q,i ϱ, c q,i +ϱ) for some ϱ > 0. Since c q,i = 0, we can choose a suﬃciently small ϱ such that 0 is not contained in the interval (ˆci ϱ, ˆci + ϱ). Then we obtain

k =i c q,k K(xj, xk) + t K(xj, xi) yj

K(xj, xi) + γqsgn(t)|t|q 1

h i (t) = 2

j=1 K2(xi, xj) γq(1 q)|t|q 2.

Recall that c q,i is a local minimizer of hi(t). Due to the optimality condition, we must have h i(c q,i) = 0 and h i (c q,i) > 0, i.e.,

k=1 c q,k K(xj, xk) yj

K(xj, xi) + γqsgn(c q,i)|c q,i|q 1 = 0 (64)

j=1 K2(xj, xi) γq(1 q)|c q,i|q 2 > 0. (65)

From the inequality (65), we have

γq(1 q)|c q,i|q 2 < 2

j=1 K2(xj, xi) 2κ2,

which leads to bound (62). If c q is the global minimizer of the optimization problem (61), the global optimality of c q implies c q,i is a global minimizer of hi(t). We thus have h i(c q,i) = 0. Due to equation (64), there holds

k=1 c q,k K(xj, xk)

K(xj, xi) = γqsgn(c q,i)|c q,i|q 1.

We multiply both sides of the above equation by c q,i and ﬁnd that

k=1 c q,k K(xj, xk)

K(xj, xi)c q,i = γq|c q,i|q. (66)

Now we consider a new vector given by

c \i q (0) = (c q,1, , c q,i 1, 0, c q,i+1, , c q,m).

Shi, Huang, Feng and Suykens

We compare Tγ,q(c \i q (0)) with Tγ,q(c q) and ﬁnd that

Tγ,q(c q) Tγ,q(c \i q (0))

k=1 c q,k K(xj, xk) + c q,i K(xj, xi)

c q,i K(xj, xi) + γ|c q,i|q

k=1 c q,k K(xj, xk) yj

K(xj, xi)c q,i 1

j=1 K2(xj, xi)(c q,i)2 + γ|c q,i|q.

From the equality (66), it follows that

Tγ,q(c q) Tγ,q(c \i q (0)) = γq|c q,i|q 1

j=1 K2(xj, xi)(c q,i)2 + γ|c q,i|q

γ(1 q)|c q,i|q κ2(c q,i)2.

Recall that c q is a global minimizer of problem (61). The above inequality implies γ(1 q)|c q,i|q κ2(c q,i)2 0, which leads to the lower bound (63). Thus we complete our proof.

Remark 19 We use the second order optimality condition to derive the lower bound on the non-zero coeﬃcients of a local minimizer in (61). Based on the same idea, a lower bound estimation which is similar to (62) has been derived in Chen et al. (2010). However, our analysis gives another lower bound when the solution is a global minimizer, which indicates that the global optimality will help to improve the sparseness of the solutions.

It should be noticed that for a given kernel function K, the lower bounds presented in Proposition 18 only depend on γ and q. Thus we can use these bounds to obtain some universal results. When cz q is a local minimizer of optimization problem (2) satisfying Assumption 1, which implies that cz q satisﬁes the condition of Proposition 18 for I =

{1, , m}. We can derive the upper bound on cz q 0 m based on the lower bound (62).

Theorem 20 Assume that approximation assumption (14) with 0 < β 1 and capacity condition (20) with 0 < p < 2 are valid, and the estimator ˆfq satisﬁes Assumption 1 with

p + 2, p p + 2β

Let (δ, τ) (0, 1)2 and J(τ, p, q) be a constant given by (47). Take γ = m τ with 1 q 1 q(1 β) < τ < 2 2+p. Then with conﬁdence 1 δ, there holds

cz q 0 m Cq (log J(τ, p, 1))3q (log(1/δ) + log(1 + J(τ, p, q)))3 mq τ 2 q 1+(1 β)τ , (67)

where cz q denotes the coeﬃcient sequence of the estimator ˆfq and Cq is a constant positive constants depending on q, κ, c K,p, M and cβ.

Sparse Kernel Regression

Proof From Proposition 18, if cz q is a local minimizer of problem (2), then for i supp(cz q), there holds

|cz q,i| > q(1 q)

1/(2 q) γ1/(2 q).

Equivalently, we have

q/(2 q) γ q/(2 q)|cz q,i|q.

Note that the above inequality holds true for every cz q,i with i supp(cz q). Therefore, we take a summation on both sides of the inequality according to i supp(cz q) and ﬁnd that

cz q 0 < q(1 q)

q/(2 q) γ q/(2 q) cz q q q.

Now taking γ = m τ with 1 q 1 q(1 β) < τ < 2 2+p and recalling the bound on cz q q q given by

(49), then the desired result is obtained with Cq = e C2 q(1 q)

2κ2 q/(2 q) . Thus we complete the proof.

If cz q is a global minimizer of algorithm (2), one can derive an upper bound on cz q 0 m from (63) following the same method. From the upper bound (67), we can see that if τ < 2 q 1+(2 q)(1 β), the quantity cz q 0 m converges to 0 at a rate of polynomial decay as m tends to inﬁnity. At the end of this section, we give the proof of Theorem 3. Proof [Proof of Theorem 3]. From the proof of Theorem 1, we know that under the assumption of Theorem 3, the approximation condition (14) is valid with β = 1 and the capacity assumption (20) can be satisﬁed with an arbitrarily small p > 0. We derive our conclusions based on Theorem 17 and Theorem 20. We ﬁrst check the restrictions for q and τ. As β = 1, we thus ﬁnd 1 q < τ < 2 p+2 and 2p 2+p < q < 1. Note that p can be arbitrarily closed to 0. Therefore, we can take τ and q to be any value in the interval (0, 1) by choosing a small enough p. By the same argument, we can bound J(τ, p, q) by log 4 q(1 τ) for 1 q < τ < 1 and 0 < q 1. Therefore, from bound (60), the inequality (7) holds with conﬁdence 1 δ and e C = 8 e C3. Similarly, as a direct corollary of the bound (67), the inequality (8) holds with conﬁdence 1 δ and e C = e C2(1 + 2κ2). Finally, if the two bounds hold together, we need to scale 2δ to δ. Thus we complete the proof.

6. Conclusion

In this paper, we investigate the sparse kernel regression with ℓq regularization, where 0 < q 1. The data dependence nature of the kernel-based hypothesis space provides ﬂexibility for this algorithm. The regularization scheme is essentially diﬀerent from the standard one in an RKHS: the kernel is not necessarily symmetric or positive semi-deﬁnite and the regularizer is the ℓq-norm of a function expansion involving samples. When the underlying hypothesis space is ﬁnite-dimensional, the ℓq regularization with 0 < q 1 is well understood in theory and widely used in various applications such as compressed sensing

Shi, Huang, Feng and Suykens

and sparse recovery. However, its role in the non-parametric regression within an inﬁnitedimensional hypothesis space has not been fully understood yet. In this paper, we develop the mathematical analysis for the asymptotic convergence and sparseness of ℓq regularized kernel regression. We ﬁrst present a tight bound on the ℓ2 empirical covering numbers of the kernel-based hypothesis space under ℓ1 constrain, which is interesting on its own right. We thus demonstrate that, compared with classical RKHS, the hypothesis space involved in the error analysis induced by the non-symmetric kernel has nice behaviors in terms of the ℓ2-empirical covering numbers of its unit ball. Moreover, the empirical covering number estimates developed in this paper can also be applied to obtain distribution-free error analysis for other sparse approximation schemes, for example Guo, Fan and Zhou (2016); Guo et al. (2017). Our theoretical analysis is based on the concentration estimates with ℓ2-empirical covering numbers, a reﬁned iteration technique for ℓq regularization and a descent-iterative minimization process which can be realized by the ℓq threshholding function. Based on our analysis, we show that the ℓq regularization term plays a role as a trade-oﬀbetween sparsity and convergence rates. We also prove that regularizing the combinatorial coeﬃcients by the ℓq norm can produce strong sparse solutions, i.e., the fraction of non-zero coeﬃcients converges to 0 at a polynomial rate when the sample size m becomes large. Our mathematical analysis established in this paper can shed some lights on understanding the role of the ℓq regularization in feature selections in an inﬁnitedimensional hypothesis space.

Acknowledgments

We thank the anonymous referees for their constructive suggestions. The work described in this paper is supported in part by the National Science Foundation of China (Grants No.11571078, 11631015, 61603248, 61977046); the Joint Research Fund by National Natural Science Foundation of China and Research Grants Council of Hong Kong (Project No. 11461161006 and Project No. City U 104012); Program of Shanghai Subject Chief Scientist (Project No.18XD1400700). The research leading to these results has also received funding from the European Research Council ERC Ad G A-DATADRIVE-B (290923) and ERC Ad G E-DUALITY (787960) under the European Union s Horizon 2020 research and innovation programme. The corresponding author is Xiaolin Huang.

Appendix A.

This appendix includes some detailed proofs.

A.1. Proof of Lemma 4

In this subsection, we shall prove Lemma 4. Proof [Proof of Lemma 4]. It is easy to verify that the vector Ψη,q(d) is a global minimizer of the problem (12) if and only if (Ψη,q(d))i is a global minimizer of minc R |c di|2 + η|c|q , where (Ψη,q(d))i denotes the i th coordinate value of the vector Ψη,q(d). In the following proof, we shall use x to denote a given constant. In order to prove our conclusion, we need to ﬁnd the global minimizer of the univariate function h(t) = t2 2xt + η|t|q, i.e., the solution

Sparse Kernel Regression

of the minimization problem mint R |t x|2 + η|t|q . Additionally, to ensure the welldeﬁnedness of the map Ψη,q, we also need to show that under the restriction |x| > aqη1/(2 q), the equation (10) has only one solution on the interval [(q(1 q)η/2)1/(2 q), ). When x = 0, as η > 0, the function h(t) achieves its minimum at t = 0. We ﬁrst limit our discussion to the case x > 0. In this case, the function h is strictly decreasing on ( , 0], hence all its possible minimizers are achieved on [0, ). Let us consider the diﬀerence between h(t) and h(0) for t > 0, i.e., h(t) h(0) = t(t 2x + ηtq 1). It is noticed that the function g(t) = t 2x+ηtq 1 is continuously diﬀerentiable on (0, ) and limt 0+ g(t) = limt g(t) = . So we take its derivative on (0, ) and ﬁnd that its unique minimizer is given by t1 = ((1 q)η)1/(2 q). It follows that g(t) g(t1) = 2aqη 1 2 q 2x for all t > 0. When 0 < x aqη1/(2 q), then mint>0 g(t) = g(t1) 0 which implies h(t) h(0) = tg(t) 0 for t > 0. Hence, t = 0 is a global minimizer of h in this case. And then we consider the case x > aqη1/(2 q). One may see that t = 0 is not a global minimizer of h, because g(t1) < 0 in this case, which implies h(t1) < h(0). In addition, we have limt h(t) = , thus the function h(t) achieves its minimum on the interval (0, ). We claim that this minimizer is given by t2, where t2 is the solution of the equation h (t) = 0 on the interval [(q(1 q)η/2)1/(2 q), ). It should be noticed that h (t) = 0 is exactly the equation given by (10) with x > 0. We thus consider the second derivative of h given by h (t) = 2 + ηq(q 1)tq 2. We ﬁrst prove the existence and uniqueness of t2. In fact, a direct computation shows that

h ((q(1 q)η/2)1/(2 q)) = 2(ηq) 1 2 q 1 q

1 2 q + (ηq) 1 2 q 1 q

= (2 q) 1 q

2 q (ηq) 1 2 q 2x

2 q q 1 2 q a 1 q x 2x

2 q q 1 2 q 2 x < 0,

the ﬁrst inequality is from x > aqη1/(2 q) and the last inequality holds as q < 2q 1 for 0 < q < 1. We also observe that h (t) 0 on [(q(1 q)η/2)1/(2 q), ), which implies h (t) is strictly increasing on this interval. Since h (t) is continuous on (0, ) and limt h (t) = , the equation h (t) = 0 has a unique solution t2 on [(q(1 q)η/2)1/(2 q), ). Because h (t2) = 0 and h (t2) > 0, we also conclude that t2 is the only minimizer of h(t) on [(q(1 q)η/2)1/(2 q), ). We further prove that t2 is actually the minimizer of h(t) on (0, ). We just need to show h(t) has no local minimizer on (0, (q(1 q)η/2)1/(2 q)). This conclusion can be easily drawn from the fact that h (t) < 0 on (0, (q(1 q)η/2)1/(2 q)). When x < 0, one can easily ﬁnd that h(t) achieves its minimum on ( , 0]. Then for t ( , 0], we can rewrite the function h(t) as ( t)2 + 2x( t) + η( t)q. Due to the same analysis as above, we ﬁnd that, when aqη1/(2 q) x < 0, the global minimizer of h(t) is t = 0; when x < aqη1/(2 q), the minimizer of h(t) is given by the unique solution of the equation 2( t) + ηq( t)q 1 + 2x = 0 on the interval ( , (q(1 q)η/2)1/(2 q)].

Shi, Huang, Feng and Suykens

Finally, we combine all the cases together and ﬁnd that for a given x R, a global minimizer of h(t) is ψη,q(x) given by (9). Hence according to our analysis, if we evaluate (Ψη,q(d))i as (11), the vector Ψη,q(d) is a global minimizer of the optimization problem (12). It is also easy to check that when |x| = aqη1/(2 q), the univariate function h(t) has two global minimizers given by 0 and 2 2q

2 q |x| respectively. Thus the vector Ψη,q(d) indeed gives one global minimizer of problem (12). Finally we complete the proof.

Remark 21 A similar discussion about the global minimizer of h(t) can be found in Knight and Fu (2000).

A.2. Proof of Proposition 5

In this subsection, we shall prove Proposition 5. Proof [Proof of Proposition 5]. We ﬁrst prove conclusion (i). Recall that c = (c 1, , c m) Rm is a local minimizer of the objective functional Tγ,q(c) deﬁned by (3) and 0 < λ q 2 K 2 2 . It is suﬃcient to prove that for i supp(c ), i.e., c i = 0, there holds c + λKT (y Kc )

i > aq(λγ)1/(2 q)

and c i = sgn c + λKT (y Kc )

i t i , where t i is the solution of the equation

2t + λγqtq 1 2 c + λKT (y Kc )

on the interval [(q(1 q)λγ/2)1/(2 q), ). Here ( )i denotes the i th coordinate value of a vector. By the same argument in the proof of Proposition 18, for i supp(c ), we have

2 KT (Kc y)

i + γqsgn(c i )|c i |q 1 = 0 (68)

j=1 K2 j,i > γq(1 q)|c i |q 2, (69)

where Kj,i denotes the (j, i) th entry of the matrix K. We multiply both sides of the inequality (69) by a positive λ. Since λ q

2 K 2 2 , we obtain

λγq(1 q)|c i |q 2 < 2λ

j=1 K2 j,i q,

which implies |c i | > ((1 q)λγ)1/(2 q). We consider the case c i > 0. By equation (68), for c i > 0, there holds

2c i + λγq(c i )q 1 = 2 c + λKT (y Kc )

which veriﬁes that c + λKT (y Kc )

i is positive and c i is the zero point of

f(t) = 2t + λγqtq 1 2 c + λKT (y Kc )

Sparse Kernel Regression

Note that c i > ((1 q)λγ)1/(2 q) > (q(1 q)λγ/2)1/(2 q). Recalling the analysis in Lemma 4, we know that f(t) is monotonically increasing on the interval [(q(1 q)λγ/2)1/(2 q), ). We thus have f((λγ(1 q))1/(2 q)) < 0 which implies c + λKT (y Kc )

i > aq(λγ)1/(2 q).

Therefore, we verify conclusion (i) for c i > 0. When c i < 0, we can prove it following the same method. Next, we shall prove the function ψη,q(x) deﬁned by (9) is Lipschitz continuous and strictly increasing for any |x| > aqη1/(2 q). Then one can check that the proof of the rest two conclusions are almost the same as the proof of Theorem 3 in Xu et al. (2012). Since ψη,q(x) is an odd function, we only need to prove our desired result for x > aqη1/(2 q). From (9), when x > aqη1/(2 q), ψη,q(x) is deﬁned to be the solution of the equation 2t+ηqtq 1 2x = 0 on the interval [(q(1 q)η/2)1/(2 q), ). Since the bivariate function F(t, x) = 2t+ηqtq 1 2x is continuously diﬀerentiable inside [(q(1 q)η/2)1/(2 q), ) (aqη1/(2 q), ), we have ψ η,q(x) = 2 2 ηq(1 q)(ψη,q(x))q 2 due to the implicit function theorem. Obviously ψ η,q(x) > 1 which implies ψη,q(x) is strictly increasing. In order to verify the Lipschitz continuity, we still need an upper bound for ψ η,q(x). To this end, we show that ψη,q(x) > (q(1 q)η)1/(2 q)

for x > aqη1/(2 q). Also following the analysis of Lemma 4, it is equivalent to check that

when x > aqη1/(2 q), there holds 2 (q(1 q)η) 1 2 q + ηq (q(1 q)η)

q 1 2 q 2x < 0. Since when x > aqη1/(2 q), a simple calculation show that

2 (q(1 q)η) 1 2 q + ηq (q(1 q)η)

q 1 2 q 2x < 2aqη1/(2 q) 3 2q

2 q q1/(2 q) 1 .

The left hand side of the above inequality can be further bounded by 0 as 3 2q

2 q q1/(2 q) < 1 for 0 < q < 1. Hence ψη,q(x) > (q(1 q)η)1/(2 q) for x > aqη1/(2 q), which implies ψ η,q(x) < 2. We thus verify the Lipschitz continuity of ψη,q(x) for x > aqη1/(2 q) and the proof is completed.

A.3. Proof of Proposition 6

In this subsection, we shall prove Proposition 6. Proof [Proof of Proposition 6]. We just prove the inequality (16), the bound (17) can be derived by the same approach. Recall the deﬁnition of the projection operator πM (see Deﬁnition 2). As ρ( |x) is supported on [ M, M] at every x X, it obviously implies |yi| M for i = 1, , m and Ez(πM( ˆfq)) Ez( ˆfq). When 0 < q < 1, since the estimator ˆfq satisﬁes Assumption 1, we have

Ez( ˆfq) + γ cz q q q Ez( ˆf1) + γ cz 1 q q.

Ez(πM( ˆfq)) + γ cz q q q Ez( ˆfq) + γ cz q q q Ez( ˆf1) + γ cz 1 q q Ez( ˆf1) + γ cz 1 1 + γ cz 1 q q. (70)

Shi, Huang, Feng and Suykens

Note that the coeﬃcient sequence of fz,γ is given by { 1

mgγ(xi)}m i=1 and ˆf1 is the global minimizer of problem (2) at q = 1. Hence, there holds

Ez( ˆf1) + γ cz 1 1 Ez(fz,γ) + γm 1 m X

i=1 |gγ(xi)|. (71)

A direct computation shows that

E (πM( ˆfq)) E (fρ) + γ cz q q q

= {E (πM( ˆfq)) Ez(πM( ˆfq))} + {Ez(πM( ˆfq)) + γ cz q q q Ez(fz,γ) γm 1 m X

i=1 |gγ(xi)|}

+{Ez(fz,γ) E (fz,γ)} + {γm 1 m X

i=1 |gγ(xi)| γ gγ L1ρX } + {γ gγ L1ρX γ gγ L2ρX }

+{E (fz,γ) E (fγ)} + {E (fγ) E (fρ) + γ gγ L2ρX }.

From inequalities (70) and (71), the second term on the right hand side of the above equation is at most γ cz 1 q q, which can be further bounded by γm1 q cz 1 q 1 due to the reverse H older inequality. The inequality gγ L1ρX gγ L2ρX implies the ﬁfth term is at most zero, thus we obtain bound (16). Then the proof is completed.

A.4. Proof of Proposition 14

Now we concentrate our eﬀorts on the proof of Proposition 14.

Deﬁnition 22 A function ψ : R+ R+ is sub-root if it is non-negative, non-decreasing, and if ψ(r)/ r is non-increasing.

It is easy to see for a sub-root function ψ and any D > 0, the equation ψ(r) = r/D has unique positive solution. Proposition 14 can be proved based on the following lemma, which is given in Blanchard et al. (2008).

Lemma 23 Let F be a class of measurable, square-integrable functions such that E(f) f b for all f F. Let ψ be a sub-root function, D be some positive constant and r be the unique positive solution of ψ(r) = r/D. Assume that

0, sup f F,Ef2 r

Then for all x > 0 and all T > D/7, with probability at least 1 e x there holds

i=1 f(Xi) Ef2

D2 r + (T + 9b)x

m , f F. (73)

Proof [Proof of Proposition 14]. We assume that the function g in the concerned function set satisﬁes g B and Eg2 c Eg for some c, B > 0. In order to prove the conclusion, we deﬁne the function set for R 1 as

GR = g(z) = (πM(f)(x) y)2 (fρ(x) y)2 : f BR .

Sparse Kernel Regression

We will apply Lemma 23 to GR and ﬁnd the sub-root function ψ in our setting. To this end, let σ1, , σn be the independent Rademacher random variables, that is, independent random variables for which Prob(σi = 1) = Prob(σi = 1) = 1/2. One can see Van der Vaart and Wellner (1996) that

sup g GR,Eg2 r

sup g GR,Eg2 r

i=1 σig(zi)

Next, we will bound the right-hand side of the above inequality by using the ℓ2 empirical covering numbers and entropy integral. Due to Van der Vaart and Wellner (1996), there is an universal absolute constant C such that

1 m Eσ sup g GR,Eg2 r

i=1 σig(zi)

1 2 2 N2(GR, ν)dν, (75)

where V = supg GR,Eg2 r 1 m Pm i=1 g2(zi) and Eσ denotes the expectation with respect to the random variables {σ1, , σm} conditioned on all of the other random variables. Now we use the capacity condition (20) for B1. It asserts that

log2 N2(B1, ϵ) c K,pϵ p, 0 < ϵ 1.

For g1, g2 GR, we have

|g1(z) g2(z)| = | (y πM(f1)(x))2 (y πM(f2)(x))2 | 4M|f1(x) f2(x)|.

Therefore, it follows that

N2 (GR, ϵ) N2 BR, ϵ 4M

N2 B1, ϵ 4MR

where R1 = max{B, 4MR}. Then

log2 N2 (GR, ϵ) c K,p Rp 1ϵ p, 0 < ϵ R1.

Using equation (75), since V = supg GR,Eg2 r 1 m Pm i=1 g2(zi) and

V B R1, one easily gets

1 m Eσ sup g GR,Eg2 r

i=1 σig(zi)

Cc1/2 K,p Rp/2 1

= c K,p Rp/2 1

sup g GR,Eg2 r

where c K,p = 2Cc1/2 K,p(2 p) 1 depending on p and the kernel function K. For simplicity, the constant c K,p may change from line to line in the following derivations. Due to Talagrand (1994), one see that

E sup g GR,Eg2 r

i=1 g2(zi) 8B

m EEσ sup g GR,Eg2 r

i=1 σig(zi)

Shi, Huang, Feng and Suykens

Therefore, from (76) and (77), we have

Rm := 1 m EEσ sup g GR,Eg2 r

i=1 σig(zi)

Rm B m + r 1

Solving the above inequality for Rm and substituting it to the equation (74), we have

sup g GR,Eg2 r

c K,p max B

2 p 2+p m 2 2+p R

2p 2+p 1 , r 1 2 p

2p 2+p 1 max n B

2 p 2+p m 2 2+p , r 1 2 p

where the last inequality is due to R1 1 and 0 < p < 2. According to Lemma 23, one may take ψ(r) to be the right-hand side of the above inequality. Then the solution r to the equation ψ(r) = r/D satisﬁes

r c K,p max n D 4 2+p , DB

2 p 2+p o m 2 2+p R

Recalling that Eg2 c Eg, now we apply Lemma 23 to the function set GR by let T = D = 2c, then with probability 1 e x, there holds

i=1 g(zi) 1

2Eg + (2c + 9b)x

+ c K,p max n c

2 p 2+p , B

2 p 2+p o m 2 2+p R

2p 2+p 1 , g GR.

For g GR, it is easy to verify that b = c = 16M2, B = 8M2 and R1 = max{4MR, 8M2} 8M2R. From the above inequality, we obtain

i=1 g(zi) 1

2Eg + 176M2x

m + c K,p M2m 2 2+p R

2p 2+p , g GR,

which is the exactly the inequality given by (37). Next, we consider the function set deﬁned as

G R = g(z) = (f(x) y)2 (fρ(x) y)2 : f BR .

Note that BR is a subset of C (X) and

f C (X) κR, f BR,

where κ = K C (X X). Therefore, we have B = c = (3M + κR)2 and b = 2B for g G R. Moreover, g1, g2 G R, there holds

|g1(z) g2(z)| = | (y f1(x))2 (y f2(x))2 |

|2y f1(x) f2(x)||f1(x) f2(x)|

2(M + κR)|f1(x) f2(x)|.

Sparse Kernel Regression

Following the same analysis as above, we ﬁnd that with probability 1 e x,

i=1 g(zi) 1

2Eg + 20(3M + κR)2x

+ c K,p(3M + κ)

2+p m 2 2+p R

2p 2+p 2 , g G R,

where R2 = max (3M + κR)2, 2(M + κR)R (3M + κ)2R2. Hence, we obtain

i=1 g(zi) 1

2Eg + 20(3M + κR)2x

+ c K,p(3M + κ)2m 2 2+p R2, g G R,

which leads to the inequality (38). Finally, we can derive the same bounds for S(πM(f)) and S(f) by considering the function set GR and G R. Thus we complete the proof.

R. Adams and J. Fournier. Sobolev Spaces. Academic press, 2003.

T. Ando, S. Konishi, and S. Imoto. Nonlinear regression modeling via regularized radial basis function networks. Journal of Statistical Planning and Inference, 138:3616 3633, 2008.

N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337 404, 1950.

G. Blanchard, O. Bousquet, and P. Massart. Statistical performance of support vector machines. Annals of Statistics, 36:489 531, 2008.

T. Blumensath and M. Davies. Iterative thresholding for sparse approximations. Journal of Fourier Analysis and Applications, 14:629 654, 2008.

M. S. Birman and M. Z. Solomyak. Piecewise-polynomial approximations of functions of the classes W α p (Russian). Matematicheskii Sbornik, 73:331 355, 1967.

E. Cand es, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59:1207 1223, 2006.

E. Cand es, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted ℓ1 minimization. Journal of Fourier Analysis and Applications, 14:877 905, 2008.

R. Chartrand. Exact reconstruction of sparse signals via nonconvex minimization. Signal Processing Letters, IEEE, 14:707 710, 2007.

D. Chen, Q. Wu, Y. Ying, and D. X. Zhou. Support vector machine soft margin classiﬁers: error analysis. Journal of Machine Learning Research, 5:1143 1175, 2004.

Shi, Huang, Feng and Suykens

S. Chen and D. Donoho. Basis pursuit. In Signals, Systems and Computers, 1994. 1994 Conference Record of the Twenty-Eighth Asilomar Conference on, volume 1, pages 41 44. IEEE, 1994.

X. Chen, F. Xu, and Y. Ye. Lower bound theory of nonzero entries in solutions of ℓ2 ℓp minimization. SIAM Journal on Scientiﬁc Computing, 32:2832 2852, 2010.

J. B. Conway. A Course in Operator Theory. American Mathematical Society, 2000.

F. Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, 2007.

I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57:1413 1457, 2004.

D. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52:1289 1306, 2006.

R. M. Dudley. Universal Donsker classes and metric entropy. Annals of Probability, 15:1306 1326, 1987.

D. Edmunds and H. Triebel. Function Spaces, Entropy Numbers, Diﬀerential Operators. Cambridge University Press, 1996.

J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96:1348 1360, 2001.

Y. Feng, S. G. Lv, H. Hang, and J. A. K. Suykens. Kernelized elastic net regularization: generalization bounds, and sparse recovery. Neural Computation, 28:525 562, 2016.

S. Fischer and I. Steinwart. Sobolev norm learning rates for regularized least-squares algorithm. ar Xiv preprint, ar Xiv:1702.07254, 2017.

F. Girosi. An equivalence between sparse approximation and support vector machines. Neural Computation, 10:1455 1480, 1998.

X. Guo, J. Fan, and D. X. Zhou. Sparsity and error analysis of empirical feature-based regularization schemes. Journal of Machine Learning Research, 17:3058 3091, 2017.

Z. C. Guo and L. Shi. Learning with coeﬃcient-based regularization and ℓ1 penalty. Advances in Computational Mathematics, 39:493 510, 2013.

Z. C. Guo, D. H. Xiang, X. Guo, and D. X. Zhou. Thresholded spectral algorithms for sparse approximations. Analysis and Applications, 15:433 455, 2017.

J. Huang, J. Horowitz, and S. Ma. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Annals of Statistics, 36:587 613, 2008.

X. Huang, A. Maier, J. Hornegger, and J. A. K. Suykens. Indeﬁnite kernels in least squares support vector machines and principal component analysis. Applied and Computational Harmonic Analysis, 43:162 172, 2017.

Sparse Kernel Regression

K. Jetter, J. St ockler, and J. D. Ward. Error estimates for scattered data interpolation on spheres. Mathematics of Computation, 68:733 747, 1999.

K. Knight and W. Fu. Asymptotics for lasso-type estimators. Annals of Statistics, 28:1356 1378, 2000.

M. J. Lai and J. Wang. An unconstrained ℓq minimization with 0 < q 1 for sparse solution of underdetermined linear systems. SIAM Journal on Optimization, 21:82 101, 2011.

S. B. Lin, J. Zeng, J. Fang, and Z. Xu. Learning rates of ℓq-coeﬃcient regularization learning with gaussian kernel. Neural Computation, 26:2350 2378, 2014.

G. Loosli, S. Canu, and C. S. Ong. Learning SVM in Krˇein spaces. IEEE transactions on pattern analysis and machine intelligence, 38:1204 1216, 2016.

S. Mendelson and J. Neeman. Regularization in kernel learning. Annals of Statistics, 38:526 565, 2010.

C. A. Micchelli, Y. Xu, and H. Zhang. Universal kernels. Journal of Machine Learning Research, 7:2651 2667, 2006.

B. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24:227 234, 1995.

A. Rakotomamonjy, R. Flamary, G. Gasso, and S. Canu. ℓp ℓq penalty for sparse linear and sparse multiple kernel multi-task learning. IEEE Transations on Neural Networks, 22:1307 1320, 2011.

G. Raskutti, M. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional linear regression over ℓq balls. IEEE Transactions on Information Theory, 57:6976 6994, 2011.

V. Roth. The generalized Lasso. IEEE transactions on Neural Networks, 15:16 28, 2004.

R. Saad and O. Yılmaz. Sparse recovery by non-convex optimization instance optimality. Applied and Computational Harmonic Analysis, 29:30 48, 2010.

F. Schleif and P. Tino. Indeﬁnite proximity learning: a review. Neural Computation, 27:2039 2096, 2015.

L. Shi, Y. L. Feng, and D. X. Zhou. Concentration estimates for learning with ℓ1-regularizer and data dependent hypothesis spaces. Applied and Computational Harmonic Analysis, 31:286-302, 2011.

L. Shi. Learning theory estimates for coeﬃcient-based regularized regression. Applied and Computational Harmonic Analysis, 34:252 265, 2013.

S. Smale and D. X. Zhou. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26:153 172, 2007.

Shi, Huang, Feng and Suykens

I. Steinwart. Sparseness of support vector machines. Journal of Machine Learning Research, 4:1071 1105, 2003.

I. Steinwart and A. Christmann. Support Vector Machines. New York: Springer, 2008.

I. Steinwart and A. Christmann. Sparsity of SVMs that use the ϵ-insensitive loss. In Advances in Neural Information Processing Systems 21 (D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, eds.), 1569 1576, 2009.

I. Steinwart and C. Scovel. Fast rates for support vector machines using Gaussian kernels. Annals of Statistics, 35:575 607, 2007.

M. Talagrand. Sharper bounds for Gaussian and empirical processes. Annals of Probability, 22:28 76, 1994.

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267 288, 1996.

H. Y. Wang, Q. W. Xiao, and D. X. Zhou. An approximation theory approach to learning with ℓ1 regularization. Journal of Approximation Theory, 167:240 258, 2013.

H. Wendland. Piecewise polynomial, positive deﬁnite and compactly supported radial functions of minimal degree. Advances in computational Mathematic, 4:389 396, 1995.

H. Wendland. Local polynomial reproduction and moving least squares approximation. IMA Journal of Numercial Analysis, 21:285 300, 2001.

H. Wendland. Scattered data approximation. Cambridge University Press Cambridge, Cambridge, 2005.

Q. Wu, Y. Ying, and D. X. Zhou. Learning rates of least-square regularized regressioin. Foundations of Computational Mathematics, 6:171 192, 2006.

Q. Wu and D. X. Zhou. Learning with sample dependent hypothesis spaces. Computers & Mathematics with Applications, 56:2896 2907, 2008.

Z. Wu. Compactly supported positive deﬁnite radial functions. Advances in Computational Mathematics, 4:283 292, 1995.

A. Van der Vaart and J. Wellner. Weak Convergence and Emprical Processes. Springer Verlag, New York, 1996.

Z. Xu, X. Chang, F. Xu, and H. Zhang. L1/2 regularization: a thresholding representation theory and a fast solver. IEEE Transactions on Neural Networks and Learning Systems, 23:1013 1027, 2012.

T. Zhang. Leave-one-out bounds for kernel methods. Neural Computation, 15:1397 1437, 2003.

D. X. Zhou. Capacity of reproducing kernel spaces in learning theory. IEEE Transactions on Information Theory, 49:1743 1752, 2003.