# tilted_sparse_additive_models__b4bedf05.pdf

Tilted Sparse Additive Models

Yingjie Wang 1 2 Hong Chen 2 3 Weifeng Liu 1 Fengxiang He 4 5 Tieliang Gong 6 Youcheng Fu 2 Dacheng Tao 7

Additive models have been burgeoning in data analysis due to their flexible representation and desirable interpretability. However, most existing approaches are constructed under empirical risk minimization (ERM), and thus perform poorly in situations where average performance is not a suitable criterion for the problems of interest, e.g., data with complex non-Gaussian noise, imbalanced labels or both of them. In this paper, a novel class of sparse additive models is proposed under tilted empirical risk minimization (TERM), which addresses the deficiencies in ERM by imposing tilted impact on individual losses, and is flexibly capable of achieving a variety of learning objectives, e.g., variable selection, robust estimation, imbalanced classification and multiobjective learning. On the theoretical side, a learning theory analysis which is centered around the generalization bound and function approximation error bound (under some specific data distributions) is conducted rigorously. On the practical side, an accelerated optimization algorithm is designed by integrating Prox-SVRG and random Fourier acceleration technique. The empirical assessments verify the competitive performance of our approach on both synthetic and real data.

1College of Control Science and Engineering, China University of Petroleum (East China),Qingdao,China 2College of Informatics, Huazhong Agricultural University, China 3Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, China 4JD Explore Academy, JD.com, Inc., Beijing, China 5Artificial Intelligence and its Applications Institute, School of Informatics, The University of Edinburgh, Edinburgh, United Kingdom 6School of Computer Science and Technology, Xi an Jiaotong University, China 7The University of Sydney, Sydney, Australia. Correspondence to: Hong Chen <chenh@mail.hzau.edu.cn>, Weifeng Liu <liuwf@upc.edu.cn>, Fengxiang He <F.He@ed.ac.uk>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

1. Introduction

Additive models (Hastie & Tibshirani, 1990; Stone, 1985), as natural extensions of linear models, have drawn much attention in high-dimensional data analysis due to their flexible representation and desirable interpretability. Let X be the explanatory variable that takes values in a p-dimensional metric space X and let Y be the corresponding real-valued response in an output set Y. The most typical additive models were formulated by dividing the input space into p parts directly, i.e., X = (X1, ..., Xp), Xj R, j {1, ..., p}. In this setting, the hypothesis space can be stated as

H = f : f(X) =

j=1 fj(Xj), fj Hj ,

where X = (X1, ..., Xp)T Rp, Xj Xj and Hj is the component function space on Xj. Usually, the component hypothesis spaces Hj, j = 1, ..., p, can be reproducing kernel Hilbert space (RKHS) (Chen et al., 2017; Kandasamy & Yu, 2016; Raskutti et al., 2012; Christmann & Zhou, 2016), the space spanned by the orthogonal basis (Ravikumar et al., 2009; Meier et al., 2009; Huang et al., 2010; Yin et al., 2012), and the space formed by neural networks (Agarwal et al., 2020; Yang et al., 2020). Over the past decades, many studies concerning the theoretical as well as practical investigations of additive models have been conducted under the empirical risk minimization (ERM) minf H Pn i=1 ℓ(f(xi), yi) (Hastie & Tibshirani, 1990; Stone, 1985; Ravikumar et al., 2009; Kandasamy & Yu, 2016; Raskutti et al., 2012), where ℓ( , ) : R R R is the loss function. The additive models are versatile by assigning different loss functions, where the several usual ones include the squared loss ℓ(f(x), y) = (f(x) y)2

for regression (Ravikumar et al., 2009; Meier et al., 2009; Huang et al., 2010; Raskutti et al., 2012; Yin et al., 2012; Tan & Zhang, 2019), and the logistic loss (or hinge loss) for classification (Ravikumar et al., 2009; Chen et al., 2017).

However, most existing approaches are constructed under empirical risk minimization (ERM), and perform poorly in situations where average performance is not a suitable criterion for the problems of interest, e.g., data with complex non-Gaussian noise, imbalanced labels or both of them. Although several works in relation to additive models have been proposed to address the these problems by introducing

Tilted Sparse Additive Models

some robust metrics, e.g., quantile loss (Lv et al., 2018), modal risk (Chen et al., 2020) and huber loss (Wang et al., 2022), these methods are specially designed and can only solve one of the above problems independently.

Recently, (Li et al., 2021a) introduced a unified learning framework called tilted empirical risk minimization (TERM) as a flexible alternative to ERM, which have the ability to handle various known issues occurred in ERM, such as robustness to noise in regression/classification and class imbalance. In fact, the flexibly of TERM comes from the fact that we can tune the impact of individual losses using a scale parameter called the tilt, and thus increase or decrease the influence of outliers (or minor class), respectively. For instance, negative scale parameter is used for robust estimation/classification and positive scale parameter is used for imbalanced classification. However, in the face of some common data distributions (e.g., Gaussian distribution, skewed distribution or heavy-tailed distribution), the theoretical explorations on the approximation ability of TERM-based estimator with proper scale parameter selection strategies to the ground truth function are still insufficient.

In this paper, we try to pave a way for sparse additive models under TERM with computational feasibility and theoretical guarantees. To the best of our knowledge, this is the first work that investigates the sparse additive models under TERM. The main contributions made in this study can be summarized as follows:

Algorithm design: Following the principle of TERM (Li et al., 2021a), we propose a class of new tilted sparse additive models (T-Sp AM) based on Tikhonv regularization scheme with the data dependent hypothesis space and the sparsity-induced ℓ2,1-norm regularizer. The proposed method can address the deficiencies in additive models under ERM by imposing tilted impact on individual losses, and is capable of achieving a variety of learning tasks, e.g., variable selection, robust estimation, imbalanced classification or multiobjective learning (e.g., achieving robust estimation and imbalanced classification simultaneously).

Theoretical guarantees. Theoretical foundations are established for T-Sp AM from three aspects: 1) Generalization consistency verifies the T-Sp AM can generalize well to unseen out-of-sample under some weak conditions; 2) Function approximation analysis demonstrates the estimator of T-Sp AM can approach f with proper hyper-parameter settings under three specific data distributions (e.g., student s t distribution, skewed normal distribution and normal distribution); 3) Variable selection analysis supports that T-Sp AM can identify truly informative variables under proper parametric conditions.

Computing Feasibility. The proposed T-Sp AM can be computationally effective by employing Prox-SVRG (Reddi et al., 2016) for non-convex objective and utilizing random Fourier features technique for speeding up the matrix calculation (Rahimi & Recht, 2007). Moreover, empirical evaluations on both synthetic and realworld data verify the effectiveness of our T-Sp AM.

Related works: The principle of TERM was originally introduced in (Howard & Matheson, 1972) and revisited in reinforcement learning (Borkar, 2002; Osogami, 2012). Recently, in (Li et al., 2021a; Lee et al., 2020), TERM has shown its appealing robustness and scalability to a variety of learning tasks. Although rich empirical evaluations are stated respectively in (Li et al., 2021a), there is no enough attention on the statistical learning theory of TERM. This paper tries to establish the learning theory foundations of TSp AM on generalization bound, function approximation and variable selection consistency. To better highlight the novelty of T-Sp AM, its algorithmic properties are summarized in Table 1, where the competitors include Sp AM (Sparse additive models (Ravikumar et al., 2009)), SAQR (Sparse Additive Quantile Models (Lv et al., 2018)), Sp MAM (Sparse Modal Additive Models (Chen et al., 2020)) and TERM (Ravikumar et al., 2009).

2. Tilted Sparse Additive Models

This paper chooses a RKHS to form the additive hypothesis space. Let Kj : Xj Xj R be a symmetric and positive definite kernel on Xj Xj, and HKj be the corresponding RKHS on Xj with norm Kj, j {1, ..., p}. The RKHS with additive structure is given by

j=1 fj : fj HKj, j = 1, ..., p}

f 2 K = inf{

j=1 fj 2 Kj : f =

To formulate T-Sp AM, we introduce the TERM scheme (Li et al., 2021a; Lee et al., 2020). Definition 2.1. Given z := {(xi, yi)}n i=1 drawn independently from an intrinsic distribution ρ, the TERM is defined as

Rz(t, f) := 1

i=1 etℓ(f(xi),yi)), (1)

and its population version is

R(t, f) := 1

Z etℓ(f(x),y)dρ(x, y),

where t ( , 0) (0, + ) is the scale parameter.

Tilted Sparse Additive Models

Table 1. Algorithmic properties ( -has the given information, -hasn t the given information)

Sp AM SAQR Sp MAM TERM T-Sp AM (ours) Variable Selection Robust Estimation Imbalanced Classification Multi-objective Learning Generalization Analysis Variable Selection Analysis Function Approximation Analysis Optimization Acceleration

It has been verified that TERM can be tuned to magnify (by taking t (0, + )) or suppress (by taking t ( , 0)) the influence of outliers (Li et al., 2021a). Then, the TERMbased regularized additive models can be formulated as

fη,t = arg min fj HKj f=Pp j=1 fj

{Rz(t, f) + η

j=1 τj fj 2 Kj}, (2)

where η > 0 is a regularization parameter and {τj}p j=1 are positive weights. The properties of RKHS assure the minimizer of (2) can be represented as

i=1 αη ji Kj(xij, ), αη ji R.

In view of the above representation, we consider sparse learning in the data dependent hypothesis space

i=1 αji Kj(xij, ) : αji R}

with a coefficient-induced penalty

Ωz(f) = inf{

j=1 τj αj 2 : f =

i=1 αji Kj(xij, )},

where α = (αT 1 , ..., αT p )T Rnp and αj = (αj1, ..., αjn)T Rn. The T-Sp AM can be formulated as fz,t = arg min f Hz{Rz(t, f) + λΩz(f)}.

Denote Kji = (Kj(xij, x1j), ..., Kj(xij, xnj))T Rn

and Ki = (KT 1i, ..., KT pi)T Rnp. The estimator of TSp AM can be rewritten as

j=1 fz,t,j =

i=1 αz,t ji Kj(xij, ) (3)

αz,t = arg min α Rnp{1

i=1 etℓ(KT i α,yi))

j=1 τj αj 2}.(4)

Remark 2.2. To circumvent the large-scale kernel matrix calculation in (4), we introduce the random Fourier features technique (Rahimi & Recht, 2007). Denote ψj : R Rd

as a random Fourier feature map associated with Kj. We know that, for any j = 1, ..., p,

Kj(xij, xtj) ψj(xij)T ψj(xtj), xij, xtj Xj.

Then, (3) can be approximated by

j=1 ˆw T j ψj(xtj) (5)

{ ˆwj}p j=1 = arg min wj Rd,j=1,...,p {1

i=1 etℓ(Pp j=1 w T j ψj(xij),yi))

j=1 τj wj 2}.

Lemmas 3-4 in (Li et al., 2021a) show that the objective (13) is non-convex as t ( , 0) and smooth when t ( , 0) (0, + ). As a result, we employ the prox-SVRG (Reddi et al., 2016) to compute (13) and provide the detailed steps in Appendix F.

3. Theoretical Analysis

This section provides some necessary assumptions firstly, and then gives the main theoretical results. All proofs are provided in Appendixes B-E.

3.1. Generalization Bound of T-Sp AM

To conduct the theoretical analysis, we firstly introduce a projection operator.

Definition 3.1. The projection operator is defined by a hard threshold function P(f)(x) := max{ M, min{f(x), M}}, f HK for regression estimation, and by the sigmoid function for binary classification task, i.e., P(f)(x) := 1 1+e f(x) , f HK.

Tilted Sparse Additive Models

The projection operator can ensure better generalization, which has been used extensively in learning theory literatures, e.g., (Steinwart et al., 2009; Wu et al., 2007; Liu et al., 2020).

Assumption 3.2. The output Y is bounded by the interval [ M, M] with a positive constant M < + . The kernel is bounded, i.e., κ = supj,u X p

Kj(u, u) < .

The above assumptions have been used for theoretical analysis of additive models (Lv et al., 2018; Chen et al., 2020).

The generalization error bound measures the difference between R(t, P(fz,t)) and Rz(t, P(fz,t)), which assesses the out-of-sample prediction ability of the estimator P(fz,t).

Theorem 3.3. Let Assumption 3.2 be true and Kj Cν for any j = 1, ..., p. For any loss function with ℓ(Pf(X), Y ) + and its derivative ℓ (Pf(X), Y ) + . By taking λ = n ζ, for any fixed t ( , 0) (0, + ) and 0 < δ < 1, there holds

R(t, P(fz,t)) Rz(t, P(fz,t)) nΦ(s,ζ) log(1/δ)

with confidence at least 1 δ, where Φ(s, ζ) = max{ s 2+2sζ

2 1+2ν , ν (0, 1]; 2 1+ν , ν (1, 3/2]; 1 ν , ν (3/2, ).

Remark 3.4. Figure 1 summaries the convergence rates of generalization error bound deduced in Theorem 3.3 by taking different s and ζ. Note that the generalization error of T-Sp AM will not converge when ζ and v are both located in the white area. Moreover, the larger the value of v, the larger the selection range of regularization parameter λ results in faster convergence rate. For any t ( , 0) (0, + ), R(t, P(fz,t)) Rz(t, P(fz,t)) 0 when ζ ( , 2 s

2s ) and n + . By taking ζ 1

2, we can get the learning rate with polynomial decay O(n 1

3.2. Function Approximation Bounds under Specific Data Distribution Assumptions

The definition of fz,t in (3) indicates that the approximation ability of estimator P(fz,t) is closely related to the scale parameter t. Thus, in the presence of different data distributions, it is important to explore how to select t to achieve good approximation performance. In this section, we pursue the L2 ρX -distance between P(fz,t) and f under three specific types of noise distributions ((Noise assumptions A-C)), where ρX is the marginal distribution of ρ on X and L2 ρX is the square-integrable function space.

This section considers a common data-generating model as Y = f (X)+ϵ, where (X, Y ) X Y, ϵ is a random noise and f is an unknown ground truth function. Moreover, the

Figure 1. The convergence rates of generalization error by taking different ζ and v.

loss function is selected as the squared loss ℓ(f(x), y) = (f(x) y)2. Here, we consider an commonly used case in learning theory (Guo & Zhou, 2013; Zou et al., 2009; Shi et al., 2011), where f is bounded and belongs to the RKHS, i.e., f = Pp j=1 f j , f < + and f j HKj, j {1, ..., p}.

Assumption 3.5. (Noise assumption A) The noise variable satisfies E(ϵ|X = x) = 0, x X. Theorem 3.6. Let Assumptions 3.2, 3.5 be true and fη,t in (2) be bounded. By taking λ = n ζ with ζ ( 1

2s ) and η = n 1+2ζ

4 , for any 0 < δ < 1, there holds

P(fz,t) f 2 L2ρX

h |t| 1nmax{ s 2+2sζ

2 } + n 1 2ζ

4 + |t| i log(1/δ)

with confidence at least 1 δ.

Corollary 3.7. Let the conditions of Theorem 3.6 be true. By taking t = n β with β (0, min{ 1

4 }), the estimator P(fz,t) can approach f with an sample-dependent error of less than

O(nmax{ 4β+s 2+2sζ

Assumption 3.5 admits a class of zero-mean noise distribution, e.g., student s t distribution, skewed Gaussian distribution and Gaussian distribution. Theorem 3.6 and Corollary 3.7 indicate that a proper t value ensures that our estimator approximate f , which verify the robustness of our T-Sp AM. It should be noticed that the previous theoretical studies of additive models are usually limited to ERM (or SRM) under Gaussian noise, see e.g., (Huang et al., 2010; Meier et al., 2009; Ravikumar et al., 2009; Yuan & Zhou, 2016).

Tilted Sparse Additive Models

We turn now to analyze the approximation performance on a class of symmetric zero-mean noise distribution.

Assumption 3.8. (Noise assumption B) The noise ϵ satisfies E(ϵ|X = x) = 0, pϵ|X(m|X = x) = pϵ|X( m|X = x), m R, x X and pϵ|X < + , where pϵ|X denotes the conditional density of noise ϵ.

Theorem 3.9. Let Assumptions 3.2, 3.8 be true and fη,t be bounded. For fixed t ( 1 2(M+ f )2 , 0) and any 0 < δ < 1, with confidence at least 1 δ, there holds

P(fz,t) f 2 L2ρX nmax{ 2sζ+s 2

4 } log(1/δ).

Corollary 3.10. Let the conditions of Theorem 3.9 be true and Kj C for any j = 1, ..., p. Setting ζ > 1

P(fz,t) f 2 L2ρX = O(n 1

To introduce the third type of noise assumption, we first define strongly m-concave function (Doss & Wellner, 2013; Feng et al., 2020).

Definition 3.11. A function g is strongly m-concave if it exhibits one of the following forms:

(i) g = max{ϕ 1 m , 0} for some strongly concave function ϕ if m > 0;

(ii) g = exp(ϕ) for some strongly concave function ϕ if m = 0;

(iii) g = max{ϕ 1 m , 0} for some strongly convex function ϕ if m < 0.

Assumption 3.12. (Noise assumption C) The conditional density pϵ|X satisfies following conditions:

(i) For any x X, the conditional density pϵ|X satisfies the conditions

(i-a) supt R pϵ|X(t|X = x) < + ;

(i-b) inft ( ,+ ) pϵ|X(t|X = x) > 0;

(i-c) pϵ|X(t|X = x) pϵ|X(0|X = x), t R.

(ii) supx X p ϵ|X( |X = x) < + ;

(iii) The conditional density pϵ|X=x, x X satisfies strongly m-concave condition.

Assumption 3.12 is widely used in modal regression (Feng et al., 2020; Wang et al., 2017; Chen et al., 2020). Condition (i) holds for common continuous densities with a unique global mode. Condition (ii) is key for subsequent theoretical analysis. Condition (iii) is typical from a statistical viewpoint (Doss & Wellner, 2013; Feng et al., 2020) as it holds for common symmetric and skewed noise distributions, e.g., chi-square distribution, student s t distribution, skewed normal distribution and normal distribution.

Theorem 3.13. Let Assumptions 3.2, 3.12 be true and fη,t be bounded. For any t ( , 0) (0, + ), we take t = log n β, β (0, min{ 1 8M 2 , 2 s 2sζ

16M 2 }), λ = n ζ, ζ ( 1

2s ) and η = n 1+2ζ

4 . For any 0 < δ < 1, with confidence at least 1 δ, there holds

P(fz,t) f 2 L2ρX log 1 n log(1/δ).

Corollary 3.14. Under the zero-mode noise assumption, the ground truth function f is a conditional mode function (Sager & Thisted, 1982; Yao & Li, 2013; Chen et al., 2016; Feng et al., 2020), i.e., f (x) = arg maxt p Y |X(t|X = x), where p Y |X is the conditional density of output Y . Theorem 3.13 shows the estimator P(fz,t) can approach f with an sample-dependent error of less than O(log 1 n), which verifies the robustness of our T-Sp AM to the outliers in that T-Sp AM would focus on the high conditional density region. Recall the approximation rate O(n 1/2) derived in Corollary 3.10, the slower approximation rate O(log 1 n) we obtained in Theorem 3.13 indicates the sacrifice for dealing with more complex zero-mode noise distributions, e.g., heavy-tailed noise and skewed noise.

With proper parameter selection strategies, Theorems 3.63.13 together ensure the approximation performance of TSp AM under three specific different noise assumptions. (Li et al., 2021a) stated that TERM become robust when taking t < 0. Indeed, we here strengthen the theoretical guarantees of TERM from the perspective of learning theory, in the sense that giving more specific parameter selection suggestions in the face of specific data analysis.

In addition, we also investigate the variable selection consistency of our method in Theorem E.2 (provided in Appendix E). Theorem E.2 illustrates that T-Sp AM can identify the desired variables by taking properly λ and weight τj, j = 1, ..., p. Indeed, the current analysis extends Theorem 4 in (Wang et al., 2017) from a linear regularized modal regression to the nonlinear T-Sp AM. Moreover, it is interesting to further explore variable selection analysis by replacing the parameter conditions here with the incoherence assumptions (e.g. Assumption 4 in (Lv et al., 2018)).

4. Extensions of T-Sp AM: Classification and Multi-objective Learning

Inspired by the idea in (Ravikumar et al., 2009), the TSp AM can be extended to tilted additive logistic regression for classification problem. Assume that input X X and output Y {0, 1}. Given observations z = {(xi, yi)}n i=1, the T-Sp AM induced by logistic loss can be formulated as

P(Y = 1|X = x) = exp(Pp j=1 Pn i=1 αz,t ji Kj(xij, x))

1 + exp(Pp j=1 Pn i=1 αz,t ji Kj(xij, x))

Tilted Sparse Additive Models

αz,t = arg min α Rnp{1

i=1 et(yi KT i α log(1+e KT i α)))

j=1 τj αj 2}. (8)

Remark 4.1. In view of Lemma 1 in (Li et al., 2021a), the tilted additive logistic regression can be tuned to achieve robust classification (by taking t ( , 0)) and imbalanced classification (by taking t (0, + )).

In real-world application, a multi-objective extension of T-Sp AM can be formulated to tackle complicated learning problems, e.g., mitigating label noise and addressing imbalance data simultaneously. By classifying the samples into different groups g [G] (e.g., the groups can be associated with the classes in classification task), a multi-objective TERM can be formulated as

Jz(t, γ, f) := 1

g G et Rz,g(γ,f))

Rz,g(γ, f) = 1

zi g eγℓ(f(xi),yi)).

Then, the T-Sp AM for two-objective learning problem can be represented as

j=1 fz,t,γ,j =

i=1 Kji(xi, )αz,t,γ ji

= arg min α Rnp{1

zi g eγℓ(KT i α,yi)))

j=1 τj αj 2}.

Here, we can select ℓ(f(x), y) as the squared loss for regression and as the logistic loss for classification. Remark 4.2. Lemma 7 in (Li et al., 2021a) demonstrates that the multi-objective T-Sp AM is equivalent to T-Sp AM when γ = t. Indeed, the scale parameter t controls the tiled level between groups g [G], and the weight τ impacts the tilted level between samples in group g [G]. Detailed optimization procedures of T-Sp AM with logistic loss and multi-objective loss are provided in Appendix F.

5. Experiments

This section conducts the empirical evaluations on synthetic data and real-world data to verify the effectiveness of our T-Sp AM in terms of robust regression prediction, robust classification, accurate imbalanced classification and the ability to handle multi-objective learning problem.

In all synthetic experiments, we independently generate training dataset, validation dataset and test dataset, where the hyper-parameters t and λ are tuned in grids { 0.1, 0.5, 1, 2} and {10 6, 10 5, 10 4, 10 3, 10 2, 10 1, 1} on validation dataset. For regression task, we use the average squared error (ASE) to describe the divergence between the f (x) and the prediction ˆf(x), i.e., ASE = 1

n Pn i=1( ˆf(x) f (x))2. For simplicity, T-Sp AM(K) and T-Sp AM(F) refer to the T-Sp AM by kernel matrix calculation (see (4)) and by random Fourier feature acceleration (see (13)), respectively. Inspired by (Feng et al., 2016), variable selection results are measured in terms of the average selection percentage (ASP), which refers to the average probability of variables selected correctly.

5.1. Synthetic data experiments

For regression task, the datasets are generated by the data generating model Y = f (X) + ϵ. Following the experimental design in (Chen et al., 2020), we consider f (X) = P8 j=1 fj(Xj) with

f1(u) = 2 sin(2u), f2(u) = 8u2, f3(u) = 7 sin u 2 sin u,

f4(u) = 6e u, f5(u) = u3 + 3

2(u 1)2, f6(u) = 5u,

f7(u) = 10 sin(e u/2), f8(u) = 10eϕ(u, 1

where eϕ is the normal cumulative distribution with mean 1 2 and standard deviation 4

5. Let the sample size n = 200 and dimension p = 100 (including 92 irrelevant variables). Each input Xj, j = 1, ..., 100, is generated from U( 1, 1). For training data and validation data, three types of noises are considered here and named ϵA, ϵB and ϵC, respectively. Here, the ϵA is drawn from a skewed zero-mean distributions {0.8N( 2, 1) + 0.2N(8, 1)}, the ϵB follows the skewed zero-mode distribution {0.8N(0, 1)+0.2N(20, 1)} and the ϵC is generated from heavy-tailed Student s t distribution. To obtain ASE, the test data are generated by Y = f (X).

We compare our proposed approch (T-Sp AM(F) and TSp AM(K)) with several related methods, e.g., Lasso (Tibshirani, 1994), Sp AM (Ravikumar et al., 2009) and RMR (Wang et al., 2017). We repeat each experiment for 50 times and report the average results under three types of noise distributions. Table 2 shows that, with a small |t|, T-Sp AM(F),

Tilted Sparse Additive Models

(a) The ASE under noise ϵA

(b) The ASE under noise ϵB

(c) The ASE under noise ϵC

Figure 2. The ASE of T-Sp AM(F) under different t and sample size n

Table 2. The averaged performance on synthetic data under noises ϵA (Top), ϵB (Middle) and ϵC (Bottom).

Method ASP ASE Method ASP ASE

t = 0.1 T-Sp AM (F) 0.923 0.173 0.042 t = 0.1 T-Sp AM (F) 0.919 0.173 0.037 T-Sp AM (K) 0.925 0.184 0.048 T-Sp AM (K) 0.918 0.237 0.050

t = 0.5 T-Sp AM (F) 0.908 0.171 0.041 t = 0.5 T-Sp AM (F) 0.906 0.222 0.056 T-Sp AM (K) 0.875 0.244 0.055 T-Sp AM (K) 0.905 0.268 0.056

t = 1.0 T-Sp AM (F) 0.925 0.157 0.046 t = 1.0 T-Sp AM (F) 0.913 0.265 0.626 T-Sp AM (K) 0.865 0.289 0.065 T-Sp AM (K) 0.905 0.269 0.055

t = 2.0 T-Sp AM (F) 0.810 0.278 0.097 t = 2.0 T-Sp AM (F) 0.881 0.296 0.068 T-Sp AM (K) 0.700 0.416 0.087 T-Sp AM (K) 0.905 0.277 0.054

Lasso 0.830 1.037 0.119 Sp AM 0.918 0.204 0.049 RMR 0.380 0.597 0.141

Method ASP ASE Method ASP ASE

t = 0.1 T-Sp AM (F) 0.505 0.418 0.077 t = 0.5 T-Sp AM (F) 0.573 0.358 0.099 T-Sp AM (K) 0.675 0.272 0.067 T-Sp AM (K) 0.715 0.196 0.043

t = 1.0 T-Sp AM (F) 0.938 0.068 0.077 t = 2.0 T-Sp AM (F) 1.000 0.019 0.005 T-Sp AM (K) 0.878 0.144 0.042 T-Sp AM (K) 0.955 0.110 0.025

Lasso 0.655 0.713 0.057 Sp AM 0.588 0.350 0.078 RMR 0.610 0.425 0.112 Method ASP ASE Method ASP ASE

t = 0.1 T-Sp AM (F) 1.000 0.041 0.013 t = 0.5 T-Sp AM (F) 1.000 0.037 0.012 T-Sp AM (K) 0.993 0.152 0.041 T-Sp AM (K) 0.993 0.158 0.042

t = 1.0 T-Sp AM (F) 1.000 0.042 0.013 t = 2.0 T-Sp AM (F) 1.000 0.084 0.037 T-Sp AM (K) 0.990 0.193 0.054 T-Sp AM (K) 0.885 0.316 0.093

Lasso 0.865 1.546 0.269 Sp AM 0.988 0.094 0.060 RMR 0.843 0.247 0.062

T-Sp AM(K) and Sp AM have a satisfactory performance under zero-mean noise ϵA, while Lasso and RMR produce a large ASE value as they can only fit functions linearly. Moreover, when t tends to be small, T-Sp AM(F) and T-Sp AM(K) outperforms other competitors under zero-mode noise ϵB

in the sense that it can select all variables correctly and obtain the smallest ASE. This result implies the robustness of T-Sp AM(F) to skewed zero-mode noise and supports the findings in Theorem 3.13. Furthermore, we investigate the impact of t and n on the performance of T-Sp AM(F) in

Tilted Sparse Additive Models

200 300 400 500 600 700 800 Sample Size n

Executive time (s)

T-Sp AM(F) T-Sp AM(K)

(a) Executive time under noise ϵA

200 300 400 500 600 700 800 Sample size n

Executive time (s)

T-Sp AM(F) T-Sp AM(K)

(b) Executive time under noise ϵB

Figure 3. The executive time for T-Sp AM(K) and T-Sp AM(F) when taking t = 0.5.

Table 3. The accuracy on contaminated data (left) and imbalanced data (right) respectively. Robust Classification Imbalanced Classification

Method ASP Accuracy Method ASP Accuracy

T-Sp AM(F) 1.000 0.887 0.028

T-Sp AM(F) 1.000 0.832 0.042 ℓ1-SVM 0.430 0.583 0.043 ℓ1-SVM 1.000 0.500 0.000 SAM 0.980 0.761 0.048 SAM 1.000 0.800 0.046 Sp AM 0.940 0.791 0.058 Sp AM 1.000 0.715 0.096

T-Sp AM(F) 0.530 0.702 0.066

T-Sp AM(F) 1.000 0.880 0.031 ℓ1-SVM 0.530 0.572 0.054 ℓ1-SVM 1.000 0.500 0.000 SAM 0.380 0.598 0.042 SAM 1.000 0.860 0.037 Sp AM 0.410 0.632 0.047 Sp AM 1.000 0.819 0.058

T-Sp AM(F) 0.270 0.554 0.066

T-Sp AM(F) 1.000 0.910 0.025 ℓ1-SVM 0.230 0.534 0.067 ℓ1-SVM 1.000 0.500 0.000 SAM 0.030 0.496 0.042 SAM 1.000 0.885 0.032 Sp AM 0.010 0.499 0.054 Sp AM 1.000 0.863 0.052

Table 4. The accuracy on contaminated and imbalanced classification data.

Method r1 = 0.3, r2 = 0.1 r1 = 0.3, r2 = 0.5 r1 = 0.1, r2 = 0.1 r1 = 0.1, r2 = 0.15

T-Sp AM(F) 0.655 0.078 0.681 0.071 0.782 0.042 0.828 0.037 Sp AM 0.616 0.106 0.680 0.089 0.705 0.080 0.791 0.064 SAM 0.603 0.074 0.639 0.053 0.760 0.074 0.800 0.048 ℓ1-SVM 0.500 0.007 0.501 0.015 0.500 0.002 0.500 0.002

Figure 2. Clearly, with sample size n increasing, the ASE has a downward trend for all t, which verifies the findings in Theorems 3.3, 3.6 and 3.13. Besides, we can see that the ASE becomes smaller as |t| tends to 0 for noise ϵA, which is consistent with the result in Theorem 3.6. Moreover, the decreasing of ASE as t decreases verifies Theorem 3.13. Figure 3 reports the executive times of T-Sp AM(F)

and T-Sp AM(K) under noises ϵA and ϵB. Together with the results from Table 2, we can see that T-Sp AM(F) performs as well as T-Sp AM(K) but runs faster than T-Sp AM(K), which verifies the effectiveness of random Fourier features technique.

Robust classification: For classification, we consider the discriminant function used in (Zhao & Liu, 2012). The

Tilted Sparse Additive Models

Figure 4. AMSE and the weights of variables.

discriminant function is additive and formulated as follows

f (xi) = (xi1 0.5)2 + (xi2 0.5)2 0.08, (9)

where xij = (Wij + Ui)/2 and Wij, Ui are independently from U(0, 1) for i = 1, . . . , 200, j = 1, . . . , 100. The label yi is 0 when f(xi) 0 and 1 otherwise. We add the noise into data by randomly flipping the label with some certain ratios r1 (see Table 3). Here, the competitors include Sp AM (induced by logistic loss) (Ravikumar et al., 2009), SAM (Zhao & Liu, 2012) and linear ℓ1-SVM (Zhu et al., 2003). The variable selection results and classification accuracy on test set are presented in Table 3, which entails the TSp AM(F) behaves better in robust classification.

Imbalanced classification: Now we check the property of T-Sp AM to handle imbalanced classification issue. The synthetic data are constructed from (9) with N = 200, p = 10. We compare our method with Sp AM(induced by logistic loss)(Ravikumar et al., 2009), SAM(Zhao & Liu, 2012), and ℓ1-SVM(Zhu et al., 2003) under some extreme ratios r2 of negative class in the population. Table 3 empirically verifies that the T-Sp AM(F) has an advantage in dealing with imbalanced data.

Multi-objective learning: The effectiveness of multiobjective T-Sp AM is checked under imbalanced and robust classification setting. For simplicity, we only consider the combinations of r1 {0.1, 0.3} and r2 {0.10, 0.15}. Based on (9), the synthetic data are generated with n = 200, p = 10. Table 4 implies the T-Sp AM(F) is slightly superior to the other methods.

5.2. Real-world data experiment

In this section, we evaluate the performance of T-Sp AM on Coronal Mass Ejections (CME) data. The CME data (https://cdaw.gsfc.nasa.gov/CME list/) consists of a output variable (arrival time of coronal mass ejections) and 20 input variables, including center projection angle (CPA), angle width (AW), linear speed (LS), SND speed final (SSF), SND speed 20RS (SSRS), MASS, kinetic energy (KE), mea-

surement position angle (MPA), field magnitude average (FMA), BX, BY, BZ, Speed (S), VX, VY, VZ, proten density (PD), Temperature (T), flow pressure (FP), plasma beta (PB). We randomly split the CME data into three parts: 86 observations for training, 22 observations for validation, and 27 observations for test. To evaluate model performance, the average mean squared error (AMSE) is obtained by repeating the experiment 50 times. To the end, our TSp AM(F) enjoys the smallest AMSE (0.615 0.165 by taking t = 0.1) compared with Sp AM (1.037 0.258), RMR (0.973 0.795) and Lasso (1.868 0.909). Moreover, Figure 4 shows that LS, SSF, FMA and BZ are significant in arrival time prediction, which has been verified in (Liu et al., 2018).

6. Conclusion

In this paper, we propose a novel tilted sparse additive models (T-Sp AM) that can be capable of a variety of learning tasks, e.g., robust estimation, robust classification, imbalanced classification, and multi-objective learning. Under some common skewed or symmetric noise assumptions, theoretical guarantees on generalization bound, function approximation, and variable selection consistency are established for T-Sp AM. In practice, empirical evaluations support the advanced performance of our approach.

Acknowledgements

This work was supported by Yunnan provincial major science and technology special plan projects under Grant 202202AD080003, National Natural Science Foundation of China under Grant Nos. 12071166, 61671480, HZAU-AGIS Cooperation Fund No. SZYJY2023010, the Fundamental Research Funds for the Central Universities of China under Grant No. 2662023LXPY005, the Major Scientific and Technological Projects of CNPC under Grant ZD2019-183008. Prof. Dacheng Tao is partially supported by Australian Research Council Project FL-170100117.

Tilted Sparse Additive Models

Agarwal, R., Frosst, N., Zhang, X., Caruana, R., and Hinton, G. E. Neural additive models: Interpretable machine learning with neural nets. ar Xiv:2004.13912v1, 2020.

Anthony, M. and Bartlett, P. L. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999.

Borkar, V. S. Q-learning for risk-sensitive control. Mathematics of Operations Research, 27(2):294 311, 2002.

Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge University Press, 2004.

Caponnetto, A. and Vito, E. D. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331 368, 2007.

Chen, H., Wang, X., Deng, C., and Huang, H. Group sparse additive machine. In Advances in Neural Information Processing Systems (NIPS), pp. 198 208. 2017.

Chen, H., Wang, Y., Zheng, F., Deng, C., and Huang, H. Sparse modal additive model. IEEE Transactions on Neural Networks and Learning Systems, DOI: 10.1109/TNNLS.2020.3005144, 2020.

Chen, Y. C., Genovese, C. R., Tibshirani, R. J., and Wasserman, L. Nonparametric modal regression. The Annals of Statistics, 44(2):489 514, 2016.

Christmann, A. and Zhou, D. X. Learning rates for the risk of kernel based quantile regression estimators in additive models. Analysis and Applications, 14(3):449 477, 2016.

Cucker, F. and Zhou, D. X. Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, 2007.

Doss, C. R. and Wellner, J. A. Global rates of convergence of the mles of log-concave and s-concave densities. The Annals of Statistics, 44(3):954 981, 2013.

Feng, Y., Huang, X., Shi, L., Yang, Y., and Suykens, J. A. Learning with the maximum correntropy criterion induced losses for regression. Journal of Machine Learning Research, 16:993 1034, 2015.

Feng, Y., Yang, Y., and Suykens, J. A. K. Robust gradient learning with applications. IEEE Transactions on Neural Networks and Learning Systems, 27(4):822 835, 2016.

Feng, Y., Fan, J., and Suykens, J. A statistical learning approach to modal regression. Journal of Machine Learning Research, 21(2):1 35, 2020.

Guo, Z. C. and Zhou, D. X. Concentration estimates for learning with unbounded sampling. Advances in Computational Mathematics, 38(1):207 223, 2013.

Hastie, T. J. and Tibshirani, R. J. Generalized additive models. London: Chapman and Hall, 1990.

Howard, R. A. and Matheson, J. E. Risk-sensitive markov decision processes. Management Science, 18(7):356 369, 1972.

Huang, J., Horowitz, J. L., and Wei, F. Variable selection in nonparametric additive models. The Annals of Statistics, 38(4):2282 2313, 2010.

J. Reddi, S., Sra, S., Poczos, B., and Smola, A. Fast stochastic methods for nonsmooth nonconvex optimization. 05 2016.

Kandasamy, K. and Yu, Y. Additive approximations in high dimensional nonparametric regression via the SALSA. In International Conference on Machine Learning (ICML), pp. 69 78, 2016.

Kowalski, M. Sparse regression using mixed norms. Applied and Computational Harmonic Analysis, 27(3):303 324, 2009. ISSN 1063-5203.

Lee, J., Park, S., and Shin, J. Learning bounds for risksensitive learning. In Advances in Neural Information Processing Systems (NIPS), 2020.

Li, T., Beirami, A., Sanjabi, M., and Smith, V. Tilted empirical risk minimization. In International Conference on Learning Representations (ICLR), 2021a.

Li, Z., Ton, J.-F., Oglic, D., and Sejdinovic, D. Towards a unified analysis of random fourier features. Journal of Machine Learning Research, 22(108):1 51, 2021b.

Liu, G., Chen, H., and Huang, H. Sparse shrunk additive models. In International Conference on Machine Learning (ICML), 2020.

Liu, J., Ye, Y., Shen, C., Wang, Y., and Erdelyi, R. A new tool for cme arrival time prediction using machine learning algorithms: Cat-puma. The Astrophysical Journal, 855(2):109 118, 2018.

Lv, S., Lin, H., Lian, H., and Huang, J. Oracle inequalities for sparse additive quantile regression in reproducing kernel hilbert space. The Annals of Statistics, 46(2):781 813, 2018.

Meier, L., Geer, S. V. D., and Buhlmann, P. Highdimensional additive modeling. The Annals of Statistics, 37(6B):3779 3821, 2009.

Tilted Sparse Additive Models

Osogami, T. Robustness and risk-sensitivity in markov decision processes. pp. 233 241, 2012.

Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems (NIPS), pp. 1177 1184, 2007.

Raskutti, G., J. Wainwright, M., and Yu, B. Minimaxoptimal rates for sparse additive models over kernel classes via convex programming. Journal of Machine Learning Research, 13(2):389 427, 2012.

Ravikumar, P., Liu, H., Lafferty, J., and Wasserman, L. Sp AM: sparse additive models. Journal of the Royal Statistical Society: Series B, 71:1009 1030, 2009.

Reddi, S. J., Sra, S., Poczos, B., and Smola, A. J. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In Advances in Neural Information Processing Systems (NIPS), pp. 1145 1153. 2016.

Sager, T. W. and Thisted, R. A. Maximum likelihood estimation of isotonic modal regression. The Annals of Statistics, 10(3):690 707, 1982.

Shi, L. Learning theory estimates for coefficient-based regularized regression. Applied and Computational Harmonic Analysis, 34(2):252 265, 2013.

Shi, L., Feng, Y. L., and Zhou, D. X. Concentration estimates for learning with ℓ1-regularizer and data dependent hypothesis spaces. Applied and Computational Harmonic Analysis, 31(2):286 302, 2011.

Steinwart, I. and Christmann, A. Support Vector Machines. Springer Science and Business Media, 2008.

Steinwart, I., Hush, D., and Scovel, C. Optimal rates for regularized least squares regression. In Annual Conference on Learning Theory (COLT), 2009.

Stone, C. J. Additive regression and other nonparametric models. The Annals of Statistics, 13(2):689 705, 1985.

Tan, Z. and Zhang, C.-H. Doubly penalized estimation in additive regression with high-dimensional data. The Annals of Statistics, 47(5):2567 2600, 2019.

Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 73(3):267 288, 1994.

Wang, C. and Zhou, D.-X. Optimal learning rates for least squares regularized regression with unbounded sampling. Journal of Complexity, 27(1):55 67, 2011.

Wang, X., Chen, H., Cai, W., Shen, D., and Huang, H. Regularized modal regression with applications in cognitive impairment prediction. In Advances in Neural Information Processing Systems (NIPS), pp. 1448 1458, 2017.

Wang, Y., Zhong, X., He, F., Chen, H., and Tao, D. Huber additive models for non-stationary time series analysis. In International Conference on Learning Representations (ICLR), 2022.

Wu, Q., Ying, Y., and Zhou, D. X. Learning rates of leastsquare regularized regression. Foundations of Computational Mathematics, 6(2):171 192, 2006.

Wu, Q., Ying, Y., and Zhou, D. X. Multi-kernel regularized classifiers. Journal of Complexity, 23(1):108 134, 2007.

Yang, L., Lv, S., and Wang, J. Model-free variable selection in reproducing kernel hilbert space. Journal of Machine Learning Research, 17(1):2885 2908, 2016.

Yang, Z., Zhang, A., and Sudjianto, A. Gami-net: An explainable neural network based on generalized additive models with structured interactions. ar Xiv:2003.07132, 2020.

Yao, W. and Li, L. A new regression model: modal linear regression. Scandinavian Journal of Statistics, 41(3): 656 671, 2013.

Yin, J., Chen, X., and Xing, E. P. Group sparse additive models. In International Conference on Machine Learning (ICML), pp. 871 878, 2012.

Yuan, M. and Zhou, D. X. Minimax optimal rates of estimation in high dimensional additive models. The Annals of Statistics, 44(6):2564 2593, 2016.

Zhao, T. and Liu, H. Sparse additive machine. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of Proceedings of Machine Learning Research, pp. 1435 1443. PMLR, 21 23 Apr 2012.

Zhou, D.-X. The covering number in learning theory. Journal of Complexity, 18(3):739 767, 2002.

Zhu, J., Rosset, S., Hastie, T., and Tibshirani, R. 1-norm support vector machines. pp. 49 56. MIT Press, 2003.

Zou, B., Li, L., and Xu, Z. The generalization performance of erm algorithm with strongly mixing observations. Machine Learning, 75(3):275 295, 2009.

Tilted Sparse Additive Models

Appendix to Tilted Sparse Additive Models

A. Notations

Some used notations are summarized in Table 5.

Table 5. Notations Notations Descriptions X, Y the input space and the output space, respectively X, Y random variables taking values in X and Y, respectively x, y realizations of X and Y , respectively ϵ the noise variable specified by the residual Y f (X) p the dimension of input variable X n the sample size z a set of n-size realizations of X, Y , i.e., z = {(xi, yi)}n i=1 ρ the joint probability distribution of X Y ρX the marginal distribution of X p Y |X the conditional density of Y conditioned on X pϵ|X the conditional density of ϵ conditioned on X L2 ρ2 X the function space of square-integrable functions with respect to ρX HK the data independent reproducing kernel Hilbert space Hz the data dependent hypothesis space f the unknown ground truth function R(t, f) the population version of tilted empirical risk minimization framework Rz(t, f) the tilted empirical risk minimization framework G(t, f) the population version of stepping-stone objective function Gz(t, f) the empirical version of stepping-stone objective function Eρ(f) the objective function for conditional mean function EM(f) the objective function for conditional mode function

B. Proof of Theorem 1

Proof sketch: Usually, the generalization analysis of traditional ERM-based algorithm can be conducted by considering the expected risk and empirical risk as the expectation of random error variable and the average of random independent error observations (Cucker & Zhou, 2007; Steinwart & Christmann, 2008). However, R(t, P(fz,t)) and Rz(t, P(fz,t)) associated with TERM do not enjoy this property due to the additional logarithmic function. Hence, the concentration inequalities (e.g., used in (Wu et al., 2007; Steinwart & Christmann, 2008)) can not be used to bound the generalization of T-Sp AM directly. In this part, we overcome this difficulty by introducing the stepping-stone objectives

G(t, f) := 1

Z et(f(x) y)2dρ(x, y)

Gz(t, f) := 1

i=1 et(f(xi) yi)2.

To the end, the main building blocks in generalization consistency analysis are presented in Figure 5.

We assume ℓ(Pf(X), Y ) Mℓand ℓ (Pf(X), Y ) Mℓ for any f(X) and Y . To bound G(t, P(fz,t)) Gz,t(t, P(fz,t)), we first give the upper bound of fz,t K.

Lemma B.1. Under Assumption 1, there hold

fz,t K M 2 ℓ κn 1 2 λ 1 τ 1, t ( , 0) (0, + )

where τ = minj=1,...,p τj is a positive constant.

Tilted Sparse Additive Models

Theorem 1 Generalization error bound ℛ(𝑡, 𝒫(𝑓𝐳,#)) ℛ𝐳(𝑡, 𝒫(𝑓𝐳,#))

Bounding 𝒢(𝑡, 𝒫(𝑓𝐳,#)) 𝒢𝐳(𝑡, 𝒫(𝑓𝐳,#))

Figure 5. An illustration of the main building blocks in generalization consistency analysis. We first establish the upper bound of G(t, P(fz,t)) Gz(t, P(fz,t)) with the help of Lemmas B.1-B.3. Lemma B.4 investigates the relation between R(t, fz,t) Rz(t, fz,t) and G(t, fz,t) Gz(t, fz,t). Finally, Theorem 1 can be obtained by combining Lemma B.4 and the bound of the generalization error G(t, P(fz,t)) Gz(t, P(fz,t)).

Proof. From the definition of fz,t, we have

Rz(t, fz,t) + λΩz(fz,t) Rz(t, 0) + λΩz(0).

Under Assumption 1, we come to the following two conclusions, i.e.,

λΩ(fz,t) = λ

j=1 τj αz,t j 2 Rz(t, 0) Rz(t, fz,t) 1

t log( Pn i=1 etℓ(0,yi) Pn i=1 etℓ(fz,t(xi),yi) )

Then, if t ( , 0), we have

t log( Pn i=1 etℓ(0,yi) Pn i=1 etℓ(fz,t(xi),yi) ) 1

|t| log e |t|M 2 ℓ= M 2 ℓ.

Also, if t (0, + ), the similar result can be obtained, i.e.,

t log( Pn i=1 etℓ(0,yi) Pn i=1 etℓ(fz,t(xi),yi) ) 1

t log et M2 ℓ= M 2 ℓ.

Consequently, we see that

j=1 αz,t j 2 M 2 ℓ λ minj=1,...,p τj , z = (x, y) Z, t ( , 0) (0, + )

In connection with

i=1 |αz,t ji | κn 1 2

j=1 αz,t j 2,

we get the desired result of Lemma B.1.

Lemma B.1 illustrates a ball Br = {f HK : f K r}

that covers the estimator fz,t, where

r = κn 1 2 M 2 ℓλ 1 τ 1. (10)

Next, we use the ℓ2-empirical covering number (e.g, used in (Zhou, 2002; Anthony & Bartlett, 1999; Chen et al., 2020; Feng et al., 2020)) to measure the capacity of Br.

Tilted Sparse Additive Models

Definition B.2. Let F be a set of measurable functions on X. Given input samples x = {xi}n i=1, we define the ℓ2-empirical metric as

d2,x(f1, f2) = ( 1

i=1 (f1(xi) f2(xi))2) 1 2 .

Then the ℓ2-empirical covering number of function space F is defined as

N2(F, ϵ) = sup n sup x inf m {m N : {fj}m j=1 F, s.t., F m j=1{f F : d2,x(f, fj) < ϵ}}, ϵ > 0.

Indeed, the empirical covering number of Br has been investigated extensively in learning theory literatures (Steinwart & Christmann, 2008). There are some detailed examples, e.g., Examples 1-2 in (Guo & Zhou, 2013), Theorem 2 in (Shi et al., 2011) and Lemma 3 in (Shi, 2013).

The following uniform concentration inequality established in (Wu et al., 2007; Steinwart & Christmann, 2008) is used for bounding G(t, P(fz,t)) Gz,t(t, P(fz,t)). Lemma B.3. Let T be a measurable function set on Z. Suppose that there are some constants B, c and ϑ [0, 1] such that h B, Eh2 c(Eh)ϑ for each h T . If for 0 < s < 2, log N2(T , ϵ) aϵ s, ϵ > 0, a > 0, then for any δ (0, 1) and given sample set z = {zi}n i=1 = {(xi, yi)}n i=1 Zn, there holds

i=1 h(zi) ω1 ϑ(Eh)ϑ

2 + csω + 2(c log 1

δ n ) 1 2 ϑ + 18B log 1

with confidence at least 1 δ, where

ω = max{c 2 s 4 2ϑ+sϑ ( a

n) 2 4 2ϑ+sϑ , B 2 s 2+s ( a

and cs is a constant depending on s.

We then present the concentration estimation for G(t, P(fz,t)) Gz(t, P(fz,t)). For this purpose, we first define a functionbased random variable set as

V = {h(z) := hf(z) = 1

t etℓ(P(f)(x),y) : f Br, z Z}, t ( , 0) (0, + ),

where r is defined in (10).

Under Assumption 1, for any f1, f2 Br, there holds

|hf1(z) hf2(z)| = |1

t etℓ(P(f1)(x),y) 1

t etℓ(P(f2)(x),y)| f M|f1(x) f2(x)|, (11)

where f M = Mℓ e4t M2 ℓ, t (0, + ) and f M = Mℓ , t ( , 0).

Then, we get log N2(V, ϵ) log N2(Br, ϵf M 1) log N2(B1, ϵf M 1r 1) cs f M srsp1+sϵ s,

where the s value is given in Theorem 1 and this inequality comes from the covering number bounds for HKj with Kj Cν

(see Theorem 2 in (Shi et al., 2011) and Lemma 3 in (Shi, 2013) for more details). It is trivial to verify that

t et M 2 ℓ1{0<t<+ } + 1

|t|1{ <t<0} := e V , (12)

Eh2 E h 2 = e V 2(Eh)0.

Now applying Lemma B.3 to the function-based random variable set V with B = e V , a = f M srsp1+s, ϑ = 0 and c = e V 2, we have, for any f Br and 0 < δ < 1,

G(t, P(f)) Gz(t, P(f))

2 + cs) max{e V 2 s

2 ( f M srsp1+s

n ) 1 2 , e V 2 s 2+s ( f M srsp1+s

n ) 2 2+s } + 2( e V 2 log 1

δ n ) 1 2 + 18e V log 1

Tilted Sparse Additive Models

with confidence at least 1 δ.

For any t ( , 0), we get e V = |t| 1, r = M 2 ℓ κ τ 1n 1 2 λ 1 and f M = Mℓ . For any t (0, + ), we have r = M 2 ℓ κ τ 1n 1 2 λ 1, e V = 1

t e4t M 2 ℓand f M = Mℓ et M 2 ℓ.

By taking λ = n ζ, Lemma B.3 gives that

G(t, P(fz,t)) Gz(t, P(fz,t))

|t| s 2 2+s nmax{ 1

2 , s 2+2sζ

2+s log(1/δ), t ( , 1]

|t| 1nmax{ 1

2 , s 2+2sζ

2+s log(1/δ), t ( 1, 0)

|t| 1et M 2 ℓnmax{ 1

2 , s 2+2sζ

2+s log(1/δ), t (0, 1)

|t| s 2 2+s et M 2 ℓnmax{ 1

2 , s 2+2sζ

2+s log(1/δ), t [1, + )

with confidence at least 1 δ, where a b means there exists a positive constant c such that a cb, a, b R.

To the end, we get back to the concerned point through following lemma.

Lemma B.4. Let Assumption 1 be true. There holds

R(t, P(fz,t)) Rz(t, P(fz,t)) e M 2 ℓ|t||G(t, P(fz,t)) Gz(t, P(fz,t))|, t ( , 0),

R(t, P(fz,t)) Rz(t, P(fz,t)) |G(t, P(fz,t)) Gz(t, P(fz,t))|, t (0, + ).

Proof. From the mean value theorem, we have

R(t, P(fz,t)) Rz(t, P(fz,t)) = 1

Z etℓ(P(fz,t)(x),y) 1

i=1 etℓ(P(fz,t)(xi),yi)

Z etℓ(P(fz,t)(x),y) 1

i=1 etℓ(P(fz,t)(xi),yi))

G(t, P(fz,t)) Gz(t, P(fz,t)) ,

where ξ is between R

Z etℓ(P(fz,t)(xi),yi) and 1

n Pn i=1 etℓ(P(fz,t)(xi),yi). It is trivial to see that

G(t, P(fz,t)) Gz(t, P(fz,t)) e4|t|M 2 ℓ|G(t, P(fz,t)) Gz(t, P(fz,t))|, t ( , 0),

G(t, P(fz,t)) Gz(t, P(fz,t)) |G(t, P(fz,t)) Gz(t, P(fz,t))|, t (0, + ).

We complete the proof of lemma B.4.

In connection with Lemmas B.1-B.3, for any 0 < δ < 1 and any fixed t ( , 0) (0, + ), there holds

R(t, P(fz,t)) Rz(t, P(fz,t)) nmax{ s 2+2sζ

2 } log(1/δ).

with confidence at least 1 δ. This completes the proof of Theorem 1.

C. Proof of Theorem 2

Proof sketch: In what follows, we shall make efforts to conduct the function approximation analysis. Clearly, the datagenerating model Y = f (X) + ϵ and Assumption 3 together ensure that the ground truth function f is a conditional mean function (Wu et al., 2006; Caponnetto & Vito, 2007; Wang & Zhou, 2011; Feng et al., 2015), i.e.,

Y ydρ(y|x),

Tilted Sparse Additive Models

Theorem 2 Function Approximation Error

𝒫(𝑓𝐳,#) 𝑓 ℒ!𝒳

(Proposition 2)

Sample Error (Proposition 1) Error Decomposition

(Proposition 3) Hypothesis Error

Figure 6. An illustration of the main building blocks in function approximation analysis. The error decomposition in Proposition C.2

supports Propositions C.4 and C.6. The strategy for bounding the sample error (Proposition C.4) is similar to that for generalization error analysis (Section B). The upper bound of αη j 2, j = 1, .., p in Lemma 5 is key to bound the hypothesis error in Proposition C.6. Finally, Theorem 2 is obtained by combining the results in Propositions C.2-C.6.

where f = arg min f HK Eρ(f) and Eρ(f) := Z

Z (f(x) y)2dρ(x, y).

In view of the above discussions, Figure 6 presents the main building blocks in function approximation analysis.

Firstly, we measure the deviations between the excess risk terms Eρ(P(fz,t)) Eρ(f ) and R(t, P(fz,t)) R(t, f ).

Lemma C.1. Let Assumptions 1-3 be true. There holds

Eρ(P(fz,t)) Eρ(f ) R(t, P(fz,t)) R(t, f ) + Mt,f ,

where Mt,f = 16M 4|t|e|t|(M+ f )2 for any t ( , 0) and |t|(M + f )4 for any t (0, + ).

Proof. From the definitions of Eρ(f) and R(t, f), we have

Eρ(P(fz,t)) Eρ(f ) (R(t, P(fz,t)) R(t, f ))

Z [y P(fz,t)(x)]2 (y f (x))2dρ(x, y) 1

Z et[y P(fz,t)(x)]2 log Z

Z et(y f (x))2dρ(x, y) i

Z [y P(fz,t)(x)]2 (y f (x))2dρ(x, y) 1

t log h 1 +

Z et[y P(fz,t)(x)]2dρ(x, y) R

Z et(y f (x))2dρ(x, y) 1 i .

Taking t ( , 0) gives

Eρ(P(fz,t)) Eρ(f ) (R(t, P(fz,t)) R(t, f ))

Z [y P(fz,t)(x)]2 (y f (x))2dρ(x, y)

Z e |t|(y f (x))2dρ(x, y)

Z e |t|[y P(fz,t)(x)]2dρ(x, y) Z

Z e |t|(y f (x))2dρ(x, y) i

e|t|(M+ f )2h Z

Z [y P(fz,t)(x)]2 (y f (x))2dρ(x, y)

Z e |t|[y P(fz,t)(x)]2dρ(x, y) Z

Z e |t|(y f (x))2dρ(x, y) i .

Tilted Sparse Additive Models

By applying Taylor s Theorem to both G(t, P(fz,t)) and G(t, f ), we get

Eρ(P(fz,t)) Eρ(f ) (R(t, P(fz,t)) R(t, f ))

|t|e|t|(M+ f )2h Z

Z [y P(fz,t)(x)]4eξ1 (y f (x))4eξ2dρ(x, y) i ,

16M 4|t|e|t|(M+ f )2,

where ξ1 is between 0 and t[y P(fz,t)(x)]2, and ξ2 is between 0 and t(y f (x))2.

Moreover, if t (0, + ), we can deduce

Eρ(P(fz,t)) Eρ(f ) (R(t, P(fz,t)) R(t, f ))

Z [y P(fz,t)(x)]2 (y f (x))2dρ(x, y)

Z et(y f (x))2dρ(x, y)

Z et(y f (x))2dρ(x, y) Z

Z et[y P(fz,t)(x)]2dρ(x, y) i

Z [y P(fz,t)(x)]2 (y f (x))2dρ(x, y)

Z et(y f (x))2dρ(x, y) Z

Z et[y P(fz,t)(x)]2dρ(x, y)) i

|t|(M + f )4.

This completes the proof.

In the following, we make a direct error decomposition.

Proposition C.2. (Error Decomposition) Under Assumptions 1-3, there holds

P(fz,t) f 2 Lρ2 X E1 + E2 + η f 2 K + Mt,f ,

where E1 = R(t, P(fz,t)) Rz(t, P(fz,t)) + Rz(t, f ) R(t, f )

and E2 = Rz(t, fz,t) + Ωz(fz,t) Rz(t, fη,t) η fη,t 2 K.

Proof. The definition of fη,t implies that

Rz(t, fη,t) + η fη,t 2 K η f 2 K Rz(t, f ).

Based on Lemma C.1, we can deduce that

P(fz,t) f 2 Lρ2 X Eρ(P(fz,t)) Eρ(f )

R(t, P(fz,t)) R(t, f ) + Mt,f

R(t, P(fz,t)) Rz(t, P(fz,t)) + Rz(t, P(fz,t)) Rz(t, f ) + Rz(t, f ) R(t, f ) + Mt,f

E1 + Rz(t, fz,t) + λΩz(fz,t) Rz(t, f ) + Mt,f

E1 + Rz(t, fz,t) + λΩz(fz,t) (Rz(t, fη,t) + η fη,t 2 K η f 2 K) + Mt,f

E1 + E2 + η f 2 K + Mt,f .

In learning theory literatures, we call E1 , E2 as the sample error and the hypothesis error, respectively. The sample error E1 describes the divergence between the generalization error terms Rz(t, P(fz,t)) R(t, P(fz,t)) and Rz(t, f ) R(t, f ). The hypothesis error E2 characterizes the difference between the regularized empirical formulations in HK and Hz.

Tilted Sparse Additive Models

Lemma C.3. Let ξ be a random variable on a probability space Z with variance σ satisfying |ξ Eξ| Mξ almost surely for some constant Mξ and for all z Z. Then for any 0 < δ < 1, with confidence 1 δ, we have

i=1 ξ(zi) Eξ 2Mξ log(1/δ)

2σ2 log(1/δ)

Proposition C.4. (Bounding sample error E1) Let Assumption 1 be true and each Kj Cv, j = 1, ..., p. For any δ (0, 1) and f HK, there holds

|t| s 2 2+s e4|t|M 2nmax{ 1

2 , s 2+2sζ

2+s log(1/δ), t ( , 1]

|t| 1e4|t|M 2nmax{ 1

2 , s 2+2sζ

2+s log(1/δ), t ( 1, 0)

|t| 1e8t M 2nmax{ 1

2 , s 2+2sζ

2+s log(1/δ), t (0, 1)

|t| s 2 2+s e8t M 2nmax{ 1

2 , s 2+2sζ

2+s log(1/δ), t [1, + )

with confidence at least 1 δ, where r, f M, e V are defined in (10), (11) and (12), respectively.

Proof. The proof proceeds similarly to the generalization consistency analysis in Section B. After defining a function-based variable set as

T = {hf(z) = 1

t et[P(f)(x) y]2 1

t et[f (x) y]2 : f Br, g HK},

Lemma B.3 holds for the function set T with B = e V , a = f M srsp1+s, ϑ = 0 and c = e V 2, where f M and e V are defined in (11) and (12), respectively. Then we have

G(t, P(fz,t)) Gz(t, P(fz,t)) (G(t, f ) Gz(t, f ))

|t| s 2 2+s nmax{ 1

2 , s 2+2sζ

2+s log(1/δ), t ( , 1]

|t| 1nmax{ 1

2 , s 2+2sζ

2+s log(1/δ), t ( 1, 0)

|t| 1e4t M2nmax{ 1

2 , s 2+2sζ

2+s log(1/δ), t (0, 1)

|t| s 2 2+s e4t M2nmax{ 1

2 , s 2+2sζ

2+s log(1/δ), t [1, + )

According to Lemma B.4, for any f HK, we have

R(t, f) Rz(t, f) = 1

ξ (G(t, f) Gz(t, f)),

where ξ is between R

Z et(f(x) y)2dρ and 1

n Pn i=1 et(f(xi) yi)2. Then we have

R(t, P(fz,t)) Rz(t, P(fz,t)) R(t, f ) Rz(t, f )

n G(t, P(fz,t)) Gz(t, P(fz,t)) (G(t, f ) Gz(t, f )) o

ξ1 1) G(t, P(fz,t)) Gz(t, P(fz,t)) + (1 1

ξ2 ) G(t, f ) Gz(t, f )

where ξ1 is between R

Z et(P(fz,t)(x) y)2dρ and 1 n Pn i=1 et(P(fz,t)(xi) yi)2 and ξ2 is between R

Z et(f (x) y)2dρ and 1 n Pn i=1 et(f (xi) yi)2. If t ( , 0), we further have

R(t, P(fz,t)) Rz(t, P(fz,t)) R(t, f ) Rz(t, f )

n G(t, P(fz,t)) Gz(t, P(fz,t)) (G(t, f ) Gz(t, f )) o

= e4|t|M 2n G(t, P(fz,t)) Gz(t, P(fz,t)) + Gz(t, f ) G(t, f ) o

Tilted Sparse Additive Models

According to Lemma B.3, we have

G(t, P(fz,t)) Gz(t, P(fz,t))

( |t| s 2 2+s nmax{ 1

2 , s 2+2sζ

2+s log(1/δ), t ( , 1]

|t| 1nmax{ 1

2 , s 2+2sζ

2+s log(1/δ), t ( 1, 0)

We define the following function set

T = {hf(z) = 1

t et[f(x) y]2 : f HK}.

For any hf T , we have |hf Ehf| 2 hf e V and σ2 = Eh2 f (Ehf)2 hf 2 e V 2. Then, from Lemma C.3, we have

Gz(t, f ) G(t, f ) 4 log(1/δ)

t2n |t| 1n 1

Moreover, for t (0, + ), we have R(t, P(fz,t)) Rz(t, P(fz,t)) R(t, f ) Rz(t, f )

n G(t, P(fz,t)) Gz(t, P(fz,t)) (G(t, f ) Gz(t, f )) o

ξ1 ) Gz(t, P(fz,t)) G(t, P(fz,t)) + (1 1

ξ2 ) G(t, f ) Gz(t, f )

We also have

Gz(t, P(fz,t)) G(t, P(fz,t)) 4e4t M 2 log(1/δ)

2e4t M2 log(1/δ)

t2n t 1e4t M2n 1

G(t, f ) Gz(t, f )

( t 1e4t M 2nmax{ 1

2 , s 2+2sζ

2+s log(1/δ), t (0, 1)

t s 2 2+s e4t M 2nmax{ 1

2 , s 2+2sζ

2+s log(1/δ), t [1, + ).

We get the desired result by combining the above inequalities.

To bound the hypothesis error E2, we first illustrate a key property for the coefficient of fη,t.

Lemma C.5. Under Assumption 2, we have

τj αη j 2 Mt,n,η, j = 1, ..., p,

where Mt,n,η = 2(M+ fη,t )et(M+ fη,t )

n 1 2 η for any t ( , 0) (0, + ).

Proof. From the definition of fη,t, we can deduce that

Rz(t, fη,t) + η

j=1 τj fη,t,j 2 K = 0.

Based on fη,t(xi) = KT i αη, simple computations show that

Pn i=1 2et(fη,t(xi) yi)2(fη,t(xi) yi)Kji Pn i=1 et(fη,t(xi) yi)2 + ητj(αη j )T e Kj = 0, j = 1, ..., p,

where e Kj = (Kj(xij, xsj))n i,s=1 Rn n, j = 1, ..., p. Consequently, for any j = 1, ..., p,, we have the following equivalent form

2et(fη,t(x1) y1)2(fη,t(x1) y1) Pn i=1 et(fη,t(xi) yi)2 , ..., 2et(fη,t(xn) yn)2(fη,t(xn) yn) Pn i=1 et(fη,t(xi) yi)2 e Kj = ητj(αη j )T e Kj,

Tilted Sparse Additive Models

Since e Kj, j = 1, ..., p is positive-definite, we further get

2et(fη,t(x1) y1)2(fη,t(x1) y1) Pn i=1 et(fη,t(xi) yi)2 , ..., 2et(fη,t(xn) yn)2(fη,t(xn) yn) Pn i=1 et(fη,t(xi) yi)2 T = ητjαη j , j = 1, ..., p.

For any j = 1, ..., p, it follows that

τj αη j 2 = 1

s Pn i=1[2et(fη,t(xi) yi)2(fη,t(xi) yi)]2

(Pn i=1 et(fη,t(xi) yi)2)2 2 e|t|(fη,t(x) y)2(fη,t(x) y)

n 1 2 η Mt,n,η,

where Mt,n,η = 2(M+ fη,t )e|t|(M+ fη,t )2

n 1 2 η , t ( , 0) (0, + ). This completes the proof.

Proposition C.6. (Bounding hypothesis error E2) Under Assumption 2, the hypothesis error E2 satisfies

j=1 τj αη j 2 2λp Mt,n,η.

Proof. From the definition of fz,t, we know that

Rz(t, fz,t) + λΩz(fz,t) Rz(t, fη,t) + λΩz(fη,t).

E2 = Rz(t, fz,t) + λΩz(fz,t) (Rz(fη,t) + λΩz(fη,t)) + λΩz(fη,t) η fη,t 2 K λΩz(fη,t) η fη,t 2 K λΩz(fη,t)

The desired result follows by combining the above inequality with Lemma C.5

Set η = n 1

4 λ 1 2 , λ = n ζ, and t = n β. Combining Propositions C.2-C.6, we have

P(fz,t) f 2 L2ρX nmax{ 4β+s 2+2sζ

4 , β} log(1/δ)

with confidence at least 1 δ. This completes the proof of Theorem 2.

D. Proof of Theorem 3

Proof sketch: The strategy for Theorem 3 is similar to that for Theorem 2 except that we need to reestablish the relation between Eρ(P(fz,t)) Eρ(f ) and G(t, P(fz,t)) G(t, f ).

Lemma D.1. Let Assumptions 1-2 and 4 be true. When t value is fixed and t ( 1 2(2Mf +M0)2 , 0), there holds

P(fz,t) f 2 L2ρX = Eρ(P(fz,t)) Eρ(f ) G(t, P(fz,t)) G(t, f ).

Proof. For any f HK, we can see that

R(t, f) = 1

Z et(f(x) y)2dρ(x, y) = 1

R et(f(x) u f (x))2pϵ|X(u)dudρX (x)

X E(f(x) f (x))dρX (x),

where E(m) = R

R et(u m)2pϵ|X(u)du.

Tilted Sparse Additive Models

By computing the derivative of E(f(x) f (x)) w.r.t f(x) f (x), we get

X E (f(x) f (x))dρX (x) t R

X E(f(x) f (x))dρX (x)

R et[u (f(x) f (x))]2[u (f(x) f (x))]pϵ|X=x(u)dudρX (x) R

X E(f(x) f (x))dρX (x)

R metm2pϵ|X=x(f(x) f (x) + m)dmdρX (x) R

X E(f(x) f (x))dρX (x) ,

where m = u (f(x) f (x)). Under noise Assumption 3, there holds R

R me |t|m2pϵ|X(m)dm = 0. Thus, we can deduce that R (t, f) = 0 holds only when f(x) = f (x) for any x X. Furthermore,

X E (f(x) f (x))dρX (x) t R

X E(f(x) f (x))dρX (x)

R e |t|m2(2 4|t|(u m)2)pϵ|X(m + f(x) f (x))dmdρX (x) R

X E(f(x) f (x))dρX (x) ,

where m = u (f(x) f (x)). Then, when 2 4|t|(u m)2 > 0, i.e., t ( 1 2(M+ f )2 , 0) (0, 1 2(M+ f )2 ), we have R (t, f) > 0. The above-discussed results show that f = f is the unique minimizer of R(t, f). Then we can deduce that

R(t, P(fz,t)) R(t, f )

X E(0)dρX (x) log Z

X E(P(fz,t)(x) f (x))dρX (x) i

X E(0) E(P(fz,t)(x) f (x))dρX (x)

X [ E (0)(P(fz,t)(x) f (x)) E (ξ)

2 (P(fz,t)(x) f (x))2]dρX (x)

2 (P(fz,t)(x) f (x))2dρX (x),

where ξ1 is between R

X E(P(fz,t)(x) f (x))dρX (x) and R

X E(0)dρX (x), and ξ2 is between 0 and P(fz,t)(x) f (x).

Under the condition t ( 1 2(M+ f )2 , 0), the inequality m2 (M + f + pϵ|X )2 yields

G(t, P(fz,t)) G(t, f ) = 1

2 (P(fz,t)(x) f (x))2dρX (x) e C[Eρ(P(fz,t)) Eρ(f )],

where e C = [1 2|t|(M + f )2]e |t|(M+ pϵ|X + f )2 Z

R pϵ|X(u)du

is a positive constant. This completes the proof.

Lemma D.1 in connection with the idea of error decomposition in Proposition C.2 yield that

P(fz,t) f 2 L2ρX E(P(fz,t)) E(f ) e C 1(G(t, P(fz,t)) G(t, f ))

e C 1(E1 + E2 + η f 2 K) E1 + E2 + η f 2 K.

Let t be fixed and satisfy t ( 1 2(M+ f + pϵ|X )2 , 0). Combining the above decomposition with Propositions C.4-C.6, we have

P(fz,t) f 2 L2ρX (n s 2

2 η 1λ + η) log(1/δ)

with confidence at least 1 δ. Moreover, the desired result in Theorem 3 follows by setting n 1

2 η 1λ = η and λ = n ζ.

Tilted Sparse Additive Models

E. Proof of Theorem 4

The data-generating model Y = f (X)+ϵ and Assumption 5 together ensure that the ground truth function f is conditional mode function (Sager & Thisted, 1982; Yao & Li, 2013; Feng et al., 2020; Wang et al., 2017), i.e.,

f = arg max t p Y |X(t|X = x),

with f = arg min f HK EM(f), where EM(f) := Z

Z p Y |X(f(x)|X = x)dρX (x).

The data-generating model Y = f (X) + ϵ and Assumption 5 ensure that f = f M is true. The proof of Theorem 4 proceeds similarly to Theorem 2 except that we need to consider the difference between the two excess risk terms EM(P(fz,t)) EM(f ) and R(t, P(fz,t)) R(t, f ) under Assumption 5. Lemma D.2. Under Assumptions 5, we can deduce that

EM(P(fz,t)) EM(f ) π 1

2 |t|3/2[R(t, P(fz,t)) R(t, f )] + |t| 1Mπ,ϵ, t ( , 0),

EM(P(fz,t)) EM(f ) π 1

2 t3/2[R(t, P(fz,t)) R(t, f )] + 2( pϵ|X +

π ), t (0, + ),

where Mπ,ϵ = π 1

2 p ϵ|X( |X = x) R

R u2e u2du is a positive constant.

Proof. From the model assumption ϵ = Y f (X), we have Z

Z p Y |X(f(x)|X = x)dρX (x) = Z

Z pϵ|X(f(x) f (x)|X = x)dρX (x).

For any f HK, direct computations show

EM(f) |t|3/2

Z p Y |X(f(x)|X = x)dρX (x) |t|3/2

Z et(f(x) y)2

Z pϵ|X(f(x) f (x)|X = x)dρX (x) |t|3/2

R et(v f(x)+f (x))2pϵ|X(v|X = x)dvρX (x)

Z pϵ|X(f(x) f (x)|X = x)dρX (x) |t|3/2

R etu2pϵ|X(u + f(x) f (x)|X = x)duρX (x).

By applying Taylor s Theorem to density function pϵ|X( ), for any t ( , 0), we get

R etu2pϵ|X(u + f(x) f (x)|X = x)duρX (x)

R e |t|u2pϵ|X(u + f(x) f (x)|X = x)duρX (x)

R e |t|u2[pϵ|X(f(x) f (x)|X = x) + up ϵ|X(f(x) f (x)|X = x)

2 p ϵ|X(ξ|X = x)]duρX (x),

where ξ is between f(x) f (x) and f(x) f (x) + u. We then have R

R ue |t|u2du = 0 and R

R e |t|u2du = π

any t ( , 0). Therefore,

|EM(f) |t|3/2

π R(t, f)| = Z

X p Y |X(f(x)|X = x)dρX (x) |t|1/2

Z e |t|(f(x) y)2

R eu2u2p ϵ|X(ξ|X = x)dudρX (x) p ϵ|X( |X = x)

R u2e u2du.

Tilted Sparse Additive Models

For any t (0, + ), there holds

π R(t, f) = [ Z

X p Y |X(f(x)|X = x)dρX (x) +

Z et(f(x) y)2]

( pϵ|X + 4t3/2M 2

For any t ( , 0), we get

|EM(P(fz,t)) EM(f ) |t|3/2

π [R(t, P(fz,t)) R(t, f )]| p ϵ|X( |X = x) π|t|

R u2e u2du.

We complete the proof by denoting Mπ,ϵ = p ϵ|X( |X=x) π R

R u2e u2du.

From Lemma D.2, the difference between EM(f) and t3/2

π R(t, f) always exists for any t (0, + ). Therefore, we only focus on the function approximation performance when t ( , 0). Lemma D.3. For any t ( , 0), there holds

P(fz,t) f 2 L2 ρX |t|3/2

π [E1 + E2 + η f 2 K] + Mπ,ϵ|t| 1,

where E1 and E2 are bounded respectively in Propositions C.4-C.6.

Proof. Similar with the error decomposition in Proposition C.2 and D.2, we get

EM(P(fz,t)) EM(f ) |t|3/2

π (R(t, P(fz,t)) R(t, f M)) + Mπ,ϵ|t| 1

π [E1 + E2 + η f 2 K] + Mπ,ϵ|t| 1.

Following the Theorem 19 in (Feng et al., 2020), under Assumption 5, one can conclude that

P(fz,t) f 2 L2ρX EM(P(fz,t)) EM(f ).

This completes the proof.

For any t ( , 0), we get e V = |t| 1, r = κ τ 1n 1 2 |t| 1λ 1 and f M = 4M. Setting |t| 3 2 n 1

2 η 1λ = |t| 3 2 η, λ = n ζ

and t = log n β, based on Propositions C.4-C.6 and Lemma D.3, we get

P(fz,t) f 2 L2ρX [ O(nmax{ 16βM2+s 2+2sζ

4 ,β}) + O(log 1 n)] log(1/δ).

When 0 < β min{ 1 8M 2 , 2 s 2sζ

16M 2 } and 2 s

P(fz,t) f 2 L2ρX log 1 n log(1/δ).

This completes the proof of Theorem 4.

E. Variable Selection Consistency

In this section, we aim to investigate the variable selection consistency of T-Sp AM.

Denote S = {1, ..., p } with p p as the set of truly informative variables. Theoretically, we expect the ℓ2-norm αz j 2 = 0 for any j {p + 1, ..., p}. In practice, we screen out the informative variables through Sz = {j : αz j 2 v, j = 1, ..., p} with a positive threshold value v. It is meaningful to investigate the relation between Sz and S .

Similar with the assumptions in (Yang et al., 2016; Wang et al., 2017; Chen et al., 2020), we require some parametric conditions for our analysis.

Tilted Sparse Additive Models

Assumption E.1. There is a positive constant cτ such that maxj S τj cτ minl/ S τl.

Theorem E.2. Under Assumptions 3.2-3.5, we have Sz S by taking 2c 1

2 τ κ 1 2 M 1 2 C

2 |t| 1 2 MCn,t λ Cn,t, where Cn,t = κ2n 1 2 |t| 1 τ 1M 1e|t|M 2 and τ = minj=1,...,p τj.

Theorem E.2 illustrates that T-Sp AM can identify the truly informative variables by taking properly λ and weight τj, j = 1, ..., p. Indeed, the current analysis extends Theorem 4 in (Wang et al., 2017) from a linear regularized modal regression to the nonlinear T-Sp AM. Moreover, it is interesting to further explore variable selection analysis by replacing the parameter conditions here with the incoherence assumptions (e.g. Assumption 4 in (Lv et al., 2018)).

Proof. From the definition of αz,t, we can deduce that

n Pn i=1 et(fz,t(xi) yi)2)

αj + λτj αj αj 2 = 0

for any αz,t j , j {1, ..., p} satisfying αz,t j 2 = 0.

Direct computation shows that

i=1 et((fz,t(xi) yi)2 Rz(t,fz,t))(fz,t(xi) yi)Kji = λτj αz,t j αz,t j 2 .

Taking ℓ2-norm on the both sides, we derive that

i=1 et((fz,t(xi) yi)2 Rz(t,fz,t))(fz,t(xi) yi)Kji 2 = λτj.

Suppose that αz,t j 2 = 0 for j-th informative variable. Under Assumption 1, we get

i=1 et((fz,t(xi) yi)2 Rz(t,fz,t))(fz,t(xi) yi)Kji 2

i=1 |et((fz,t(xi) yi)2 Rz(t,fz,t))(fz,t(xi) yi)|

The reproducing property assures that fz,t κ fz,t K. Moreover, Lemma B.1 shows that

fz,t K κn 1 2 |t| 1λ 1 τ 1e|t|M 2, t ( , 0) (0, + ),

where τ = minj=1,...,p τj is positive constant. Under Assumption 1, we can further deduce that

i=1 |et((fz,t(xi) yi)2 Rz(t,fz,t))(fz,t(xi) yi)|

i=1 e|t|(( fz,t +M)2+ Rz(t,fz,t) )( fz,t + M)

2 κe2|t|( fz,t +M)2( fz,t + M)n 3 2

By setting λ M 1 κ2n 1 2 |t| 1 τ 1e|t|M 2, we have

2 κe2|t|( fz,t +M)2( fz,t + M)n 3 2 4 κ3n2|t| 1 τ 1λ 1e8 κ4n|t| 1 τ 2λ 2.

Then λ2 4 κ3n2|t| 1 τ 1τ 1 j e|t|M 2e8 κ4n|t| 1 τ 2λ 2e2|t|M2 .

Tilted Sparse Additive Models

By taking logarithmic function on both side, we get

C 8 κ4n|t| 1 τ 2e2|t|M 2,

where C = 4 κ3n2|t| 1 τ 1τ 1 j e|t|M 2. By setting λ 2 κ3/2n|t| 1

2 j e |t|M2

2 , we derive

2κ2n 1 2 |t| 1

2 τ 1e|t|M2

2 + 2 κ3/2n|t| 1

Under Assumption 5, for any j / S , the above inequality guarantees that

2κ2n 1 2 |t| 1

2 τ 1e|t|M2

2 + 2 κ3/2n|t| 1

This inequality contradicts with the parameter condition in Theorem 5. Therefore we have αz j 2 = 0 for any j / S .

F. Optimization

In spite of the rich representation power of kernel-based algorithm, it suffers from the high computational cost with large-scale data. Random Fourier features have shown potential for accelerating the training associated with kernel methods and may achieve even better results (Rahimi & Recht, 2007; Li et al., 2021b). The main idea of random Fourier acceleration is to approximate kernel evaluation Kj( , ), j = 1, ..., p by

Kj(xij, xtj) ψj(xij)T ψ(xtj), i, t {1, ..., n},

where ψj : R Rd is a random Fourier feature map constructed by the Algorithm 1 in (Rahimi & Recht, 2007).

Denote ψ(xi) = (ψ1(xi1)T , ψ2(xi2)T , . . . , ψp(xip)T )T Rpd. Recalling the optimization objectives in (6) and (8)

ˆW = arg min W =(w T 1 ,w T 2 ,...,w T p )T {Rz,ψ(t, W) + λΩz(W)} , (13)

Rz,ψ(t, W) := 1

i=1 etℓ(W T ψ(xi),yi)) and Ωz(W) =

j=1 τj wj 2.

The loss function ℓ( , ) can be selected as least squared loss for regression task and logistic loss for classification task. Simple computation shows that

W Rz,ψ(t, W) = 1

etℓ(W T ψ(xi),yi) Pn j=1 etℓ(W T ψ(xj),yj) W ℓ(W T ψ(xi), yi),

where W ℓ(W T ψ(xi), yi) = (W T ψ(xi) yi)ψ(xi) for least squared loss and W ℓ(W T ψ(xi), yi) =

e W T ψ(xi)

1+e W T ψ(xi) ψ(xi) yiψ(xi) for logistic loss.

Let M be the number of inner loop, S be the number of outer loop, and η0 be the step size. We suppose that Ik with |Ik| = b is a set randomly picked from {1, . . . , n}. Denote RIk z,ψ as the gradient related to Ik samples. To overcome the non-smoothness property of the regularizer Ωz, we introduce the following the proximal operator as (Kowalski, 2009; Boyd & Vandenberghe, 2004)

proxη0,λΩz(W) := (Sλη0(w1)T , Sλη0(w2)T , . . . , Sλη0(wp)T )T ,

Sλη0 (wj) =

( 0, if wj 2 λη0 wj 2 λη0

wj 2 wj, otherwise. for j = 1, . . . , p.

Following the non-convex optimization algorithm Prox SVRG in(J. Reddi et al., 2016), the detailed steps for solving (13) are summarized in Algorithm 1.

Tilted Sparse Additive Models

Algorithm 1 Prox SVRG for (13)

Input: Observations z = {(xi, yi)}n i=1, f W 0 = W 0 M = W 0 Rpd, t ( , 0) (0, + ) for s = 0; s < S; s = s + 1 do

W s+1 0 = W s M gs+1 = W Rz,ψ(t, f W s) for k = 0; k < M; k = k + 1 do

Uniformly randomly pick Ik {1, . . . , n} such that |Ik| = b vs+1 k = W RIk z,ψ(t, W s+1 k ) W RIk z,ψ(f W s) + gs+1

W s+1 k+1 = proxη0,λΩz(W s+1 k η0vs+1 k ) end f W s+1 = W s+1 M end Output: ˆW = W S M

Moreover, the optimization for multi-objective learning can also be achieved through replacing the gradient W Rz,ψ(t, W) in Algorithm 1 with

W Rz,ψ(t, γ, W) = X

z g eγℓ(W T ψ(x),y) t γ 1

g G 1 |g | P

z g eγℓ(W T ψ(x),y) t

γ eγℓ(W T ψ(x),y) W ℓ(W T ψ(x), y).