# generalized_sparse_additive_models__c217aa07.pdf Journal of Machine Learning Research 23 (2022) 1-56 Submitted 2/20; Revised 4/21; Published 05/22 Generalized Sparse Additive Models Asad Haris asad.haris@ubc.ca Department of Earth, Ocean and Atmospheric Sciences University of British Columbia 2020 2207 Main Mall Vancouver, BC, Canada V6T 1Z4 Noah Simon nrsimon@uw.edu Ali Shojaie ashojaie@uw.edu Department of Biostatistics University of Washington Seattle, WA 98195-7232, USA Editor: Garvesh Raskutti We present a unified framework for estimation and analysis of generalized additive models in high dimensions. The framework defines a large class of penalized regression estimators, encompassing many existing methods. An efficient computational algorithm for this class is presented that easily scales to thousands of observations and features. We prove minimax optimal convergence bounds for this class under a weak compatibility condition. In addition, we characterize the rate of convergence when this compatibility condition is not met. Finally, we also show that the optimal penalty parameters for structure and sparsity penalties in our framework are linked, allowing cross-validation to be conducted over only a single tuning parameter. We complement our theoretical results with empirical studies comparing some existing methods within this framework. Keywords: Generalized Additive Models, Sparsity, Minimax, High-Dimensional, Penalized Regression 1. Introduction In this paper, we model a response variable as an additive function of a potentially large number of covariates. The problem can be formulated as follows: we are given n observations with response yi R and covariates xi Rp for i = 1, . . . , n. The goal is to fit the model g (E (yi|xi)) = β + j=1 fj (xij) , i = 1, . . . , n, for a prespecified link function g, unknown intercept β and, unknown component functions f1, . . . , fp. The link function, g, is generally based on the outcome data-type, for example, g(x) = x or g(x) = log(x) for continuous or count response data, respectively. The estimands, f1, . . . , fp, give the conditional relationships between each feature xij and the outcome yi for all i and j. For identifiability, we assume Pn i=1 fj(xij) = 0 for all j = 1, . . . , p. This model is known as a generalized additive model (GAM) (Hastie and Tibshirani, 1990). 2022 Asad Haris, Noah Simon and Ali Shojaie. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/20-108.html. Haris, Simon and Shojaie It extends the generalized linear model (GLM) where each fj is linear, and is a popular choice for modeling different types of response variables as a function of covariates. GAMs are popular because they extend GLMs to model non-linear conditional relationships while retaining some interpretability (we can examine the effect of each covariate xij individually on yi while holding all other variables fixed); they also do not suffer from the curse of dimensionality. While there are a number of proposals for estimating GAMs, a popular approach is to encode the estimation in the following convex optimization problem: bβ, bf1, . . . , bfp argmin β R,f1,...,fp F n 1 n X i=1 ℓ yi, β + j=1 fj (xij) + λst j=1 Pst (fj) . (1) Here F is some suitable function class; ℓ(yi, θ) is the log-likelihood of yi under parameter θ; Pst is a structure-inducing penalty to control the wildness of the estimated functions, bfj; and λst > 0 is a penalty parameter which modulates the trade-offbetween goodness-of-fit and structure/smoothness of estimates. The class F is a general convex space, for example, F = L2[0, 1]. Functions ℓ(yi, θ) and Pst(fj) are convex in θ and fj, respectively. The objective function in (1) is convex and for small dimension, p, can be solved via a generalpurpose convex solver. However, many modern data sets are high-dimensional, often with more features than observations, that is, p > n. Fitting even GLMs is challenging in such settings as conventional methods are known to overfit the data. A common assumption in the high-dimensional setting is sparsity, which states that, only a small (but unknown) subset of features is informative for the outcome. In this case, it is desirable to apply feature selection: to build a model for which only a small subset of bfj 0. A number of estimators have been proposed for fitting GAMs with sparsity. These estimators are generally solutions to a convex optimization problem. Though they differ in details, we show that most of these optimization problems can be written as: bβ, bf1, . . . , bfp argmin β R,f1,...,fp F n 1 n X i=1 ℓ yi, β + j=1 fj (xij) + λst j=1 Pst (fj) + λsp j=1 fj n , (2) where fj 2 n = n 1 Pn i=1{fj(xij)}2 is a group lasso-type penalty (Yuan and Lin, 2006) for feature-wise sparsity, and λsp a sparsity-related tuning parameter (Ravikumar et al., 2009; Meier et al., 2009; Koltchinskii and Yuan, 2010; Raskutti et al., 2012; Yuan and Zhou, 2016; Lou et al., 2016; Petersen et al., 2016; Sadhanala and Tibshirani, 2019; Tan and Zhang, 2019). However, previous proposals consists of gaps around efficient computation (Koltchinskii and Yuan, 2010; Raskutti et al., 2012; Yuan and Zhou, 2016; Tan and Zhang, 2019) and/or optimal statistical convergence properties (Ravikumar et al., 2009; Lou et al., 2016; Petersen et al., 2016; Sadhanala and Tibshirani, 2019). General-purpose convex solvers have also been suggested (Koltchinskii and Yuan, 2010; Raskutti et al., 2012; Yuan and Zhou, 2016) as an alternative for solving problem (2), but they roughly scale as O(n3p3) and are hence inefficient. This manuscript aims to bridge these gaps. We present a general framework for sparse GAMs with two major contributions, a general algorithm for computing (2) and a theorem for establishing convergence rates. Briefly, our algorithm is based on accelerated proximal gradient descent. This reduces (2) to repeatedly solving a univariate penalized least squares problem. In many cases, this algorithm Generalized Sparse Additive Models has a per-iteration complexity of O(np) precisely that of state-of-the-art algorithms for the lasso (Friedman et al., 2010; Beck and Teboulle, 2009b). Our main theorem establishes fast convergence rates of the form max(s log p/n, sξn), where s is the number of signal variables and ξn is the minimax rate of the univariate regression problem, that is, problem (1) with p = 1. Nonparametric rates are established for a wide class of structural penalties Pst with ξn = n 2m/(2m+1), popular choices of Pst include m-th order Sobolev and H older norms, total variation norm of the m-th derivative and, norms of Reproducing Kernel Hilbert Spaces (RKHS). Parametric rates are also established with ξn = Tn/n via a truncation-penalty; the number of parameters, Tn, can be fixed or allowed to grow with sample size. The highlight of this paper is the generality of the proposed framework: not only does it encompass many existing estimators for high-dimensional GAMs, but also estimators for low-dimensional fully nonparametric models and, parametric models in low or high dimensions. Brief examples of our framework s generalizability include: establishing minimax convergence rates (under well-studied assumptions) where only sub-optimal rates existed (Lou et al., 2016; Ravikumar et al., 2009; Meier et al., 2009; van de Geer, 2010); recovering minimax rates for a general class of loss functions as opposed to only least squares (Raskutti et al., 2012; Yuan and Zhou, 2016; Tan and Zhang, 2019); establishing consistency (albeit at a slower rate) while relaxing strong assumptions on the design matrix and function class (Raskutti et al., 2012; Yuan and Zhou, 2016; Tan and Zhang, 2019). Extending GAMs (1), to GSAMs (2), appears to simply be a matter of adding a sparsity penalty. However, our manuscript proves a surprising result: sparsity in GAMs can only be achieved if Pst is a semi-norm penalty, as opposed to a squared semi-norm. Thus, the originally proposed GAMs (Hastie and Tibshirani, 1990) cannot be extended to highdimensions by simply adding a sparsity penalty. Finally, as a byproduct of our general theorem, we also determine that λst = λ2 sp in (2), results in optimal convergence rates, reducing the problem to a single tuning parameter. Empirical studies showed a single tuning parameter to yield comparable or better performance compared to finding two tuning parameters over a grid. The rest of the paper is organized as follows. In Section 2, we detail our framework and discuss various choices of structural penalties, Pst, illustrating that our framework encompasses many existing proposals. In Section 3 we present an algorithm for solving the optimization problem (2) for a broad class of Pst penalties, and establish their theoretical convergence rates in Section 4. We explore the empirical performance of various choices of Pst in simulation in Section 5, and in an application to the Boston housing data set and gene expression data sets in Section 6. Concluding remarks are in Section 7. 2. General Framework for Additive Models In this section, we present our general framework for estimating sparse GAMs, discuss its salient features, and review some existing methods as special cases. Before presenting our framework, we introduce some notation. For any function f and response/covariate pair, (y, x), let ℓ(f) ℓ(y, f(x)) denote a loss function; given data (y1, x1), . . . , (yn, xn), let Pnℓ(f) n 1 Pn i=1 ℓ(yi, f(xi)) denote an empirical average; and f 2 n n 1 Pn i=1 f(xi)2 denote the empirical norm. With some abuse of notation, we will use the shorthand fj to denote the function fj πj where πj(x) = xj for x Rp. Haris, Simon and Shojaie Our general framework for obtaining a Generalized Sparse Additive Model (GSAM) encompasses estimators that can be obtained by solving the following problem: bβ, bf1, . . . , bfp argmin β R,f1,...,fp F Pnℓ | {z } Goodness-of-fit j=1 Pst (fj) | {z } structure-inducing j=1 fj n | {z } sparsity-inducing This optimization problem balances three terms. The first is a loss function based on goodness-of-fit to the observed data; the least squares loss, ℓ(f) = (y f(x))2, is commonly used for continuous response. Our general framework requires only convexity and differentiability of ℓ(y, θ), with respect to θ. Later we consider loss functions given by the negative log-likelihood of exponential family distributions. The second piece is a penalty to induce smoothness/structure of the function estimates. Our framework requires Pst to be a semi-norm on F. This choice is motivated by both statistical theory and computational efficiency; we discuss this along with possible choices of Pst in the following sub-sections. The final piece is a sparsity penalty n, which encourages models with bfj 0 for many j. For some choices of Pst the smoothness and sparsity pattern of fj are intrinsically linked (for example, Ravikumar et al., 2009; Lou et al., 2016); for other structure-inducing penalties the formulation (3) appears to decouple structure and sparsity. However, this manuscript highlights the surprising role of Pst in obtaining an appropriate sparsity pattern. Briefly, if Pst is a squared norm then either all bfj 0 or all bfj 0. We detail this result and its extension to semi-norms in Section 2.2. Throughout this manuscript, we require the function class F to be a convex cone, for example, L2(R). Later for some specific results, we will additionally require F to be a linear space. As noted before, the tuning parameters for structure (λ2) and sparsity (λ) are coupled in our framework. The theoretical consequence of this is that, for properly chosen λ, we get rate-optimal estimates, up to a constant (details in Section 4). The practical consequence is that we have a single tuning parameter. While fine tuning an estimator with two tuning parameters can lead to improved prediction performance, a single tuning parameter is adequate for most choices of Pst as seen in our empirical experiments of Section 5. Furthermore, our framework relaxes the usual distributional requirements of i.i.d. response from an exponential family; we require only yi independent and E{yi E(yi)} to be sub-Gaussian (or sub-Exponential). This demonstrates the generality of our framework and highlights our main innovation: the efficient algorithm of Section 3 and theoretical results of Section 4 apply to a very broad class of estimators, fill in the gaps of existing work and, can easily be applied for the development of future estimators. 2.1 Structure Inducing Penalties We now present some possible choices of the structural penalty Pst followed by a discussion of the conditions on Pst that lead to desirable estimation and computation. The main requirement is that Pst is a semi-norm: a functional that obeys all the rules of a norm except one for nonzero f we may have Pst(f) = 0. Some potential choices for smoothing semi-norms are: 1. k-th order Sobolev Pst Psobolev(f(k)) = q R x f(k)(x) 2 dx; Generalized Sparse Additive Models 2. k-th order total variation Pst TV (f(k)); 3. k-th order H older Pst Pholder(f(k)) = supx f(k)(x) ; 4. k-th order monotonicity Pst Pmon(f(k)) I f; {f : f(k+1) 0} ; 5. M-th dimensional linear subspace Pst P M lin(f) = I (f; span {g1, . . . , g M}); here TV ( ) is the total variation norm, TV (f) = sup{Po i=1 |f(zi+1) f(zi)| : z1 < . . . < zo is a partition of [0, 1]}, and I is a convex indicator function defined as I(f; A) = 0 if f A and I(f; A) = if f A. As implied by the name, Pst imposes smoothness or structure on individual components bfj. For instance, Psobolev(f ) is a common measure of smoothness; small λ values leads to wiggly fitted functions bfj; on the other hand, sufficiently large λ values would lead to each component being a linear function. The convex indicator function, I( ), can impose specific structural properties on bfj; for example, Pmon(f) fits a model with each bfj a non-decreasing function. The semi-norm requirement for Pst is important because: (a) it implies convexity leading to a convex objective function, (b) the first order absolute homogeneity, Pst(αf) = |α|Pst(f), is needed for the algorithm of Section 3 and, (c) the triangle inequality is used throughout the proof of our theoretical results of Section 4. For our context, we consider convex indicators of cones as a semi-norm because, the first order homogeneity condition can be relaxed. For our algorithm, we only require Pst(αf) = αPst(f) for α > 0; for our theoretical results we treat convex indicators of cones as a special case and discuss them at the end of Section 4.2. For non-sparse GAMs of the form (1), the existing literature does not necessarily use a semi-norm penalty; a common choice of smoothing penalty is Pst(f) = P 2 sobolev(f ). In the following subsection, we discuss the issues with using squared semi-norm penalties in high dimensions, particularly their impact on the sparsity of estimated component functions. 2.2 Using a Squared Smoothness Penalty Given a semi-norm Psemi, using Pst = P 2 semi in (3) may give poor theoretical performance (as noted in Meier et al., 2009, for Psemi = Psobolev) and, can also be computationally expensive (as disscussed in Section 3). In this subsection, we show a surprising result: using a squared semi-norm penalty results in fitting models that are either not sparse (all bfj = 0) or not flexible (all fj belong to some restrictive parametric class). In other words, the original GAMs (Hastie and Tibshirani, 1990) cannot be extended to high dimensions/sparsity by simply adding a sparsity penalty. This highlights a key contribution of this manuscript: not only do we present a framework for fitting GSAM but also, prove that na ıve GAM extensions are not feasible. In greater detail, using Pst = P 2 norm where Pnorm is a norm, leads to an active set, S = {j : bfj 0}, for which either |S| = 0 or |S| = p. If Pst = P 2 semi, for a semi-norm Psemi, we can get 0 < |S| < p; however, now all bfj F0, where F0 is a finite-dimensional function class. In contrast, using Pst = Psemi can give active sets such that 0 < |S| < p and each bfj can be modeled nonparametrically. Before presenting our main results, we note that throughout this section we deal with finite valued semi-norms, this excludes convex indicator penalties. It should be noted that many convex indicator penalties impose a parametric structure, thus excluding them from Haris, Simon and Shojaie the discussion of this section. Other convex indicator penalties are challenging to deal with in this context and other settings (see for example, Remark 18). We now present our result in Lemma 1 for a squared norm penalty followed by the extension to squared semi-norms in Corollary 2 (proofs in Appendix D). Lemma 1 Let F be a nonparametric function class in the following sense: for any covariateresponse pair (y, x) Rn 2 there exists some f F such that f(xi) = yi for all i. Consider the optimization problem bf1, . . . , bfp argmin f1,...,fp F j=1 fj (xij) λst P 2 st (fj) + λsp fj n , (4) for a norm Pst on F. Then, for any λsp, either bfj 0 for all j or bfj 0 for all j. Corollary 2 For a semi-norm Pst in (4), we define its null set as F0 F such that Pst(f0) = 0 for all f0 F0 (note that F0 contains the zero function). Consider an arbitrary, non-empty, index subset I {1, . . . , p}. If bfj = 0 for all j I, then for all j Ic, bfj F0. The above results imply that using a squared semi-norm means sacrificing either sparsity or flexibility in our modeling approach. In most cases the set F0 is parametric, e.g, a commonly used penalty Pst = Psobolev(f ), leads to F0 as the set of linear functions. Thus we can either fit sparse, parametric models or non-sparse, nonparametric models but not both. In other words, for parametric regression we can simply add a sparsity-inducing penalty; for example, the elastic net (Zou and Hastie, 2005), which adds a sparsity penalty to ridge regression, leading to sparse linear models. In contrast, simply adding a sparsity penalty to traditional GAMs (Hastie and Tibshirani, 1990) is not sufficient because, fitted models will not be sparse GAMs. 2.3 Relationship of Existing Methods to GSAM We now discuss some of the existing methods for sparse additive models in greater detail, and demonstrate that many existing proposals are special cases of our GSAM framework. One of the first proposals for sparse additive models, Sp AM (Ravikumar et al., 2009), uses a basis expansion and solves argmin β1,...,βj RM m=1 βjmψjm 2 m=1 βjmψjm n, where ψjm = [ψm(x1j), . . . , ψm(xnj)] Rn for basis functions ψ1, . . . , ψM. This is a GSAM with Pst = I (f; span {ψ1, . . . , ψM}). The Sp AM proposal is extended to partially linear models in SPLAM (Lou et al., 2016). There, a similar basis expansion is used, though with the particular choice ψ1(x) = x. The SPLAM estimator solves argmin β1,...,βj RM m=1 βjmψjm 2 m=1 βjmψjm n + λ2 m=2 βjmψm n, Generalized Sparse Additive Models and is also a GSAM with Pst = I (f; span {ψ1, . . . , ψM}) + Projspan(ψ2,...,ψM) (f) n, where Proj A is the projection operator onto the set A. The recently proposed extensions of trend filtering to additive models are other examples (Petersen et al., 2016; Sadhanala and Tibshirani, 2019); these methods can be written in our GSAM framework with Pst(f) = TV (f). Meier et al. (2009) give two proposals: the first solves the optimization problem argmin f1,...,fp F fj 2 n + λst P 2 st (fj), and is not a GSAM; they note that this proposal gives a suboptimal rate of convergence. The second is a GSAM of the form (3) with Pst(f) = Psobolev(f ). At the time, Meier et al. (2009) focused on the first proposal as no computationally efficient method for solving the second one was known to them. In a follow-up paper, van de Geer (2010) studied the theoretical properties of a GSAM with an alternative, diagonalized smoothness structural penalty. The diagnolized smoothness penalty for a function with basis expansion fβ(x) = Pn j=1 ψj(x)βj, is defined as Pst(fβ) = n X j=1 j2mβ2 j 1/2 , for a smoothness parameter m. Koltchinskii and Yuan (2010), Raskutti et al. (2012) and Yuan and Zhou (2016) discuss a similar framework to GSAMs; however, they only consider structural penalties Pst, which are norms of Reproducing Kernel Hilbert Spaces (RKHS). Furthermore, they do not discuss efficient algorithms for solving the convex optimization problem. Using properties of RKHS, they note that their estimator is the minimum of a d = np dimensional second order cone program (SOCP). The computation for general-purpose SOCP solvers scales roughly as d3. Thus for even moderate p and n, these problems quickly become intractable. Recently, Tan and Zhang (2019) studied GSAMs beyond the RKHS framework similar to this manuscript, however, there are important differences. Firstly, Tan and Zhang (2019) consider only the least squares loss. Secondly, the authors do not present an algorithm for estimation; instead they prove that under a least squares loss and special smoothness penalties the solution to the optimization problem is finite dimensional. Thirdly, their results (like the minimax rates proved by Yuan and Zhou, 2016) include the more general notion of weak sparsity (van de Geer, 2016); however, extending this notion beyond the least squares loss is left for future research. All of the above mentioned proposals either fail to provide an efficient computational algorithm or have sub-optimal convergence rates. There are also a number of other proposals that do not quite fall in the GSAM framework (Chouldechova and Hastie, 2015; Fan et al., 2012; Yin et al., 2012). Haris, Simon and Shojaie 3. General-Purpose Algorithm Here we give a general algorithm for fitting GSAMs based on proximal gradient descent (Parikh and Boyd, 2014). We begin with some notation. We denote by ℓ(y, θ) and ℓ(y, θ) the first and second derivatives of ℓwith respect to θ. For functions f, g : Rp R, let f, ℓ(g) n n 1 Pn i=1 f(xi){ ℓ(yi, g(xi))}, Pn ℓ(g) n 1 Pn i=1 ℓ(yi, g(xi)) and, f + ℓ(g) 2 n n 1 Pn i=1{f(xi) + ℓ(yi, g(xi))}2. We begin with a second order Taylor expansion of the loss at some arbitrary point β0 + Pp j=1 f0 j . For this, we first apply Taylor s theorem to ℓ(yi, β + θi1 + . . . + θip) as a (p + 1) variate function of (β, θi1, . . . , θip). Note that for | ℓ(y, θ)| L, the Hessian matrix, Hp+1, of ℓ(yi, β + θi1 + . . . + θip) obeys the inequality a T Hp+1a (p + 1)L a 2 2 for all a Rp+1 (Zhan, 2005). This gives us the following bound: j=1 fj Pnℓ β0 + (β β0)Pn ℓ β0 + D fj f0 j , ℓ β0 + 2 (β β0)2 + fj f0 j 2 n , which leads to the following majorizing inequality j=1 fj (p + 1)L h β n β0 + 1 (p + 1)LPn ℓ β0 + j=1 f0 j oi2 fj n f0 j + 1 (p + 1)L ℓ β0 + j=1 f0 j o 2 where W is not a function of β or fj for any j. Instead of minimizing the original problem (3), we minimize the majorizing surrogate h β n β0 + t Pn ℓ β0 + j=1 f0 j oi2 + 1 fj n f0 j + t ℓ β0 + j=1 f0 j o 2 j=1 Pst (fj) + tλ j=1 fj n , (6) where t = {(p + 1)L} 1. Minimizing (6) and re-centering our Taylor series at the current iteration, is precisely the proximal gradient recipe. Updating the intercept β, is simply bβ β0 + t Pn ℓ β0 + Pp j=1 f0 j . Components f1, . . . , fp, can be updated in parallel by solving the univariate problems: bfj argmin f F n f0 j + t ℓ β0 + j=1 f0 j o f 2 n + tλ2Pst (f) + tλ f n . (7) Generalized Sparse Additive Models At first, this problem still appears difficult due to the combination of structure and sparsity penalties. However, the following Lemma shows that things greatly simplify. Lemma 3 Suppose Pst is a semi-norm, and r is an n-vector. Consider the optimization problems 1 2 r f 2 n + λ1Pst (f) + λ2 f n, (8) 1 2 r f 2 n + λ1Pst (f) . (9) If ef is a solution to (9); then bf is a solution to (8) where bf is defined as bf = 1 λ2/ ef n with (z)+ = max(z, 0). Additionally, if λ2 r n, then bf 0. The proof is given in Appendix E. Using Lemma 3, we can get the solution to (7) by solving a problem in the form of (9), a classical univariate smoothing problem, and then applying (10), the simple soft-scaling operator. Putting this together, leads to Algorithm 1 below for solving (3). The general recipe used to derive Algorithm 1, is the well known proximal gradient descent algorithm. Thus, well established convergence results in the literature can be used (see for example, Beck and Teboulle, 2009b), under mild conditions: we require a convex ℓwith Lipschitz first derivative. These conditions hold for many loss functions particularly the negative log-likelihood of exponential family distributions. Convergence of the infinite-dimensional optimization over F follows from the finite-dimensional analog of problem (3); we prove this in Appendix E. Algorithm 1 is simple and can be quite fast: the time complexity is largely determined by the difficulty of solving the univariate smoothing problem of step 5. In many cases this takes O(n) operations, allowing an iteration of proximal gradient descent to run in O(np) operations. Complexity order O(np) is the per-iteration time complexity of state-of-the-art algorithms for the lasso (Friedman et al., 2010; Beck and Teboulle, 2009a). Any step-size t can be used in Algorithm 1 so long as inequality (5) holds for f0 j fk 1 j and fj fk j when (p + 1)L is replaced by t 1. Note that if t {L(p + 1)} 1 this will always hold. While a fixed step size t ensures theoretical convergence, in practice we can achieve a substantial speedup by adaptively selecting t for each iteration. We could use tk = {L(pk active + 1)} 1, where pk active is the number of j for which either of fk 1 j or fk j is non-zero. Since we are interested in sparse models, generally pk active p, leading to a substantial efficiency gain. If we do not have a suitable bound for L, we could use a datadependent scheme such as the backtracking line search (Beck and Teboulle, 2009b). The algorithm can also take advantage of Nesterov-style acceleration (Nesterov, 2007), which improves the worst-case convergence rate after k steps from O k 1 to O k 2 . An important special case is the least squares loss ℓ(y, θ) = (y θ)2. In this case, we can use a block coordinate descent algorithm which can be more efficient than Algorithm 1, and does not require a step-size calculation. We present the full details of the algorithm in Appendix A. Haris, Simon and Shojaie Algorithm 1 General Proximal Gradient Algorithm for (3) 1: Initialize f0 1 , . . . f0 p 0, β0 0, k 1; choose a step-size t 2: while k max iter and not converged do 3: For each i = 1, . . . , n, set j=1 fk 1 j (xij) , ri ℓ(yi, θi) . 4: Update βk βk 1 t Pn i=1 ri. 5: for j = 1, . . . , p do 6: Set finter j argmin f F fk 1 j tr f 2 n + tλ2Pst (f) . (11) 7: Update fk j 1 tλ/ finter j n +finter j . 9: end while 10: return βk, fk 1 , . . . , fk p Algorithm 1 is developed for a given λ value. Alternatively, we recommend using a decreasing sequence of λ values, linear on the log scale starting at λmax = y n. Another computational consideration is the method of tuning parameter selection: our numerical experiments suggest K-fold cross validation as a suitable choice; however, we note that other tuning parameter selection techniques such as generalized cross validation, AIC and BIC can be used. Finding a theoretically optimal tuning parameter method, or proving the estimator selected by cross validation obtains the same fast convergence rate, is a challenging problem which we defer to future work. As noted above, the main computational hurdle in Algorithm 1 is solving the univariate problem (9). In the following subsection, we discuss this step in greater detail for various smoothness penalties. 3.1 Solving the Univariate Subproblem For many semi-norm smoothers there are already efficient solvers for solving (9): with the k-th order total variation penalty, (9) can be solved exactly in 2n operations for k = 0 (Johnson, 2013), or iteratively in roughly O((k + 1)n) operations for k 1 (Ramdas and Tibshirani, 2015); with the convex indicator of an M-dimensional linear subspace, (9) can be solved in O(M2n) operations using linear regression; using a monotonicity indicator, (9) can be solved with the pool adjacent violators algorithm in O(n) operations (Ayer et al., 1955). Generalized Sparse Additive Models For many other choices of Pst, we do not have efficient algorithms for solving (9); however, we might have fast algorithms for the slightly different optimization problem: efeλ argmin f F 1 2 r f 2 n + eλP τ st (f) , (12) for τ > 1. For example, the k-th order Sobolev penalty (Wahba, 1990) can be solved exactly in O(kn) operations for τ = 2. In the following Lemma, we show that the solution to (12) can be leveraged to solve the harder problem (9). Lemma 4 Given an n-vector r, a convex linear space F over the field R, and real τ > 1, consider the optimization problems: bfλ argmin f F 1 2 r f 2 n + λPst (f) ; efλ argmin f F 1 2 r f 2 n + λP τ st (f) ; fnull argmin f F 1 2 r f 2 n + I (f F : Pst(f) = 0) ; finterp argmin f F P τ st (f) + I (ri = f(xi) for all i) , where Pst( ) is a semi-norm on F. Assume that the directional derivative h P τ st(f) = lim ε 0 P τ st(f + εh) P τ st(f) ε , exists for all h F. If Pst( bfλ) = 0 and τeλP τ 1 st ( efeλ) = λ, then bfλ = efeλ. To determine if Pst( bf) = 0, let F = F1 F2, where is such that, for all f F we have f = f0 + f where f0, f n = 0 and Pst(f) = Pst(f ). Furthermore, let P st be the dual norm over F2, given by P st(f ) = sup n | f , f n| : Pst(f ) 1, f F2 o . (13) Then finterp fnull F2 and bfλ = fnull if λ P st(finterp fnull). The proof is given in Appendix E. This lemma allows us to first check if we should shrink entirely to a null fit with Pst( bf) = 0 (usually a finite dimensional function), based on the dual semi-norm of the interpolating function finterp. If we do not shrink to Pst( bf) = 0, then there is an equivalence between bf and ef; and the problem is reduced to finding eλ with τeλP τ 1 st ( efeλ) = λ for the originally specified λ. This can be done in a number of ways; most simply by a combination of grid search and then local bisection noting that a) we need not try any eλ-values above λmax finterp n (by Lemma 3), and b) eλPst( efeλ) is a smooth function of eλ. In fact, the grid search will often be unnecessary as we will generally have a good guess from the previous iterate of the proximal gradient algorithm, and can leverage the fact that Pst( efeλ) and Pst( bfλ) are both smooth functions of r. To assess the computational Haris, Simon and Shojaie impact of an added grid search, we looked at the run-time for the proximal problem with Pst = Psobolev (which requires a grid search) and with Pst {TV (f(0)), TV (f(1)), TV (f(2))} (which does not use Lemma 4). For 100 replications of the proximal problem on a quadcore Intel Core , i7-10510U CPU @1.80GHz, the median run-time with n = 500 for Pst = Psobolev was 693.20 µs. In contrast, the median run-time for Pst(f) = TV (f(k)) for k = 0, 1, 2 was 514.15, 2968.30, 4884.90 µs, respectively. These median run-times were calculated via a small simulation study; details of this experiment along with detailed timing results are presented in Appendix B. To complete the discussion, we give the explicit form of the dual norm (13) for the case where Pst(f) = D f q for a matrix D RM n, a vector f = [f(x1), . . . , f(xn)] Rn, and q 1. Such penalties are common in the literature, for example, when Pst is the Sobolev semi-norm, total variation norm, or any RKHS norm. For Pst(f) = D f q, the dual norm is given by P st(f) = D(D D) f eq, where (D D) is the Moore-Penrose pseudo inverse of D D and eq satisfies 1/q + 1/eq = 1. 4. Theoretical Results Here we prove rates of convergence for GSAMs, estimators that fall within our framework (3). We first present the so-called slow rates, which require few assumptions, followed by fast rates, which require compatibility and margin conditions (defined and discussed below). Our fast rates match the minimax rates under Gaussian data with a least squares loss (Raskutti et al., 2009) and, our slow rates can be seen as an additive generalization of the lasso slow rates (Dalalyan et al., 2017). For both slow and fast rates, we first present a deterministic result; this result simply states that if we are within a special set, T , then the convergence rates hold. We then show that under suitable conditions (stated and discussed below) on the loss function, smoothness penalty, and data, we lie in T with high probability. Throughout, we also allow for mean model misspecification with an additional approximation error term in the convergence rates; if the true mean model is additive, then this term disappears. To the best of our knowledge, the closest results to our work were established by Koltchinskii and Yuan (2010). However, they consider a more restrictive setting of Reproducing Kernel Hilbert Spaces (RKHS); where each additive component fj belongs to a RKHS Hj, and Pst is the norm on Hj. Our work gives these rates for all semi-norm penalties and function classes F, associated with certain non-restrictive entropy conditions. Before presenting the main results, we present some notation and definitions which will be used throughout the section. 4.1 Definitions and Notation We consider here properties of the solution to bβ, bf1 . . . , bfp arg min β R,{fj}p j=1 F Pnℓ β + fj n + λPst (fj) , (14) Generalized Sparse Additive Models where R R and F is some univariate function class. Note that in (14) we optimize β over R; this is because we need R to be a bounded for proving the slow rates, the stronger compatibility condition allows us to take R = R for proving fast rates. For a function f(x) = β + Pp j=1 fj(xj) we use the shorthand notation fj n + λPst (fj) , which defines a semi-norm on the function f. Furthermore, for any index set S {1, . . . , p} we define IS(f) as IS(f) = P j S { fj n + λPst(fj)} . We denote the target function by f0 where f0 arg min f F0 Pℓ(f) , for some function class F0 and, where Pℓ(f) = n 1 Pn i=1 E {ℓ(yi, f(xi))}. We say the target function belongs to some class F0 to signify that f0 does not need to belong to F. We require no assumptions on the class F0 for the slow-rates of Theorem 7; we can take F0 to be the class of all measurable functions. For the fast rates we will require the margin condition on a subset of F0. We define the excess risk for a function f as E(f) = P ℓ(f0) ℓ(f) , and we denote by νn( ) the empirical process term, which is defined as νn(f) = (Pn P) { ℓ(f)} = 1 i=1 {ℓ(yi, f(xi)) Eℓ(yi, f(xi))} . Define the δ-covering number, N(δ, F, Q), as the size of the smallest δ-cover of F with respect to the norm Q induced by measure Q. We denote the δ-entropy of F by H(δ, F, Q) log N(δ, F, Q). Given fixed covariates x1, . . . , xn Rp, we denote the empirical measure by Qn where Qn = n 1 Pn i=1 δxi, and for covariate j; we denote by Qj,n the empirical measure of (x1,j, . . . , xn,j). We define two different types of entropy bounds for a function class F. Definition 5 (Logarithmic Entropy) A univariate function class, F, is said to have a logarithmic entropy bound if, for all j = 1, . . . , p, and γ > 0, we have H(δ, {fj F : fj n + γPst(fj) 1} , Qj,n) A0Tn log (1/δ + 1) , (15) for some constant A0, and parameter Tn. Definition 6 (Polynomial Entropy with Smoothness) A univariate function class, F, is said to have a polynomial entropy bound with smoothness if, for all j = 1, . . . , p and γ > 0, we have H(δ, {fj F : fj n + γPst(fj) 1} , Qj,n) A0(δγ) α, (16) for some constant A0, parameter α (0, 2). Haris, Simon and Shojaie The concept of entropy is commonly used in the literature, particularly in nonparametric statistics and empirical processes, to quantify the size of function classes. The logarithmic entropy bound (15) holds for most finite dimensional classes of dimension Tn. For instance, it holds for F = L2(R) with Pst(fj) = I(fj; span{x, x2, . . . , x Tn}). The bound (16) commonly holds for broader function classes, for example, for F = L2([0, 1]) with Pst(fj) = Psobolev(f(k)) and α = 1/k. To simplify our presentation of bounds on the convergence rate, we use A B to denote A c B for some constant c > 0. We write A B if A B and B A. 4.2 Main Results We now present our main results: upper bounds for the excess risk of GSAMs, specifically, bounds for E(bβ + Pp j=1 bfj). The following theorem shows that E(bβ + Pp j=1 bfj) λ over a special set T . In the corollary that follows, we show that for appropriate λ values, and certain type of loss functions, we are within T with high probability. Theorem 7 (Slow Rates for GSAM) Let bf = bβ + Pp j=1 bfj be as defined in (14), and let f = β + Pp j=1 f j be an arbitrary additive function with Pn i=1 f j (xij) = 0 and β R. Assume that ℓ( ) and Pst are convex and that supβ R |β| < R. Define M such that ρM = E(f ) + 2λI(f ) + 2Rρ, where λ 4ρ. Furthermore, define the set T as follows T = {ZM ρ(M + 2R)} , where ZM = sup I(f f ) M |νn(f) νn(f )| . Then, on the set T , E( bf) + λI( bf f ) ρM + ρ(2R) + 2λI(f ) + E(f ). Corollary 8 Let bf, f and R be as defined in Theorem 7. Assume that for any function f the loss ℓ( ) is such that ℓ(f) = ℓ(yi, f(xi)) = ayif(xi) + b(f(xi)), for some a R\{0} and function b : R R. Further assume that for i = 1, . . . , n, yi E(yi) are independent, uniformly sub-Gaussian: max i=1,...,n K2 E exp {yi E(yi)}2/K2 1 σ2 0. Finally, suppose E(f ) = O(λ) and I(f ) = O(1). Then, with probability at-least 1 2 exp C1nρ2 C exp C2nρ2 , we have the following cases: 1. If F has a logarithmic entropy bound, then for λ ρ κ max q E( bf) + λI( bf f ) max with constants κ = κ(a, K, σ0, A0), C1 = C1(K, σ0), C = C(K, σ0) and C2 = C2(C, κ). Generalized Sparse Additive Models 2. If F has a polynomial entropy with smoothness, then for λ ρ κ max n 1 2+α , q E( bf) + λI( bf f ) max with constants κ = κ(a, K, σ0, A0, α), C1 = C1(K, σ0), C = C(K, σ0) and C2 = C2(C, κ). In the above corollary, the assumption I(f ) = O(1) is often reasonable in high-dimensions; if omitted, with the same high probability, the above rates will be multiplied by the term Pp j=1 f j n. Now for high-dimensions, we commonly assume sparsity, f = P j S f j , where |S| is small. The dependence of the rate on sparsity can be directly expressed by the inequality P j S f j n |S|maxj S f j n. Another possible assumption for high dimensions is weak sparsity, which states that, the effect size of most component functions is very small. In this case, the preceding inequality would not be tight but we essentially have Pp j=1 f j n = O(1). We now proceed to show the fast rates of convergence. To establish these rates, we require the compatibility and margin conditions. The compatibility condition, is based on the idea that I(f) and f are somehow compatible for some norm . This condition is common in the high-dimensional literature for proving fast rates (see van de Geer and B uhlmann, 2009, for a discussion of compatibility and related conditions for the lasso). The margin condition, is based the idea that if E(f) is small then f f0 should also be small. This is another common condition in the literature for handling general convex loss functions (see for example, Negahban et al., 2011; van de Geer, 2008). Definition 9 (Compatibility Condition) The compatibility condition is said to hold for an index set S {1, 2, . . . , p}, with compatibility constant φ(S) > 0, if for all γ > 0 and all functions f of the form f(x) = β+Pp j=1 fj(xj) that satisfy P j Sc fj n+γ Pp j=1 Pst(fj) |β| + 3 P j S fj n, it holds that j S fj n f p for some norm . Definition 10 (Margin Condition) The margin condition holds if there is strictly convex function G such that G(0) = 0 and for all f F0 local F0 we have E(f) G( f f0 ), for some norm on F0; here F0 local is a neighborhood of f0 (for example, F0 local = {f : f f0 η}). In typical cases, the margin condition holds with G(u) = cu2, for a positive constant c. We refer to this special case as the quadratic margin condition. Haris, Simon and Shojaie The norm used in the definitions above is most often the empirical norm, n. Our proof is the same for any norm , as long as the same norm is used for both conditions. Note that the margin condition is strictly a condition on the loss function ℓ( ), implying that it is not dependent on the class, F, or dimension, p. While the margin condition is established for well-known choices of ℓ( ) (see for example, van de Geer, 2016), in Appendix H, we present a framework for verifying the quadratic margin condition for loss functions of the form: ℓ(f) = ayif(xi) + b(f(xi)). While the compatibility condition is difficult to prove, the theoretical compatibility condition (defined below) can be verified under suitable conditions. In Appendix H, we prove that (under mild conditions), the theoretical compatibility condition implies the original compatibility condition with high probability. Definition 11 (Theoretical Compatibility Condition) The theoretical compatibility condition is said to hold for an index set S {1, 2, . . . , p}, for a compatibility constant eφ(S), if for some η (0, 1/5), all λ > 0, and all functions of the form f(x) = β + Pp j=1 fj(xj) that satisfy P j Sc fj + 1 5η 1 η Pp j=1 λPst(fj) |β| + 3(1+η) 1 η P j S fj , it holds that |S| f eφ(S) , where f 2 = R [f(x)]2d Q(x) is the population level L2 norm. The theoretical compatibility condition holds trivially when we have independent covariates. In general, establishing verifying it depends on the smoothness penalty Pst; for example, for the Sobolev norm, Meier et al. (2009) established sufficient conditions for the compatibility condition to hold. An important special case, is when Pst(f) projects component functions to a finite dimensional space (for example, Ravikumar et al., 2009; Lou et al., 2016). In this case, our condition reduces to the well-known, group lasso compatibility condition, for which sufficient conditions are well established in the literature (for example, B uhlmann and van de Geer, 2011). We now present our second theorem which establishes the bound E(bβ +Pp j=1 bfj) sλ2, where λ is the slow rate of Theorem 7, and s is the number of non-zero components of f = β + Pp j=1 f j , a sparse additive approximation of f0. As in Theorem 7, the bound holds over a set T ; Corollary 13 following the theorem shows that we lie in T with high probability. Theorem 12 (Fast Rates for GSAM) Suppose ℓ( ) and Pst are convex functions and with bf and f as defined in Theorem 7. Assume that f is sparse with |S | = s where S = {j : f j = 0}, and that the compatibility condition holds for S . Further assume the quadratic margin condition holds with constant c, and that for a function f(x) = β + Pp j=1 fj(xj), f F0 local if and only if |β β | + I(f f ) M . The constant M is defined as ρM = E(f ) + 16sλ2 cφ2(S ) + 2λ2 X j S Pst(f j ), Generalized Sparse Additive Models and ρ is such that λ 8ρ. Furthermore, define the set T as T = {ZM ρM } , where ZM = sup |β β |+I(f f ) M |νn(f) νn(f )| . Then, on the set T , E( bf) + λI( bf f ) 4ρM = 4E(f ) + 64sλ2 cφ2(S ) + 8λ2 X j S Pst(f j ). Corollary 13 Let bf and f be as defined in Theorem 7 and assume the conditions of Theorem 12. Furthermore, for any function f assume the loss ℓ( ) is such that ℓ(f) = ℓ(yi, f(xi)) = ayif(xi) + b(f(xi)), for some a R\{0} and function b : R R. Further assume that for i = 1, . . . , n, yi Eyi are independent, uniformly sub-Gaussian: max i=1,...,n K2 E exp (yi Eyi)2/K2 1 σ2 0. Finally suppose E(f ) = O(sλ2/φ2(S )) and s 1 P j S Pst(f j ) = O(1). Then, with probability at-least 1 2 exp C1nρ2 C exp C2nρ2 , we have the following cases: 1. If F has a logarithmic entropy bound, for λ ρ κ max q E( bf) + λI( bf f ) max s Tn with constants κ = κ(a, K, σ0, A0), C1 = C1(K, σ0), C = C(K, σ0) and C2 = C2(C, κ). 2. If F has a polynomial entropy bound with smoothness, then for λ ρ κ max n 1 2+α , q E( bf) + λI( bf f ) max sn 2 2+α , slog p with constants κ = κ(a, K, σ0, A0, α), C1 = C1(K, σ0), C = C(K, σ0) and C2 = C2(C, κ). We will discuss the significance of our theoretical results in the next subsection by specializing them to some well-studied special cases. Before discussing these specializations, we conclude this section by further generalizing Theorem 12. We will now assume a more general margin condition, for which we need to define the additional notion of a convex conjugate. Haris, Simon and Shojaie Definition 14 (Convex Conjugate) Let G be a strictly convex function on [0, ) with G(0) = 0. The convex conjugate of G, denoted by H, is defined as H(v) = sup u {uv G(u)} , v 0. For the special case of G(u) = cu2, one has H(v) = v2/(4c). Theorem 15 (Fast Rates) Assume the conditions of Theorem 12 and define M as ρM = E(f ) + H 8λ s j S Pst(f j ), where H( ) is the convex conjugate of G. Then, on the set T , E( bf) + λI( bf f ) 4ρM . Remark 16 (Additional tuning parameters) Note that our convergence rates include the term P j S Pst(f j ), or constants which depend on it. For some choices of Pst this can lead to poor finite sample performance. In such cases, prediction performance can be improved by solving instead bβ, bf1, . . . , bfp argmin β R,f1,...,fp F Pnℓ j=1 Pst (fj) + ζλ j=1 fj n , (18) where ζ [0, 1] is an additional tuning parameter. In theory, using two tuning parameters should lead to improved prediction, however in practice, tuning parameter selection over a discrete grid can become computationally cumbersome. A moderately-sized search grid might not yield a lower MSE and in fact, can lead to substantially higher MSE, particularly for large n. We illustrate this phenomenon via a small simulation study in Appendix C: to the simulation study of Section 5, specifically Scenarios 3 and 4, we additionally fit GSAMs by solving (18). Using a grid of ten ζ values, we observe improved prediction in some cases however, a ten-grid is too coarse to exhibit uniformly lower MSE. Remark 17 (Constants in convergence rates) Our convergence rates are presented up to constants. To illustrate this fact, we consider the problem bβ, bf1, . . . , bfp argmin β R,f1,...,fp F Pnℓ j=1 Pst (fj) + λ j=1 fj n , (19) for a constant Θ that does not depend on n or p. While (19) will have a different convergence rate than those presented in Theorems 7 15, the two rates will only differ by a constant that depends on Θ. Optimizing these constants is an interesting open problem that is beyond the scope of this manuscript. Remark 18 (Convex indicator penalties) The above results do not directly extend to some convex indicator penalties. For some convex indicator penalties, such as Pst(f) = I(f; {f : f 0}), we require a third type of entropy condition: Generalized Sparse Additive Models Definition 19 (Polynomial Entropy without Smoothness) The univariate function class, F, is said to have a polynomial entropy without smoothness bound if for all j = 1, . . . , p we have H(δ, {fj F : fj n + γPst(fj) 1} , Qj,n) A0δ α, for some constant A0, parameter α (0, 2) and all γ > 0. Our results do not extend to convex indicator penalties because our proof relies on the fact that fj f j F for fj, f j F; function classes with polynomial entropy without smoothness do not usually have this property. We defer the extension to convex indicator structural penalties to future work. Remark 20 (Sub-Exponential residuals) In Corollaries 8 and 13 we can replace the requirement of uniformly sub-Gaussian residuals by the weaker condition of uniformly sub Exponential residuals. To be precise, we would require max i=1,...,n K2 E exp {yi E(yi)}2/K2 1 |yi E(yi)|/K σ2 0. However, sub-Exponential residuals would firstly require bounds for the δ-entropy with bracketing. The δ-entropy with bracketing is a stronger notion than δ-entropy without bracketing (the δ-entropy with bracketing is always larger than the δ-entropy without bracketing). Secondly, we would also require uniform bounds for each univariate function, specifically, we need max j=1,...,p sup x |fj(x)/ fj n| R. Remark 21 (Comparison of rates to existing work) Here, we highlight the key differences between our theoretical result and existing work. Koltchinskii and Yuan (2010) and Raskutti et al. (2012) establish same rates of convergence as those in Corollaries 8 and 13. However, their work requires stronger assumptions. Both papers are restricted to the setting reproducing of kernel hilbert spaces (RKHS) and only allow an RKHS norm as the choice of smoothness penalty. They also assume known bounds on the additive functions or individual components. Additionally, Raskutti et al. (2012) assumes independence of covariates as opposed to a more general compatibility condition. The work of Yuan and Zhou (2016), extends the RKHS framework with results capturing the notion of weak sparsity; assuming bounds on the term Pp j=1 f j q n for 0 q < 1 they present rates (up to a constant) of the form n 2 2+α + log p 1 q/2 . (20) Similarly, Tan and Zhang (2019) present rates of the form n 1 2+α(1 q) + !2 q . (21) Both (20) and (21) match our established rates for the limiting case of q = 0. While Tan and Zhang (2019) relax some of the strong assumptions of Yuan and Zhou (2016), both Haris, Simon and Shojaie papers deal exclusively with the least squares loss function. It is not clear if results like (20) and (21) can be established for general loss functions. Tan and Zhang (2019) also present convergence rates in greater generality, namely, in terms of the integral of the entropy number. However the only special case they consider are function classes with polynomial entropy with smoothness. For this special case, their convergence rates match ours under a least squares loss; it is not clear if their results extend to GAMs nor is it clear what their rates are for other commonly used function classes. 4.3 Special Cases of GSAM In this subsection, we illustrate the main strength of our framework, namely its generalizability. We specialize our theoretical results to, various existing GSAMs, fully non-parametric regression, and also to (sparse-)GLMs. As per a reviewer s suggestion, in Table A.1 in Appendix A, we summarize existing GSAMs and their limitations which our framework overcomes. Firstly, the proposals of Ravikumar et al. (2009) and Lou et al. (2016), established convergence rates substantially slower than the minimax rates and only for the least squares loss. Our general framework, establishes the following convergence rates for both methods: E( bf) max (s M/n, s log p/n) + E(f ), where M is the order of the basis expansion used for each bfj. For an additive f0, E(f ) is decreasing in M; and we require a value of M which balances the two terms in the rate. For function classes with polynomial entropy with smoothness, we recover rates (17) with M n α 2+α . As noted in Section 4.2, the margin condition holds for a large class of loss functions; for both methods, the compatibility condition reduces to the well-studied, group lasso compatibility condition. Next we consider the proposal of Meier et al. (2009) (see also B uhlmann and van de Geer, 2011; van de Geer, 2010); their theoretical results were limited to the least squares loss and the resulting convergence rate takes the form E( bf) (s log p/n)2/(2+α) . This rate is suboptimal compared to our fast rate (17). Established rates for the diagnolized smoothness penalty of van de Geer (2010), were also sub-optimal and of the order s(log p)n 2/(2+α). Our work bridges the following gaps in the theoretical work of Meier et al. (2009) and van de Geer (2010): (a) we establish minimax rates under identical compatibility conditions, (b) we extend their result beyond least squares loss functions and, (c) we establish slow rates under virtually no assumptions. As another example, we consider our own previous work (Haris et al., 2018), a GSAM which uses wavelet basis functions. Once the univariate problem (p = 1) was solved for the wavelet bases, extending it to GSAM was trivially achieved using the results of this manuscript. The above examples demonstrate that not only do our theoretical results and proximal gradient descent algorithm improve existing results in the literature, but also, can be applied to fully develop any GSAM as long as we can solve the uni-variate problem of the form (12). Next we show how some seemingly unrelated problems can also be treated as special cases of our framework. Firstly, we recover the special case of univariate nonparametric regression, that is, with p = 1: the compatibility condition trivially holds leading to the usual rates E( bf) n 2/(2+α). Next, we recover the multivariate nonparametric regression problem: suppose we have a single (but multivariate) component function f1 : Rp R. For various choices of Pst, the bound (16) holds with α = p/m for some smoothness parameter Generalized Sparse Additive Models 2 1 0 1 2 x Scenario 1: All Step Functions 2 1 0 1 2 x Scenario 2: All Piecewise Linear Functions 2 1 0 1 2 x Scenario 3: All Smooth Functions 2 1 0 1 2 x Scenario 4: Mixture of Functions 2 1 0 1 2 x Scenario 5: Hills Type Functions Figure 1: Plot of the 4 signal functions for each of the five simulation settings. m. Again, the compatibility condition holds trivially, leading to the usual nonparametric rate n 2m/(2m+p). Finally, parametric regression models are also special cases of GSAM. Using a convex indicator for Pst, we can constrain each fj to be a linear function leading to GLMs. For low-dimensional GLMs, Corollary 13 gives the usual parametric rate, p/n. For high-dimensional GLMs, not only does our theorem recover the lasso rate, but our compatibility condition also matches that of lasso (B uhlmann and van de Geer, 2011). 5. Simulation Study In this section, to complement our theoretical results, we conduct a simulation study to study the finite sample performance of various GSAMs as a function of n. The GSAMs we study are existing techniques in the literature obtained by various choices of the smoothness penalty, Pst( ). Our aim is to study the convergence of various methods with increasing n. For a more detailed simulation study we refer the reader to the original papers for each method (cited below). We consider the following choices for Pst( ): 1. Sp AM (Ravikumar et al., 2009). Pst(f) = I(f; span{ψ1, . . . , ψM}) for M {3, 6, 10, 20, 30, 50, 80}. We use the SAM R-package (Zhao et al., 2014). 2. SSP (Meier et al., 2009). Pst(f) = q R x(f(2)(x))2 dx, the Sobolev smoothness penalty (SSP). Given the lack of efficient software for this method, we implemented it using the algorithm and results of Section 3. 3. TF (Sadhanala and Tibshirani, 2019). Pst(f) = R x |f(k+1)(x)| dx for k {0, 1, 2}, trend filtering for additive models. We implemented this method using the algorithm of Section 3 where the univariate sub-problem (11) was solved using the R package glmgen (Arnold et al., 2014). Haris, Simon and Shojaie 200 400 600 800 Sample size Mean square error 200 400 600 800 Sample size Mean square error 200 400 600 800 Sample size Mean square error 200 400 600 800 Sample size Mean square error 200 400 600 800 Sample size Mean square error Figure 2: Plot of MSEs versus sample size for each of five scenarios for p = 6, averaged over 500 replications. The dashed lines correspond to Sp AM with small ( ), moderate ( ) and high ( ) number of basis functions. The solid lines correspond to trend filtering of order k = 0 ( ). SSP is represented by the dotted line ( We simulate data for each of five simulation scenarios as follows: Given a sample size n and a number of covariates p, we draw 50 different n p training design matrices X where each element is drawn from U( 2.5, 2.5). We replicate each of the 50 design matrices 10 times leading to a total of 500 design matrices. The response is generated as yi = f1(xi1) + f2(xi2) + f3(xi3) + f4(xi4) + εi where εi N(0, 1). The remaining covariates are noise variables. We also generate an independent test set for each replicate with sample size n/2. We vary the sample size, n {100, 200, . . . , 800} and consider both, a low-dimensional (p = 6) and high-dimensional (p = 100) settings. We consider five different choices of the signal functions as shown in Figure 1. We fit each method over a sequence of 50 λ values on the training set, and select the tuning parameter λ which minimizes the test error ( ytest by 2 n). For the estimated model bfλ , we report the mean square error (MSE; bfλ f0 2 n) as a function of n. In Figures 2 and 3, we plot the MSE as a function of n for the low and high-dimensional setting, respectively. For each simulation scenario, we plot the performance of Sp AM for three different choices of M: low, moderate and high number of basis functions, M. The exact value of M presented varies by scenario, for example, in Scenario 4, low, moderate and high values of M correspond to M = 3, 10 and 30, respectively. In both lowand highdimensional settings, we observe similar relative performances between the methods, with more variability in results for the high-dimensional setting. While there is no uniformly Generalized Sparse Additive Models 200 400 600 800 Sample size Mean square error 200 400 600 800 Sample size Mean square error 200 400 600 800 Sample size Mean square error 200 400 600 800 Sample size Mean square error 200 400 600 800 Sample size Mean square error Figure 3: Plots of MSE versus sample size for each of five scenarios for p = 100, averaged over 500 replications. The line types and colors are the same as in Figure 2. superior method, for all, except Scenario 1, the Sobolev smoothness penalty and trend filtering of orders 1 and 2 had comparably good performances. Unsurprisingly, trend filtering of order 0 exhibits superior performance in Scenario 1, where each component is piecewise constant. In each scenario, the bias-variance trade-offof Sp AM depends on the choice of M: too small or large values of M lead to high prediction error compared to other methods. In Appendix A, we plot examples of fitted functions for the various methods. The dependence on M for Sp AM, is further illustrated in Figure A.1, where we plot functions estimated by Sp AM for high-dimensional Scenario 4 with n = 500. We observe large bias for M = 3 (especially for the piecewise constant and linear functions) and high variance for M = 30. In the same figure, we also plot functions estimated by the SSP; SSP estimates exhibit a similar bias to that of Sp AM with M = 10, but with a substantially smaller variance. Figure A.2 similarly plots fitted example functions for trend filtering. Trend filtering with k = 0 estimates the piecewise constant function well, but estimating the other fj s by piecewise constant functions incurs additional variance. Trend filtering with k = 1 and 2 estimates all other signal functions well. 6. Data Analysis 6.1 Boston Housing Data We use the methods of Section 5 to predict the value of owner-occupied homes in the suburbs of Boston using census data from 1970. The data consists of n = 506 measurements and 10 covariates, and has been studied in the additive models literature (Ravikumar et al., 2009; Haris, Simon and Shojaie Lasso TPR: 0.46 FPR: 0.04 TPR: 0.61 FPR: 0.02 TPR: 0.66 FPR: 0.04 SSP TPR: 0.71 FPR: 0.31 TF, k=0 TPR: 0.74 FPR: 0.08 TF, k=1 TPR: 0.73 FPR: 0.32 TF, k=2 TPR: 0.68 FPR: 0.24 Figure 4: Box-plot of test errors for 100 different train/test splits of the data for each method. The average TPR and FPR was calculated using the original 10 covariates as signal variables and remaining 20 as noise variables. Lin and Zhang, 2006). As done in the data analysis by Ravikumar et al. (2009), we add 10 noise covariates uniformly generated on the unit interval and 10 additional noise covariates obtained by randomly permuting the original covariates. We fit SSP, Sp AM with M = 2 and 3 basis functions, and TF with orders k = 0, 1, 2; we also fit the lasso (Tibshirani, 1996). Approximately 75% of the observations are used as training set, and the mean square prediction error on the test set is reported. The final model is selected using 5-fold cross validation using the 1 standard error rule . Results are presented for 100 splits of the data into training and test sets. The box-plots of test error in the test set are shown in Figure 4. Since we added noise variables for the purpose of this analysis, we also state the average true positive rate (TPR) and false positive rate (FPR) in Figure 4. The box-plots demonstrate superior performance of TF of order k = 0 over other methods in terms of lowest prediction error and highest TPR. The FPR of TF with k = 0 is also low (under 10%). In Figure A.3 of Appendix A, we plot fitted functions for one split of the data for lasso, Sp AM with M = 3, SSP and, TF with k = 0 for the 10 covariates of the original data set. A striking feature of TF fits is that many component functions are constant for extreme values of the covariates. 6.2 Gene Expression Data We now fit GSAMs for classification of gene expression data. We used the Curated Microarray Database (Cu Mi Da) (Feltes et al., 2019): a repository of gene-expression data sets curated from the Gene Expression Omnibus (GEO). Using gene expression measurements, we aimed to classify observations as either cancer cells or normal cells. We consider the following data sets/cancer types: 1. Lung: 54, 675 gene expression measurements from 114 lung tissue samples from nonsmoking women with non-small cell lung carcinoma; data set consists of 56 tumor, and 58 normal tissue samples. Available on Cu Mi Da with accession number GSE19804. Generalized Sparse Additive Models Cancer type Method Lung Prostate Breast Oral cavity n = 114; p = 54, 675 n = 124; p = 12, 620 n = 289; p = 35, 980 n = 103; p = 54, 675 Lasso 0.985 (3.46) 0.713 (14.07) 0.951 (3.93) 0.930 (10.43) Sp AM, M = 2 0.982 (3.49) 0.726 (13.56) 0.948 (3.89) 0.923 (14.14) Sp AM, M = 3 0.984 (3.29) 0.763 (12.94) 0.955 (3.50) 0.918 (13.06) Sp AM, M = 10 0.970 (5.48) 0.727 (13.46) 0.946 (4.08) 0.940 (8.17) SSP 0.987 (2.72) 0.765 (11.71) 0.934 (4.69) 0.950 (7.08) TF, k = 0 0.980 (4.07) 0.761 (13.04) 0.935 (4.66) 0.953 (7.32) TF, k = 1 0.988 (2.65) 0.771 (12.50) 0.936 (4.22) 0.947 (9.14) Table 1: Table results for the analysis of gene expression data. For 100 different splits of the data into a training, testing and validation set, we report mean AUC along with 103 mean SE, on the validation set. The method with the highest mean AUC is highlighted for each cancer type. 2. Prostate: 12,620 gene expression measurements from 124 prostate tissue samples; data set consists of 64 primary prostate tumor, and 60 normal tissue samples. Available on Cu Mi Da with accession number GSE6919 U95B. 3. Breast: 35,980 gene expression measurements from 289 breast tissue samples; data set consists of 143 breast adenocarcinoma, and 143 normal tissue samples. Available on Cu Mi Da with accession number GSE70947. 4. Oral cavity: 54,675 gene expression measurements from 103 mucosa cell samples; data set consists of 74 samples with oral cavity cancer, and 29 normal cell samples. Available on Cu Mi Da with accession number GSE42743. Our goal is to correctly classify samples as either normal or cancer samples. We split the data as follows: 60% as training, 20% as validation and 20% as test data. On the training data we fit the lasso, Sp AM with M {2, 3, 10}, SSP and, TF with k {0, 1}. TF with k = 2 was excluded due to numerical instability of the current implementation of the algorithm in glmgen; Sp AM with other values of M yielded similar performance and thus the results are omitted here. All methods were fit for a sequence of λ values, using (λsp, λst) = (λ, λ2) for GSAMs. The λ value with the smallest area under the curve (AUC) for the ROC curve on the validation set was selected, and the corresponding model was used to classify samples in the test set. The experiment was repeated for 50 splits of the data into training, validation and testing. In Table 1, we report the mean AUC on the test set and the estimated standard error based on 50 replications of the experiment. For the lung and breast cancer data sets, we observe similar performance between the lasso and other additive models; this suggests a low signal in the data to detect non-linearities. However, for prostate and oral cavity cancer, we observe a substantial gain (mean AUC beyond one SE) when using a GSAM instead of a linear model. In summary, this data analysis validates our intuition and theoretical results: using a GSAM will lead to comparable or better performance than using a linear model. Haris, Simon and Shojaie 7. Conclusion In this paper, we introduced a general framework for non-parametric high-dimensional sparse additive models. We show that many existing proposals, such as Sp AM (Ravikumar et al., 2009), SPLAM (Lou et al., 2016), Sobolev smoothness (Meier et al., 2009), and trend filtering additive models (Sadhanala and Tibshirani, 2019; Petersen et al., 2016), fall within our framework. We established a proximal gradient descent algorithm which has a lasso-like per-iteration complexity for certain choices of the structural penalty. The computational framework presented in this paper, effectively reduce the problem of fitting high-dimensional GSAMs to fitting a univariate regression model with the relevant smoothness penalty. While algorithms and theoretical results for specific GSAMs, as well as some theoretical results for certain types of general GSAMs, exists, to the best of our knowledge, the general algorithm for GSAMs is a key novel contribution in this paper. Our theoretical analyses in Section 4 showed both fast rates, which match minimax rates under Gaussian noise, as well as slow rates, which only require a few weak assumptions. The R package GSAM, available on https://github.com/asadharis/GSAM, implements the methods described in this paper. Acknowledgments The authors gratefully acknowledge the support for this project from the National Science Foundation (NSF grant DMS-1915855) and the National Institute of Health (NIH grant R01GM114029). The authors would like to thank the anonymous reviewers for their insightful and helpful comments. Appendix A. Additional Figures, Table and Algorithm Algorithm A.1 is the block coordinate descent algorithm that can be used to estimate GSAMs. Figure A.1 shows estimated functions for Sp AM with 3,10 and 30 basis functions, and for SSP. Figure A.2 shows estimated functions for trend filtering of orders k = 0, 1, 2. Figure A.3 shows estimated functions for the various methods for the analysis of Boston housing data. Table A.1 summary of existing GSAMs in the literature and their shortcomings (lack of efficient algorithms and/or limited theoretical results). Generalized Sparse Additive Models Algorithm A.1 Block Coordinate Descent for Least Squares Loss 1: Initialize f0 1 , . . . f0 p 0, β0 0, r y, k 1 2: while k max iter and not converged do i=1 ri, r r βk1. 4: for j = 1, . . . , p do 5: Set r j as r j,i = ri + fk 1 j (xij). 6: Update fk j 1 tλ/ finter j n +finter j , where finter j argmin f F n + tλ2Pst (f) . (A.1) 7: Update r to ri r j,i + fk j (xij). 9: end while 10: return βk, fk 1 , . . . , fk p Haris, Simon and Shojaie Function 1 Function 2 Function 3 Function 4 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 Sp AM, M = 3 Function 1 Function 2 Function 3 Function 4 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 Sp AM, M = 10 Function 1 Function 2 Function 3 Function 4 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 Sp AM, M = 30 Function 1 Function 2 Function 3 Function 4 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 Figure A.1: Examples of estimated signal functions by Sp AM (Ravikumar et al., 2009) and Sobolev Smoothness Penalty (Meier et al., 2009) for Scenario 4. Generalized Sparse Additive Models Function 1 Function 2 Function 3 Function 4 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 Function 1 Function 2 Function 3 Function 4 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 Function 1 Function 2 Function 3 Function 4 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 Figure A.2: Examples of estimated signal functions by Trend Filtering (Sadhanala and Tibshirani, 2019) for Scenario 4. Haris, Simon and Shojaie 0.0 0.2 0.4 0.6 0.8 1.0 3 2 1 0 1 2 3 Per capita crime rate Median value of homes 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.5 0.0 0.5 1.0 % non retail bussiness acres Median value of homes 0.0 0.2 0.4 0.6 0.8 1.0 Nitric oxide concentration Median value of homes 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 Rooms per dwelling Median value of homes 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.5 0.0 0.5 1.0 % of homes built before 1940 Median value of homes 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 6 Weighted distance to employment centers Median value of homes 0.0 0.2 0.4 0.6 0.8 1.0 0.5 0.0 0.5 1.0 1.5 Property tax rate Median value of homes 0.0 0.2 0.4 0.6 0.8 1.0 3 2 1 0 1 2 3 Pupil teacher ratio Median value of homes 0.0 0.2 0.4 0.6 0.8 1.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 % of blacks Median value of homes 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 15 % of lower status of population Median value of homes Figure A.3: Plots of fitted functions for the original 10 covariates for a single split of the data into training and test sets for lasso ( ) with M = 3 basis functions, SSP ( ) and, TF ( ) of order k = 0. Generalized Sparse Additive Models Method Function class Loss/link Smoothness Penalty F ℓ( ) Pst( ) Limited Theoretical Results Ravikumar et al. (2009) Sobolev space Least squares I (f; span {ψ1, . . . , ψM}) and logistic Lou et al. (2016) General class Convex loss I (f; span {ψ1, . . . , ψM}) + of functions Pp j=1 Projspan(ψ2,...,ψM) (f) n Petersen et al. (2016) General class Convex loss TV(f) of functions Sadhanala and Tibshirani (2019) Functions with Least squares TV(f(k)) finite k-th order total variation No Efficient Algorithm Meier et al. (2009) General class of Least squares q R x f(2)(x) 2 dx smooth functions and logistic Raskutti et al. (2012) RKHS Least squares Norm of RKHS space Yuan and Zhou (2016) RKHS Least squares Quasi-norm of RKHS space Tan and Zhang (2019) General class Least squares General smoothness of functions semi-norms Koltchinskii and Yuan (2010) RKHS Convex loss Norm of RKHS space : Theoretical results only available for least squares loss/identity link function. : The link function is generally incorporated into the loss function, for example, least squares loss corresponds to the identity link and logistic loss to the logit link. Table A.1: Summary of methods generalized by our GSAM framework. The various methods in the literature are grouped according to (a) limited theoretical results (only for least squares or sub-optimal rates) and (b) absence of an efficient algorithm. Haris, Simon and Shojaie 300 500 1000 Time [microseconds] Sample size: 100 1000 3000 10000 30000 Time [microseconds] Sample size: 500 1000 3000 10000 Time [microseconds] Sample size: 1000 3 10 30 Time [milliseconds] Sample size: 5000 Figure B.1: Timing results for 100 implementations of the proximal problem for SSP and trend filtering. Appendix B. Numerical Experiments for Comparing Run-Times In this appendix, we present a detailed comparison of the run-times for solving the univariate proximal problem for various choices of the smoothness penalty, Pst. In greater detail, we study the proximal problem: min f F 1 2 r f 2 n + λ f n + λ2Pst(f). We generated data as xi Unif[ 1, 1] and ri = sin(1.5πxi)/2 + εi where εi N(0, 0.52) for i = 1, . . . , n. We considered sample sizes n = 100, 500, 1000 and 5000. In Table B.1 we present the timing results for repeatedly implementing the proximal solver 100 times for n = 500. Unsurprisingly, Sp AM is the fastest method as it can be viewed as a standardized group lasso problem (Simon and Tibshirani, 2012) where each proximal problem has a closed-form solution. The SSP penalty is slightly slower than trend filtering for order k = 0 but still orders of magnitude faster than trend filtering for k = 1, 2. In Figure B.1, we present violin plots for the timing results comparing SSP to trend filtering. This figure validates our previous observations: despite the added grid search, SSP is still much faster than trend filtering of order k = 1, 2. Another interesting feature is that SSP is the fastest method for n = 5000. While it is encouraging to see competitive computation time despite an added grid search, we must highlight one limitation of this experiment: each method was implemented based on existing code in R packages, and other publicly available sources. Therefore, there are bound to be differences in the efficacy of the code written including the choice of programming language used. For instance the SSP proximal problem is solved using Generalized Sparse Additive Models Min Lower Mean Median Upper Max Quantile Quantile Sp AM, M = 3 23.70 26.10 36.61 30.30 42.95 138.20 Sp AM, M = 5 25.30 28.90 40.15 32.90 42.20 143.60 Sp AM, M = 10 31.80 35.05 44.20 39.15 49.25 102.60 Sp AM, M = 15 38.00 42.95 55.48 46.85 56.90 202.00 Sp AM, M = 20 42.50 47.55 57.31 51.60 61.75 163.70 Sp AM, M = 30 55.00 61.00 75.79 69.25 77.10 248.20 Sp AM, M = 50 77.00 85.10 100.75 92.35 102.60 225.50 SSP 633.00 670.20 733.72 693.20 761.10 1107.80 TF, k = 0 452.80 489.80 571.64 514.15 579.05 1390.20 TF, k = 1 2681.40 2815.95 3495.12 2968.30 3134.55 50508.80 TF, k = 2 4436.40 4710.05 4921.01 4884.90 5008.40 5759.70 Table B.1: Summary of timing results for 100 implementations of proximal solver for n = 500. FORTRAN, whereas others use C++; additionally, speed is impacted by other functions which might be written in R, for example, using for loops to construct the matrix of basis functions. Appendix C. Numerical Experiments for Additional Tuning Parameters In this appendix, we empirically investigate the impact of decoulping the tuning parameters for sparsity and smoothness. Recall our decoupled GSAM: bβ, bf1, . . . , bfp argmin β R,f1,...,fp F Pnℓ j=1 Pst (fj) + ζλ j=1 fj n , (C.1) for a second tuning parameter ζ [0, 1]. To demonstrate the performance of decoupling tuning parameters, we implemented (C.1) using some of the data from our simulation study in Section 5. In greater detail, we consider Scenario 3 (all smooth functions) and Scenario 4 (mixture of functions) from the simulation study in Section 5. We generated data with n {100, 300, 500, 700, 900, 1000} and p {6, 100}. We implement both versions of GSAM, with the smoothness penalties, SSP and TF with k = 0, 1, 2. For the originally proposed GSAM (with coupled tuning parameters), we use a sequence of fifty λ values and, for (C.1), we additionally used a sequence of ten ζ [10 3, 1 10 5] values (resulting in a 10 50 grid of tuning parameters). We report the oracle mean square error (MSE): MSEcoupled = min λ bfλ f0 2 n, MSEdecoupled = min ζ,λ bfζ,λ f0 2 n. All results were averaged over 100 replications of the data. Haris, Simon and Shojaie TF, k=1 TF, k=2 SSP TF, k=0 250 500 750 1000 250 500 750 1000 Sample size (n) Mean square error Scenario 3; p = 6 TF, k=1 TF, k=2 SSP TF, k=0 250 500 750 1000 250 500 750 1000 Sample size (n) Mean square error Scenario 4; p = 6 TF, k=1 TF, k=2 SSP TF, k=0 250 500 750 1000 250 500 750 1000 Sample size (n) Mean square error Scenario 3; p = 100 TF, k=1 TF, k=2 SSP TF, k=0 250 500 750 1000 250 500 750 1000 Sample size (n) Mean square error Scenario 4; p = 100 Figure C.1: Plot of oracle MSEs versus sample size for each of scenarios 3 and 4 for p = 6 and 100, averaged over 100 replications. The lines correspond to GSAM with coupled ( ) and decoupled ( ) tuning parameters. For Scenario 3 with p = 100, we also consider a finer grid of ζ values for decoupled tuning parameters ( In Figure C.1 we plot the oracle MSE as a function of the sample size. For lowdimensional data we observe that decoupling the tuning parameters can lead to a lower MSE in some cases. For example, in Scenario 3, with p = 6 we observe the decoupled GSAM to have lower MSE for all methods for almost all sample sizes. However, even with p = 6 we note decoupled tuning parameters lead to a high MSE for large n whereas coupled tuning parameters have a monotone decreasing MSE curve. This phenomenon is exacerbated for p = 100, where we observe the coupled tuning parameters GSAM to uniformly beat the decoupled version in terms of oracle MSE. This behavior is likely due to the precision of our tuning parameters grid: using a finer grid of ζ values could lead to a lower MSE but at a high computational cost. We study the use of a finer grid for Scenario 3 with p = 100: the third line in the panel is MSEdecoupled for a sequence of thirty ζ [0.700, 0.999] values. We clearly observe a substantial reduction of oracle MSE from the original decoupled GSAM; additionally, for most methods, for some n, the decoupled GSAM outperforms the coupled GSAM. Generalized Sparse Additive Models Appendix D. Proof of Results in Section 2.2 Proof [Proof of Lemma 1] For brevity, we write our optimization problem as bf1, . . . , bfp argmin f1,...,fp F L(f1, . . . , fp), where L( ) is the objective function. The proof follows by contradiction: we assume some bfj 0 while others are non-zero, and looking at the sub-gradient conditions we arrive at a contradiction. In greater detail, assume without loss of generality, that for some k < p, bf1, . . . , bfk = 0 and bfk+1, . . . , bfp = 0. Define the paths bfj,εj = bfj + εjhj for any direction hj F for j = 1, . . . , p. The sub-gradient conditions state, that for any direction hj F, we must have 0 εj L( bf1,ε1, . . . , bfp,εp) ε1=...=εp=0 , for all j {1, . . . , p}, where εj denotes the sub-gradient set with respect to εj. For the non-zero functions, j {1, . . . , k}, the sub-gradient conditions are: y bf1 bf2 . . . bfk, hj n + λ1 εj P 2 st( bfj,εj) + λ2 bfj, hj n where a, b n = n 1 Pn i=1 a(xi)b(xi). Since the sub-gradient conditions must be met for all hj F, we set hj = bfj. This implies that εj P 2 st( bfj,εj) εj=0 = εj P 2 st{(1 + εj) bfj} εj=0 = εj(1 + εj)2 εj=0 P 2 st{ bfj} = 2P 2 st( bfj), where the second equality follows from properties of a norm. Thus, for j {1, . . . , k}, we must have y bf1 bf2 . . . bfk, bfj n + 2λ1P 2( bfj) + λ2 bfj n = 0, y bf1 bf2 . . . bfk, bfj n λ2 bfj n = 1 + 2λ1P 2( bfj) λ2 bfj n . (D.1) On the other hand for j {k + 1, . . . , p}, εj P 2 st( bfj ,εj ) εj =0 = εj P 2 st(εj hj ) εj =0 = εj ε2 j εj =0 P 2 st(hj ) = 0. εj bfj ,εj n εj =0 = εj εj hj n εj =0 = εj |εj | εj =0 hj n, Haris, Simon and Shojaie By definition of a sub-gradient we know that εj |εj | εj =0 = [ 1, 1]. Therefore, the sub- gradient conditions for j {k + 1, . . . , p} imply that y bf1 bf2 . . . bfk, hj n + λ2Uj hj n = 0, (D.2) for some Uj [ 1, 1]. Now with (D.1) and (D.2), and a clever choice of hj we arrive at a contradiction: setting hj = hj = bfj = 0 for any j k, Uj = y bf1 bf2 . . . bfk, bfj n λ2 bfj n by (D.2) = 1 + 2λ1P 2( bfj) λ2 bfj n by (D.1) because bfj = 0, which leads to a contradiction. Proof [Proof of Corollary 2] The proof is essentially identical to that of Lemma 1. Assume without loss of generality that I = {k + 1, . . . , p}. Assume for contradiction that, there is some j {1, . . . , k} such that bfj F\F0. Then as in the proof of Lemma 1, we can arrive at a contradiction showing Uj > 1 for all j I. Appendix E. Proof of Results in Section 3 Proof [Proof of Lemma 3] If ef = 0, then bf = 0 is trivially the solution to (8). Thus, throughout this proof, we consider ef = 0. Case 1: ef n λ2. In this case c bf = ef where c = (1 λ2/ ef n) 1. Let f T F be some arbitrary function and define the function h = f T bf. We will show that along the path bf + εh for all ε [0, 1], the objective r ( bf + εh) 2 n + λ1Pst bf + εh + λ2 bf + εh n (E.1) is minimized at ε = 0. We begin by noting that r ( ef + εch) 2 n + λ1Pst ef + εch , is minimized at ε = 0 because ef + εch = ef + εcf T εc bf = (1 ε) ef + εcf T F, for all ε [0, 1] since F is a convex cone. By the sub-gradient condition, we have r ef, ch n + λ1ϑ1 = 0, Generalized Sparse Additive Models for some ϑ1 Pst ef + εch ε=0, or equivalently c h r c bf, h n + λ1ϑ2 i = 0, for some ϑ2 Pst bf + εh ε=0. At bf + εh, one possible sub-gradient of the objective (E.1) is r bf, h n + λ1ϑ2 + λ2 bf, h n By the definition of c, we have that λ2/ bf n = cλ2/ ef n = c(1 1/c) = c 1, and thus the above sub-gradient is r bf, h n + λ1ϑ2 + (c 1) bf, h n = r c bf, h n + λ1ϑ2 = 0. Thus we have shown that the objective function (E.1) is minimized at ε = 0. Since f T was an arbitrary function, we conclude that bf is the solution of (8). Case 2: ef n < λ2. In this case we will show that bf 0. For this, we consider the path εf T for ε [0, 1] for an arbitrary f T F. We will show that the function 1 2 r εf T 2 n + λ1Pst (εf T ) + λ2 εf T n, (E.2) is minimized at ε = 0 and since f T is arbitrary that will complete the proof. As in the previous case, we begin by looking at the sub-gradient conditions for ef. The expression r ( ef + εf T ) 2 n + λ1Pst ef + εf T , is minimized at ε = 0 by definition of ef. This leads us to the sub-gradient condition r ef, f T n + λ1ϑ1 = 0 ϑ1 = r ef, f T n Now we describe the the sub-gradient conditions for (E.2). All sub-gradients of (E.2) at ε = 0 are given by r, f T n + ν1λ1Pst(f T ) + ν2λ2 f T n, (E.3) for real values (ν1, ν2) [ 1, 1]2. To complete the proof we need to find ν1 and ν2 such that (E.3) is 0 and, (ν1, ν2) [ 1, 1]2. Setting ν1 = ϑ1/Pst(f T ) and ν2 = ef, f T /(λ2 f T n) clearly makes (E.3) 0 and so we need only prove that our choice of ν1 and ν2 lie within the interval [ 1, 1]. Showing |ν1| 1 is equivalent to |ϑ1| Pst(f T ). ϑ1 is a member of the sub-gradient set Pst ef + εf T ε=0 = n u 0 : Pst( ef + ηf T ) P( ef) u η η 0 o . Haris, Simon and Shojaie Thus ϑ1 must satisfy the inequality ϑ1η Pst( ef + ηf T ) Pst( ef) Pst( ef) + ηPst(f T ) Pst( ef) = ηPst(f T ), where the second inequality holds because Pst is a semi-norm. This proves that |ϑ1| Pst(f T ). Showing |ν2| 1 is easier and follows by definition: |ν2| = | ef, f T | λ2 f T n ef n f T n λ2 f T n = ef n which is less than 1 since ef n < λ2. Sufficient conditions for bfj 0: for the second part of this Lemma, we proceed from the sub-gradient condition (E.3). If we set ν1 = 0, then bf 0 if for every direction f T there exists ν2 [ 1, 1] such that ν2λ2 f T n = r, f T n. Which is equivalent to r, f T n If λ2 r n, then (E.4) is satisfied for all f T because by the Cauchy-Schwarz inequality r, f T n Proof [Prof of Lemma 4] Consider an arbitrary direction bfλ + εh for some function h and ε in an open interval. We first consider the case Pst( bfλ) = 0. In this case if the directional derivative h P ν st( bf) exists then so does the directional derivative of Pst( bf) and is given by h Pst( bfλ) = h P ν st( bfλ) νP ν 1 st ( bfλ) . This follows from standard arguments for derivative of power functions. Here we present the simple case of integer valued ν > 1. h Pst( bfλ) = lim ε 0 Pst( bfλ + εh) Pst( bfλ) = lim ε 0 Pst( bfλ + εh) Pst( bfλ) ε Pν l=1{Pst( bfλ + εh)}ν l{Pst( bfλ)}l 1 Pν l=1{Pst( bfλ + εh)}ν l{Pst( bfλ)}l 1 = lim ε 0 P ν st( bfλ + εh) P ν st( bfλ) ε lim ε 0 1 Pν l=1{Pst( bfλ + εh)}ν l{Pst( bfλ)}l 1 = h P ν st( bfλ) 1 Pν l=1{Pst( bfλ)}ν 1 = h P ν st( bfλ) νP ν 1 st ( bfλ) . Generalized Sparse Additive Models Now, by the gradient condition, for fε = bfλ + εh, 2 r fε 2 n + λPst (fε) ϵ=0 = r bfλ, h n + λ h P ν st( bfλ) νP ν 1 st ( bfλ) = 0, (E.5) Similarly, for the path feε = efeλ + eεh, by the gradient condition, 2 r feε 2 n + eλP ν st (feε) eε=0 = r efeλ, h n + eλ h P ν st efeλ = 0. This is exactly the optimality condition (E.5) with eλ = λ(νP ν 1 st ( bfλ)) 1. Thus, if νeλP ν 1 st ( efeλ) = λ then bfλ = efeλ. Now to show the case Pst( bf) = 0, we need to find conditions for which the objective 1 2 finterp f 2 n + λPst(f) is minimized by bf = fnull. We consider the functions fh,ε = (1 ε)fnull + εh for ε [0, 1] and show that for all h F, the objective 1 2 finterp fh,ε 2 n + λPst(fh,ε) (E.6) is minimized at ε = 0, if and only if λ P st(finterp fnull). To see this, note that all subgradients of (E.6) at ε = 0, are of the form finterp fnull, fnull h n + λκPst(h), for κ [ 1, 1]. For 0 to be a sub-gradient of (E.6) we need to have λκPst(h) = finterp fnull, h fnull n. Consequently, (E.6) is minimized at ε = 0 if and only if for all h F finterp fnull, h fnull n λPst(h). Using the decomposition h = h0 + h F1 F2, the above condition becomes finterp fnull, h0 fnull n Pst(h ) + finterp fnull, h n Now if finterp fnull F2, then the first part of the LHS is 0 and the second part is bounded above by P st(finterp fnull). To complete the proof we show that finterp fnull is, infact, a member of F2. For this, it suffices to show that finterp fnull, fnull hnull n = 0 for all hnull F1. We know that fnull is the solution to the problem minimize f F1 1 2 finterp f 2 n; Haris, Simon and Shojaie in other words, for all hnull F1, the expression 1 2 finterp (1 ε)fnull εhnull 2 n, is minimized by ε = 0. Equivalently (by the gradient condition), for all hnull F1, finterp fnull, fnull hnull n = 0. Proof [Proof of Convergence of Infinite-dimensional problem] We begin by re-writing our optimization problem as follows: bβ, bη1, . . . , bηp argmin β R,η1,...,ηp Rn Pnℓ j=1 Pst (ηj) + λ j=1 ηj n, (E.7) Pst(ηj) = min f F Pst(f) + I(ηji = f(xji) for all i), in which case, the term Pst(ηj), in (E.7), is a semi-norm over Rn. We recover our estimate by bfj argmin f F Pst(f) + I bηji = f(xji) for all i . (E.8) Thus a proximal gradient descent algorithm will solve the finite-dimensional problem (E.7) by standard convergence guarantees. We then just need to solve (E.8) for a given bηj. Our Algorithm 1, does exactly that. To see this, re-write our main proximal problem, (11): ηinter j argmin η Rn 1 2 ηk 1 j tr η 2 n + tλ2Pst (η) , (E.9) finter j argmin f F Pst(f) + I(ηinter ji = f(xji) for all i). (E.10) Thus, we see that (E.9) generates a proximal gradient descent algorithm which solves (E.7) and (E.10) is exactly the problem (E.8). This completes the proof. Appendix F. Proofs of Results in Section 4.2 In this section we present the proof Theorem 7 and 15 for the sake of completeness. The arguments presented here are only a slight modification to those of B uhlmann and van de Geer (2011) for proving LASSO rates. One notable difference is that we explicitly handle an unpenalized intercept term; another is our handling of the structural penalty, Pst( ). Throughout the proofs, we will use the so-called basic inequalities. Hence, for the sake of convenience, we state and prove these basic inequalities as a separate lemma. Generalized Sparse Additive Models Lemma F.1 (Basic Inequality) Let bf(x) = bβ + Pp j=1 bfj(xj) be as defined in (14), and let f (x) = β + Pp j=1 f j (xj) be an arbitrary additive function with β R and f j F. Then we have the following basic inequality E( bf) + λI( bf) h νn( bf) νn(f ) i + λI(f ) + E(f ). If we further assume that ℓ( ) and Pst( ) are convex, then for all t (0, 1) and ef = t bf + (1 t)f we have the following basic inequality E( ef) + λI( ef) h νn( ef) νn(f ) i + λI(f ) + E(f ). Proof For the first inequality, note that Pnℓ bf + λI( bf) Pnℓ(f ) + λI(f ), which is equivalent to λI( bf) Pnℓ bf Pnℓ(f ) + λI(f ) P(ℓ(f )) P(ℓ( bf)) + λI( bf) Pnℓ bf P(ℓ( bf)) Pnℓ(f ) + P(ℓ(f )) + λI(f ) P(ℓ(f )) P(ℓ( bf)) + λI( bf) h (Pn P)( ℓ( bf)) (Pn P)( ℓ(f )) i + λI(f ) P(ℓ(f )) P(ℓ( bf)) + λI( bf) h νn( bf) νn(f ) i + λI(f ) P(ℓ(f )) P(ℓ(f0)) + P(ℓ(f0)) P(ℓ( bf)) + λI( bf) h νn( bf) νn(f ) i + λI(f ) E(f ) + E( bf) + λI( bf) h νn( bf) νn(f ) i + λI(f ). For the second inequality we have by convexity Pnℓ ef + λI( ef) t h Pnℓ bf + λI( bf) i + (1 t) [ Pnℓ(f ) + λI(f )] Pnℓ(f ) + λI(f ), after which we simply need to repeat the arguments for the previous basic inequality with bf replaced by ef. Proof [Proof of Theorem 7] Define M + I( bf f ) , (F.1) and ef = t bf + (1 t)f . Then (F.1) implies I( ef f ) M and by Lemma F.1, we obtain E( ef) + λI( ef) ZM + λI(f ) + E(f ). Haris, Simon and Shojaie By applying the triangle inequality I( ef f +f ) I( ef f ) I(f ), to the left hand side we obtain on the set T (where ZM ρM + 2Rρ) E( ef) + λI( ef f ) ρM + 2Rρ + 2λI(f ) + E(f ). Recall the definition of M , given by ρM = E(f ) + 2λI(f ) + 2Rρ, from which we obtain E( ef) + λI( ef f ) 2ρM 2λ 4 M I( ef f ) M Now by the definition of ef we have I( bf f ) = I( ef f ) t = I( ef f ) 1 + I( bf f ) 2 + I( bf f ) which implies that I( bf f ) M . Now we can repeat the above arguments with ef replaced by bf which gives us E( bf) + λI( bf f ) ρM + ρ(2R) + 2λI(f ) + E(f ). Proof [Proof of Theorem 15] As in the proof of Theorem 7, we begin by defining t as M + |bβ β | + I( bf f ) , and ef = t bf + (1 t)f which gives us |eβ β | + I( ef f ) M implying that, ef F0 local. Lemma F.1 implies that on the set T (where ZM ρM ) we have E( ef) + λI( ef) ZM + E + λI(f ) ρM + E(f ) + λI(f ). (F.2) Be definition of S and I( ), we have by the triangle inequality λI(f ) = λ X n f j n + λPst(f j ) o λ X n f j efj n + efj n + λPst(f j ) o , and by the reverse triangle inequality we have λI( ef) = λ X n efj n + λPst( efj) o + λ X n efj n + λPst( efj) o n efj n + λPst( efj f j ) λPst(f j ) o + λ X n efj n + λPst( efj) o n efj n + λPst( efj f j ) λPst(f j ) o + λ X n efj f j n + λPst( efj f j ) o , Generalized Sparse Additive Models where the last equality follows from the fact that f j = 0 for all j Sc . With the above two inequalities combined with (F.2) we get n efj n + λPst( efj f j ) λPst(f j ) o + λ X n efj n + λPst( efj) o n f j efj n + efj n + λPst(f j ) o + ρM + E(f ), which simplifies to E( ef) + λ X j Sc efj f j n + λ j=1 λPst( efj f j ) (F.3) j S f j efj n + 2λ2 X j S Pst(f j ) + ρM + E(f ). Now we add λ|eβ β | + λ P j S efj f j n to both sides of (F.3) to obtain E( ef) + λ n |eβ β | + I( ef f ) o 2λ X j S f j efj n + λ|eβ β |+ ρM + E(f ) + 2λ2 X j S Pst(f j ). (F.4) j S f j efj n + λ|eβ β | ρM + E(f ) + 2λ2 X j S Pst(f j ), then (F.4) simplifies to E( ef) + λ n |eβ β | + I( ef f ) o 2ρM + 2E(f ) + 4λ2 X j S Pst(f j ) 8 M = λM /2, which indicates that |eβ β |+I( ef f ) M /2 which implies that |bβ β |+I( bf f ) M and hence we can redo the above arguments and replace ef by bf. Case II. If instead j S f j efj n + λ|eβ β | ρM + E(f ) + 2λ2 X j S Pst(f j ), then we have E( ef) + λ n |eβ β | + I( ef f ) o 4λ X j S f j efj n + 2λ|eβ β |. (F.5) Haris, Simon and Shojaie This is equivalent to E( ef) + λ n X j Sc efj f j n + λ j=1 Pst( efj f j ) o 3λ X j S f j efj n + λ|eβ β |, which means we have that j Sc efj f j n + λ j=1 Pst( efj f j ) 3 X j S f j efj n + |eβ β |, and hence by the compatibility condition (F.5) reduces to E( ef) + λ n |eβ β | + I( ef f ) o 4λ ef f s /φ(S ). Since ef and f are in F0 local, we invoke the inequality uv H(v) + G(u) to obtain By the convexity of G and the margin condition we obtain φ(S ) H 8λ s Hence we have 2 + λ n |eβ β | + I( ef f ) o H 8λ s 2 ρM λM /8, which implies that |eβ β |+I( ef f ) M /2 which in turn implies |bβ β |+I( bf f ) M and hence we can redo the above arguments and replace ef by bf. Thus we have shown that E( bf) + λ n |bβ β | + I( bf f ) o 4ρM . (F.6) Appendix G. The Set T Theorems 7 and 12 show inequalities holding over the set T . In this section we will show that T occurs with high probability. This will be shown for the two different types of entropy bounds considered in Section 4. We consider the special case of loss functions linear in Yi as in Corollaries 8 and 13 and bound the term νn(f) νn(f ) in the following theorem. Generalized Sparse Additive Models Theorem G.1 Let xi Rp and Yi R denote the fixed covariates and response, respectively, for i = 1, . . . , n. Assume that for any function f the loss ℓ( ) is such that ℓ(f) = ℓ(f, xi, Yi) = a Yif(xi) + b(f(xi)), for some a R\{0} and function b : R R. Further assume that Yi EYi = Yi µi are uniformly sub-Gaussian: max i=1,...,n K2 Ee(Yi µi)2/K2 1 σ2 0, (G.1) then with probability at-least 1 2 exp nρ2C1 C exp nρ2C2 the following inequality holds νn(f) νn(f ) ρ j=1 fj f j n + λPst(fj f j ) for variables ρ, λ and positive constants C, C1, C2 which we specify in the following 3 cases. Case 1. If F has a logarithmic entropy bound, then the inequality (G.2) holds with ρ = and λ = 1 for constants κ = κ(a, K, σ0, A0), C1 = C1(K, σ0), C = C(K, σ0) and C2 = C2(C, κ). Case 2. If F has a polynomial entropy bound with smoothness, then the inequality (G.2) holds with ρ = κ max n 1 2+α , q for constants κ = κ(a, K, σ0, A0, α), C1 = C1(K, σ0), C = C(K, σ0) and C2 = C2(C, κ). The parameter satisfies λ ρ and λ 8ρ. In light of the above theorem, for the case of Theorem 7 where ZM = sup I(f f ) M |νn(f) νn(f )|, and β, β R where R is uniformly bounded by R then we have with probability at-least 1 2 exp nρ2C1 C exp nρ2C2 , ZM ρ (M + 2R) . In the case of Theorem 15 where ZM = sup |β β |+I(f f ) M |νn(f) νn(f )|, we have with probability at-least 1 2 exp nρ2C1 C exp nρ2C2 , To prove Theorem G.1, we use a few technical lemmas from van de Geer (2000) namely Lemma 8.2 and Corollary 8.3; for the sake of completeness we state these lemmas in Appendix I. Haris, Simon and Shojaie Proof [Proof of Theorem G.1] We begin by noting that for any arbitrary function we have νn(f) = (Pn P)( ℓ(f)) = 1 i=1 [a Yif(xi) + b(f(xi))] E i=1 a Yif(xi) + b(f(xi)) Since we assume the covariates x1, . . . , xn are fixed we obtain i=1 a(Yi µi)f(xi) a Y µ, f n, where µi = EYi. Thus for additive functions f and f we obtain νn(f) νn(f ) = a Y µ, f f n = 1 i=1 a(Yi µi) j=1 fj(xij) f j (xij) = a(β β ) 1 i=1 (Yi µi) + i=1 a(Yi µi)(fj(xij) f j (xij)) = a(β β )(Y µ) + j=1 a Y µ, fj f j n. From now on we will assume, without loss of generality, that |a| = 1 since this constant is absorbed into a constant κ which we define later. To control the first term, (β β )(Y µ), we simply apply Lemma I.1. For the second part, we consider 2 cases. Case 1: Logarithmic Entropy. We first note that if the entropy bound holds, then the same bound holds (upto a constant) for the class ( fj f j fj f j n + λPst(fj f j ) : fj F for some f j F for all j = 1, . . . , p. Now we apply Lemma I.2 to the above class by first noting that R 1 and then using the bound for Dudley s integral A1/2 0 T 1/2 n u + 1 du e A0T 1/2 n , we have for all δ that satisfy where the constant C depends only on K and σ0, we have 1 n Pn i=1(Yi µi) fj(xij) f j (xij) fj f j n + λPst(fj f j ) Generalized Sparse Additive Models We can now take δ = ρ 2C e A0 max q n which holds for all κ 2C e A0. Applying the above result with a union bound gives us max j=1,...,p sup fj F Y µ, fj f j n fj f j n + λPst(fj f j ) ρ p C exp nρ2 = C exp nρ2 4C2 + log p = C exp nρ2 1 C exp nρ2C2 , for a constant C2 > 0 that depends on C and e A0. To see this, note that 1 4C2 log p nρ2 = 1 4C2 1 κ max n Tn log p, 1 o 1 4C2 1 which is positive if κ > 4C2. Thus we can take the constant κ such that κ > max n 4C2, 2C e A0 o . Hence κ depends on C(K, σ0) and e A0(A0). Case 2: Polynomial Entropy with Smoothness. Now we note that same entropy bound holds for the class ( fj f j fj f j n + λPst(fj f j ) : fj F and we can now apply Lemma I.2 by noting that 0 H1/2(u, e F, Qn) du e A0λ α/2, for some constant e A0 = e A0(A0). For some C = C(K, σ0) and all δ 2C e A0λ α/2n 1/2 we have Y µ, fj f j n fj f j n + λPst(fj f j ) δ Since λ ρ we note that 2C e A0λ α/2n 1/2 2C e A0ρ α/2n 1/2 and that 2C e A0ρ α/2n 1/2 ρ ρ 2C e A0 2 2+α n 1 2+α . Which holds by definition since ρ = κ max q n , n 1 2+α κn 1 2+α and κ is sufficiently large (any κ 2C e A0 2 2+α would suffice). Therefore, we can take δ = ρ in (G.3) along Haris, Simon and Shojaie with a union bound to obtain max j=1,...,p sup fj F Y µ, fj f j n fj f j n + λPst(fj f j ) ρ p C exp nρ2 = C exp nρ2 1 C exp nρ2C2 , for some positive constant C2 = C2(C, e A0) exactly as in Case 1. Appendix H. On the Margin and Compatibility Coniditions In this appendix, we present a detailed discussion of the margin and compatibility conditions. These conditions are the main assumptions made for our theoretical results and in this appendix we discuss suitable conditions under which the conditions hold. H.1 Margin Condition We now present a general framework for proving the quadratic margin condition for loss functions given by the negative log-likelihood of exponential family distributions. Specifically, for loss functions of the type ℓ(f) = ℓ(yi; f(xi)) = ayif(xi) + b(f(xi)). (H.1) The excess risk is E(f) = P{ℓ(f0) ℓ(f)} i=1 { a E(yi)}{f0(xi) f(xi)} [b{f0(xi)} b{f(xi)}] i=1 [b {f0(xi)}]{f0(xi) f(xi)} [b{f0(xi)} b{f(xi)}], where the last equality holds because the expectation of the score function is zero. Now by Taylor s theorem for b( ) and the mean value theorem for b ( ), the above simplifies to E(f) = n 1 n X i=1 b (ζ1i){f0(xi) fi(xi)}2 b (ζ2i) 2 {f0(xi) f(xi)}2 b (ζ1i) b (ζ2i) Generalized Sparse Additive Models where ζ1i, ζ2i lie between f(xi), f0(xi). Thus for loss functions of the type (H.1), we only need to find a constant C in a neighborhood of f0. As an example, consider the least squares loss where b(θ) = θ2/2, b (θ) = 1 C = 1/2. Similarly, for logistic regression we have b(θ) = log {exp(θ) + 1} , b (θ) = exp(θ) 1 + exp(θ). Now on the set F0 local = {f : f f0 η}, we need only look at the interval (f0(xi) η, f0(xi) + η). As a simple case, say f0(xi) = 0, then b (ζ1i) b (ζ2i) 2 exp{η} [1 + exp{η}]2 exp{0} 2[1 + exp{0}]2 > 0, for η < log(3 + 2 2). While tedious, for any f0(xi), we can find an η such that b (ζ1i) b (ζ2i)/2 > 0 for all ζ1i, ζ2i. The above examples indicate that the margin condition is satisfied in most of the relevant problems. H.2 Compatibility Condition We now show that under suitable conditions, the theoretical compatibility condition implies the compatibility condition. Recall first our notation, f 2 = R [f(x)]2d Q(x) where Q is the probability measure of our observed data. We denote by Qn, the empirical measure associated with Q, and f 2 n = n 1 Pn i=1 f2(xi). Lemma H.1 On the set f n f Pp j=1 fj + λP(fj) η with η (0, 1/5), suppose we have c1 = η 2(2 + η) 1 η + 3(1 + η) |S| eφ(S) < 1, (H.3) if the theoretical compatibility condition holds with η as in (H.2). Then, the empirical compatibility condition holds with constant {φ(S)} 1 = (1 + η) + 3η(1 + η) (1 c1)eφ(S) . Remark H.2 The event S, ensures that our empirical norm is relatively close to the theoretical norm; existing literature shows that S holds with high probability under suitable entropy conditions for small η (see for example, Section 5 of Tan and Zhang, 2019). The condition (H.3), simply forces our index set S to not be too large; thus, highlighting the role of sparsity in high dimensions. Haris, Simon and Shojaie To prove Lemma H.1, we will state and prove a sequence of smaller lemmas, the proof will then follow immediately. For ease of exposition/reference, we restate our theoretical and (empirical) compatibility conditions. For simplicity, we do not include the intercept term; explicitly including an intercept term does not change the proof. Definition H.3 (Empirical Compatibility Condition) The compatibility condition is said to hold for an index set S {1, 2, . . . , p} with s = |S|, and with compatibility constant φ(S) > 0, if for all λ > 0 and all functions f of the form f(x) = Pp j=1 fj(xj) that satisfy j Sc fj n + j=1 λPst(fj) 3 X j S fj n, (Empirical-A) it holds that X j S fj n f n s/φ(S). (Empirical-B) Definition H.4 (Theoretical Compatibility Condition) The theoretical compatibility condition is said to hold for an index set S {1, 2, . . . , p} with s = |S|, and for a compatibility constant eφ(S) > 0, if for some η (0, 1/5), all λ > 0, and all functions of the form f(x) = Pp j=1 fj(xj) that satisfy j Sc fj + 1 5η j=1 λPst(fj) 3(1 + η) j S fj , , (Theoretical-A) it holds that X j S fj f s/eφ(S). (Theoretical-B) Proof [Proof of Lemma H.1] We will show that on the set S given in (H.2), we have the following sequence of implications. (Empirical-A) 1 (Theoretical-A) 2 (Theoretical-B) 3 (Empirical-B) 1. The result of Lemma H.5. 2. By assuming the theoretical compatility condition. 3. Follows immediately from Lemmas H.6 and H.7. Lemma H.5 On the set S, (Empirical-A) (Theoretical-A). Generalized Sparse Additive Models Proof We have fj n fj η ( fj + λP(fj)) , which is equivalent to (1 η) fj ηλP(fj) fj n (1 + η) fj + ηλP(fj). (H.4) Thus we have the following: j Sc fj n + η 1 η j Sc λP(fj) + j=1 λP(fj) by (H.4) j Sc fj n + 1 1 η j Sc λP(fj) + X j Sc fj n + j S fj n + η 1 η j S λP(fj) by (Empirical-A) j S (1 + η) fj + ηλP(fj) j S λP(fj) by (H.4) j S fj + 4η 1 η j S λP(fj). Taking the smoothness norm term to the left hand side completes the proof. Lemma H.6 On the set S, assuming the theoretical compatibility condition, if c1 = η 2(2 + η) 1 η + 3(1 + η) f 1 1 c1 f n. Proof On the set S we have j=1 fj + λP(fj) by definition Haris, Simon and Shojaie For terms inside the brackets we have: p X j S fj + 3(1 + η) j S fj by (Theoretical-A) j=1 λP(fj) 3(1 + η) j S fj by (Theoretical-A). So now we have: j S fj + 3(1 + η) f n + η 2(2 + η) 1 η + 3(1 + η) eφ(S) . by (Theoretical-B) By definition of c1 we get (1 c1) f f n, which completes the proof. Lemma H.7 Under the conditions of Lemma H.6, (Theoretical-B) (Empirical-B) with {φ(S)} 1 = (1 + η) + 3η(1 + η) (1 c1)eφ(S) . Proof We have that X j S fj n (1 + η) X j S fj + η X j S λP(fj) by (H.4) j S fj + 3η(1 + η) j S fj by (Theoretical-A) (1 + η) + 3η(1 + η) eφ(S) by (Theoretical-B) (1 + η) + 3η(1 + η) f n 1 c1 , by Lemma H.6 f n s/φ(S). Generalized Sparse Additive Models Appendix I. Some Results from van de Geer (2000) Lemma I.1 (Lemma 8.2 of van de Geer (2000)) Suppose that Y1 µ1, . . . , Yn µn are mean zero sub-Gaussian random variables, satisfying (G.1). Then for all γ Rn and ρ > 0, i=1 (Yi µi)γi 8(K2 + σ2 0) Pn i=1 γ2 i in particular if γi = 1/n then we have 8(K2 + σ2 0) Lemma I.2 (Corollary 8.3 of van de Geer (2000)) Suppose that supfj F fj n R for a univariate function class F and that Y1 µ1, . . . , Yn µn are mean zero sub-Gaussian random variables, satisfying (G.1). Then for some constant C = C(K, σ0), and for all δ > 0 satisfying nδ 2C Z R 0 H1/2(u, F, Qn) du R , i=1 (Yi µi)fj(xij) Taylor Arnold, Veeranjaneyulu Sadhanala, and Ryan Tibshirani. glmgen: Fast algorithms for generalized lasso problems, 2014. R package version 0.0.3. Miriam Ayer, Hugh D. Brunk, George M. Ewing, William T. Reid, and Edward Silverman. An empirical distribution function for sampling with incomplete information. The Annals of Mathematical Statistics, 26(4):641 647, 1955. Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009a. Amir Beck and Marc Teboulle. Gradient-based algorithms with applications to signal recovery. Convex Optimization in Signal Processing and Communications, pages 42 88, 2009b. Peter B uhlmann and Sara van de Geer. Statistics for High-dimensional Data: Methods, Theory and Applications. Springer Science & Business Media, 2011. Alexandra Chouldechova and Trevor J. Hastie. Generalized additive model selection. Ar Xiv Preprint ar Xiv:1506.03850, 2015. Arnak S. Dalalyan, Mohamed Hebiri, and Johannes Lederer. On the prediction performance of the lasso. Bernoulli, 23(1):552 581, 2017. Haris, Simon and Shojaie Jianqing Fan, Yang Feng, and Rui Song. Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 2012. Bruno C. Feltes, Eduardo B. Chandelier, Bruno I. Grisci, and M arcio Dorn. Cu Mi Da: An extensively curated microarray database for benchmarking and testing of machine learning approaches in cancer research. Journal of Computational Biology, 26(4):376 386, 2019. doi: 10.1089/cmb.2018.0238. URL https://doi.org/10.1089/cmb.2018.0238. PMID: 30789283. Jerome H. Friedman, Trevor J. Hastie, and Robert J. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1): 1 22, 2010. Asad Haris, Ali Shojaie, and Noah Simon. Wavelet regression and additive models for irregularly spaced data. In Advances in Neural Information Processing Systems, pages 8973 8983, 2018. Trevor J. Hastie and Robert J. Tibshirani. Generalized Additive Models, volume 43. CRC Press, 1990. Nicholas A. Johnson. A dynamic programming algorithm for the fused lasso and l0segmentation. Journal of Computational and Graphical Statistics, 22(2):246 260, 2013. Vladimir Koltchinskii and Ming Yuan. Sparsity in multiple kernel learning. The Annals of Statistics, 38(6):3660 3695, 12 2010. doi: 10.1214/10-AOS825. URL http://dx.doi. org/10.1214/10-AOS825. Yi Lin and Hao H. Zhang. Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34(5):2272 2297, 2006. Yin Lou, Jacob Bien, Rich Caruana, and Johannes Gehrke. Sparse partially linear additive models. Journal of Computational and Graphical Statistics, 25(4):1126 1140, 2016. doi: 10.1080/10618600.2015.1089775. URL http://dx.doi.org/10.1080/10618600.2015. 1089775. Lukas Meier, Sara van de Geer, and Peter B uhlmann. High-dimensional additive modeling. The Annals of Statistics, 37(6B):3779 3821, 2009. Sahand Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Manuscript, University of California, Berkeley, Dept. of Statistics and EECS, 2011. Yurii Nesterov. Gradient methods for minimizing composite objective function. Technical report, UCL, 2007. Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127 239, 2014. Generalized Sparse Additive Models Ashley Petersen, Daniela Witten, and Noah Simon. Fused lasso additive model. Journal of Computational and Graphical Statistics, 25(4):1005 1025, 2016. Aaditya Ramdas and Ryan J. Tibshirani. Fast and flexible ADMM algorithms for trend filtering. Journal of Computational and Graphical Statistics, 25(3):839 858, 2015. Garvesh Raskutti, Bin Yu, and Martin J. Wainwright. Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness. In Advances in Neural Information Processing Systems, pages 1563 1570, 2009. Garvesh Raskutti, Martin J. Wainwright, and Bin Yu. Minimax-optimal rates for sparse additive models over kernel classes via convex programming. The Journal of Machine Learning Research, 13(1):389 427, 2012. Pradeep Ravikumar, John Lafferty, Han Liu, and Larry Wasserman. Sparse additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(5):1009 1030, 2009. Veeranjaneyulu Sadhanala and Ryan J. Tibshirani. Additive models with trend filtering. The Annals of Statistics, 47(6):3032 3068, 2019. Noah Simon and Robert Tibshirani. Standardization and the group lasso penalty. Statistica Sinica, 22(3):983, 2012. Zhiqiang Tan and Cun-Hui Zhang. Doubly penalized estimation in additive regression with high-dimensional data. The Annals of Statistics, 47(5):2567 2600, 2019. doi: 10.1214/18-AOS1757. URL https://doi.org/10.1214/18-AOS1757. Robert J. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267 288, 1996. Sara van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000. Sara van de Geer. High-dimensional generalized linear models and the lasso. The Annals of Statistics, pages 614 645, 2008. Sara van de Geer. The lasso with within group structure. In J. Antoch, M. Huˇskov a, and P. K. Sen, editors, Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jureˇckov a, volume 7 of IMS Collections, pages 235 244. Institute of Mathematical Statistics, Beachwood, Ohio, USA, 2010. doi: 10.1214/10-IMSCOLL723. URL https://doi.org/10.1214/ 10-IMSCOLL723. Sara van de Geer. Estimation and Testing Under Sparsity: Ecole d Et e de Probabilit es De Saint-Flour XLV - 2015. Springer Publishing Company, Incorporated, 1st edition, 2016. ISBN 3319327739, 9783319327730. Sara van de Geer and Peter B uhlmann. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360 1392, 2009. Haris, Simon and Shojaie Grace Wahba. Spline Models for Observational Data. SIAM, 1990. Junming Yin, Xi Chen, and Eric P. Xing. Group sparse additive models. In Proceedings of the International Conference on Machine Learning, volume 2012, page 871. NIH Public Access, 2012. Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49 67, 2006. Ming Yuan and Ding-Xuan Zhou. Minimax optimal rates of estimation in high dimensional additive models. The Annals of Statistics, 44(6):2564 2593, 2016. doi: 10.1214/15-AOS1422. URL https://doi.org/10.1214/15-AOS1422. Xingzhi Zhan. Extremal eigenvalues of real symmetric matrices with entries in an interval. SIAM Journal on Matrix Analysis and Applications, 27(3):851 860, 2005. Tuo Zhao, Xingguo Li, Han Liu, and Kathryn Roeder. SAM: Sparse Additive Modelling, 2014. URL http://CRAN.R-project.org/package=SAM. R package version 1.0.5. Hui Zou and Trevor J. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301 320, 2005.