# scalable_approximations_for_generalized_linear_problems__c4fca247.pdf Journal of Machine Learning Research 20 (2019) 1-45 Submitted 5/17; Revised 8/18; Published 2/19 Scalable Approximations for Generalized Linear Problems Murat A. Erdogdu erdogdu@cs.toronto.edu Department of Computer Science Department of Statistical Sciences University of Toronto Toronto, ON M5S 3G3, Canada Mohsen Bayati bayati@stanford.edu Graduate School of Business Stanford University Stanford, CA 94305, USA Lee H. Dicker ldicker@stat.rutgers.edu Department of Statistics and Biostatistics Rutgers University Piscataway, NJ 08854, USA Editor: Qiang Liu In stochastic optimization, the population risk is generally approximated by the empirical risk which is in turn minimized by an iterative algorithm. However, in the large-scale setting, empirical risk minimization may be computationally restrictive. In this paper, we design an efficient algorithm to approximate the population risk minimizer in generalized linear problems such as binary classification with surrogate losses and generalized linear regression models. We focus on large-scale problems where the iterative minimization of the empirical risk is computationally intractable, i.e., the number of observations n is much larger than the dimension of the parameter p (n p 1). We show that under random sub-Gaussian design, the true minimizer of the population risk is approximately proportional to the corresponding ordinary least squares (OLS) estimator. Using this relation, we design an algorithm that achieves the same accuracy as the empirical risk minimizer through iterations that attain up to a quadratic convergence rate, and that are computationally cheaper than any batch optimization algorithm by at least a factor of O(p). We provide theoretical guarantees for our algorithm, and analyze the convergence behavior in terms of data dimensions. Finally, we demonstrate the performance of our algorithm on well-known classification and regression problems, through extensive numerical studies on large-scale datasets, and show that it achieves the highest performance compared to several other widely used optimization algorithms. Keywords: Generalized Linear Problems, Stochastic optimization, Subsampling, Dimension reduction in optimization. 1. Introduction We consider the following stochastic optimization problem minimize β Rp R(β) := E Ψ ( x, β ) y x, β , (1) c 2019 Murat A. Erdogdu, Mohsen Bayati, Lee H. Dicker. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/17-279.html. Erdogdu, Bayati, Dicker where Ψ : R R is a non-linear function, y Y R denotes the response variable, x X Rp denotes the predictor (or covariate), and the expectation is over the joint distribution of (y, x). The above minimization is called a generalized linear problem in its canonical representation, and it is commonly encountered in statistical learning. Celebrated examples include binary classification with smooth surrogate losses (Buja et al., 2005; Reid and Williamson, 2010), and generalized linear models (GLMs) such as Poisson regression, logistic regression, ordinary least squares, multinomial regression with many applications involving graphical models (Nelder and Baker, 1972; Mc Cullagh and Nelder, 1989; Wainwright and Jordan, 2008; Koller and Friedman, 2009). These methods play a crucial role in numerous machine learning and statistics problems, and they provide a miscellaneous framework for many regression and classification tasks. The exact minimization of the stochastic optimization problem (1) requires the knowledge of the underlying distribution of the variables (y, x). In practice, however, the joint distribution of the pair is not available; therefore, after observing n independent data points (yi, xi), the standard approach is to minimize the following surrogate of (1), often referred to as empirical risk approximation minimize β Rp b R(β) := 1 i=1 Ψ ( xi, β ) yi xi, β . (2) In the case of GLMs, the empirical risk minimization given in (2) is equivalent to the maximum likelihood estimation (MLE), whereas in the case of binary classification, it is generally referred to as surrogate loss minimization. Due to the non-linear structure of the optimization task given in (2), minimizing the empirical risk requires iterative methods. The first-order approximation of the non-linear risk yields the gradient descent algorithm, which attains a (local) linear convergence rate under certain conditions with O(np) per-iteration cost. Although its convergence rate is slow compared to that of the second-order methods, its modest per-iteration cost makes it practical for large-scale problems. Regardless of the problem formulation, the most commonly used optimization method for computing the MLE is the Newton-Raphson method, which may be viewed as a reweighted least squares algorithm (Mc Cullagh and Nelder, 1989; Buja et al., 2005). This method uses a second-order approximation to benefit from the curvature of the log-likelihood and achieves locally quadratic convergence. A drawback of this approach is its excessive per-iteration cost of O(np2). To remedy this, Hessian-free Krylov sub-space based methods such as conjugate gradient can be used, but the resulting direction is imprecise (Hestenes and Stiefel, 1952; Paige and Saunders, 1975; Martens, 2010). In the regime n p, another popular optimization technique is the class of Quasi-Newton methods (Bishop, 1995; Nesterov, 2004), which can attain a per-iteration cost of O(np), and the convergence rate is locally superlinear; a well-known member of this class of methods is the BFGS algorithm (Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970). In this paper, we take an alternative approach for minimizing (1), based on an identity that is well-known in some areas of statistics, but appears to have received relatively little attention for its computational implications in large-scale problems. Let βpop denote the true minimizer of the population risk given in (1), and let βols denote the corresponding ordinary least squares (OLS) coefficients defined as βols = E xx T 1 E [xy]. Then, under Scalable Approximations for Generalized Linear Problems certain random predictor (design) models, βpop βols. (3) For logistic regression with Gaussian design (which is equivalent to Fisher s discriminant analysis), (3) was noted by Fisher in the 1930s (Fisher, 1936); a more general formulation for models with Gaussian design is given in Brillinger (1982). The relationship (3) suggests that if the constant of proportionality is known, then βpop can be estimated by computing the OLS estimator, which may be substantially simpler than minimizing the empirical risk. In fact, in some applications like binary classification, it may not be necessary to find the constant of proportionality in (3). Our work in this paper builds on this idea. Our contributions can be summarized as follows. 1. We show that βpop is approximately proportional to βols in the random design setting, regardless of the covariate (predictor) distribution. That is, we prove βpop cΨ βols c βpop p , for some cΨ R which depends on the non-linearity Ψ. We note that above rate is the relative decay, and it is obtained under the assumption that E[ x, βpop 2] c2. Our generalization uses zero-bias transformations (Goldstein and Reinert, 1997). We also show that the relation (3) still holds under certain types of regularization. 2. We design a computationally efficient estimator for βpop by first estimating the OLS coefficients, and then estimating the proportionality constant cΨ via line search. We refer to the resulting estimator as the Scaled Least Squares (SLS) estimator and denote it by ˆβ sls. After estimating the OLS coefficients, the second step of our algorithm involves finding a root of a real valued function; this can be accomplished using iterative methods with up to a quadratic convergence rate and only O(n) per-iteration cost. This is computationally cheaper than the classical batch methods such as gradient descent by at least a factor of O(p). 3. For random design with sub-Gaussian predictors and bounded Ψ(2), we show that ˆβ sls βpop βpop p + r p This bound characterizes the performance of the proposed estimator in terms of data dimensions, and justifies the use of the algorithm in the large-scale setting where n p 1. We also provide theoretical guarantees when subsampling is utilized in the algorithm for further efficiency. 4. We demonstrate how to transform a binary classification problem with smooth surrogate loss into a generalized linear problem, and how our methods can be applied to obtain a computationally efficient optimization scheme. Erdogdu, Bayati, Dicker 5. We propose a scalable optimization method for converting one generalized linear problem to another by exploiting the proportionality relation (3). The proposed algorithm requires only O (n) computations per each iteration, with no additional one-time cost. We further discuss the canonicalization of the square loss, which may be of independent interest for non-convex optimization. 6. We empirically study the statistical and computational performance of ˆβ sls, and compare it to that of the empirical risk minimizer (using several well-known implementations), on a variety of large-scale datasets. The rest of the paper is organized as follows: Section 1.1 surveys the related work and Section 2 introduces the required background and the notation. In Section 3, we provide the intuition behind the relationship (3), which are based on exact calculations for the Gaussian design setting. In Section 4, we propose our algorithm and discuss its computational properties. Theoretical results are given in Section 5. In Section 6, we propose an algorithm to convert one GLM type to another. We discuss how a binary classification problem can be cast as a generalized linear problem in Section 7, and in Section 7.1 we propose a method to canonicalize the square loss. Section 8 provides a thorough comparison between the proposed algorithm and other existing methods. Finally, we conclude with a brief discussion in Section 9. 1.1. Related work As mentioned in Section 1, the relationship (3) is well-known in several forms in statistics literature. Brillinger (1982) derived (3) for models with Gaussian predictors using Stein s lemma. Li and Duan (1989) studied model misspecification problems in statistics and derived (3) when the predictor distribution has linear conditional means (this is a slight generalization of Gaussian predictors). The relation (3) has led to various techniques for dimension reduction (Li, 1991; Li and Dong, 2009), and more recently, it has been studied by Plan and Vershynin (2016); Thrampoulidis et al. (2015) in the context of compressed sensing. It has been shown that the standard lasso estimator may be very effective when used in models where the relationship between the expected response and the signal is nonlinear, and the predictors (i.e. the design or sensing matrix) are Gaussian. A common theme for all of this previous work is that it focuses solely on settings where (3) holds exactly and the predictors (covariates) are Gaussian (or, in the case of Li and Duan (1989), very nearly Gaussian). The proportionality relation is solely based on Stein s lemma and its variants. There are recent studies using Stein s lemma in various machine learning and optimization tasks such as second-order optimization (Erdogdu, 2015, 2016, 2017), Bayesian inference (Liu and Wang, 2016), measuring sample quality and model evaluation tasks (Gorham and Mackey, 2015; Liu et al., 2016). Two key novelties of the present paper are (i) our focus on the computational benefits following from (3) for large scale problems with n p 1; and (ii) our rigorous finite sample analysis of models with non-Gaussian predictors, where (3) is shown to be approximately valid. To the best of our knowledge, the present paper and its earlier version Erdogdu et al. (2016) are the first to consider the relation (3) in the context of optimization. Scalable Approximations for Generalized Linear Problems 2. Preliminaries and Notation We assume a random design setting, where the observed data consists of n random iid pairs (y1, x1), (y2, x2), . . ., (yn, xn); yi Y R is the response variable and xi = (xi1, , xip)T X Rp is the vector of predictors or covariates. We focus on problems where the minimization (1) is desirable, but we allow model misspecification, and do not need to assume that (yi, xi) are actually drawn from a particular parametric model. In other words, the joint distribution of (yi, xi) can be arbitrary. βpop = argmin β Rp E Ψ( xi, β ) yi xi, β . (4) While we make no assumptions on Ψ beyond smoothness, note that when the optimization problem is GLM, and Ψ is the cumulant generating function for yi | xi, then the problem reduces to the standard GLM with canonical link and regression parameters βpop (Mc Cullagh and Nelder, 1989). Examples of GLMs in this form include logistic regression with Ψ(w) = log{1 + ew}, Poisson regression with Ψ(w) = ew, and linear regression (least squares) with Ψ(w) = w2/2. Our objective is to find a computationally efficient estimator for βpop. The alternative estimator for βpop proposed in this paper is related to the OLS coefficient vector, which is defined by βols := E[xix T i ] 1E [xiyi]; the corresponding OLS estimator is ˆβols := (XT X) 1XT y, where X = (x1, . . . , xn)T is the n p design matrix and y = (y1, . . . , yn)T Rn. Additionally, throughout the text we let [m]={1, 2, ..., m}, for positive integers m, and we denote the size of a set S by |S|. The m-th derivative of a function g : R R is denoted by g(m). For a vector u Rp and a n p matrix U, we let u q and U q denote the ℓq-vector and -operator norms, respectively. If S [n], let US denote the |S| p matrix obtained from U by extracting the rows that are indexed by S. For a symmetric matrix M Rp p, λmax(M) and λmin(M) denote the maximum and minimum eigenvalues, respectively, and ρq(M) for q {1, 2, } denotes the condition number of M with respect to ℓq-norm. We denote by Np the p-variate normal distribution, and all expectations are over all randomness inside the brackets. Finally, we use a b and a O (b) interchangeably, whichever is convenient (where O ( ) refers to the big O notation). 3. From OLS to the True Minimizer: Gaussian Case To motivate our methodology, we assume in this section that the covariates are multivariate normal, as in Brillinger (1982). These distributional assumptions will be relaxed to a certain extent in Section 5. Proposition 1 Assume that the covariates are multivariate normal with mean 0 and covariance matrix Σ, i.e. xi Np(0, Σ). Then βpop can be written as βpop = cΨ βols, (5) where cΨ R is the fixed point of the mapping z E h Ψ(2)( xi, βols z) i 1 . (6) Erdogdu, Bayati, Dicker Proof [of Proposition 1] The stationary point of the minimization problem in (4) satisfies the following normal equations, E [yixi] = E h xiΨ(1)( xi, β ) i for i = 1, 2, ..., p. (7) Now, denote by φ(x | Σ) the multivariate normal density with mean 0 and covariance matrix Σ. We recall the well-known property of Gaussian density dφ(x | Σ)/dx = Σ 1xφ(x | Σ). Using this and the integration by parts on the right hand side of the above equation, we obtain for i = 1, 2, ..., p E h xiΨ(1)( xi, β ) i = Z xΨ(1)( x, β )φ(x | Σ) dx, (8) =Σβ E Ψ(2)( xi, β ) which is basically the Stein s lemma. Combining this with the normal equations (7) and multiplying both side with Σ 1, we obtain the desired result. Proposition 1 and its proof provide the main intuition behind our proposed method. Observe that in our derivation, we only worked with the right hand side of the normal equations (7) which does not depend on the response variable yi. Therefore, the equivalence will hold regardless of the joint distribution of (yi, xi). This is the main difference from the proof of Brillinger (1982) where yi is assumed to follow a single index model. In Section 5, where we extend the method to non-Gaussian predictors, the identity (8) is generalized via the zero-bias transformations from Goldstein and Reinert (1997). 3.1. Regularization A version of Proposition 1 incorporating regularization an important tool for datasets where p is large relative to n or the predictors are highly collinear is also possible, as outlined briefly in this section. We focus on ℓ2-regularization (ridge regression) in this section; some connections with lasso (ℓ1-regularization) are discussed in Section 5 and Corollary 6. For λ 0, define the ℓ2-regularized empirical risk minimizer, βpop λ = argmin β Rp E [Ψ( xi, β ) yi xi, β ] + λ 2 β 2 2 (9) and the corresponding ℓ2-regularized OLS coefficients βols λ = E xix T i + λI 1 E [xiyi] (so βpop = βpop 0 and βols = βols 0 ). The same argument as above implies that βpop λ = cΨ βols γ , where γ = λcΨ. (10) This suggests that the ordinary ridge regression for the linear model can be used to estimate the ℓ2-regularized empirical risk minimizer βpop λ . Further pursuing these ideas for problems where regularization is a critical issue may be an interesting area for future research. Scalable Approximations for Generalized Linear Problems Algorithm 1 SLS: Scaled Least Squares Estimator Input: Data (yi, xi)n i=1 Step 1. Compute the least squares estimator: ˆβols and ˆy = Xˆβols. For a sub-sampling based OLS estimator, let S [n] be a random subset and take ˆβols = |S| n (XT SXS) 1XT y. Step 2. Solve the following equation for c R: 1 = c n Pn i=1 Ψ(2)(c ˆyi). Use Newton s root-finding method: Initialize c; Repeat until convergence: n Pn i=1 Ψ(2)(c ˆyi) 1 1 n Pn i=1 Ψ(2)(c ˆyi) + c ˆyiΨ(3)(c ˆyi) . Output: ˆβ sls = c ˆβols. 4. SLS: Scaled Least Squares Estimator Motivated by the results in the previous section, we design a computationally efficient algorithm that approximates the stochastic optimization problem (1) that is as simple as solving the least squares problem; it is described in Algorithm 1. The algorithm has two basic steps. First, we estimate the OLS coefficients, and then in the second step we estimate the proportionality constant via a simple root-finding algorithm. There are numerous fast optimization methods to solve the least squares problem, and even a superficial review of these could go beyond the page limits of this paper. We emphasize that this step (finding the OLS estimator) does not have to be iterative and it is the main computational cost of the proposed algorithm. We suggest using a sub-sampling based estimator for βols, where we only use a subset of the observations to estimate the covariance matrix. Let S [n] be a random sub-sample and denote by XS the sub-matrix formed by the rows of X in S. Then the sub-sampled OLS estimator is given as ˆβols = 1 |S|XT SXS 1 1 n XT y. Properties of sub-sampling and sketching based estimators have been well-studied (Vershynin, 2010; Dhillon et al., 2013; Erdogdu and Montanari, 2015; Pilanci and Wainwright, 2015; Roosta-Khorasani and Mahoney, 2016a,b). For sub-Gaussian covariates, it suffices to use a sub-sample size of O (p log(p)) (Vershynin, 2010). Hence, this step requires a single time computational cost of O |S|p2 + p3 + np O p max{p2 log(p), n} . For other approaches, we refer reader to Rokhlin and Tygert (2008); Drineas et al. (2011); Dhillon et al. (2013); Erdogdu and Montanari (2015) and the references therein. The second step of Algorithm 1 involves solving a simple root-finding problem. As with the first step, there are numerous methods available for completing this task. Newton s root-finding method with quadratic convergence or Halley s method with cubic convergence may be appropriate choices. We highlight that this step costs only O (n) per-iteration and that we can attain fast convergence rates of higher order algorithms. The resulting periteration cost is cheaper than commonly used batch algorithms by at least a factor of O (p) indeed, the cost of computing the gradient is O (np). For simplicity, we use Newton s root-finding method. We also note that in certain classification problems, only the direction of βpop is needed to make a prediction, so there is no need to compute the scaling constant. Erdogdu, Bayati, Dicker 4 5 6 log10(n) SLS vs MLE : Computation 4 5 6 log10(n) SLS vs MLE : Accuracy 4 5 6 log10(n) log10(|β β|2) SLS vs MLE : Log Accuracy Figure 1: Logistic regression with iid standard Gaussian design. The left plot shows the computational cost (time) for finding the MLE and SLS as n grows and p = 200. The middle and the right plots depict the accuracy of the estimators in standard and log scales, respectively. In the regime where the MLE is expensive to compute, the SLS is found much more rapidly and has the same accuracy. R s built-in functions are used to find the MLE. Correct initialization of the scaling constant c depends on the optimization problem. For example, in the case of GLM problems, assuming that the GLM is a good approximation to the true conditional distribution, by the law of total variance and basic properties of GLMs, we have Var (yi) = E [Var (yi | xi)] + Var (E [yi | xi]) c 1 Ψ + Var Ψ(1)( xi, β ) . (11) It follows that the initialization c = 2/Var (yi) is reasonable as long as we have c 1 Ψ E [Var (yi | xi)] not much smaller than Var Ψ(1)( xi, β ) . Our experiments show that SLS is very robust to initialization. In Figure 1, we compare the performance of our SLS estimator to that of the MLE in a GLM optimization problem, when both are used to analyze synthetic data generated from a logistic regression model under general Gaussian design with randomly generated covariance matrix. The left plot shows the computational cost of obtaining both estimators as n increases for fixed p = 200. The middle and the right plots show the accuracy of the estimators where the latter is in log-scale. In the regime n p 1 where the MLE is hard to compute the MLE and the SLS achieve the same accuracy, yet SLS has significantly smaller computation time. We refer the reader to Section 5 for theoretical results characterizing the finite sample behavior of the SLS. 5. Theoretical Results In this section, we use the zero-bias transformations (Goldstein and Reinert, 1997) to generalize the equivalence relation given in the previous section to the settings where the covariates are non-Gaussian. Definition 2 Let z be a random variable with mean 0 and variance σ2. Then, there exists a random variable z that satisfies E [zf(z)] = σ2E[f(1)(z )], for all differentiable functions f. The distribution of z is said to be the z-zero-bias distribution. The existence of z in Definition 2 is a consequence of Riesz representation theorem (Goldstein and Reinert, 1997). The normal distribution is the unique distribution whose zero-bias Scalable Approximations for Generalized Linear Problems transformation is itself (i.e. the normal distribution is a fixed point of the operation mapping the distribution of z to that of z which is basically Stein s lemma). To provide some intuition behind the usefulness of the zero-bias transformation, we refer back to the proof of Proposition 1. For simplicity, assume that the covariate vector xi has iid entries with mean 0, and variance 1. Then the zero-bias transformation applied to the j-th normal equation in (7) yields E [yixij] = E h xijΨ(1) xijβj + Σk =jxikβk i | {z } j-th normal equation = βj E h Ψ(2) x ijβj + Σk =jxikβik i | {z } Zero-bias transformation The distribution of x ij is the xij-zero-bias distribution and is entirely determined by the distribution of xij; general properties of x ij can be found, for example, in Chen et al. (2010). If β is well spread, it turns out that taken together, with j = 1, . . . , p, the far right-hand side in (12) behaves similar to the right side of (8), with Σ = I; that is, the behavior is similar to the Gaussian case, where the proportionality relationship given in Proposition 1 holds. This argument leads to an approximate proportionality relationship for problems with non-Gaussian predictors, which, when carried out rigorously, yields the following result. Theorem 3 Suppose that the whitened covariates wi = Σ 1/2xi are independent random vectors with mean 0, covariance I, and sub-Gaussian norm κx. Furthermore, wi s have constant first and second conditional moments, i.e., j [p] and β = Σ 1/2βpop, E[wij Σk =j βkwik] and E[w2 ij Σk =j βkwik] are deterministic, and the function Ψ(2) is Lipschitz continuous with constant k. Then, for cΨ = 1/E Ψ(2)( xi, βpop ) , we have 1 cΨ βpop βols η Σ1/2βpop βpop , where η = 8kκ3 xρ . (13) Theorem 3 is proved in Section 10 of the Appendix. It implies that the population parameters βols and βpop are approximately equivalent up to a scaling factor, with a relative error bound of O( Σ1/2βpop ). For the above analysis to provide a decreasing bound in the dimension p, one needs a tractability assumption on either one of the terms in the error bound. The most common way to achieve this is to introduce a structure on the inner product between x and βpop, e.g. E[ x, βpop 2]1/2 = O p suffices to get a decay in (13). We emphasize that it is common practice to standardize the response and the features before training any model; therefore this preliminary procedure can provide the required order and consequently the desired decay in p. Below, we discuss two commonly encountered settings in the literature as two corollaries. Corollary 4 Let the assumptions of Theorem 3 hold. If we further assume that the covariates xi are supported on a ball of radius r, and Σ1/2 is diagonally dominant, we have 1 cΨ βpop βols η βpop 2 p , where η = 16krκ3 x ρ2ρ . (14) Note that we do not require an underlying model assumption between y and x. Therefore, one can always perform under the finite support assumption by first scaling the design Erdogdu, Bayati, Dicker matrix appropriately, and then using the proposed estimator. This assumption is common in many fields, yet can appear in different forms. For example in the machine learning literature, the standard convention (the same setting in Corollary 4) is to assume that xi 2 r, and that βpop 2 = O p (Kalai and Sastry, 2009; Shalev-Shwartz et al., 2009; Kakade et al., 2011), or the inner product is supported on an interval | x, βpop | b almost surely (Alquier and Biau, 2013), where the latter is a more strict assumption. In the compressed sensing and other areas in statistics, a similar assumption is imposed on the covariance matrix E xx T = Σ/p which provides us with E[ x 2 2] = O (1) (Donoho et al., 2009; Bayati and Montanari, 2012; Bayati et al., 2013; Su and Candes, 2016). On the other hand, the standard convention in the dimension reduction techniques is to assume that the covariates have norm of order p, i.e. E [ x 2] = O p , and the coefficients have norm of order 1, i.e. β 2 = τ = O (1) (Duan and Li, 1991; Hall and Li, 1993; Hristache et al., 2001). This setting again yields that the inner product between those two vectors are tractable and provides us with a decay in the latter term, i.e. βpop = O 1/ p . In this setting, Theorem 3 yields the following corollary. Corollary 5 Let the assumptions of Theorem 3 hold. If we further assume that βpop is r-well-spread in the sense that for βpop 2 = τ, we have τ/ βpop = r p for some r 1, 1 cΨ βpop βols η βpop p , where η = 8kκ3 xρ (τ/r) Σ1/2 . (15) In the above bound, the decay is in fact 1/p, but we chose to state the relative decay. We emphasize that both settings in Corollaries 4 and 5 are equivalent and both assumptions lead a tractable inner product x, βpop . The constants κx, ρk are invariant to the scalings considered in both settings. The assumption that βpop is well-spread can be relaxed with minor modifications. For example, if we have a sparse coefficient vector, where supp(βpop) = {j; βpop j = 0} is the support set of βpop, then Corollary 5 holds with p replaced by the size of the support set. The assumptions on the conditional moments are the relaxed versions of assumptions that are commonly encountered in dimension reduction techniques. For example, sliced inverse regression methods assume that the first conditional moment E x x, β is linear in x for all β (Li and Duan, 1989; Li, 1991), which is satisfied by elliptically distributed random vectors. An important case that is not covered by these methods is the independent coordinate case, i.e., when the whitened covariates have independent, but not necessarily identical entries. This is made possible by introducing additional approximation error (i.e. suboptimality due to non-Gaussian design). We refer reader to Li and Dong (2009), for a good review of dimension reduction techniques and their corresponding assumptions. We also highlight that our moment assumptions can be relaxed further, at the expense of introducing some additional complexity into the results. An interesting consequence of Theorem 3 and the corollaries following the theorem is that whenever an entry of βpop is zero, the corresponding entry of βols has to be small, and conversely. For λ 0, define the lasso coefficients βlasso λ = argmin β Rp 1 2E (yi xi, β )2 + λ β 1 . (16) Scalable Approximations for Generalized Linear Problems Corollary 6 Let supp(β) denote the support of β, i.e., the set {i [p] : βi = 0}. Given 1 cΨ βpop βols δ, and for any λ δ, if E [xi] = 0 and E xix T i = I, we have supp(βlasso) supp(βpop). Further, if λ and βpop also satisfy that j supp(βpop), |βpop j | > cΨ (λ + δ), then we have supp(βlasso) = supp(βpop). We note that δ above can be obtained using Theorem 3 or its corollaries. So far in this section, we have only discussed properties of the population parameters, such as βpop and βols. In the remainder of this section, we turn our attention to the results on the estimators that are the main focus of this paper; these results ultimately build on our earlier results, i.e. Theorem 3. We will be considering the setting in Corollary 5 in the rest of this section, i.e., x 2 = O p and βpop 2 = O (1). Results presented below can be easily generalized to the setting in Corollary 4 as well. In order to precisely describe the performance of ˆβ sls, we first need bounds on the OLS estimator. The OLS estimator has been studied extensively in the literature; however, for our purposes, we find it convenient to derive a new bound on its accuracy. While we have not seen this exact bound elsewhere, it is very similar to Theorem 5 of Dhillon et al. (2013). Proposition 7 Assume that E [xi] = 0, E xix T i = Σ, and that Σ 1/2xi and yi are sub Gaussian with norms κ and γ, respectively. For λmin denoting the smallest eigenvalue of Σ, and |S| > ηp, ˆβols βols 2 ηλ 1/2 with probability at least 1 3e p, where η depends only on γ and κ. Proposition 7 is proved in Section 10. Our main result on the performance of ˆβ sls is given next. Theorem 8 Let the assumptions of Theorem 3, Corollary 5, and Proposition 7 hold. Further assume that Ψ(2) b and for some c and ϵ the function f(z) = z E Ψ(2)( x, βols z) satisfies f( c) > 1 +ϵ, and the derivative of f in the interval [0, c] does not change sign, i.e., its absolute value is lower bounded by υ > 0. Then, for n and |S| sufficiently large, with probability at least 1 4e p, we have ˆβ sls βpop η1 βpop p + η2 r p where the constants η1 and η2 are defined by η1 =ηk cρ (τ/r) Σ1/2 (19) η2 =η c λ 1/2 min + kυ 1 max n ( Σ1/2βols 2 + 1) Σ 2 + b/k, c o , (20) and η > 0 is a constant depending on κ and γ. Erdogdu, Bayati, Dicker Note that the convergence rate of the upper bound in (18) depends on the sum of the two terms, both of which are functions of the data dimensions n and p. The first term on the right in (18) comes from Theorem 3 and Corollary 5, which bounds the discrepancy between cΨ βols and βpop. This term is small when p is large, and it does not depend on the number of observations n. If n and p grow together, then p has to grow at a smaller rate to achieve convergence. We note that the actual decay of this term is O (1/p) (since βpop = O 1/ p ), but we state the relative decay which is O 1/ p . The second term in the upper bound (18) comes from estimating βols and cΨ. This term is increasing in p, which reflects the fact that estimating βpop is more challenging when p is large. As expected, this term is decreasing in n and |S|, i.e. larger sample size yields better estimates. When the full OLS solution is used (|S| = n), the second term becomes O( p p/n), which suggests that the sample size should be at least of order p for good performance. We note that in the case of Gaussian covariates, n-consistency follows since we don not have a bias term. In certain loss functions, the uniform boundedness assumption on Ψ(2) may be too strict. However, this assumption can be removed by making an additional assumption that the covariates have bounded support. This yields the same decay rate O p p/n in the second term. However, if one needs to preserve the sub-Gaussian assumption on the covariates, then the above analysis when carried out without the uniform bound assumption on Ψ(2) yields the following result. We note that we still require the uniform Lipschitz assumption but one can always utilize the local Lipschitz properties of the function Ψ(2). Theorem 9 Let the assumptions of Theorem 3, Corollary 5, and Proposition 7 hold with E[ Σ 1/2x 2] = µ p. Further assume that supβ: β Σ1/2βols 2 1 Ψ(2)( xi, β ) ψ2 κg and for some c and ϵ the function f(z) = z E Ψ(2)( x, βols z) satisfies f( c) > 1 + ϵ, and the derivative of f in the interval [0, c] does not change sign, i.e., its absolute value is lower bounded by υ > 0. Then, for n and |S| sufficiently large, with probability at least 1 5e p, we have ˆβ sls βpop η1 βpop p + η2 r p min {n/ log(n), |S|}, (21) where the constants η1 and η2 are defined by η1 =ηk cρ (τ/r) Σ1/2 (22) η2 =η c λ1/2 min + υ 1 βols max {(κg + k/ µ), k cκx} , (23) and η > 0 is a constant depending on κx and γ. We observe that this time when the full OLS solution is used (|S| = n), the second term becomes O( p p log(n)/n), which is worse than the rate given in Theorem 8 by a logarithmic factor. In this setting, we require that n/ log(n) should be at least of order p for good performance. Also, note that there is a theoretical threshold for the sub-sampling size |S|, namely O (n/ log(n)), beyond which further sub-sampling provides no improvement. When the covariates are Gaussian, the Lipschitz assumption on the function Ψ(2) is enough the guarantee the sub-Gaussianity of Ψ(2)( xi, β ). However, we note that the condition given in the Theorem is more general. Scalable Approximations for Generalized Linear Problems 5.1. Convergence in ℓ2-norm In the sequel, we assume that the full OLS solution is used (|S| = n). Using Proposition 7 and Lemma 16, we can show that the ℓ2-distance between the estimator ˆβ sls and the population parameter cΨβols converges to 0 with rate n when p is fixed. It is straightforward to see that under Gaussian design, this leads to consistency in ℓ2-norm with convergence rate n. However, the quantity cΨβols is in general different from βpop under general sub Gaussian design; thus, in such cases consistency cannot be achieved in the classical setting where p is fixed (see i.e. Claim 1 in Chen and Banerjee (2017)). In such a scenario, Theorem 3 can be used to obtain an upper bound on the suboptimality resulting from general sub-Gaussian design. In order to talk about consistency when n, p under ℓ2-norm, a normalization by one of the growing dimensions is required, see e.g. Bayati and Montanari (2012); Bayati et al. (2013). Hence, in this regime, we focus on the quantity 1 p ˆβ sls βpop 2 under the assumptions of Corollary 4 with Σ = I. Since we have n-convergence of our estimator to the population parameter cΨβols, consistency reduces to showing that the suboptimality term 1 p cΨβols βpop 2 converges to 0. This term can be controlled with additional sparsity-type structural assumptions on the problem. Below, we discuss two scenarios that may lead to such a setting. Sparsity. Assume that βpop has s non-zero terms, i.e., βpop 0 = s. Following the same proof technique of Theorem 3, we can show that cΨβols βpop 2 C rs p βpop , (24) for some constant C > 0. As long as the non-sparsity level satisfies s = O (p), we can attain consistency in ℓ2-norm. The rate of convergence will be determined by the growing rates of n, p(n) and s(n). For example, to attain n-convergence, one needs s = O (1) and p = O (n), but milder growth conditions can still lead to consistency at the expense of slower convergence. Normality. Similarly, one can assume that a subset of features with size s follow a non-Gaussian distribution, and the remaining p s features are normally distributed. The same argument above can also provides consistency in ℓ2-norm again in this case and the convergence rate will be determined by the ratio p 5.2. Verifying the key assumptions The assumption on f(z) = z E Ψ(2)( x, βols , z) appears in both theorems and depends both on the function Ψ and the underlying covariate distribution. Therefore, in order to verify this assumption rigorously, one needs to know the distribution of the covariates which is not always available. However, a straightforward way to verify this assumption is to use the sample mean estimate ˆf(z) = z 1 n Pn i=1 Ψ(2)( xi, ˆβols z) using the dataset at hand to estimate f(z). In Figure 2, we use this technique to verify that our assumptions are valid for two commonly used loss functions, namely, boosting and logistic regression when the covariates are normally distributed x N(0, 1 p I) and βols 2 = p/4. We observe through the figure that for both loss functions, one can easily verify that the conditions are satisfied. Erdogdu, Bayati, Dicker 0.0 2.5 5.0 7.5 10.0 Boosting Logistic 0.0 2.5 5.0 7.5 10.0 Boosting Logistic Figure 2: We demonstrate how to validate our assumption on f(z) = z E Ψ(2)( x, βols z) on two commonly used loss functions. Left plot shows f(z), whereas right plots shows its derivative. In these plots, covariates are generated from x N(0, I/p), and βols 2 = p/4. The assumption on f(z) can be verified by a more rigorous treatment. For simplicity, we again consider the compressed sensing setting where we assume x N(0, 1 p I), and this time we let βpop 2 = cΨ βols 2 p. In order to obtain a scaling constant cΨ 20, assume that we have βols 2 = p/20. Boosting loss: In this case, we have Ψ(z) = z/2 + p 1 + z2/4 which yields Ψ(2)(z) = 1 4(1 + z2/4) 3/2 and Ψ(4)(z) = 3 64 5z2(1 + z2/4) 2 4 (1 + z2/4)5/2 . (25) We notice that both of these functions are even, i.e., Ψ(2)(z) = Ψ(2)( z), and Ψ(4)(z) = Ψ(4)( z). We use the local convexity for z 0 around z = 2, and obtain Ψ(2)(z) a bz where a = Ψ(2)(2) 2Ψ(3)(2) and b = Ψ(3)(2). For W N(0, 1) and φ denoting the standard normal density, we write f(z) =z E h Ψ(2)( x, βols z) i = z E h Ψ(2)(Wz/20) i , (26) 0 Ψ(2)(wz/20)φ(w)dw 2z Z 20a/bz 0 (a bwz/20)φ(w)dw, = 2z aΦ(20a/bz) a 2π(1 e 200a2/b2z2) . Using the above bound, we observe that for c = 6 and δ = .23, we have f( c) > 1 + δ. Next, we verify the derivative condition on f, that is, f(1) does not change sign in the interval [0, c]. By the Stein s lemma, we write f(1)(z) = E h Ψ(2)(Wz/20) i + (z/20)2E h Ψ(4)(Wz/20) i . (27) Using the previous inequalities together with the above expression, we have f(1)(z) E h Ψ(2)(Wz/20) i 9 100 2 aΦ(20a/bz) a 2π(1 e 200a2/b2z2) 27 1600 0.188, Scalable Approximations for Generalized Linear Problems which verifies the derivative assumption on f(z). Logistic regression: In this case, we have Ψ(z) = log(1 + ez) which yields Ψ(2)(z) = ez (1 + ez)2 and Ψ(4)(z) = ez(1 4ez + e2z) (1 + ez)4 . (29) As before, both Ψ(2) and Ψ(4) are even functions. Again, appealing to the local convexity properties, for z 0 around z = 2.5, we obtain Ψ(2)(z) a bz where a = Ψ(2)(2.5) 2.5Ψ(3)(2.5) and b = Ψ(3)(2.5). Using (26), we obtain f(z) 2z aΦ(20a/bz) a 2π(1 e 200a2/b2z2) . (30) Using the above bound, we see that for c = 6, and δ = .22 we have f( c) > 1 + δ. For the derivative bound, we again use (27), and obtain that for z [0, c], f(1)(z) E h Ψ(2)(Wz/20) i 9 100 Ψ(4) E h Ψ(2)(Wz/20) i 9 800 0.19, (31) which verifies the assumption. 6. Converting One GLM to Another In this section, we assume that one has already obtained a set of coefficients by minimizing a certain GLM loss, but wants to minimize another GLM loss using the same dataset. It is often the case that a practitioner would like to change the loss function (or equivalently the model) based on its performance. However, when the dataset is large, training a new model from scratch is extremely time consuming. In the following, we will use the proportionality relation to transition between different loss functions. Assume that a practitioner fitted a GLM using the loss function (or cumulant generating function) Ψ1, but they would like to train a new model based on the second loss function Ψ2. Instead of minimizing the new loss based on Ψ2, one can exploit the proportionality relation and obtain the coefficients for the new GLM problem. Denote by βpop 1 and βpop 2 the GLM coefficients corresponding to the loss functions Ψ1 and Ψ2, respectively. Using the results in the previous sections, we have 1 cΨ1 βpop 1 = 1 cΨ2 βpop 2 = βols, for the normal case, that is, both coefficients are proportional to the OLS coefficients which does not depend on the loss function. Therefore, these coefficients βpop 1 and βpop 2 are also proportional to each other and we can write βpop 2 = cΨ2 cΨ1 βpop 1 := ρ βpop 1 , (32) where the proportionality constant between two different GLM types turns out to be the ratio between cΨ1 and cΨ2, i.e. ρ = cΨ2/cΨ1. Using the definition of cΨ2, we write 1 = cΨ2 E h Ψ(2) 2 ( x, βpop 2 ) i , = cΨ1ρ E h Ψ(2) 2 ( x, βpop 1 ρ) i . Erdogdu, Bayati, Dicker Algorithm 2 Conversion from one GLM to another Input: Data (yi, xi)n i=1, and ˆβglm 1 Step 1. Compute ˆy = Xˆβglm 1 , and κ = 1 n Pn i=1 Ψ(2) 1 (ˆyi). Step 2. Solve the following equation for ρ R: κ = ρ n Pn i=1 Ψ(2) 2 (ˆyiρ) Use Newton s root-finding method: Initialize ρ = 1; Repeat until convergence: n Pn i=1 Ψ(2) 2 (ρ ˆyi) κ 1 n Pn i=1 n Ψ(2) 2 (ρ ˆyi) + ρ ˆyiΨ(3) 2 (ρ ˆyi) o. Output: ˆβglm 2 = ρ ˆβglm 1 . Dividing the both sides by cΨ1 and using the equality c 1 Ψ1 = E h Ψ(2) 1 ( x, βpop 1 ) i , we obtain E h Ψ(2) 1 ( x, βpop 1 ) i = ρ E h Ψ(2) 2 ( x, βpop 1 ρ) i . The above equation only involves βpop 1 as the coefficients (which is assumed to be already known/obtained by the practitioner). Therefore, if we solve it for the ratio ρ, we can estimate βpop 2 by simply using the proportionality relation given in (32). The procedure described above is summarized as Algorithm 2. We emphasize that this procedure does not require the computation of the OLS estimator which was the main cost of SLS, and it only requires a per-iteration cost of O (n). In other words, conversion from one GLM type to another is easier compared to obtaining the GLM coefficients from scratch. 7. Binary Classification with Proper Scoring Rules In this section, we assume that for i [n], the response is binary yi {0, 1}. The binary classification problem can be described by the following minimization of an empirical risk minimize β Rp 1 n i=1 ℓ(yi; q( xi, β )), (33) where ℓand q are referred to as the loss and the link functions, respectively. There are various loss functions that are used in practice. Examples include log-loss, boosting loss, square loss etc (See Table 1). As before, we constrain our analysis on the canonical links. The concept of canonical links for binary classification is introduced by Buja et al. (2005), and it is quite similar to the generalized linear problems. For any given loss function, we define the partial losses ℓk( ) = ℓ(y = k; ) for k {0, 1}. Since we have a binary response variable, we can write any loss in the following format ℓ(y; q) =yℓ1(q) + (1 y)ℓ0(q), =y (ℓ1(q) ℓ0(q)) + ℓ0(q). Scalable Approximations for Generalized Linear Problems Table 1: Common loss functions and their inverse canonical links Name Loss function: ℓ(y; q) Weight: w(q) Inverse Canonical link: q Log-loss y log(q) (1 y) log(1 q) 1 q(1 q) 1 1+exp( z) Boosting loss y(q 1 1)1/2 + (1 y)(q 1 1) 1/2 1 [q(1 q)]3/2 1 2 + z/4 2(z2/4+1)1/2 Square loss y(1 q)2 + (1 y)q2 1 1+z The above formulation is of the form of a generalized linear problem. Before moving forward, we recall the concept of proper scoring in binary classification, which is sometimes referred to as Fisher consistency. Definition 10 (Proper scoring rules) Assume that y Bernoulli(η). If the expected loss E [ℓ(y, q)] is minimized by q = η for all η (0, 1), we call the loss function a proper scoring rule. The following theorem by Schervish (1989) provides a methodology for constructing a loss function for the proper scoring rules. Theorem 11 (Schervish (1989)) Let w(dt) be a positive measure on (0, 1) that is finite on interval (ϵ, 1 ϵ) ϵ > 0. Then the following defines a proper scoring rule ℓ1(q) = Z 1 q (1 t)w(dt), and ℓ0(q) = Z q The measure w(dt) uniquely defines the loss function (generally referred to as the weight function, since all losses can be written as weighted average of cost weighted misclassification error (Buja et al., 2005; Reid and Williamson, 2010)). Examples of weight functions is given in Table 1. The above theorem has many interesting interpretations; one that is most useful to us is that ℓ(1) 0 (q) = qw(q). The notion of canonical links for proper scoring rules are introduced by Buja et al. (2005), which corresponds to the notion of matching loss (Helmbold et al., 1999; Reid and Williamson, 2010). The derivation of canonical links stems from the Hessian of the above minimization, which remedies two potential problems: non-convexity and asymptotic variance inflation. It turns out that by setting w(q)q(1) as constant, one can remedy both problems (Buja et al., 2005). We will skip the derivation and, without loss of generality, assume that the canonical link-loss pair satisfies w(q)q(1) = 1. Note that any loss function has a natural canonical link. The following Theorem summarizes this concept. Theorem 12 (Buja et al. (2005)) For proper scoring rules with w > 0, there exists a canonical link function which is unique up to addition and multiplication by constants. Conversely, any link function is canonical for a unique proper scoring rule. The canonical link for a given loss function can be explicitly derived from the equation w(q)q(1) = 1. We have provided some examples in Table 1. Using the definition of canonical Erdogdu, Bayati, Dicker link for proper scoring rules, we write the normal equations d dβE [ℓ(y, q( x, β ))] = 0 as E h xq(1)( x, β )ℓ(1) 0 (q( x, β )) i = E h yxq(1)( x, β ) ℓ(1) 0 (q( x, β )) ℓ(1) 1 (q( x, β )) i , = E h xq(1)( x, β )q( x, β )w(q( x, β )) i = E h yxq(1)( x, β )w(q( x, β )) i , = E [xq( x, β )] = E [yx] , = ΣβE h q(1)( x, β ) i = E [yx] . The last equation provides us with the analog of the proportionality relation we observed in generalized linear problems. In this case, we observe that the proportionality constant becomes 1/E q(1)( x, β ) . Therefore, our algorithm can be used to obtain a fast training procedure for the binary classification problems under canonical links as well. 7.1. Canonicalization of the Square Loss Consider a minimization problem of the following form minimize β 1 n i=1 [yi f( xi, β )]2. (34) The above problem is commonly encountered in many machine learning tasks specifically, in the context of neural networks, the function f is called the activation function. Here, we consider a toy example to demonstrate how our methodology can be useful in a minimization problem of the above form. We first use Taylor series expansion around a point θ (which should be close to x, β ), in order to approximate the function f(z) with a linear function around f(θ). We write min β E (y f( x, β ))2 min β E f( x, β )2 2y x, β f (θ) . (35) The above approximation yields Ψ(z) = f(z)2 2f (θ), (36) and the proportionality relation given in previous sections would hold for the approximation, which will be accurate when the activation function is smooth around the user-specified point θ. This method can be used to derive proportionality relations for GLMs with noncanonical links (conditional on the link being nice), and also may be of interest in non-convex optimization, e.g., reliable initialization. 8. Experiments 8.1. Numerical Analysis of Convergence This section contains the results of numerical studies that demonstrate the convergence results presented in Section 5. We first demonstrate the convergence of the population Scalable Approximations for Generalized Linear Problems 25 50 75 100 p |βpop cΨβols| |βpop| Convergence of Population Parameters 4 5 6 log10(n) SLS vs MLE : Computation 4 5 6 log10(n) SLS vs MLE : Log Accuracy Figure 3: Logistic regression with iid Bernoulli 1 design. The left plot shows the relative error between the population parameters. The middle plot shows the computational cost (time) for finding the MLE and SLS as n grows and p = 200. The right plot depict the accuracy of the estimators in log scale. In the regime where the MLE is expensive to compute, the SLS is found much more rapidly and has the same accuracy. R s built-in functions are used to find the MLE. parameters with respect to the dimension. In this experiment, we estimate the relative error βpop cΨβols / βpop using a Monte-Carlo simulation. The dimension takes values from p {5, 10, 15, ..., 100}, and for each fixed p, we choose the sample size n(p) = Kp where K = 10, 000. βpop = (1, 1, ..., 1)/ p is fixed throughout this experiment. Covariates are generated from a Bernoulli distribution P (xij = 1) = 0.5, and the response is generated from the logistic model. We estimate βols using coordinatewise average of the ordinary least squares estimator over the whole dataset. The scaling constant is estimated by the inverse of 1 n Pn i=1 Ψ(2)( xi, βpop ). We repeat this experiment 100 times for each p, and the results are reported in left plot in Figure 3. We observe that the relative error decreases with increasing dimension. The second experiment demonstrates the effect of sample size. In this experiment, we fix p = 200, and the sample size varies n 103 {2, 3, 5, 15, 25, 50, 100, 200, 500, 1000}. Covariates are again generated from Bernoulli distribution P (xij = 1) = 0.5, and the response is generated from the logistic model. Maximum likelihood estimator is obtained through R s built-in function glm. We report the computation time of these estimators as well as their estimation error in the middle and the right plots in Figure 3, respectively. We observe that MLE performs consistently better in terms of test error, yet it is significantly expensive to compute when n gets larger. In the large-scale regime, the difference between the achieved test errors becomes negligible as seen in the right plot. This is the regime that SLS provides significant advantage. We also emphasize that MLE and SLS has the same convergence rate as can be observed in the right plot. 8.2. Comparisons With Other Optimization Methods This section contains the results of a variety of numerical studies, which show that the Scaled Least Squares estimator reaches its minimum achievable test error substantially faster than commonly used batch algorithms for finding the MLE. Both logistic and Poisson regression models (two types of GLMs) are utilized in our analyses, which are based on several synthetic and real datasets. Below, we briefly describe the optimization algorithms for the MLE that were used in the experiments. Erdogdu, Bayati, Dicker Random start OLS start Logis0c Regression 0 10 20 30 40 50 Time (sec) SLS NR NS BFGS LBFGS GD AGD Log Reg / Covariates ~ Σ x {Exp(1) 1} 0 10 20 30 40 50 Time (sec) SLS NR NS BFGS LBFGS GD AGD Log Reg / Higgs dataset 0 10 20 30 40 Time (sec) SLS NR NS BFGS LBFGS GD AGD Log Reg / Higgs dataset 0 5 10 15 20 Time (sec) SLS NR NS BFGS LBFGS GD AGD Log Reg / Covariates ~ Σ x {Exp(1) 1} Figure 4: We compared the performance of SLS to that of MLE for the logistic regression problem on several datasets. MLE optimization is solved by various optimization algorithms. SLS is represented with red straight line. The details are provided in Table 2. 1. Newton-Raphson (NR) achieves locally quadratic convergence by scaling the gradient by the inverse of the Hessian evaluated at the current iterate. Computing the Hessian has a per-iteration cost of O np2 , which makes it impractical for large-scale datasets. 2. Newton-Stein (NS) is a recently proposed second-order batch algorithm specifically designed for GLMs (Erdogdu, 2015, 2016). The algorithm uses Stein s lemma and sub-sampling to efficiently estimate the Hessian with a cost of O (np) per-iteration, achieving near quadratic rates. 3. Broyden-Fletcher-Goldfarb-Shanno (BFGS) is the most popular and stable quasi-Newton method (Nesterov, 2004). At each iteration, the gradient is scaled by a matrix that is formed by accumulating information from previous iterations and gradient computations. The convergence is locally super-linear with a per-iteration cost of O (np). Scalable Approximations for Generalized Linear Problems Random start OLS start Poisson Regression 0 10 20 30 40 Time (sec) log(Test Error) SLS NR NS BFGS LBFGS GD AGD Poi Reg / Covariates ~ Σ x Ber( 1) 0 10 20 30 40 Time (sec) log(Test Error) SLS NR NS BFGS LBFGS GD AGD Poi Reg / Covariates ~ Σ x Ber( 1) 0.0 2.5 5.0 7.5 10.0 Time (sec) log(Test Error) SLS NR NS BFGS LBFGS GD AGD Poi Reg / Covertype dataset 0.0 2.5 5.0 7.5 10.0 Time (sec) log(Test Error) SLS NR NS BFGS LBFGS GD AGD Poi Reg / Covertype dataset Figure 5: We compared the performance of SLS to that of MLE for the Poisson regression problem on several datasets. MLE optimization is solved by various optimization algorithms. SLS is represented with red straight line. Details are given in Table 2. 4. Limited memory BFGS (LBFGS) is a variant of BFGS, which uses only the recent iterates and gradients to approximate the Hessian, providing significant improvement in terms of memory usage. LBFGS has many variants; we use the formulation given in Bishop (1995). 5. Gradient descent (GD) takes a step in the opposite direction of the gradient, evaluated at the current iterate. Its performance strongly depends on the condition number of the design matrix. Under certain assumptions, the convergence is linear with O (np) per-iteration cost. 6. Accelerated gradient descent (AGD) is a modified version of gradient descent with an additional momentum term (Nesterov, 1983). Its per iteration cost is O (np) and its performance strongly depends on the smoothness of the objective function. For all the algorithms for computing the MLE, the step size at each iteration is chosen via the backtracking line search (Boyd and Vandenberghe, 2004). Erdogdu, Bayati, Dicker Table 2: Details of the experiments shown in Figures 4 and 5 with simulated and real datasets HIGGS (Baldi et al., 2014), and Covertype (Blackard and Dean, 1999) Model Logistic regression Poisson regression Dataset Σ {Exp(1)-1} Higgs Σ Ber( 1) Covertype Size n = 6.0 105, p = 300 n = 1.1 107, p = 29 n = 6.0 105, p = 300 n = 5.8 105, p = 53 Initial Rnd Ols Rnd Ols Rnd Ols Rnd Ols Plot (a) (b) (c) (d) (e) (f) (g) (h) Method Time in seconds / number of iterations (to reach min test error) Sls 8.34/4 2.94/3 13.18/3 9.57/3 5.42/5 3.96/5 2.71/6 1.66/20 Nr 301.06/6 82.57/3 37.77/3 36.37/3 170.28/5 130.1/4 16.7/8 32.48/18 Ns 51.69/8 7.8/3 27.11/4 26.69/4 32.71/5 36.82/4 21.17/10 282.1/216 Bfgs 148.43/31 24.79/8 660.92/68 701.9/68 67.24/29 72.42/26 5.12/7 22.74/59 Lbfgs 125.33/39 24.61/8 6368.1/651 6946.1/670 224.6/106 357.1/88 10.01/14 10.05/17 Gd 669/138 134.91/25 100871/10101 141736/13808 1711/513 1364/374 14.35/25 33.58/87 Agd 218.1/61 35.97/12 2405.5/251 2879.69/277 103.3/51 102.74/40 11.28/15 11.95/25 Recall that the proposed Algorithm 1 is composed of two steps; the first finds an estimate of the OLS coefficients. This up-front computation is not needed for any of the MLE algorithms described above. On the other hand, each of the MLE algorithms requires some initial value for β, but no such initialization is needed to find the OLS estimator in Algorithm 1. This raises the question of how the MLE algorithms should be initialized, in order to compare them fairly with the proposed method. We consider two scenarios in our experiments: first, we use the OLS estimator computed for Algorithm 1 to initialize the MLE algorithms; second, we use a random initial value. On each dataset, the main criterion for assessing the performance of the estimators is how rapidly the minimum test error is achieved. The test error is measured as the mean squared error of the estimated mean using the current parameters at each iteration on a test dataset, which is a randomly selected (and set-aside) 10% portion of the entire dataset. As noted previously, the MLE is more accurate for small n (see Figure 1). However, in the regime considered here (n p 1), the MLE and the SLS perform very similarly in terms of their error rates; for instance, on the Higgs dataset, the SLS and MLE have test error rates of 22.40% and 22.38%, respectively. For each dataset, the minimum achievable test error is set to be the maximum of the final test errors, where the maximum is taken over all of the estimation methods. Let Σ(1) and Σ(2) be two randomly generated covariance matrices. The datasets we analyzed were: (i) a synthetic dataset generated from a logistic regression model with iid {exponential(1) 1} predictors scaled by Σ(1); (ii) the Higgs dataset (logistic regression) Baldi et al. (2014); (iii) a synthetic dataset generated from a Poisson regression model with iid binary( 1) predictors scaled by Σ(2); (iv) the Covertype dataset (Poisson regression) Blackard and Dean (1999). In all cases, the SLS outperformed the alternative algorithms for finding the MLE by a large margin, in terms of computation. Detailed results may be found in Figures 4 and 5, and Table 2. We provide additional experiments with different datasets in Section 12. Scalable Approximations for Generalized Linear Problems 9. Discussion In this paper, we showed that the true minimizer of a generalized linear problem and the OLS estimator are approximately proportional under the general random design setting. Using this relation, we proposed a computationally efficient algorithm for large-scale problems that achieves the same accuracy as the empirical risk minimizer by first estimating the OLS coefficients and then estimating the proportionality constant through iterations that can attain quadratic or cubic convergence rate, with only O (n) per-iteration cost. We briefly mentioned that the proportionality between the coefficients holds even when there is regularization in Section 3.1. Further pursuing this idea may be interesting for large-scale problems where regularization is crucial. Another interesting line of research is to find similar proportionality relations between the parameters in other large-scale optimization problems such as support vector machines. Such relations may reduce the problem complexity significantly. 10. Proof of Main Results In this section, we provide the details and the proofs of our technical results. For convenience, we briefly state the following definitions. Definition 13 (Sub-Gaussian) For a given constant κ, a random variable x R is said to be sub-Gaussian if it satisfies supm 1 m 1/2E [|x|m]1/m κ. Smallest such κ is the sub Gaussian norm of x and it is denoted by x ψ2. Similarly, a random vector y Rp is a sub-Gaussian vector if there exists a constant κ such that supv Sp 1 y, v ψ2 κ . Definition 14 (Sub-exponential) For a given constant κ, a random variable x R is called sub-exponential if it satisfies supm 1 m 1E [|x|m]1/m κ. Smallest such κ is the sub-exponential norm of x and it is denoted by x ψ1. Similarly, a random vector y Rp is a sub-exponential vector if there exists a constant κ such that supv Sp 1 y, v ψ1 κ . The proof of Theorem 3 is given next. Proof [of Theorem 3] For simplicity, we denote the whitened covariate by w = Σ 1/2x. Since w is sub-Gaussian with norm κ, its j-th entry wj has bounded third moment. That is, κ = sup u 2=1 u, w ψ2 wj ψ2 = sup m 1 m 1/2E [|wj|m]1/m 1 3E |wj|3 1/3 , (37) where in the first step, we used u = ej, the j-th standard basis vector. Hence, we obtain a bound on the third moment, i.e, max j E |wj|3 33/2κ3. (38) Using the normal equations, we write E [yx] = E h xΨ(1)( x, β ) i =Σ1/2E h wΨ(1)( w, Σ1/2β ) i , (39) =Σ1/2E h wΨ(1)( w, β ) i , Erdogdu, Bayati, Dicker where we defined β = Σ1/2β. By multiplying both sides with Σ 1, we obtain βols = Σ 1/2E h wΨ(1)( w, β ) i . (40) Now we define the partial sums W i = P j =i βjwj = β, w βiwi. We will focus on the i-th entry of the above expectation given in (40). Denoting the zero biased transformation of wi conditioned on W i by w i , we have E h wiΨ(1)( w, β ) i =E h E h wiΨ(1) βiwi + W i W i ii , (41) = βi E h Ψ(2)( βiw i + W i) i , = βi E h Ψ(2)( βi(w i wi) + w, β ) i , where in the second step, we used the assumption on conditional moments. Let D be a diagonal matrix with diagonal entries Dii = E h Ψ(2)( βi(w i wi) + w, β ) i . Using (40) together with (41), we obtain the equality βols =Σ 1/2D β = Σ 1/2DΣ1/2β. (42) Now, using the Lipschitz continuity assumption of the variance function, we have E h Ψ(2)( βi(w i wi) + w, β ) i E h Ψ(2)( w, β ) i k| βi|E [|w i wi|] . (43) In the following, we will use the properties of zero-biased transformations. Consider the quantity r = sup E |w i wi| W i E |wi|3 W i (44) where w i has wi-zero biased distribution (conditioned on W i) and the supremum is taken with respect to all random variables with mean 0, standard deviation 1 and finite third moment, and w i is achieving the minimal ℓ1 coupling to wi conditioned on W i. It is shown in Goldstein (2007) that the above bound holds for r = 1.5 for the unconditional zero-bias transformations. Here, we take a similar approach to show that the same bound holds for the conditional case as well. By using the triangle inequality, we have E |w i wi| W i E |w i | W i + E |wi| W i (45) 2E |wi|3 W i + E |wi|3 W i 1/3 . Since E |wi|2 W i is constant, it is equal to E |wi|2 = 1. This yields that the second term in the last line is upper bounded by E |wi|3 W i . Consequently, by taking expectations over both hand sides we obtain that E [|w i wi|] 1.5 E |wi|3 . (46) Scalable Approximations for Generalized Linear Problems Then the right hand side of (43) can be upper bounded by k| βi|E [|w i wi|] rk max i n | βi|E |wi|3 o 8kκ3 Σ1/2β , (47) where in the second step we used the bound on the third moment given in (38). The last inequality provides us with the following result, 8kκ3 Σ1/2β . (48) Finally, combining this with (40) and (42), we obtain βols 1 cΨ β = Σ 1/2DΣ1/2β 1 = Σ 1/2 D 1 cΨ I Σ1/2β , 8kκ3ρ Σ1/2 βpop 2 , which completes the proof. Proof [of Corollary 4] Due to diagonal dominance property, we have Σ1/2 = max i Σ1/2 ij 2 max i Σ1/2 ii 2 Σ 1/2 2 . (50) Since we have x 2 r, we write r2 E h x 2 2 i = Tr (Σ) p Σ 2 /ρ2. (51) Combining this with the previous inequality we obtain Finally, the result follows from using the last bound in Theorem 3. Proof [of Corollary 5] The result directly follows from using the r-well-spreadness assumption in Theorem 3. Proof [of Proposition 7] For convenience, we denote the whitened covariates by wi = Σ 1/2xi. We have E [wi] = 0, E wiw T i = I, and wi ψ2 κ. Also denote the sub-sampled covariance matrix with bΣ = 1 |S| P i S xix T i , and its whitened version as eΣ = 1 |S| P i S wiw T i . Further, define ˆζ = 1 n Pn i=1 wiyi and ζ = E [wy]. Then, we have ˆβols = bΣ 1Σ1/2ˆζ and βols = Σ 1/2ζ. Erdogdu, Bayati, Dicker For now, we work on the event that bΣ is invertible. We will see that this event holds with very high probability. We write Σ1/2(ˆβols βols) 2 = Σ1/2 bΣ 1Σ1/2ˆζ Σ 1/2ζ 2 , (53) = eΣ 1 n ˆζ ζ + I Σ 1/2 bΣΣ 1/2 ζ o 2 , n ˆζ ζ 2 + I eΣ 2 ζ 2 o , where we used the triangle inequality and the properties of the operator norm. For the first term on the right hand side of (53), we write λmin(eΣ) 1 1 δ, where we assumed that such a δ > 0 exists. In fact, when δ < 0.5, we obtain a bound of 2 on the right hand side, which also justifies the invertibility assumption of bΣ. By Corollary 5.50 of Vershynin (2010) and the following remark, we have with probability at least 1 2 exp { p}, eΣ I 2 c r p where c is a constant depending only on κ. When |S| > 4c2p, we obtain λmin(eΣ) 1 eΣ I 2 0.5, where the first inequality follows from the Lipschitz property of the eigenvalues. Next, we bound the difference between ˆζ and its expectation ζ. We write the bounds on the sub-exponential norm wy ψ1 = sup v 2=1 sup m 1 m 1E [| v, w y|m]1/m , (54) sup v 2=1 sup m 1 m 1E | v, w |2m 1/2m E |y|2m 1/2m , sup v 2=1 sup m 1 m 1/2E | v, w |2m 1/2m sup m 1 m 1/2E |y|2m 1/2m , 2 w ψ2 y ψ2 = 2γκ. Hence, we have maxi wiyi E [wiyi] ψ1 4γκ. Further, let ej denote the j-th standard basis, and notice that each entry of w is also sub-Gaussian with norm upper bounded by κ, i.e., κ = w ψ2 = sup u 2=1 u, w ψ2 ej, w ψ2 = wj ψ2 . Scalable Approximations for Generalized Linear Problems Also, we can write 2γκ wy ψ1 = sup u 2=1 sup m 1 m 1E [| u, w y|m]1/m , (55) sup u 2=1 E [| u, w y|] , sup u 2=1 E [ u, w y] = sup u 2=1 u, ζ = ζ 2 , where in the last step, we used the fact that dual norm of ℓ2 norm is itself. Next, we apply Lemma 17 to ˆζ ζ, and obtain with probability at least 1 exp { p} ˆζ ζ 2 cγκ r p whenever n > c2p for an absolute constant c. Combining the above results in (53), we obtain with probability at least 1 3 exp { p} Σ1/2(ˆβols βols) 2 2 c1γκ r p n + c2γκ r p where η depends only on κ and γ, and |S| > ηp. Finally, we write ˆβols βols 2 λ 1/2 min Σ1/2(ˆβols βols) 2 , with probability at least 1 3 exp { p}, whenever |S| > ηp. The following lemma combined with the Proposition 7 provides the necessary tools to prove Theorem 9. Lemma 15 For a given function Ψ(2) that is Lipschitz continuous with constant k, we define the function f : R Rp R as f(c, β) = c E Ψ(2)( x, β c) , and its empirical counterpart as ˆf(c, β) = c 1 i=1 Ψ(2)( xi, β c). For the ball centered around βols with radius δ, Bδ( βols) = n β : β βols 2 δ o , with βols = Σ1/2βols, assume that supβ Bδ( βols) Ψ(2)( x, β ) ψ2 κg, and Σ 1/2xi ψ2 κx. Further, for some ϵ, c > 0, assume that f( c, βols) 1+ϵ. Then, cΨ > 0 satisfying the equation 1 = f(cΨ, βols). Erdogdu, Bayati, Dicker Further, assume that for some ϵ > 0, we have ϵ = ϵ p, and n and |S| sufficiently large, i.e., min n log(n), |S| > K2/ ϵ2 for K = η c max {b + κ/ µ, k cκ}. Then, with probability 1 5 exp { p}, there exists a constant ˆcΨ (0, c) satisfying the equation 1 = ˆcΨ 1 n i=1 Ψ(2)( xi, ˆβols ˆcΨ). Moreover, if the derivative of z f(z, βols) is bounded below in absolute value (i.e. does not change sign) by υ > 0 in the interval z [0, c], then with probability 1 5 exp { p}, we have |ˆcΨ cΨ| C r p min {n/ log (n) , |S|}, where C = K/υ. Proof [of Lemma 15] First statement is obvious. We notice that f(c, βols) is a continuous function in its first argument with f(0, βols) = 0 and f( c, βols) 1 + δ. Hence, there exists cΨ > 0 such that f(cΨ, βols) = 1. If there are many solutions to the above equation, we choose the one that is closest to zero. The condition on the derivative will guarantee the uniqueness of the solution. Next, we will show the existence of ˆcΨ using a uniform concentration given by Lemma 18. Define the event E that ˆβols falls into Bδ Σ(βols), i.e., E = n ˆβols Bδ Σ(βols) o . By Proposition 7 and the inequality given in (56), whenever |S| > ηp, we obtain P EC 3 exp { p} , where EC denotes the complement of the event E, and η is a constant depending only on κx and γ, and δ = η p p/|S|. For any c [0, c], on the event E, we have ˆf(c, ˆβols) f(c, ˆβols) sup β Bδ Σ(βols) ˆf(c, β) f(c, β) . Hence, we obtain the following inequality sup c [0, c] ˆf(c, ˆβols) f(c, ˆβols) > ϵ sup c [0, c] ˆf(c, ˆβols) f(c, ˆβols) > ϵ; E sup c [0, c] sup β Bδ Σ(βols) ˆf(c, β) f(c, β) > ϵ + 3 exp { p} . Scalable Approximations for Generalized Linear Problems In the following, we will use Lemma 18 for the first term in the last line above. Denoting by w, the whitened covariates, we have x, β = w, Σ1/2β . Therefore, sup c [0, c] sup β Bδ Σ(βols) ˆf(c, β) f(c, β) c sup c [0, c] sup β Bδ Σ(βols) i=1 Ψ(2)( wi, Σ1/2β c) E h Ψ(2)( w, Σ1/2β c) i . Next, define the ball centered around βols = Σ1/2βols, with radius δ as Bδ( βols) = Σ1/2Bδ Σ(βols). We have β Bδ Σ(βols) if and only if Σ1/2β Bδ( βols). Then, the right hand side of the above inequality can be written as c sup c [0, c] sup β Bδ( βols) i=1 Ψ(2)( wi, β c) E h Ψ(2)( w, β c) i , = c sup β B cδ( βols) i=1 Ψ(2)( wi, β ) E h Ψ(2)( w, β ) i . Then, by Lemma 18, we obtain sup c [0, c] ˆf(c, ˆβols) f(c, ˆβols) > c c(κg + κx/ µ) r p n/ log (n) 5 exp { p} (57) whenever np > 51 max χ, χ 1 where χ = (κg + κx/ µ)2/(c δ2k2 c2 µ2). Also, by the Lipschitz condition for Ψ(2), we have for any c [0, c], and β1, β2, |f(c, β1) f(c, β2)| kc2E h w, Σ1/2(β1 β2) i k c2κx Σ1/2(β1 β2) 2 . Applying the above bound for β1 = ˆβols and β2 = βols, we obtain with probability 1 3 exp { p} f(c, ˆβols) f(c, βols) ηk c2κx r p where the last step follows from Proposition 7 and the inequality given in (56). Combining this with the previous bound, and taking into account that µ = µ p, for any c [0, c], with probability 1 5 exp { p}, we obtain ˆf(c, ˆβols) f(c, βols) c c(κg + κx/ µ) r p n/ log (n) + ηk c2κx r p K r p min {n/ log (n) , |S|} where K = η c max {κg + κx/ µ, k cκx}. Here, η depends only on κx and γ. Erdogdu, Bayati, Dicker In particular, for c = c we observe that ˆf( c, ˆβols) f( c, βols) K r p min {n/ log (n) , |S|} 1 + ϵ K r p min {n/ log (n) , |S|}. Therefore, for sufficiently large n and |S| satisfying min n n log(n), |S| o > K2/ ϵ2 we obtain ˆf( c, ˆβols) > 1. Since this function is continuous and ˆf(0, ˆβols) = 0, we obtain the existence of ˆcΨ [0, c] with probability at least 1 5 exp { p}. Now, since ˆcΨ and cΨ satisfy the equations ˆf(ˆcΨ, ˆβols) = f(cΨ, βols) = 1 (with high probability), by the inequality given in (57), with probability at least 1 5 exp { p}, we obtain 1 f(ˆcΨ, ˆβols) = ˆf(ˆcΨ, ˆβols) f(ˆcΨ, ˆβols) c c(κg + κx/ µ) r p n/ log(n). Also, by the same argument in (60), and Proposition 7, we get f(ˆcΨ, ˆβols) f(ˆcΨ, βols) k c2κ Σ(ˆβols βols) 2 ηk c2κx r p Now, using the Taylor s series expansion of c f(c, βols) around cΨ, and the assumption on the derivative of f with respect to its first argument, we obtain υ |ˆcΨ cΨ| f(ˆcΨ, βols) f(cΨ, βols) f(ˆcΨ, βols) f(ˆcΨ, ˆβols) + f(ˆcΨ, ˆβols) 1 ηk c2κx r p |S| + c c(κg + κx/ µ) r p n/ log(n) K r p min {n/ log (n) , |S|} with probability at least 1 5 exp { p}. Here, the constant K is the same as before K = η c max {κg + κx/ µ, k cκx} . Lemma 16 For a given function Ψ(2) that is Lipschitz continuous with constant k, and uniformly bounded by b, we define the functions f, ˆf, and the constants ϵ, cΨ, c as in Lemma 15. Assume that xi s are i.i.d. sub-Gaussian with xi ψ2 κx. Then, with probability 1 4 exp { p}, there exists a constant ˆcΨ (0, c) satisfying the equation 1 = ˆcΨ 1 n i=1 Ψ(2)( xi, ˆβols ˆcΨ). Scalable Approximations for Generalized Linear Problems Moreover, if the derivative of z f(z, βols) is bounded below in absolute value (i.e. does not change sign) by υ > 0 in the interval z [0, c], then with probability 1 4 exp { p}, we have |ˆcΨ cΨ| K r p where K = ηυ 1 ck max n (1 + β 2) Σ 2 + b/k, c o , η depends only on γ and κx, and |S| is sufficiently large. Proof [of Lemma 16] The proof follows from the same steps in Lemma 15. First statement is obvious. Next, we will show the existence of ˆcΨ using a uniform concentration given by Lemma 19. Let the ellipsoids Bδ Σ(βols) and Bδ( βols), and the event E be as in Lemma 15. By Proposition 7 and the inequality given in (56), whenever |S| > ηp, we obtain P EC 3 exp { p} , where EC denotes the complement of the event E, and η is a constant depending only on κx and γ. For any c [0, c], on the event E, we have ˆf(c, ˆβols) f(c, ˆβols) sup β Bδ Σ(βols) ˆf(c, β) f(c, β) . Using the same arguments as in the previous lemma that led to (57), we obtain the following inequality sup c [0, c] ˆf(c, ˆβols) f(c, ˆβols) > ϵ sup c [0, c] sup β Bδ Σ(βols) ˆf(c, β) f(c, β) > ϵ + 3 exp { p} , sup β B cδ( βols) i=1 Ψ(2)( wi, β ) E h Ψ(2)( w, β ) i > ϵ/ c + 3 exp { p} where B cδ( βols) is the ball centered around βols = Σ1/2βols with radius δ = η p p/|S|. Then, using Lemma 19, we obtain sup c [0, c] ˆf(c, ˆβols) f(c, ˆβols) > 2 c k β 2 + η r p Σ 2 + b r p 4 exp { p} . By the Lipschitz property of Ψ(2) as well as sub-Gaussian property of x, we have for any c [0, c], and β1, β2, |f(c, β1) f(c, β2)| kc2E h w, Σ1/2(β1 β2) i k c2κx Σ1/2(β1 β2) 2 . Erdogdu, Bayati, Dicker Applying the above bound for β1 = ˆβols and β2 = βols, we obtain with probability 1 3 exp { p} f(c, ˆβols) f(c, βols) ηk c2κx r p where the last step follows from Proposition 7 and the inequality given in (56). Combining this with the previous bound, for any c [0, c], with probability 1 4 exp { p}, we obtain ˆf(c, ˆβols) f(c, βols) 2 c k β 2 Σ 2 + ηk Σ 2 |S| + b r p n + ηk c2κx r p n + η Σ 2 p p |S|n + b r p n + η cκx r p η ck max n (1 + β 2) Σ 2 + b/k, c o r p |S| = K r p where K = η ck max n (1 + β 2) Σ 2 + b/k, c o , and |S| > ηp. Here, η depends only on the constants κx and γ. In particular, for c = c we observe that ˆf( c, ˆβols) f( c, βols) K r p 1 + ϵ K r p Therefore, for sufficiently large |S| satisfying |S| > K2p, we obtain ˆf( c, ˆβols) > 1. Since this function is continuous and ˆf(0, ˆβols) = 0, we obtain the existence of ˆcΨ [0, c] with probability at least 1 4 exp { p}. Now, since ˆcΨ and cΨ satisfy the equations ˆf(ˆcΨ, ˆβols) = f(cΨ, βols) = 1 (with high probability), by the inequality given in (59), with probability at least 1 4 exp { p}, we obtain 1 f(ˆcΨ, ˆβols) = ˆf(ˆcΨ, ˆβols) f(ˆcΨ, ˆβols) 2 c k β 2 + η r p Σ 2 + b r p Now, using the Taylor s series expansion of c f(c, βols) around cΨ, and the assumption on the derivative of f with respect to its first argument, we obtain υ |ˆcΨ cΨ| f(ˆcΨ, βols) f(cΨ, βols) f(ˆcΨ, βols) f(ˆcΨ, ˆβols) + f(ˆcΨ, ˆβols) 1 ηk c2κx r p |S| + 2 c k β 2 + η r p Σ 2 + b r p Scalable Approximations for Generalized Linear Problems with probability at least 1 4 exp { p}. Here, the constant K is the same as before K = η ck max n (1 + β 2) Σ 2 + b/k, c o . Proof [of Theorem 8] We have ˆβ sls βpop cΨβols βpop + ˆcΨ ˆβols cΨβols , (61) where we used the triangle inequality for the ℓ norm. The first term on the right hand side can be bounded using Theorem 3 and Corollary 5. We write cΨβols βpop η1 βpop p , (62) for η1 = 8k cκ3ρ Σ1/2 (τ/r). For the second term, we write ˆcΨ ˆβols cΨβols = ˆcΨ ˆβols ˆcΨβols cΨβols , (63) |ˆcΨ| ˆβols βols + |ˆcΨ cΨ| βols , where the first step follows from triangle inequality. By Lemma 16, for sufficiently large n and |S|, with probability 1 4 exp { p}, the constant ˆcΨ exists and it is in the interval (0, c]. By the same lemma, with probability 1 4 exp { p}, we have |ˆcΨ cΨ| η4 r p where η4 = η υ 1 ck max ( Σ1/2βols 2 + 1) Σ 2 + b/k, c for some constant η depending on the sub-Gaussian norms κ and γ. Also, by the norm equivalence and Proposition 7, we have with probability 1 4 exp { p} ˆβols βols η3 r p for η3 = η λ 1/2 min , where η is constant depending only on γ and κ. Finally, combining all these inequalities with the last line of (61), we have with probability 1 4 exp { p}, ˆβ sls βpop η1 βpop p + η3 c r p |S| + η4 βols η1 βpop p + η3 c + η4 βols = η1 βpop p + η2 r p Erdogdu, Bayati, Dicker η1 =ηk cρ (τ/r) Σ1/2 (67) η2 =η3 c + η4 βols = η c λ 1/2 min + kυ 1 max n ( Σ1/2βols 2 + 1) Σ 2 + b/k, c o Proof [of Theorem 9] Proof is essentially the same as that of Theorem 8, where this time we use Lemma 16 instead of Lemma 15. Proof [of Corollary 6] The normal equations for the lasso minimization yields E xx T βlasso λ βols + λs = 0, where s βlasso λ 1. It is well-known that under the orthogonal design where the covariates have i.i.d. entries, the above equation reduces to soft(βols; λ) = βlasso λ , where soft( ; λ) denotes the soft thresholding operator at level λ. For any β Rp, let supp(β) denote the support of β, i.e., the set {i [p] : βi = 0}. We have supp(βlasso λ ) = {i [p] : βlasso λ,i = 0} = {i [p] : |βols i | > λ} By Theorem 3, we have |βols i | 1 cΨ |βpop i | + δ, which implies that supp(βlasso λ ) i [p] : 1 cΨ |βpop i | + δ λ . Hence, whenever λ δ, we have supp(βlasso λ ) supp(βpop). Further, we have by Theorem 3 1 cΨ |βpop i | |βols i | + δ. Hence, whenever |βpop i | > cΨ (λ + δ), we get |βols i | > λ. If this condition is satisfied for any entry in the support of βpop, the corresponding lasso coefficient will be non-zero. Therefore, we get supp(βpop) supp(βlasso λ ) under this assumption. Combining this with the previous result, we conclude the proof. 11. Auxiliary Lemmas Lemma 17 (Sub-exponential vector concentration) Let x1, x2, ..., xn be independent centered sub-exponential random vectors with maxi xi ψ1 = κ. Then we have exp { p} . (68) whenever n > 4c2p for an absolute constant c. Scalable Approximations for Generalized Linear Problems Proof [of Lemma 17] For a vector z Rp, we have z 2 = sup u 2=1 u, z since the dual of ℓ2 norm is itself. Therefore, we write i=1 u, xi > t Now, let Nϵ be an ϵ-net over Sp 1 = {u Rp : u 2 = 1}, and observe that max u Nϵ u, x (1 ϵ) sup u 2=1 u, x = (1 ϵ) x 2, with |Nϵ| (1 + 2/ϵ)p. Hence, we may write i=1 u, xi > t max u Nϵ 1 n i=1 u, xi > t(1 ϵ) i=1 u, xi > t(1 ϵ) For any u Sp 1, we have u, xi ψ1 κ. Then, by the Bernstein-type inequality for sub-exponential random variables (Vershynin, 2010), we have i=1 u, xi > t(1 ϵ) exp cn min t2(1 ϵ)2 κ2 , t(1 ϵ) for an absolute constant c. Therefore, the probability on the left hand side of (68) can be bounded by 1 + 2 p exp cnt2(1 ϵ)2 = exp cnt2(1 ϵ)2 κ2 + p log 1 + 2 whenever t < κ/(1 ϵ). Choosing ϵ = 0.5 and for an absolute constant c > 3.24/c and letting t = c κ q p n, we conclude the proof. Lemma 18 Let B( β) denote the ball centered around β with radius δ, i.e., B( β) = n β : β β 2 δ o . For i = 1, ..., n, let xi Rp be i.i.d. sub-Gaussian random vectors with xi ψ2 κx. Given a function g : R R that is Lipschitz continuous with k, and satisfying supβ B( β) g( xi, β ) ψ2 κg, we have i=1 g( xi, β ) E [g( x, β )] > c(κg + κx/ µ) r p n/ log(n) 2 exp { p} , whenever np > 51 max{χ, χ 1} for χ = (κg + κx/ µ)2/(cδ2k2 µ2). Above, c is an absolute constant. Erdogdu, Bayati, Dicker Proof [of Lemma 18] Let E [ x 2] = µ = µ p and for ϵ > 0, β B( β) and w Rp define the bounding functions lβ(w) = g( w, β ) ϵ w 2/4µ, uβ(w) = g( w, β ) + ϵ w 2/4µ. Let N be a net over B( β) in the sense that for any β1 B( β), β2 N such that β1 β2 2 . We fix = ϵ/(4kµ) and write β1 B, β2 N , 1. an upper bound of the form: g( w, β1 ) g( w, β2 ) + k | w, β1 β2 | , g( w, β2 ) + k w 2 = uβ2(w), 2. and a lower bound of the form: g( w, β1 ) g( w, β2 ) k | w, β1 β2 | , g( w, β2 ) k w 2 = lβ2(w), where the second steps in the above inequalities follow from the Cauchy-Schwarz inequality. These functions are called bracketing functions in the context of empirical process theory. Hence, we can write that β1 B( β), β2 N such that i=1 lβ2(xi) E [lβ2(x)] ϵ/2 1 i=1 g( xi, β1 ) E [g( x, β1 )] , i=1 uβ2(xi) E [uβ2(x)] + ϵ/2. The above inequalities translate to the following conclusion: Whenever the following event happens, ( 1 n i=1 g( xi, β1 ) E [g( x, β1 )] at least one of the following events happens ( 1 n i=1 uβ2(xi) E [uβ2(x)] > ϵ/2 i=1 lβ2(xi) E [lβ2(x)] < ϵ/2 Therefore, using the union bound on the above events, we may obtain sup β B( β) i=1 g( xi, β ) E [g( x, β )] i=1 uβ(xi) E [uβ(x)] > ϵ/2 i=1 lβ(xi) E [lβ(x)] < ϵ/2 Scalable Approximations for Generalized Linear Problems Note that the right hand side of the above inequality has two terms both of which are of the same form. For simplicity, we bound only the first one. The bound for the second one follows from the exact same steps. The relation between sub-Gaussian and sub-exponential norms (Vershynin, 2010) allows us to write x 2 2 ψ2 x 2 2 ψ1 i=1 x2 i ψ1 2 i=1 xi 2 ψ2 2κ2 xp, (70) where the second step follows from the triangle inequality. Hence, we conclude that x 2 E [ x 2] is a centered sub-Gaussian random variable with norm upper bounded by 3κx p. For ϵ < 4/3, we notice that the random variable uβ(x) = g( x, β ) + ϵ x 2/4µ is also sub-Gaussian with norm uβ(x) ψ2 κg + ϵ 4 µ3κx κg + κx/ µ, and consequently, the centered random variable uβ(x) E [uβ(x)] has the sub-Gaussian norm upper bounded by 2κg + 2κx/ µ. Then, by the Hoeffding-type inequality for the sub-Gaussian random variables, we obtain i=1 uβ(xi) E [uβ(x)] > ϵ/2 (κg + κx/ µ)2 for an absolute constant c > 0. By the same argument above, one can obtain the same result for the function lβ(x). Using Hoeffding bounds in (69) along with the union bound over the net, we immediately obtain sup β B( β) i=1 g( xi, β ) E [g( x, β )] 2 |N | exp cn ϵ2 (κg + κx/ µ)2 for some absolute constant c. Using a standard covering argument over the net N as given in Erdogdu (2015), we have Combining this with the previous bound, and choosing n (κg + κx/ µ)2 2c log 32cδ2k2 µ2pn (κg + κx/ µ)2 p exp cn ϵ2 (κg + κx/ µ)2 2 log log 32cδ2k2 µ2pn (κg + κx/ µ)2 2 exp { p} , Erdogdu, Bayati, Dicker whenever np > 51 max{χ, χ 1} for χ = (κg + κx/ µ)2/(cδ2k2 µ2). Lemma 19 Let B( β) denote the ball centered around β with radius δ, i.e., B( β) = n β : β β 2 δ o . For i = 1, ..., n, let xi Rp be i.i.d. random vectors with a covariance matrix Σ. Given a function g : R R that is uniformly bounded by b > 0, and Lipschitz continuous with k, sup β B( β) i=1 Ψ(2)( xi, β ) E h Ψ(2)( x, β ) i 2 n k( β 2 + δ) Σ 2 + b o r p Proof [of Lemma 19] This is a standard result in the theory of Rademacher complexity see for example Bartlett and Mendelson (2002). Define the following empirical process G(x1, ..., xn) = sup β B( β) i=1 g( xi, β ) E [g( x, β )] We notice that for any i, we have G(.., xi, ..) G(.., x i, ..) 2b due to boundedness of the function g. By the Mc Diarmid s inequality (Azuma), we have P (G > ϵ) exp (ϵ E [G])2n The above inequality when stated differently reads that with probability at least 1 exp { p}, we have G < E [G] + b For σ1, ..., σn i.i.d. random variables with P (σi = 1) = 0.5, we have sup β B( β) i=1 σig( xi, β ) = R(g B( β)), (76) where R denoting the Rademacher complexity. Since g is Lipschitz continuous with constant k, we have R(g B( β)) k R(B( β)) by the Ledoux-Talagrand contraction inequality. We Scalable Approximations for Generalized Linear Problems R(B( β)) = 2 sup β B( β) i=1 σi xi, β sup β B( β) ( β 2 + δ) 2 #1/2 = ( β 2 + δ) 2 i=1 E xi 2 2 !1/2 , 2( β 2 + δ) Σ 2 Combining this with the previous results, we obtain sup β B( β) i=1 Ψ(2)( xi, β ) E h Ψ(2)( x, β ) i 2 n k( β 2 + δ) Σ 2 + b o r p 12. Additional Experiments In this section, we provide additional experiments. The overall setting is the same as Section 8. The only difference is that we change the sampling distribution of the datasets, which are stated in the title of each plot. As in Section 8, SLS estimator outperforms its competitors by a large margin in terms of the computation time. The results are provided in Figures 7 and 6, and Table 3. Erdogdu, Bayati, Dicker Random start OLS start Logis0c Regression 0 10 20 30 40 50 Time (sec) SLS NR NS BFGS LBFGS GD AGD Log Reg / Covariates ~ Σ x Ber( 1) 0 10 20 30 Time (sec) SLS NR NS BFGS LBFGS GD AGD Log Reg / Covariates ~ Σ x Ber( 1) 0 10 20 30 40 50 Time (sec) SLS NR NS BFGS LBFGS GD AGD Log Reg / Covariates ~ Σ x Norm(0,1) 0 10 20 30 40 50 Time (sec) SLS NR NS BFGS LBFGS GD AGD Log Reg / Covariates ~ Σ x Norm(0,1) Figure 6: Additional experiments comparing the performance of SLS to that of MLE obtained with various optimization algorithms on several datasets. SLS is represented with red straight line. The details are provided in Table 3 Scalable Approximations for Generalized Linear Problems Table 3: Details of the experiments shown in Figures 7 and 6. Model Logistic Regression Poisson Regression Dataset Σ Ber( 1) Σ Norm(0,1) Σ {Exp(1)-1} Σ Norm(0,1) Size n = 6.0 105, p=300 n = 6.0 105, p=300 n = 6.0 105, p=300 n = 6.0 105, p=300 Init Rnd Ols Rnd Ols Rnd Ols Rnd Ols Plot (a) (b) (c) (d) (e) (f) (g) (h) Method Time in Seconds / Number of Iterations (to reach min test error) Sls 6.61/3 2.97/3 9.38/5 4.25/4 14.68/4 2.99/4 6.66/10 4.13/10 Nr 222.21/6 84.08/3 186.33/6 115.76/4 218.1/6 218.9/4 364.63/9 363.4/9 Ns 40.68/10 11.57/3 53.06/9 19.52/4 39.22/6 59.61/4 51.48/10 39.8/10 Bfgs 125.83/33 35.41/9 155.3/48 24.78/8 46.61/20 48.71/12 92.84/36 74.22/38 LBfgs 142.09/38 44.41/12 444.62/143 21.79/7 96.53/39 50.56/12 296.4/111 228.1/117 Gd 409.9/134 79.45/22 1773.1/509 135.62/44 569.1/211 124.31/48 792.3/344 1041.1/366 Agd 177.3/159 43.76/12 359.56/95 53.73/18 157.9/57 63.16/16 74.74/32 62.21/32 Poisson Regression 0 10 20 30 40 50 Time (sec) log(Test Error) PEG NR NS BFGS LBFGS GD AGD Poi Reg / Covariates ~ Exp{1} SLS Log Reg / Covariates ~ Σ x {Exp(1) 1} Poi Reg 0 10 20 30 40 50 Time (sec) log(Test Error) PEG NR NS BFGS LBFGS GD AGD Poi Reg / Covariates ~ Exp{1} SLS Log Reg / Covariates ~ Σ x {Exp(1) 1} Poi Reg 0 10 20 30 40 50 Time (sec) log(Test Error) PEG NR NS BFGS LBFGS GD AGD Poi Reg / Covariates ~ Norm(0,1) SLS Log Reg / Covariates ~ Σ x {Exp(1) 1} Poi Reg Σ x Norm(0,1) 0 10 20 30 40 50 Time (sec) log(Test Error) PEG NR NS BFGS LBFGS GD AGD Poi Reg / Covariates ~ Norm(0,1) SLS Log Reg / Covariates ~ Σ x {Exp(1) 1} Poi Reg Σ x Norm(0,1) Figure 7: Additional experiments comparing the performance of SLS to that of MLE obtained with various optimization algorithms on several datasets. SLS is represented with red straight line. The details are provided in Table 3 Erdogdu, Bayati, Dicker Pierre Alquier and G erard Biau. Sparse single-index model. Journal of Machine Learning Research, 14(Jan):243 280, 2013. Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in highenergy physics with deep learning. Nature communications, 5, 2014. Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002. Mohsen Bayati and Andrea Montanari. The lasso risk for gaussian matrices. IEEE Transactions on Information Theory, 58(4):1997 2017, 2012. Mohsen Bayati, Murat A Erdogdu, and Andrea Montanari. Estimating lasso risk and noise level. In Advances in Neural Information Processing Systems, pages 944 952, 2013. Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. Jock A. Blackard and Denis J. Dean. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput. Electron. Agr., 24:131 151, 1999. Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. David R Brillinger. A generalized linear model with Gaussian regressor variables. In A Festschrift For Erich L. Lehmann, pages 97 114. CRC Press, 1982. Charles G Broyden. The convergence of a class of double-rank minimization algorithms 2. the new algorithm. IMA Journal of Applied Mathematics, 6(3):222 231, 1970. Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classification: Structure and applications. 2005. Louis H. Y. Chen, Larry Goldstein, and Qi-Man Shao. Normal approximation by Stein s method. Springer, 2010. Sheng Chen and Arindam Banerjee. Robust structured estimation with single-index models. In International Conference on Machine Learning, pages 712 721, 2017. Paramveer Dhillon, Yichao Lu, Dean P Foster, and Lyle Ungar. New subsampling algorithms for fast least squares regression. In Advances in Neural Information Processing Systems, pages 360 368, 2013. David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914 18919, 2009. Scalable Approximations for Generalized Linear Problems Petros Drineas, Michael W Mahoney, S Muthukrishnan, and Tam as Sarl os. Faster least squares approximation. Numerische Mathematik, 117(2):219 249, 2011. Naihua Duan and Ker-Chau Li. Slicing regression: a link-free regression method. The Annals of Statistics, pages 505 530, 1991. Murat A Erdogdu. Newton-stein method: A second order method for glms via stein s lemma. In Advances in Neural Information Processing Systems, pages 1216 1224, 2015. Murat A. Erdogdu. Newton-stein method: an optimization method for glms via stein s lemma. The Journal of Machine Learning Research, 17(1):7565 7616, 2016. Murat A Erdogdu. Stein s lemma and subsampling in large-scale optimization. Ph D thesis, Stanford University, 2017. Murat A Erdogdu and Andrea Montanari. Convergence rates of sub-sampled newton methods. In Advances in Neural Information Processing Systems, pages 3052 3060, 2015. Murat A Erdogdu, Mohsen Bayati, and Lee H Dicker. Scaled least squares estimator for glms in large-scale problems. In Advances in Neural Information Processing Systems, 2016. Ronald A. Fisher. The use of multiple measurements in taxonomic problems. Annals Eugenic, 7:179 188, 1936. Roger Fletcher. A new approach to variable metric algorithms. The computer journal, 13 (3):317 322, 1970. Donald Goldfarb. A family of variable-metric methods derived by variational means. Mathematics of computation, 24(109):23 26, 1970. Larry Goldstein. l1 bounds in normal approximation. Annals Probability, 35:1888 1930, 2007. Larry Goldstein and Gesine Reinert. Stein s method and the zero bias transformation with application to simple random sampling. Annals of Applied Probability, 7:935 952, 1997. Jackson Gorham and Lester Mackey. Measuring sample quality with stein s method. In Advances in Neural Information Processing Systems, pages 226 234, 2015. Peter Hall and Ker-Chau Li. On almost linearity of low dimensional projections from high dimensional data. The annals of Statistics, pages 867 889, 1993. David P Helmbold, Jyrki Kivinen, and Manfred K Warmuth. Relative loss bounds for single neurons. IEEE Transactions on Neural Networks, 10(6):1291 1304, 1999. Magnus R Hestenes and Eduard Stiefel. Methods of conjugate gradients for solving linear systems . Journal of Research of the National Bureau of Standards, 49(6), 1952. Marian Hristache, Anatoli Juditsky, and Vladimir Spokoiny. Direct estimation of the index coefficient in a single-index model. Annals of Statistics, pages 595 623, 2001. Erdogdu, Bayati, Dicker Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In Advances in Neural Information Processing Systems, pages 927 935, 2011. Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regression. In Proceedings of the 22nd Annual Conference on Learning Theory, 2009. Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT press, 2009. Bing Li and Yuexiao Dong. Dimension reduction for nonelliptically distributed predictors. The Annals of Statistics, pages 1272 1298, 2009. Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316 327, 1991. Ker-Chau Li and Naihua Duan. Regression analysis under link violation. Annals of Statistics, 17:1009 1052, 1989. Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pages 2370 2378, 2016. Qiang Liu, Jason D Lee, and Michael I Jordan. A kernelized stein discrepancy for goodnessof-fit tests and model evaluation. ar Xiv preprint ar Xiv:1602.03253, 2016. James Martens. Deep learning via hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 735 742, 2010. Peter Mc Cullagh and John A. Nelder. Generalized Linear Models. Chapman and Hall, 2nd edition, 1989. John A Nelder and R. Jacob Baker. Generalized linear models. Wiley Online Library, 1972. Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Math. Dokl., 27:372 376, 1983. Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2004. Christopher C Paige and Michael A Saunders. Solution of sparse indefinite systems of linear equations. SIAM journal on numerical analysis, 12(4):617 629, 1975. Mert Pilanci and Martin J Wainwright. Newton sketch: A linear-time optimization algorithm with linear-quadratic convergence. ar Xiv preprint ar Xiv:1505.02250, 2015. Yaniv Plan and Roman Vershynin. The generalized lasso with non-linear observations. IEEE Transactions on information theory, 62(3):1528 1537, 2016. Mark D Reid and Robert C Williamson. Composite binary losses. Journal of Machine Learning Research, 11(Sep):2387 2422, 2010. Scalable Approximations for Generalized Linear Problems Vladimir Rokhlin and Mark Tygert. A fast randomized algorithm for overdetermined linear least-squares regression. Proceedings of the National Academy of Sciences, 105(36):13212 13217, 2008. Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled newton methods i: globally convergent algorithms. ar Xiv preprint ar Xiv:1601.04737, 2016a. Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled newton methods ii: Local convergence rates. ar Xiv preprint ar Xiv:1601.04738, 2016b. Mark J Schervish. A general method for comparing probability assessors. The Annals of Statistics, pages 1856 1879, 1989. Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic convex optimization. In Conference on Learning Theory (COLT), 2009. David F Shanno. Conditioning of quasi-newton methods for function minimization. Mathematics of computation, 24(111):647 656, 1970. Weijie Su and Emmanuel Candes. Slope is adaptive to unknown sparsity and asymptotically minimax. The Annals of Statistics, 44(3):1038 1068, 2016. Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi. Lasso with non-linear measurements is equivalent to one with linear measurements. In Advances in Neural Information Processing Systems, pages 3420 3428, 2015. Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices, 2010. ar Xiv:1011.3027. Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1:1 305, 2008.