# estimating_stochastic_linear_combination_of_nonlinear_regressions__6ccf8594.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Estimating Stochastic Linear Combination of Non-Linear Regressions

Di Wang, Xiangyu Guo, Chaowen Guan, Shi Li, Jinhui Xu Department of Computer Science and Engineering State University of New York at Buffalo Buffalo, NY 14260

In this paper we study the problem of estimating stochastic linear combination of non-linear regressions, which has a close connection with many machine learning and statistical models such as non-linear regressions, the Single Index, Multi-index, Varying Coefﬁcient Index Models and Two-layer Neural Networks. Speciﬁcally, we ﬁrst show that with some mild assumptions, if the variate vector x is multivariate Gaussian, then there is an algorithm whose output vectors have ℓ2-norm estimation errors of O( p

n) with high probability, where p is the dimension of x and n is the number of samples. Then we extend our result to the case where x is sub-Gaussian using the zero-bias transformation, which could be seen as a generalization of the classic Stein s lemma. We also show that with some additional assumptions there is an algorithm whose output vectors have ℓ -norm estimation errors of O( 1 p + p

n) with high probability. Finally, for both Gaussian and sub-Gaussian cases we propose a faster sub-sampling based algorithm and show that when the sub-sample sizes are large enough then the estimation errors will not be sacriﬁced by too much. Experiments for both cases support our theoretical results. To the best of our knowledge, this is the ﬁrst work that studies and provides theoretical guarantees for the stochastic linear combination of non-linear regressions model.

Introduction We study the problem of estimating stochastic linear combination of non-linear regressions. The model can be formally deﬁned as follows. Deﬁnition 1 (Stochastic Linear Combination of Non-linear Regressions). Given variates x Rp and z1, , zk R such that E[x] = 0 and zi s for all i [k] are i.i.d random variables independent of x with E[zi] = 0 and Var(zi) = 1, the response y is given by

i=1 zifi( β i , x ) + ϵ, (1)

where β 1, β 2, , β k Rp are unknown parameters, fi s for all i [k] are known (but could be non-convex) link

The ﬁrst two authors contributed equally. Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

functions, and ϵ is some random noise (from an unknown distribution) satisfying E[ϵ] = 0 and is independent of x and zi s. The goal is to estimate the parameters β j for all j [k] from n observations (x1, y1, {z1,i}k i=1), (x2, y2, {z2,i}k i=1), , (xn, yn, {zn,i}k i=1).

This model has a close connection with many models in Statistics, Machine Learning, Signal Processing and Information Theory: (1) when k = 1, the model is reduced to the nonlinear regression estimation problem which has been studied in (Zhang, Yang, and Wang 2018; Beck and Eldar 2013; Yang et al. 2016; Cook and Lee 1999) and is related to compressed sensing and image recovery as well; (2) when k = 1 but the link function f1 is unknown, it becomes the Single Index Model, which is one of the most fundamental models in statistics and has been studied for many years (Ichimura 1993; Horowitz 2009; Kakade et al. 2011; Yang, Balasubramanian, and Liu 2017; Radchenko 2015); (3) when k 1, zi s are deterministic but fi s are unknown, this model will be a special case of the Multi-index Model which has been studied in (Li 1991; 1992; Li, Duan, and others 1989; Yang et al. 2017); (4) when k 1, zi s are stochastic but fi s are unknown, it will be the Varying Coefﬁcient Index Model which was introduced by (Ma and Song 2015) and has wide applications in economics and medical science (Fan and Zhang 2008); (5) when all fi s are the same, the model can be viewed as a Two-layer Neural Network with k hidden nodes and random hidden-output layer weights. To estimate the parameters in Model (1), the main challenge is that without the assumption that fi s are convex or similarities between them, it is hard to establish an objective function that can be efﬁciently optimized using optimization methods such as (Stochastic) Gradient Descent. Recently, some works including (Yang, Balasubramanian, and Liu 2017; Na et al. 2018; Yang et al. 2017) studied and proposed efﬁcient algorithms for the Single Index, Multiindex and Varying Coefﬁcient Index models using Stein s Lemma. Their theoretical guarantees are measured in terms of βj cβ j 2, j [k], where βj is the estimator for β j and c is a constant depending on many parameters in the models (such as fi s, β j s and the distribution for x). However there is a common issue related to the constant c in these

results: They did not provide a method to compute or even estimate c. To address this issue, we measure the error in terms of βj β j for all j [k]; that is, we do not introduce the constant c. The key question the paper tries to answer is: Is there an efﬁcient method whose output vectors β1, β2, , βk have small errors compared to β 1, β 2, , β k? In this paper, we answer the question in the afﬁrmative under some mild assumptions on the model. Speciﬁcally, our contributions can be summarized as follows. 1. We ﬁrst consider the case where x multivariate Gaussian. In this case, we show that there is a special structure for each β j , j [k]: β j = cjβols j , where cj is a constant depending on the link function fj and x, and βols j is the Ordinary Lest Square estimator w.r.t yzj and x, i.e., βols j = (E[xx T ]) 1E[zjyx]. Based on this key observation, we propose an algorithm which estimates cj s and βols j s, and outputs {βj}k j=1 satisfying βj β j 2 O( p

n) for each j [k] with high probability. Moreover, in order to make our algorithm faster, instead of using linear regression estimator to approximate βols j , we use the sub-sampling covariance linear regression estimator (Dhillon et al. 2013). We show that if the sub-sample size is large enough, the error bound is almost the same as in the previous ones. 2. We then extend our result to the case when x is (bounded) sub-Gaussian. The challenge is that the result for the Gaussian case depends on some properties of Gaussian distribution which are not satisﬁed in the sub-Gaussian case. To overcome this, we use the zero-bias transformation (Goldstein, Reinert, and others 1997), which could be seen as a generalization of the Stein s lemma (Brillinger 1982). Particularly, we show that instead of the equality β j = cjβols j , we have the ℓ norm estimation error β j cjβols j O( 1 p) with some additional mild assumptions. Based on this and the same idea from the Gaussian case, we show that there exists an algorithm whose output vectors {βj}k j=1 satisfy βj β j O( 1 p + p

n) with high probability. Similarly, we also propose a sub-sampled version of our algorithm as in the Gaussian case. 3. At the end, we show the experimental results on both Gaussian and sub-Gaussian cases with single/mixed type of link functions, and these results support our theoretical results above. To the best of our knowledge, this is the ﬁrst paper studying and providing the estimation error bound for Model (1) in both Gaussian and sub-Gaussian cases. Due to the space limit, omitted proofs and the background are included in the full version of the paper. The source code of experiments can be found at github.com/anonymizepaper/ SLSE.

Related Work As we mentioned above, there is no previous work on Model (1) with guarantees on the ℓ2 or ℓ norm of the errors βj β j .

Hence, below we compare with the results which are close to ours. When the link functions fj s are unknown, Model (1) is just the Varying Coefﬁcient Index Model. (Na et al. 2018) provided the ﬁrst efﬁcient algorithm for this model. Although they considered the high dimensional sparse case, their method requires the underlying distribution of x to be known, an unrealistic assumption for most applications. Moreover, their estimation errors are measured by the differences between βj s and cβ j s for an unknown c, while in our results we have ﬁxed c = 1. When the link functions fj s are all the same, then our model can be reduced to the two-layer neural network with random hidden-output layer weights. Previous work on the convergence results all focused on the gradient descent type of methods such as those in (Zhang et al. 2019; Zou et al. 2018; Nitanda and Suzuki 2019). However, our method is based on Stein s lemma and its generalization. Compared with the gradient descent type methods, our algorithm is noninteractive (that is, we do not need to update estimators in each iteration) and parameter-free (that is, we don not need to tune the step-size). Moreover, our method can be extended to the case where the link functions fj s are different. Our method is motivated by Stein s lemma (Brillinger 1982) and its generalization, the zero-bias transformation. Several previous studies have used Stein s Lemma in various machine learning problems. For example, (Erdogdu, Dicker, and Bayati 2016; Erdogdu 2016) used it to accelerate some optimization procedures, (Liu and Wang 2016) applied it to Bayesian inference and (Yang, Balasubramanian, and Liu 2017; Yang et al. 2016; Na et al. 2018; Wei, Yang, and Wang 2019) used it and its generalizations in the Single Index, Multi-index, Varying Coefﬁcient Index and Generative models, respectively. The zero-bias transformation has also been used in (Erdogdu, Bayati, and Dicker 2019) for estimating the Generalized Linear Model. However, due to the difference between the models, these algorithms cannot be applied to our problem.

Preliminaries In this section, we review some necessary deﬁnitions and lemmas.

Deﬁnition 2 (Sub-Gaussian). For a given constant κ, a random variable x R is said to be sub-Gaussian if it satisﬁes supm 1 1 m E[|x|m] 1 m κ. The smallest such κ is the sub Gaussian norm of x and it is denoted by x ψ2. Similarly, a random vector x Rp is called a sub Gaussian vector if there exists a constant κ such that supv Sp 1 x, v ψ2 κ, where Sp 1 is the set of all pdimensional unit vector.

In order to extend our results to the sub-Gaussian case, we will use the zero-bias transformation which is proposed by (Goldstein, Reinert, and others 1997). It is a generalization of the classic Stein s lemma in (Brillinger 1982).

Deﬁnition 3. Let z be a random variable with mean 0 and variance σ2. Then there exists a random variable z such that for all differentiable functions f we have E[zf(z)] =

σ2E[f (z )]. The distribution of z is said to be the z-zerobias distribution.

The standard Gaussian distribution is the unique distribution whose zero-bias distribution is itself. This is just the basic Stein s lemma.

Lemma 1. (Dhillon et al. 2013) Assume that E[x] = 0, E[xix T i ] = Σ Rp p, and Σ 1

2 x and y are sub-Gaussian with norms κx and γ respectively. If n Ω(γκxp), then with probability at least 1 3 exp( p) we have

Σ 1 2 ( βols βols) 2 C1κxγ p

where βols = Σ 1E[yx] is the OLS estimator w.r.t y and x, βols = (XT X) 1XT Y is the empirical one, and C1 > 0 is some universal constant.

Lemma 2. (Erdogdu, Bayati, and Dicker 2019) Let Bδ( β) denote the ball centered around β with radius δ. For i = 1, 2, , n, let xi Rp be i.i.d random vectors with a covariance matrix Σ. Given a function g that is uniformly bounded by L and G-Lipschitz, with probability at least 1 exp( p) we have

sup β Bδ( β)

i=1 g( xi, β ) E[g( x, β )]

2(G( β 2 + δ) Σ 2 + L) p

Assumption 1. We assume that for each j [k], the random variable yzj is sub-Gaussian with its sub-Gaussian norm yzj ψ2 = γ.

Note that this assumption holds if y is bounded and zj is sub-Gaussian or zj is bounded and y is sub-Gaussian.

Assumption 2. We assume that there exist constants G, L > 0 such that for each j [k], f j is G-Lipschitz and bounded by L. Also for j [k], we let E[f j( x, β j )] = 0.

Notations For a positive semi-deﬁnite matrix M Rp p, we deﬁne the M-norm for a vector w as w 2 M = w T Mw. Also we will denote Bδ M( β) as the ball around β with radius δ under M-norm, i.e., Bδ M( β) = {β : M 1 2 (β β) 2 δ}. λmin(M) is the minimal singular value of the matrix M. For a semi positive deﬁnite matrix M Rp p, let its SVD be M = U T ΣU, where Σ = diag(λ1, , λp), then M 1 2 is deﬁned as M 1 2 = U T Σ 1 2 U with Σ 1 2 = diag( λ1, ,

Gaussian Case

In this section we consider the case where x is sampled from some multivariate Gaussian distribution, then we will extend our idea to the sub-Gaussian distribution case in next section. Our algorithm is based on the following key observation using some properties of the multivariate Gaussian distribution.

Theorem 1. Consider Model (1) in Deﬁnition 1 under Assumptions 1 and 2. Moreover, assume that the observations {xi}n i=1 are i.i.d sampled from N(0, Σ). Then each β j , j [k] can be written as

β j = cj βols j , (3)

where βols j = Σ 1E[zjyx] and cj is the root of the function lj(c) 1 where

lj(c) = c E[f j( x, βols j c)]. (4)

From Theorem 1 we can see that, in order to estimate β j , it is sufﬁcient to estimate the terms βols j = Σ 1E[zjyx] and cj. If we denote zjy as the response and x as the variate, then the term βols j is just the Ordinary Least Square

(OLS) estimator. Thus we can use its empirical form βj ols = ( n i=1 x T i xi) 1 n i=1 zi,jyixi = (XT X) 1XT Yj as an estimator, where X = [x T 1 ; x T 2 ; ; x T n] Rn d is the data matrix and Yj = [z1,jy1, , zn,jyn]T is the corresponding response vector. After getting the estimator of βols j , denoted by βols j , we use it to approximate cj. That is we ﬁnd the root ˆcj of the empirical version of lj(c) 1, i.e., ˆlj(c) 1, where

i=1 [f j( xi, βols j c)].

Note that there are numerous methods available to ﬁnd a root of a function, such as Newton s root-ﬁnding method with quadratic convergence and Halley s method with cubic convergence. We also note that this step only cost O(n) periteration. After that, we could estimate each β j by ˆβnlr j = ˆcj βols j . In total, we have Algorithm 1. The following theorem shows that the converge rate of the estimation error for each ˆβnlr j β j 2 is O( p

n) under some additional mild assumptions on link functions {fj}k j=1.

Theorem 2. Consider Option 1 in Algorithm 1. Under the Assumptions 1, 2 and the assumptions in Theorem 1, for each j [k] we deﬁne the function ℓj(c, β) = c E[f j( x, β c)] and its empirical counter part as

ˆℓj(c, β) = c

i=1 f j( xi, β c).

Assume that there exist some constants η, cj such that ℓj( cj, βols j ) > 1 + η. Then there exists cj > 0 satisfying the equation 1 = ℓj(cj, βols j ) for each j [k]. Further, assume that n is sufﬁciently large:

n Ω(p Σ 2 β j 2 2)

Then, with probability at least 1 k exp( p) there exist constants ˆcj (0, cj) satisfying the equations

i=1 f j( xi, βols j ˆcj).

Algorithm 1 SLS: Scaled Least Squared Estimators

Input: Data {(xi, yi, {zi,j}k j=1)}n i=1, link functions {fj}j [k].

1: Option 1: Let X = [x1, x2, , xn]T Rn p and compute the ˆΣ 1 = (XT X) 1. 2: Option 2: Construct a sub-sampling based OLS estimator, that is let S [n] be a random subset and take ˆΣ 1 S = |S|

n (XT S XS) 1, where XS R|S| p is the data matrix constrained on indices of S. 3: for j = 1, 2 , k do 4: Let Yi = [z1,jy1, , zn,jyn]T Rn. For Option 1, Compute the ordinary least squares estimator βols j = (Σ) 1XT Yj. For Option 2, take βols j = (ˆΣS) 1XT Yj.

5: Denote yj = X βols j . Then use the Newton s rootﬁnding method to the function c

n n i=1[f j( yj,ic)] 1, denote the root as ˆcj: 6: for t = 1, 2, until convergence do

7: c = c c 1

n n i=1 f j(c yj,i) 1

1 n n i=1{f j(c yj,i)+c yj,if (2) j (c yj,i)}.

8: ˆβnlr j = ˆcj βols j .

9: return ˆβnlr j

Moreover, if for all j [k] the derivative of z ℓj(z, βols j ) is bounded below in absolute value (does not change sign) by M > 0 in the interval z [0, cj]. Then with probability at least 1 4k exp( p) the outputs {ˆβnlr j }k j=1 satisfy for each j [k]

ˆβnlr j β j 2 O(max{1, β j 2}λ 1

where G, L, γ, M, cj, η are assumed to be Θ(1) and thus omitted in the Big-O and Ω notations (see Appendix for the explicit forms). Note that in Theorem 2 the link functions fj are not required to be convex. Hence this is quite useful in non-convex learning models.

Time Complexity Analysis Under Option 1 of Algorithm 1, we can see that the ﬁrst step takes O(np2 + p3) time, calculating βols j for all j [k] takes O(k(np+p2)) time and each iteration of ﬁnding ˆcj takes O(n) time. Thus, if k ,the number of link functions fj, is a constant, then the total time complexity is O(np2 + p3 + n T), where T is the number of iterations for ﬁnding cj. However, the term np2 is prohibitive in the large scale setting where n, p are huge (see (Wang and Xu 2018) for details). To further reduce the time complexity, we propose another estimator based on sub-sampling. Note that the term O(np2) comes from calculating the empirical covariance matrix XT X. Thus, to reduce the time complexity, instead of calculating the covariance via the whole dataset, we here use the sub-sampled covariance matrix. More precisely, we ﬁrst randomly sample

a set of indices S [n] whose size |S| will be speciﬁed later. Then we calculate |S|

n (XT S XS) 1 to estimate (XT X) 1, where XS R|S| p is the data matrix constrained on indices of S. We can see that the time complexity in this case will only be O(|S|p2 + p3). The following lemma, which is given by (Dhillon et al. 2013; Erdogdu, Dicker, and Bayati 2016) shows the convergence rate of the OLS estimator based on the sub-sampled covariance matrix. This is a generalization of Lemma 1. Lemma 3. Under the same assumptions as in Lemma 1, if |S| Ω(γκxp), then with probability at least 1 3 exp( p) the sub-sampled covariance OLS estimator βols = |S|

n (XT S XS) 1XT Y satisﬁes

βols βols 2 C2κxγ p

We have the following approximation error based the subsampled covariance OLS estimator: Theorem 3. Under the same assumptions as in Theorem 2, in Algorithm 1 if we use Option 2 with |S| Ω(γκxp) , then with probability at least 1 4k exp( p) the outputs {ˆβnlr j }k j=1 satisfy for each j [k], ˆβnlr j β j 2

O(max{1, β j 2}λ 1

Moreover, it is also possible to accelerate the algorithm using the sub-sampling method in the step 5 (ﬁnding the root) and we can see the estimation error will be the same as in Theorem 3 (by the proof of Theorem 3). Due to the space limit, we omit it here.

Extension to Sub-Gaussian Case Note that Theorem 2 is only suitable for the case when x is Gaussian. This is due to the requirements on some properties of the Gaussian distribution in the proof of Theorem 1. In this section we will ﬁrst extend Theorem 1 to the sub-Gaussian case. Remember that the proof of Theorem 1 is based on the classic Stein s lemma. Thus, in order to generalize to sub Gaussian case, we will use the zero-bias transformation in Deﬁnition 3 since it is a generalization of the Stein s lemma. With some additional assumptions, we can get the following theorem. Theorem 4. Let x1, , xn Rp be i.i.d realizations of a random vector x which is sub-Gaussian with zero mean, whose covariance matrix Σ has Σ 1 2 being diagonally dominant 1, and whose distribution is supported on a ℓ2-norm ball of radius r. Let v = Σ 1

2 x be the whitened random vector of x with sub-Gaussian norm v ψ2 = κx. If for all j [k], each v has constant ﬁrst and second conditional moments (i.e., s [p] and βj = Σ 1 2 β j , E[vs|

t =s βjvt] and E[v2 s|

t =s βjvt] are deterministic) and the link functions

1A square matrix is said to be diagonally dominant if, for every row of the matrix, the magnitude of the diagonal entry in a row is larger than or equal to the sum of the magnitudes of all the other (non-diagonal) entries in that row.

f j satisfy Assumption 2. Then for cj = 1 E[f j( x,β j )], the

following holds for the model in (1) for all j [k]

cj β j βols j 16Grκ3 x ρ2ρ β j 2 p , (6)

where ρq for q = {2, } is the conditional number of Σ in ℓq norm and βols j = Σ 1E[xyzj] is the OLS vector w.r.t yzj and x. Remark 1. Note that compared with the equality relationship between β j and cjβols in Theorem 1, in Theorem 4 we only has the ℓ norm of their difference. Also, here we need more assumptions on the distribution of x, and these assumptions ensure that the estimation error decays in the rate of O( 1 p).

Theorem 4 indicates that we can use the same idea as in the Gaussian case to estimate each β j . Note that the forms of cj in Theorem 1 and 4 are different. In Theorem 1 each cj is based on βols j , while in Theorem 4 it is based on β j . However, due to the closeness of β j and βols j in (6), we can still use 1 E[f j( xi,βols j cj)] to approximate cj, where cj is the

root of c E[f j( xi, βols j c)] 1. Because of this similarity, we can still use Algorithm 1 for the sub-Gaussian case under the assumptions in Theorem 4, and we can get the following estimation error: Theorem 5. Consider Option 1 in Algorithm 1. Under Assumptions 1, 2 and the assumptions in Theorem 4, for each j [k], if we deﬁne the function ℓj(c, β) = c E[f j( x, β c)] and its empirical counter part as

ˆℓj(c, β) = c

i=1 f j( xi, β c).

Assume that there exist some constants η, cj such that ℓj( cj, βols j ) > 1 + η. Then there exists cj > 0 satisfying the equation 1 = ℓj( cj, βols j ) for each j [k]. Further, assume that n is sufﬁciently large:

n Ω( Σ 2p2ρ2ρ2 β j 2 max{1, β j 2 }).

Then, with probability at least 1 k exp( p) there exist constants ˆcj (0, cj) satisfying the equations

i=1 f j( xi, βols j ˆcj).

Moreover, if for all j [k], the derivative of z ℓj(z, βols j ) is bounded below in absolute value (does not change sign) by M > 0 in the interval z [0, max{ cj, cj}]. Then with probability at least 1 4k exp( p) the outputs {ˆβnlr j }k j=1 satisfy for each j [k]

ˆβnlr j β j O ρ2ρ λ 1

max{1, β j } + ρ2ρ2 max{ β j 2 , 1} β j 2 p ,

where η, G, L, γ, M, cj, r, κx, cj are assumed to be Θ(1) and thus omitted in the Big-O and Ω notations (see Appendix for the explicit forms).

Draft Proof . We ﬁrst prove the following lemma (see Appendix for the proof):

Lemma 4. Under the assumptions in Theorem 5, with probability at least 1 k exp( p) for all j [k]

|ˆcj cj| O(M 1GL c2 jκ2 xγ Σ 1/2 2 βols j 2

For convenience we will assume G, L, γ, M, cj, r, κx, cj = Θ(1). We have for each ˆβnlr j ,

ˆβnlr j β j ˆcj βols j cjβols j + cjβols j β j

ˆcj βols j cjβols j + cjβols j cjβols j

+ cjβols j β j . (7)

To bound the terms in (7) we ﬁrst bound | cj cj|. By deﬁnition we have cj E[f j( x, β j )] = 1 and cj E[f j( x, βols j cj)] = 1, then we get

|ℓj( cj, βols j ) ℓj(cj, βols j )| = |1 ℓj(cj, βols j )|

= |cj E[f j( x, β j )] cj E[f j( x, βols j cj)]|

G|cj|E[ x, β j cjβols j ]

Gcj β j cjβols j E x 1

Gcjκx β j cjβols j , where the last inequality is by the deﬁnition of the sub Gaussian norm (see Deﬁnition 1). Thus, by the assumption of the bounded deviation of ℓ(c, βols j ) on max{ cj, cj} we have

M| cj cj| |ℓj( cj, βols j ) ℓj(cj, βols j )|

Gcjκx β j cjβols j , and further by Theorem 4 we have

| cj cj| O( ρ2ρ β j 2 p ). (8)

For the second term in (7), by (8) we have

cjβols j cjβols j O( ρ2ρ βols j β j 2 p ), (9)

where the last inequality is due to Lemma 4, Lemma 1 and the assumption of n. For the ﬁrst term in (7) we have

ˆcj βols j cjβols j

ˆcj βols j βols j + |ˆcj cj| βols j

O ( cj + Σ 1/2 2 βols j 2

n) λmin(Σ) 1

+ Σ 1/2 2 βols j 2

n βols j (10)

n max{1, βols j } . (11)

1 2 3 4 5 n 105

maxi [k] βi β i 2/ β i 2

k = 5 k = 10 k = 15

0.0 0.2 0.4 0.6 0.8 1.0 |S|/n

p = 20, |S| = n p = 20, n = 500000

(a) Gaussian data

1 2 3 4 5 n 105

maxi [k] βi β i / β i

k = 5 k = 10 k = 15

0.0 0.2 0.4 0.6 0.8 1.0 |S|/n

0.15 p = 20, |S| = n p = 20, n = 500000

(b) sub-Gaussian data

Figure 1: Single type of link function f(x) = x3

1 2 3 4 5 n 105

maxi [k] βi β i 2/ β i 2

k = 5 k = 10 k = 15

0.0 0.2 0.4 0.6 0.8 1.0 |S|/n

0.6 p = 20, |S| = n p = 20, n = 500000

(a) Gaussian data

1 2 3 4 5 n 105

maxi [k] βi β i / β i

k = 5 k = 10 k = 15

0.0 0.2 0.4 0.6 0.8 1.0 |S|/n

0.22 p = 20, |S| = n p = 20, n = 500000

(b) sub-Gaussian data

Figure 2: Single type of link function f(x) = (1 + e x) 1

1 2 3 4 5 n 105

maxi [k] βi β i 2/ β i 2

k = 5 k = 10 k = 15

0.0 0.2 0.4 0.6 0.8 1.0 |S|/n

0.10 p = 20, |S| = n p = 20, n = 500000

(a) Gaussian data

1 2 3 4 5 n 105

maxi [k] βi β i / β i

k = 5 k = 10 k = 15

0.0 0.2 0.4 0.6 0.8 1.0 |S|/n

0.16 p = 20, |S| = n p = 20, n = 500000

(b) sub-Gaussian data

Figure 3: Single type of link function f(x) = log(1 + e x)

1 2 3 4 5 n 105

maxi [k] βi β i / β i

k = 5 k = 10 k = 15

0.0 0.2 0.4 0.6 0.8 1.0 |S|/n

0.24 p = 20, |S| = n p = 20, n = 500000

(a) f(x) = x, x3, x5, sub-Gaussian data

1 2 3 4 5 n 105

maxi [k] βi β i / β i

k = 5 k = 10 k = 15

0.0 0.2 0.4 0.6 0.8 1.0 |S|/n

0.15 p = 20, |S| = n p = 20, n = 500000

(b) f(x) = x3, (1 + e x) 1, log(1 + e x), sub-Gaussian data

Figure 4: Mixed different type of link functions

By Theorem 4, we see that the third term of (7) is bounded as the following

cjβols j β j O( ρ2ρ β j 2 p ). (12)

(9), (11) and (12) together give

ˆβnlr j β j O ρ2ρ βols j β j 2 p

n max{1, βols j } + ρ2ρ β j 2 p .

By Theorem 4, we have

βols j O( β j + ρ2ρ β j 2 p ).

Plugging it into (13) completes the proof.

Remark 2. Compared with the converge rate in the ℓ2-norm of O( p

n) in Theorem 2, Theorem 5 shows that for the sub-Gaussian case, the converge rate of the estimation error is O( 1 p + p

n) in the ℓ -norm (if we omit other terms). This is due to the estimation error in Theorem 4. Moreover, compared with the assumptions of link functions in Theorem 2, there are additional assumptions in Theorem 5. In order to reduce the time complexity and make the algorithm faster, we can also use the sub-sampled covariance OLS estimator. This is the same as that in the Gaussian case.

Theorem 6. Under the same assumptions as in Theorem 5,if we use Option 2 in Algorithm 1, then with probability at least 1 4k exp( p), the outputs {ˆβnlr j }k j=1 satisfy for each j [k]

ˆβnlr j β j O ρ2ρ λ 1

max{1, β j } + ρ2ρ2 max{ β j 2 , 1} β j 2 p .

Experiments We conduct experiments on three types of link functions: polynomial, sigmoid, and logistic function, as well as an arbitrary mix of them. Formally, the polynomial link functions include f(x) = x, x3, x5; the sigmoid link function is deﬁned as f(x) = (1 + e x) 1; the logistic link function refers to f(x) = log(1 + e x). Due to the statistical setting we focused on in the paper, we will perform our algorithm on the synthetic data, and the same experimental setting has been used in the previous work such as (Na et al. 2018; Yang et al. 2017; Erdogdu, Dicker, and Bayati 2016). Experimental setting. We sample all coefﬁcient zi,j and noise ϵ i.i.d. from standard Gaussian distribution N(0, 1) across each experiment. Each β j is generated by sampling from N(1, 16Id). We consider two distributions for generating x: Gaussian and Uniform distribution (corresponds to

thr sub-Gaussian case). To satisfy the requirement of Theorem 6, in both cases the standard variance is scaled by 1/p and this is also used in (Erdogdu, Dicker, and Bayati 2016), where p is the data dimension. Thus, in the Gaussian case, x N(0, 1

p Ip), while in the sub-Gaussian case x is sampled from a uniform distribution, i.e., x U([ 1/p, 1/p])p. Finally, given the list F = [f1, . . . , fk] of link functions, the response y is computed via (1). It is notable that the experimental results with different number of link functions k are incomparable since when k is changed Model (1) will also be changed. These experiments are divided into two parts, examining how the sample size n and the size of the sub-sample set S affect the algorithm performance. In the ﬁrst part we vary n from 100 000 to 500 000 with ﬁxed p = 20 and |S| = n, while in the second part we vary |S| from 0.01n to n, with ﬁxed n = 500 000 and p = 20. In each part we test the algorithm against various data distribution/link function combinations. For each experiment, in order to support our theoretical analysis, we will use the (maximal) relative error as the measurement, that is when x is Gaussian we use maxi [k] βi β i 2 β i 2 and when x is sub-Gaussian we will use

maxi [k] βi β i β i . For each experiment we repeat 20 times and take the average as the ﬁnal output. Experiment results. Each of Figure 1-3 illustrates the result for a single type of link function. We can see that the relative error decreases steadily as the sample size n grows which is due to the O( 1 n) converge rate as our theorem states. Also, we can see that the size of S doesn t affect the ﬁnal relative error much if |S| is large enough, i.e., in all cases, choosing large enough |S| 0.2n is sufﬁcient to achieve a relative error roughly the same as when |S| = n, which also has been mentioned in our theorems theoretically. We further investigate the case when F contains different types of link functions. In Figure 4a, we let F contain polynomials with different degrees (x, x3, x5), and there are roughly k 3 functions for each degree. Similarly, in Figure 4b we also mix polynomial links with the other two types of links, i.e., logistic link and log-exponential link, and there are roughly k 3 functions for each type of link function. Our algorithm achieves similar performance as in the previous settings.

We studied a new model called stochastic linear combination of non-linear regressions and provided the ﬁrst estimation error bounds for both Gaussian and bounded sub-Gaussian cases. Our algorithm is based on Stein s lemma and its generalization, the zero-bias transformation. Moreover, we used the sub-sampling of the covariance matrix to accelerate our algorithm. Finally, we conducted experiments whose results support our theoretical analysis.

Ackonwledgements

The research of the ﬁrst and last authors was supported in part by NSF through grants CCF-1716400 and IIS-1910492.

Beck, A., and Eldar, Y. C. 2013. Sparse signal recovery from nonlinear measurements. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 5464 5468. IEEE. Brillinger, D. R. 1982. A generalized linear model with gaussian regressor variables. A Festschrift For Erich L. Lehmann 97. Cook, R. D., and Lee, H. 1999. Dimension reduction in binary response regression. Journal of the American Statistical Association 94(448):1187 1200. Dhillon, P.; Lu, Y.; Foster, D. P.; and Ungar, L. 2013. New subsampling algorithms for fast least squares regression. In Advances in neural information processing systems, 360 368. Erdogdu, M. A.; Bayati, M.; and Dicker, L. H. 2019. Scalable approximations for generalized linear problems. The Journal of Machine Learning Research 20(1):231 275. Erdogdu, M. A.; Dicker, L. H.; and Bayati, M. 2016. Scaled least squares estimator for glms in large-scale problems. In Advances in Neural Information Processing Systems, 3324 3332. Erdogdu, M. A. 2016. Newton-stein method: an optimization method for glms via stein s lemma. The Journal of Machine Learning Research 17(1):7565 7616. Fan, J., and Zhang, W. 2008. Statistical methods with varying coefﬁcient models. Statistics and its Interface 1(1):179. Goldstein, L.; Reinert, G.; et al. 1997. Stein s method and the zero bias transformation with application to simple random sampling. The Annals of Applied Probability 7(4):935 952. Horowitz, J. L. 2009. Semiparametric and nonparametric methods in econometrics, volume 12. Springer. Ichimura, H. 1993. Semiparametric least squares (sls) and weighted sls estimation of single-index models. Journal of Econometrics 58(1-2):71 120. Kakade, S. M.; Kanade, V.; Shamir, O.; and Kalai, A. 2011. Efﬁcient learning of generalized linear and single index models with isotonic regression. In Advances in Neural Information Processing Systems, 927 935. Li, K.-C.; Duan, N.; et al. 1989. Regression analysis under link violation. The Annals of Statistics 17(3):1009 1052. Li, K.-C. 1991. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association 86(414):316 327. Li, K.-C. 1992. On principal hessian directions for data visualization and dimension reduction: Another application of stein s lemma. Journal of the American Statistical Association 87(420):1025 1039. Liu, Q., and Wang, D. 2016. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, 2378 2386. Ma, S., and Song, P. X.-K. 2015. Varying index coefﬁcient models. Journal of the American Statistical Association 110(509):341 356.

Na, S.; Yang, Z.; Wang, Z.; and Kolar, M. 2018. Highdimensional index volatility models via stein s identity. ar Xiv preprint ar Xiv:1811.10790. Nitanda, A., and Suzuki, T. 2019. Reﬁned generalization analysis of gradient descent for over-parameterized two-layer neural networks with smooth activations on classiﬁcation problems. ar Xiv preprint ar Xiv:1905.09870. Radchenko, P. 2015. High dimensional single index models. Journal of Multivariate Analysis 139:266 282. Wang, D., and Xu, J. 2018. Large scale constrained linear regression revisited: Faster algorithms via preconditioning. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence. Wei, X.; Yang, Z.; and Wang, Z. 2019. On the statistical rate of nonlinear recovery in generative models with heavy-tailed data. In International Conference on Machine Learning, 6697 6706. Yang, Z.; Balasubramanian, K.; and Liu, H. 2017. Highdimensional non-gaussian single index models via thresholded score function estimation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 3851 3860. JMLR. org. Yang, Z.; Wang, Z.; Liu, H.; Eldar, Y.; and Zhang, T. 2016. Sparse nonlinear regression: parameter estimation under nonconvexity. In International Conference on Machine Learning, 2472 2481. Yang, Z.; Balasubramanian, K.; Wang, Z.; and Liu, H. 2017. Learning non-gaussian multi-index model via second-order stein s method. Advances in Neural Information Processing Systems 30:6097 6106. Zhang, X.; Yu, Y.; Wang, L.; and Gu, Q. 2019. Learning one-hidden-layer relu networks via gradient descent. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, 1524 1534. Zhang, K.; Yang, Z.; and Wang, Z. 2018. Nonlinear structured signal estimation in high dimensions via iterative hard thresholding. In International Conference on Artiﬁcial Intelligence and Statistics, 258 268. Zou, D.; Cao, Y.; Zhou, D.; and Gu, Q. 2018. Stochastic gradient descent optimizes over-parameterized deep relu networks. ar Xiv preprint ar Xiv:1811.08888.