# accelerated_sparse_linear_regression_via_random_projection__4cfae773.pdf

Accelerated Sparse Linear Regression via Random Projection

Weizhong Zhang , Lijun Zhang , Rong Jin , Deng Cai , Xiaofei He

State Key Lab of CAD&CG, College of Computer Science, Zhejiang University, Hangzhou, China National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China Alibaba Group, Seattle, USA {zhangweizhongzju, dengcai, xiaofeihe}@gmail.com, zhanglj@lamda.nju.edu.cn, jinrong.jr@alibaba-inc.com

In this paper, we present an accelerated numerical method based on random projection for sparse linear regression. Previous studies have shown that under appropriate conditions, gradient-based methods enjoy a geometric convergence rate when applied to this problem. However, the time complexity of evaluating the gradient is as large as O(nd), where n is the number of data points and d is the dimensionality, making those methods inefﬁcient for large-scale and highdimensional dataset. To address this limitation, we ﬁrst utilize random projection to ﬁnd a rank-k approximator for the data matrix, and reduce the cost of gradient evaluation to O(nk + dk), a signiﬁcant improvement when k is much smaller than d and n. Then, we solve the sparse linear regression problem via a proximal gradient method with a homotopy strategy to generate sparse intermediate solutions. Theoretical analysis shows that our method also achieves a global geometric convergence rate, and moreover the sparsity of all the intermediate solutions are well-bounded over the iterations. Finally, we conduct experiments to demonstrate the efﬁciency of the proposed method.

Introduction

In this study, we focus on developing an efﬁcient algorithm based on random projection for sparse linear regression (Huang, Ma, and Zhang 2008), which is closely related to sparse recovery (Becker, Bobin, and Cand es 2011) and compressed sensing(Candes and Tao 2006). Suppose we observe the data matrix X = (x1, . . . , xn) Rd n and the corresponding response vector y = (y1, . . . , yn)T Rn, where each xi Rd is a vector representation for the i-th observation and yi is its response value. The goal of sparse linear regression is to ﬁnd a sparse linear prediction function f(x) = x T β such that each yi can be well approximated by f(xi), where β Rd is a sparse vector composed of the model coefﬁcients. In this work, we are interested in estimating β in a special case when the dimension d of xi is very high, which can be much larger than the sample size, i.e. d > n, thus the problem here is under-determined. One common approach is to learn β by solving a ℓ1regularized Least-Square (ℓ1-LS) problem, known as Lasso

Copyright c 2016, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

in statistics (Tibshirani 1996),

min β Rd 1 2n||y XT β||2 2 + λ||β||1 (1)

where λ is the regularization parameter that controls the sparsity of the model. Here || ||2 denotes the standard ℓ2 norm for a vector, and ||β||1 = d i |βi| is the ℓ1 norm. ℓ1-LS problem has been studied extensively due to its important applications in machine learning, signal processing and statistics. Plenty of algorithms (Figueiredo, Nowak, and Wright 2007; Kim et al. 2007; Nesterov 2013; Xiao and Zhang 2013) have been developed for efﬁciently solving the optimization problem (1). For instance, in (Xiao and Zhang 2013), the authors propose a proximal gradient homotopy method to solve this problem by solving a sequence of optimization problems in form (1) with the regularization parameter λ decreasing along the sequence. They prove that under some common assumptions, they can obtain a global geometric convergence rate and the intermediate solutions over the iterations are always sparse. However, despite the provable fast convergence rates, most of these algorithms need to compute XXT βt for gradient evaluation at each iteration, where βt is the intermediate solution at the t-th iteration. It costs O(nd) ﬂops for generic dense matrix X, leading to a high computational cost when both n and d are large. In this study, we adopt the key idea in randomized matrix algorithms (Mahoney 2011; Zhang et al. 2013; 2014) to develop a novel efﬁcient algorithm for the sparse linear regression problem, due to their high efﬁciency in data intensive problems. Various of random matrix algorithms (Drineas et al. 2004; Frieze, Kannan, and Vempala 2004; Sarlos 2006; Drineas, Kannan, and Mahoney 2006; Drineas, Mahoney, and Muthukrishnan 2008; Mahoney and Drineas 2009; Drineas et al. 2011; Lu et al. 2013) have been developed in these years to accelerate the corresponding algorithms in large-scale data analysis. They typically construct a randomized sketch of the input matrix X to downsize the scale of the original problem. And then they solve the downsized problem to obtain a approximate solution. Based on the way they construct the sketch they can be categorized into two categories roughly, i.e., random sampling based methods and random projection based methods. In random sampling based methods (Drineas, Mahoney, and

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

Muthukrishnan 2008; Mahoney and Drineas 2009; Drineas et al. 2011), the sketch consists of some columns sampled from the columns of X according to some speciﬁc probability distribution. While in random projection based methods, the columns in the sketch are some linear combinations of the columns from X. In our method, we adopt the random projection method instead of the random sampling method, since our problem is under-determined and sampling would make it worse. To the best of our knowledge, it is the ﬁrst attempt to exploit randomization for fast solving sparse linear regression problems. To be precise, we develop an efﬁcient two-phase algorithm based on random projection for sparse linear regression. In phase one, we approximate X with a low rank matrix X = (QW)T by using random projection, where Q Rn k and W Rk d. Thus XXT βt can be approximated by W T QT QWβt and in this way the computing cost can be reduced from O(nd) to O(dk + nk). When k min{n, d}, this improvement in efﬁciency would be signiﬁcant. If the data matrix X is low rank or approximately low rank, which is common in real applications, the matrix X would be a good approximator. In phase two, based on the approximate matrix X, we update the vector βt in each iteration by employing a proximal gradient method using a homotopy strategy, which is similar to the method in (Xiao and Zhang 2013). After giving the rough sketch of our method, we would like to point out the main contributions of this work below:

We reduce the computing complexity of each iteration from O(nd) to O(dk + nk), meanwhile the global geometric convergence rate is preserved.

Compared to the recent well known method (Xiao and Zhang 2013), we need to run the composite gradient mapping only once for each regularization parameter.

The sparsity of the intermediate solutions in our approach is well-bounded. The number of the nonzero entries in βt is no more than two times of that in β .

Related Work

In this section, we brieﬂy review the recent work on the optimization methods for ℓ1-LS problem, the random matrix algorithms for least square problem and low-rank matrix approximation.

Previous Algorithms for ℓ1-LS Problem

Traditional methods (Daubechies, Defrise, and De Mol 2003; Nesterov 2013; Hale, Yin, and Zhang 2008; Wright, Nowak, and Figueiredo 2009) developed in the past decades for problem (1) are always based on composite gradient mapping (Nesterov 2004) and we call them proximal gradient methods. Given the current solution βt, they update the solution by using the following rule

βt+1 = arg min β {f(βt) + f(βt)T (β βt)

2 ||β βt||2 2 + λ||β||1}, (2)

where f(βt) = 1 2n||XT βt y||2 2 and Lt is a parameter chosen at each iteration. The major computational efforts of them are used to compute the gradient f(βt) = 1 n X(XT βt y), which costs O(nd) ﬂops for generic dense matrix X. Since in the under-determined case the object function is not strongly convex, the convergence rate of the traditional proximal gradient methods can only be O(1/T), T is the iteration number. Some accelerated algorithms (Bioucas-Dias and Figueiredo 2007; Wright, Nowak, and Figueiredo 2009; Wen et al. 2010; Becker, Bobin, and Cand es 2011) are then proposed to improve the convergence rate to O(1/T 2) by generating several concurrent sequences of iterates. Recently, a proximal gradient homotopy method (Xiao and Zhang 2013) is proposed to improve the efﬁciency of the existing methods further. Its key idea is to solve a sequence of ℓ1-LS problem over the stages with the values of the regularization parameter λ decreasing exponentially as the stage goes on. The main beneﬁt they get from the homotopy strategy is that all the solutions along the homotopy path are sparse, which makes the objective function strongly convex along this path. Thus, the convergence rate can be improved to be geometric. The limitation of this method is that the times it updates the solution in each intermediate problem is unknown and always more than once. (Zhang et al. 2015) adopt this idea to compressive sensing and in their approach they update the solution only once for each regularization parameter. However, different from our problem, the sensing matrix X in compressive sensing is pre-designed, like sub-Gaussian random matrix. Thus, its theoretical analysis may not be always true in our case. More importantly, the computational complexity of each iteration is still O(nd) in all these accelerated algorithms, which would be prohibitive when both n and d are too large.

Random Algorithms for Least Square Problem and Low-rank Matrix Approximation Below we brieﬂy introduce some random matrix algorithms for Least Square problem and Low-rank Approximation problem since they are closely related to the key idea of our method and they are at the center of the recent developments. To accelerate the existing algorithms for least square problem

min β Rd 1 2n||y XT β||2, (3)

where X Rd n, y Rn and β Rd, the researchers (Drineas et al. 2011; Sarlos 2006) typically ﬁnd an approximate solution by solving the following approximate problem

βopt = arg min β Rd 1 2n||Z(y XT β)||2, (4)

where Z Rl n is the data dependent matrix, the parameter l is usually much smaller than n. Since l n, problem (4) has much smaller size compared to problem (3). For random sampling based algorithms (Drineas et al. 2011), Z is a data dependent random sampling matrix constructed from an importance sampling distribution to capture the non-uniformity structure of X, which is deﬁned by

the statistical leverage scores (Mahoney and Drineas 2009). And for random projection based algorithms (Drineas et al. 2011), it is a data-independent random projection for constructing the sketch and its elements are the coefﬁcients of the linear combinations. Recently, they show that by using random algorithm, they can ﬁnd an approximate solution βopt in O(dn log(d/ϵ) + d3 log2 n

ϵ ) time with the error || βopt βopt||2 ϵ κ(X)

γ 2 1 ||βopt||2, here κ(X) is the conditional number of X, γ is a parameter deﬁning the amount of the mass of y inside the column space of XT . Compared to the cost O(nd2) of the traditional methods for the least square problem (3), such as Cholesky decomposition, QR decomposition and Singular Value Decomposition (SVD), they are indeed much more efﬁcient. Different from our problem, the least square problem they consider is overconstrained, i.e., n > d. In addition, they do not care about the sparsity of the solution in their problem. In Low-rank Matrix Approximation problems, given a low rank d n matrix X with rank(X) = r min{d, n}, the goal is to ﬁnd a good approximation X to X of rank k with k r. As we know, we can obtain the best approximation X from SVD with respect to the Frobenius norms. To be precise, if we have the SVD of X, say X = UΣV T , where U = (Uk, Uk, ) = [u1, u2, ..., ur] Rd r and V = (Vk, Vk, ) = [v1, v2, ..., vr] Rn r are two column orthonormal matrices and Σ = diag(Σk, Σk, ) = diag(σ1, ..., σr) Rr r, σ1 σ2 ... σr > 0, then X = UkΣk V T k is the best rank-k approximation to X. Since the computing complexity of SVD is typically O(d2n + n3), which is prohibitive in high dimensional and data intensive applications, some random algorithms (Drineas et al. 2004; Frieze, Kannan, and Vempala 2004; Drineas, Kannan, and Mahoney 2006) are developed to accelerate this procedure. These methods usually use random sampling (Drineas, Mahoney, and Muthukrishnan 2008; Mahoney and Drineas 2009) or random projection to identify a subspace Ck that captures most of the actions of the matrix X, then they project X to Ck to obtain the desired low-rank approximation X. The main steps of these methods are as follows:

Generate the random sampling or random projection matrix Z Rn (k+p) with p 0,

Compute the sketch C of X: C = XZ,

Obtain the rank-k approximate matrix X by projecting X onto Ck: X = PCk X, where Ck is the best rank-k approximation to the matrix C, P is the projection operator.

In these random algorithms, the relative error bound ||X X|| (1+ϵ)||X PUk X|| is often used to evaluate the precision of the random algorithms, here || || is the Frobenius norm or spectral of the matrix. Their computing complexity can be reduced to O(dnk) or even O(dn log(k)) when Z is a structured matrix, which is a quite signiﬁcant improvement in practice. In addition, compared to the traditional methods, they are much more robust and can be reorganized to exploit parallel computing architectures.

Preliminary and Notation In this section, we present the mathematical setups and a few notions, which are necessary for the theoretical analysis of our methods and also widely used in sparse linear regression, lasso and compressed sensing. Firstly, we assume that there exists a sparse linear prediction function f (x) = x T β , where β is a sparse vector with ||β ||0 = s d, such that each yi can be well approximated by f (xi). To this end, we deﬁne ε as the mean squared regression error made by f ( ) over the training examples, i.e.

i=1 (f (xi) yi)2.

Since f ( ) is assumed to be an accurate prediction function, we assume that ε is small. Our goal is to efﬁciently recover the sparse vector β and consequentially the sparse linear prediction function f ( ). Then, following the work (Koltchinskii 2011), we introduce a restricted eigenvalue condition and a restricted correlation (Bickel, Ritov, and Tsybakov 2009) for the feature matrix X.

Deﬁnition 1 (Restricted Eigenvalue) Given an integer s > 0, we say that a matrix X satisﬁes the restricted eigenvalue condition at sparsity level s if there exist positive constants γℓand γu such that

γℓ := inf 1 n||XT u||2 : ||u||2 = 1, ||u||0 s

γu := sup 1 n||XT u||2 : ||u||2 = 1, ||u||0 2s .

Deﬁnition 2 (Restricted Correlation) We denote S(β) as the support set of β, which is composed of the nonzero components of β. Given an integer s > 0, we deﬁne the restricted correlation of the data matrix X as

ρ := max |βT XXT β | |XT β|2|XT β |2 : ||β||0 2s,

||β ||0 s and S(β) S(β ) = .

At last, we assume our data matrix X in the sequence satisﬁes the restricted eigenvalue and restricted correlation conditions with parameters γℓ, γu and ρ respectively.

Accelerated Sparse Linear Regression via Random Projection As described in the introduction section, our method is composed of two steps. It computes the approximation X for the data matrix X at ﬁrst. And then, using the homotopy strategy, it recovers the optimal pattern β by solving a sequence of subproblems in form (5). In this section, we will discuss the key ideas and the detail steps (Algorithm 1) of our approach below. In the ﬁrst step, we utilize the randomized matrix algorithms to get a low-rank approximation for X efﬁciently. As

Algorithm 1 Accelerated Sparse Linear Regression via Random Projection (SLRvia RP)

1: Input: the data matrix X, y, λ0, λmin, γ, k, η 2: // Compute the approximation X: 3: Sample a d k random matrix Z with Zij N(0, 1) 4: Compute the QR decomposition of XT Z, i.e., XT Z = QR 5: Approximate X by X = (XQ)QT = W T QT

6: // Recover the sparse vector β: 7: Initialize β0 = 0 8: for t = 0 to T 1 do 9: Update:

βt+1 = min β Rd

n(β βt)T X(y XT βt)

+λt|β|1 + γ

2 |β βt|2 2

where λt = max(λmin, λ0ηt). 10: end for 11: Return: βT

our problem is highly under-determined since n < d, random sampling will make it worse. Thus, we adopt random projection instead of random sampling based method to approximate X with a low rank matrix X (see Algorithm 4.1 in (Halko, Martinsson, and Tropp 2011)). We ﬁrst compute a Gaussian random matrix Z Rd k, where k d and each entry Zi,j is sampled independently from an Gaussian distribution N(0, 1). We then compute the QR decomposition of XT Z, i.e. XT Z = QR, where Q Rn k is an orthogonormal matrix of rank k and R Rk k is an upper triangle matrix. Actually, Q is composed of some orthogonal basis of XT . We ﬁnally approximate X by projecting XT to the space spanned by the columns of Q, i.e., X = XQQT = W T QT , where W = (XQ)T Rk d. At last, we should note here that the QR decomposition for XT Z can be computed in 2nk2 ﬂoating-point calculations, which is acceptable when k is relatively small. And in many real applications, the data matrix X is always a low-rank or approximately low-rank matrix, so X is still an accurate approximator for X even when we set k to be a small integer. We will certify this phenomenon in Theorem 1. In the second step, given the approximate low rank matrix X returned by the ﬁrst step, we recover the sparse linear model β by a simple ﬁrst order method with a homotopy strategy. At iteration t, given the current solution βt, we update it by solving the following optimization problem

βt+1 = min β Rd{ 1

n(β βt)T X(y XT βt)

+λt|β|1 + γ

2 |β βt|2 2} (5)

where γ is a constant whose value will be discussed later, and λt = λ0ηt is an adaptive regularization parameter that usually declines over the iterations, where η (0, 1) controls the shrinkage speed of λt.

We will show that with appropriate choice of the parameters, we can accurately recover the true sparse solution β . And furthermore, the sparsity of the intermediate solutions is well bounded, which leads to an additional improvement on the computation besides the low rank matrix X.

Main Results In this subsection, we present the theoretical results about the convergence rate and the sparsity bound for our method. For the space limitation, we postpone the detailed proofs to the supplementary materials. Firstly, we need the following theorem from (Halko, Martinsson, and Tropp 2011) to bound the difference between the data matrix X and its approximator X. Theorem 1 (Corollary 10.9 (Halko, Martinsson, and Tropp 2011)) Assume p = k k0 4, then it holds with failure probability no more than 3e p that

1 + k0 p + 1σk0+1 + 8

j>k0 σ2 j 1/2,

where || || stands for the spectral norm of matrix, σ1 σ2 . . . are the singular values of matrix X. The theorem above shows that if X is low-rank or approximate low-rank that is most of the σis are 0 or close to 0, X would be a good approximator for X even we choose a small integer k. For the sake of convenience in the theoretical analysis below, we denote δ as δ = || X X||/ n. From Theorem 1, we know, in the cases where the data matrix X has a low-rank or approximate row-rank structure, which is very common in real applications, δ would always be a small number, i.e., δ = o( 1 n). The following theorem demonstrates the sparsity of βt over the iterations. Theorem 2 Assume |S(βt) \ S(β )| s, and λt 2 s max (γu +δ)(ε+δ||β ||2), Λ||β βt||2 , where Λ = (2γu +δ)δ +ργ2 u +max(|γ γ2 u|, |γ γ2 ℓ|) . Then we can have |S(βt+1) \ S(β )| s In Theorem 2, the assumption |S(βt) \ S(β )| s when t = 0 can be satisﬁed by initializing β0 as 0. Then by induction and choosing appropriate λt over the iterations, we can have |S(βt+1) \ S(β )| s for all t N. Thus the number of non-zero elements in the intermediate solution βt is no more than two times of that in β and the sparsity of βt is consequentially well-bounded. At last, we present the convergence rate of our method in the theorem below. Theorem 3 Suppose ||βt β ||2 max (e0, θt) and η = 4 s

γ Λ < 1, we let

λt = 2 max (γu + δ)(ε + δ||β ||2), Λ max (e0, θt) ,

where e0 = 4 s(γu +δ)(ε+δ||β ||2)/γ and Λ is inherited from the theorem above. Then we have ||βt+1 β ||2 max e0, θt+1 with θt+1 = ηθt.

Theorem 3 shows that by appropriately choosing the value of λt according to this theorem, we can get ||βt β ||2 max (e0, θt) = max (e0, ηtθ0). Furthermore, from the definition of Λ, it is clear that we can make 0 < η < 1 if appropriate γ is chosen. Thus a global geometric convergence rate is obtained until we get ||βt+1 β ||2 e0. From the deﬁnition of e0, we can see that since both δ and ϵ are small numbers, we can recover β accurately. At last, we notice the fact that λt can be rewritten as λt = 2 max max ((γu + δ)(ε + δ||β ||2), Λe0) , Λθt . Hence we can let λt = max(λmin, λ0ηt) with λmin = 2 max ((γu + δ)(ε + δ||β ||2), Λe0), just as we set in the Algorithm 1. At the end of this section, we would like to conclude some properties of our approach. 1) We only need O(dk + nk) ﬂops at each iteration to update the βt instead of O(dn) in traditional methods. It is a signiﬁcant improvement when both d and n are much larger than k. 2) A global geometric convergence rate is achieved. 3) The sparsity of intermediate solution βt is well bounded, that is the number of the nonzero entries in βt is no more than two times of that in β . 4) Compared to (Xiao and Zhang 2013), we only need to update βt only once a for each λt.

Experiment Study

Having analysed the performance of the proposed method in the theoretical perspective, now we turn to the empirical study that how the proposed method behaves. We follow the work (Xiao and Zhang 2013) and evaluate the performance of our method mainly on two aspects: (i) The speed of the learned vector βt converging to the optimal solution β . (ii) The sparsity of the learned vector βT . And two baseline algorithms below will be used in our study.

ADG ((Nesterov 2013), Algorithm 4.9): Nesterov s accelerated dual gradient method.

PGH (Xiao and Zhang 2013): a state-of-art algorithm yields a global geometric convergence rate.

SLvia RP: the proposed method in our study.

Experiments on the Synthesized Datasets

Dataset We generate the data matrix X, the response vector y and the optimal solution β in the following way: 1) For the data matrix X, we generate it by adding a Gaussian noise e X to a low rank matrix X0, i.e. X = X0 + e X, where e X Rd N, [e X]ij N(0, σ2 e X). For generating the matrix X0, we ﬁrst construct a random matrix U Rd N, each Uij is sampled from an uniform distribution U[ 1,1]. Then, we calculate a QR factorization for U, i.e. U = QR, Q Rd N, R RN N. We extract the ﬁrst k columns from Q to form a new matrix ˆQ. At last, we get the low rank matrix X0 by projecting U onto ˆQ, i.e., X0 = ˆQ ˆQT U. 2) For the sparse vector β Rd, we select s components from its d coordinates randomly and set them to 1. Then we set all the rest components to 0.

3) The response vector is sampled from the prediction model y = XT β + ey, where ey RN is a random noise vector sampled from a normal distribution N(0, σ2 ey IN N). Parameter setting For the training data, we set d = 10000, N = 5000, k = 500, s = 25 and vary the noise level e X, ey in [0.01, 0.02, ..., 0.05]. In ADG, Ψ(x) = 0.001||β||1, γμ = γd = 1.5, μ = 0 and L0 = 0.001. In PGH, we set λtgt = 0.001, ϵ = 10 6, Lmin = 10 6, η = 0.7, δ = 0.2 and γdec = γinc = 2. At last, for our method SLvia RP, we let γ = λ0 = 0.3, η = 0.94 and λmin = 0.002. Goal To recover the optimal sparse vector β from the training data X and y both accurately and efﬁciently. And also, the solutions we obtain should be as sparse as possible. Evalation metrics To evaluate the property of the learning βT , we measure the error ||βt β ||2 and its sparsity over the iterations and the running time (second). The sparsity here is computed as density = 1

d d i=1 I([βT ]i = 0). And we also measure the recovery of the support set of βT by SSR(βT ) = 2|S(βT ) S(β )|/|S(βT )|+|S(β )|. We run each experiment 100 times, and report the averaged results. Experimental results Figures 1 and 2 show the performances of different algorithms when the noise level e X = e Y = 0.01. From the left panels of both Figure 1 and 2, we observe that our method converges to the optimal solution β as fast as the recent algorithm PGH, thus their convergence rates are comparable. The right panels give us a deep impression that we can improve run-time efﬁciency signiﬁcantly due to the lower computing complexity at each iteration. In addition, it shows that ADG is not good at dealing with such under-determined problem, since its error ||βt β ||2 decreases slowly over the iterations. It may result from the fact that its solution path is not sparse (Table 1) over the iterations (Xiao and Zhang 2013).

20 40 60 80 100 3

error log(||betat beta*||2)

SLRvia RP PGH ADG

0 50 100 150 200 3

Running Time

error log(||betat beta*||2)

SLRvia RP PGH ADG

Figure 1: The error log(||βt β ||2) over the iterations (left) and running time (right) with e X = e Y = 0.05.

Table 1 summarizes the evaluation results for ﬁnal solutions output from different algorithms after 100 iterations under different noise levels e X and e Y . For PGH and SLRvia RP, we observe that they have similar performances at convergence and sparsity preserving, since the values of the metrics, like the residual, the error, density and SSR are all comparable. More importantly, it is conﬁrmed by the time cost of each method that our method is much more efﬁcient than others (roughly 70 times faster in this case).

20 40 60 80 100 7

residual log(||y Xbetat||2 2/2N)

SLRvia RP PGH ADG

0 50 100 150 200 7

Running Time

residual log(||y Xbetat||2 2/2N)

SLRvia RP PGH ADG

Figure 2: The residual log(||y Xβt||2 2/2N) over the iterations (left) and running time (right) with e X = e Y = 0.05.

Figure 3 shows the performance of our method under different shrinkage rates η [0.91, 0.92, ..., 0.95]. It reﬂects the insensitivity of our method to the parameter.

20 40 60 80 100 2.5

error log(||betat beta*||)

eta=0.95 eta=0.94 eta=0.93 eta=0.92 eta=0.91

20 40 60 80 100 7

residual log(||y Xbetat||2/2N)

eta=0.95 eta=0.94 eta=0.93 eta=0.92 eta=0.91

Figure 3: The error log(||βt β ||2) (left) and the residual log(||y Xβt||2 2/2N) (right) of our method over the iterations with different shrinkage rate η and e X = e Y = 0.05.

Experiments on Real-word Dataset Experiment design To demonstrate the efﬁciency of our methods further, we conduct a regression experiment on the well-known dataset MNIST, comprised of the gray images of scanned handwritten digits. It has roughly 60000 training samples and 10000 testing samples. The dimension of each sample is 28 28. We randomly sample 10,000 examples from the training set and get a data matrix X R10,000 784. At last, we obtain the response vector y R784 by sampling an image from the testing set. Our goal is to ﬁnd a sparse vector β which can approximate y with XT β accurately. Parameter setting In ADG, Ψ(x) = 0.0005||β||1, γμ = γd = 1.2, μ = 0 and L0 = 0.001. In PGH, we set λtgt = 0.005, ϵ = 0.001, Lmin = 0.005, η = 0.7, δ = 0.7 and γdec = γinc = 2. For our method SLvia RP, we let k = 100, γ = 10, λ0 = 0.2, η = 0.97 and λmin = 0.005. At last, we run 100 trials and report the averaged performance of each method after 1000 iterations. Evalation metrics Since the optimal solution β is unknown, we use the metrics like residual, sparsity and running time to evaluate the performance of our method. And summarize the numerical results in Table 2.

Table 1: Results with d = 10000, N = 5000, k = 500.

e X = 0.01, e Y = 0.01 Method Res Error density SSR time ADG 0.026 24.566 0.7363 0.007 138.2 PGH 0.001 0.107 0.0025 0.996 134.7 SLRvia RP 0.001 0.111 0.0025 0.995 2.4 e X = 0.02, e Y = 0.02 Method Res Error density SSR time ADG 0.066 22.132 0.7400 0.007 143.7 PGH 0.001 0.108 0.0025 0.997 147.3 SLRvia RP 0.001 0.115 0.0025 0.994 2.3 e X = 0.03, e Y = 0.03 Method Res Error density SSR time ADG 0.091 20.718 0.7524 0.007 148.9 PGH 0.001 0.107 0.0025 0.993 157.5 SLRvia RP 0.001 0.119 0.0026 0.989 1.9 e X = 0.04, e Y = 0.04 Method Res Error density SSR time ADG 0.098 20.13 0.7700 0.006 146.2 PGH 0.001 0.106 0.0025 0.991 149.1 SLRvia RP 0.001 0.124 0.0027 0.971 1.9 e X = 0.05, e Y = 0.05 Method Res Error density SSR time ADG 0.104 20.478 0.7942 0.006 148 PGH 0.002 0.106 0.0026 0.988 153.8 SLRvia RP 0.002 0.127 0.0027 0.97 2.1

Experimental Result According to Table 2, it is obvious that our method has great superiority in efﬁciency over the baseline algorithms. And compared to PGH, our method has comparable sparse learning ability. The relatively big residual and the bad sparsity of ADG show that it is not good at dealing with such under-determined problem, which consists with the result in the last experiment.

Table 2: Numerical results on MNIST.

Method Res density time ADG 0.022 0.9982 48.3 PGH 0.004 0.0046 63.8 SLRvia RP 0.005 0.0036 0.7

In this paper, we propose a novel sparse linear regression method based on random projection. In our method, we reduce the complexity for evaluating the gradient at each iteration from O(nd) to O(nk+dk) by using random projection. And then,we adopt a proximal gradient method to solve the sparse linear regression problem with a homotopy strategy. We verify, both theoretically and experimentally, that our method achieves a geometric convergence and can signiﬁcantly improve the time-efﬁciency of the existing methods for sparse linear regression. In addition, the sparsity of our intermediate solutions is well-bounded.

Acknowledgments This work was supported in part by National Basic Research Program of China (973 Program) under Grant 2013CB336500, National Natural Science Foundation of China under Grants (61125203, 61233011) and National Program for Special Support of Top-Notch Young Professionals.

References Becker, S.; Bobin, J.; and Cand es, E. J. 2011. Nesta: a fast and accurate ﬁrst-order method for sparse recovery. SIAM Journal on Imaging Sciences 4(1):1 39. Bickel, P. J.; Ritov, Y.; and Tsybakov, A. B. 2009. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics 1705 1732. Bioucas-Dias, J. M., and Figueiredo, M. A. 2007. A new twist: two-step iterative shrinkage/thresholding algorithms for image restoration. Image Processing, IEEE Transactions on 16(12):2992 3004. Candes, E. J., and Tao, T. 2006. Near-optimal signal recovery from random projections: Universal encoding strategies? Information Theory, IEEE Transactions on 52(12):5406 5425. Daubechies, I.; Defrise, M.; and De Mol, C. 2003. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. ar Xiv preprint math/0307152. Drineas, P.; Frieze, A.; Kannan, R.; Vempala, S.; and Vinay, V. 2004. Clustering large graphs via the singular value decomposition. Machine learning 56(1-3):9 33. Drineas, P.; Mahoney, M. W.; Muthukrishnan, S.; and Sarl os, T. 2011. Faster least squares approximation. Numerische Mathematik 117(2):219 249. Drineas, P.; Kannan, R.; and Mahoney, M. W. 2006. Fast monte carlo algorithms for matrices ii: Computing a lowrank approximation to a matrix. SIAM Journal on Computing 36(1):158 183. Drineas, P.; Mahoney, M. W.; and Muthukrishnan, S. 2008. Relative-error cur matrix decompositions. SIAM Journal on Matrix Analysis and Applications 30(2):844 881. Figueiredo, M. A.; Nowak, R. D.; and Wright, S. J. 2007. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. Selected Topics in Signal Processing, IEEE Journal of 1(4):586 597. Frieze, A.; Kannan, R.; and Vempala, S. 2004. Fast montecarlo algorithms for ﬁnding low-rank approximations. Journal of the ACM (JACM) 51(6):1025 1041. Hale, E. T.; Yin, W.; and Zhang, Y. 2008. Fixed-point continuation for l1-minimization methodology and convergence. SIAM Journal on Optimization 19(3):1107 1130. Halko, N.; Martinsson, P.-G.; and Tropp, J. A. 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review 53(2):217 288. Huang, J.; Ma, S.; and Zhang, C.-H. 2008. Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica 18(4):1603.

Kim, S.-J.; Koh, K.; Lustig, M.; Boyd, S.; and Gorinevsky, D. 2007. An interior-point method for large-scale l 1regularized least squares. Selected Topics in Signal Processing, IEEE Journal of 1(4):606 617. Koltchinskii, V. 2011. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Ecole d Et e de Probabilit es de Saint-Flour XXXVIII-2008, volume 2033. Springer. Lu, Y.; Dhillon, P.; Foster, D. P.; and Ungar, L. 2013. Faster ridge regression via the subsampled randomized hadamard transform. In Advances in Neural Information Processing Systems, 369 377. Mahoney, M. W., and Drineas, P. 2009. Cur matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences 106(3):697 702. Mahoney, M. W. 2011. Randomized algorithms for matrices and data. Foundations and Trends R in Machine Learning 3(2):123 224. Nesterov, Y. 2004. Introductory lectures on convex optimization: A basic course, volume 87. Springer. Nesterov, Y. 2013. Gradient methods for minimizing composite functions. Mathematical Programming 140(1):125 161. Sarlos, T. 2006. Improved approximation algorithms for large matrices via random projections. In Foundations of Computer Science, 2006. FOCS 06. 47th Annual IEEE Symposium on, 143 152. IEEE. Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 267 288. Wen, Z.; Yin, W.; Goldfarb, D.; and Zhang, Y. 2010. A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization, and continuation. SIAM Journal on Scientiﬁc Computing 32(4):1832 1857. Wright, S. J.; Nowak, R. D.; and Figueiredo, M. A. 2009. Sparse reconstruction by separable approximation. Signal Processing, IEEE Transactions on 57(7):2479 2493. Xiao, L., and Zhang, T. 2013. A proximal-gradient homotopy method for the sparse least-squares problem. SIAM Journal on Optimization 23(2):1062 1091. Zhang, L.; Mahdavi, M.; Jin, R.; Yang, T.; and Zhu, S. 2013. Recovering the optimal solution by dual random projection. In Proceedings of the 26th Annual Conference on Learning Theory, 135 157. Zhang, L.; Mahdavi, M.; Jin, R.; Yang, T.; and Zhu, S. 2014. Random projections for classiﬁcation: A recovery approach. Information Theory, IEEE Transactions on 60(11):7300 7316. Zhang, L.; Yang, T.; Jin, R.; and Zhou, Z.-H. 2015. A simple homotopy algorithm for compressive sensing. In Proceedings of the Eighteenth International Conference on Artiﬁcial Intelligence and Statistics, 1116 1124.