# efficient_stochastic_gradient_hard_thresholding__c82beffe.pdf Efficient Stochastic Gradient Hard Thresholding Pan Zhou Xiao-Tong Yuan Jiashi Feng Learning & Vision Lab, National University of Singapore, Singapore B-DAT Lab, Nanjing University of Information Science & Technology, Nanjing, China pzhou@u.nus.edu xtyuan@nuist.edu.cn elefjia@nus.edu.sg Stochastic gradient hard thresholding methods have recently been shown to work favorably in solving large-scale empirical risk minimization problems under sparsity or rank constraint. Despite the improved iteration complexity over full gradient methods, the gradient evaluation and hard thresholding complexity of the existing stochastic algorithms usually scales linearly with data size, which could still be expensive when data is huge and the hard thresholding step could be as expensive as singular value decomposition in rank-constrained problems. To address these deficiencies, we propose an efficient hybrid stochastic gradient hard thresholding (HSG-HT) method that can be provably shown to have sample-size-independent gradient evaluation and hard thresholding complexity bounds. Specifically, we prove that the stochastic gradient evaluation complexity of HSG-HT scales linearly with inverse of sub-optimality and its hard thresholding complexity scales logarithmically. By applying the heavy ball acceleration technique, we further propose an accelerated variant of HSG-HT which can be shown to have improved factor dependence on restricted condition number in the quadratic case. Numerical results confirm our theoretical affirmation and demonstrate the computational efficiency of the proposed methods. 1 Introduction We consider the following sparsityor rank-constrained finite-sum minimization problems which are widely applied in high-dimensional statistical estimation: min x f(x) := 1 i=1 fi(x), s.t. x 0 k or rank (x) k, (1) where each individual loss fi(x) is associated with the i-th sample, x 0 denotes the number of nonzero entries in x as a vector variable, rank (x) denotes the rank of x as a matrix variable, and k represents the sparsity/low-rankness level. Such a formulation encapsulates several important problems, including ℓ0-constrained linear/logistic regression [1, 2, 3], sparse graphical model learning [4], and low-rank multivariate and multi-task regression [5, 6, 7, 8, 9], to name a few. We are particularly interested in gradient hard thresholding methods [10, 11, 12, 13] which are popular and effective for solving problem (1). The common theme of this class of methods is to iterate between gradient descent and hard thresholding to maintain sparsity/low-rankness of solution while minimizing the loss function. In our problem setting, a plain gradient hard thresholding iteration is given by xt+1 = Φk(xt η f(xt)), where Φk( ), as defined in Section 2, denotes the hard thresholding operation that preserves the top k entries (in magnitude) of a vector or produces an optimal rank-k approximation to a matrix via singular value decomposition (SVD). When considering gradient hard thresholding methods, two main sources of computational complexity are at play: the gradient evaluation complexity and the hard thresholding complexity. As the per-iteration hard thresholding can be as expensive as SVD in rank-constrained problems, our goal is to develop methods that iterate and converge quickly while using a minimal number of hard thresholding operations. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. Table 1: Comparison of different hard thresholding algorithms for sparsityand rank-constrained problem (1). Both computational complexity and statistical error are evaluated w.r.t. the estimation error x x between the k-sparse/rank estimator x and the k -sparse/rank optimum x . Both κs and κbs denote the restricted condition numbers with s = 2k + k and bs = 3k + k . e I = supp(Φ2k( f(x ))) supp(x ) and b I = supp(Φ3k( f(x ))) supp(x ) are two support sets. The results of AHSG-HT are established for quadratic loss functions. Restriction Required Computational Complexity Statistical Error on on κs value of k #IFO #Hard Thresholding Sparsity-constrained Problem1 FG-HT [12, 13] Ω(κ2 sk ) O nκs log 1 ϵ O κs log 1 k + k f(x ) SG-HT [20] 4 3 Ω(κ2 sk ) O κs log 1 ϵ O κs log 1 n Pn i=1 fi(x ) 2 SVRG-HT [21] Ω(κ2 sk ) O (n+κs)log 1 ϵ O (n+κs)log 1 O s f(x ) + e If(x ) 2 HSG-HT Ω(κ2 sk ) O κs ϵ O κs log 1 O e If(x ) 2 AHSG-HT Ω(κbsk ) O κbs ϵ O κbs log 1 O b If(x ) 2 1 For general rank-constrained problem, the statistic error is not explicitly provided in FG-HT, SG-HT and SVRG-HT while is given in our Theorem 1 for HSG-HT and Theorem 3 for AHSG-HT. Full gradient hard thresholding. The plain form of full gradient hard thresholding (FG-HT) algorithm has been extensively studied in compressed sensing and sparse learning [10, 12, 13, 14]. At each iteration, FG-HT first updates the variable x by using full gradient descent and then performs hard thresholding on the updated variable. Theoretical results show that FG-HT converges linearly towards a proper nominal solution with high estimation accuracy [12, 13, 15]. Besides, compared with the algorithms adopting ℓ1or nuclear-norm convex relaxation (e.g., [16, 17, 18, 19]), directly solving problem (1) via FG-HT often exhibits similar accuracy guarantee but is more computationally efficient. However, despite these desirable properties, FG-HT needs to compute the full gradient at each iteration which can be expensive in large-scale problems. If the restricted condition number is κs, then O κs log 1 ϵ iterations are needed to attain an ϵ-suboptimal solution (up to a statistical error), and thus the sample-wise gradient evaluation complexity, or incremental first order oracle (IFO, see Definition 1), is O nκs log 1 ϵ which scales linearly with nκs. Stochastic gradient hard thresholding. To improve computational efficiency, stochastic hard thresholding algorithms [20, 21, 22] have recently been developed via leveraging the finite-sum structure of problem (1). For instance, Nguyen et al. [20] proposed a stochastic gradient hard thresholding (SG-HT) algorithm for solving problem (1). At each iteration, SG-HT only evaluates gradient of one (or a mini-batch) randomly selected sample for variable update and hard thresholding. It was shown that the IFO complexity and hard thresholding complexity of SG-HT are both O κs log 1 ϵ which is independent on n. However, SG-HT can only be shown to converge to a sub-optimal statistical estimation accuracy (see Table 1) which is inferior to that of the full-gradient methods. Another limitation of SG-HT is that it requires the restricted condition number κs to be not larger than 4/3 which is hard to meet in realistic high-dimensional sparse estimation problems such as sparse linear regression [13]. To overcome these issues, the stochastic variance reduced gradient hard thresholding (SVRG-HT) algorithm [21, 22] was developed as an adaptation of SVRG [23] to problem (1). Benefiting from the variance reduced technique, SVRG-HT can converge more stably and efficiently while having better estimation accuracy than SG-HT. Also different from SG-HT, the convergence analysis for SVRG-HT allows arbitrary bounded restricted condition number. As shown in Table 1, both the IFO complexity and hard thresholding complexity of SVRG-HT are O (n + κs) log 1 ϵ . Although the IFO complexity of SVRG-HT substantially improves over FG-HT, the overall complexity still scale linearly with respect to the sample size n. Therefore, when the data-scale is huge (e.g., n κs) and the per-iteration hard thresholding operation is expensive, SVRG-HT could still be computationally inefficient in practice. Later, Chen et al. [24] proposed a stochastic variance-reduced block coordinate descent algorithm. But its overall complexity still scale linearly with respect to the sample size and thus it faces the same challenge as SVRG-HT in computation. Overview of our approach. The method we propose can be viewed as a simple yet efficient extension of the hybrid stochastic gradient descent (HSGD) method [25, 26, 27] from unconstrained finite-sum minimization to the cardinality-constrained finite-sum problem (1). The core idea of HSGD is to iteratively sample an evolving mini-batch of terms in the finite-sum for gradient estimation. This style of incremental gradient method has been shown, both in theory and practice, to bridge smoothly the gap between deterministic and stochastic gradient methods [26]. Inspired by the success of HSGD, we propose the hybrid stochastic gradient hard thresholding (HSG-HT) method which has the following variable update form: xt+1 = Φk xt ηgt , with gt = 1 it St fit(xt), where η is the learning rate and St is the set of st selected samples. In early stage of iterations, HSG-HT selects a few samples to estimate the full gradient; and along with more iterations, st increases, giving more accurate full gradient estimation. Such a mechanism allows it to enjoy the merits of both SG-HT and FG-HT, i.e. the low iteration complexity of SG-HT and the steady convergence rate of FG-HT with constant learning rate η. Given a k -sparse/low-rank target solution x , for objective function with restricted condition number κs and s = 2k + k , we show that O κs ϵ rounds of IFO update and O κs log 1 ϵ steps of hard thresholding operation are sufficient for HSG-HT to find x such that x x 2 ϵ + O s f(x ) 2 . In this way, HSG-HT exhibits sample-size-independent IFO and hard thresholding complexity. Another attractiveness of HSG-HT is that it can be readily accelerated via applying the heavy ball acceleration technique [28, 29, 30]. To this end, we modify the iteration of HSG-HT by adding a small momentum ν(xt xt 1) for some ν > 0 to the gradient descent step: xt+1 = Φk xt ηgt + ν(xt xt 1) . We call the above modified version as accelerated HSG-HT (AHSG-HT). For quadratic problems, we prove that such a simple momentum strategy boosts the IFO complexity of HSG-HT to O κbs ϵ , and the hard thresholding complexity to O κbs log 1 ϵ , where bs = 3k + k . To the best of our knowledge, AHSG-HT is the first momentum based algorithm that can be provably shown to have such an improved complexity for stochastic gradient hard thresholding. Highlight of results and contribution. Table 1 summarizes our main results on computational complexity and statistical estimation accuracy of HSG-HT and AHSG-HT, along with the results for the above mentioned state-of-the-art gradient hard thresholding algorithms. From this table we can observe that our methods have several theoretical advantages over the considered prior methods, which are highlighted in the next few paragraphs. On sparsity/low-rankness level constraint condition. AHSG-HT in the quadratic case substantially improves the bounding condition on the sparsity/low-rankness level k: it only requires k = Ω(κbsk ), while the other considered algorithms with optimal statistical estimation accuracy all require k = Ω(κ2 sk ). Moreover, both HSG-HT and AHSG-HT get rid of the restrictive condition κs 4/3 required in SG-HT. On statistical estimation accuracy. For sparsity-constrained problem, the statistical estimation accuracy of HSG-HT is comparable to that in FG-HT and is better than those in SVRG and SG-HT, as e If(x ) 2 in HSG-HT is usually superior to the error s f(x ) in SVRG-HT and is much smaller than the one 1 n Pn i=1 fi(x ) 2 in SG-HT. AHSG-HT has even smaller estimation error than HSG-HT since it allows smaller sparsity/low-rankness level k = Ω(κbsk ) and thus a smaller cardinality of the support set b I, especially when the restrictive condition number is sensitive to k. On computational complexity. Both HSG-HT and AHSG-HT enjoy sample-size-independent IFO and hard thresholding complexity. To compare the IFO complexity, our methods will be cheaper than FG-HT and SVRG-HT when n dominates 1 ϵ . This suggests that HSG-HT and AHSG-HT are more suitable for handling large-scale data. SG-HT has the lowest IFO complexity, which however is obtained at the price of severely deteriorated statistical estimation accuracy. In terms of hard thresholding complexity, AHSG-HT is the best one and HSG-HT matches FG-HT and SG-HT. Last but not least, we highlight that AHSG-HT, to our best knowledge, for the first time provides improved convergence guarantees for momentum based stochastic gradient hard thresholding methods. While in convex problems the momentum based methods such as heavy ball and Nesterov s methods have long been known to work favorably for accelerating full/stochastic gradient methods [28, 31, 32, 33], it still remains largely unknown if it is possible to accelerate gradient hard thresholding methods for solving the non-convex finite-sum problem (1). There is a recent attempt at understanding a Nesterov s momentum full gradient hard thresholding method [34]. Although showing to attain improved rate of convergence under certain conditions, the iteration complexity bound established in [34] still does not exhibit better dependence on restricted condition number than the plain FG-HT. In contrast, at least in the quadratic case, AHSG-HT can be shown to have improved dependence on condition number than the existing gradient hard thresholding methods. 2 Preliminaries Throughout this paper, we use x to denote the Euclidean norm for vector x Rd and the Frobenius norm for matrix x Rd1 d2. x denotes the largest absolute entry in x. The hard thresholding operation Φk(x) preserves the k largest entries of x in magnitude for vector x, and for matrix x it only preserves the top k-top singular values. Namely, Φk(x) = HkΣk V T k , where Hk and Vk are respectively the top-k left and right singular vectors of x, Σk is the diagonal matrix of the top-k singular values of x. We use supp(x) to denote the support set of x. Specifically, for vector x, supp(x) indexes its nonzero entries; and for matrix x Rd1 d2, it indexes the subspace U that is a set of singular vectors spanning the column space of x. For vector variable x, If(x) preserves the entries in f(x) indexed by the support set I and sets the remaining entries to be zero; while for matrix variable x, If(x) with I = I1 I2 projects f(x) into the subspace indexed by I1 I2, namely If(x) = (U1U T 1 + U2U T 2 U1U T 1 U2U T 2 ) f(x), where U1 and U2 respectively span the subspaces indexed by I1 and I2. We assume the objective function in (1) to have restricted strong convexity (RSC) and restricted strong smoothness (RSS). For both sparsityand rank-constrained problems, the RSC and RSS conditions are commonly used in analyzing hard thresholding algorithms [12, 13, 21, 22, 20]. Assumption 1 (Restricted strong convexity condition, RSC). A differentiable function f(x) is restricted ρs-strongly convex with parameter s if there exists a generic constant ρs > 0 such that for any x, x with x x 0 s or rank (x x ) s, f(x) f(x ) f(x ), x x ρs Assumption 2 (Restricted strong smoothness condition, RSS). For each fi(x), it is said to be restricted ℓs-strongly smooth with parameter s if there exists a generic constant ℓs > 0 such that for any x, x with x x 0 s or rank (x x ) s, fi(x) fi(x ) fi(x ), x x ℓs We also need to impose the following boundness assumption on the variance of stochastic gradient. Assumption 3 (Bounded stochastic gradient variance). For any x and each loss fi(x), the distance between fi(x) and the full gradient f(x) is upper bounded as maxi fi(x) f(x) B. Similar to [21, 23, 35, 36], the incremental first order oracle (IFO) complexity is adopted as the computational complexity metric for solving finite-sum problem (1). In high-dimensional sparse learning and low-rank matrix recovery problems, the per-iteration hard thresholding operation can be equally time-consuming or even more expensive than gradient evaluation. For instance, in rankconstrained problems, hard thresholding operation can be as expensive as top-k SVD for a matrix. Therefore we also need to take the computational complexity of hard thresholding into our account. Definition 1 (IFO and Hard Thresholding Complexity). For f(x) in problem (1), an IFO takes an index i [n] and a point x, and returns the pair (fi(x), fi(x)). In a hard thresholding operation, we feed x into Φk( ) and obtain the output Φk(x). The IFO and hard thresholding complexity as a whole can more comprehensively reflect the overall computational performance of a first-order hard thresholding algorithm, as objective value, gradient evaluation and hard thresholding operation usually dominate the per-iteration computation. 3 Hybrid Stochastic Gradient Hard Thresholding In this section, we first introduce the Hybrid Stochastic Gradient Hard Thresholding (HSG-HT) algorithm and then analyze its convergence performance for sparsityand rank-constrained problems. 3.1 The HSG-HT Algorithm The HSG-HT algorithm is outlined in Algorithm 1. At the t-th iteration, it first uniformly randomly selects st samples St from all data and evaluates the approximated gradient gt = 1 it St fit(xt). Algorithm 1: (Accelerated) Hybrid Stochastic Gradient Hard Thresholding Input :Initial point x0, sample index set S = {1, , n}, learning rate η, momentum strength ν, mini-batch sizes {st}. for t = 1, 2, ..., T 1 do Uniformly randomly select st samples St from S Compute the approximate gradient gt = 1 it St fit(xt) Update xt+1 using either of the following two options: (O1) xt+1 = Φk (xt ηgt); /* for plain HSG-HT */ (O2) xt+1 = Φk xt ηgt + ν(xt xt 1) ; /* for accelerated HSG-HT */ end Output:x T . Then, there are two options for variable update. The first option is to update xt+1 with a standard local descent step along gt followed by a hard threholding step, giving the plain update procedure xt+1 = Φk (xt ηgt) in option O1. The other option O2 is to update xt+1 based on a momentum formulation xt+1 = Φk xt ηgt + ν(xt xt 1) , leading to an accelerated variant of HSG-HT. The plain update O1 is actually a special case of the momentum based update in O2 with strength ν = 0. In early stage of iteration when the mini-batch size st is relatively small, HSG-HT performs more like SG-HT with low per-iteration gradient evaluation cost. Along with more iterations, st increases and HSG-HT performs like full gradient hard thresholding methods. Next, we analyze the parameter estimation accuracy and the objective value convergence of HSG-HT. The analysis of the accelerated version will be presented in Section 4. 3.2 Statistical Estimation Analysis We first analyze the parameter estimation performance of HSG-HT by characterizing the distance between the output of Algorithm 1 and the optimum x . Such an analysis is helpful in understanding the convergence behavior and statistical estimation accuracy of the computed solution. We summarize the main result for both sparsityand rank-constrained problems in Theorem 1. Theorem 1. Suppose the objective function f(x) is ρs-strongly convex and each individual fi(x) is ℓs-strongly smooth with parameter s = 2k + k . Let κs = ℓs ρs and α = 1 + 2 k k k . Assume the sparsity/low-rankness level k 1 + 712κ2 s k . Set the learning rate η = 1 6ℓs and the mini-batch size st = τ ωt with ω = 1 1 480κs and τ 40αB 3ρsℓs x0 x 2 . Then the output x T of HSG-HT satisfies E x T x 1 1 480κs T/2 x0 x + α 12(1 β) e If(x ) , where β = α 1 1 12κs , e I = supp(Φ2k( f(x ))) supp(x ), and T is the number of iterations. A proof of Theorem 1 is given in Appendix B.1. Theorem 1 shows that for both sparsityand rankconstrained problem, if using sparsity/low-rankness level k = Ω(κ2 sk ) and gradually expanding the mini-batch size at an exponential rate of 1 ω with ω = 1 1 480κs , then in expectation the sequence {xt} generated by HSG-HT converges linearly towards x at the rate of (1 1 480κs ) 1 2 . This indicates that HSG-HT enjoys a similar fast and steady convergence rate just like the deterministic FG-HT [13]. As the condition number κs = ℓs/ρs is usually large in realistic problems, the exponential rate 1 ω is actually only a slightly larger than one. This means even a moderate-scale dataset will allow HSGD-HT to iterate sufficiently for decreasing the loss, as illustrated in Figure 1 and 2 in Section 5. One can also observe that the estimation error of E xt x is controlled by the multiplier of e If(x ) which usually represents the statistical error of model. For sparsity-constrained problem, such a statistical error bound matches that established in FG-HT [13], and is usually better than the error bound O s f(x ) + e If(x ) 2 with s = 2k + k in SVRG-HT [21] since e If(x ) 2 s f(x ) . Compared with the error O 1 n Pn i=1 fi(x ) 2 for SG-HT [20], the error term of HSG-HT is significantly smaller. This is because the magnitude f(x ) 2 of the full gradient is usually small when sample size is large, while the individual (or small mini-batch) gradient norm fi(x ) 2 could still have relatively large magnitude. For example, in sparse linear regression problems the difference could be as significant as O( p log(d)/n) (in HSG-HT) versus O( p log(d)) (in SG-HT). Notice, for the general rank-constrained problem, FG-HT, SG-HT and SVRG-HT do not explicitly provide the statistical error as given by HSG-HT. Indeed, SG-HT and SVRG-HT only considered a low-rank matrix linear model which is a special case of the general rank-constrained problem (1). Moreover, to guarantee convergence, SG-HT requires the restrictive condition κs 4/3, while our analysis removes such a condition and allows for an arbitrarily large κs as long as it is bounded. Based on Theorem 1, we can derive the IFO and hard thresholding complexity of HSG-HT for problem (1) in Corollary 1 with proof in Appendix B.2. For fairness, here we follow the convention in [13, 20, 21, 22] to use E x x ϵ + statistical error as the measure of ϵ-suboptimality. Corollary 1. Suppose the conditions in Theorem 1 hold. To achieve E x T x ϵ + α e If(x ) 12(1 β) , the IFO complexity of HSG-HT in Algorithm 1 is O κs ϵ and the hard threshold- ing complexity is O κs log 1 Compared with FG-HT [13] and SVRG-HT [21, 22] whose IFO complexity are O nκs log 1 ϵ and O (n + κs) log 1 ϵ respectively, HSG-HT is more computationally efficient in IFO than FG-HT and SVRG-HT when sample size n dominates 1 ϵ . This is usually the case when the data scale is huge while the desired accuracy ϵ is moderately small. Concerning the hard thresholding complexity, HSG-HT shares the same complexity O κs log 1 ϵ with FG-HT, which is considerably cheaper than the O (n + κs) log 1 ϵ hard thresholding complexity of SVRG-HT when data scale is large. Overall, HSG-HT is able to achieve better trade-off between IFO and hard thresholding complexity than FG-HT and SVRG-HT when n is much larger than 1 ϵ in large-scale learning problems. 3.3 Convergence Analysis For sparsity-constrained problem, we further investigate the convergence behavior of HSG-HT in terms of the objective value f(x) towards the optimal loss f(x ). The main result is summarized in the following theorem, whose proof is deferred to Appendix B.3. Theorem 2. Suppose f(x) is ρs-strongly convex and each individual component fi(x) is ℓs-strongly smooth with parameter s = 2k + k . Let κs = ℓs ρs and the sparsity level k 1 + 64κ2 s k . By setting the learning rate η = 1 2ℓs and the mini-batch size st = τ ωt with ω = 1 1 16κs and τ 148Bκ2 s ρs[f(x0) f(x )], then for sparsity-constrained problem, the output x T of Algorithm 1 satisfies E[f(x T ) f(x )] 1 1 16κs T [f(x0) f(x )]. Theorem 2 shows that for sparsity-constrained problem, HSG-HT in expectation also enjoys linear convergence in terms of the objective value by gradually exponentially expanding the mini-batch size. The result in Theorem 2 also implies that the expected value of f(xt) can be arbitrarily close to the k -sparse target value f(x ) as long as the iteration number is sufficiently large. This property is important, since in realistic problems, such as classification or regression problems, if f(x) is more close to the optimum f(x ), then the prediction result can be better. FG-HT [13] also enjoys such a good property. In contrast, for SVRG-HT [21], the convergence bound is known to be E[f(xt) f(x )] O (ζt + s f(x ) ) for some shrinkage rate ζ (0, 1). That result is inferior to ours due to the presence of a non-vanishing statistical barrier term s f(x ) . 4 Acceleration via Heavy-Ball Method In this section, we show that HSG-HT can be effectively accelerated by applying the heavy ball technique [28, 29]. As proposed in the option O2 in Algorithm 1, the idea is to use the integration of the estimated gradient gt and a small momentum ν(xt xt 1) to modify the update as xt+1 = Φk xt ηgt + ν(xt xt 1) . The following result confirms that such an accelerated variant, i.e. AHSG-HT, can significantly improve the efficiency of HSG-HT for quadratic loss functions. A proof of this result can be found in Appendix C.1. Theorem 3. Suppose the objective function f(x) is quadratic and it satisfies the RSC and RSS conditions with parameter bs = 3k + k . Let κbs = ℓbs ρbs . Assume the sparsity/low-rankness level k (1 + 16κbs) k . Set the learning rate η = 4 ( ℓbs+ ρbs)2 , the mini-batch size st = τ ωt where ω = (1 1 18 κbs )2 and τ 81Bκbs 4( ρbs+ ℓbs)4 x0 x 2 , the momentum parameter ν = κbs 1 κbs+1 2. Then the output x T of AHSG-HT in Algorithm 1 satisfies E x T x 2 1 1 18 κs T x0 x + 8 κbs ( ρbs + ℓbs)2 b If(x ) , where b I = supp(Φ3k( f(x ))) supp(x ) and T is the number of iterations. From this result, we can observe that for both sparsityand rank-constrained quadratic loss minimization problems, AHSG-HT has a faster convergence rate (1 1 18 κbs ) than the rate (1 1 480κs ) 1 2 of HSG-HT. This is because the restricted condition number κbs is usually comparable to or even smaller than κs since the factor k in bs = 3k + k is allowed to be smaller than that in s = 2k + k (explained below). Also, such an acceleration relaxes the restriction on the sparsity/low-rankness level k: AHSG-HT allows k = Ω(κbsk ) which is considerably superior to the condition of k = Ω(κ2 sk ) as required in the analysis of other hard thresholding algorithms including HSG-HT, FG-HT and SVRG-HT. As a direct consequence, the statistical error bound O( b If(x ) ) of AHSG-HT can be improved in the sense that the cardinality |b I| = 3k + k has better dependency on the restricted condition number κs. To better illustrate the boosted efficiency, we establish the computational complexity of AHSG-HT in IFO and hard thresholding in Corollary 2, whose proof is given in Appendix C.2. Corollary 2. Suppose the conditions in Theorem 3 hold. To achieve E x T x ϵ + 8 κbs b If(x ) ( ρbs+ ℓbs)2 , the IFO complexity of AHSG-HT in Algorithm 1 is O κbs ϵ and the hard threshold- ing complexity is O κbs log 1 Corollary 2 shows that equipped with heavy ball acceleration, the IFO complexity of HSG-HT in the quadratic case can be reduced from O κs ϵ , and its hard thresholding complexity can be reduced from O κs log 1 ϵ to O κbs log 1 ϵ . Such an improvement in the dependency on restricted condition number is noteworthy in large-scale ill-conditioned learning problems. 0 10000 20000 10 Objective Distance log(f f*) FG HT SG HT SVRG HT HSG HT AHSG HT 0 10000 20000 10 #Hard Thresholding Objective Distance log(f f*) FG HT SG HT SVRG HT HSG HT AHSG HT rcv1, k=200 0 20000 40000 60000 Objective Distance log(f f*) FG HT SG HT SVRG HT HSG HT AHSG HT 0 20000 40000 60000 #Hard Thresholding Objective Distance log(f f*) FG HT SG HT SVRG HT HSG HT AHSG HT real sim, k=500 Figure 1: Single-epoch processing: comparison among hard thresholding algorithms for a single pass over data on sparse logistic regression with regularization parameter λ = 10 5. 5 Experiments We now compare the numerical performance of HSG-HT and AHSG-HT to several state-of-theart algorithms, including FG-HT [13], SG-HT [20] and SVRG-HT [21, 22]. We evaluate all the considered algorithms on two sets of learning tasks. The first set contains two sparsity-constrained problems: logistic regression with fi(x)=log(1+exp( bia i x))+ λ 2 x 2 2 and multi-class softmax regression with fi(x)=Pc j=1 λ 2c xj 2 2 1{bi =j} log exp(a i xj) Pc l=1exp(a i xl) , where bi is the target output of ai and c is the class number. The second one is a rank-constrained linear regression problem: bi x, ai 2 2 + λ 2 x 2 F , s.t. rank (x) k, which has several important applications including multi-class classification and multi-task regression for simultaneously learning shared characteristics of all classes/tasks [37], as well as high dimensional image and financial data modeling [5, 8]. We run simulations on six datasets, including rcv1, real-sim, mnist, news20, coil100 and caltech256. The details of these data sets are described in Appendix E. For HSG-HT and AHSG-HT, we follow our theory to exponentially expand the mini-batch size st but with small exponential rate, with τ = 1. Since there is no ground truth on real data, we run FG-HT sufficiently long until xt xt+1 / xt 10 6, and then use the output f(xt) as the approximate optimal value f for sub-optimality estimation in Figure 1 and Figure 2. Single-epoch evaluation results. We first consider the sparse logistic regression problem with singleepoch processing. As demonstrated in Figure 1 (more experiments in Appendix E) that HSG-HT and AHSG-HT converge significantly faster than the other considered algorithms in one pass over data. This confirms our theoretical predictions in Corollary 1 and 2 that HSG-HT and AHSG-HT are cheaper in IFO complexity than the sample-size-dependent algorithms when the desired accuracy is moderately small and data scale is large. In view of the hard thresholding complexity, AHSG-HT and HSG-HT are comparable to FG-HT and they all require much fewer hard thresholding operations than SG-HT and SVRG-HT to reach the same accuracy. This also well aligns with our theory: in one pass setting, roughly speaking, AHSG-HT and HSG-HT respectively need O κbs log n κbs and O κs log n κs steps of hard thresholding which are both much less than the O(n) complexity of SG-HT and SVRG-HT. From Figure 1 and the magnifying figures in Appendix E for better displaying objective loss decrease along with hard thresholding iteration, one can observe that AHSG-HT has shaper convergence behavior than HSG-HT, which demonstrates the acceleration power of AHSG-HT. Multi-epoch evaluation results. We further evaluate the considered algorithms on sparsityconstrained softmax regression and rank-constrained linear regression problems, for which an approach usually needs multiple cycles of data processing to reach high accuracy solution. In our implementation, HSG-HT (and AHSG-HT) degenerates to plain (and accelerated) FG-HT when the mini-batch size exceeds data size. The degeneration case, however, does not happen in our experiment with the specified small expanding rate. The corresponding results are illustrated in Figure 2, from which we can observe that HSG-HT and AHSG-HT exhibit much shaper convergence curves and lower hard thresholding complexity than other considered hard thresholding algorithms. 0 10 20 30 40 50 7 Objective Distance log(f f*) FG HT SG HT SVRG HT HSG HT AHSG HT 0 10 20 30 40 50 7 #Hard Thresholding/n Objective Distance log(f f*) FG HT SG HT SVRG HT HSG HT AHSG HT mnist, k=200 0 10 20 30 40 50 4 Objective Distance log(f f*) FG HT SG HT SVRG HT HSG HT AHSG HT 0 10 20 30 40 50 4 #Hard Thresholding/n Objective Distance log(f f*) FG HT SG HT SVRG HT HSG HT AHSG HT new20, k=2000 (a) Sparsity-constrained softmax problem 0 10 20 30 40 50 4 Objective Distance log(f f*) FG HT SG HT SVRG HT HSG HT AHSG HT 0 10 20 30 40 50 4 #Hard Thresholding/n Objective Distance log(f f*) FG HT SG HT SVRG HT HSG HT AHSG HT coil100, k=100 0 10 20 30 40 50 4 Objective Distance log(f f*) FG HT SG HT SVRG HT HSG HT AHSG HT 0 10 20 30 40 50 4 #Hard Thresholding/n Objective Distance log(f f*) FG HT SG HT SVRG HT HSG HT AHSG HT caltech256, k=100 (b) Rank-constrained linear regression problem Figure 2: Multi-epochs processing: comparison among hard thresholding algorithms with multiple passes over data for sparse softmax regression and rank-constrained linear regression problems, both with regularization parameters λ = 10 5. 6 Conclusions In this paper, we proposed HSG-HT as a hybrid stochastic gradient hard thresholding method for sparsity/rank-constrained empirical risk minimization problems. We proved that HSG-HT enjoys the ϵ hard thresholding complexity like full gradient methods, while possessing sample-sizeindependent IFO complexity of O κs ϵ . Compared to the existing variance-reduced hard thresholding algorithms, HSG-HT enjoys lower overall computational cost when sample size is large and the accuracy is moderately small. Furthermore, we showed that HSG-HT can be effectively accelerated via applying the heavy ball acceleration technique to attain improved dependency on restricted condition number. The provable efficiency of HSGT-HT and its accelerated variant has been confirmed by extensive numerical evaluation with comparison against the prior state-of-the-art algorithms. Acknowledgements Jiashi Feng was partially supported by NUS startup R-263-000-C08-133, MOE Tier-I R-263-000C21-112, NUS IDS R-263-000-C67-646, ECRA R-263-000-C87-133 and MOE Tier-II R-263-000D17-112. Xiao-Tong Yuan was supported in part by Natural Science Foundation of China (NSFC) under Grant 61522308 and Grant 61876090, and in part by Tencent AI Lab Rhino-Bird Joint Research Program No.JR201801. [1] D. Donoho. Compressed sensing. IEEE Trans. on Information Theory, 52(4):1289 1306, 2006. 1 [2] J. Tropp and A. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. on Information Theory, 53(12):4655 4666, 2007. 1 [3] S. Bahmani, B. Raj, and P. Boufounos. Greedy sparsity-constrained optimization. Journal of Machine Learning Research, 14:807 841, 2013. 1 [4] A. Jalali, C. Johnson, and P. Ravikumar. On learning discrete graphical models using greedy methods. In Proc. Conf. Neutral Information Processing Systems, pages 1935 1943, 2011. 1 [5] S. Negahban and M. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, pages 1069 1097, 2011. 1, 8 [6] P. Zhou and J. Feng. Outlier-robust tensor PCA. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 1 9, 2017. 1 [7] Y. Wang, C. Xu, C. Xu, and D. Tao. Beyond RPCA: Flattening complex noise in the frequency domain. In AAAI Conf. Artificial Intelligence, 2017. 1 [8] A. Rohde and A. Tsybakov. Estimation of high-dimensional low-rank matrices. The Annals of Statistics, 39(2):887 930, 2011. 1, 8 [9] P. Zhou, C. Lu, Z. Lin, and C. Zhang. Tensor factorization for low-rank tensor completion. IEEE Transactions on Image Processing, 27(3):1152 1163, 2017. 1 [10] T. Blumensath and M. Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265 274, 2009. 1, 2 [11] S. Foucart. Hard thresholding pursuit: an algorithm for compressive sensing. SIAM Journal on Numerical Analysis, 49(6):2543 2563, 2011. 1 [12] X. Yuan, P. Li, and T. Zhang. Gradient hard thresholding pursuit. Journal of Machine Learning Research, 18(166):1 43, 2018. 1, 2, 4 [13] P. Jain, A. Tewari, and P. Kar. On iterative hard thresholding methods for high-dimensional M-estimation. In Proc. Conf. Neutral Information Processing Systems, pages 685 693, 2014. 1, 2, 4, 5, 6, 7 [14] R. Garg and Rohit R. Khandekar. Gradient descent with sparsification: an iterative algorithm for sparse recovery with restricted isometry property. In Proc. Int l Conf. Machine Learning, pages 337 344. ACM, 2009. 2 [15] X. Yuan, P. Li, and T. Zhang. Exact recovery of hard thresholding pursuit. In Proc. Conf. Neutral Information Processing Systems, pages 3558 3566, 2016. 2 [16] R. Tibshirani. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society. Series B (Methodological), pages 267 288, 1996. 2 [17] S. Van de Geer. High-dimensional generalized linear models and the LASSO. The Annals of Statistics, 36(2):614 645, 2008. 2 [18] B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471 501, 2010. 2 [19] N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In Proc. Conf. Neutral Information Processing Systems, pages 1329 1336, 2005. 2 [20] N. Nguyen, D. Needell, and T. Woolf. Linear convergence of stochastic iterative greedy algorithms with sparse constraints. IEEE Trans. on Information Theory, 63(11):6869 6895, 2017. 2, 4, 5, 6, 7 [21] X. Li, R. Arora, H. Liu, J. Haupt, and T. Zhao. Nonconvex sparse learning via stochastic optimization with progressive variance reduction. Proc. Int l Conf. Machine Learning, 2016. 2, 4, 5, 6, 7 [22] J. Shen and P. Li. A tight bound of hard thresholding. Journal of Machine Learning Research, 18(208):1 42, 2018. 2, 4, 6, 7 [23] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Proc. Conf. Neutral Information Processing Systems, pages 315 323, 2013. 2, 4 [24] J. Chen and Q. Gu. Accelerated stochastic block coordinate gradient descent for sparsity constrained nonconvex optimization. In Proc. Conf. Uncertainty in Artificial Intelligence, 2016. 2 [25] D. P. Bertsekas. A new class of incremental gradient methods for least squares problems. SIAM Journal on Optimization, 7(4):913 926, 1997. 2 [26] M. Friedlander and M. Schmidt. Hybrid deterministic-stochastic methods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380 A1405, 2012. 2 [27] P. Zhou, X. Yuan, and J. Feng. New insight into hybrid stochastic gradient descent: Beyond withreplacement sampling and convexity. In Proc. Conf. Neutral Information Processing Systems, 2018. 2 [28] B. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964. 3, 6 [29] A. Govan. Introduction to optimization. In North Carolina State University, SAMSI NDHS, Undergraduate workshop, 2006. 3, 6 [30] N. Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145 151, 1999. 3 [31] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013. 3 [32] P. Jain, S. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Accelerating stochastic gradient descent. ar Xiv preprint ar Xiv:1704.08227, 2017. 3 [33] S. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. In Proc. Int l Conf. Learning Representations, 2018. 3 [34] K. Rajiv and K. Anastasios. IHT dies hard: Provable accelerated iterative hard thresholding. In Proc. Int l Conf. Artificial Intelligence and Statistics, volume 84, pages 188 198, 2018. 3 [35] Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. In Proc. Int l Conf. Machine Learning, pages 353 361, 2015. 4 [36] Q. Lin, Z. Lu, and L. Xiao. An accelerated proximal coordinate gradient method. In Proc. Conf. Neutral Information Processing Systems, pages 3059 3067, 2014. 4 [37] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering shared structures in multiclass classification. In Proc. Int l Conf. Machine Learning, pages 17 24. ACM, 2007. 8