# dual_iterative_hard_thresholding__b26923da.pdf Journal of Machine Learning Research 21 (2020) 1-50 Submitted 7/18; Revised 12/19; Published 6/20 Dual Iterative Hard Thresholding Xiao-Tong Yuan xtyuan@nuist.edu.cn B-DAT Lab and CICAEET, Nanjing University of Information Science and Technology Nanjing, 210044, China Bo Liu kfliubo@gmail.com JD Finance America Corporation Mountain View, CA 94043, USA Lezi Wang wanglezi.bupt@gmail.com Department of Computer Science, Rutgers University Piscataway, NJ 08854, USA Qingshan Liu qsliu@nuist.edu.cn B-DAT Lab and CICAEET, Nanjing University of Information Science and Technology Nanjing, 210044, China Dimitris N. Metaxas dnm@cs.rutgers.edu Department of Computer Science, Rutgers University Piscataway, NJ 08854, USA Editor: David Wipf Iterative Hard Thresholding (IHT) is a popular class of first-order greedy selection methods for loss minimization under cardinality constraint. The existing IHT-style algorithms, however, are proposed for minimizing the primal formulation. It is still an open issue to explore duality theory and algorithms for such a non-convex and NP-hard combinatorial optimization problem. To address this issue, we develop in this article a novel duality theory for ℓ2-regularized empirical risk minimization under cardinality constraint, along with an IHT-style algorithm for dual optimization. Our sparse duality theory establishes a set of sufficient and/or necessary conditions under which the original non-convex problem can be equivalently or approximately solved in a concave dual formulation. In view of this theory, we propose the Dual IHT (DIHT) algorithm as a super-gradient ascent method to solve the non-smooth dual problem with provable guarantees on primal-dual gap convergence and sparsity recovery. Numerical results confirm our theoretical predictions and demonstrate the superiority of DIHT to the state-of-the-art primal IHT-style algorithms in model estimation accuracy and computational efficiency.1 Keywords: Iterative hard thresholding, Duality theory, Sparsity recovery, Non-convex optimization 1. Introduction We consider the problem of learning sparse predictive models which has been extensively studied in high-dimensional statistical learning (Hastie et al., 2015; Bach et al., 2012). Given 1. A conference version of this article appeared at ICML 2017 (Liu et al., 2017). The first two authors contributed equally to this article. c 2020 Xiao-Tong Yuan, Bo Liu, Lezi Wang, Qingshan Liu and Dimitris N. Metaxas. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-487.html. Yuan, Liu, Wang, Liu, and Metaxas a set of training samples {(xi, yi)}N i=1 in which xi Rd is the feature representation and yi R the corresponding label, the following sparsity-constrained ℓ2-norm regularized loss minimization problem is often considered for learning the sparse representation of a linear predictive model (Bahmani et al., 2013): min w 0 k P(w) := 1 i=1 l(w xi, yi) + λ Here w Rd is the model parameter vector, l(w xi; yi) is a convex function that measures the linear regression/prediction loss of w at data point (xi, yi), and λ > 0 is the regularization modulus. For example, the squared loss l(u, v) = 1 2(u v)2 is used in linear regression and the hinge loss l(u, v) = max{0, 1 uv} in support vector machines. The cardinality constraint w 0 k is imposed for improving learnability and interpretability of model when d is potentially much larger than N. Due to the presence of such a constraint, the problem (1) is simultaneously non-convex and NP-hard even for the quadratic loss function (Natarajan, 1995), hence is challenging for optimization. An alternative way to address this challenge is to use proper convex relaxation, e.g., ℓ1-norm (Tibshirani, 1996) and k-support norm (Argyriou et al., 2012), as an alternative of the cardinality constraint. However, the convex relaxation based techniques tend to introduce bias for parameter estimation (Zhang and Huang, 2008). In this article, we are interested in algorithms that directly minimize the non-convex formulation (1). Early efforts mainly lie in compressed sensing for signal recovery, which is a special case of (1) with squared loss (Donoho, 2006; Pati et al., 1993). Among others, a family of the so called Iterative Hard Thresholding (IHT) methods have gained particular interests and they have been witnessed to offer the fastest and most scalable solutions in many cases (Blumensath and Davies, 2009; Foucart, 2011). In recent years, IHT-style methods have been generalized to handle more general convex loss functions (Beck and Eldar, 2013; Jain et al., 2014; Yuan et al., 2018) as well as structured sparsity constraints (Jain et al., 2016). The common theme of these methods is to iterate between gradient descent and hard thresholding to maintain sparsity of solution while minimizing the objective value. In our problem setting, a plain IHT iteration is given by w(t) = Hk w(t 1) η P(w(t 1)) , where Hk( ) is the truncation operator that preserves the top k (in magnitude, with ties broken arbitrarily) entries of input and sets the remaining to be zero, and η > 0 is the learning rate. In practice, IHT-style algorithms have found their applications in deep neural networks compression (Jin et al., 2016), sparse signal demixing from noisy observations (Soltani and Hegde, 2017), and fluorescence molecular lifetime tomography (Cai et al., 2016), to name a few. Although IHT-style methods have long been studied, so far this class of methods is only designed for optimizing the primal formulation (1). It still remains an open problem to investigate the feasibility of solving the original NP-hard/non-convex formulation in a dual space that might potentially generate new sparsity recovery theory and further improve computational efficiency. To fill this gap, inspired by the emerging success of dual Dual Iterative Hard Thresholding optimization methods in regularized learning problems (Shalev-Shwartz and Zhang, 2013b, 2016; Xiao, 2010; Tan et al., 2018), we establish in this article a sparse Lagrangian duality theory and propose an IHT-style algorithm along with its stochastic extension for efficient dual optimization. 1.1 Overview of Contribution The core contribution of this work is two-fold in theory and algorithm. As the theoretical aspect of our contribution, we have established a novel strong sparse Lagrangian duality theory for the NP-hard and non-convex combinatorial optimization problem (1) which to the best of our knowledge has not been reported elsewhere in literature. A fundamental result is showing that the following condition is sufficient and necessary to guarantee a zero primal-dual gap between the primal non-convex problem and the dual concave problem: where w is the k-sparse primal minimizer and α [ l1( w x1), ..., l N( w x N)]. The strong sparse duality theory suggests a natural way for finding the global minimum of the sparsityconstrained minimization problem (1) via equivalently maximizing its dual problem as given in (6) which is concave. Although can be partially verified for some special models such as sparse linear regression and logistic regression, the preceding condition could more often than not be violated in practice, leading to non-zero primal-dual gap. In order to address this issue, we further develop an approximate sparse duality theory to cover the setting where sparse strong duality does not hold exactly. On the algorithm side, we propose the Dual IHT (DIHT) algorithm as a super-gradient ascent method to maximize the non-smooth dual objective. In high level description, DIHT iterates between dual gradient ascent and primal hard thresholding pursuit until convergence. A stochastic variant of DIHT is further proposed to handle large-scale learning problems. For both algorithms, we provide non-asymptotic convergence analysis on dual estimation error, primal-dual gap, and sparsity recovery as well. In contrast to the existing analysis for primal IHT-style algorithms, our analysis is not explicitly relying on the Restricted Isometry Property (RIP) conditions and thus would be less restrictive in reallife high-dimensional estimation. Numerical results on synthetic and benchmark data sets demonstrate that DIHT and its stochastic extension significantly outperform the state-ofthe-art primal IHT-style algorithms in estimation accuracy and computational efficiency. The main contributions of this article are highlighted in below: Sparse Lagrangian duality theory: we introduce a sparse saddle point theorem (Theorem 2), a sparse mini-max theorem (Theorem 5) and a sparse strong duality theorem (Theorem 8). Moreover, we establish an approximate duality theory (Theorem 13) as a complement to the strong sparse duality theory. Dual iterative hard thresholding algorithm: we propose an IHT-style algorithm along with its stochastic extension for non-smooth dual maximization. These algorithms have been shown to converge at sub-linear rates when the individual loss functions Yuan, Liu, Wang, Liu, and Metaxas Notation Definition N number of samples d number of features k the sparsity level hyper-parameter λ the regularization strength hyper-parameter P(w) the primal objective function w the primal k-sparse minimizer given by w := arg min w 0 k P(w) D(α) the dual objective function F the feasible set of dual variables α the dual maximizer given by α := arg maxα F D(α) ϵPD(w, α) the sparse primal-dual gap defined by ϵPD(w, α) := P(w) D(α) X the data matrix of which the columns are N data samples HF (x) the truncation operator that restricts vector x to the index set F Hk(x) the truncation operator that restricts x to its top k (in magnitude) entries supp(x) the index set of non-zero entries of x x 0 number of non-zero entries of x, i.e., x 0 := |supp(x)| [x]i the i-th entry of x x the largest (in magnitude) element of x, i.e., x := maxi |[x]i| xmin the smallest non-zero element of x, i.e., xmin := mini supp(x) |[x]i| λmax(A) the largest eigenvalue of matrix A λmin(A) the smallest eigenvalue of A A the spectral norm (the largest singular value) of A AF the restriction of A with rows restricted to F Table 1: Table of notation. are Lipschitz smooth (see Theorem 15 and Theorem 19), and at linear rates if further assuming that the loss functions are strongly convex (see Theorem 17 and Theorem 22). 1.2 Notation and Organization Notation. The key quantities and notations that commonly used in our analysis are summarized in Table 1. Organization. The rest of this article is organized as follows: In Section 2 we briefly review the related literature. In Section 3 we develop a Lagrangian duality theory for sparsityconstrained minimization problems. The dual IHT algorithms along with convergence analysis are presented in Section 4. The numerical evaluation results are reported in Section 5. Finally, the concluding remarks are made in Section 6. All the technical proofs are deferred to the appendix sections. 2. Related Work For generic convex objective beyond quadratic loss, the rate of convergence and parameter estimation error of IHT-style methods were analyzed under proper RIP (or restricted strong condition number) bounding conditions (Blumensath, 2013; Bahmani et al., 2013; Dual Iterative Hard Thresholding Yuan et al., 2014). In the work of Jain et al. (2014), several sparsity-level-relaxed variants of IHT-style algorithms were presented for which the high-dimensional estimation consistency can be established without requiring the RIP conditions. The support recovery performance of IHT-style methods has been studied to understand when the algorithm can exactly recover the support of a sparse signal from its compressed measurements (Yuan et al., 2016; Shen and Li, 2017a,b). A Nesterov s momentum based hard thresholding method was proposed by Khanna and Kyrillidis (2018) to further improve the efficiency of IHT. In largescale settings where a full gradient evaluation on all data samples becomes a bottleneck, stochastic and variance reduction techniques have been adopted to improve the computational efficiency of IHT via leveraging the finite-sum structure of learning problem (Nguyen et al., 2017; Li et al., 2016; Chen and Gu, 2016; Shen and Li, 2018; Zhou et al., 2018). For distributed learning with sparsity, an approximate Newton-type extension of IHT was developed that takes advantage of the stochastic nature of problem to improve communication efficiency. The generalization performance of IHT has recently been studied from the perspective of algorithmic stability (Yuan and Li, 2020). Another related line of research is dual optimization which has gained considerable popularity in various machine learning tasks including kernel learning (Hsieh et al., 2008), online learning (Xiao, 2010), multi-task learning (Lapin et al., 2014) and graphical models learning (Mazumder and Hastie, 2012). In recent years, a number of stochastic dual coordinate ascent (SDCA) methods have been proposed for solving large-scale regularized loss minimization problems (Shalev-Shwartz and Zhang, 2013a,b, 2016). All these methods exhibit fast convergence rate in theory and highly competitive numerical performance in practice. Shalev-Shwartz (2016) also developed a dual free variant of SDCA that supports non-regularized objectives and non-convex individual loss functions. To further improve computational efficiency, some primal-dual methods are developed to alternately minimize the primal objective and maximize the dual objective. The successful examples of primaldual methods include learning total variation regularized model (Chambolle and Pock, 2011) and generalized Dantzig selector (Lee et al., 2016). For large-scale machine learning, several stochastic and distributed variants were developed to make the primal-dual algorithms more computationally efficient and scalable (Zhang and Xiao, 2017; Yu et al., 2015; Tan et al., 2018; Xiao et al., 2019). Our work lies at the intersection of the above two disciplines of research. Although dual optimization methods have long been studied in machine learning, it still remains largely unknown, in both theory and algorithm, how to apply dual methods to the non-convex and NP-hard sparse estimation problem (1) where the non-convexity arises from the cardinality constraint rather than the objective function. The main contribution of the present article is closing this gap by presenting a novel sparse Lagrangian duality theory and a dual IHT method with provable guarantees on sparsity recovery accuracy and efficiency. 3. A Sparse Lagrangian Duality Theory In this section, we establish weak and strong duality theory that guarantees the original nonconvex and NP-hard problem in (1) can be equivalently solved in a dual space. The results in this part lay a theoretical foundation for developing dual sparse estimation methods. Yuan, Liu, Wang, Liu, and Metaxas 3.1 Sparse Strong Duality Theory From here onward we abbreviate li(u) = l(u, yi). The convexity of l(w xi, yi) implies that li(u) is also convex. Let l i (αi) = max u {αiu li(u)} be the convex conjugate of li(u) and Fi R be the feasible set of αi. According to the standard expression of li(u) = maxαi Fi {αiu l i (αi)}, the problem (1) can be reformulated into the following mini-max formulation: min w 0 k 1 N i=1 max αi Fi{αiw xi l i (αi)} + λ The following defined Lagrangian form will be useful in analysis: L(w, α) = 1 αiw xi l i (αi) + λ where α = [α1, ..., αN] F := F1 FN RN is the vector of dual variables. We now introduce the following concept of sparse saddle point which is a restriction of the conventional saddle point to the setting of sparse optimization. Definition 1 (Sparse saddle point) Let k > 0 be an integer. A pair ( w, α) Rd F is said to be a k-sparse saddle point for L if w 0 k and the following holds for all w 0 k, α F: L( w, α) L( w, α) L(w, α). (3) Different from the conventional definition of saddle point, the k-sparse saddle point only requires that the inequality (3) holds for an arbitrary k-sparse vector w. The following result is a basic sparse saddle point theorem for L. Throughout the article, we will use l ( ) to denote a sub-gradient (or super-gradient) of a convex (or concave) function l( ), and use l( ) to denote its sub-differential (or super-differential). Theorem 2 (Sparse saddle point theorem) Let w Rd be a k-sparse primal vector and α F be a dual vector. Then the pair ( w, α) is a k-sparse saddle point for L if and only if the following conditions hold: (a) w solves the primal problem in (1); (b) α [ l1( w x1), ..., l N( w x N)]; (c) w = Hk 1 λN PN i=1 αixi . Remark 3 Theorem 2 shows that the conditions (a) (c) are sufficient and necessary to guarantee the existence of a sparse saddle point for the Lagrangian form L. The condition (c) can be regarded as a Sparsity Constraint Qualification condition to guarantee the existence of saddle point. Remark 4 Let us consider P ( w) = 1 N PN i=1 αixi + λ w P( w). Denote F = supp( w). It is straightforward to verify that the condition (c) in Theorem 2 is equivalent to H F (P ( w)) = 0, wmin 1 λ P ( w) . (4) Dual Iterative Hard Thresholding To gain some intuition of the above condition, let us consider a simple example where w RN, {xi = ei} are the standard basis of RN and P(w) = 1 2N PN i=1(yi wi)2 + λ is quadratic. In this case, it is trivial to see that the primal minimizer is given by w = Hk y 1+λN and P ( w) = (λ + 1 N y. Denote [|y|](j) the j-th largest entry of |y|, i.e., [|y|](1) [|y|](2) ... [|y|](N). Then the above condition (4) is characterized by [|y|](k) 1+λN [|y|](k+1) λN which basically requires λ [|y|](k+1) N([|y|](k) [|y|](k+1)). Obviously, we need [|y|](k) to be strictly larger than [|y|](k+1) to guarantee the existence of such a λ. The following sparse mini-max theorem guarantees that the min and max in formulation (2) can be safely switched if and only if there exists a sparse saddle point for L(w, α). Theorem 5 (Sparse mini-max theorem) The mini-max relationship max α F min w 0 k L(w, α) = min w 0 k max α F L(w, α) (5) holds if and only if there exists a sparse saddle point ( w, α) for L. The sparse mini-max result in Theorem 5 provides sufficient and necessary conditions under which one can safely exchange a min-max for a max-min, in the presence of non-convex cardinality constraint. The following corollary is a direct consequence of invoking Theorem 2 to Theorem 5. Corollary 6 The mini-max relationship max α F min w 0 k L(w, α) = min w 0 k max α F L(w, α) holds if and only if there exist a k-sparse primal vector w Rd and a dual vector α F such that the conditions (a) (c) in Theorem 2 are fulfilled. The mini-max result in Theorem 5 can be used as a basis for establishing sparse duality theory. Indeed, we have already shown the following: min w 0 k max α F L(w, α) = min w 0 k P(w). This is called the primal minimization problem and it is the min-max side of the sparse minimax theorem. The other side, the max-min problem, will be called as the dual maximization problem with dual objective function D(α) := min w 0 k L(w, α), i.e., max α F D(α) = max α F min w 0 k L(w, α). (6) The following Proposition 7 shows that the dual objective function D(α) is concave and explicitly gives the expression of its super-differential. Proposition 7 The dual objective function D(α) is given by i=1 l i (αi) λ Yuan, Liu, Wang, Liu, and Metaxas where w(α) = Hk 1 λN PN i=1 αixi . Moreover, D(α) is concave and its super-differential is given by N [w(α) x1 l 1(α1), ..., w(α) x N l N(αN)]. Particularly, if w(α) is unique at α with respect to the truncation operator Hk( ) and {l i }i=1,...,N are differentiable, then D(α) is unique and it is the super-gradient of D(α). In view of Theorem 2 and Theorem 5, we are in the position to establish a sparse strong duality theorem which gives the sufficient and necessary conditions under which the optimal values of the primal and dual problems coincide. Theorem 8 (Sparse strong duality theorem) Let w Rd be a k-sparse primal vector and α F be a dual vector. Then α solves the dual problem in (6), i.e., D( α) D(α), α F, and P( w) = D( α) if and only if the pair ( w, α) satisfies the conditions (a) (c) in Theorem 2. We define the sparse primal-dual gap ϵPD(w, α) := P(w) D(α). The main message conveyed by Theorem 8 is that the conditions (a) (c) in Theorem 2 are sufficient and necessary to guarantee a zero primal-dual gap at the sparse primal-dual pair ( w, α). 3.2 On Dual Sufficient Conditions for Sparse Strong Duality The previously established strong sparse duality theory relies on the sparsity constraint qualification condition (c) in Theorem 2. This key condition is essentially imposed on the underlying primal sparse minimizer w one would like to recover. To make the results more comprehensive, we further provide in the following theorem a sufficient condition imposed on the dual maximizer of D(α) to guarantee zero primal-dual gap. From now on we denote X = [x1, ..., x N] Rd N the data matrix which contains the N data samples as columns. Theorem 9 Assume that each l i is differentiable and smooth, and each dual feasible set Fi is convex. Let α = arg maxα D(α) be a dual maximizer. If w( α) = Hk 1 λN PN i=1 αixi is unique at α with respect to truncation operation, then (w( α), α) is a sparse saddle point and w( α) is a primal minimizer of P(w) satisfying P(w( α)) = D( α). Remark 10 The dual sufficient condition given in Theorem 9 basically shows that under mild conditions, if w( α) constructed from a dual maximizer α is unique with respect to the truncation operator Hk( ), then sparse strong duality holds. Such a uniqueness condition is computationally more verifiable than the condition (c) in Theorem 2 as maximizing the dual concave program is easier than minimizing the primal non-convex problem. To gain better intuition of Theorem 9, we discuss its implications for sparse linear regression and logistic regression models which are commonly used in statistical machine learning. Example I: Sparse strong duality for linear regression. Consider the special case of the primal problem (1) with least square loss l(w xi, yi) = 1 2(yi w xi)2. Let us write l(w xi, yi) = li(w xi) with li(a) = 1 2(a yi)2. It is standard to know that the convex conjugate of li(a) is l i (αi) = α2 i 2 + yiαi and Fi = R. Obviously, l i is differentiable and Dual Iterative Hard Thresholding Fi is convex. According to Theorem 9, if w( α) = Hk 1 λN PN i=1 αixi is unique at the dual maximizer α with respect to truncation operation, then w( α) is a primal minimizer of P(w). To illustrate this claim, let us consider the same example as presented in Remark 4, of which the dual objective function is expressed as α2 i 2 αiyi 1 2λN2 Hk(α) 2. Provided that λN[|y|](k) 1+λN > [|y|](k+1), it can be readily verified that the dual solution is [ α](i) = λN 1+λN [y](i) i {1, ..., k} [y](i) i {k + 1, ..., N} , and w( α) = Hk 1 λN α is then by definition unique with respect to the truncation operator. According to the discussion in Remark 4, w( α) is exactly the primal minimizer. This verifies the validness of Theorem 9 on the considered example. On the other side, to see an example in which the uniqueness condition on w( α) can be violated, let us consider a special case y = [2, 2, 1], λ = 1 and k = 1. From the discussion in Remark 4 we know that the primal sparse minimizer is w = [0.5, 0, 0] (or w = [0, 0.5, 0]) with P( w) = 4/3. In the meanwhile, it can be verified by exhaustive search that the maximal dual objective value is attained at α = [ 1.5, 1.5, 1] with D( α) = 7/6. Obviously, here w( α) = Hk 1 λN α is not unique and the primal-dual gap is non-zero which indicates that strong duality fails in this case. Example II: Sparse strong duality for logistic regression. In logistic regression model, given a k-sparse parameter vector w, the relation between the random feature vector x Rd and its associated random binary label y { 1, +1} is determined by the conditional probability P(y|x; w) = exp(2y w x)/(1 + exp(2y w x)). The logistic loss over a sample (xi, yi) is written by l(w xi, yi) = li(w xi) = log 1 + exp( yiw xi) , where li(a) = log (1 + exp( ayi)). In this case, we have l i (αi) = αiyi log( αiyi)+(1+αiyi) log(1+αiyi) with αiyi [ 1, 0]. Note that l i is differentiable and Fi is convex. Therefore Theorem 9 implies that if w( α) = Hk 1 λN PN i=1 αixi is unique at α with respect to truncation operation, then w( α) is a primal minimizer of P(w) satisfying P(w( α)) = D( α). 3.3 Approximate Sparse Duality The strong sparse duality theory developed in the previous subsection relies on certain sparse constraint qualification conditions imposed on the primal or dual optimizers as appeared in Theorem 2 and Theorem 9. Although these conditions can be partially verified for some special models such as sparse linear regression and logistic regression, they could still be restrictive and more often than not be violated in practice, leading to non-zero primal-dual gap at primal and dual optimizers. To cover the regime where sparse strong duality does not hold exactly, we further derive in the present subsection a set of primal-dual gap bounding results which only require the sparse duality holds in an approximate way. The starting point is to define the concept of approximate sparse saddle point which is fundamental to the subsequent analysis. Yuan, Liu, Wang, Liu, and Metaxas Definition 11 (Approximate sparse saddle point) Let k > 0 be an integer and ν 0 be a scalar. A pair ( w, α) Rd F is said to be a ν-approximate k-sparse saddle point for L(w, α) if w 0 k and the following is valid for all k-sparse vector w and α F: L( w, α) L(w, α) + ν. (7) Obviously, the above definition allows L( w, α) L( w, α) + ν for any α F, and L( w, α) L(w, α) + ν for any w with w 0 k. Specially when ν = 0, an approximate sparse saddle point reduces to an exact sparse saddle point. The approximate sparse saddle point can also be understood as an extension of the so called approximate saddle point in approximate duality theory (Scovel et al., 2007) to the setting of sparsity-constrained minimization. Based on the concept of approximate sparse saddle point, we can prove the following proposition of approximate sparse duality. Proposition 12 Let w Rd be a k-sparse primal vector and α F be a dual vector. Then for any ν 0, the primal-dual gap satisfies P( w) D( α) ν if and only if the pair ( w, α) admits a ν-approximate k-sparse saddle point for L(w, α). Now we are going to bound the primal-dual gap under the condition of approximate sparse duality. Before presenting the main result, we need to define some notations. We say a univariate differentiable function l(x) is µ-strongly convex and ℓ-smooth if x, y, 2 |x y|2 l(y) l(x) l (x), y x ℓ The following defined sparse largest (smallest) eigenvalue of the empirical covariance matrix ˆΣ = 1 N XX will be used in our analysis: γ+ s = max v Rd n v ˆΣv | v 0 s, v = 1 o , γ s = min v Rd n v ˆΣv | v 0 s, v = 1 o . Let us re-express the primal objective function as P(w) = f(w) + λ 2 w 2, where f(w) := 1 i=1 li(w xi). We show in the following theorem that the primal-dual optimizer pair ( w, α) admits an approximate sparse saddle point with approximation level controlled by the underlying statistical error of model. Theorem 13 Assume that the primal loss functions li are µ-strongly convex and ℓ-smooth. Let w be any k-sparse vector satisfying wmin > f( w) ℓγ+ k . Assume that λ µγ k k f( w) ℓγ+ k w P( w) D( α) + k 2 + ℓγ+ k µγ k + λ 2 f( w) 2 , where w is the primal minimizer of P(x) and α is the dual maximizer of D(α). Dual Iterative Hard Thresholding Remark 14 Specially, by setting λ = µγ k k f( w) ℓγ+ k w > 0, we get from the above theorem P( w) D( α) ℓγ+ k w 2 + ℓγ+ k µγ k This bound shows that the optimal primal-dual gap at ( w, α) is controlled by the approximation level ν = O( k f( w) ) which usually represents the statistical estimation error of a nominal vector w. The smaller f( w) is, the more accurate approximation will be. This theorem, however, does not provide any guarantee on the sub-optimality of the primal solution w( α) = Hk 1 λN PN i=1 αixi produced from the dual maximizer α. We leave this prima-dual connection issue as an open problem for future investigation. In any case, this prima-dual gap bound can be a useful tool for getting a rough idea of how well the dual formulation can capture the optimal primal objective value. In the following, we show the implications of Theorem 13 for the sparse linear regression and logistic regression models. Approximate sparse duality for linear regression. Given a k-sparse parameter vector w, assume the samples are generated according to the linear model y = w x + ε where ε is a zero-mean Gaussian random noise variable with parameter σ. Let li(w xi) = 1 2(yi w xi)2 be the least square loss over data sample (xi, yi). In this example, we have ℓ= µ = 1. Suppose xi are drawn from Gaussian distribution with covariance Σ. Then with overwhelming probability we have γ k λmin(Σ) O(k log(d)/N) and γ+ k maxi xi ; and f( w) = O σ p log(d)/N . Thus according to Theorem 13, by setting the regular- ization parameter λ = O σγ k γ+ k w p log(d)/N , the primal-dual gap can be upper bounded with high probability as P( w) D( α) = O σ p k log(d)/N . Approximate sparse duality for logistic regression. Suppose xi are sub-Gaussian with parameter σ in logistic regression model. It is known that f( w) = O σ p hold with high probability (Yuan et al., 2018). Also it is well known that the binary logistic loss li(u) = log (1 + exp( yiu)) is ℓ-smooth with ℓ= 1/4. By assuming without loss of generality that |u| r we can verify that it is also µ-strongly convex with µ = exp(r)/(1 + exp(r))2. Then according to the bound in Theorem 13, by setting the regularization parameter λ = O σ p log(d)/N , the primal-dual gap can be upper bounded with high probability as P( w) D( α) = O σ p k log(d)/N . We remark that the sparse duality theory developed in this section suggests a natural way for finding the global minimum of the sparsity-constrained minimization problem in (1) via maximizing its dual problem in (6). Particularly in the case when the strong sparse duality holds, once the dual maximizer α is estimated, the primal sparse minimizer w can then be recovered from it according to the prima-dual connection w = Hk 1 λN PN i=1 αixi . Since the dual objective function D(α) is shown to be concave, its global maximum can be estimated using off-the-shelf convex/concave optimization methods. In the next section, we present a simple projected super-gradient method to solve the dual maximization problem with strong guarantees on convergence and sparsity recovery. Yuan, Liu, Wang, Liu, and Metaxas 4. Algorithms Let us now consider the dual maximization problem (6) which can be expressed as max α F D(α) = 1 i=1 l i (αi) λ 2 w(α) 2, (8) where w(α) = Hk 1 λN PN i=1 αixi . Generally speaking, D(α) is a non-smooth function because: 1) the conjugate function l i of an arbitrary convex loss li is generally non-smooth and 2) the term w(α) 2 is non-smooth with respect to α due to the truncation operation involved in computing w(α). We propose to adopt projected sub-gradient-type methods to solve the constrained non-smooth dual maximization problem in (6). 4.1 The DIHT Algorithm The Dual Iterative Hard Thresholding (DIHT) algorithm, as outlined in Algorithm 1, is essentially a projected super-gradient method for maximizing D(α). Initialized with w(0) = 0 and α(0) = 0, the procedure generates a sequence of prima-dual pairs {(w(t), α(t))}t 1. At the t-th iteration, the dual update step S1 conducts the projected super-gradient ascent in (9) to update α(t) from α(t 1) and w(t 1). Then in the primal update step S2, the primal vector w(t) is constructed from α(t) based on a k-sparse primal-dual connection operator (10). Algorithm 1: Dual Iterative Hard Thresholding (DIHT) Input : Training set {xi, yi}N i=1. Regularization strength λ. Sparsity level k. Initialization w(0) = 0, α(0) 1 = ... = α(0) N = 0. for t = 1, 2, ..., T do /* Dual projected super-gradient ascent */ (S1) For all i {1, 2, ..., N}, update the dual variables α(t) i as α(t) i = PFi α(t 1) i + η(t 1)g(t 1) i , (9) where g(t 1) i = 1 N (x i w(t 1) l i (α(t 1) i )) is the super-gradient and PFi( ) is the Euclidian projection operator with respect to feasible set Fi. /* Primal hard thresholding */ (S2) Update the primal vector w(t) as: i=1 α(t) i xi end Output: w(T). In the following we analyze the non-asymptotic convergence behavior of DIHT. We denote w = arg min w 0 k P(w) and use the abbreviation ϵ(t) PD := ϵPD(w(t), α(t)). In order Dual Iterative Hard Thresholding to avoid technical complications, we will limit optimization to bounded dual feasible sets Fi and derivatives l i , i.e., we will let r = maxi,a Fi |a| and ρ = maxi,a Fi |l i (a)|. For example, such quantities exist when li and l i are Lipschitz continuous (Shalev-Shwartz and Zhang, 2013b). We assume without loss of generality that xi 1. In the following theorem, we show that DIHT converges sub-linearly in dual parameter estimation error and primal-dual gap, and exact sparsity recovery can be guaranteed after sufficient iteration. Theorem 15 Assume the primal loss functions li( ) are 1/µ-smooth and ϵ := wmin 1 λ P ( w) > 0. Set the step-size as η(t) = N µ(t+2). (a) Parameter estimation error and primal-dual gap. Let α = [l 1( w x1), ..., l N( w x N)]. Then the sequence {α(t)}t 1 generated by Algorithm 1 satisfies !2 1 t + 2. Moreover the primal-dual gap is bounded as ϵ(t) PD (r X + λ + 1 1 t + 2. (b) Sparsity recovery. The exact support recovery supp(w(t)) = supp( w) holds if & 4 X 2(r X + λ Remark 16 To gain some intuition of the bounds in the theorem, if conventionally choosing the regularization parameter λ 1 N α(t) α = O 1 , ϵ(t) PD = O 1 supp(w(t)) = supp( w) is guaranteed after O 1 ϵ2 steps of iteration. Theorem 15 also suggests a computationally tractable termination criterion for DIHT: the algorithm can be stopped when the primal-dual gap becomes sufficiently small and supp(w(t)) becomes stable. Next, we consider the case when li are simultaneously smooth and strongly convex, which can lead to improved linear rate of convergence as stated in the following theorem. Theorem 17 Suppose that the loss functions li( ) are 1/µ-smooth and 1/ℓ-strongly-convex. Assume that ϵ := wmin 1 λ P ( w) > 0. Let ℓD = N . Set the step-size as η(t) = µD (a) Parameter estimation error and primal-dual gap. Let α = [l 1( w x1), ..., l N( w x N)]. Then the sequence {α(t)}t 1 generated by Algorithm 1 satisfies α(t) α 2 α 2 1 µ2 D ℓ2 D Moreover, the primal-dual gap ϵ(t) PD is upper bounded as ϵ(t) PD ℓD α 2 X + 1 1 µ2 D ℓ2 D Yuan, Liu, Wang, Liu, and Metaxas (b) Sparsity recovery. The exact support recovery, i.e., supp(w(t)) = supp( w), occurs after t ℓ2 D µ2 D log 4 α 2 X 2 rounds of iteration. Remark 18 Consider setting the regularization parameter as λ 1 N . Then the contrac- tion factor µ2 D ℓ2 D is of the order O µ2 ( X +ℓ)2 , and supp(w(t)) = supp( w) can be guaranteed after O ( X +ℓ)2 ϵ steps of iteration using step-size η(t) = O Nµ ( X +ℓ)2 . For the special case of linear regression with li(u) = 1 2(u yi)2 and l i (αi) = α2 i 2 + yiαi, we have µ = ℓ= 1, hence the contraction factor is of the order O 1 X and support recovery can be guaranteed after t = O X log 1 ϵ rounds of iteration with η(t) = O N X . Regarding the primal sub-optimality ϵ(t) P := P(w(t)) P( w), since ϵ(t) P ϵ(t) PD always holds, the convergence rates in Theorem 15 and Theorem 17 are directly applicable to the primal sub-optimality. We comment that under the sparse strong duality conditions, our convergence results on ϵ(t) P are not explicitly relying on the Restricted Isometry Property (RIP) (or restricted strong condition number condition) which is required in most existing analysis of primal IHT-style algorithms (Blumensath and Davies, 2009; Bahmani et al., 2013; Yuan et al., 2014). It was shown by Jain et al. (2014) that with proper relaxation of sparsity level, the estimation consistency of IHT-style algorithms can be guaranteed without imposing RIP-type conditions. In contrast to these prior analysis, our RIP-free results in Theorem 15 and Theorem 17 do not require the sparsity level k to be relaxed. 4.2 Stochastic DIHT When a batch estimation of super-gradient D (α) becomes expensive in large-scale applications, it is optional to consider the stochastic implementation of DIHT, namely SDIHT, as outlined in Algorithm 2. Different from the batch computation in Algorithm 1, the dual update step S1 in Algorithm 2 randomly selects a block of samples (from a given block partition of samples) and update their corresponding dual variables according to (11). Then in the primal update step S2.1, we incrementally update an intermediate accumulation vector w(t) which records 1 λN PN i=1 α(t) i xi as a weighted sum of samples. In S2.2, the primal vector w(t) is updated by applying k-sparse truncation on w(t). The SDIHT is essentially a block-coordinate super-gradient method for the dual problem. Particularly, in the extreme case where m = 1, SDIHT reduces to the batch DIHT. At the opposite extreme end where m = N, i.e., each block contains one sample, SDIHT becomes a stochastic coordinate-wise super-gradient method. The dual update (11) in SDIHT is computationally more efficient than full DIHT as the former only needs to access a small subset of samples at a time. If the complexity of hard thresholding operation in primal update is not negligible as in high-dimensional settings, we suggest to use SDIHT with relatively smaller number of blocks so that the hard thresholding operation in S2.2 can be less frequently called. Dual Iterative Hard Thresholding Algorithm 2: Stochastic Dual Iterative Hard Thresholding (SDIHT) Input : Training set {xi, yi}N i=1. Regularization strength λ. Sparsity level k. A block disjoint partition {B1, ..., Bm} of the sample index set [N]. Initialization w(0) = w(0) = 0, α(0) B1 = ... = α(0) Bm = 0. for t = 1, 2, ..., T do /* Stochastic blockwise dual projected super-gradient ascent */ (S1) Uniformly randomly select a block index i(t) [m]. For all j Bi(t) update α(t) j as α(t) j = PFj α(t 1) j + η(t 1)g(t 1) j , (11) and set α(t) j = α(t 1) j , j / Bi(t). /* Primal hard thresholding */ (S2) Update the primal vector w(t) via the following operations: (S2.1) Update w(t) according to w(t) = w(t 1) 1 λN j Bi(t) (α(t) j α(t 1) j )xj. (12) (S2.2) Compute w(t) = Hk( w(t)). end Output: w(T). When the primal losses are Lipschitz continuous, we can similarly establish sub-linear convergence rate bounds for SDIHT, as summarized in the following theorem. Theorem 19 Assume that the primal loss functions li( ) are 1/µ-smooth and ϵ := wmin 1 λ P ( w) > 0. Set the step-size as η(t) = m N µ(t+2). (a) Parameter estimation error and primal-dual gap. Let α = [l 1( w x1), ..., l N( w x N)]. The sequence {α(t)}t 1 generated by Algorithm 2 satisfies E[ α(t) α 2] !2 m t + 2. Moreover, the primal-dual gap is upper bounded in expectation by E[ϵ(t) PD] (r X + λ + 1 m t + 2. (b) Support recovery. For any δ (0, 1), it holds with probability at least 1 δ that supp(w(t)) = supp( w) occurs after & 4m X 2(r X + λ δ2λ4µ2N2 ϵ2 rounds of iteration. Yuan, Liu, Wang, Liu, and Metaxas Remark 20 Theorem 19 indicates that up to scaling factors, the expected iteration complexity of SDIHT is identical to that of DIHT. The additional scaling factors m or m appeared in the bounds essentially reflect a trade-offbetween the decreased per-iteration computational cost and the increased iteration complexity. Remark 21 The part(b) of Theorem 19 states that supp(w(t)) = supp( w) occurs with high probability when t is sufficiently large. When this event occurs, SDITH (with m = N) reduces to a restricted version of SDCA (Shalev-Shwartz and Zhang, 2013b) over supp( w), and thus we are able to obtain improved primal-dual gap convergence rate by straightforwardly applying the analysis of SDCA over supp( w). However, we do not pursue further in that direction as the final stage convergence behavior of SDIHT after exact support recovery is not of primal interest of this work. When the primal loss functions are smooth and strongly convex, we can further generalize the linear convergence rates in Theorem 17 from batch DIHT to SDIHT, as formally summarized in the following theorem. Theorem 22 Assume that the loss functions li( ) are 1/µ-smooth and 1/ℓ-strongly-convex. Assume that ϵ := wmin 1 λ P ( w) > 0. Let ℓD = N . Set the step-size as η(t) = µD (a) Parameter estimation error and primal-dual gap. Let α = [l 1( w x1), ..., l N( w x N)]. The sequence {α(t)}t 1 generated by Algorithm 2 satisfies E[ α(t) α 2] α 2 1 µ2 D mℓ2 D Moreover, the primal-dual gap ϵ(t) PD is bounded in expectation as E[ϵ(t) PD] ℓD α 2 X + 1 1 µ2 D mℓ2 D (b) Support recovery. For any δ (0, 1), it holds with probability at least 1 δ that supp(w(t)) = supp( w) occurs after t mℓ2 D µ2 D log 4 α 2 X 2 rounds of iteration. Remark 23 Like in Theorem 19, the scaling factor m appeared in the contraction factors represents a trade-offbetween the reduced per-iteration computational cost and the increased iteration complexity. Dual Iterative Hard Thresholding 4.3 Comparison against Primal IHT Methods We now compare DIHT and SDIHT in theory with several representative primal IHTstyle algorithms for sparse estimation. Here, we use primal ϵ-sub-optimality as the metric of performance. Specifically, we compare the considered algorithms in terms of RIP-type condition, sparsity level relaxation condition, and incremental first-order oracle (IFO)2 complexity for achieving the primal sub-optimality P( w) P( w) ϵ, where w is the k-sparse estimator and w is the target k-sparse primal minimizer with k k. Table 2 summarizes the comparison results in the setting where the univariate loss functions li( ) are 1/µ-smooth and 1/ℓ-strongly-convex. In the following elaboration, we highlight the key observations that can be made from these results. DIHT versus primal full gradient IHT methods. Since DIHT is a full gradient hardthresholding method, we compare it with several popularly studied full gradient primal IHT algorithms including Gra SP (Bahmani et al., 2013), IHT (Jain et al., 2014) and Gra HTP (Yuan et al., 2018). The comparison results are summarized on the top panel of Table 2, from which we can observe that: 1) Different from the considered full gradient IHT algorithms which are either relying on RIP-type bounding conditions on κs or requiring relaxation k = Ω(κ2 s k) on sparsity level, DIHT is free of explicitly assuming RIP-type conditions and sparsity level relaxation; 2) In terms of IFO complexity, based on Theorem 17 we can verify that DIHT needs O Nℓ2 λ2 1 + 1 λN 2 log 1 ϵ IFO queries to achieve primal ϵ-sub-optimality. This should be superior to the considered primal algorithms when λ 1 N and κs ℓ2 µ2 which is expected to be the case in ill-conditioned problems where κs scales up quickly with data size and sparsity level while ℓ, µ typically do not. SDIHT versus primal stochastic gradient IHT methods. In order to improve computational efficiency, stochastic gradient IHT algorithms have recently been developed via leveraging the finite-sum structure of statistical learning problems. We further compare SDIHT against several state-of-the-art stochastic gradient IHT algorithms including Sto IHT (Nguyen et al., 2017), SVR-GHT (Li et al., 2016) and HSG-HT (Zhou et al., 2018). At each iteration, Sto IHT only evaluates gradient of one (or a mini-batch) randomly selected sample for variable update and hard thresholding. Although efficient in iteration, Sto IHT can only be shown to converge to a sub-optimal statistical estimation accuracy which is inferior to that of the full-gradient methods. Another limitation of Sto IHT is that it requires the restricted condition number κs to be not larger than 4/3 which is hard to meet in realistic high-dimensional sparse estimation problems such as sparse linear regression (Jain et al., 2014). The SVR-GHT algorithm (Li et al., 2016) was developed as an adaptation of SVRG (Johnson and Zhang, 2013) to the iterative hard thresholding methods. Benefiting from the variance reduced technique, SVR-GHT can converge more stably and efficiently, while allowing arbitrary bounded restricted condition number at the cost of sparsity level relaxation. Lately, Zhou et al. (2018) proposed HSG-HT as a stochastic-deterministic hybrid stochastic gradient hard thresholding method that can be provably shown to have samplesize-independent IFO complexity. From the bottom panel of Table 2 we can see that in 2. For the primal objective P(w) in problem (1), an IFO takes a data point (xi, yi) and returns the pair n li(w xi) + λ w 2 2 , l i(w xi)xi + λw o . For the dual objective D(α) in problem (8), an IFO takes a point (xi, yi) and returns the pair n l i (αi) + λ w(α) 2 2 , l i (αi) x i w(α) o . Yuan, Liu, Wang, Liu, and Metaxas Method RIP-Free Sparsity Gra SP k O Nκs log 1 IHT Ω κ2 s k O Nκs log 1 Gra HTP Ω κ2 s k O Nκs log 1 (this work) λ + 1 λ2N 2 log 1 SVR-GHT Ω κ2 s k O (N + κs) log 1 HSG-HT Ω κ2 s k (this work) λ + 1 λ2N 2 log 1 Table 2: Comparison of primal and dual IHT-style methods for achieving primal ϵ-suboptimality in full gradient (top panel) and stochastic gradient (bottom panel) optimization settings. We denote κs as the restricted condition number of P(w) with s = O(k + k). The loss functions li( ) are assumed to be 1/µ-smooth and 1/ℓ-strongly-convex. The mark indicates that the related result is unknown in the corresponding reference. comparison to the considered primal stochastic gradient IHT methods, SDIHT is the only one that is free of assuming RIP-type conditions while allowing the sparsity level to be unrelaxed for sparse estimation. From Theorem 22 we know that in expectation, SDIHT needs O Nℓ2 λ2 1 + 1 λN 2 log 1 ϵ IFO queries to achieve primal ϵ-sub-optimality, which is comparable to SVR-GHT when λ 1 N and superior to HSG-HT when the sample size N is dominated by κs ϵ . To summarize the above comparison, when strong sparse duality holds, DIHT and SDIHT have provable guarantees on convergence without assuming RIP-type conditions and sparsity level relaxation conditions. The IFO complexity bounds of DIHT and SDIHT are superior or comparable to the best known results for primal IHT-style algorithms. As always, there is no free lunch here: DIHT and SDIHT are customized for solving the ℓ2-norm regularized sparse learning problems while the primal IHT-style algorithms can be applied to more general sparse learning problems without needing to impose ℓ2-norm penalty on the empirical risk. 5. Experiments In this section we present numerical study for theory verification and algorithm evaluation. In the theory verification part, we conduct simulations on sparse linear regression problems to verify the strong/approximate sparse duality theorems established in Section 3. Then in the algorithm evaluation part, we run experiments on synthetic and real data sets to Dual Iterative Hard Thresholding 0 5 10 15 20 10-6 (a) d = 30, k = 5 0 5 10 15 20 10-6 (b) d = 500, k = 50 Figure 1: Verification of strong sparse duality theory on linear regression problem: optimal primal-dual gap evolving curves as functions of regularization strength λ under different values of sample size N. For the sake of semi-log curve plotting, we set the primal-dual gap as 10 6 when the gap is exactly zero. evaluate the numerical performance of DIHT and SDIHT when applied to sparse linear regression and hinge loss minimization tasks. 5.1 Theory Verification For theory verification, we consider the sparse ridge regression model with quadratic loss function l(yi, w xi) = 1 2(yi w xi)2. The feature points {xi}N i=1 are sampled from standard multivariate normal distribution. The responses {yi}N i=1 are generated according to a linear model yi = w xi + εi with a k-sparse parameter w Rd and random Gaussian noise εi N(0, σ2). For this simulation study, we test with two baseline dimensionality-sparsity configurations (d, k) {(30, 5), (500, 50)}. For each configuration, we fix the parameter vector w and study the effect of varying sample size N, regularization strength λ, and noise level σ on the optimal primal-dual gap between primal minimum and dual maximum. 5.1.1 Verification of Strong Sparse Duality Theory The strong sparse duality theory relies on the sparsity constraint qualification condition (c) in Theorem 2, which essentially requires wmin 1 λ P ( w) . In this group of simulation study, keeping all other quantities fixed, we test how the optimal primal-dual gap evolves under varying sample size N and regularization strength λ. To compute the optimal primaldual gap, we need to find ways to estimate the primal and dual optimal values. For the configuration (d, k) = (30, 5), the primal minimizer can be exactly determined via bruteforce search among the optimal values over all the feasible index sets of cardinality k, and the dual maximizer is estimated via running the proposed DIHT algorithm until convergence. For (d, k) = (500, 50), it becomes computationally prohibitive to compute the exact primal minimum. In this case, we just run DIHT on the dual problem until convergence and com- Yuan, Liu, Wang, Liu, and Metaxas 0 10 20 30 40 50 (a) d = 30, k = 5 0 10 20 30 40 50 (b) d = 500, k = 50 Figure 2: Verification of approximate sparse duality theory on linear regression problem: optimal primal-dual gap evolving curves as functions of noise level σ under different regularization strength λ. Here we fix N = d. pute the suboptimal primal-dual gap at the estimated dual maximizer. Figure 1 shows the (sub)optimal primal-dual gap evolving curves as functions of λ N {10 2, 10 1, 1, 10, 20} under different values of sample size with N/d {0.2, 0.5, 1}. From this group of curves we can make the following observations: For each curve with fixed N, the optimal primal-dual gap decreases as λ increases and the gap reaches zero when λ N is sufficiently large. This is as expected because the larger λ is, the easier the condition wmin 1 λ P ( w) can be fulfilled so as to guarantee strong sparse duality. The primal-dual gap evolving curves are relatively insensitive to sample size N. This observation combined with the previous one indicates that sample size tends to have limited impact on the validness of the sparsity constraint qualification condition wmin 1 5.1.2 Verification of Approximate Sparse Duality Theory We further verify the approximate sparse strong duality theory stated in Theorem 13, which basically suggests that when wmin is sufficiently large, by setting the regularization parameter λ = O σ p log(d)/N , the primal-dual gap can be upper bounded with high probability as ϵPD = O σ p k log(d)/N . To confirm this result, fixing the sample size N = d, we studied how the optimal primal-dual gap evolves under varying noise level σ [10 3, 50] and λ = λ0/ N with λ0 {10 2, 10 1, 1}. Figure 2 shows the optimal primal-dual gap evolving curves as functions of noise level σ under a variety of regularization strength λ. These results lead to the following observations: Dual Iterative Hard Thresholding For each curve with fixed λ, the optimal primal-dual gap increases as σ increases. This confirms the implication of Theorem 13 in linear regression models that the optimal prima-dual gap of sparse linear regression model is controlled by the quantity σ p k log d/N σ. For a fixed σ, it can be observed that the optimal primal-dual gap approaches zero as λ increases. This again matches the prediction of Theorem 13 that the prima-dual gap bound is scaled inversely in λ. 5.2 Algorithm Evaluation We now turn to evaluate the effectiveness and efficiency of DIHT and SDIHT for dual sparse optimization. We begin with a simulation study to confirm some theoretical properties of DIHT. Then we conduct a set of real-data experiments to demonstrate the computational efficiency of DIHT/SDIHT when applied to sparse hinge loss minimization problems. 5.2.1 Simulation Study The basic setting of this simulation study is identical to the one as described in the theory verification part. As we pointed out at the end of Section 4.1, an interesting theoretical property of DIHT is that its convergence is not relying on the RIP-type conditions which in contrast are usually required by primal IHT-style algorithms. To confirm this point, for each configuration (d, k), we studied the effect of varying regularization strength λ and condition number of design matrix on the optimal primal-dual gap achieved by DIHT, and make a comparison to some baseline primal IHT-style methods as well. Convergence of DIHT under varying condition number. In this simulation, when λ is fixed and given a desirable condition number κ > 1, we generate feature points {xi}N i=1 from multivariate Gaussian distribution N(0, Σ) of which the covariance matrix is carefully designed3 such that the condition number of Σ + λI equals to κ. In this way of data generation, the condition number of the primal Hessian matrix 1 N XX + λI is close to κ. Keeping all other quantities fixed, we test how the optimal primal-dual gap output by DIHT evolves under varying κ [1, 200] and regularization strength λ = λ0/ N for λ0 {0.1, 1, 10}. Figure 3 shows the corresponding optimal primal-dual gap evolving curves. From these curves we can observe that the optimal primal-dual gap curves are not sensitive to κ in most cases, especially in badly conditioned cases when κ 50. This numerical observation confirms our theoretical claim that the convergence behavior of DIHT is not relying on the condition number of problem. DIHT versus primal methods on ill-conditioned problems. We further run experiments to compare DIHT against primal IHT and HTP methods (Yuan et al., 2018; Jain et al., 2014) in high condition number setting. For this simulation study, we test with the dimensionalitysparsity configuration (d, k) = (500, 50). To make the problem badly conditioned, we follow a protocol introduced by Jain et al. (2014) to select k/2 random coordinates from the support of nominal parameter vector w and k/2 random coordinates outside its support and 3. We first generate a semi-positive definite matrix Σ 0 such that λmin(Σ ) = 0 and λmax(Σ ) = (2κ 1)λ, and then we set Σ = Σ + σI with σ = λκ κ 1. It is readily verifiable that the condition number of Σ + λI is given by λmax(Σ)+λ λmin(Σ)+λ = λmax(Σ )+σ+λ λmin(Σ )+σ+λ = 2κλ+κλ/(κ 1) κλ/(κ 1)+λ = κ. Yuan, Liu, Wang, Liu, and Metaxas 0 50 100 150 200 (a) d = 30, k = 5 0 50 100 150 200 (b) d = 500, k = 50 Figure 3: Convergence of DITH on linear regression problem under varying condition number: optimal primal-dual gap evolving curves as functions of condition number κ of the Hessian matrix 1 N XX + λI, under different regularization strength λ. constructed a covariance matrix with heavy correlations between these chosen coordinates. The condition number of the resulting matrix is around 50. Keeping all the other quantities fixed, we test how the primal objective value P(w) and ℓ2-norm parameter estimation error w w evolve under varying sample size N d and regularization strength λ = λ0/ N for λ0 {1, 10}. The resulting curves are plot in Figure 4. It can be seen from these curves that in most cases DIHT is able to achieve more optimal primal objective values and smaller parameter estimation errors than IHT and HTP in the considered ill-conditioned problems. We attribute such a numerical benefit of DIHT to its invariance to the condition number of problem. 5.2.2 Real-data Experiment: Computational Efficiency Evaluation For real data experiment, we mainly evaluate the computational efficiency of the proposed dual algorithms. We test with varying smoothed or non-smooth hinge loss functions which are commonly used by support vector machines. Two binary benchmark data sets from Lib SVM data repository, RCV1 (d = 47, 236) (Lewis et al., 2004) and News20 (d = 1, 355, 191) (Lang, 1995),4 are used for algorithm efficiency evaluation and comparison. For the RCV1 data set, we select N = 500, 000 (N d) samples for model training and the rest 197, 641 samples for testing. For the News20 data set, we use N = 15, 000 (d N) samples for training and the left 4, 996 samples are used as test data. 4. These data sets are available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ binary.html. Dual Iterative Hard Thresholding 100 200 300 400 500 100 100 200 300 400 500 0 100 200 300 400 500 700 100 200 300 400 500 20 (b) λ = 10/ Figure 4: DIHT versus primal IHT-style methods on badly conditioned linear regression problem: primal objective value (left panel) and parameter estimation error (right panel) evolving curves as functions of sample size N under regularization strength (a) λ = 1/ N and (b) λ = 10/ N, respectively. Experiment with smoothed hinge loss. We first consider the sparse learning model (1) with the following smoothed hinge loss function l(w xi, yi) = 0 yiw xi 1 1 yiw xi γ 2 yiw xi < 1 γ 1 2γ (1 yiw xi)2 otherwise . Its convex conjugate is given by l (αi) = yiαi + γ 2α2 i if yiαi [ 1, 0] + otherwise . We set γ = 0.25 throughout our experiment. The computational efficiency of DIHT and SDIHT is evaluated by comparing their wall-clock running time against three primal baseline Yuan, Liu, Wang, Liu, and Metaxas Running time (Sec.) 0 10 20 30 40 Primal loss IHT HTP SVR-GHT DIHT SDIHT (a) RCV1, k = 500, λ = 1/ Running time (Sec.) 0 10 20 30 40 Primal loss IHT HTP SVR-GHT DIHT SDIHT (b) RCV1, k = 103, λ = 1/ Running time (Sec.) 0 2 4 6 8 10 Primal loss IHT HTP SVR-GHT DIHT SDIHT (c) News20, k = 104, λ = 1/ Running time (Sec.) 0 2 4 6 8 10 Primal loss IHT HTP SVR-GHT DIHT SDIHT (d) News20, k = 5 104, λ = 1/ Figure 5: Real-data experiment with smooth hinge loss: Primal loss evolving curves as functions of running time (in second). algorithms: IHT, HTP, and SVR-GHT (Li et al., 2016) which is a stochastic variance reduced variant of IHT. The learning rates of all the considered algorithms are tuned via grid search. For the two stochastic algorithms SDIHT and SVR-GHT, the training data is uniformly randomly divided into mini-batches with batch size 10 . Figure 5 shows the primal loss evolving curves with respect to wall-clock running time under λ = 1/ N and varying sparsity level k. It can be seen from these results that under all the considered configurations of λ and k, DIHT and SDIHT outperform the considered primal IHT algorithms in minimizing the primal objective value. In the meanwhile, it can be seen that SDIHT is more efficient than DIHT which matches the consensus that stochastic dual coordinate methods often outperform their batch counterparts (Hsieh et al., 2008; Shalev-Shwartz and Zhang, 2013b). We further compare the computational efficiency of the considered methods in terms of the training time needed to reach comparable test accuracy. We set the desirable test error as 0.08 for RCV1 and 0.24 for News20. Figure 6 shows the time cost comparison Dual Iterative Hard Thresholding 0.4 0.8 1.2 1.6 2 0 (a) RCV1, k = 500 0.4 0.8 1.2 1.6 2 0 (b) RCV1, k = 103 0.4 0.8 1.2 1.6 2 0 (c) News20, k = 104 0.4 0.8 1.2 1.6 2 0 (d) News20, k = 5 104 Figure 6: Real-data experiment with smoothed hinge loss: Running time (in second) comparison of the considered algorithms to reach comparable test accuracy. under varying regularization parameter λ = λ0/ N and sparsity level k. From this group of curves we can observe that DIHT and SDIHT are significantly more efficient than the considered primal IHT algorithms to reach comparable generalization performance on the test set. Also, we can see that SDIHT is consistently more efficient than DIHT. Moreover, to evaluate the primal-dual convergence behavior of DIHT and SDIHT, we plot in Figure 7 their primal-dual gap evolving curves with respect to the number of epochs processing, under sparsity level k = 103 for RCV1 and k = 5 104 for News20. The regularization parameters are set to be λ = λ0/ N, λ0 = {0.4, 1.2, 2}, respectively. The results again showcase the superior efficiency of SDIHT over DIHT as the former uses much fewer epoches of processing to reach comparable primal-dual gaps to the latter. Yuan, Liu, Wang, Liu, and Metaxas 10 20 30 40 50 10 3 Number of epochs Primal-dual gap (a) DIHT on RCV1 2 4 6 8 10 10 4 Number of epochs Primal-dual gap (b) SDIHT on RCV1 20 40 60 80 100 10 4 Number of epochs Primal-dual gap (c) DIHT on News20 0 2 4 6 8 10 10 6 Number of epochs Primal-dual gap (d) SDIHT on News20 Figure 7: Real-data experiment with smoothed hinge loss: The primal-dual gap evolving curves of DIHT and SDIHT. We test with the sparsity level k = 103 for RCV1 and k = 5 104 for News20. Experiment with non-smooth hinge loss. Finally, we test the efficiency of the proposed algorithms when applied to the support vector machines with vanilla hinge loss function l(w xi, yi) = max(0, 1 yiw xi). It is standard to know that l (αi) = yiαi if yiαi [ 1, 0] + otherwise . We follow the same experiment protocol as in the previous smoothed case to compare the considered primal and dual IHT algorithms on the two benchmark data sets. In this nonsmooth case, we set the step-size in DIHT and SDIHT to be η(t) = c t+2, where c is a constant determined by grid search for optimal efficiency. In Figure 8, we plot the primal loss evolving curves with respect to running time under λ = 1/ N. The computational time curves of the considered algorithms to reach comparable test errors (0.074 for RCV1 and 0.23 for News20) are shown in Figure 9. These two groups of results demonstrate the remarkable efficiency advantage of DIHT and SDIHT over the considered primal IHT algorithms even Dual Iterative Hard Thresholding Running time (Sec.) 0 5 10 15 20 Primal loss IHT HTP SVR-GHT DIHT SDIHT (a) RCV1, k = 103, λ = 1/ Running time (Sec.) 0 2 4 6 8 10 Primal loss IHT HTP SVR-GHT DIHT SDIHT (b) News20, k = 5 104, λ = 1/ Figure 8: Real-data experiment with non-smooth hinge loss: Primal loss evolving curves as functions of running time (in second). 0.4 0.8 1.2 1.6 2 0 (a) RCV1, k = 103 0.4 0.8 1.2 1.6 2 0 (b) News20, k = 5 104 Figure 9: Real-data experiment with non-smooth hinge loss: Running time (in second) comparison of the considered algorithms to reach comparable test accuracy. when the loss function is non-smooth. The prima-dual gap evolving curves of DIHT and SDIHT under a variety of λ = λ0/ N are illustrated in Figure 10, from which we can observe that when using non-smooth hinge loss function, SDIHT is still more efficient than DIHT in closing the primal-dual gap. Yuan, Liu, Wang, Liu, and Metaxas 10 20 30 40 50 Number of epochs Primal-dual gap (a) DIHT on RCV1 0 2 4 6 8 10 Number of epochs Primal-dual gap (b) SDIHT on RCV1 10 20 30 40 50 10 5 10 5 Number of epochs Primal-dual gap (c) DIHT on News20 2 4 6 8 10 10 4 Number of epochs Primal-dual gap (d) SDIHT on News20 Figure 10: Real-data experiment with non-smooth hinge loss: The primal-dual gap evolving curves of DIHT and SDIHT. We test with the sparsity level k = 103 for RCV1 and k = 5 104 for News20. 6. Conclusion and Future Work In this article, we investigated duality theory and optimization algorithms for solving the sparsity-constrained empirical risk minimization problem which has been widely applied in sparse learning. As a core theoretical contribution, we established a sparse Lagrangian duality theory which guarantees strong duality in sparse settings under certain sufficient and necessary conditions. For the scenarios where sparse strong duality would be violated, we further developed an approximate sparse duality theory that upper bounds the primadual gap at the level of statistical estimation error of model. Our theory opens the gate to solve the original NP-hard and non-convex problem equivalently in a dual formulation. We then propose DIHT as a first-order method to maximize the non-smooth dual concave formulation. The algorithm is characterized by dual super-gradient ascent and primal hard thresholding. To further improve iteration efficiency in large-scale settings, we propose SDIHT as a block-coordinate stochastic variant of DIHT. For both algorithms we have proved sub-linear primal-dual gap convergence rates when the loss is smooth, and improved Dual Iterative Hard Thresholding linear rates of convergence when the loss is also strongly convex. Based on our theoretical findings and numerical results, we conclude that DIHT and SDIHT are theoretically sound and computationally attractive alternatives to the conventional primal IHT algorithms, especially when the sample size is smaller than feature dimensionality. Our work leaves several open issues for future exploration. First, it remains an open question on how to verify the key condition (c) in Theorem 2 for generic sparse learning models. It will be interesting to provide some more intuitive ways to understand this condition in popular statistical learning models such as linear regression and logistic regression. Second, our approximate duality theory (Theorem 13) only gives a duality gap bound between the (unknown) primal minimizer w and the dual maximizer α. From the perspective of primal solution quality certification, it would be more informative to have results on the duality gap between α and the primal vector w( α) produced from α. Or third, our convergence results in Theorem 19 and 22 merely indicate that SDIHT is not worse than DIHT in convergence rate, but without showing that its dependence of scaling factors on sample size N and regularization strength λ can be significantly improved as what has been achieved by SDCA for unconstrained regularized learning (Shalev-Shwartz and Zhang, 2013b). In our opinion, a main challenge here we are facing with is the non-smoothness of the dual objective D(α), which prevents us from directly extending the analysis of SDCA to SDIHT. We need to develop new proof approaches to justify why SDIHT often outperforms DIHT in practice. Finally, it would be an interesting future work to apply our duality theory and algorithms to communication-efficient distributed sparse learning problems which have recently gained considerable attention in distributed machine learning (Jaggi et al., 2014; Wang et al., 2017; Liu et al., 2019). Acknowledgements The authors sincerely thank the anonymous reviewers for their constructive comments on this work. Xiao-Tong Yuan is supported in part by National Major Project of China for New Generation of AI under Grant No.2018AAA0100400 and in part by Natural Science Foundation of China (NSFC) under Grant No.61876090 and No.61936005. Qingshan Liu is supported by NSFC under Grant No.61532009 and No.61825601. Yuan, Liu, Wang, Liu, and Metaxas Appendix A. Proofs of Results in Section 3 In this section, we present the proofs of the main results stated in Section 3. A.1 Proof of Theorem 2 Proof The direction: If the pair ( w, α) is a sparse saddle point for L, then from the definition of conjugate convexity and inequality (3) we have P( w) = max α F L( w, α) L( w, α) min w 0 k L(w, α). On the other hand, we know that for any w 0 k and α F L(w, α) max α F L(w, α ) = P(w). combining the preceding two inequalities yields P( w) min w 0 k L(w, α) min w 0 k P(w) P( w). Therefore P( w) = min w 0 k P(w), i.e., w solves the problem in (1), which proves the necessary condition (a). Moreover, the above arguments lead to P( w) = max α F L( w, α) = L( w, α). Then from the maximizing argument property of convex conjugate we know that αi li( w xi). Thus the necessary condition (b) holds. Note that L(w, α) = λ i=1 l i ( αi) 1 2λN2 Let F = supp( w). Since the above analysis implies L( w, α) = min w 0 k L(w, α), it must hold that This validates the necessary condition (c). The direction: Conversely, let us assume that w is a k-sparse solution to the problem (1) (i.e., conditio(a)) and let αi li( w xi) (i.e., condition (b)). Again from the maximizing argument property of convex conjugate we know that li( w xi) = αi w xi l i ( αi). This leads to the following: L( w, α) P( w) = max α F L( w, α) = L( w, α). (14) The sufficient condition (c) guarantees that F contains the top k (in absolute value) entries of 1 λN PN i=1 αixi. Then based on the expression in (13) we can see that the following holds for any k-sparse vector w L( w, α) L(w, α). (15) Dual Iterative Hard Thresholding By combining the inequalities (14) and (15) we obtain that for any w 0 k and α F, L( w, α) L( w, α) L(w, α). This shows that ( w, α) is a sparse saddle point of the Lagrangian L. A.2 Proof of Theorem 5 Proof The direction: Let ( w, α) be a saddle point for L. On one hand, note that the following holds for any k-sparse w and α F min w 0 k L(w, α ) L(w , α ) max α F L(w , α), which implies max α F min w 0 k L(w, α) min w 0 k max α F L(w, α). (16) On the other hand, since ( w, α) is a saddle point for L, the following is true: min w 0 k max α F L(w, α) max α F L( w, α) L( w, α) min w 0 k L(w, α) max α F min w 0 k L(w, α). (17) In view of (16) and (17) we have that the equality in (5) must hold. The direction: Assume that the equality in (5) holds. Let us define w and α such that max α F L( w, α) = min w 0 k max α F L(w, α) min w 0 k L(w, α) = max α F min w 0 k L(w, α) . Then we can see that for any α F, L( w, α) min w 0 k L(w, α) = max α F L( w, α ) L( w, α), where the = is due to (5). In the meantime, for any w 0 k, L( w, α) max α F L( w, α) = min w 0 k L(w , α) L(w, α). This shows that ( w, α) is a sparse saddle point for L. A.3 Proof of Proposition 7 Proof Recall that L(w, α) = 1 αiw xi l i (αi) + λ i=1 l i (αi) 1 2λN2 Yuan, Liu, Wang, Liu, and Metaxas Then for any fixed α F, it is straightforward to verify that the k-sparse minimum of L(w, α) with respect to w is attained at the following point: w(α) = arg min w 0 k L(w, α) = Hk Thus we have D(α) = min w 0 k L(w, α) = L(w(α), α) αiw(α) xi l i (αi) + λ i=1 l i (αi) λ where ζ1 follows from the above definition of w(α). Now let us consider two arbitrary dual variables α , α F and any g(α ) 1 N [w(α ) x1 l 1(α 1), ..., w(α ) x N l N(α N)]. From the definition of D(α) and the fact that L(w, α) is concave with respect to α at any fixed w we can derive that D(α ) = L(w(α ), α ) L(w(α ), α ) L(w(α ), α ) + g(α ), α α . This implies that D(α) is a concave function and its super-differential is given by N [w(α) x1 l 1(α1), ..., w(α) x N l N(αN)]. If we further assume that w(α) is unique and {l i }i=1,...,N are differentiable at any α, then D(α) = 1 N [w(α) x1 l 1(α1), ..., w(α) x N l N(αN)] becomes unique, which implies that D(α) is the unique super-gradient of D(α). A.4 Proof of Theorem 8 Proof The direction: Given the conditions in the theorem, it can be known from Theorem 2 that the pair ( w, α) forms a sparse saddle point of L. Thus based on the definitions of sparse saddle point and dual function D(α) we can show that D( α) = min w 0 k L(w, α) L( w, α) L( w, α) D(α). This implies that α solves the dual problem in (6). Furthermore, Theorem 5 guarantees the following D( α) = max α F min w 0 k L(w, α) = min w 0 k max α F L(w, α) = P( w), which indicates that the primal and dual optimal values are equal to each other. The direction: Assume that α solves the dual problem in (6) and D( α) = P( w). Since D( α) P(w) holds for any w 0 k, w must be the sparse minimizer of P(w). It follows that max α F min w 0 k L(w, α) = D( α) = P( w) = min w 0 k max α F L(w, α). Dual Iterative Hard Thresholding From the argument for in the proof of Theorem 5 and Corollary 6 we get that the conditions (a) (c) in Theorem 2 should be satisfied for ( w, α). A.5 Proof of Theorem 9 Proof Recall the dual objective function is D(α) = 1 N PN i=1 l i (αi) λ 2 w(α) 2. Since w( α) is unique and each l i is differentiable, according to Proposition 7 it is true that the super-gradient of D(α) at α is given by D ( α) = 1 N [w( α) x1 l 1 ( α1), ..., w( α) x N l N( αN)]. Under the conditions in the theorem, we are going to show that for sufficiently small η, the following must hold: αi = PFi ( αi + η gi) , (18) where gi = 1 N (w( α) xi l i ( αi)) and PFi( ) is the Euclidian projection operator with respect to feasible set Fi. Before proving this, we need to present a few preliminaries. For any α F, let us define w(α) = 1 λN PN i=1 αixi. For a vector x Rd, denote [x](j) the j-th largest entry (in absolute value) of x, i.e., |[x](1)| |[x](2)| ... |[x](d)|. Since w( α) is unique, or equivalently, the top k entries of w( α) is unique, we must have ϵ := [ w( α)](k) [ w( α)](k+1) > 0. Let F = supp(w( α)) and define B( α) = n α RN : α α λN ϵ 2 X o . We prove the equation (18) by contradiction. Note that for any α B( α) we have w(α) w( α) = 1 λN X(α α) 1 λN X(α α) X This indicates that supp(w(α)) = F = supp(w( α)). That is, F still contains the (unique) top k entries of w(α) for all α B( α). Consider the vector β with βi = αi + η gi with a sufficiently small step-size η > 0 such that β B( α). Let α be a vector such that α i = PFi (βi). From the non-expanding property of Euclidian projection we know that and α α β α and thus α B( α). Therefore supp(w(α )) = F. Let us assume α = α. Since l i is ℓ-smooth, we have i=1 l i (α i) λ 2 w(α ) 2 = 1 i=1 l i (α i) λ l i ( αi) l i (αi)(α i αi) ℓ 2(α i αi)2 λ l i ( αi) l i (αi)(α i αi) ℓ 2(α i αi)2 λ i=1 x i w( α)(α i αi) 1 2λN2 (α α) X F X F (α α) ζ1 D( α) + D ( α), α α λNℓ+ X 2 ζ2 D( α) + 1 2η λNℓ+ X 2 Yuan, Liu, Wang, Liu, and Metaxas where in ζ1 we have used (α α) X F X F (α α) X 2 α α 2, ζ2 is due to the fact that α β 2 α β 2 which then implies α α 2 2η α α, D ( α) 0. Since we have assumed α α = 0, by choosing sufficiently small η < λN2 λNℓ+ X 2 , we can always find D(α ) > D( α), which contradicts the optimality of α. Therefore α = α, i.e., the equation (18) must hold for sufficiently small η. Next we prove that (w( α), α) forms a sparse saddle point of the Lagrangian of the form: L(w, α) = 1 αiw xi l i (αi) + λ Since (18) holds and Fi is convex, it must hold that either gi = 1 N (x i w( α) l i ( αi)) = 0 for αi lies in the interior of Fi, or αi lies on the boundary of Fi (if it is closed) and it maximizes the function 1 N αiw xi l i (αi) . In any case, we always have that αi is a maximizer of 1 N αix i w( α) l i (αi) over the feasible set Fi, which implies L(w( α), α) L(w( α), α) holds for any α F. From the definition of w( α) we know that L(w( α), α) L(w, α) is valid for all k-sparse primal vector w. Therefore (w( α), α) is a sparse saddle point, and consequently according to Theorem 8 that w( α) admits a primal k-sparse minimizer. A.6 Proof of Proposition 12 Proof The direction: If the pair ( w, α) is a ν-approximate k-sparse saddle point for L(w, α), then from Definition 11 we can derive P( w) = max α F L( w, α) min w 0 k L(w, α) + ν = D( α) + ν. The direction: Conversely, let us assume that P( w) D( α) ν. Then max α F L( w, α) = P( w) D( α) + ν = min w 0 k L(w, α) + ν, which implies that for any w 0 k and α F, L( w, α) L(w, α) + ν. Then by definition ( w, α) is a ν-approximate k-approximate saddle point of the Lagrangian L(w, α). This concludes the proof. A.7 Proof of Theorem 13 We first introduce the following key lemma which bounds the approximation level of certain approximate sparse saddle point ( w, α) with the primal vector w being optimal on its own supporting set. Lemma 24 Assume that the primal loss functions li( ) are differentiable. Let w Rd be a k-sparse primal vector and α = [l 1( w x1), ..., l N( w x N)]. Let F = supp( w). Assume that Dual Iterative Hard Thresholding H F ( P( w)) = 0. Then ( w, α) is a ν-approximate k-sparse saddle point of the Lagrangian L(w, α) with approximation level λ Hk( f( w)) 2. Proof From the definition of α we know that L( w, α) L( w, α). Recall the following formulation of L(w, α): L(w, α) = λ where the term C is not dependent on w. Since H F ( P( w)) = 0, we must have w = H F ( 1 λ f( w)) which implies L( w, α) = 1 2λ H F c( f( w)) 2+C. Let w = arg min w 0 k L(w, α) and F = supp( w). Then from the first-order optimality of w we must have L( w, α) = 1 2λ H F c( f( w)) 2 + C. Therefore L( w, α) L(w, α) L( w, α) L( w, α) H F ( f( w)) 2 + H F ( f( w)) 2 1 λ Hk( f( w)) 2. By combining the above arguments we get L( w, α) L( w, α) L(w, α)+ 1 λ Hk( f( w)) 2, which by definition indicates that ( w, α) admits a ν-approximate sparse saddle point with ν = 1 λ Hk( f( w)) 2. Next we introduce the concepts of smoothness and restricted strong convexity which are conventionally used in IHT-style methods (Jain et al., 2014; Yuan et al., 2018). Definition 25 (Restricted strong convexity and smoothness) For any integer s > 0, we say a function f(w) : Rd 7 R is restricted µs-strongly convex for some ms > 0 if f(w) f(w ) f(w ), w w µs 2 w w 2, w w 0 s. (19) Moreover, we say f(w) is ℓ-smooth for some ℓ> 0 if f(w) f(w ) f(w ), w w ℓ 2 w w 2, w, w Rd. The ratio number ℓ/µs, which measures the curvature of the loss function over sparse subspaces, will be referred to as restricted condition number. The following is a simple lemma that summarizes some standard properties of smoothness and restricted strong convexity. Lemma 26 Assume that f(w) is µs-strongly convex and ℓ-smooth. For any index set F with cardinality |F| s and any x, y with supp(w) supp(w ) F, it holds that HF ( f(w)) HF ( f(w )) µs w w , f(w) f(w ) ℓ w w . Yuan, Liu, Wang, Liu, and Metaxas Proof By adding two copies of the inequality (19) with w and w interchanged we get (w w ) ( f(w) f(w )) µs w w 2, which in turn according to Cauchy-Schwartz inequality leads to HF ( f(w)) HF ( f(w )) µs w w . The inequality f(w) f(w ) ℓ w w is standard (see, e.g., Nesterov, 2004, Theorem 2.1.5). The following is another key lemma to the proof of Theorem 13. Lemma 27 Assume that the loss functions li are µ-strongly convex and ℓ-smooth. Let w be an arbitrary k-sparse vector and F = supp( w). Let w = arg minsupp(w)= F P(w). Then Hk( f( w )) 1 + ℓγ+ k µγ k + λ Hk( f( w)) + λℓγ+ k µγ k + λ w . Proof It can be verified that f(w) is µγ k -strongly convex and thus P(w) is (µγ k + λ)- strongly convex. From Lemma 26 we know that w w 1 µγ k + λ H F ( P( w)) H F ( P( w )) ζ1= 1 µγ k + λ H F ( P( w)) 1 µγ k + λ H F ( f( w)) + λ µγ k + λ w , where ζ1 is due to the first-order optimality condition of w over F. Also, it can be verified that f(w) is ℓγ+ k -smooth. Then it follows from Lemma 26 and the above inequality that f( w) f( w ) ℓγ+ k w w ℓγ+ k µγ k + λ H F ( f( w)) + λℓγ+ k µγ k + λ w . Hk( f( w )) f( w) f( w ) + Hk( f( w)) ℓγ+ k µγ k + λ H F ( f( w)) + λℓγ+ k µγ k + λ w + Hk( f( w)) 1 + ℓγ+ k µγ k + λ Hk( f( w)) + λℓγ+ k µγ k + λ w . This proves the desired bound. We are now in the position to prove the main result in Theorem 13. Proof [of Theorem 13] Let us consider w = arg min supp(w)=supp( w) P(w), α = [l 1(x 1 w ), ..., l N(x N w )]. Dual Iterative Hard Thresholding By applying Lemma 24 we can show that ( w , α ) is a ν -approximate sparse saddle point with relaxation level λ Hk( f( w )) 2. We now bound the quantity Hk( f( w )) in ν using Hk( f( w)) . Since wmin > ℓγ+ k and λ µγ k k f( w) ℓγ+ k w k f( w) , according to Lemma 27 we can show that Hk( f( w )) 1 + ℓγ+ k µγ k + λ Hk( f( w)) + λℓγ+ k µγ k + λ w 1 + ℓγ+ k µγ k + λ k f( w) + λℓγ+ k µγ k + λ w 2 + ℓγ+ k µγ k + λ 2 + ℓγ+ k µγ k + λ 2 f( w) 2 . Finally, from Proposition 12 we obtain P( w ) D( α ) ν k 2 + ℓγ+ k µγ k + λ 2 f( w) 2 . The desired bound then follows immediately from the fact P( w) D( α) P( w ) D( α ). Appendix B. Proofs of Results in Section 4 In this section, we present the technical proofs of the main results stated in Section 4. B.1 Proof of Theorem 15 We need a series of technical lemmas to prove this theorem. The following lemma bounds the estimation error α α 2 = O( D (α) D ( α), α α ) when the primal losses {li}N i=1 are Lipschitz smooth. Lemma 28 Assume that the primal loss functions {li( )}N i=1 are 1/µ-smooth. Then the following inequality holds for any α, α F and g(α ) D(α ), g(α ) D(α ): µ g(α ) g(α ), α α . Proof Recall that D(α) = 1 N PN i=1 l i (αi) λ 2 w(α) 2. Let us consider two arbitrary dual variables α , α F. The assumption of li being 1/µ-smooth implies that its convex Yuan, Liu, Wang, Liu, and Metaxas conjugate function l i is µ-strongly-convex. Let F = supp(w(α )). Then i=1 l i (α i) λ i=1 l i (α i) λ l i (α i ) l i (α i )(α i α i ) µ 2 (α i α i )2 λ l i (α i ) l i (α i )(α i α i ) µ 2 (α i α i )2 λ i=1 x i w(α )(α i α i ) 1 2λN2 (α α ) X F XF (α α ) D(α ) + g(α ), α α µ By adding two copies of the above inequality with α and α interchanged we arrive at µ N α α 2 g(α ) g(α ), α α , which leads to the desired inequality in the lemma. The following lemma gives a simple expression of the gap for properly connected primaldual pairs. Lemma 29 For any dual vector α F and the related primal vector the primal-dual gap ϵPD(w, α) can be expressed as: ϵPD(w, α) = 1 li(w xi) + l i (αi) αiw xi . Proof It can be directly verified according to the definitions of P(w) and D(α) that P(w) D(α) = 1 i=1 li(w xi) + λ αiw xi l i (αi) + λ li(w xi) + l i (αi) αiw xi , which is the desired expression. Dual Iterative Hard Thresholding Based on Lemma 29, we can further derive the following lemma which establishes a bound on the primal-dual gap. Lemma 30 Consider a primal-dual pair (w, α) satisfying w = Hk 1 λN PN i=1 αixi . Then the following inequality holds for any g(α) D(α) and β [ l1(w x1), ..., l N(w x N)]: P(w) D(α) g(α), β α . Proof For any i [1, ..., N], from the maximizing argument property of convex conjugate we have li(w xi) = w xil i(w xi) l i (l i(w xi)), and l i (αi) = αil i (αi) li(l i (αi)). By summing both sides of above two equalities we get li(w xi) + l i (αi) =w xil i(w xi) + αil i (αi) (li(l i (αi)) + l i (l i(w xi))) ζ1 w xil i(w xi) + αil i (αi) l i (αi)l i(w xi), (20) where ζ1 follows from Fenchel-Young inequality. Therefore g(α), β α = 1 i=1 (w xi l i (αi))(l i(w xi) αi) w xil i(w xi) l i (αi)l i(w xi) αiw xi + αil i (αi) i=1 (li(w xi) + l i (αi) αiw xi) ζ3= P(w) D(α), where ζ2 follows from (20) and ζ3 follows from Lemma 29. This proves the desired bound. The following lemma shows that under proper conditions, w(α) is locally smooth around w = w( α). Lemma 31 Assume that {li}i=1,...,N are differentiable and ϵ := wmin 1 λ P ( w) > 0. Let α = [l 1( w x1), ..., l N( w x N)]. (a) If α α λN ϵ 2 X , then supp(w(α)) = supp( w) and (b) If α α > λN ϵ 2 X , then Yuan, Liu, Wang, Liu, and Metaxas Proof Part(a): For any α F, let us define Consider F = supp( w). Given ϵ > 0, it is known from Theorem 8 that w = H F ( w( α)) and P ( w) λ = H F c ( w( α)). Then ϵ > 0 implies F is unique, i.e., the top k entries of w( α) is unique, and w = w( α). Given that α α λN ϵ 2 X , we can show that w(α) w( α) = 1 λN X(α α) X This indicates that F still contains the (unique) top k entries of w(α). Therefore, supp(w(α)) = F = supp( w). Consequently we have w(α) w( α) = H F ( w(α)) H F ( w( α)) w(α) w( α) = 1 λN X(α α) X This proves the desired bound in Part(a). Part(b): Next let us consider the case α α > λN ϵ 2 X . From the expression of w(α) we can verify that w(α) 1 λN Xα 1 λN X α . Then we have w(α) w( α) X λN ( α + α ) λN ( α α + 2 α ) λN ϵ α α α . This completes the proof. We are now ready to prove the main result in Theorem 15. Proof [of Theorem 15] Part(a): Let us consider g(t) D(α(t)) with g(t) i = 1 N (x i w(t) l i (α(t) i )). From the expression of w(t) we can verify w(t) 1 λN Xα(t) X α(t) Since xi 1, it can be verified that g(t) r X + λ Nρ λN . (21) Dual Iterative Hard Thresholding Let g D( α) with gi = 1 N (x i w( α) l i ( αi)). We will now claim g = 0. Indeed, Since ϵ = wmin 1 λ P ( w) > 0, from the strong sparse duality theory we can show that w = w( α). Then, according to the fact l (l (a)) = a we can derive g(t) i = 1 N (x i w l i (l i(x i w))) = 1 N (x i w x i w) = 0, and thus g = 0. Let h(t) = α(t) α and v(t) = g(t) g, α α(t) . From Lemma 28 we know that (h(t))2 Nv(t)/µ. Then (h(t))2 = PF α(t 1) + η(t 1)g(t 1) α 2 α(t 1) + η(t 1)g(t 1) α 2 =(h(t 1))2 2η(t 1) g(t 1), α α(t 1) + (η(t 1))2 g(t 1) 2 =(h(t 1))2 2η(t 1) g(t 1) g, α α(t 1) + (η(t 1))2 g(t 1) 2 (h(t 1))2 η(t 1) 2µ N (h(t 1))2 + (η(t 1))2 (r X + λ where the first inequality is permitted by the non-expansion property of convex projection operator. Let η(t) = N µ(t+2). Then we obtain (h(t))2 1 2 t + 1 (h(t 1))2 + (r X + λ λ2µ2(t + 1)2 . (22) We will now use induction over t 1 to prove our claimed bound, i.e., for all t 1, (h(t))2 c0 t + 2. where c0 = (r X +λ λ2µ2 . The base-case t = 1 follows immediately from (22). Now considering t 2, the bound in (22) reads as (h(t))2 1 2 t + 1 (h(t 1))2 + c0 (t + 1)2 c0 t + 1 + c0 (t + 1)2 = 1 1 t + 1 c0 t + 1 c0 t + 2, which is our claimed estimation error bound when t 2. To prove the convergence of primal-dual gap, we consider β(t) := [l 1(x 1 w(t)), ..., l N(x Nw(t))]. According to Lemma 30 we have ϵ(t) PD = P(w(t)) D(α(t)) g(t), β(t) α(t) g(t) ( β(t) α + α α(t) ). From the smoothness of li and Lemma 31 we get N µ w(t) w X λµ Yuan, Liu, Wang, Liu, and Metaxas where in the first we have used the assumption xi 1. By combining the above with the bound in (21) we obtain ϵ(t) PD g(t) ( β(t) α + α α(t) ) + 1 1 t + 2 This completes the proof of Part(a). Part(b): Let us consider ϵ0 = λN ϵ 2 X . From Part(a) we obtain α(t) α ϵ0 after ϵ2 0 . In this case, it is known from Lemma 31 that supp(w(t)) = supp( w). This proves the claim of Part(b). B.2 Proof of Theorem 17 Proof Part(a): Let us consider g(t) D(α(t)) with g(t) i = 1 N (x i w(t) l i (α(t) i )), and g D( α) with gi = 1 N (x i w( α) l i ( αi)). Since ϵ = wmin 1 λ P ( w) > 0, based on the proof of the part(a) of Theorem 15 we can verify that w = w( α) and g = 0. The 1/ℓ-strong-convexity of li implies the ℓ-smoothness of l i . Then we can show that x i (w(t) w( α)) l i (α(t) i ) + l i ( αi) 2 x i (w(t) w( α)) 2 + l i (α(t) i ) l i ( αi) 2 x i (w(t) w( α)) 2 + l i (α(t) i ) l i ( αi) 2 2 N w(t) w( α) + 2ℓ N α(t) α = ℓD α(t) α , where in ζ1 we have used xi 1 and |l i (α(t) i ) l i ( αi)| ℓ|α(t) i αi|, ζ2 follows from Lemma 31. Now let h(t) = α(t) α and v(t) = g(t) g, α α(t) . From Lemma 28 we Dual Iterative Hard Thresholding know that (h(t))2 Nv(t)/µ = v(t)/µD. Then (h(t))2 = PF α(t 1) + η(t 1)g(t 1) α 2 α(t 1) + η(t 1)g(t 1) α 2 = α(t 1) + η(t 1)(g(t 1) g) α 2 =(h(t 1))2 2η(t 1)v(t 1) + (η(t 1))2 g(t 1) g 2 (h(t 1))2 2η(t 1)µD(h(t 1))2 + (η(t 1))2ℓ2 D(h(t 1))2 = 1 2η(t 1)µD + (η(t 1))2ℓ2 D (h(t 1))2, where the first inequality is permitted by the non-expansion property of convex projection operator. Let η(t) µD ℓ2 D . Then we obtain (h(t))2 1 µ2 D ℓ2 D By recursively applying the above inequality we obtain that (h(t))2 1 µ2 D ℓ2 D t (h(0))2 = 1 µ2 D ℓ2 D where we have used α(0) = 0. This proves the claim in Part(a). Now let β(t) := [l 1(x 1 w(t)), ..., l N(x Nw(t))]. According to Lemma 30 we have ϵ(t) PD = P(w(t)) D(α(t)) g(t), β(t) α(t) g(t) ( β(t) α + α α(t) ) = g(t) g ( β(t) α + α α(t) ), where we have used g = 0. From the smoothness of li and Lemma 31 we obtain N µ w(t) w X λµ Since we have already shown in Part(a) that g(t) g ℓD α(t) α , the following is valid immediately: + 1 α(t) α 2, which then implies the desired bound on primal-dual gap. This concludes the proof of Part(a). Part(b): Let us consider ϵ0 = λN ϵ 2 X . Based on the fact (1 a)t exp( at) for a (0, 1) and the result in Part(a) we can show after t ℓ2 D µ2 D log 4 α 2 X 2 λ2N2 ϵ2 . In this case, it is known from Lemma 31 that supp(w(t)) = supp( w). This proves the claim of Part(b). Yuan, Liu, Wang, Liu, and Metaxas B.3 Proof of Theorem 19 Proof Part(a): The proof argument largely mimics that of Theorem 15. Here we still provide a detailed proof for the sake of completeness. Let h(t) = α(t) α and v(t) = g(t) g, α α(t) . From Lemma 28 we know that (h(t))2 Nv(t)/µ. For an index set B, denote g(t) B := HB(g(t)) and v(t) B := g(t) B g B, α α(t) . Then from the non-expansion property of convex projection operator and the fact of g = 0 we can show (h(t))2 = PF α(t 1) + η(t 1)g(t 1) Bi(t 1) α(t 1) + η(t 1)g(t 1) Bi(t 1) α 2 =(h(t 1))2 2η(t 1)v(t 1) Bi(t 1) + (η(t 1))2 g(t 1) Bi(t 1) 2. By taking conditional expectation (with respect to uniform random block selection, conditioned on α(t 1)) on both sides of the above inequality we get E[(h(t))2 | α(t 1)] (h(t 1))2 1 i=1 2η(t 1)v(t 1) Bi + 1 i=1 (η(t 1))2 g(t 1) Bi 2 =(h(t 1))2 2η(t 1) m v(t 1) + (η(t 1))2 (h(t 1))2 2η(t 1)µ m N (h(t 1))2 + (η(t 1))2 (r X + λ Let us choose η(t) = m N µ(t+2). Then we obtain E[(h(t))2 | α(t 1)] 1 2 t + 1 (h(t 1))2 + m(r X + λ λ2µ2(t + 1)2 . By taking expectation on both sides of the above over α(t 1), we further get E[(h(t))2] 1 2 t + 1 E[(h(t 1))2] + m(r X + λ λ2µ2(t + 1)2 . By induction, this recursive inequality leads to E[(h(t))2] m(r X + λ Moreover, similar to the argument in the proof of Theorem 15 we obtain E[ϵ(t) PD] E[ g(t) ( β(t) α + α α(t) )] + 1 E[ α(t) α ] + 1 1 t + 2 Dual Iterative Hard Thresholding where in the first inequality we have used E[ α(t) α ] p E[ α(t) α 2] This proves the results in Part(a). Part(b): Let us consider ϵ0 = λN ϵ 2 X . From Part(a) we obtain E α(t) α δϵ0 after t t1 = l m(r X +λ m . Then from the Markov inequality we know that α(t) α E[ α(t) α ]/δ ϵ0 holds with probability at least 1 δ. Lemma 31 shows that α(t) α ϵ0 implies supp(w(t)) = supp( w). Therefore when t t1, the event supp(w(t)) = supp( w) occurs with probability at least 1 δ. This proves the desired result in Part(b). B.4 Proof of Theorem 22 Proof Part(a): Based on the proof of Theorem 17 we know that g = 0 and g(t) g ℓD α(t) α . Let h(t) = α(t) α and v(t) = g(t) g, α α(t) . From Lemma 28 we know that (h(t))2 v(t)/µD. For an index set B, denote g(t) B := HB(g(t)) and v(t) B := g(t) B g B, α α(t) . Then from the non-expansion property of convex projection operator and the fact of g = 0 we have (h(t))2 = PF α(t 1) + η(t 1)g(t 1) Bi(t 1) α(t 1) + η(t 1)(g(t 1) Bi(t 1) g Bi(t 1)) α 2 =(h(t 1))2 2η(t 1)v(t 1) Bi(t 1) + (η(t 1))2 g(t 1) Bi(t 1) g Bi(t 1) 2. By taking conditional expectation (with respect to uniform random block selection, conditioned on α(t 1)) on both sides of the above inequality we get E h (h(t))2 | α(t 1)i (h(t 1))2 1 i=1 2η(t 1)v(t 1) Bi + 1 i=1 (η(t 1))2 g(t 1) Bi g Bi 2 =(h(t 1))2 2η(t 1) m v(t 1) + (η(t 1))2 m g(t 1) g 2 (h(t 1))2 2η(t 1)µD m (h(t 1))2 + (η(t 1))2ℓ2 D m (h(t 1))2 1 2η(t 1)µD m + (η(t 1))2ℓ2 D m Let η(t) = µD ℓ2 D . Then we obtain E h (h(t))2 | α(t 1)i 1 µ2 D mℓ2 D By taking expectation on both sides of the above over α(t 1), we further get E h (h(t))2i 1 µ2 D mℓ2 D E h (h(t 1))2i . Yuan, Liu, Wang, Liu, and Metaxas This recursive inequality leads to E h (h(t))2i 1 µ2 D mℓ2 D where we have used α(0) = 0. Following the similar arguments in the proof of Part(a) of Theorem 17 we can show that E[ϵ(t) PD] ℓD + 1 E[ α(t) α 2] + 1 1 µ2 D mℓ2 D which is the desired bound in Part(a). Part(b): Let us consider ϵ0 = λN ϵ 2 X . From Part(a) we obtain E α(t) α δϵ0 after t mℓ2 D µ2 D log 4 α 2 X 2 δ2λ2N2 ϵ2 . Then from the Markov inequality we know that α(t) α E α(t) α /δ ϵ0 holds with probability at least 1 δ. Lemma 31 shows that α(t) α ϵ0 implies supp(w(t)) = supp( w). Therefore in this case, the event supp(w(t)) = supp( w) occurs with probability at least 1 δ. Andreas Argyriou, Rina Foygel, and Nathan Srebro. Sparse prediction with the k-support norm. In Advances in Neural Information Processing Systems, pages 1457 1465, 2012. Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends R in Machine Learning, 4(1): 1 106, 2012. Sohail Bahmani, Bhiksha Raj, and Petros T Boufounos. Greedy sparsity-constrained optimization. Journal of Machine Learning Research, 14:807 841, 2013. Amir Beck and Yonina C Eldar. Sparsity constrained nonlinear optimization: Optimality conditions and algorithms. SIAM Journal on Optimization, 23(3):1480 1509, 2013. Thomas Blumensath. Compressed sensing with nonlinear observations and related nonlinear optimization problems. IEEE Transactions on Information Theory, 59(6):3466 3474, 2013. Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265 274, 2009. Chuangjian Cai, Lin Zhang, Wenjuan Cai, Dong Zhang, Yanlu Lv, and Jianwen Luo. Nonlinear greedy sparsity-constrained algorithm for direct reconstruction of fluorescence molecular lifetime tomography. Biomedical optics express, 7(4):1210 1226, 2016. Dual Iterative Hard Thresholding Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40 (1):120 145, 2011. Jinghui Chen and Quanquan Gu. Accelerated stochastic block coordinate gradient descent for sparsity constrained nonconvex optimization. In Conference on Uncertainty in Artificial Intelligence, 2016. David L Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4): 1289 1306, 2006. Simon Foucart. Hard thresholding pursuit: an algorithm for compressive sensing. SIAM Journal on Numerical Analysis, 49(6):2543 2563, 2011. Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. CRC press, 2015. Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S Sathiya Keerthi, and Sellamanickam Sundararajan. A dual coordinate descent method for large-scale linear svm. In International Conference on Machine Learning, pages 408 415, 2008. Martin Jaggi, Virginia Smith, Martin Tak ac, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I Jordan. Communication-efficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems, pages 3068 3076, 2014. Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for high-dimensional m-estimation. In Advances in Neural Information Processing Systems, pages 685 693, 2014. Prateek Jain, Nikhil Rao, and Inderjit S Dhillon. Structured sparse regression via greedy hard thresholding. In Advances in Neural Information Processing Systems, pages 1516 1524, 2016. Xiaojie Jin, Xiaotong Yuan, Jiashi Feng, and Shuicheng Yan. Training skinny deep neural networks with iterative hard thresholding methods. ar Xiv preprint ar Xiv:1607.05423, 2016. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315 323, 2013. Rajiv Khanna and Anastasios Kyrillidis. IHT dies hard: Provable accelerated iterative hard thresholding. In International Conference on Artificial Intelligence and Statistics, pages 188 198, 2018. Ken Lang. Newsweeder: Learning to filter netnews. In International Conference on Machine Learning, pages 331 339, 1995. Maksim Lapin, Bernt Schiele, and Matthias Hein. Scalable multitask representation learning for scene classification. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1434 1441, 2014. Yuan, Liu, Wang, Liu, and Metaxas Sangkyun Lee, Damian Brzyski, and Malgorzata Bogdan. Fast saddle-point algorithm for generalized dantzig selector and fdr control with ordered L1-norm. In International Conference on Artificial Intelligence and Statistics, pages 780 789, 2016. David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361 397, 2004. Xingguo Li, Tuo Zhao, Raman Arora, Han Liu, and Jarvis Haupt. Stochastic variance reduced optimization for nonconvex sparse learning. In International Conference on Machine Learning, pages 917 925, 2016. Bo Liu, Xiao-Tong Yuan, Lezi Wang, Qingshan Liu, and Dimitris N Metaxas. Dual iterative hard thresholding: From non-convex sparse minimization to non-smooth concave maximization. In International Conference on Machine Learning, pages 2179 2187, 2017. Bo Liu, Xiao-Tong Yuan, Lezi Wang, Qingshan Liu, Junzhou Huang, and Dimitris N Metaxas. Distributed inexact newton-type pursuit for non-convex sparse learning. In International Conference on Artificial Intelligence and Statistics, pages 343 352, 2019. Rahul Mazumder and Trevor Hastie. The graphical lasso: New insights and alternatives. Electronic Journal of Statistics, 6:2125, 2012. Balas Kausik Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227 234, 1995. Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, 2004. Nam Nguyen, Deanna Needell, and Tina Woolf. Linear convergence of stochastic iterative greedy algorithms with sparse constraints. IEEE Transactions on Information Theory, 63(11):6869 6895, 2017. Yagyensh Chandra Pati, Ramin Rezaiifar, and Perinkulam Sambamurthy Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Annual Asilomar Conference on Signals, Systems, and Computers, pages 40 44, 1993. Clint Scovel, Don Hush, and Ingo Steinwart. Approximate duality. Journal of Optimization Theory and Applications, 135(3):429 443, 2007. Shai Shalev-Shwartz. SDCA without duality, regularization, and individual convexity. In International Conference on Machine Learning, pages 747 754, 2016. Shai Shalev-Shwartz and Tong Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems, pages 378 385, 2013a. Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567 599, 2013b. Dual Iterative Hard Thresholding Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1-2):105 145, 2016. Jie Shen and Ping Li. On the iteration complexity of support recovery via hard thresholding pursuit. In International Conference on Machine Learning, pages 3115 3124, 2017a. Jie Shen and Ping Li. Partial hard thresholding: Towards a principled analysis of support recovery. In Advances in Neural Information Processing Systems, pages 3127 3137, 2017b. Jie Shen and Ping Li. A tight bound of hard thresholding. Journal of Machine Learning Research, 18(208):1 42, 2018. Mohammadreza Soltani and Chinmay Hegde. Fast algorithms for demixing sparse signals from nonlinear observations. IEEE Transactions on Signal Processing, 65(16):4209 4222, 2017. Conghui Tan, Tong Zhang, Shiqian Ma, and Ji Liu. Stochastic primal-dual method for empirical risk minimization with o (1) per-iteration complexity. In Advances in Neural Information Processing Systems, pages 8376 8385, 2018. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267 288, 1996. Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang. Efficient distributed learning with sparsity. In International Conference on Machine Learning, pages 3636 3645, 2017. Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11:2543 2596, 2010. Lin Xiao, Adams Wei Yu, Qihang Lin, and Weizhu Chen. Dscovr: Randomized primaldual block coordinate algorithms for asynchronous distributed optimization. Journal of Machine Learning Research, 20:1 58, 2019. Adams Wei Yu, Qihang Lin, and Tianbao Yang. Doubly stochastic primal-dual coordinate method for empirical risk minimization and bilinear saddle-point problem. ar Xiv preprint ar Xiv:1508.03390, 2015. Xiao-Tong Yuan and Ping Li. Generalization bounds for high-dimensional m-estimation under sparsity constraint. ar Xiv preprint ar Xiv:2001.07212, 2020. Xiao-Tong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit for sparsityconstrained optimization. In International Conference on Machine Learning, pages 127 135, 2014. Xiao-Tong Yuan, Ping Li, and Tong Zhang. Exact recovery of hard thresholding pursuit. In Advances in Neural Information Processing Systems, pages 3558 3566, 2016. Xiao-Tong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit. Journal of Machine Learning Research, 18(166):1 43, 2018. Yuan, Liu, Wang, Liu, and Metaxas Cun-Hui Zhang and Jian Huang. The sparsity and bias of the lasso selection in highdimensional linear regression. The Annals of Statistics, pages 1567 1594, 2008. Yuchen Zhang and Lin Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. Journal of Machine Learning Research, 18(1):2939 2980, 2017. Pan Zhou, Xiao-Tong Yuan, and Jiashi Feng. Efficient stochastic gradient hard thresholding. In Advances in Neural Information Processing Systems, pages 1988 1997, 2018.