# primaldual_block_generalized_frankwolfe__2a2a9e92.pdf Primal-Dual Block Generalized Frank-Wolfe Qi Lei , Jiacheng Zhuo , Constantine Caramanis , Inderjit S. Dhillon , and Alexandros G. Dimakis UT Austin Amazon {leiqi@oden., jzhuo@, constantine@, inderjit@cs., dimakis@austin.}utexas.edu We propose a generalized variant of Frank-Wolfe algorithm for solving a class of sparse/low-rank optimization problems. Our formulation includes Elastic Net, regularized SVMs and phase retrieval as special cases. The proposed Primal-Dual Block Generalized Frank-Wolfe algorithm reduces the per-iteration cost while maintaining linear convergence rate. The per iteration cost of our method depends on the structural complexity of the solution (i.e. sparsity/low-rank) instead of the ambient dimension. We empirically show that our algorithm outperforms the state-of-the-art methods on (multi-class) classification tasks. 1 Introduction We consider optimization problems of the form: min x C : X i fi(a i x) + g(x), directly motivated by regularized and constrained Empirical Risk Minimization (ERM). Particularly, we are interested in problems whose solution has special simple structure like low-rank or sparsity. The sparsity constraint applies to large-scale multiclass/multi-label classification, low-degree polynomial data mapping [5], random feature kernel machines [32], and Elastic Net [39]. Motivated by recent applications in low-rank multi-class SVM, phase retrieval, matrix completion, affine rank minimization and other problems (e.g., [9, 31, 2, 3]), we also consider settings where the constraint x C (e.g., trace norm ball) while convex, may be difficult to project onto. A wish-list for this class of problems would include an algorithm that (1) exploits the function finite-sum form and the simple structure of the solution, (2) achieves linear convergence for smooth and strongly convex problems, (3) does not pay a heavy price for the projection step. We propose a Frank-Wolfe (FW) type method that attains these three goals. This does not come without challenges: Although it is currently well-appreciated that FW type algorithms avoid the cost of projection [14, 1], the benefits are limited to constraints that are hard to project onto, like the trace norm ball. For problems like phase retrieval and ERM for multi-label multi-class classification, the gradient computation requires large matrix multiplications. This dominates the per-iteration cost, and the existing FW type methods do not asymptotically reduce time complexity per iteration, even without paying the expensive projection step. Meanwhile, for simpler constraints like the ℓ1 norm ball or the simplex, it is unclear if FW can offer any benefits compared to other methods. Moreover, as is generally known, FW suffers from sub-linear convergence rate even for well-conditioned problems that enjoy strong convexity and smoothness. Our contributions. In this paper we tackle the challenges by exploiting the special structure induced by the constraints and FW steps. We propose a generalized variant of FW that we call Primal-Dual Block Generalized Frank Wolfe. The main advantage is that the computational complexity depends Both authors contribute equally. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. only on the sparsity of the solution, rather than the ambient dimension, i.e. it is dimension free. This is achieved by conducting partial updates in each iteration, i.e., sparse updates for ℓ1 and low-rank updates for the trace norm ball. While the benefits of partial updates is unclear for the original problem, we show in this work how they significantly benefit a primal-dual reformulation. This reduces the per iteration cost to roughly a ratio of s d compared to naive Frank-Wolfe, where s is the sparsity (or rank) of the optimal solution, and d is the feature dimension. Meanwhile, the per iteration progress of our proposal is comparable to a full gradient descent step, thus retaining linear convergence rate. For strongly convex and smooth f and g we show that our algorithm achieves linear convergence with per-iteration cost sn over ℓ1-norm ball, where s upper bounds the sparsity of the primal optimal. Specifically, for sparse ERM with smooth hinge loss or quadratic loss with ℓ2 regularizer, our algorithm yields an overall O(s(n + κ) log 1 ϵ ) time complexity to reach ϵ duality gap, where κ is the condition number (smoothness divided by strong convexity). Our theory has minimal requirements on the data matrix A. Experimentally we observe our method yields significantly better performance compared to prior work, especially when the data dimension is large and the solution is sparse. Therefore we achieve the state-of-the-art performance both in time complexity and in practice measured by CPU time, for regularized ERM with smooth hinge loss and matrix sensing problems. 2 Related Work We review relevant algorithms that improve the overall performance of Frank-Wolfe type methods. Such improvements are roughly obtained for two reasons: the enhancement on convergence speed and the reduction on iteration cost. Very few prior works benefit in both. Nesterov s acceleration has proven effective as in Stochastic Condition Gradient Sliding (SCGS) [23] and other variants [36, 26, 10]. Restarting techniques dynamically adapt to the function geometric properties and fills in the gap between sublinear and linear convergence for FW method [18]. Some variance reduced algorithms obtain linear convergence as in [13], however, the number of inner loops grows significantly and hence the method is not computationally efficient. Linear convergence has been obtained specifically for polytope constraints like [27, 20], as well as the work proposed in [21, 11] that use the Away-step Frank Wolfe and Pair-wise Frank Wolfe, and their stochastic variants. One recent work [1] focuses on trace norm constraints and proposes a FW-type algorithm that yields similar progress as projected gradient descent per iteration but is almost projection free. However, in many applications where gradient computation dominates the iteration complexity, the reduction on projection step doesn t necessarily produce asymptotically better iteration costs. The sparse update introduced by FW steps was also appreciated by [22], where they conducted dual updates with a focus on SVM with polytope constraint. Their algorithm yields low iteration costs but still suffer from sub-linear convergence. On the other hand, the recently popularized primal-dual formulation minx maxy{g(x) + y Ax f(y)} has proven useful for different machine learning tasks like reinforcement learning, ERM, and robust optimization [8]. Especially for the ERM related problems, the primal-dual formulation still inherits the finite-sum structure from the primal form, and could be used to reduce variance [38, 35] or reduces communication complexity in the distributed setting [37]. One issue lies in this line of prior work: they do not achieve any better performance than that with the primal formulation. A notable exception is [24] where they also attempt to exploit sparsity of the primal variables with the primal-dual formulation. However, this work is for unconstrained problem so it s not directly comparable to ours. On the other hand, the analysis of [24] relies on the sparsity of the whole iterate trajectory, which has no obvious guarantee to be small. While our analysis only depends on primal optimal s sparsity or rank, and is guaranteed by the ℓ1 or nuclear norm constraints. Notation. We briefly introduce the notation used throughout the paper. We use bold lower case letter to denote vectors, capital letter to represent matrices. is ℓ2 norm for vectors and Frobenius norm for matrices unless specified otherwise. indicates the trace norm for a matrix. We say a function f is α strongly convex if f(y) f(x)+ g, y x + α 2 y x 2, where g f(x) is any sub-gradient of f. Similarly, f is β-smooth when f(y) f(x) + g, y x + β We use f to denote the convex conjugate of f, i.e., f (y) def = maxx x, y f(x). Some more parameters are problem-specific and are defined when needed. 3.1 A Theoretical Vignette To elaborate the techniques we use to obtain the linear convergence for our Frank-Wolfe type algorithm, we consider the ℓ1 norm constrained problem as an illustrating example: arg min x Rd, x 1 τ f(x), (1) where f is L-smooth and µ-strongly convex. If we invoke the Frank Wolfe algorithm, we compute x(t) (1 η)x(t 1) + η x, where x arg min x 1 τ f(x(t 1)), x . (2) Even when the function f is smooth and strongly convex, (2) converges sublinearly. As inspired by [1], if we assume the optimal solution is s-sparse, we can enforce a sparse update while maintaining linear convergence by a mild modification on (2): x(t) (1 η)x(t 1)+η x, where x arg min x 1 τ, x 0 s { f(x(t 1)), x +L 2 η x(t 1) x 2 2}. (3) We also call this new practice block Frank-Wolfe as in [1]. The proof of convergence can be completed within three lines. Let ht = f(x(t)) f . ht = f(x(t 1) + η( x x(t 1))) f ht 1 + η f(x(t 1)), x x(t 1) + L 2 η2 x x(t 1) 2 (Smoothness of f) ht 1 + η f(x(t 1)), x x(t 1) + L 2 η2 x x(t 1) 2 (Definition of x) µ η2)ht 1 (by convexity and µ-strong convexity of f) (4) Therefore, when η = µ 2L, ht+1 (1 µ 4L)th1 and the iteration complexity is O( L µ log(1/ϵ)) to achieve ϵ error. Although we achieve linear convergence, the advantage of overall complexity against classical methods (e.g. Projected Gradient Descend (PGD)) is not shown yet. Luckily, with the update x being sparse, it is possible to improve the iteration complexity, while maintaining the linear convergence rate. In order to differentiate, we name the sparse update nature of (3) as partial update. Next we elaborate the situations when one benefits from partial updates. Consider a quadratic function: f(x) = 1 2x Ax, whose gradient is Ax for symmetric A. As x is sparse, One can maintain the value of the gradient efficiently: Ax(t) (1 η)Ax(t 1) + ηAI,: x, where I is the support set of x. We therefore reduce the complexity of one iteration to O(sd), compared to O(d2) with PGD. Similar benefits hold when we replace x by a matrix X and conduct a low-rank update on X. The benefit of partial update is not limited to quadratic functions. Next we show that for a class of composite function, we are able to take the full advantage of the partial update, by taking a primal-dual re-formulation. 4 Methodology Primal-Dual Formulation. Note that the problem we are tackling is as follows: i=1 fi(a i x) + g(x) We first focus on the setting where x Rd is a vector and C is the ℓ1-norm ball. This form covers general classification or regression tasks with fi being some loss function and g being a regularizer. Extension to matrix optimization over a trace norm ball is introduced in Section 4.3. Even with the constraint, we could reform (5) as a primal-dual convex-concave saddle point problem: (5) min x C max y L(x, y) g(x) + 1 i=1 f i (yi) or its dual formulation: D(y) := min x C i=1 f i (yi) Notice (7) is not guaranteed to have an explicit form. Therefore some existing FW variants like [22] that optimizes over (7) may not apply. Instead, we directly solve the convex concave problem (6) and could therefore solve more general problems, including complicated constraint like trace norm. Since the computational cost of the gradient x L and y L is dominated by computing A y and Ax respectively, sparse updates could reduce computational costs by a ratio of roughly O(d/s) for updating x and y while achieving good progress. 4.1 Primal-Dual Block Generalized Frank-Wolfe With the primal-dual formulation, we are ready to introduce our algorithm. The idea is simple: since the primal variable x is constrained over ℓ1 norm ball, we conduct block Frank-Wolfe algorithm and achieve an s-sparse update. Meanwhile, for the dual variable y we conduct greedy coordinate ascent method to select and update k coordinates (k determined later). We selected coordinates that allow the largest step, which is usually referred as a Gauss-Southwell rule denoted by GS-r [30]. Our algorithm is formally presented in Algorithm 1. We have the following assumptions on f and g: Assumption 4.1. We assume the functions satisfy the following properties: Each loss function fi is convex and β-smooth, and is α strongly convex over some convex set (could be R), and linear otherwise. maxi ai 2 2 R. Therefore 1 n Pn i=1 fi(a i x) is βR-smooth. g is µ-strongly convex and L-smooth. Suitable loss functions fi include smooth hinge loss [33] and quadratic loss function. Relevant applications covered are Support Vector Machine (SVM) with smooth hinge loss, elastic net [39], and linear regression problem with quadratic loss. Algorithm 1 Primal-Dual Block Generalized Frank-Wolfe Method for ℓ1 Norm Ball 1: Input: Training data A Rn d, primal and dual step size η, δ > 0. 2: Initialize: x(0) 0 Rd, y(0) 0 Rn, w(0) Ax = 0 Rn, z(0) A y = 0 Rd 3: for t = 1, 2, , T do 4: Use Block Frank Wolfe to update the primal variable: x arg min x 1 λ, x 0 s { 1 nz(t 1) + g(x(t 1)), x + L 2 η x x(t 1) 2} (8) x(t) (1 η)x(t 1) + η x 5: Update w to maintain the value of Ax: w(t) (1 η)w(t 1) + ηA x (9) 6: Consider the potential dual update: y = arg max y n w(t), y f (y ) 1 2δ y y(t 1) 2 . (10) 7: Choose greedily the dual coordinates to update: let I(t) be the top k coordinates that maximize | yi y(t 1) i |, i [n]. Update the dual variable accordingly: y(t) i yi if i I(t) y(t 1) i otherwise. (11) 8: Update z to maintain the value of A y z(t) z(t 1) + A :,I(t)(y(t) y(t 1)) (12) 9: end for 10: Output: x(T ), y(T ) As L(x, y) is µ-strongly convex and L-smooth with respect to x, we set the primal learning rate η = µ 2L according to Section 3.1. Meanwhile, the dual learning rate δ is set to balance its effect on the dual progress as well as the primal progress. We specify it in the theoretical analysis part. The computational complexity for each iteration in Algorithm 1 is O(ns). Both primal and dual update could be viewed as roughly three steps: coordinate selection, variable update, and maintaining AT y or Ax. The coordinate selection as Eqn. (8) for primal and the choice of I(t) for dual variable respectively take O(d) and O(n) on average if implemented with the quick selection algorithm. The variable update costs O(d) and O(n). The dominating cost is to maintain Ax as in Eqn. (9) that takes O(ns), and O(dk) of maintaining A y as in Eqn. (12). To balance the time budget for primal and dual step, we set k = ns/d and achieve an overall complexity of O(ns) per iteration. 4.2 Theoretical Analysis We derive convergence analysis under Assumption 4.1. The derivation consists of the analysis on the primal progress, the balance of the dual progress, and their overall effect. Define the primal gap as (t) p def = L(x(t+1), y(t)) L( x(t), y(t)), where x(t) is the primal optimal solution such that the dual D(y(t)) = L( x(t), y(t)), and is sparse enforced by the ℓ1 constraint. The dual gap is (t) d def = D D(y(t)). We analyze the convergence rate of duality gap (t) max{1, (β/α 1)} (t) p + (t) d . Primal progress: Firstly, similar to the analysis in Section 3.1, we could derive that primal update introduces a sufficient descent as in Lemma A.2. L(x(t+1), y(t)) L(x(t), y(t)) η Dual progress: With the GS-r rule to carefully select and update the most important k coordinates in the dual variable in (10), we are able to derive the following result on dual progress that diminishes dual gap as well as inducing error. y(t) y(t 1) 2 kδ nβ (t) d + kδ n2 R x(t) x(t) 2 2 Refer to Lemma A.5 for details. Primal Dual progress: The overall progress evolves as: primal progress z }| { L(x(t+1), y(t)) L(x(t), y(t)) 1 dual progress z }| { y(t) y(t 1) 2 +3δRk primal hindrance z }| { x(t) x(t) 2 . In this way, we are able to connect the progress on duality gap with constant fraction of its value, and achieve linear convergence: Theorem 4.1. Given a function P(x) = Pn i=1 fi(a i x) + g(x) that satisfies Assumption 4.1. Set s to upper bound the sparsity of the primal optimal x(t), and learning rates η = µ 2L, δ = 1 k( L µnβ + 5βR 2αµn2 (1 + 4 L µ )) 1. The duality gap (t) = max{1, β α 1} (t) p + (t) d generated by Algorithm 1 takes O( L ϵ ) iterations to achieve ϵ error. The overall complexity is For our target applications like elastic net, or ERM with smooth hinge loss, we are able to connect the time complexity to the condition number of the primal form. Corollary 4.1.1. Given a smooth hinge loss or quadratic loss fi that is β-smooth, and ℓ2 regularizer g = µ 2 x 2. Define the condition number κ = βR µ . Setting s upper bounds the sparsity of the primal optimal x(t), and learning rates η = 1 2µn2 ) 1, the duality gap (t) takes O((1 + κ ϵ ) iterations to achieve ϵ error. The overall complexity is O(s(n + κ) log 1 Our derivation of overall complexity implicitly requires ns d by setting k = sd/n 1. This is true for our considered applications like SVM. Otherwise we choose k = 1 and the complexity becomes O(max{d, ns}(1 + κ In Table 1, we briefly compare the time complexity of our algorithm with some benchmark algorithms: (1) Accelerated Projected Gradient Descent (PGD) (2) Frank-Wolfe algorithm (FW) (3) Stochastic Variance Reduced Gradient (SVRG) [15] (4) Stochastic Conditional Gradient Sliding (SCGS) [23] and (5) Stochastic Variance-Reduced Conditional Gradient Sliding (STORC) [13]. The comparison is not thorough but intends to select constrained optimization that improves the overall complexity from different perspective. Among them, accelerated PGD improves conditioning of the problem, while SCGS and STORC reduces the dependence on number of samples. In the experimental session we show that our proposal outperforms the listed algorithms under various conditions. Algorithm Per Iteration Cost Iteration Complexity Frank Wolfe O(nd) O( 1 ϵ ) Accelerated PGD [29] O(nd) O( κ log 1 ϵ ) SVRG [15] O(nd) O((1 + κ/n) log 1 ϵ ) SCGS [23] O(κ2 #iter3 ϵ ) STORC [13] O(κ2d + nd) O(log 1 ϵ ) Primal Dual FW (ours) O(ns) O((1 + κ/n) log 1 Table 1: Time complexity comparisons on the setting of Corollary 4.1.1. For clear comparison, we refer the per iteration cost as the time complexity of outer iterations. 4.3 Extension to the Trace Norm Ball Algorithm 2 Primal-Dual Block Generalized Frank-Wolfe Method for Trace Norm Ball 1: Input: Training data A Rn d, primal and dual step size η, δ > 0. Target accuracy ϵ. 2: Initialize: X(0) 0 Rd c, Y (0) 0 Rn c, W (0) AX = 0 Rn c, Z(0) A Y = 0 Rd c 3: for t = 1, 2, , T do 4: Use Frank Wolfe to Update the primal variable: X(t) (1 η)X(t 1) + η X, where X (1 8)-approximation of Eqn. (18). 5: Update W to maintain the value of AX: W (t) (1 η)W (t 1) + ηA X (13) 6: Consider the potential dual update: Y (t) arg max Y W, Y f (Y ) 1 2δ Y Y (t 1) 2 (14) 7: Choose greedily the rows of the dual variable to update: let I(t) be the top k coordinates that maximize Yi,: Y (t 1) i,: 2 , i [n]. Update the dual variable accordingly: Y (t) i,: Yi,: if i I(t) Y (t 1) i,: otherwise. (15) 8: Update Z to maintain the value of A Y Z(t) Z(t 1) + A (Y (t) Y (t 1)) (16) 9: end for 10: Output: X(T ), Y (T ) We also extend our algorithm to matrix optimization over trace norm constraints: min X λ,X Rd c i=1 fi(a i X) + g(X) This formulation covers multi-label multi-class problems, matrix completion, affine rank minimization, and phase retrieval problems (see reference therein [3, 1]). Equivalently, we solve the following primal-dual problem: min X λ,X Rd c max Y Rn c L(X, Y ) g(X) + 1 i=1 f i (yi) Here yi is the i-th row of the dual matrix Y . For this problem, the partial update we enforced on the primal matrix is to keep the update matrix low rank: X arg min X λ,rank(X) s n Z + g(X(t 1)), X + L 2 η X X(t 1) 2 , Z A Y (t 1). (18) However, an exact solution to (18) requires computing the top s left and right singular vectors of the matrix X(t 1) 1 ηL(Z + g(X(t 1)) Rd c. Therefore we loosely compute an ( 1 2, ϵ/2)- approximation, where ϵ is the target accuracy, based on the following definition: Definition 4.2 (Restated Definition 3.2 in [1]). Let lt(V ) = XL(X(t), Y (t)), V X(t) + L 2 η V X(t) 2 F be the objective function in (18), and let l t = lt( X(t)). Given parameters γ 0 and ϵ 0, a feasible solution V to (18) is called (γ, ϵ)-approximate if it satisfies l(V ) (1 γ)l t + ϵ. The time dependence on the data size n, c, d, s is ncs + s2(n + c) [1], and is again independent of d. Meanwhile, the procedures to keep track of W (t) AX(t) requires complexity of nds + ncs, while updating Y (t) requires dck operations. Therefore, by setting k ns(1/c + 1/d), the iteration complexity s dependence on the data size becomes O(n(d + c)s) operations, instead of O(ndc) for conducting a full projected gradient step. Recall that s upper bounds the rank of X(t) min{d, c}. The trace norm version mostly inherits the convergence guarantees for vector optimization. Refer to the Appendix for details. Assumption 4.2. We assume the following property on the primal form (17): fi is convex, and β-smooth. Its convex conjugate f i exists and satisfies 1 α-smooth on some convex set (could be Rc) and infinity otherwise. Data matrix A satisfies R = max|I| k,I [n] σ2 max(AI,:) ( A 2 2). Here σmax(X) denotes the largest singular value of X. g is µ-strongly convex and L-smooth. The assumptions also cover smooth hinge loss as well as quadratic loss. With the similar assumptions, the convergence analysis for Algorithm 2 is almost the same as Algorithm 1. The only difference comes from the primal step where approximated update produces some error: Primal progress: With the primal update rule in Algorithm 2, it satisfies L(X(t+1), Y (t)) L(X(t), Y (t)) µ 8L (t) p + ϵ 16. (See Lemma A.7.) With no much modification in the proof, we are able to derive similar convergence guarantees for the trace norm ball. Theorem 4.3. Given a function 1 n Pn i=1 fi(a i X) + g(X) that satisfies Assumption 4.2. Setting s rank( X(t)), and learning rate η = µ 2L, δ 1 k( L µnβ + 5βR 2αµn2 (1 + 8 L µ )) 1, the duality gap (t) generated by Algorithm 2 satisfies (t) kδ kδ+8βn (t 1) + ϵ 16. Therefore it takes O( L ϵ ) iterations to achieve ϵ error. We also provide a brief analysis on the difficulty to extend our algorithm to polytope-type constraints in the Appendix A.9. 5 Experiments We evaluate the Primal-Dual Block Generalized Frank-Wolfe algorithm by its performance on binary classification with smoothed hinge loss2. We refer the readers to Appendix A.7 for details about smoothed hinge loss. We compare the proposed algorithm against five benchmark algorithms: (1) Accelerated Projected Gradient Descent (Acc PG) (2) Frank-Wolfe algorithm (FW) (3) Stochastic Variance Reduced Gradient (SVRG) [15] (4) Stochastic Conditional Gradient Sliding (SCGS) [23] and (5) Stochastic Variance-Reduced Conditional Gradient Sliding (STORC) [13]. We presented the time complexity for each algorithm in Table 1. Three of them (FW, SCGS, STORC) are projection-free algorithms, and the other two (Acc PG, SVRG) are projection-based algorithms. Algorithms are implemented in C++, with the Eigen linear algebra library [12]. The six datasets used here are summarized in Table 2. All of them can be found in LIBSVM datasets [4]. We augment the features of MNIST, ijcnn, and cob-rna by random binning [32], which is a standard technique for kernel approximation. Data is normalized. We set the ℓ1 constraint to be 300 and the ℓ2 regularize parameter to 10/n to achieve reasonable prediction accuracy. We refer the 2The codes to reproduce our results could be found in https://github.com/Carlson Zhuo/ primal_dual_frank_wolfe. Figure 1: Convergence result comparison of different algorithms on smoothed hinge loss. For six different datasets, we show the decrease of relative primal objective: (P(x(t)) P )/P over CPU time. Our algorithm (brown) achieves around 10 times speedup over all other methods except for the smallest dataset duke. Dataset Name # Features # Samples # Non-Zero Solution Sparsity (Ratio) duke breast-cancer [4] 7,129 44 313,676 423 (5.9%) rcv1 [4] 47,236 20,242 1,498,952 1,169 (2.5%) news20.binary [4] 1,355,191 19,996 9,097,916 1,365 (0.1%) MNIST.RB 0 VS 9 [4, 32] 894,499 11,872 1,187,200 8,450 (0.9%) ijcnn.RB [4, 32] 58,699 49,990 14,997,000 715 (1.2%) cob-rna.RB [4, 32] 81,398 59,535 5,953,500 958 (1.2%) Table 2: Summary of the properties of the datasets. readers to the Appendix C.1 for results of other choice of parameters. These datasets have various scale of features, samples, and solution sparsity ratio. The results are shown in Fig 1. To focus on the convergence property, we show the decrease of loss function instead of prediction accuracy. From Fig 1, our proposed algorithm consistently outperforms the benchmark algorithms. The winning margin is roughly proportional to the solution sparsity ratio, which is consistent with our theory. We also implement Algorithm 2 for trace norm ball and compare it with some prior work in the Appendix C.2, especially Block FW [1]. We generated synthetic data with optimal solutions of different ranks, and show that our proposal is consistently faster than others. 6 Conclusion In this paper we consider a class of problems whose solutions enjoy some simple structure induced by the constraints. We propose a FW type algorithm to exploit the simple structure and conduct partial updates, reducing the time cost for each update remarkably while attaining linear convergence. For a class of ERM problems, our running time depends on the sparsity/rank of the optimal solutions rather than the ambient feature dimension. Our empirical studies verify the improved performance compared to various state-of-the-art algorithms. Acknowledgements. This work is supported by NSF Grants 1618689, EECS-1609279, CCF1302435, CNS-1704778, IIS-1546452, CCF-1564000, DMS 1723052, CCF 1763702, AF 1901292 and research gifts by Google, Western Digital and NVIDIA. [1] Zeyuan Allen-Zhu, Elad Hazan, Wei Hu, and Yuanzhi Li. Linear convergence of a Frank-Wolfe type algorithm over trace-norm balls. In Advances in Neural Information Processing Systems, 2017. [2] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature learning. Machine Learning, 2008. [3] Emmanuel J Candes, Yonina C Eldar, Thomas Strohmer, and Vladislav Voroninski. Phase retrieval via matrix completion. SIAM review, 2015. [4] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2011. [5] Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, and Chih-Jen Lin. Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research, 2010. [6] Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Nearest neighbor based greedy coordinate descent. In Advances in Neural Information Processing Systems, 2011. [7] Ding-Zhu Du and Panos M Pardalos. Minimax and applications. Springer Science & Business Media, 2013. [8] Simon S Du and Wei Hu. Linear convergence of the primal-dual gradient method for convexconcave saddle point problems without strong convexity. ar Xiv preprint ar Xiv:1802.01504, 2018. [9] Miroslav Dudik, Zaid Harchaoui, and Jérôme Malick. Lifted coordinate descent for learning with trace-norm regularization. In Artificial Intelligence and Statistics, 2012. [10] Dan Garber and Elad Hazan. Faster rates for the Frank-Wolfe method over strongly-convex sets. In 32nd International Conference on Machine Learning, ICML 2015, 2015. [11] Donald Goldfarb, Garud Iyengar, and Chaoxu Zhou. Linear convergence of stochastic Frank Wolfe variants. ar Xiv preprint ar Xiv:1703.07269, 2017. [12] Gaël Guennebaud, Benoît Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010. [13] Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. In International Conference on Machine Learning, 2016. [14] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In ICML (1), 2013. [15] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, 2013. [16] Sham Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization. [17] Sai Praneeth Karimireddy, Anastasia Koloskova, Sebastian U Stich, and Martin Jaggi. Efficient greedy coordinate descent for composite problems. ar Xiv preprint ar Xiv:1810.06999, 2018. [18] Thomas Kerdreux, Alexandre d Aspremont, and Sebastian Pokutta. Restarting Frank-Wolfe. International Conference on Machine Learning, 2019. [19] Subhash Khot. Hardness of approximating the shortest vector problem in lattices. Journal of the ACM (JACM), 2005. [20] Piyush Kumar and E Alper Yıldırım. A linearly convergent linear-time first-order algorithm for support vector classification with a core set result. INFORMS Journal on Computing, 2011. [21] Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems, 2015. [22] Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, and Patrick Pletscher. Block-coordinate Frank-Wolfe optimization for structural SVMs. ar Xiv preprint ar Xiv:1207.4747, 2012. [23] Guanghui Lan and Yi Zhou. Conditional gradient sliding for convex optimization. SIAM Journal on Optimization, 2016. [24] Qi Lei, Ian EH Yen, Chao-yuan Wu, Inderjit S Dhillon, and Pradeep Ravikumar. Doubly greedy primal-dual coordinate descent for sparse empirical risk minimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017. [25] Qi Lei, Kai Zhong, and Inderjit S Dhillon. Coordinate-wise power method. In Advances in Neural Information Processing Systems, 2016. [26] Gerard GL Meyer. Accelerated Frank-Wolfe algorithms. SIAM Journal on Control, 1974. [27] Ricardo Ñanculef, Emanuele Frandi, Claudio Sartori, and Héctor Allende. A novel Frank-Wolfe algorithm. analysis and applications to large-scale SVM training. Information Sciences, 2014. [28] Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 2012. [29] Yurii Nesterov. Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2013. [30] Julie Nutini, Mark Schmidt, Issam Laradji, Michael Friedlander, and Hoyt Koepke. Coordinate descent converges faster with the Gauss-Southwell rule than random selection. In International Conference on Machine Learning, 2015. [31] Ting Kei Pong, Paul Tseng, Shuiwang Ji, and Jieping Ye. Trace norm regularization: Reformulations, algorithms, and multi-task learning. SIAM Journal on Optimization, 2010. [32] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, 2008. [33] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 2013. [34] Anshumali Shrivastava and Ping Li. Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In Advances in Neural Information Processing Systems, 2014. [35] Jialei Wang and Lin Xiao. Exploiting strong convexity from data with primal-dual first-order algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017. [36] Andrés Weintraub, Carmen Ortiz, and Jaime González. Accelerating convergence of the Frank-Wolfe algorithm. Transportation Research Part B: Methodological, 1985. [37] Lin Xiao, Adams Wei Yu, Qihang Lin, and Weizhu Chen. Dscovr: Randomized primal-dual block coordinate algorithms for asynchronous distributed optimization. Journal of Machine Learning Research, 2019. [38] Yuchen Zhang and Lin Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. ar Xiv preprint ar Xiv:1409.3257, 2014. [39] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 2005.