# stochastic_frankwolfe_for_constrained_finitesum_minimization__fec3e5f6.pdf Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization Geoffrey N egiar 1 Gideon Dresdner 2 Alicia Yi-Ting Tsai 1 Laurent El Ghaoui 1 3 Francesco Locatello 2 4 Robert M. Freund 5 Fabian Pedregosa 6 We propose a novel Stochastic Frank-Wolfe (a.k.a. conditional gradient) algorithm for constrained smooth finite-sum minimization with a generalized linear prediction/structure. This class of problems includes empirical risk minimization with sparse, low-rank, or other structured constraints. The proposed method is simple to implement, does not require step-size tuning, and has a constant per-iteration cost that is independent of the dataset size. Furthermore, as a byproduct of the method we obtain a stochastic estimator of the Frank-Wolfe gap that can be used as a stopping criterion. Depending on the setting, the proposed method matches or improves on the best computational guarantees for Stochastic Frank-Wolfe algorithms. Benchmarks on several datasets highlight different regimes in which the proposed method exhibits a faster empirical convergence than related methods. Finally, we provide an implementation of all considered methods in an open-source package. 1 Introduction We consider constrained finite-sum optimization problems of the form minimize w C 1 n i=1 fi x i w , (OPT) where C is a compact and convex set and X = (x1, , xn) Rn d is a data matrix, with n sam- 1Berkeley AI Research, University of California, Berkeley, CA, USA 2Department of Computer Science, ETH Zurich, Switzerland 3Sum Up Analytics 4Max-Planck Institute for Intelligent Systems, T ubingen, Germany 5MIT Sloan School of Management 6Google Research. Correspondence to: Geoffrey N egiar . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). Table 1. Worst-case convergence rates for the function suboptimality after t iterations, for a dataset with n samples. κ n and can be much smaller than n for datasets of interest. κ is introduced in Section 5. RELATED WORK CONVEX NON-CONVEX FRANK & WOLFE (1956) O (n/t) O n/ MOKHTARI ET AL. (2018) O 1/ 3 t LU & FREUND (2018) O (n/t) THIS WORK O (κ/t) 0 ples and d features. This template includes several problems of interest, such as constrained empirical risk minimization. The LASSO (Tibshirani, 1996) may be written in this form, where fi(x i w) = 1 2(x i w yi)2 and C = {w : w 1 λ} for some parameter λ. We focus on the case where the fis are differentiable with L-Lipschitz derivative, and study the convex and non-convex cases. The classical Frank-Wolfe (FW) or Conditional Gradient algorithm (Frank & Wolfe, 1956; Levitin & Polyak, 1966; Demyanov & Rubinov, 1967) is an algorithm for constrained optimization. Contrary to other projection-based constrained optimization algorithms, such as Projected Gradient Descent, it relies on a Linear Minimization Oracle (LMO) over the constraint set C, rather than a Quadratic Minimization Oracle (the projection subroutine). For certain constraint sets such as the trace norm or most ℓp balls, the LMO can be computed more efficiently than the projection subroutine. Recently, the Frank-Wolfe algorithm has garnered much attention in the machine learning community where polytope constraints and sparsity are of large interest, e.g. Jaggi (2013); Lacoste-Julien & Jaggi (2015); Locatello et al. (2017). In the unconstrained setting, stochastic variance-reduced methods (Shalev-Shwartz & Zhang, 2013; Schmidt et al., 2013; Hofmann et al., 2015) exhibit the same iteration complexity as full gradient (non-stochastic) methods, while reaching much smaller per-iteration complexity, usually at some (small) additional memory cost. This work takes a step in the direction of designing such a method for Frank-Wolfe type algorithms, which remains an important open problem. Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization Our main contributions are: 1. A constant batch-size Stochastic Frank-Wolfe (SFW) algorithm for finite sums with linear prediction. We describe the method in Section 2 and discuss its computational and memory cost. 2. A non-asymptotic rate analysis on smooth and convex objectives. The suboptimality of the SFW algorithm after t iterations can be bounded as O (κ/t), where κ is a data-dependent constant we will discuss later. It is upper bounded by the sample-size n but, depending on the setting, can be potentially much smaller. 3. An asymptotic analysis for non-convex objectives. We prove that SFW converges to a stationary point for smooth but potentially non-convex functions. This is the first stochastic FW variant that has convergence guarantees in this setting of large practical interest. Finally, we compare the SFW algorithm with other stochastic Frank-Wolfe algorithms amenable to constant batch size on different machine learning tasks. These experiments show that the proposed method converges at least as fast as previous work, and notably faster on several such instances. 1.1 Related Work We split existing stochastic FW algorithm into two categories: methods with increasing batch size and methods with constant batch size. Increasing batch size Stochastic Frank-Wolfe. This variant allows the number of gradient evaluations to grow with the iteration number (Goldfarb et al., 2017; Hazan & Luo, 2016; Reddi et al., 2016). Because of the growing number of gradient evaluations, these methods converge towards a deterministic full gradient FW algorithm and so asymptotically share their computational requirements. In this work we will instead be interested in constant batchsize methods, in which the number of gradient evaluations does not increase with the iteration number. See Hazan & Luo (2016) for a detailed comparison of assumptions and complexities for Stochastic Frank-Wolfe methods with increasing batch sizes, in terms of both iterations and gradient calls. Constant batch size Stochastic Frank-Wolfe. These methods use a constant batch size b, which is chosen by the user as a hyperparameter. In the convex and smooth setting, Mokhtari et al. (2018) and Locatello et al. (2019) reach O 1/ 3 t convergence rates. The rate of Locatello et al. (2019) further holds for non-smooth and non-Lipschitz objectives. Zhang et al. (2019) requires second order knowledge of the objective. Lu & Freund (2018) proves convergence for an averaged iterate in O(n/t) with n the number of samples in the dataset. Let us assume for simplicity that we use unit batch size. Since each iteration involves only one data point, the per-iteration complexity of their method reduces by a factor of n the per-iteration complexity of fullgradient method. On the other hand, the method proposed in this work loses this factor in the rate in number of iterations, reaching the same overall complexity as the deterministic full gradient method. Depending on the use-case (large or small datasets), each of the rates reported in Lu & Freund (2018) and Mokhtari et al. (2018) can have an advantage over the other. In favorable cases, the rate of convergence achieved by our method is nearly independent of the number of samples in the dataset. In these cases, our method is therefore faster than both. In the worst case, it matches the O (n/t) bound (Lu & Freund, 2018). 1.2 Notation Throughout the paper we denote vectors in lowercase boldface letters (w), matrices in uppercase boldface letters (X), and sets in calligraphic letters (e.g., C). We say a function f is L-smooth in the norm if it is differentiable and its gradient is L-Lipschitz continuous with respect to , that is, if it verifies f(x) f(y) L x y for all x, y in the domain (where is the dual norm of ). For a one dimensional function f, this reduces to |f (z) f (z )| L|z z | for all z, z in the domain. For the time dependent vector ut, we denote by u(i) t its i-th coordinate. We distinguish E, the full expectation taken with respect to all the randomness in the system, from Et, the conditional expectation with respect to the random index sampled at iteration t, conditioned on all randomness up to iteration t. Finally, LMO(u) returns an arbitrary element in arg mins C s, u . 2.1 A Primal-Dual View on Frank-Wolfe In this subsection, we present the Frank-Wolfe algorithm as an alternating optimization scheme on a saddle-point problem. This point of view motivates the design of the proposed SFW algorithm. This perspective is similar to the two player game point of view of Abernethy & Wang (2017); Abernethy et al. (2018), which we express using convex conjugacy. We suppose here that f is closed, convex and differentiable. Let us rewrite our initial problem (OPT) in the equivalent unconstrained formulation minimize w Rd f(Xw) + ıC(w) , (1) where ıC is the indicator function of C: it is 0 over C and Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization + outside of C. We denote by f the convex conjugate of f, that is, f (α) def = maxw α, w f(w). Whenever f is closed and convex, it is known that f = (f ) , and so we can write f(Xw) = maxα{ f (α) + Xw, α }. Plugging this identity into the previous equation, we arrive at a saddlepoint reformulation of the original problem: min w Rd max α Rn n L(w, α) def = f (α) + ıC(w) + Xw, α o . This reformulation allows to derive the Frank-Wolfe algorithm as an alternating optimization method on this saddlepoint reformulation. To distinguish the algorithm in this section from the stochastic algorithm we propose, we denote the iterates in this section by αt, wt. The first step of the Frank-Wolfe algorithm is to compute the gradient of the objective at the current iterate. In the saddle-point formulation, this corresponds to maximizing over the dual variable α at step t: αt arg max α Rn {L( wt 1, α) = f (α) + X wt 1, α } αt = f(X wt 1). (3) Then, the LMO step corresponds to fixing the dual variable and minimizing over the primal one w. This gives st arg min w Rd n L(w, αt) = ıC(w) + w, XT αt o st = LMO(X αt). (4) Note that from the definition of the LMO, st can always be chosen as an extreme point of the constraint set C. We then update our iterate using the convex combination wt = (1 γt) wt 1 + γt st, (5) where γt is a step-size to be chosen. These updates determine the Frank-Wolfe algorithm. 2.2 The Stochastic Frank-Wolfe Algorithm We now consider a variant in which we replace the exact minimization of the dual variable (3) by a minimization over a single coordinate, chosen uniformly at random. Let us define the function f from Rn to R as f(θ) def = 1 n Pn i=1 fi(θi). We can write our original optimization problem as an optimization over w C of f(Xw). Still alternating between the primal and the dual problems, we replace maximization over the full vector α in (3) with Algorithm 1 Stochastic Frank-Wolfe 1: Initialization: w0 C, α0 Rn, r0 = X α0 2: for t = 1, 2, . . . , do 3: Sample i {1, . . . , n} uniformly at random. 4: Update α(i) t = 1 nf i(x i wt 1) 5: Update α(j) t = αj t 1, j = i 6: rt = rt 1 + (α(i) t α(i) t 1)xi 7: st = LMO(rt) 8: wt = wt 1 + 2 t+2(st wt 1) 9: end for optimization along the coordinate i only. We obtain the update α(i) t = 1 nf i(x i wt 1). Doing so changes the cost per-iteration from O(nd) to O(d), and yields Algorithm 1. We now describe our main contribution, Algorithm 1 (SFW) above. It follows the classical Frank-Wolfe algorithm, but replaces the gradient with a stochastic estimate of the gradient. Throughout Algorithm 1, we maintain the following iterates: the iterate wt, the stochastic estimator of f(Xwt 1) denoted by αt Rn, the stochastic estimator of the full gradient of our loss X f(Xwt 1), denoted by rt Rd. Algorithm. At the beginning of iteration t, we have access to αt 1, rt 1 and to the iterate wt 1. Thus equipped, we sample an index i uniformly at random over {1, . . . , n}. We then compute the gradient of our loss function for that datapoint, on our iterate, yielding [ f(Xwt 1)]i = 1 nf i(x i wt 1). We update the stochastic gradient estimator αt by refreshing the contribution of the i-th datapoint and leaving the other coordinates untouched. Remark 1. Coordinate j of our estimator αt contains the latest sampled one-dimensional derivative of 1 To get rt, we do the same, removing the previous contribution of the i-th datapoint, and adding the refreshed contribution. This allows us not to store the full data-matrix in memory. The rest of the algorithm continues as the deterministic Frank-Wolfe algorithm from the previous subsection: we find the update direction from st = LMO(rt), and we update our iterate using a convex combination of the previous iterate wt 1 and st, whereby our new iterate is feasible. Remark 2. Our algorithm requires to keep track of the αt vector and amounts to keeping one scalar per sample Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization in memory. Our method requires the same small memory caveat as other variance reduced algorithms such as SDCA (Shalev-Shwartz & Zhang, 2013), SAG (Schmidt et al., 2013) or SAGA (Defazio et al., 2014). Despite the resemblance of our gradient estimator to the Stochastic Average Gradient (Schmidt et al., 2013), the convergence rate analyses are quite different. 3.1 Preliminary tools Recall that in our setting, our objective function is w 7 f(Xw), where f(θ) = 1 n Pn i=1 fi(θi). We suppose that for all i, fi is L-smooth, which then implies that f satisfies the following non-standard smoothness condition: f(θ) f( θ) p L n θ θ p (6) for every p [1, ]. Note that in this inequality unlike in the standard definition of L-smoothness with respect to the ℓp norm the same norm appears on both sides of the inequality. This inequality is proven in Appendix A. In particular it follows from (6) that f is (L/n)-smooth with respect to the ℓ2 norm. We therefore have the following quadratic upper bound on our objective function f, valid for all w, v in the domain: f(Xw) f(Xv) + f(Xv), w v 2n X (w v) 2 2 . (7) For p {1, 2, }, we define the diameters Dp = max u,v C X(u v) p. (8) Remark 3. For p {1, 2}, we have that Dp p n Dp . 3.2 Worst-Case Convergence Rates for Smooth and Convex Objectives We state our main result in the L-smooth, convex setting. In this section, we suppose that the fis are L-smooth and convex and that for all θ, f(θ) = 1 n Pn i=1 fi(θi). The objective function f then satisfies (6) as noted previously. Theorem 1. Let H0 def = α0 f(Xw0) 1 be the initial error of our gradient estimator and w C a solution to OPT. We run Algorithm 1 with step sizes γt = 2/(t + 2). At time-step t 2, the expected primal suboptimality Eεt = E[f(Xwt) f(Xw )] has the following upper bound Eεt 2L D2 2 + 4(n 1)D1D t (t + 1)(t + 2) + 2ε0 + (2D H0 + 64LD1D )n2 (t + 1)(t + 2) Remark 4. The rate of the proposed method in terms of gradient calls is also given by (9) (one gradient call per iteration), whereas for deterministic Frank-Wolfe, the (deterministic) suboptimality after t gradient calls has the following upper bound (Jaggi, 2013; Hazan & Luo, 2016) εt 2LD2 2 t . (10) In this paper, we will only discuss unit batch size. We can adapt our algorithm and proofs to consider sampling a minibatch of b datapoints at each step. The leading term in our rate from Theorem 1 will change: we will use ρ = 1 b n in Lemma 3. The overall rate will be modified accordingly. The per-iteration complexity will then become O(bd). We first sketch the outline of the proof before delving into details. The proof of this convergence rate builds on three key lemmas. The first is an adaptation of Lemma 2 of Mokhtari et al. (2018) which bounds the suboptimality at step t by the sum of a contraction in the suboptimality at t 1, a vanishing term due to smoothness, and a last term depending on our gradient estimator s error in ℓ1 norm. The first two terms show up in the convergence proof of the full-gradient Frank-Wolfe, see Lacoste-Julien & Jaggi (2015). The last term is an error, or noise term. Supposing the error term vanishes fast enough, we can fall back on the full-gradient proof technique (Frank & Wolfe, 1956; Jaggi, 2013). From there, we show that the error term verifies a particular recursive inequality in lemma 2. In lemma 3, we then leverage this inequality to prove that the error term vanishes as O(1/t), finally allowing us to obtain the promised rate. The formal statements of these lemmas follow. Lemma 1. Let fi be convex and L-smooth for all i. For any direction αt Rn, define st = LMO(X αt), xt = (1 γt)xt 1 + γtst and Ht = αt f(Xwt 1) 1. We have the following upper bound on the primal suboptimality at step t: εt (1 γt)εt 1 + γ2 t LD2 2 2n + γt D Ht | {z } error term We defer this proof to Appendix B. Remark 5. This lemma holds for any direction αt Rn, not necessarily the αt given by the SFW algorithm. Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization Remark 6. This lemma generalizes the key inequality used in many proofs in the Frank-Wolfe literature (Jaggi, 2013) but includes an extra error term to account for the fact that the direction αt, which we use for the LMO step and therefore to compute the updated iterate, is not the true gradient. If αt = f(Xwt 1), that is, if we compute the gradient on the full dataset, then Ht = 0 and we recover the standard quadratic upper bound. In the following, αt is the direction given by Algorithm 1, and the ℓ1 error term is in terms of that αt: Ht def = αt f(Xwt 1) 1 (12) for t > 0 and H0 = α0 f(Xw0) 1. Notice that we define the gradient estimator s error with the ℓ1 norm. The previous lemma also holds with the ℓ2 norm of the gradient error (replacing D by D2). We prefer the ℓ1 norm because of the finite-sum assumption: it induces a coordinate-wise separation over αt which corresponds to a datapoint-wise separation. The following lemma crucially leverages this assumption to upper bound Ht given by the SFW algorithm. Lemma 2. For the stochastic gradient estimator αt given by Algorithm 1 (SFW), we can upper bound Ht = αt f(Xwt 1) 1 in conditional expectation as follows Ht 1 + γt 1 LD1 Proof. We have the following expression for αt, supposing that index i was sampled at step t. αt = αt 1 + 1 nf i(x i wt 1) α(i) t 1 where ei is the i-th vector of the canonical basis of Rn. Consider a fixed coordinate j. Since there is a 1 n chance of αj being updated to f j(x j wt 1), taking conditional expectations we have Et Hj t def = |α(j) t 1 nf j(x j wt 1)| (15) |α(j) t 1 1 nf j(x j wt 1)|. (16) Summing over all coordinates we then have j=1 Et Hj t (17) αt 1 f(Xwt 1) 1 | {z } δt 1 We denote the ℓ1 norm term by δt 1 for ease. Let us introduce the full gradient at the previous step f(Xwt 2) and use the triangle inequality. Our finite sum assumption gives us that for all j {1, . . . , n} and w C, [ f(Xw)]j = 1 nf j(x j w). Then, we separate the ℓ1 norm, use L-smoothness of each of the fjs and the definition of wt 1. δt 1 Ht 1 + f(Xwt 2) f(Xwt 1) 1 (19) j=1 |x j (wt 1 wt 2)| (20) Ht 1 + γt 1 L n j=1 |x j (st 1 wt 2)| (21) Ht 1 + γt 1 L n X(st 1 wt 2) 1 (22) where we used wt 1 wt 2 = γt 1(wt 1 st 2). Finally, using the definition of the diameter D1, we obtain inequality (13). Now, we can use the structure of this recurrence to obtain the desired rate of convergence for our gradient estimator. We state this in the following lemma. Lemma 3. Let γt = 2 t+2. We have the following bound on the expected error EHt, for t 2: t + 2 + 1 1 Remark 7. Our gradient estimator s error in ℓ1 norm goes to zero as O D1 t . This rate depends on the assumption of the separability of f into a finite sum of L-smooth fi s. On the other hand, it does not require that each (or any) fi be convex. Proof. Consider a general sequence of nonnegative numbers, u0, u1, u2, . . . , ut R+ where for all t, the following recurrence holds: ut ρ ut 1 + K t + 1 where 0 < ρ < 1 and K > 0 are scalars. First note that all the iterates are nonnegative. Suppose Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization ut ρtu0 + K k= t/2 +1 2 ρt k+1 To go from the second line to the third line, we observe that for old terms with large steps sizes, we are saved by the higher power in the geometric term. For the more recent terms, the step-size is small enough to ensure convergence. More formally, in the early terms (1 k t/2 ), we upper bound ρt k+1 by ρt/2. In the later terms ( t/2 +1 k t), we upper bound 1 k+1 by 2 t+2. To obtain the full rate, we now study both parts separately. For the first part, we use knowledge of the harmonic series: 1 k + 1 ρt/2 log t for t 2, we can upper bound log t 2 + 1 by log t. For the second part, we use knowledge of the geometric series: k= t/2 +1 ρt k+1 ρ 1 ρ. (26) Finally, for t 2 0 ut K ρ (1 ρ) 2 (t + 2) + ρt/2 log t + ρtu0. The expected error EHt verifies our general conditions with u0 = H0 = α0 f(Xw 1) 1, defining w 1 def = w0 for the sake of the proof; ρ = 1 1 n and K = 2LD1 n . Specifying these values gives us the claimed bound. The remainder of the proof of Theorem 1 follows the usual Frank-Wolfe proofs in the full gradient case, which can be found e.g. in Frank & Wolfe (1956); Jaggi (2013). Here is a brief sketch of these steps: we tie the three key lemmas together, plugging in the bound on EHt given by Lemma 3 into the upper bound on the suboptimality at step t given by Lemma 1. By specifying the step size 2/(t + 2), and scaling the bounds by a factor of (t + 1)(t + 2), we obtain a telescopic sum, allowing us to upper bound the expected suboptimality at the latest step considered. The details are deferred to Appendix C. 3.3 Worst-case Convergence Rates for Smooth, Non-Convex Objectives We start by recalling the definition of the Frank-Wolfe gap: gt = max s C f(Xwt 1), X(wt 1 s) . (28) Previous work (Jaggi, 2013) has shown the importance of the Frank-Wolfe gap. In the convex setting, it is a primal-dual gap, and as such, upper bounds both primal and dual suboptimalities. In the general non-convex setting, it is a measure of near-stationarity. We define a stationary point as any point w such that for all w C, f(Xw ), X(w w ) 0 (Bertsekas, 1999). From this definition, it is clear that the Frank-Wolfe gap gt is zero only at a stationary point. In this section, we suppose that fi is L-smooth for i in {1, . . . , n}, but not necessarily convex. The following theorem states that we can still obtain a stationary point from Algorithm 1. Theorem 2. Let wt be computed according to Algorithm 1, then lim inf t Etgt = 0, (29) where gt is the Frank-Wolfe gap. The proof of this result is deferred to Appendix F. 4 Stopping Criterion In this section, we define a natural stochastic Frank-Wolfe gap, and explain why it can be used as a stopping criterion. We recall the definition of the true Frank-Wolfe gap gt, and define the stochastic Frank-Wolfe gap ˆgt as: gt = max s C f(Xwt 1), X(wt 1 s) , (30) ˆgt = max s C αt, X(wt 1 s) (31) for αt given by SFW. The Frank-Wolfe gap s properties make estimating it very desirable: when the gap is small for a given iteration of a Frank-Wolfe type algorithm, we can guarantee we are close to optimum (or to a stationary point in the general nonconvex case). Unfortunately, in datasets with many samples, and since it depends on the full gradient, computing this gap can be impractical. Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization The following proposition shows that the stochastic Frank Wolfe gap estimator resulting from Algorithm 1 can be used as a proxy for the true Frank-Wolfe gap. Proposition 1. For αt given by Algorithm 1, we can bound the distance between the stochastic Frank-Wolfe gap and the true Frank-Wolfe gap as follows: |gt ˆgt| D Ht, (32) which yields the following bound in expectation E|gt ˆgt| 2 LD1D t + 2 + 1 1 t D H0. (33) We defer the proof to Appendix E. If ˆgt goes to 0, then the true Frank-Wolfe gap will be expected to vanish as well. We therefore propose to use ˆgt, which is computed as a byproduct of our SFW algorithm, as a heuristic stopping criterion, but defer a more in-depth theoretical and empirical analysis of this gap to future work. 5 Discussion In this section, we compare the convergence rate of the proposed SFW, Lu & Freund (2018) and Mokhtari et al. (2018) as shown in Table 1. We use big O notation, only focusing on dependencies in n and t to upper bound the suboptimality at step t. To make a fair comparison, including dependencies in n, the number of samples, we first standardize notations across papers. Lu & Freund (2018) use the same formal setting as ours, where x i w is the argument to the i-th objective fi, and the full objective is the average of these. Mokhtari et al. (2018) set themselves in a more general setting, where they only assume access to an unbiased estimator of the full gradient. For ease of comparison, we rewrite the two algorithms of Lu & Freund (2018) and Mokhtari et al. (2018) in Appendix G using our notations. Because of their more general setting, the Lmok Lipschitz constant appearing in Mokhtari et al. (2018) can be written Lmok = L nn maxi xi 2 (using Cauchy-Schwartz). Their diameter constant Dmok = maxu,v C u v 2 is also independent of n. Finally, their σ2 term controlling the variance of their stochastic estimator should also be n-independent. Under this notation, their convergence rate (Theorem 3, Mokhtari et al. (2018)) is O 1/ 3 t with no dependency in n as expected. Lu & Freund (2018) have a detailed discussion of the rate of their method, and achieve the overall rate of O (n/t). To fairly compare these rates to the one given by Theorem 1, we must consider the D1 and D terms, which may depend on the number of samples n. The rate we obtain has a leading term of O (D1D /t), and a second term of O D1D n2/t2 . The second term is dominated by the first in the regime t > n2. Defining κ = D1/D , we can write D1D as κD2 . We have that κ n, meaning that in the worst case, this bound matches the one in Lu & Freund (2018). When the constraint set is the ℓ1 ball {w | w 1 λ}, we have the following closed form expression: X 1, = maxj Pn i=1 |Xij| maxij |Xij| . (34) We can therefore easily compute it for given datasets. Remark 8. We briefly remark that if for every feature, the contribution of that feature is limited to a few datapoints, this ratio will be small, and therefore the overall bound does not depend on the number of samples. This tends to happen for TF-IDF text representations, and for fat-tailed data. Formal analysis of this ratio exceeds the scope of this paper, and we defer it to future work. We report values of κ for the considered datasets in Section 7. 6 Implementation Details Our implementation is available in the C-OPT package.1 Initialization. We use the cheapest possible initialization: our initial stochastic gradient estimator α0 starts out at 0. We also then have that r0 = 0. Sparsity in X. Suppose there are at most s non-zero features for any datapoint xi. Then for instances where C is an ℓ1 ball, all updates in SFW algorithm can be implemented using using only the support of the current datapoint, making the per-iteration cost of SFW O(s) instead of O(d). Large-scale datasets are often extremely sparse, so leveraging this sparsity is crucial. For example, in the Lib SVM datasets suite, 8 out of the 11 datasets with more than a million samples have a density between 10 4 and 10 6. 7 Experiments We compare the proposed SFW algorithm with other constant batch size algorithms from Mokhtari et al. (2018) and Lu & Freund (2018). Experimental Setting. We consider ℓ1 constrained logistic regression problems on the BREAST CANCER and RCV1 datasets, and an ℓ1 constrained least squares regression problem on the CALIFORNIA HOUSING dataset, all from the 1https://github.com/openopt/copt Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization Table 2. Datasets and tasks used in experiments. DATASET n d κ/n fi C BREAST CANCER 683 10 0.929 log(1 + exp( yix i w)) { w 1 λ, λ = 5} RCV1 20,242 47,236 0.021 log(1 + exp( yix i w)) { w 1 λ, λ = 100} CALIFORNIA HOUSING 20,640 8 0.040 1 2(yi x i w)2 { w 1 λ, λ = 0.1} 103 104 105 106 107 Number of sampled gradients processed Relative suboptimality Breast Cancer Mokhtari et al. (2018) Lu & Freund (2018) This work 105 106 107 108 Number of sampled gradients processed Mokhtari et al. (2018) Lu & Freund (2018) This work 104 105 106 107 108 109 Number of sampled gradients processed California Housing Mokhtari et al. (2018) Lu & Freund (2018) This work Figure 1. Comparing our SFW method to the related works of Lu & Freund (2018) and Mokhtari et al. (2018). From left to right: BREAST CANCER, RCV1, and CALIFORNIA HOUSING datasets. We plot the relative subtimality values in log-log plots to show empirical rates of convergence. We use the following batch size: b = n/100 . UCI dataset repository (Dua & Graff, 2017). See Table 2 for details and links. We compare the relative suboptimality computed for each method, given by (f(Xwt) fmin)/(fmax fmin) at step t, where fmin and fmax are the smallest and largest function values encountered by any of the compared methods. We compute these values at different time intervals (the same for each method) depending on problem size, to limit the time of each run. We use batches using 1% of the dataset at each step, following Lu & Freund (2018). Within a batch, data points are sampled without replacement. We plot these values as a function of the number of gradient evaluations, equal to the number of iterations times the batch size b: for all of the considered methods, an iteration involves exactly b gradient evaluations and one call to the LMO. This allows us to fairly compare the convergence speeds in practice. Compared to both methods from Mokhtari et al. (2018) and Lu & Freund (2018), the proposed SFW achieves lower suboptimality for a given number of iterations on the considered tasks and datasets. We have no explanation for the initial regime in the CALIFORNIA HOUSING dataset, before the methods start showing what resembles a sublinear rate, as the theory prescribes. Notice that the RCV1 dataset has the lowest κ/n (due to sparsity of the TF-IDF represented data), and that the method presented in this paper performs particularly well on this dataset. Comparison with Mokhtari et al. (2018). Although the step-size in our SFW Algorithm and the one proposed in the paper are of the same order of magnitude O(1/t), Mokhtari et al. (2018) use f i(x i wt 1) instead of our (1/n)f i(x i wt 1), because they require an unbiased estimator. Their choice induces higher variance, which then requires the algorithm to use momentum with a vanishing step size in their stochastic gradient estimator, damping the contributions of the later gradients (using ρt = 1 t2/3 , see the pseudo code in Appendix G). This may explain why the method proposed in Mokhtari et al. (2018) achieves slower convergence. On the contrary, the lower variance in our estimator αt allows us to give the same weight to contributions of later gradients as to previous ones, and to forget all but the last gradient computed at a given datapoint. Comparison with Lu & Freund (2018). The method from Lu & Freund (2018) computes the gradient at an averaged iterate, putting more weight on earlier iterates, making it more conservative. This may explain slower convergence versus the SFW algorithm proposed in this paper in certain settings. 8 Conclusion and Future Work Similarly to methods from the Variance Reduction literature such as SAG, SAGA, SDCA, we propose a Stochastic Frank Wolfe algorithm tailored to the finite-sum setting. Our method achieves a step towards attaining comparable com- Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization plexity iteration-wise to deterministic, true-gradient Frank Wolfe in the smooth, convex setting, at a per-iteration cost which can be nearly independent of the number of samples in the dataset in favorable settings. Our rate of convergence depends on the norm ratio κ on the dataset, which is related to a measure of the weights of the data distribution s tails. We will explore this intriguing fact in future work. We propose a stochastic Frank-Wolfe gap estimator, which may be used as a heuristic stopping criterion, including in the non-convex setting. Its distance to the true gap may be difficult to evaluate numerically. Obtaining a practical bound on this distance is an interesting avenue for future work. Gu elat & Marcotte (1986) and Lacoste-Julien & Jaggi (2015) have proposed variants of the FW algorithm that converge linearly on polytope constraint sets for strongly convex objectives: the Away Steps Frank-Wolfe and the Pairwise Frank-Wolfe. Goldfarb et al. (2017) studied stochastic versions of these and showed linear convergence over polytopes using increasing batch sizes. Our SFW algorithm, the natural stochastic gap and the analyses in this paper should be amenable to such variants as well, which we plan to explore in future work. Acknowledgments The authors would like to thank Donald Goldfarb for early encouragement in this direction of research, and Armin Askari, Sara Fridovich-Keil, Yana Hasson, Thomas Kerdreux, Nicolas Le Roux, Romain Lopez, Gr egoire Mialon, Courtney Paquette, Hector Roux de B ezieux, Alice Schoenauer-Sebag, Dhruv Sharma, Yi Sun, and Nilesh Tripuraneni for their constructive criticism on drafts of this paper. The authors also warmly thank Maria-Luiza Vladareanu for finding and reporting an error in an earlier draft s proof, and Alex Belloni, Jose Moran for discussions as well. Francesco Locatello is supported by the Max Planck ETH Center for Learning Systems, by an ETH core grant (to Gunnar R atsch), and by a Google Ph.D. Fellowship. Robert Freund s research is supported by AFOSR Grant No. FA955019-1-0240. Abernethy, J., Lai, K. A., Levy, K. Y., and Wang, J.-K. Faster Rates for Convex-Concave Games. In Proceedings of the 31st Conference On Learning Theory, 2018. Abernethy, J. D. and Wang, J.-K. On Frank-Wolfe and Equilibrium Computation. In Advances in Neural Information Processing Systems 30, 2017. Bertsekas, D. P. Nonlinear programming. Athena Scientific, Defazio, A., Bach, F., and Lacoste-Julien, S. SAGA: A Fast Incremental Gradient Method With Support for Non Strongly Convex Composite Objectives. In Advances in Neural Information Processing Systems 27. 2014. Demyanov, V. and Rubinov, A. The minimization of a smooth convex functional on a convex set. SIAM Journal on Control, 1967. Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml. Frank, M. and Wolfe, P. An algorithm for quadratic programming. Naval Research Logistics (NRL), 1956. Goldfarb, D., Iyengar, G., and Zhou, C. Linear Convergence of Stochastic Frank Wolfe Variants. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017. Gu elat, J. and Marcotte, P. Some comments on Wolfe s away step . Mathematical Programming, 1986. Hazan, E. and Luo, H. Variance-reduced and projection-free stochastic optimization. In International Conference on Machine Learning, 2016. Hofmann, T., Lucchi, A., Lacoste-Julien, S., and Mc Williams, B. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems, 2015. Jaggi, M. Revisiting Frank-Wolfe: projection-free sparse convex optimization. In International Conference on Machine Learning, 2013. Lacoste-Julien, S. and Jaggi, M. On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems, 2015. Levitin, E. S. and Polyak, B. T. Constrained minimization methods. USSR Computational mathematics and mathematical physics, 1966. Locatello, F., Khanna, R., Tschannen, M., and Jaggi, M. A Unified Optimization View on Generalized Matching Pursuit and Frank-Wolfe. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017. Locatello, F., Yurtsever, A., Fercoq, O., and Cevher, V. Stochastic Frank-Wolfe for Composite Convex Minimization. In Advances in Neural Information Processing Systems 32. 2019. Lu, H. and Freund, R. M. Generalized Stochastic Frank Wolfe Algorithm with Stochastic Substitute Gradient for Structured Convex Optimization. ar Xiv, 2018. Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization Mokhtari, A., Hassani, H., and Karbasi, A. Stochastic Conditional Gradient Methods: From Convex Minimization to Submodular Maximization. ar Xiv, 2018. Reddi, S. J., Sra, S., P oczos, B., and Smola, A. Stochastic Frank-Wolfe methods for nonconvex optimization. In 54th Annual Allerton Conference on Communication, Control, and Computing, 2016. Schmidt, M., Le Roux, N., and Bach, F. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 2013. Shalev-Shwartz, S. and Zhang, T. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 2013. Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B), 58:267 288, 1996. Zhang, M., Shen, Z., Mokhtari, A., Hassani, H., and Karbasi, A. One Sample Stochastic Frank-Wolfe. ar Xiv, 2019.