# stochastic_frankwolfe_for_composite_convex_minimization__a4025047.pdf Stochastic Frank-Wolfe for Composite Convex Minimization Francesco Locatello? Alp Yurtsever Olivier Fercoq Volkan Cevher francesco.locatello@inf.ethz.ch {alp.yurtsever,volkan.cevher}@epfl.ch olivier.fercoq@telecom-paristech.fr ?Department of Computer Science, ETH Zurich, Switzerland LIONS, Ecole Polytechnique F ed erale de Lausanne, Switzerland LTCI, T el ecom Paris, Universit e Paris-Saclay, France A broad class of convex optimization problems can be formulated as a semidefinite program (SDP), minimization of a convex function over the positive-semidefinite cone subject to some affine constraints. The majority of classical SDP solvers are designed for the deterministic setting where problem data is readily available. In this setting, generalized conditional gradient methods (aka Frank-Wolfe-type methods) provide scalable solutions by leveraging the so-called linear minimization oracle instead of the projection onto the semidefinite cone. Most problems in machine learning and modern engineering applications, however, contain some degree of stochasticity. In this work, we propose the first conditional-gradienttype method for solving stochastic optimization problems under affine constraints. Our method guarantees O(k 1/3) convergence rate in expectation on the objective residual and O(k 5/12) on the feasibility gap. 1 Introduction We focus on the following stochastic convex composite optimization template, which covers finite sum and online learning problems: x2X E f(x, !) + g(Ax) := F(x). (P) In this optimization template, we consider the following setting: . X Rn is a convex and compact set, . ! is a realization of the random variable drawn from the distribution P, . E f( , !) : X ! R is a smooth (see Section 1.2 for the definition) convex function, . A 2 Rn ! Rd is a given linear map, . g : Rd ! R [ {+1} is a convex function (possibly non-smooth). We consider two distinct specific cases for g: (i) g is a Lipschitz-continuous function, for which the proximal-operator is easy to compute: proxg(y) = arg min z2Rd g(z) + 1 2kz yk2 (1) (ii) g is the indicator function of a convex set K Rd: 0 if z 2 K, +1 otherwise. (2) 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. The former covers the regularized optimization problems. This type of regularization is common in machine learning applications to promote a desired structure to the solution. The latter handles affine constraints of the form Ax 2 K. We can also attack the combination of both: the minimization of a regularized loss-function subject to some affine constraints. In this paper, we propose a conditional-gradient-type method (aka Frank-Wolfe-type) for (P). In summary, our main contributions are as follows: . We propose the first CGM variant for solving (P). By CGM variant, we mean that our method avoids projection onto X and uses the lmo of X instead. The majority of the known methods for (P) require projections onto X. . We prove O(k 1/3) convergence rate on objective residual when g is Lipschitz-continuous. . We prove O(k 1/3) convergence rate on objective residual, and O(k 5/12) on feasibility gap when g is an indicator function. Surprisingly, affine constraints that make the lmo challenging for existing CGM variants can be easily incorporated in this framework by using smoothing. . We provide empirical evidence that validates our theoretical findings. Our results highlight the benefits of our framework against the projection-based algorithms. 1.1 Motivation: Stochastic Semidefinite Programming Consider the following stochastic semidefinite programming template, minimization of a convex function over the positive-semidefinite cone subject to some affine constraints: minimize X2Sn +, tr(X) β E f(X, !) subject to AX 2 K. (3) + denotes the positive-semidefinite cone. We are interested in solving (3) rather than the classical SDP since it does not require access to the whole data at one time. This creates a new vein of SDP applications in machine learning. Examples span online variants of clustering [33], streaming PCA [4], kernel learning [24], community detection [1], optimal power-flow [29], etc. Example: Clustering. Consider the SDP formulation of the k-means clustering problem [33]: minimize X2Sn subject to X1n = 1n, X 0. (4) Here, 1n denotes the vector of ones, X 0 enforces entrywise non-negativity, and D is the Euclidean distance matrix. Classical SDP solvers assume that we can access to the whole data matrix D at each time instance. By considering (3), we can solve this problem using only a subset of entries of D at each iteration. Remark that a subset of entries of D can be computed form a subset of the datapoints, since D is the Euclidean distance matrix. We can attack (P), and (3) as a special case, by using operator splitting methods, assuming that we can efficiently project a point onto X (see [2] and the references therein). However, projection onto semidefinite cone might require a full eigendecomposition, which imposes a computational bottleneck (with its cubic cost) even for medium scaled problems with a few thousand dimensions. When affine constraints are absent from the formulation (3), we can use stochastic CGM variants from the literature. The main workhorse of these methods is the so-called linear minimization oracle: S = arg min rf(X, !), Y +, tr(Y ) β We can compute S if we can find an eigenvector that corresponds to the smallest eigenvalue of rf(X, !). We can compute these eigenvectors efficiently by using shifted power methods or the randomized subspace iterations [12]. When we also consider affine constraints in our problem template, however, lmo becomes an SDP instance in the canonical form. In this setting, neither projection nor lmo is easy to compute. To our knowledge, no existent CGM variant is effective for solving (3) (and (P)). We specifically bridge this gap. 1.2 Notation and Preliminaries We denote the expectation with respect to the random variable by E , and the expectation wrt the sources of randomness in the optimization simply by E. Furthermore we denote f ? := E f(x?, !) where x? is the solution of (P). Throughout the paper, y? represents the solution of the dual problem of (P). We assume that strong duality holds. Slater s condition is a common sufficient condition for strong duality that implies existence of a solution of the dual problem with finite norm. Solution. We denote a solution to (P) and the optimal value by x? and F ? respectively: F ? = F(x?) F(x), 8x 2 X. (5) 2 X is an -suboptimal solution (or simply an -solution) if and only if ) F ? . (6) Stochastic first-order oracle (sfo). For the stochastic function E f(x, !), suppose that we have access to a stochastic first-order oracle that returns a pair (f(x, !), rf(x, !)) given x, where ! is an iid sample from distribution P. Lipschitz continuity & Smoothness. A function g : Rd ! R is L-Lipschitz continuous if |g(z1) g(z2)| Lkz1 z2k, 8z1, z2 2 Rd. (7) A differentiable function f is said to be L-smooth if the gradient rf is L-Lipschitz continuous. 2 Stochastic Homotopy CGM Algorithm 1 SHCGM Input: x1 2 X, β0 > 0, d0 = 0 for k = 1, 2, . . . , do k = 9/(k + 8) βk = β0/(k + 8) 1 2 k = 4/(k + 7) 2 3 dk = (1 k)dk 1 + krxf(xk, !k) vk = dk +β 1 Axk proxβkg(Axk) sk = arg minx2X xk+1 = xk + k(sk xk) end for Most stochastic CGM variants require minibatch size to increase, in order to reduce the variance of the gradient estimator. However, Mokhtari et al., [31] have recently shown that the following (biased) estimator (that can be implemented with a single sample) can be incorporated with the CGM analysis: dk = (1 k)dk 1 + krxf(xk, !k) (8) The resulting method guarantees O(1/k 1 3 ) convergence rate for convex smooth minimization, but it does not apply to our composite problem template (P). On the other hand, we introduced a CGM variant for composite problems (also covers affine constraints) in the deterministic setting in our prior work [41]. Our framework combines Nesterov smoothing [32] (and the quadratic penalty for affine constraints) with the CGM analysis. Unfortunately, this method does not work for stochastic problems. In this paper, we propose the Stochastic Homotopy Conditional Gradient Method (SHCGM) for solving (P). The proposed method combines the stochastic CGM of [31] with our (deterministic) CGM for composite problems [41] in a non-trivial way. Remark that the following formulation uniformly covers the Nesterov smoothing (with the Euclidean prox-function 1 2k k2) and the quadratic penalty (but the analyses for these two cases differ): gβ(z) = max 2 kyk2, where g (x) = max We call gβ as the smooth approximation of g, parametrized by the penalty (or smoothing) parameter β > 0. It is easy to show that gβ is 1/β-smooth. Remark that the gradient of gβ can be computed by the following formula: rxgβ(Ax) = A>proxβ 1g (β 1Ax) = β 1A> & Ax proxβg(Ax) where the second equality follows from the Moreau decomposition. The main idea is to replace the non-smooth component g by the smooth approximation gβ in (P). Clearly the solutions for (P) with g(Ax) and gβ(Ax) do not coincide for any value of β. However, gβ ! g as β ! 0. Hence, we adopt a homotopy technique: We decrease β at a controlled rate as we progress in the optimization procedure, so that the decision variable converges to a solution of the original problem. SHCGM is characterized by the following iterative steps: . Decrease the step-size, smoothing and gradient averaging parameters k, βk and k. . Call the stochastic first-order oracle and compute the gradient estimator dk in (8). . Compute the gradient estimator vk for the smooth approximation of the composite objective, Fβk(x) = E f(x, !) + gβk(Ax) =) vk = dk + rxgβk(Ax). (11) . Compute the lmo with respect to vk. . Perform a CGM step to find the next iterate. The roles of k and βk are coupled. The former controls the variance of the gradient estimator, and the latter decides how fast we reduce the smoothing parameter to approach to the original problem. A carefully tuned interaction between these two parameters allows us to prove the following convergence rates. Assumption (Bounded variance). We assume the following bounded variance condition holds: krxf(x, !) rx E f(x, !)k2 σ2 < +1. (12) Theorem 1 (Lipschitz-continuous regularizer). Assume that g : Rd ! R is Lg-Lipschitz continuous. Then, the sequence xk generated by Algorithm 1 satisfies the following convergence bound: EF(xk+1) F ? 9 k + 8, (13) where C := 81 X (Lf + β0k Ak2) + 36σDX + 27 Proof sketch. The proof follows the following steps: (i) Relate the stochastic gradient to the full gradient (Lemma 7). (ii) Show convergence of the gradient estimator to the full gradient (Lemma 8). (iii) Show O(1/k 1 3 ) convergence rate on the smooth gap EFβk(xk+1) F ? (Theorem 9). (iv) Translate this bound to the actual sub-optimality EF(xk+1) F ? by using the envelope property for Nesterov smoothing, see Equation (2.7) in [32]. Convergence rate guarantees for stochastic CGM with Lipschitz continuous g (also based on Nesterov smoothing) are already known in the literature, see [16, 22, 23] for examples. Our rate is not faster than the ones in [22, 23], but we obtain O( 1 3 ) sample complexity in the statistical setting as opposed to O( 1 In contrast with the existing stochastic CGM variants, our algorithm can also handle affine constraints. Remark that the indicator functions are not Lipschitz continuous, hence the Nesterov smoothing technique does not work for affine constraints. Assumption (Strong duality). For problems with affine constraints, we further assume that the strong duality holds. Slater s condition is a common sufficient condition for strong duality. By Slater s condition, we mean relint(X K) \ (x, r) 2 Rn Rd : Ax = r Recall that the strong duality ensures the existence of a finite dual solution. Theorem 2 (Affine constraints). Suppose that g : Rd ! R is the indicator function of a simple convex set K. Assuming that the strong duality holds, the sequence xk generated by SHCGM satisfies EE f(xk+1, !) f ? ky?k Edist(Axk+1, K) EE f(xk+1, !) f ? 9 Edist(Axk+1, K) 2β0ky?k p Proof sketch. We re-use the ingredients of the proof of Theorem 1, except that at step (iv) we translate the bound on the smooth gap (penalized objective) to the actual convergence measures (objective residual and feasibility gap) by using the Lagrange saddle point formulations and the strong duality. See Corollaries 1 and 2. Remark (Comparison to baseline). SHCGM combines ideas from [31] and [41]. Surprisingly, . O(1/k 1 3 ) rate in objective residual matches the rate in [31] for smooth minimization. . O(1/k 5 12 ) rate in feasibility gap is only an order of k 1 12 worse than the deterministic variant in [41]. Remark (Inexact oracles). We assume to use the exact solutions of lmo in SHCGM in Theorems 1 and 2. In many applications, however, it is much easier to find an approximate solution of lmo. For instance, this is the case for the SDP problems in Section 1.1. To this end, we extend our results for inexact lmo calls with additive and multiplicative error in the supplements. Remark (Splitting). An important use-case of affine constraints in (P) is splitting (see Section 5.6 in [41]). Suppose that X can be written as the intersection of two (or more) simpler (in terms of computational cost of lmo or projection) sets A \ B. By using the standard product space technique, we can reformulate this problem in the extended space (x, y) 2 A B with the constraint x = y: minimize (x,y)2A B E f(x, !) subject to x = y. (16) This allows us to decompose the difficult optimization domain X into simpler pieces. SHCGM requires lmo of A and lmo B separately. Alternatively, we can also use the projection onto one of the component sets (say B) by reformulating the problem in domain A with an affine constraint x 2 B: x2A E f(x, !) subject to x 2 B. (17) An important example is the completely positive cone (intersection of the positive-semidefinite cone and the first orthant). Remark that the Clustering SDP example in Section 1.1 is also defined on this cone. While the lmo of this intersection can only be evaluated in O(n3) computetion by using the Hungarian method, we can compute the lmo for the semidefinite cone and the projection onto the first orthant much more efficiently. 3 Related Works CGM dates back to the 1956 paper of Frank and Wolfe [8]. It did not acquire much interest in machine learning until the last decade because of its slower convergence rate in comparison with the (projected) accelerated gradient methods. However, there has been a resurgence of interest in CGM and its variants, following the seminal papers of Hazan [14] and Jaggi [18]. They demonstrate that CGM might offer superior computational complexity than state-of-the-art methods in many largescale optimization problems (that arise in machine learning) despite its slower convergence rate, thanks to its lower per-iteration cost. The original method by Frank and Wolfe [8] was proposed for smooth convex minimization on polytopes. The analysis is extended for smooth convex minimization on simplex by Clarkson [3], spactrahedron by Hazan [14], and finally for arbitrary compact convex sets by Jaggi [18]. All these methods are restricted for smooth problems. Lan [21] proposed a variant for non-smooth minimization based on the Nesterov smoothing technique. Lan and Zhou [23] also introduced the conditional gradient sliding method and extended it for the non-smooth minimization in a similar way. These methods, however, are not suitable for solving (P) because we let g to be an indicator function which is not smoothing friendly. In a prior work [41], we introduced homotopy CGM (HCGM) for composite problems (also with affine constraints). HCGM combines the Nesterov smoothing and quadratic penalty techniques under the CGM framework. It has O(1/"2) iteration complexity. In a follow-up work [40], we extended this method from quadratic penalty to an augmented Lagrangian formulation for empirical benefits. Gidel et al., [10] also proposed an augmented Lagrangian CGM but the analysis and guarantees differ. We refer to the references in [40, 41] for other variants in this direction. So far, we have focused on deterministic variants of CGM. The literature on stochastic variants are much younger. We can trace it back to the Hazan and Kale s projection-free methods for online learning [16]. When g is a non-smooth but Lipschitz continuous function, their method returns an "-solution in O(1/"4) iterations. The standard extension of CGM to the stochastic setting gets O(1/") iteration complexity for smooth minimization, but with an increasing minibatch size. Overall, this method requires O(1/"3) sample complexity, see [17] for the details. More recently, Mokhtari et al., [31] proposed a new variant with O(1/"3) convergence rate, but the proposed method can work with a single sample at each iteration. Hazan and Luo [17] and Yurtsever et al., [42] incorporated various variance for further improvements. Goldfarb et al., [11] introduced two stochastic CGM variants, with away-steps and pairwise-steps. These methods enjoy linear convergence rate (however, the batchsize increases exponentially) but for strongly convex objectives and only in polytope domains. None of these stochastic CGM variants work for non-smooth (or composite) problems. Non-smooth conditional gradient sliding by Lan and Zhou [23] also have extensions to the stochastic setting. There is also a lazy variant with further improvements by Lan et al., [22]. Note however, similar to their deterministic variants, these methods are based on the Nesterov smoothing and are not suitable for problems with affine constraints. Garber and Kaplan [9] considers problem (P). They also propose a variance reduced algorithm, but this method indeed solves the smooth relaxation of (P) (see Definition 1 Section 4.1). Contrary to SHCGM, this method might not asymptotically converge to a solution of the original problem. Lu and Freund [28] also studied a similar problem template. However, their method incorporates the non-smooth term into the linear minimization oracle. This is restrictive in practice because the non-smooth term can increase the cost of linear minimization. In particular, this is the case when g is an indicator function, such as in SDP problems. This is orthogonal to our scenario in which the affine constraints are processed by smoothing, not directly through lmo. In recent years, CGM has also been extended for non-convex problems. These extensions are beyond the scope of this paper. We refer to Yu et al., [39] and Julien-Lacoste [19] for the non-convex extensions in the deterministic setting, and to Reddi et al., [34], Yurtsever et al., [42], and Shen et al. [37] in the stochastic setting. To the best of our knowledge, SHCGM is the first CGM-type algorithm for solving (P) with cheap linear minimization oracles. Another popular approach for solving large-scale instances of (P) is the operator splitting. See [2] and the references therein for stochastic operator splitting methods. Unfortunately, these methods still require projection onto X at each iteration. This projection is arguably more expensive than the linear minimization. For instance, for solving (3), the projection has cubic cost (with respect to the problem dimension n) while the linear minimization can be efficiently solved using subspace iterations, as depicted in Table 1. Algorithm Iteration complexity Sample complexity Solves (3) Per-iteration cost (for (3)) [41] O(1/"2) N Yes (Nr/δ) [9] O(1/"2) O(1/"4) No (Nr/δ) [17] O(1/") O(1/"3) No (Nr/δ) [28] O(1/") O(1/"2) No SDP [15] O(1/") N No (Nr/δ) [2] Yes (n3) SHCGM O(1/"3) O(1/"3) Yes (Nr/δ) Table 1: Existing algorithms to tackle (3). N is the size of the dataset. n is the dimension of each datapoint. Nr is the number of non-zeros of the gradient. δ is the accuracy of the approximate lmo. The per-iteration cost of [28] is the cost of solving a SDP in the canonical form. [2] has O(1/"2) iteration and sample complexity when the objective function is strongly convex. This is not the case in our model problem, and [2] only has an asymptotic convergence guarantee. 4 Numerical Evidence This section presents the empirical performance of the proposed method for the stochastic kmeans clustering, covariance matrix estimation, and matrix completion problems. We performed the experiments in MATLAB R2018a using a computing system of 4 Intel Xeon CPU E5-2630 v3@2.40GHz and 16 GB RAM. We include the code to reproduce the results in the supplements. 4.1 Stochastic k-means Clustering We consider the SDP formulation (4) of the k-means clustering problem. The same problem is used in numerical experiments by Mixon et al. [30], and we design our experiment based on their problem setup1 with a sample of 1000 datapoints from the MNIST data2. See [30] for details on the preprocessing. We solve this problem with SHCGM and compare it against HCGM [41] as the baseline. HCGM is a deterministic algorithm hence it uses the full gradient. For SHCGM, we compute a gradient estimator by randomly sampling 100 datapoints at each iteration. Remark that this corresponds to observing approximately 1 percent of the entries of D. We use β0 = 1 for HCGM and β0 = 10 for SHCGM. We set these values by tuning both methods by trying β0 = 0.01, 0.1, ..., 1000. We display the results in Figure 1 where we denote a full pass over the entries of D as an epoch. Figure 1 demonstrates that SHCGM performs similar to HCGM although it uses less data. Figure 1: Comparison of SHCGM with HCGM for k-means clustering SDP in Section 4.1. 4.2 Online Covariance Matrix Estimation Covariance matrix estimation is an important problem in multivariate statistics with applications in many fields including gene microarrays, social network, finance, climate analysis [35, 36, 7, 6], etc. In the online setting, we suppose that the data is received as a stream of datapoints in time. The deterministic approach is to first collect some data, and then to train an empirical risk minimization model using the data collected. This has obvious limitations, since it may not be clear a priori how much data is enough to precisely estimate the covariance matrix. Furthermore, data can be too large to store or work with as a batch. To this end, we consider an online learning setting. In this case, we use each datapoint as it arrives and then discard it. 1D.G. Mixon, S. Villar, R.Ward. Available at https://github.com/solevillar/kmeans_sdp 2Y. Le Cun and C. Cortes. Available at http://yann.lecun.com/exdb/mnist/ Figure 2: SHCGM and HCGM on Online covariance matrix estimation from streaming data. Let us consider the following sparse covariance matrix estimation template (this template also covers other problems such as graph denoising and link prediction [35]) : minimize X2Sn +, tr(X) β1 E k X !!>k2 F subject to k Xk1 β2. (18) where k Xk1 denotes the 1 norm (sum of absolute values of the entries). Our test setup is as follows: We first create a block diagonal covariance matrix 2 Rn n using 10 blocks of the form φφ>, where entries of φ are drawn uniformly random from [ 1, 1]. This gives us a sparse matrix of rank 10. Then, as for datapoints, we stream observations of in the form !i N(0, ). We fix the problem dimension n = 1000. We compare SHCGM with the deterministic method, HCGM. We use β0 = 1 for both methods. Both methods require the lmo for the positive-semidefinite cone with trace constraint, and the projection oracle for the 1 norm constraint at each iteration. We study two different setups: In Figure 2, we use SHCGM in the online setting. We sample a new datapoint at each iteration. HCGM, on the other hand, does not work in the online setting. Hence, we use the same sample of datapoints for all iterations. We consider 4 different cases with different sample sizes for HCGM, with 10, 50, 100 and 200 datapoints. Although this approach converges fast up to some accuracy, the objective value gets saturated at some estimation accuracy. Naturally, HCGM can achieve higher accuracy as the sample size increases. We can also read the empirical convergence rates of SHCGM from Figure 2 as approximately O(k 1/2) for the objective residual and O(k 1) for the feasibility gap, significantly better than the theoretical guarantees . Figure 3: Comparison of SHCGM with HCGM batchsize 200 for online covariance matrix estimation. If we can store larger samples, we can also consider minibatches for the stochastic methods. Figure 3 compares the deterministic approach with 200 datapoints with the stochastic approach with minibatch size of 200. In other words, while the deterministic method uses the same 200 datapoints for all iterations, we use a new draw of 200 datapoints at each iteration with SHCGM. 4.3 Stochastic Matrix Completion We consider the problem of matrix completion with the following mathematical formulation: (Xi,j Yi,j)2 subject to 1 X 5, (19) where, is the set of observed ratings (samples of entries from the true matrix Y that we try to recover), and k Xk denotes the nuclear-norm (sum of singular values). The affine constraint 1 X 5 imposes a hard threshold on the estimated ratings (in other words, the entries of X). SHCGM 0.5574 0.0498 SFW 1.8360 0.3266 SHCGM 1.1446 0.0087 SFW 2.0416 0.2739 Figure 4: Training Error, Feasibility gap and Test Error for Movie Lens 100k. Table shows the mean values and standard deviation of train and test RMSE over 5 different train/test splits at the end of 104 iterations. We first compare SHCGM with the Stochastic Frank-Wolfe (SFW) from [31]. We consider a test setup with the Movie Lens100k dataset3 [13]. This dataset contains 100 000 integer valued ratings between 1 and 5, assigned by 1682 users to 943 movies. The aim of this experiment is to emphasize the flexibility of SHCGM: Recall that SFW does not directly apply to (19) as it cannot handle the affine constraint 1 X 5. Therefore, we apply SFW to a relaxation of (19) that omits this constraint. Then, we solve (19) with SHCGM and compare the results. We use the default ub.train and ub.test partitions provided with the original data. We set the model parameter for the nuclear norm constraint β1 = 7000, and the initial smoothing parameter β0 = 10. At each iteration, we compute a gradient estimator from 1000 iid samples. We perform the same test independently for 10 times to compute the average performance and confidence intervals. In Figure 4, we report the training and test errors (root mean squared error) as well as the feasibility gap. The solid lines display the average performance, and the shaded areas show one standard deviation. Note that SHCGM performs uniformly better than SFW, both in terms of the training and test errors. The Table shows the values achieved at the end of 100000 iterations. Finally, we compare SHCGM with the stochastic three-composite convex minimization method (S3CCM) from [43]. S3CCM is a projection-based method that applies to (19). In this experiment, we aim to demonstrate the advantages of the projection-free methods for problems in large-scale. We consider a test setup with the Movie Lens1m dataset3 with 1 million ratings from 6000 users on 4000 movies. We partition the data into training and test samples with a 80/20 train/test split. We use 100000 iid samples at each iteration to compute a gradient estimator. We set the model parameter β1 = 200000. We use β0 = 10 for SHCGM, and we set the step-size parameter γ = 1 for S3CCM. We implement the lmo efficiently using the power method. We refer to the code in the supplements for details on the implementation. Figure 5: SHCGM vs S3CCM with Movie Lens-1M. Figure 5 reports the outcomes of this experiment. SHCGM clearly outperforms S3CCM in this test. We run both methods for 2 hours. Within this time limit, SHCGM can perform 270860 iterations while S3CCM can gets only up to 435 because of the high computational cost of the projection. 5 Conclusions We introduced a scalable stochastic CGM-type method for solving convex optimization problems with affine constraints and demonstrated empirical superiority of our approach in various numerical experiments. In particular, we consider the case of stochastic optimization of SDPs for which we give the first projection-free algorithm. In general, we showed that our algorithm provably converges to an optimal solution of (P) with O(k 1/3) and O(k 5/12) rates in the objective residual and feasibility gap respectively, with a sample complexity in the statistical setting of O(k 1/3). The possibility of a faster rate with the same (or even better) sample complexity remains an open question as well as an adaptive approach with O(k 1/2) rate when fed with exact gradients. 3F.M. Harper, J.A. Konstan. Available at https://grouplens.org/datasets/movielens/ Acknowledgements Francesco Locatello has received funding from the Max Planck ETH Center for Learning Systems, by an ETH Core Grant (to Gunnar R atsch) and by a Google Ph.D. Fellowship. Volkan Cevher and Alp Yurtsever have received funding from the Swiss National Science Foundation (SNSF) under grant number 200021 178865/1, and the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation program (grant agreement no 725594 - time-data). [1] E. Abbe. Community detection and stochastic block models: Recent developments. Journal of Machine Learning Research, 18:1 86, 2018. [2] V. Cevher, B. C. Vu, and A. Yurtsever. Stochastic forward Douglas-Rachford splitting method for monotone inclusions. In P. Giselsson and A. Rantzer, editors, Large Scale and Distributed Optimization, chapter 7, pages 149 179. Springer International Publishing, 2018. [3] K. L. Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM Transactions on Algorithms (TALG), 6(4), 2010. [4] A. d Aspremont, L. E. Ghaoui, M. I. Jordan, and G. R. Lanckriet. A direct formulation for sparse PCA using semidefinite programming. SIAM Review, 49(3):434 448, 2007. [5] C. D unner, S. Forte, M. Tak ac, and M. Jaggi. Primal dual rates and certificates. In Proc. 33rd International Conference on Machine Learning, 2016. [6] J. Fan, F. Han, and H. Liu. Challenges of big data analysis. National science review, 1(2):293 314, 2014. [7] J. Fan, Y. Liao, and H. Liu. An overview of the estimation of large covariance and precision matrices. The Econometrics Journal, 19(1):C1 C32, 2016. [8] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3:95 110, 1956. [9] D. Garber and A. Kaplan. Fast stochastic algorithms for low-rank and nonsmooth matrix problems. ar Xiv:1809.10477, 2018. [10] G. Gidel, F. Pedregosa, and S. Lacoste-Julien. Frank-Wolfe splitting via augmented Lagrangian method. In Proc. 21st International Conference on Artificial Intelligence and Statistics, 2018. [11] D. Goldfarb, G. Iyengar, and C. Zhou. Linear convergence of stochastic Frank Wolfe variants. In Proc. 20th International Conference on Artificial Intelligence and Statistics, 2017. [12] N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217 288, 2011. [13] F. M. Harper and J. A. Konstan. The Movie Lens datasets: History and context. ACM Transac- tions on Interactive Intelligent Systems (Tii S), 5(4):19, 2016. [14] E. Hazan. Sparse approximate solutions to semidefinite programs. In Proc. 8th Latin American Conf. Theoretical Informatics, pages 306 316, 2008. [15] E. Hazan. Sparse approximate solutions to semidefinite programs. In Latin American sympo- sium on theoretical informatics, pages 306 316. Springer, 2008. [16] E. Hazan and S. Kale. Projection free online learning. In Proc. 29th International Conference on Machine Learning, 2012. [17] E. Hazan and H. Luo. Variance-reduced and projection-free stochastic optimization. In Proc. 33rd International Conference on Machine Learning, 2016. [18] M. Jaggi. Revisiting Frank Wolfe: Projection free sparse convex optimization. In Proc. 30th International Conference on Machine Learning, 2013. [19] S. Lacoste-Julien. Convergence rate of Frank-Wolfe for non-convex objectives. ar Xiv:1607.00345, 2016. [20] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate Frank-Wolfe op- timization for structural SVMs. In Proc. 30th International Conference on Machine Learning, 2013. [21] G. Lan. The complexity of large scale convex programming under a linear optimization oracle. ar Xiv:1309.5550v2, 2014. [22] G. Lan, S. Pokutta, Y. Zhou, and D. Zink. Conditional accelerated lazy stochastic gradient descent. ar Xiv:1703.05840, 2017. [23] G. Lan and Y. Zhou. Conditional gradient sliding for convex optimization. SIAM J. Optim., 26(2):1379 1409, 2016. [24] G. R. G. Lanckriet, N. Cristianini, L. E. Ghaoui, P. Bartlett, and M. I. Jordan. Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res., 5:27 72, 2004. [25] F. Locatello, R. Khanna, M. Tschannen, and M. Jaggi. A unified optimization view on gener- alized matching pursuit and Frank-Wolfe. In Proc. 20th International Conference on Artificial Intelligence and Statistics, 2017. [26] F. Locatello, A. Raj, S. P. Karimireddy, G. R atsch, B. Sch olkopf, S. U. Stich, and M. Jaggi. On matching pursuit and coordinate descent. ar Xiv:1803.09539, 2018. [27] F. Locatello, M. Tschannen, G. R atsch, and M. Jaggi. Greedy algorithms for cone constrained optimization with convergence guarantees. In Advances in Neural Information Processing Systems 30, 2017. [28] H. Lu and R. M. Freund. Generalized stochastic frank-wolfe algorithm with stochastic sub- stitute gradient for structured convex optimization. ar Xiv:1807.07680, 2018. [29] J. L. R. Madani and S. Sojoudi. Convex relaxation for optimal power flow problem: mesh networks. IEEE Trans. on Power Syst., 30(1):199 211, 2015. [30] D. G. Mixon, S. Villar, and R. Ward. Clustering subgaussian mixtures by semidefinite pro- gramming. Information and Inference: A Journal of the IMA, 6(4):389 415, 2017. [31] A. Mokhtari, H. Hassani, and A. Karbasi. Stochastic conditional gradient methods: From convex minimization to submodular maximization. ar Xiv:1804.09554, 2018. [32] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103:127 152, 2005. [33] J. Peng and Y. Wei. Approximating K means type clustering via semidefinite programming. SIAM J. Optim., 18(1):186 205, 2007. [34] S. J. Reddi, S. Sra, B. P oczos, and A. Smola. Stochastic frank-wolfe methods for nonconvex optimization. ar Xiv:1607.08254, 2016. [35] E. Richard, P.-A. Savalle, and N. Vayatis. Estimation of simultaneously sparse and low rank matrices. In Proc. 29th International Conference on Machine Learning, 2012. [36] J. Sch afer and K. Strimmer. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical applications in genetics and molecular biology, 4(1), 2005. [37] Z. Shen, C. Fang, P. Zhao, J. Huang, and H. Qian. Complexities in projection-free stochastic non-convex minimization. In Proc. 22nd International Conference on Artificial Intelligence and Statistics, 2019. [38] Q. Tran-Dinh, O. Fercoq, and V. Cevher. A smooth primal-dual optimization framework for nonsmooth composite convex minimization. SIAM J. Optim., 28(1):96 134, 2018. [39] Y. Yu, X. Zhang, and D. Schuurmans. Generalized conditional gradient for sparse estimation. ar Xiv:1410.4828v1, 2014. [40] A. Yurtsever, O. Fercoq, and V. Cevher. A conditional-gradient-based augmented Lagrangian framework. In Proc. 36th International Conference on Machine Learning, 2019. [41] A. Yurtsever, O. Fercoq, F. Locatello, and V. Cevher. A conditional gradient framework for composite convex minimization with applications to semidefinite programming. In Proc. 35th International Conference on Machine Learning, 2018. [42] A. Yurtsever, S. Sra, and V. Cevher. Conditional gradient methods via stochastic pathintegrated differential estimator. In Proc. 36th International Conference on Machine Learning, 2019. [43] A. Yurtsever, B. C. Vu, and V. Cevher. Stochastic three-composite convex minimization. In Advances in Neural Information Processing Systems 29, 2016.