# stochastic_frankwolfe_for_composite_convex_minimization__a4025047.pdf

Stochastic Frank-Wolfe for Composite Convex Minimization

Francesco Locatello? Alp Yurtsever Olivier Fercoq Volkan Cevher

francesco.locatello@inf.ethz.ch {alp.yurtsever,volkan.cevher}@epfl.ch

olivier.fercoq@telecom-paristech.fr

?Department of Computer Science, ETH Zurich, Switzerland LIONS, Ecole Polytechnique F ed erale de Lausanne, Switzerland

LTCI, T el ecom Paris, Universit e Paris-Saclay, France

A broad class of convex optimization problems can be formulated as a semideﬁnite program (SDP), minimization of a convex function over the positive-semideﬁnite cone subject to some afﬁne constraints. The majority of classical SDP solvers are designed for the deterministic setting where problem data is readily available. In this setting, generalized conditional gradient methods (aka Frank-Wolfe-type methods) provide scalable solutions by leveraging the so-called linear minimization oracle instead of the projection onto the semideﬁnite cone. Most problems in machine learning and modern engineering applications, however, contain some degree of stochasticity. In this work, we propose the ﬁrst conditional-gradienttype method for solving stochastic optimization problems under afﬁne constraints. Our method guarantees O(k 1/3) convergence rate in expectation on the objective residual and O(k 5/12) on the feasibility gap.

1 Introduction

We focus on the following stochastic convex composite optimization template, which covers ﬁnite sum and online learning problems:

x2X E f(x, !) + g(Ax) := F(x). (P)

In this optimization template, we consider the following setting: . X Rn is a convex and compact set, . ! is a realization of the random variable drawn from the distribution P, . E f( , !) : X ! R is a smooth (see Section 1.2 for the deﬁnition) convex function, . A 2 Rn ! Rd is a given linear map, . g : Rd ! R [ {+1} is a convex function (possibly non-smooth).

We consider two distinct speciﬁc cases for g: (i) g is a Lipschitz-continuous function, for which the proximal-operator is easy to compute:

proxg(y) = arg min

z2Rd g(z) + 1

2kz yk2 (1)

(ii) g is the indicator function of a convex set K Rd:

0 if z 2 K, +1 otherwise. (2)

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

The former covers the regularized optimization problems. This type of regularization is common in machine learning applications to promote a desired structure to the solution. The latter handles afﬁne constraints of the form Ax 2 K. We can also attack the combination of both: the minimization of a regularized loss-function subject to some afﬁne constraints.

In this paper, we propose a conditional-gradient-type method (aka Frank-Wolfe-type) for (P). In summary, our main contributions are as follows:

. We propose the ﬁrst CGM variant for solving (P). By CGM variant, we mean that our method

avoids projection onto X and uses the lmo of X instead. The majority of the known methods for (P) require projections onto X. . We prove O(k 1/3) convergence rate on objective residual when g is Lipschitz-continuous. . We prove O(k 1/3) convergence rate on objective residual, and O(k 5/12) on feasibility gap

when g is an indicator function. Surprisingly, afﬁne constraints that make the lmo challenging for existing CGM variants can be easily incorporated in this framework by using smoothing. . We provide empirical evidence that validates our theoretical ﬁndings. Our results highlight the

beneﬁts of our framework against the projection-based algorithms.

1.1 Motivation: Stochastic Semideﬁnite Programming

Consider the following stochastic semideﬁnite programming template, minimization of a convex function over the positive-semideﬁnite cone subject to some afﬁne constraints:

minimize X2Sn

+, tr(X) β E f(X, !) subject to AX 2 K. (3)

+ denotes the positive-semideﬁnite cone. We are interested in solving (3) rather than the classical SDP since it does not require access to the whole data at one time. This creates a new vein of SDP applications in machine learning. Examples span online variants of clustering [33], streaming PCA [4], kernel learning [24], community detection [1], optimal power-ﬂow [29], etc.

Example: Clustering. Consider the SDP formulation of the k-means clustering problem [33]:

minimize X2Sn

subject to X1n = 1n, X 0. (4)

Here, 1n denotes the vector of ones, X 0 enforces entrywise non-negativity, and D is the Euclidean distance matrix. Classical SDP solvers assume that we can access to the whole data matrix D at each time instance. By considering (3), we can solve this problem using only a subset of entries of D at each iteration. Remark that a subset of entries of D can be computed form a subset of the datapoints, since D is the Euclidean distance matrix.

We can attack (P), and (3) as a special case, by using operator splitting methods, assuming that we can efﬁciently project a point onto X (see [2] and the references therein). However, projection onto semideﬁnite cone might require a full eigendecomposition, which imposes a computational bottleneck (with its cubic cost) even for medium scaled problems with a few thousand dimensions.

When afﬁne constraints are absent from the formulation (3), we can use stochastic CGM variants from the literature. The main workhorse of these methods is the so-called linear minimization oracle:

S = arg min

rf(X, !), Y

+, tr(Y ) β

We can compute S if we can ﬁnd an eigenvector that corresponds to the smallest eigenvalue of rf(X, !). We can compute these eigenvectors efﬁciently by using shifted power methods or the randomized subspace iterations [12]. When we also consider afﬁne constraints in our problem template, however, lmo becomes an SDP instance in the canonical form. In this setting, neither projection nor lmo is easy to compute. To our knowledge, no existent CGM variant is effective for solving (3) (and (P)). We speciﬁcally bridge this gap.

1.2 Notation and Preliminaries

We denote the expectation with respect to the random variable by E , and the expectation wrt the sources of randomness in the optimization simply by E. Furthermore we denote f ? := E f(x?, !) where x? is the solution of (P). Throughout the paper, y? represents the solution of the dual problem of (P). We assume that strong duality holds. Slater s condition is a common sufﬁcient condition for strong duality that implies existence of a solution of the dual problem with ﬁnite norm.

Solution. We denote a solution to (P) and the optimal value by x? and F ? respectively:

F ? = F(x?) F(x), 8x 2 X. (5)

2 X is an -suboptimal solution (or simply an -solution) if and only if

) F ? . (6)

Stochastic ﬁrst-order oracle (sfo). For the stochastic function E f(x, !), suppose that we have access to a stochastic ﬁrst-order oracle that returns a pair (f(x, !), rf(x, !)) given x, where ! is an iid sample from distribution P.

Lipschitz continuity & Smoothness. A function g : Rd ! R is L-Lipschitz continuous if

|g(z1) g(z2)| Lkz1 z2k, 8z1, z2 2 Rd. (7)

A differentiable function f is said to be L-smooth if the gradient rf is L-Lipschitz continuous.

2 Stochastic Homotopy CGM

Algorithm 1 SHCGM

Input: x1 2 X, β0 > 0, d0 = 0 for k = 1, 2, . . . , do

k = 9/(k + 8) βk = β0/(k + 8)

1 2 k = 4/(k + 7)

2 3 dk = (1 k)dk 1 + krxf(xk, !k) vk = dk +β 1

Axk proxβkg(Axk)

sk = arg minx2X

xk+1 = xk + k(sk xk) end for

Most stochastic CGM variants require minibatch size to increase, in order to reduce the variance of the gradient estimator. However, Mokhtari et al., [31] have recently shown that the following (biased) estimator (that can be implemented with a single sample) can be incorporated with the CGM analysis:

dk = (1 k)dk 1 + krxf(xk, !k) (8)

The resulting method guarantees O(1/k

1 3 ) convergence rate for convex smooth minimization, but it does not apply to our composite problem template (P).

On the other hand, we introduced a CGM variant for composite problems (also covers afﬁne constraints) in the deterministic setting in our prior work [41]. Our framework combines Nesterov smoothing [32] (and the quadratic penalty for afﬁne constraints) with the CGM analysis. Unfortunately, this method does not work for stochastic problems.

In this paper, we propose the Stochastic Homotopy Conditional Gradient Method (SHCGM) for solving (P). The proposed method combines the stochastic CGM of [31] with our (deterministic) CGM for composite problems [41] in a non-trivial way.

Remark that the following formulation uniformly covers the Nesterov smoothing (with the Euclidean prox-function 1

2k k2) and the quadratic penalty (but the analyses for these two cases differ):

gβ(z) = max

2 kyk2, where g (x) = max

We call gβ as the smooth approximation of g, parametrized by the penalty (or smoothing) parameter β > 0. It is easy to show that gβ is 1/β-smooth. Remark that the gradient of gβ can be computed by the following formula:

rxgβ(Ax) = A>proxβ 1g (β 1Ax) = β 1A> &

Ax proxβg(Ax)

where the second equality follows from the Moreau decomposition.

The main idea is to replace the non-smooth component g by the smooth approximation gβ in (P). Clearly the solutions for (P) with g(Ax) and gβ(Ax) do not coincide for any value of β. However, gβ ! g as β ! 0. Hence, we adopt a homotopy technique: We decrease β at a controlled rate as we progress in the optimization procedure, so that the decision variable converges to a solution of the original problem.

SHCGM is characterized by the following iterative steps: . Decrease the step-size, smoothing and gradient averaging parameters k, βk and k. . Call the stochastic ﬁrst-order oracle and compute the gradient estimator dk in (8). . Compute the gradient estimator vk for the smooth approximation of the composite objective,

Fβk(x) = E f(x, !) + gβk(Ax) =) vk = dk + rxgβk(Ax). (11)

. Compute the lmo with respect to vk. . Perform a CGM step to ﬁnd the next iterate.

The roles of k and βk are coupled. The former controls the variance of the gradient estimator, and the latter decides how fast we reduce the smoothing parameter to approach to the original problem. A carefully tuned interaction between these two parameters allows us to prove the following convergence rates.

Assumption (Bounded variance). We assume the following bounded variance condition holds:

krxf(x, !) rx E f(x, !)k2

σ2 < +1. (12)

Theorem 1 (Lipschitz-continuous regularizer). Assume that g : Rd ! R is Lg-Lipschitz continuous. Then, the sequence xk generated by Algorithm 1 satisﬁes the following convergence bound:

EF(xk+1) F ? 9

k + 8, (13)

where C := 81

X (Lf + β0k Ak2) + 36σDX + 27

Proof sketch. The proof follows the following steps: (i) Relate the stochastic gradient to the full gradient (Lemma 7). (ii) Show convergence of the gradient estimator to the full gradient (Lemma 8). (iii) Show O(1/k

1 3 ) convergence rate on the smooth gap EFβk(xk+1) F ? (Theorem 9). (iv) Translate this bound to the actual sub-optimality EF(xk+1) F ? by using the envelope property for Nesterov smoothing, see Equation (2.7) in [32].

Convergence rate guarantees for stochastic CGM with Lipschitz continuous g (also based on Nesterov smoothing) are already known in the literature, see [16, 22, 23] for examples. Our rate is not faster than the ones in [22, 23], but we obtain O( 1

3 ) sample complexity in the statistical setting as opposed to O( 1

In contrast with the existing stochastic CGM variants, our algorithm can also handle afﬁne constraints. Remark that the indicator functions are not Lipschitz continuous, hence the Nesterov smoothing technique does not work for afﬁne constraints.

Assumption (Strong duality). For problems with afﬁne constraints, we further assume that the strong duality holds. Slater s condition is a common sufﬁcient condition for strong duality. By Slater s condition, we mean

relint(X K) \

(x, r) 2 Rn Rd : Ax = r

Recall that the strong duality ensures the existence of a ﬁnite dual solution. Theorem 2 (Afﬁne constraints). Suppose that g : Rd ! R is the indicator function of a simple convex set K. Assuming that the strong duality holds, the sequence xk generated by SHCGM satisﬁes

EE f(xk+1, !) f ? ky?k Edist(Axk+1, K)

EE f(xk+1, !) f ? 9

Edist(Axk+1, K) 2β0ky?k p

Proof sketch. We re-use the ingredients of the proof of Theorem 1, except that at step (iv) we translate the bound on the smooth gap (penalized objective) to the actual convergence measures (objective residual and feasibility gap) by using the Lagrange saddle point formulations and the strong duality. See Corollaries 1 and 2.

Remark (Comparison to baseline). SHCGM combines ideas from [31] and [41]. Surprisingly, . O(1/k

1 3 ) rate in objective residual matches the rate in [31] for smooth minimization. . O(1/k

5 12 ) rate in feasibility gap is only an order of k

1 12 worse than the deterministic variant in [41].

Remark (Inexact oracles). We assume to use the exact solutions of lmo in SHCGM in Theorems 1 and 2. In many applications, however, it is much easier to ﬁnd an approximate solution of lmo. For instance, this is the case for the SDP problems in Section 1.1. To this end, we extend our results for inexact lmo calls with additive and multiplicative error in the supplements.

Remark (Splitting). An important use-case of afﬁne constraints in (P) is splitting (see Section 5.6 in [41]). Suppose that X can be written as the intersection of two (or more) simpler (in terms of computational cost of lmo or projection) sets A \ B. By using the standard product space technique, we can reformulate this problem in the extended space (x, y) 2 A B with the constraint x = y:

minimize (x,y)2A B E f(x, !) subject to x = y. (16)

This allows us to decompose the difﬁcult optimization domain X into simpler pieces. SHCGM requires lmo of A and lmo B separately. Alternatively, we can also use the projection onto one of the component sets (say B) by reformulating the problem in domain A with an afﬁne constraint x 2 B:

x2A E f(x, !) subject to x 2 B. (17)

An important example is the completely positive cone (intersection of the positive-semideﬁnite cone and the ﬁrst orthant). Remark that the Clustering SDP example in Section 1.1 is also deﬁned on this cone. While the lmo of this intersection can only be evaluated in O(n3) computetion by using the Hungarian method, we can compute the lmo for the semideﬁnite cone and the projection onto the ﬁrst orthant much more efﬁciently.

3 Related Works

CGM dates back to the 1956 paper of Frank and Wolfe [8]. It did not acquire much interest in machine learning until the last decade because of its slower convergence rate in comparison with the (projected) accelerated gradient methods. However, there has been a resurgence of interest in CGM and its variants, following the seminal papers of Hazan [14] and Jaggi [18]. They demonstrate that CGM might offer superior computational complexity than state-of-the-art methods in many largescale optimization problems (that arise in machine learning) despite its slower convergence rate, thanks to its lower per-iteration cost.

The original method by Frank and Wolfe [8] was proposed for smooth convex minimization on polytopes. The analysis is extended for smooth convex minimization on simplex by Clarkson [3], spactrahedron by Hazan [14], and ﬁnally for arbitrary compact convex sets by Jaggi [18]. All these methods are restricted for smooth problems.

Lan [21] proposed a variant for non-smooth minimization based on the Nesterov smoothing technique. Lan and Zhou [23] also introduced the conditional gradient sliding method and extended it for the non-smooth minimization in a similar way. These methods, however, are not suitable for solving (P) because we let g to be an indicator function which is not smoothing friendly.

In a prior work [41], we introduced homotopy CGM (HCGM) for composite problems (also with afﬁne constraints). HCGM combines the Nesterov smoothing and quadratic penalty techniques under the CGM framework. It has O(1/"2) iteration complexity. In a follow-up work [40], we extended this method from quadratic penalty to an augmented Lagrangian formulation for empirical beneﬁts. Gidel et al., [10] also proposed an augmented Lagrangian CGM but the analysis and guarantees differ. We refer to the references in [40, 41] for other variants in this direction.

So far, we have focused on deterministic variants of CGM. The literature on stochastic variants are much younger. We can trace it back to the Hazan and Kale s projection-free methods for online

learning [16]. When g is a non-smooth but Lipschitz continuous function, their method returns an "-solution in O(1/"4) iterations.

The standard extension of CGM to the stochastic setting gets O(1/") iteration complexity for smooth minimization, but with an increasing minibatch size. Overall, this method requires O(1/"3) sample complexity, see [17] for the details. More recently, Mokhtari et al., [31] proposed a new variant with O(1/"3) convergence rate, but the proposed method can work with a single sample at each iteration. Hazan and Luo [17] and Yurtsever et al., [42] incorporated various variance for further improvements. Goldfarb et al., [11] introduced two stochastic CGM variants, with away-steps and pairwise-steps. These methods enjoy linear convergence rate (however, the batchsize increases exponentially) but for strongly convex objectives and only in polytope domains. None of these stochastic CGM variants work for non-smooth (or composite) problems.

Non-smooth conditional gradient sliding by Lan and Zhou [23] also have extensions to the stochastic setting. There is also a lazy variant with further improvements by Lan et al., [22]. Note however, similar to their deterministic variants, these methods are based on the Nesterov smoothing and are not suitable for problems with afﬁne constraints.

Garber and Kaplan [9] considers problem (P). They also propose a variance reduced algorithm, but this method indeed solves the smooth relaxation of (P) (see Deﬁnition 1 Section 4.1). Contrary to SHCGM, this method might not asymptotically converge to a solution of the original problem.

Lu and Freund [28] also studied a similar problem template. However, their method incorporates the non-smooth term into the linear minimization oracle. This is restrictive in practice because the non-smooth term can increase the cost of linear minimization. In particular, this is the case when g is an indicator function, such as in SDP problems. This is orthogonal to our scenario in which the afﬁne constraints are processed by smoothing, not directly through lmo.

In recent years, CGM has also been extended for non-convex problems. These extensions are beyond the scope of this paper. We refer to Yu et al., [39] and Julien-Lacoste [19] for the non-convex extensions in the deterministic setting, and to Reddi et al., [34], Yurtsever et al., [42], and Shen et al. [37] in the stochastic setting.

To the best of our knowledge, SHCGM is the ﬁrst CGM-type algorithm for solving (P) with cheap linear minimization oracles. Another popular approach for solving large-scale instances of (P) is the operator splitting. See [2] and the references therein for stochastic operator splitting methods. Unfortunately, these methods still require projection onto X at each iteration. This projection is arguably more expensive than the linear minimization. For instance, for solving (3), the projection has cubic cost (with respect to the problem dimension n) while the linear minimization can be efﬁciently solved using subspace iterations, as depicted in Table 1.

Algorithm Iteration complexity Sample complexity Solves (3) Per-iteration cost (for (3))

[41] O(1/"2) N Yes (Nr/δ) [9] O(1/"2) O(1/"4) No (Nr/δ) [17] O(1/") O(1/"3) No (Nr/δ) [28] O(1/") O(1/"2) No SDP [15] O(1/") N No (Nr/δ) [2] Yes (n3) SHCGM O(1/"3) O(1/"3) Yes (Nr/δ)

Table 1: Existing algorithms to tackle (3). N is the size of the dataset. n is the dimension of each datapoint. Nr is the number of non-zeros of the gradient. δ is the accuracy of the approximate lmo. The per-iteration cost of [28] is the cost of solving a SDP in the canonical form. [2] has O(1/"2) iteration and sample complexity when the objective function is strongly convex. This is not the case in our model problem, and [2] only has an asymptotic convergence guarantee.

4 Numerical Evidence

This section presents the empirical performance of the proposed method for the stochastic kmeans clustering, covariance matrix estimation, and matrix completion problems. We performed the experiments in MATLAB R2018a using a computing system of 4 Intel Xeon CPU E5-2630 v3@2.40GHz and 16 GB RAM. We include the code to reproduce the results in the supplements.

4.1 Stochastic k-means Clustering

We consider the SDP formulation (4) of the k-means clustering problem. The same problem is used in numerical experiments by Mixon et al. [30], and we design our experiment based on their problem setup1 with a sample of 1000 datapoints from the MNIST data2. See [30] for details on the preprocessing.

We solve this problem with SHCGM and compare it against HCGM [41] as the baseline. HCGM is a deterministic algorithm hence it uses the full gradient. For SHCGM, we compute a gradient estimator by randomly sampling 100 datapoints at each iteration. Remark that this corresponds to observing approximately 1 percent of the entries of D.

We use β0 = 1 for HCGM and β0 = 10 for SHCGM. We set these values by tuning both methods by trying β0 = 0.01, 0.1, ..., 1000. We display the results in Figure 1 where we denote a full pass over the entries of D as an epoch. Figure 1 demonstrates that SHCGM performs similar to HCGM although it uses less data.

Figure 1: Comparison of SHCGM with HCGM for k-means clustering SDP in Section 4.1.

4.2 Online Covariance Matrix Estimation

Covariance matrix estimation is an important problem in multivariate statistics with applications in many ﬁelds including gene microarrays, social network, ﬁnance, climate analysis [35, 36, 7, 6], etc. In the online setting, we suppose that the data is received as a stream of datapoints in time.

The deterministic approach is to ﬁrst collect some data, and then to train an empirical risk minimization model using the data collected. This has obvious limitations, since it may not be clear a priori how much data is enough to precisely estimate the covariance matrix. Furthermore, data can be too large to store or work with as a batch. To this end, we consider an online learning setting. In this case, we use each datapoint as it arrives and then discard it.

1D.G. Mixon, S. Villar, R.Ward. Available at https://github.com/solevillar/kmeans_sdp 2Y. Le Cun and C. Cortes. Available at http://yann.lecun.com/exdb/mnist/

Figure 2: SHCGM and HCGM on Online covariance matrix estimation from streaming data.

Let us consider the following sparse covariance matrix estimation template (this template also covers other problems such as graph denoising and link prediction [35]) :

minimize X2Sn

+, tr(X) β1

E k X !!>k2

F subject to k Xk1 β2. (18)

where k Xk1 denotes the 1 norm (sum of absolute values of the entries).

Our test setup is as follows: We ﬁrst create a block diagonal covariance matrix 2 Rn n using 10 blocks of the form φφ>, where entries of φ are drawn uniformly random from [ 1, 1]. This gives us a sparse matrix of rank 10. Then, as for datapoints, we stream observations of in the form !i N(0, ). We ﬁx the problem dimension n = 1000.

We compare SHCGM with the deterministic method, HCGM. We use β0 = 1 for both methods. Both methods require the lmo for the positive-semideﬁnite cone with trace constraint, and the projection oracle for the 1 norm constraint at each iteration.

We study two different setups: In Figure 2, we use SHCGM in the online setting. We sample a new datapoint at each iteration. HCGM, on the other hand, does not work in the online setting. Hence, we use the same sample of datapoints for all iterations. We consider 4 different cases with different sample sizes for HCGM, with 10, 50, 100 and 200 datapoints. Although this approach converges fast up to some accuracy, the objective value gets saturated at some estimation accuracy. Naturally, HCGM can achieve higher accuracy as the sample size increases.

We can also read the empirical convergence rates of SHCGM from Figure 2 as approximately O(k 1/2) for the objective residual and O(k 1) for the feasibility gap, signiﬁcantly better than the theoretical guarantees .

Figure 3: Comparison of SHCGM with HCGM batchsize 200 for online covariance matrix estimation.

If we can store larger samples, we can also consider minibatches for the stochastic methods. Figure 3 compares the deterministic approach with 200 datapoints with the stochastic approach with minibatch size of 200. In other words, while the deterministic method uses the same 200 datapoints for all iterations, we use a new draw of 200 datapoints at each iteration with SHCGM.

4.3 Stochastic Matrix Completion

We consider the problem of matrix completion with the following mathematical formulation:

(Xi,j Yi,j)2 subject to 1 X 5, (19)

where, is the set of observed ratings (samples of entries from the true matrix Y that we try to recover), and k Xk denotes the nuclear-norm (sum of singular values). The afﬁne constraint 1 X 5 imposes a hard threshold on the estimated ratings (in other words, the entries of X).

SHCGM 0.5574 0.0498 SFW 1.8360 0.3266

SHCGM 1.1446 0.0087 SFW 2.0416 0.2739

Figure 4: Training Error, Feasibility gap and Test Error for Movie Lens 100k. Table shows the mean values and standard deviation of train and test RMSE over 5 different train/test splits at the end of 104 iterations.

We ﬁrst compare SHCGM with the Stochastic Frank-Wolfe (SFW) from [31]. We consider a test setup with the Movie Lens100k dataset3 [13]. This dataset contains 100 000 integer valued ratings between 1 and 5, assigned by 1682 users to 943 movies. The aim of this experiment is to emphasize the ﬂexibility of SHCGM: Recall that SFW does not directly apply to (19) as it cannot handle the afﬁne constraint 1 X 5. Therefore, we apply SFW to a relaxation of (19) that omits this constraint. Then, we solve (19) with SHCGM and compare the results.

We use the default ub.train and ub.test partitions provided with the original data. We set the model parameter for the nuclear norm constraint β1 = 7000, and the initial smoothing parameter β0 = 10. At each iteration, we compute a gradient estimator from 1000 iid samples. We perform the same test independently for 10 times to compute the average performance and conﬁdence intervals. In Figure 4, we report the training and test errors (root mean squared error) as well as the feasibility gap. The solid lines display the average performance, and the shaded areas show one standard deviation. Note that SHCGM performs uniformly better than SFW, both in terms of the training and test errors. The Table shows the values achieved at the end of 100000 iterations.

Finally, we compare SHCGM with the stochastic three-composite convex minimization method (S3CCM) from [43]. S3CCM is a projection-based method that applies to (19). In this experiment, we aim to demonstrate the advantages of the projection-free methods for problems in large-scale.

We consider a test setup with the Movie Lens1m dataset3 with 1 million ratings from 6000 users on 4000 movies. We partition the data into training and test samples with a 80/20 train/test split. We use 100000 iid samples at each iteration to compute a gradient estimator. We set the model parameter β1 = 200000. We use β0 = 10 for SHCGM, and we set the step-size parameter γ = 1 for S3CCM. We implement the lmo efﬁciently using the power method. We refer to the code in the supplements for details on the implementation.

Figure 5: SHCGM vs S3CCM with Movie Lens-1M.

Figure 5 reports the outcomes of this experiment. SHCGM clearly outperforms S3CCM in this test. We run both methods for 2 hours. Within this time limit, SHCGM can perform 270860 iterations while S3CCM can gets only up to 435 because of the high computational cost of the projection.

5 Conclusions

We introduced a scalable stochastic CGM-type method for solving convex optimization problems with afﬁne constraints and demonstrated empirical superiority of our approach in various numerical experiments. In particular, we consider the case of stochastic optimization of SDPs for which we give the ﬁrst projection-free algorithm. In general, we showed that our algorithm provably converges to an optimal solution of (P) with O(k 1/3) and O(k 5/12) rates in the objective residual and feasibility gap respectively, with a sample complexity in the statistical setting of O(k 1/3). The possibility of a faster rate with the same (or even better) sample complexity remains an open question as well as an adaptive approach with O(k 1/2) rate when fed with exact gradients.

3F.M. Harper, J.A. Konstan. Available at https://grouplens.org/datasets/movielens/

Acknowledgements

Francesco Locatello has received funding from the Max Planck ETH Center for Learning Systems, by an ETH Core Grant (to Gunnar R atsch) and by a Google Ph.D. Fellowship. Volkan Cevher and Alp Yurtsever have received funding from the Swiss National Science Foundation (SNSF) under grant number 200021 178865/1, and the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation program (grant agreement no 725594 - time-data).

[1] E. Abbe. Community detection and stochastic block models: Recent developments. Journal

of Machine Learning Research, 18:1 86, 2018. [2] V. Cevher, B. C. Vu, and A. Yurtsever. Stochastic forward Douglas-Rachford splitting method

for monotone inclusions. In P. Giselsson and A. Rantzer, editors, Large Scale and Distributed Optimization, chapter 7, pages 149 179. Springer International Publishing, 2018. [3] K. L. Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM

Transactions on Algorithms (TALG), 6(4), 2010. [4] A. d Aspremont, L. E. Ghaoui, M. I. Jordan, and G. R. Lanckriet. A direct formulation for

sparse PCA using semideﬁnite programming. SIAM Review, 49(3):434 448, 2007. [5] C. D unner, S. Forte, M. Tak ac, and M. Jaggi. Primal dual rates and certiﬁcates. In Proc. 33rd

International Conference on Machine Learning, 2016. [6] J. Fan, F. Han, and H. Liu. Challenges of big data analysis. National science review, 1(2):293

314, 2014. [7] J. Fan, Y. Liao, and H. Liu. An overview of the estimation of large covariance and precision

matrices. The Econometrics Journal, 19(1):C1 C32, 2016. [8] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics

Quarterly, 3:95 110, 1956. [9] D. Garber and A. Kaplan. Fast stochastic algorithms for low-rank and nonsmooth matrix problems. ar Xiv:1809.10477, 2018. [10] G. Gidel, F. Pedregosa, and S. Lacoste-Julien. Frank-Wolfe splitting via augmented Lagrangian

method. In Proc. 21st International Conference on Artiﬁcial Intelligence and Statistics, 2018. [11] D. Goldfarb, G. Iyengar, and C. Zhou. Linear convergence of stochastic Frank Wolfe variants.

In Proc. 20th International Conference on Artiﬁcial Intelligence and Statistics, 2017. [12] N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic

algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217 288, 2011. [13] F. M. Harper and J. A. Konstan. The Movie Lens datasets: History and context. ACM Transac-

tions on Interactive Intelligent Systems (Tii S), 5(4):19, 2016. [14] E. Hazan. Sparse approximate solutions to semideﬁnite programs. In Proc. 8th Latin American

Conf. Theoretical Informatics, pages 306 316, 2008. [15] E. Hazan. Sparse approximate solutions to semideﬁnite programs. In Latin American sympo-

sium on theoretical informatics, pages 306 316. Springer, 2008. [16] E. Hazan and S. Kale. Projection free online learning. In Proc. 29th International Conference

on Machine Learning, 2012. [17] E. Hazan and H. Luo. Variance-reduced and projection-free stochastic optimization. In Proc.

33rd International Conference on Machine Learning, 2016. [18] M. Jaggi. Revisiting Frank Wolfe: Projection free sparse convex optimization. In Proc. 30th

International Conference on Machine Learning, 2013. [19] S. Lacoste-Julien. Convergence rate of Frank-Wolfe for non-convex objectives. ar Xiv:1607.00345, 2016. [20] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate Frank-Wolfe op-

timization for structural SVMs. In Proc. 30th International Conference on Machine Learning, 2013.

[21] G. Lan. The complexity of large scale convex programming under a linear optimization oracle.

ar Xiv:1309.5550v2, 2014. [22] G. Lan, S. Pokutta, Y. Zhou, and D. Zink. Conditional accelerated lazy stochastic gradient

descent. ar Xiv:1703.05840, 2017. [23] G. Lan and Y. Zhou. Conditional gradient sliding for convex optimization. SIAM J. Optim.,

26(2):1379 1409, 2016. [24] G. R. G. Lanckriet, N. Cristianini, L. E. Ghaoui, P. Bartlett, and M. I. Jordan. Learning the

kernel matrix with semideﬁnite programming. J. Mach. Learn. Res., 5:27 72, 2004. [25] F. Locatello, R. Khanna, M. Tschannen, and M. Jaggi. A uniﬁed optimization view on gener-

alized matching pursuit and Frank-Wolfe. In Proc. 20th International Conference on Artiﬁcial Intelligence and Statistics, 2017. [26] F. Locatello, A. Raj, S. P. Karimireddy, G. R atsch, B. Sch olkopf, S. U. Stich, and M. Jaggi. On

matching pursuit and coordinate descent. ar Xiv:1803.09539, 2018. [27] F. Locatello, M. Tschannen, G. R atsch, and M. Jaggi. Greedy algorithms for cone constrained

optimization with convergence guarantees. In Advances in Neural Information Processing Systems 30, 2017. [28] H. Lu and R. M. Freund. Generalized stochastic frank-wolfe algorithm with stochastic sub-

stitute gradient for structured convex optimization. ar Xiv:1807.07680, 2018. [29] J. L. R. Madani and S. Sojoudi. Convex relaxation for optimal power ﬂow problem: mesh

networks. IEEE Trans. on Power Syst., 30(1):199 211, 2015. [30] D. G. Mixon, S. Villar, and R. Ward. Clustering subgaussian mixtures by semideﬁnite pro-

gramming. Information and Inference: A Journal of the IMA, 6(4):389 415, 2017. [31] A. Mokhtari, H. Hassani, and A. Karbasi. Stochastic conditional gradient methods: From

convex minimization to submodular maximization. ar Xiv:1804.09554, 2018. [32] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103:127 152,

2005. [33] J. Peng and Y. Wei. Approximating K means type clustering via semideﬁnite programming.

SIAM J. Optim., 18(1):186 205, 2007. [34] S. J. Reddi, S. Sra, B. P oczos, and A. Smola. Stochastic frank-wolfe methods for nonconvex

optimization. ar Xiv:1607.08254, 2016. [35] E. Richard, P.-A. Savalle, and N. Vayatis. Estimation of simultaneously sparse and low rank

matrices. In Proc. 29th International Conference on Machine Learning, 2012. [36] J. Sch afer and K. Strimmer. A shrinkage approach to large-scale covariance matrix estimation

and implications for functional genomics. Statistical applications in genetics and molecular biology, 4(1), 2005. [37] Z. Shen, C. Fang, P. Zhao, J. Huang, and H. Qian. Complexities in projection-free stochastic

non-convex minimization. In Proc. 22nd International Conference on Artiﬁcial Intelligence and Statistics, 2019. [38] Q. Tran-Dinh, O. Fercoq, and V. Cevher. A smooth primal-dual optimization framework for

nonsmooth composite convex minimization. SIAM J. Optim., 28(1):96 134, 2018. [39] Y. Yu, X. Zhang, and D. Schuurmans. Generalized conditional gradient for sparse estimation.

ar Xiv:1410.4828v1, 2014. [40] A. Yurtsever, O. Fercoq, and V. Cevher. A conditional-gradient-based augmented Lagrangian

framework. In Proc. 36th International Conference on Machine Learning, 2019. [41] A. Yurtsever, O. Fercoq, F. Locatello, and V. Cevher. A conditional gradient framework for

composite convex minimization with applications to semideﬁnite programming. In Proc. 35th International Conference on Machine Learning, 2018. [42] A. Yurtsever, S. Sra, and V. Cevher. Conditional gradient methods via stochastic pathintegrated differential estimator. In Proc. 36th International Conference on Machine Learning, 2019. [43] A. Yurtsever, B. C. Vu, and V. Cevher. Stochastic three-composite convex minimization. In

Advances in Neural Information Processing Systems 29, 2016.