# nonconvex_conditional_gradient_sliding__7434ccb0.pdf

Non-convex Conditional Gradient Sliding

Chao Qu 1 Yan Li 2 Huan Xu 2

We investigate a projection free optimization method, namely non-convex conditional gradient sliding (NCGS) for non-convex optimization problems on the batch, stochastic and ﬁnitesum settings. Conditional gradient sliding (CGS) method, by integrating Nesterov s accelerated gradient method with Frank-Wolfe (FW) method in a smart way, outperforms FW for convex optimization, by reducing the amount of gradient computations. However, the study of CGS in the non-convex setting is limited. In this paper, we propose the non-convex conditional gradient sliding (NCGS) methods and analyze their convergence properties. We also leverage the idea of variance reduction from the recent progress in convex optimization to obtain a new algorithm termed variance reduced NCGS (NCGS-VR), and obtain faster convergence rate than the batch NCGS in the ﬁnite-sum setting. We show that NCGS algorithms outperform their Frank-Wolfe counterparts both in theory and in practice, for all three settings, namely the batch, stochastic and ﬁnite-sum setting. This signiﬁcantly improves our understanding of optimizing non-convex functions with complicated feasible sets (where projection is prohibitively expensive).

1. Introduction

This paper studies non-convex optimization problems with complicated feasible sets. Speciﬁcally, we consider the following problem min θ ΩF(θ), (1)

where the objective function F(θ) is non-convex and L smooth, and Ωis a convex compact set.

*Equal contribution 1EE faculty, Technion, Israel 2H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, USA. Correspondence to: Chao Qu <chaoqu.technion@gmail.com>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

Besides this general form, we also consider a stochastic setting and a ﬁnite-sum setting. In the stochastic setting, we assume F(θ) = Eξf(θ, ξ), where f(θ, ξ) is smooth and non-convex. In the ﬁnite-sum case, we study the following problem

min θ ΩF(θ) := 1

where each fi(θ) is non-convex and L smooth, Ωis a convex compact set. Here, we are interested in the case where n, i.e., the number of training examples, is very large.

We focus on the case where the feasible set Ωis complicated, in the sense that projection onto Ωis expensive (for instance, the projection on the trace norm ball), or even computationally intractable (Collins et al., 2008). To alleviate such difﬁculty, the Frank-Wolfe method (Frank & Wolfe, 1956) (a.k.a. conditional gradient method ), which was initially developed for the convex problem in 1950s, has attracted much attention again in machine learning community recently, due to its projection free property (Jaggi, 2013). In each iteration, the Frank-Wolfe algorithm calls a ﬁrst-order oracle to get F(θ), and then calls a linear oracle in the form arg minθ Ω θ, g , which avoids the projection operation.

This setup is motivated by several popular problems in machine learning, wherein the above linear optimization is easy but the projection is much more computationally demanding. The example par excellence is the nuclear norm constraint which is widely used in multi-task learning, mutli-class classiﬁcation, recommendation systems and matrix learning. We brieﬂy explain some of them: Suppose there are m tasks and each column i of a matrix Θ represents a task i, one popular multi-task learning formulation proposed by Pong et al. (2010) has the following form.

j=1 ℓ(yi j, θT i xi j + bi)

subject to Θ R,

where ℓ( , ) is the loss function which can be potentially non-convex, ni is the number of samples in each task, Θ = [θ1, ..., θm], and is the nuclear norm constraint to encourage the low rank property. In the multiclass classiﬁcation problem, suppose there are n training examples

Non-convex Conditional Gradient Sliding

(xi, yi)i=1,...,n, where xi is a feature vector and yi is the label. The goal is to ﬁnd an accurate linear predictor with parameter Θ = [θ1, ..., θh]. In Zhang et al. (2012) and Dudik et al. (2012), multivariate logistic loss with nuclear norm regularization is proposed with the following form

fi(Θ) = log(1 + X

ℓ =yi exp(θT ℓxi θT yixi)),

and Ω= Θ R. The logistic loss can be replace by a non-convex loss function due to the superior robustness and classiﬁcation accuracy of the non-convex loss (Mei et al., 2016).

Apart from the nuclear norm constraint, other examples of complicated feasible sets include polytopes with exponentially many facets, often resulted from combinatorial problems (e.g., the convex hull of all spanning trees). We refer reader to Garber & Hazan (2013) and Lacoste-Julien & Jaggi (2015) for details.

In the convex case, it is well known that for the Frank-Wolfe method to achieve ϵ-solution, O 1

ϵ iterations are required, if F(θ) is convex and smooth. This rate is signiﬁcantly worse than the optimal rate O 1/ ϵ for smooth convex problems (Nesterov, 2013), which raises a question whether this complexity bound O( 1

ϵ ) is improvable. Unfortunately, the answer is no in the general setting (Lan, 2013; Guzm an & Nemirovski, 2015) and improved results can only be obtained under stronger assumptions, see e.g., Garber & Hazan (2013; 2015); Lacoste-Julien & Jaggi (2015). Lan & Zhou (2016) proposed the conditional gradient sliding method (CGS) which combines the idea of Nesterov s accelerated gradient with the Frank-Wolfe method. While the number of linear oracle calls remains same, the number of gradient computations (the ﬁrst order oracle) is signiﬁcantly improved from O( 1

ϵ ) to O( 1 ϵ). Under the strongly convex assumption, this can be further improved to O(log(1/ϵ)) by using the restarting techniques (Lan & Zhou, 2016).

Recently, non-convex optimization has attracted lots of attentions and becomes the frontier of the machine learning, where a partial list of applications includes robust regression and classiﬁcation (Mei et al., 2016), dictionary learning (Mairal et al., 2009), phase retrial (Candes et al., 2015) and training the neural network (Goodfellow et al., 2016). Therefore, the convergence on non-convex Frank-Wolfe method has been studied, under the batch, stochastic and ﬁnite-sum setting (Lacoste-Julien, 2016; Reddi et al., 2016b). A natural question arises: can we use similar technique of convex conditional gradient sliding in the non-convex conditional gradient sliding and improve the complexity on the ﬁrst order oracle? This paper provides an afﬁrmative answer.

Summary of contributions: We propose the non-convex conditional gradient sliding (NCGS) method and provides a convergence analysis in the batch and the stochastic setting.

Compared to the convex CGS, the difﬁculty comes from the analysis of non-convexity. In the ﬁnite-sum setting, we propose the variance reduction non-convex gradient sliding method (NCGS-VR) which achieves faster convergence than the batch one. We need carefully balance the number of call on the linear oracle and ﬁrst order oracle. All result are summarized in Table 1,2, 3 (with red color). Table 1 compares the result of non-convex conditional gradient sliding with the non-convex Frank-Wolfe method in the batch setting. Table 2 compares our method with SAGAFW and SVFW (Reddi et al., 2016b) in the stochastic setting. In Table 3, our method leverages the popular stochastic variance reduction technique. We compare it with the stochastic variance reduction Frank-Wolfe method in (Reddi et al., 2016b). To the best of our knowledge, our algorithms outperform Frank-Wolfe method in all corresponding setting. We defer the detailed comparison to the related work subsection. Please see Section 2 for the formal deﬁnition of ﬁrst order oracle (FO), stochastic ﬁrst order oracle (SFO), Incremental First Order Oracle (IFO) and linear oracle (LO).

We remark that the convergence criterion used in paper is different from that in Frank-Wolfe. In our paper, we follow the criterion F(θ) 2 ϵ ( See section 2.3 for more precise deﬁnition on convergence criteria), as that in most non-convex optimization work (Lan, 2013; Allen-Zhu & Hazan, 2016; Reddi et al., 2016c; Nesterov, 2013), while the Frank-Wolfe method uses the Frank-Wolfe gap. Understanding the precise relationship between these convergence criteria is an important direction for future research.

Algorithm FO complexity LO complexity NCGS O(1/ϵ) O(1/ϵ2) FW O(1/ϵ2) O(1/ϵ2)

Table 1. Comparison of complexity of algorithms in the batch setting.

Algorithm SFO complexity LO complexity NCGS O(1/ϵ2) O(1/ϵ2) SAGAFW O(1/ϵ 8 3 ) O(1/ϵ2) SVFW O(1/ϵ 10

3 ) O(1/ϵ2)

Table 2. Comparison of complexity of algorithms in the stochastic setting.

Related work

The classical Frank-Wolfe method considers optimizing a smooth convex function F(θ) over a polyhedral set and enjoys O(1/ϵ) convergence rate (Frank & Wolfe, 1956; Jaggi, 2013). Recent work (Garber & Hazan, 2013; 2015) proves faster convergence rate under additional assumptions. Conditional gradient sliding was proposed in (Lan & Zhou, 2016), aiming at minimizing a convex objective function.

Non-convex Conditional Gradient Sliding

Algorithm IFO complexity LO complexity NCGS O(n/ϵ) O(1/ϵ2) FW O(n/ϵ2) O(1/ϵ2)

NCGS-VR O(n + n 2 3 ϵ ) O(1/ϵ2)

SVFW O(n + n 2 3 ϵ2 ) O(1/ϵ2)

Table 3. Comparison of complexity of algorithms in the ﬁnite-sum setting. Since we need to evaluate n gradients each iteration in NCGS and FW, the IFO complexity of NCGS and FW are n results in table 1.

While our high level algorithmic idea is the same, the analysis differs signiﬁcantly due to the non-convexity of the objective function.

Most existing works on analyzing non-convex optimization solve the problem with the projection or the proximal operation. Hence we list some representative works below. Ghadimi & Lan (2013) investigate SGD in the non-convex setting. They extend Nesterov s acceleration method in the constrained stochastic optimization. The performance on non-convex stochastic variance reduction method is analyzed in Reddi et al. (2016a); Allen-Zhu & Hazan (2016); Shalev-Shwartz (2016); Allen-Zhu & Yuan (2016). Note that the stochastic variance reduction techniques are ﬁrst introduced for solving convex optimization problems (Xiao & Zhang, 2014; Johnson & Zhang, 2013; Xiao & Zhang, 2014).

There are very few work on projection free methods for non-convex optimization. The early work in Bertsekas (1999) proves the asymptotic convergence of the Frank Wolfe method to a stationary point, but the convergence rate is not studied. Lacoste-Julien (2016) establishes the convergence rate of O(1/ϵ2) for the Frank-Wolfe method in the (batch) non-convex setting, under the criteria of Frank Wolfe gap. In his work, both FO and LO complexity are shown to be O(1/ϵ2). In contrast, for our proposed NCGS, the FO complexity is O(1/ϵ) and the LO complexity is O(1/ϵ2). Recent work on Frank-Wolfe method for nonconvex, stochastic setup shows that the SFO complexity and the LO complexity are O(1/ϵ 10

3 ) and O(1/ϵ2) respectively for SVFW, and O(1/ϵ 8 3 ) and O(1/ϵ2) for SAGAFW (Reddi et al., 2016b). Our SFO and LO on the same setting are O(1/ϵ2) and O(1/ϵ2) respectively. In the ﬁnite sum setting, our variance reduction NCGS(NCGS-VR) has IFO

complexity O(n + n 2 3 ϵ ), while the state of the art variance

reduced FW has IFO complexity O(n + n 2 3 ϵ2 ) and the same LO complexity (Reddi et al., 2016b). Thus, it is clear that for all three settings, we improved upon the best known results in literature in terms of reducing computation for gradient evaluation.

2. Preliminary

2.1. Oracle model

We consider the following set of Oracles:

First Order Oracle (FO): given a θ, the FO returns θF(θ).

Stochastic First Order Oracle (SFO): For a function F(θ) = Eξf(θ, ξ) where ξ P, a SFO returns the stochastic gradient G(θk, ξk) = θf(θk, ξk) where ξk is a sample drawn i.i.d. from P in the k-th call.

Incremental First Order Oracle (IFO): For the setting F(θ) = 1 n Pn i=1 fi(θ), an IFO samples i [n] and returns θfi(θ).

Linear oracle (LO): the LO solves the following problem arg minθ Ω θ, g for a given vector g.

Thought out the paper, the complexity of FO, SFO, IFO, LO denotes the number of call of them to obtain a solution with ϵ accuracy (see Section 2.3 for details).

2.2. Assumptions

F(θ) is L smooth, if F(θ1) F(θ2) L θ1 θ2 . This deﬁnition is equivalent to the following form:

2 θ1 θ2 2 F(θ1) F(θ2) F(θ2), θ1 θ2

2 θ1 θ2 2, θ1, θ2 Ω.

We say F(θ) is ℓlower smooth if it satisﬁes

2 θ1 θ2 2 F(θ1) F(θ2) F(θ2), θ1 θ2 ,

Intuitively, the lower smoothness quantiﬁed the amount of non-convexity of the function. Clearly, the L smooth assumption trivially implies l lower smoothness for l = L. However, in some cases, the non-convexity l is much smaller than L and we will show how it affects some results in our theorem.

We then deﬁne prox-mapping type functions ψ(x, ω, γ):

ψ(x, ω, γ) = arg min θ Ω ω, θ + 1

It is closely related to the projected gradient by setting w = F(θ), γ by the stepsize and x = θk. We assume ψ(x, ω, γ) M for all γ (0, ) and x Ωand ω Rp following that in (Lan, 2013). Since in our work Ω is compact and convex, this assumption is satisﬁed.

Non-convex Conditional Gradient Sliding

For the stochastic setting, we make the following additional assumptions: For any θ Rp and k > 1, we have

(1). EG(θ, ξk) = F(θ)

(2). E G(θ, ξk) F(θ) 2 σ2,

i.e., the stochastic gradient G(θ, ξk) is unbiased and has bounded variance.

In the ﬁnite-sum setting, we assume each fi(θ) is L smooth.

2.3. Convergence criteria

Conventionally, the convergence criterion in non-convex optimization deﬁnes a solution with ϵ accuracy as F(θ) 2 ϵ (Lan & Zhou, 2016; Nesterov, 2013). However when the problem has constraints, it needs a different termination criterion based on the gradient mapping (Lan & Zhou, 2016). This is a natural extension of gradient, since if there is no constraint, it reduces to the gradient. The gradient mapping is deﬁned as follows

g(θ, F(θ), γ) = 1

γ (θ ψ(θ, F(θ), γ)).

Through out the paper, we use g(θ, F(θ), γ) as the convergence criterion, i.e., we want to ﬁnd the solution θ such that g(θ, F(θ), γ) 2 ϵ.

Notice there is another criterion, called Frank-Wolfe gap maxx Ω x θk, F(θk) , in some recent analysis of non-convex Frank-Wolfe methods (Lacoste-Julien, 2016; Reddi et al., 2016a), which was initially used in the convex Frank-Wolfe method. However, in this paper, we follow the deﬁnition on gradient mapping, since it is a natural generalization of gradient.

3. Batched non-convex conditional gradient sliding

3.1. Algorithm

Before we present the algorithm of non-convex conditional gradient sliding, we present a procedure condg in Algorithm 1, which will be used as a subroutine in all three (i.e., batch, stochastic and ﬁnite-sum) settings.

Algorithm 2 presents the non-convex conditional gradient sliding in the batch setting. Notice it needs to call the procedure condg. There are two options to update θag, and we provide the theoretical guarantees for both of them in Theorem 1.

3.2. Theoretical result

Theorem 1. Suppose the objective function F(θ) satisﬁes the assumption in section 2, where L is the smoothness parameter and l is the lower smoothness parameter, then if

Algorithm 1 Procedure of u+ = condg(l, u, γ, η)

1.u1 = u and t = 1. 2.vt be an optimal solution for the subproblem

V (ut) = max x Ω l + 1

γ (ut u), ut x

3.if V (ut) η, set u+ = ut and terminate the procedure 4.ut+1 = (1 ξt)ut + ξtvt with ξt =

γ (u ut) l,vt ut

1 γ vt ut 2 }.

Set t t + 1 and go to step 2. end procedure

Algorithm 2 Non-convex conditional gradient sliding (NCGS)

Input: Step size αk, λk, βk, smoothness parameter L. Initialization: θag 0 = θ0, k=1. for k = 1, ..., N do

update: θmd k = (1 αk)θag k 1 + αkθk 1 update: θk = condg( F(θmd k ), θk 1, λk, ηk) update: option I: θag k = θmd k βk g(θk 1, F(θmd k ), λk, ηk), where g(θk 1, F(θmd k ), λk, ηk) := θk 1 θk

λk . option II: θag k = condg( F(θmd k ), θmd k , βk, χk). end for

we set αk = 2 k+1, βk = 1 2L, λk = βk, ηk = 1

N in option I of Algorithm 2, we have

min k=1,...,N g(θk 1, F(θmd k ), λk) 2

12L(F(θ0) F(θ )) + 16L

where θ is the optimal solution of equation 1.

In option II, we set αk = 2 k+1, βk = 1 2L,λk = kβk/2, ηk = 1

N , M is the positive constant deﬁned in our section 2.2, then we have

min k=1,...,N g(θk 1, F(θmd k ), βk) 2

192L2 θ0 θ 2

N 2(N + 1) + 48l L

N ( θ 2 + 2M 2) + 96L

Some remarks are in order:

Notice in the procedure of condg, we solve the sub problem with accuracy η. The choice of η is important: If η is chosen too small, we need too many calls on LO. On the other hand, if η is too large, the algorithm may not converge.

Non-convex Conditional Gradient Sliding

The FO complexities of option I and II are order wise equivalent, namely, O(1/ϵ).

We now examine each terms of the convergence guarantee of Option II: the term L2 θ0 θ 2

N 2(N+1) corresponds to the convex part of the function. The term Ll N ( θ 2 + 2M 2) corresponds to the non-convex part of the function. And the last term L/N corresponds to the procedure of condg.

When the objective function a has ﬁnite-sum form with n term, the IFO complexity of NCGS is O( n

ϵ ). We will improve this complexity using stochastic variance reduction techniques in Section 5.

Theorem 1 presents the convergence in terms of iteration number, which we transfer to the FO and LO complexity in the following corollary.

Corollary 1. Under the same condition of theorem 1. In option I and II of algorithm 2, to achieve the accuracy ϵ, the FO complexity is O(1/ϵ) and the LO complexity is O( 1

Our algorithm has the same LO complexity with FW but improves the FO complexity from O( 1

ϵ2 ) to O( 1

4. Stochastic non-convex conditional gradient sliding

In this section we consider the following stochastic optimization problem:

min θ ΩF(θ) := Eξf(θ, ξ). (5)

4.1. Algorithm

A natural way to extend batch NCGS method to the stochastic case is to replace the exact gradient F(θ) in Algorithm 2 by the stochastic gradient G(θ, ξ). However it is shown in Ghadimi & Lan (2016) that mini-batch stochastic gradient is necessary to guarantee the convergence. We incorporate this technique in the stochastic NCGS . In particular, we deﬁne Gk = 1 mk Pmk i=1 G(xmd k , ξk,i).

Notice, by our assumption in Section 2, we have

E Gk = 1 mk

i=1 EG(θmd k , ξk,i) = F(θmd k ),

E Gk F(θmd k ) 2 σ2

We present the stochastic NCGS in Algorithm 3. Notice we have a randomized termination criterion on the total iteration R.

Algorithm 3 Stochastic Non-convex conditional gradient sliding

Input: Step size αk, λk, βk, smoothness parameter L, a probability mass function PR( ) with Prob{R = k} = pk, k = 1, ..., N. Initialization: θag 0 = θ0, k=1. Let R be a random variable chosen according to PR( ). for k = 1, ..., R do

update: θmd k = (1 αk)θag k 1 + αkθk 1. update: θk = condg( Gk, θk 1, λk, ηk). update: θag k = condg( Gk, θmd k , βk, χk). end for

4.2. Theoretical Result

In the following theorem, we carefully choose the value of step size and the tolerance in the procedure condg to guarantee the convergence of the algorithm.

Theorem 2. Suppose F(θ) is L smooth and l lower smooth, σ2 is the variance deﬁned in Section 2. In Algorithm 3, set αk = 2 k+1, βk = 1 2L, λk = kβk

2 , ηk = χk = 1 N and

mk = k , and set pk = Γ 1 k PN k=1 Γ 1 k , where Γk = 2 k(k+1) , then we have

E[ g(θmd R , GR, βR) 2]

192L 4L θ0 θ 2

N 2(N + 1) + l

N ( θ 2 + 2M 2) + 1

where θ is the optimal solution of equation 5.

Compare this result with its batch counterpart, namely, Theorem 1, we see there is a additional term cσ2

N due to the variance of the gradient.

The mini-batch size is set as mk = k, i.e., increasing with the iteration of the algorithm. This is chosen to guarantee the convergence with a fast rate.

Theorem 2 implies the following corollary of LO and SFO complexity.

Corollary 2. Under the same setting as Theorem 2, SFO and LO complexities in algorithm 3 are O(1/ϵ2) and O(1/ϵ2) respectively.

Notice in algorithm 3, we use mini-batches to calculate Gk.Thus even the number of iteration of the stochastic non-convex conditional gradient sliding is the same as the batch one, it needs more calls of SFO.

Non-convex Conditional Gradient Sliding

To the best of our knowledge, the corresponding Frank Wolfe method in Reddi et al. (2016b) has SFO complexity O(1/ϵ 8 3 ). Our algorithm has the same LO complexity while improves the SFO complexity.

5. Variance reduction nonconvex conditional gradient sliding in ﬁnite-sum setting

The stochastic variance reduction technique has been very successful in optimizing convex functions in the form of ﬁnite sums. In this section we incorporate it with our nonconvex conditional gradient sliding and propose NCGS-VR in Algorithm 4.

5.1. Algorithm

We consider minimizing a ﬁnite sum problem as the following

min x ΩF(θ) = 1

i=1 fi(θ), (8)

where each fi is possibly non-convex, and smooth with parameter L. If we view the ﬁnite-sum problem as a special case of batch problem, then use Algorithm 2, we have IFO complexity O( n

ϵ ). Variance reduction technique has been proposed for ﬁnite sum problem to reduce the dependence of IFO complexity on number of components n. We incorporate this technique into NCGS. At the out looper, we calculate the full gradient and use it in the inner loop to reduce the variance of the stochastic gradient. Then we call the procedure condg. As we show below our new algorithm

(Algorithm 4) achieves IFO complexity O(n + n 2 3 ϵ ). To the best of our knowledge, this outperforms Frank-Wolfe type algorithms for the non-convex ﬁnite-sum problem. Notice in Algorithm 4, different from Algorithm 2 and 3, we do not apply Nestrerov s acceleration step. Whether the acceleration step can further improve the rate of convergence (e.g., exploit the lower smoothness) in this setting is left for future research.

5.2. Theoretical result

Theorem 3. Suppose fi(θ) is non-convex and L smooth, set b = n 2 3 in Algorithm 4, λt = 1 3L, m = n 1 3 and T is a multiple of m, η = 1

T . Then for output θa we have:

E[ g(θα, F(θα), λ) 2] 18L(f(θ0) f(θ ) + 1)

T where θ is an optimal solution to (8).

If b = 1 and m = n, then vs t reduces to the regular stochastic variance reduced gradient. However this means in every step we sample one data point and then call condg, which may deteriorate the performance of the algorithm.

Algorithm 4 Variance reduction Non-convex conditional gradient sliding (NCGS-VR)

Input: θ0 = θ0 m = θ0 Rd, epoch length m, stepsize λt, tolerance η, minibatch size b, iteration limit T, S = T

m. for s = 0, ..., S 1 do

θs+1 0 = θs m. gs+1 = 1

n Pn i=1 fi( θs). for t = 0, . . . , m 1 do

Pick It uniformly from {1, . . . , n} with replacement such that |It| = b. vs+1 t = 1

i It( fit(θs+1 t ) fit( θs)) + gs+1. θs+1 t+1 = condg(vs+1 t , θs+1 t , λt, η). end for

θs+1 = θs+1 m end for Output: θα is chosen uniformly at random from {{θs+1 t }m 1 t=0 }S 1 s=0 .

The minibatch gradient with size b = n2/3 and iteration length m = n1/3 are carefully chosen to guarantee the convergence of the algorithm and fast rate.

Theorem 3 leads to the following results on IFO and LO complexity.

Corollary 3. Set the parameters set as in Theorem 3, the

IFO and LO complexities of Algorithm 4 are O(n+ n 2 3 ϵ ) and O( 1

ϵ2 ) respectively to achieve E[ g(θα, F(θα), λ) 2] ϵ.

The stochastic variance reduction Frank-Wolfe method has the IFO complexity O(n+ n2/3

ϵ2 ), while our method

has the IFO complexity O(n + n2/3

ϵ ). The LO complexity for both algorithms are the same.

6. Simulation Result

In this section we test our algorithm in the batch (NCGS) and ﬁnite-sum setting (NCGS-VR) and compare them with Frank-Wolfe counterparts (FW and SVFW (Reddi et al., 2016b)), as well as SVRG, a projection based algorithm.

6.1. Synthetic dataset

In this section, we ﬁrst use a toy example on matrix completion to examine the convergence of the gradient mapping, which is the convergence criteria for the algorithm. Notice that in practice, the objective function value is typically a more relevant metric, and hence we report such results using a multitask learning problem.

Non-convex Conditional Gradient Sliding

6.1.1. MATRIX COMPLETION

We consider a toy matrix completion problem for our simulation and observe the magnitude of gradient mapping. In particular, we optimize the following trace norm constrained non-convex problem using the candidate algorithms.

(i,j) Ω fi,j(θ) s.t. θ R, (9)

where Ωis the set of observed entries, fi,j = 1

exp( (θi,j Yi,j)2

σ ) , Yi,j is the observation of (i, j) s entry, is the nuclear norm. Here fi,j is a smoothed ℓ0 loss with enhanced robustness to outliers in the data, thus it can solve sparse+low rank matrix completion in Chandrasekaran et al. (2009). Obviously, fi,j is non-convex and satisﬁes assumptions in our algorithm 2,3,4. We compare our non-convex conditional gradient sliding method with the Frank-Wolfe method in Fig 1. Particularly, we report the result of the batch setting in the left panel. The dimension of the matrix is 200 200, rank r = 5, the probability of observing each entry is 0.1. The sparse noise is sampled uniformly from [ 3, 3]. Each entry is corrupted by noise with probability 0.05. We set σ = 1, R = 5 in Problem (9). We observe that our algorithm 2 (NCGS) clearly outperforms the non-convex Frank-Wolfe method (FW). In the right panel, we treat Problem (9) as a ﬁnite-sum problem, and thus solve it using Algorithm 4 (NCGS-VR) and compare the performance with the SVFW (Reddi et al., 2016b). We set the dimension of the matrix as 400 400, rank r = 8, σ = 1, R = 8. The way to generate sparse noise and the probability to observe the entry are same with the setting of batch case. We observe that our NCGS-VR uses around 50 cpu-time to achieve 10 3 accuracy of squared gradient mapping, while SVFW needs more than 300 cpu-time.

0 10 20 30 40 cpu time

squared gradient mapping

0 50 100 150 200 250 300 cpu time

squared gradient mapping

SVFW NCGS-VR

Figure 1. Left ﬁgure: NCGS and non-convex Frank-Wolfe method. Right ﬁgure: non-convex stochastic variance reduction Frank Wolfe and NCGS-VR. The x-axis is the cpu-time with unit second, y-axis is the squared gradient mapping.

Notice in this example, NCGS-VR is not necessary faster than NCGS, since computing the gradient is very cheap.

6.1.2. NON-CONVEX MULTITASK LEARNING

In this section, we consider the non-convex multitask learning problem and compare the performance of NCGS, NCGS-

VR, non-convex Frank-Wolfe (FW) (Lacoste-Julien, 2016), stochastic variance reduction Frank-wolfe (SVFW) (Reddi et al., 2016b) and SVRG (Johnson & Zhang, 2013). In SVRG, we update θ and then project it back to the feasible set of the nuclear norm constraint. In the experiment we apply the mini-batch technique on SVRG. The reason is that in regular SVRG (with mini-batch size =1) it samples one data point and then calls the the procedure condg, resulting in very slow convergence. We choose the mini-batch b = n2/3 and m = n1/3 as that in Reddi et al. (2016a).

We ﬁrst generate m different covariance matrices Σi, i = 1, ..., m, according to Σi = Ui Di U T i , where Di Rd d is a diagonal matrix with each entry drawn uniformly from (1, 2), and Ui Rd d is a random matrix with each entry drawn from N(0, 1). The feature vectors of task (i, k) are generated from the Normal distribution N(0, Σi + Σik), k = 1, ...K, where Σik = Ui Di U T i , Di is a small perturbation of Di uniformly sampled from (0, 0.1) . Thus for each i, there are K similar tasks. Totally we have m K tasks, where some of them are similar and others may be different. The target yi,k j in task (i, k)is 0 or 1 which is sampled from the distribution P(yi,k j = 1|xi,k j = x) = 1 1+exp( θi ,x +bi ), where θi , bi is the true parameter and each entry of them (-1 or 1) are sampled with equal probability . In each task we generate n such data points. We use Equation (2) as our objective function, where we choose a loss function l(y, θT x + b) = (y 1 1+exp(θT x+b))2. This non-convex loss function has been used in (Mei et al., 2016; Nguyen & Sanner, 2013) and enjoys better accuracy in contrast to convex losses (e.g., logistic loss). NCGS, NCGS-VR, FW, SVFW and SVRG are compared in this experiment.

We choose different setting on m, K, n and report the results in Figure 2. In all experiment, the dimension of feature is set as d = 300 and we choose R = 10 in Equation (2).

We observe from Figure 2 that although we already adapt SVRG into the mini-batch version to speed up the converges, its convergence is still very slow due to large amount of computation in singular value decomposition when performing the projection onto the nuclear norm ball. We also observe that in all experiments, the proposed non-convex conditional gradient sliding methods outperfom their respective counterparts of the Frank-wolfe method. Particularly, NCGS-VR performs the best and is followed by NCGS , FW and SVFW. When the sample size is large (the right panel where the sample size is 30 10 3000), our method is signiﬁcantly better than the Frank-Wolfe method.

6.2. Real datasets

We test our algorithms on three real datasets: Aloi (n=108000, d=128) (Geusebroek et al., 2005), Covertype (n=581012, d=54) (Blackard & Dean, 1999) and Sensorless

Non-convex Conditional Gradient Sliding

0 10 20 30 40 50 60 70 80 cpu time

objective function

NCGS-VR FW SVFW NCGS SVRG

0 10 20 30 40 50 60 70 80 cpu time

objective function

NCGS-VR FW SVFW NCGS SVRG

0 10 20 30 40 50 60 70 80 cpu time

objective function

NCGS-VR FW SVFW NCGS SVRG

Figure 2. The X-axis is the cputime, the y-axis is the objective function. In the left ﬁgure, m = 50, k = 5, and n = 1000. In the middle ﬁgure, we have m = 30, k = 5, and n = 1500. In the right ﬁgure, m = 30, k = 10, and n = 3000.

0 5 10 15 20 25 30 cpu time

objective function

NCGS-VR FW SVFW NCGS SVRG

0 5 10 15 20 25 30 cpu time

objective function

NCGS-VR FW SVFW NCGS SVRG

0 5 10 15 20 25 30 35 40 45 50 cpu time

objective function

NCGS-VR FW SVFW NCGS SVRG

Figure 3. The X-axis is the cputime, the y-axis is the objective function. From the left to right, the dataset is aloi, covetype and Sensorless Drive Diagnosis.

Drive Diagnosis Data Set (n=58509, d=49) (Bator, 2015). We test the multitask learning task in Equation 2. For all dataset, we normalize the feature to the range [ 1, 1]. Note these dataset have multi-classes. We generate tasks in the following way. For each class of a dataset, we generate ﬁve noisy versions of them by adding the small noise on the feature . Then we set the labels of these data to ones. We randomly sample same amount data from other classes and generate the noisy version of them in the same way, and then set the labels of them to zeros. Each version of data with label ones and zeros is one individual task in our multi-task learning problem. We report the experimental results in Figure 3. Same as before, we use the mini-batch version of SVRG.

In all experiments, SVRG converges very slowly due to heavy computation cost of the projection onto the nuclear norm ball. In the left ﬁgure, the performance of NCGSVR is best and then is followed by NCGS. The non-convex Frank-Wolfe method converges fast at beginning then slow down, while SVFW performs in the opposite way. In the mid ﬁgure, NCGS-VR works fastest and is followed by NCGS, SVFW, and FW. In the right ﬁgure, the performance of NCGS-VR and NCGS are almost identical, and both outperform the counterpart of the Frank-Wolfe method.

7. Conclusion and future work

In this paper, we propose non-convex conditional gradient sliding methods to solve the batch, stochastic and ﬁnite-sum non-convex problems with complex constraints, such that projecting onto the feasible set is time consuming. Our algorithms surpass state of the art Frank-Wolfe type method both theoretically and empirically. One future research direction is to consider the accelerated steps in the proposed NCGSVR algorithm, in hope to further improve the convergence speed.

Allen-Zhu, Z. and Hazan, E. Variance reduction for faster non-convex optimization. In Proceedings of The 33rd International Conference on Machine Learning, pp. 699 707, 2016.

Allen-Zhu, Z. and Yuan, Y. Improved svrg for non-stronglyconvex or sum-of-non-convex objectives. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1080 1089, 2016.

Bator, M. UCI machine learning repository, 2015. URL

http://archive.ics.uci.edu/ml.

Non-convex Conditional Gradient Sliding

Bertsekas, D. P. Nonlinear programming. Athena scientiﬁc Belmont, 1999.

Blackard, J. A. and Dean, D. J. Comparative accuracies of artiﬁcial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and electronics in agriculture, 24(3):131 151, 1999.

Candes, E. J., Li, X., and Soltanolkotabi, M. Phase retrieval via wirtinger ﬂow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985 2007, 2015.

Chandrasekaran, V., Sanghavi, S., Parrilo, P. A., and Willsky, A. S. Sparse and low-rank matrix decompositions. IFAC Proceedings Volumes, 42(10):1493 1498, 2009.

Collins, M., Globerson, A., Koo, T., Carreras, X., and Bartlett, P. L. Exponentiated gradient algorithms for conditional random ﬁelds and max-margin markov networks. Journal of Machine Learning Research, 9(Aug): 1775 1822, 2008.

Dudik, M., Harchaoui, Z., and Malick, J. Lifted coordinate descent for learning with trace-norm regularization. In Artiﬁcial Intelligence and Statistics, pp. 327 336, 2012.

Frank, M. and Wolfe, P. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2): 95 110, 1956.

Garber, D. and Hazan, E. A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. ar Xiv preprint ar Xiv:1301.4666, 2013.

Garber, D. and Hazan, E. Faster rates for the frank-wolfe method over strongly-convex sets. In ICML, pp. 541 549, 2015.

Geusebroek, J.-M., Burghouts, G. J., and Smeulders, A. W. The amsterdam library of object images. International Journal of Computer Vision, 61(1):103 112, 2005.

Ghadimi, S. and Lan, G. Stochastic ﬁrst-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

Ghadimi, S. and Lan, G. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59 99, 2016.

Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. http://www. deeplearningbook.org.

Guzm an, C. and Nemirovski, A. On lower complexity bounds for large-scale smooth convex optimization. Journal of Complexity, 31(1):1 14, 2015.

Jaggi, M. Revisiting frank-wolfe: Projection-free sparse convex optimization. In ICML (1), pp. 427 435, 2013.

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pp. 315 323, 2013.

Lacoste-Julien, S. Convergence rate of frank-wolfe for non-convex objectives. ar Xiv preprint ar Xiv:1607.00345, 2016.

Lacoste-Julien, S. and Jaggi, M. On the global linear convergence of frank-wolfe optimization variants. In Advances in Neural Information Processing Systems, pp. 496 504, 2015.

Lan, G. The complexity of large-scale convex programming under a linear optimization oracle. department of industrial and systems engineering, university of ﬂorida, gainesville. Technical report, Florida. Technical Report, 2013.

Lan, G. and Zhou, Y. Conditional gradient sliding for convex optimization. SIAM Journal on Optimization, 26(2):1379 1409, 2016.

Mairal, J., Ponce, J., Sapiro, G., Zisserman, A., and Bach, F. R. Supervised dictionary learning. In Advances in neural information processing systems, pp. 1033 1040, 2009.

Mei, S., Bai, Y., and Montanari, A. The landscape of empirical risk for non-convex losses. ar Xiv preprint ar Xiv:1607.06534, 2016.

Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.

Nguyen, T. and Sanner, S. Algorithms for direct 0 1 loss optimization in binary classiﬁcation. In International Conference on Machine Learning, pp. 1085 1093, 2013.

Pong, T. K., Tseng, P., Ji, S., and Ye, J. Trace norm regularization: Reformulations, algorithms, and multi-task learning. SIAM Journal on Optimization, 20(6):3465 3489, 2010.

Reddi, S. J., Hefny, A., Sra, S., Poczos, B., and Smola, A. Stochastic variance reduction for nonconvex optimization. In Proceedings of The 33rd International Conference on Machine Learning, pp. 314 323, 2016a.

Non-convex Conditional Gradient Sliding

Reddi, S. J., Sra, S., P oczos, B., and Smola, A. Stochastic frank-wolfe methods for nonconvex optimization. In Communication, Control, and Computing (Allerton), 2016 54th Annual Allerton Conference on, pp. 1244 1251. IEEE, 2016b.

Reddi, S. J., Sra, S., P oczos, B., and Smola, A. J. Proximal stochastic methods for nonsmooth nonconvex ﬁnite-sum optimization. In Advances in Neural Information Processing Systems, pp. 1145 1153, 2016c.

Shalev-Shwartz, S. Sdca without duality, regularization, and individual convexity. In Proceedings of The 33rd International Conference on Machine Learning, pp. 747 754, 2016.

Xiao, L. and Zhang, T. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014.

Zhang, X., Schuurmans, D., and Yu, Y.-l. Accelerated training for matrix-norm regularization: A boosting approach. In Advances in Neural Information Processing Systems, pp. 2906 2914, 2012.