# beyond_convexity_stochastic_quasiconvex_optimization__d3632186.pdf

Beyond Convexity: Stochastic Quasi-Convex Optimization

Elad Hazan Princeton University ehazan@cs.princeton.edu

Kﬁr Y. Levy Technion kfiryl@tx.technion.ac.il

Shai Shalev-Shwartz The Hebrew University shais@cs.huji.ac.il

Stochastic convex optimization is a basic and well studied primitive in machine learning. It is well known that convex and Lipschitz functions can be minimized efﬁciently using Stochastic Gradient Descent (SGD). The Normalized Gradient Descent (NGD) algorithm, is an adaptation of Gradient Descent, which updates according to the direction of the gradients, rather than the gradients themselves. In this paper we analyze a stochastic version of NGD and prove its convergence to a global minimum for a wider class of functions: we require the functions to be quasi-convex and locally-Lipschitz. Quasi-convexity broadens the concept of unimodality to multidimensions and allows for certain types of saddle points, which are a known hurdle for ﬁrst-order optimization methods such as gradient descent. Locally-Lipschitz functions are only required to be Lipschitz in a small region around the optimum. This assumption circumvents gradient explosion, which is another known hurdle for gradient descent variants. Interestingly, unlike the vanilla SGD algorithm, the stochastic normalized gradient descent algorithm provably requires a minimal minibatch size.

1 Introduction

The beneﬁts of using the Stochastic Gradient Descent (SGD) scheme for learning could not be stressed enough. For convex and Lipschitz objectives, SGD is guaranteed to ﬁnd an ϵ-optimal solution within O(1/ϵ2) iterations and requires only an unbiased estimator for the gradient, which is obtained with only one (or a few) data samples. However, when applied to non-convex problems several drawbacks are revealed. In particular, SGD is widely used for deep learning [2], one of the most interesting ﬁelds where stochastic non-convex optimization problems arise. Often, the objective in these kind of problems demonstrates two extreme phenomena [3]: on the one hand plateaus regions with vanishing gradients; and on the other hand cliffs exceedingly high gradients. As expected, applying SGD to such problems is often reported to yield unsatisfactory results.

In this paper we analyze a stochastic version of the Normalized Gradient Descent (NGD) algorithm, which we denote by SNGD. Each iteration of SNGD is as simple and efﬁcient as SGD, but is much more appropriate for non-convex optimization problems, overcoming some of the pitfalls that SGD may encounter. Particularly, we deﬁne a family of locally-quasi-convex and locally-Lipschitz functions, and prove that SNGD is suitable for optimizing such objectives.

Local-Quasi-convexity is a generalization of unimodal functions to multidimensions, which includes quasi-convex, and convex functions as a subclass. Locally-Quasi-convex functions allow for certain types of plateaus and saddle points which are difﬁcult for SGD and other gradient descent variants. Local-Lipschitzness is a generalization of Lipschitz functions that only assumes Lipschitzness in a small region around the minima, whereas farther away the gradients may be unbounded. Gradient explosion is, thus, another difﬁculty that is successfully tackled by SNGD and poses difﬁculties for other stochastic gradient descent variants.

Our contributions:

We introduce local-quasi-convexity, a property that extends quasi-convexity and captures unimodal functions which are not quasi-convex. We prove that NGD ﬁnds an ϵ-optimal minimum for such functions within O(1/ϵ2) iterations. As a special case, we show that the above rate can be attained for quasi-convex functions that are Lipschitz in an Ω(ϵ)-region around the optimum (gradients may be unbounded outside this region). For objectives that are also smooth in an Ω( ϵ)-region around the optimum, we prove a faster rate of O(1/ϵ). We introduce a new setup: stochastic optimization of locally-quasi-convex functions; and show that this setup captures Generalized Linear Models (GLM) regression, [14]. For this setup, we devise a stochastic version of NGD (SNGD), and show that it converges within O(1/ϵ2) iterations to an ϵ-optimal minimum. The above positive result requires that at each iteration of SNGD, the gradient should be estimated using a minibatch of a minimal size. We provide a negative result showing that if the minibatch size is too small then the algorithm might indeed diverge. We report experimental results supporting our theoretical guarantees and demonstrate an accelerated convergence attained by SNGD.

1.1 Related Work

Quasi-convex optimization problems arise in numerous ﬁelds, spanning economics [20, 12], industrial organization [21] , and computer vision [8]. It is well known that quasi-convex optimization tasks can be solved by a series of convex feasibility problems [4]; However, generally solving such feasibility problems may be very costly [6]. There exists a rich literature concerning quasi-convex optimization in the ofﬂine case, [17, 22, 9, 18]. A pioneering paper by [15], was the ﬁrst to suggest an efﬁcient algorithm, namely Normalized Gradient Descent, and prove that this algorithm attains ϵoptimal solution within O(1/ϵ2) iterations given a differentiable quasi-convex objective. This work was later extended by [10], establishing the same rate for upper semi-continuous quasi-convex objectives. In [11] faster rates for quasi-convex optimization are attained, but they assume to know the optimal value of the objective, an assumption that generally does not hold in practice.

Among the deep learning community there have been several attempts to tackle plateaus/gradientexplosion. Ideas spanning gradient-clipping [16], smart initialization [5], and more [13], have shown to improve training in practice. Yet, non of these works provides a theoretical analysis showing better convergence guarantees. To the best of our knowledge, there are no previous results on stochastic versions of NGD, neither results regarding locally-quasi-convex/locally-Lipschitz functions.

1.2 Plateaus and Cliffs - Difﬁculties for GD

krf(x)k = m 7! 0

krf(x)k = M 7! 1

Figure 1: A quasi-convex Locally-Lipschitz function with plateaus and cliffs.

Gradient descent with ﬁxed step sizes, including its stochastic variants, is known to perform poorly when the gradients are too small in a plateau area of the function, or alternatively when the other extreme happens: gradient explosions. These two phenomena have been reported in certain types of non-convex optimization, such as training of deep networks.

Figure 1 depicts a one-dimensional family of functions for which GD behaves provably poorly. With a large step-size, GD will hit the cliffs and then oscillate between the two boundaries. Alternatively, with a small step size, the low gradients will cause GD to miss the middle valley which has constant size of 1/2. On the other hand, this exact function is quasi-convex and locally-Lipschitz, and hence the NGD algorithm provably converges to the optimum quickly.

2 Deﬁnitions and Notations

We use to denote the Euclidean norm. Bd(x, r) denotes the d dimensional Euclidean ball of radius r, centered around x, and Bd := Bd(0, 1). [N] denotes the set {1, . . . , N}.

For simplicity, throughout the paper we always assume that functions are differentiable (but if not stated explicitly, we do not assume any bound on the norm of the gradients). Deﬁnition 2.1. (Local-Lipschitzness and Local-Smoothness) Let z Rd, G, ϵ 0. A function f : K 7 R is called (G, ϵ, z)-Locally-Lipschitz if for every x, y Bd(z, ϵ), we have

|f(x) f(y)| G x y .

Similarly, the function is (β, ϵ, z)-locally-smooth if for every x, y Bd(z, ϵ) we have,

|f(y) f(x) f(y), x y | β

Next we deﬁne quasi-convex functions: Deﬁnition 2.2. (Quasi-Convexity) We say that a function f : Rd 7 R is quasi-convex if x, y Rd, such that f(y) f(x), it follows that

f(x), y x 0 .

We further say that f is strictly-quasi-convex, if it is quasi-convex and its gradients vanish only at the global minima, i.e., y : f(y) > minx Rd f(x) f(y) > 0.

Informally, the above characterization states that the (opposite) gradient of a quasi-convex function directs us in a global descent direction. Following is an equivalent (more common) deﬁnition: Deﬁnition 2.3. (Quasi-Convexity) We say that a function f : Rd 7 R is quasi-convex if any α-sublevel-set of f is convex, i.e., α R the set

Lα(f) = {x : f(x) α} is convex.

The equivalence between the above deﬁnitions can be found in [4]. During this paper we denote the sublevel-set of f at x by

Sf(x) = {y : f(y) f(x)} . (1)

3 Local-Quasi-Convexity

Quasi-convexity does not fully capture the notion of unimodality in several dimension. As an example let x = (x1, x2) [ 10, 10]2, and consider the function

g(x) = (1 + e x1) 1 + (1 + e x2) 1 . (2)

It is natural to consider g as unimodal since it acquires no local minima but for the unique global minima at x = ( 10, 10). However, g is not quasi-convex: consider the points x = (log 16, log 4), y = ( log 4, log 16), which belong to the 1.2-sub-level set, their average does not belong to the same sub-level-set since g(x/2 + y/2) = 4/3.

Quasi-convex functions always enable us to explore, meaning that the gradient always directs us in a global descent direction. Intuitively, from an optimization point of view, we only need such a direction whenever we do not exploit, i.e., whenever we are not approximately optimal.

In what follows we deﬁne local-quasi-convexity, a property that enables us to either explore/exploit. This property captures a wider class of unimodal function (such as g above) rather than mere quasiconvexity. Later we justify this deﬁnition by showing that it captures Generalized Linear Models (GLM) regression, see [14, 7]. Deﬁnition 3.1. (Local-Quasi-Convexity) Let x, z Rd, κ, ϵ > 0. We say that f : Rd 7 R is (ϵ, κ, z)-Strictly-Locally-Quasi-Convex (SLQC) in x, if at least one of the following applies:

1. f(x) f(z) ϵ .

2. f(x) > 0, and for every y B(z, ϵ/κ) it holds that f(x), y x 0 .

Note that if f is G-Lispschitz and strictly-quasi-convex function, then x, z Rd, ϵ > 0, it holds that f is (ϵ, G, z)-SLQC in x. Recalling the function g that appears in Equation (2), then it can be shown that ϵ (0, 1], x [ 10, 10]2 then this function is (ϵ, 1, x )-SLQC in x, where x = ( 10, 10).

3.1 Generalized Linear Models (GLM)

3.1.1 The Idealized GLM

In this setup we have a collection of m samples {(xi, yi)}m i=1 Bd [0, 1], and an activation function φ : R 7 R. We are guaranteed to have w Rd such that: yi = φ w , xi , i [m] (we denote φ w, x := φ( w, x )). The performance of a predictor w Rd, is measured by the average square error over all samples.

c errm(w) = 1

i=1 (yi φ w, xi )2 . (3)

In [7] it is shown that the Perceptron problem with γ-margin is a private case of GLM regression.

The sigmoid function φ(z) = (1 + e z) 1 is a popular activation function in the ﬁeld of deep learning. The next lemma states that in the idealized GLM problem with sigmoid activation, then the error function is SLQC (but not quasi-convex). As we will see in Section 4 this implies that Algorithm 1 ﬁnds an ϵ-optimal minima of c errm(w) within poly(1/ϵ) iterations. Lemma 3.1. Consider the idealized GLM problem with the sigmoid activation, and assume that w W. Then the error function appearing in Equation (3) is (ϵ, e W , w )-SLQC in w, ϵ > 0, w Bd(0, W) (But it is not generally quasi-convex).

3.1.2 The Noisy GLM

In the noisy GLM setup (see [14, 7]), we may draw i.i.d. samples {(xi, yi)}m i=1 Bd [0, 1], from an unknown distribution D. We assume that there exists a predictor w Rd such that E(x,y) D[y|x] = φ w , x , where φ is an activation function. Given w Rd we deﬁne its expected error as follows: E(w) = E(x,y) D(y φ w, x )2 , and it can be shown that w is a global minima of E. We are interested in schemes that obtain an ϵ-optimal minima to E, within poly(1/ϵ) samples and optimization steps. Given m samples from D, their empirical error c errm(w), is deﬁned as in Equation (3). The following lemma states that in this setup, letting m = Ω(1/ϵ2), then c errm is SLQC with high probability. This property will enable us to apply Algorithm 2, to obtain an ϵ-optimal minima to E, within poly(1/ϵ) samples from D, and poly(1/ϵ) optimization steps. Lemma 3.2. Let δ, ϵ (0, 1). Consider the noisy GLM problem with the sigmoid activation, and assume that w W. Given a ﬁxed point w B(0, W), then w.p. 1 δ, after m 8e2W (W +1)2

ϵ2 log(1/δ) samples, the empirical error function appearing in Equation (3) is (ϵ, e W , w )-SLQC in w.

Note that if we had required the SLQC to hold w B(0, W), then we would need the number of samples to depend on the dimension, d, which we would like to avoid. Instead, we require SLQC to hold for a ﬁxed w. This satisﬁes the conditions of Algorithm 2, enabling us to ﬁnd an ϵ-optimal solution with a sample complexity that is independent of the dimension.

4 NGD for Locally-Quasi-Convex Optimization

Here we present the NGD algorithm, and prove the convergence rate of this algorithm for SLQC objectives. Our analysis is simple, enabling us to extend the convergence rate presented in [15] beyond quasi-convex functions. We then show that quasi-convex and locally-Lipschitz objective are SLQC, implying that NGD converges even if the gradients are unbounded outside a small region

Algorithm 1 Normalized Gradient Descent (NGD)

Input: #Iterations T, x1 Rd, learning rate η for t = 1 . . . T do

Update: xt+1 = xt ηˆgt where gt = f(xt), ˆgt = gt gt end for Return: x T = arg min{x1,...,x T } f(xt)

around the minima. For quasi-convex and locally-smooth objectives, we show that NGD attains a faster convergence rate.

NGD is presented in Algorithm 1. NGD is similar to GD, except we normalize the gradients. It is intuitively clear that to obtain robustness to plateaus (where the gradient can be arbitrarily small) and to exploding gradients (where the gradient can be arbitrarily large), one must ignore the size of the gradient. It is more surprising that the information in the direction of the gradient sufﬁces to guarantee convergence.

Following is the main theorem of this section: Theorem 4.1. Fix ϵ > 0, let f : Rd 7 R, and x arg minx Rd f(x). Given that f is (ϵ, κ, x )- SLQC in every x Rd. Then running the NGD algorithm with T κ2 x1 x 2/ϵ2, and η = ϵ/κ, we have that: f( x T ) f(x ) ϵ.

Theorem 4.1 states that ( , , x )-SLQC functions admit poly(1/ϵ) convergence rate using NGD. The intuition behind this lies in Deﬁnition 3.1, which asserts that at a point x either the (opposite) gradient points out a global optimization direction, or we are already ϵ-optimal. Note that the requirement of (ϵ, , )-SLQC in any x is not restrictive, as we have seen in Section 3, there are interesting examples of functions that admit this property ϵ [0, 1], and for any x.

For simplicity we have presented NGD for unconstrained problems. Using projections we can easily extend the algorithm and and its analysis for constrained optimization over convex sets. This will enable to achieve convergence of O(1/ϵ2) for the objective presented in Equation (2), and the idealized GLM problem presented in Section 3.1.1. We are now ready to prove Theorem 4.1:

Proof of Theorem 4.1. First note that if the gradient of f vanishes at xt, then by the SLQC assumption we must have that f(xt) f(x ) ϵ. Assume next that we perform T iterations and the gradient of f at xt never vanishes in these iterations. Consider the update rule of NGD (Algorithm 1), then by standard algebra we get,

xt+1 x 2 = xt x 2 2η ˆgt, xt x + η2 .

Assume that t [T] we have f(xt) f(x ) > ϵ. Take y = x + (ϵ/κ) ˆgt, and observe that y x ϵ/κ. The (ϵ, κ, x )-SLQC assumption implies that ˆgt, y xt 0, and therefore

ˆgt, x + (ϵ/κ) ˆgt xt 0 ˆgt, xt x ϵ/κ .

Setting η = ϵ/κ, the above implies,

xt+1 x 2 xt x 2 2ηϵ/κ + η2

= xt x 2 ϵ2/κ2 .

Thus, after T iterations for which f(xt) f(x ) > ϵ we get

0 x T +1 x 2 x1 x 2 Tϵ2/κ2 ,

Therefore, we must have T κ2 x1 x 2/ϵ2 .

4.1 Locally-Lipschitz/Smooth Quasi-Convex Optimization

It can be shown that strict-quasi-convexity and (G, ϵ/G, x )-local-Lipschitzness of f implies that f is (ϵ, G, x )-SLQC x Rd, ϵ 0, and x arg minx Rd f(x). Therefore the following is a direct corollary of Theorem 4.1:

Algorithm 2 Stochastic Normalized Gradient Descent (SNGD)

Input: #Iterations T, x1 Rd, learning rate η, minibatch size b for t = 1 . . . T do

Sample: {ψi}b i=1 Db, and deﬁne,

Update: xt+1 = xt ηˆgt where gt = ft(xt), ˆgt = gt gt end for Return: x T = arg min{x1,...,x T } ft(xt)

Corollary 4.1. Fix ϵ > 0, let f : Rd 7 R, and x arg minx Rd f(x). Given that f is strictly quasi-convex and (G, ϵ/G, x )-locally-Lipschitz. Then running the NGD algorithm with T G2 x1 x 2/ϵ2, and η = ϵ/G, we have that: f( x T ) f(x ) ϵ.

In case f is also locally-smooth, we state an even faster rate:

Theorem 4.2. Fix ϵ > 0, let f : Rd 7 R, and x arg minx Rd f(x). Given that f is strictly quasi-convex and (β, p

2ϵ/β, x )-locally-smooth. Then running the NGD algorithm with T β x1 x 2/2ϵ, and η = p

2ϵ/β, we have that: f( x T ) f(x ) ϵ.

Remark 1. The above corollary (resp. theorem) implies that f could have arbitrarily large gradients and second derivatives outside B(x , ϵ/G) (resp. B(x , p

2ϵ/β)), yet NGD is still ensured to output an ϵ-optimal point within G2 x1 x 2/ϵ2 (resp. β x1 x 2/2ϵ) iterations. We are not familiar with a similar guarantee for GD even in the convex case.

5 SNGD for Stochastic SLQC Optimization

Here we describe the setting of stochastic SLQC optimization. Then we describe our SNGD algorithm which is ensured to yield an ϵ-optimal solution within poly(1/ϵ) queries. We also show that the (noisy) GLM problem, described in Section 3.1.2 is an instance of stochastic SLQC optimization, allowing us to provably solve this problem within poly(1/ϵ) samples and optimization steps using SNGD.

The stochastic SLQC optimization Setup: Consider the problem of minimizing a function f : Rd 7 R, and assume there exists a distribution over functions D, such that:

f(x) := Eψ D[ψ(x)] .

We assume that we may access f by randomly sampling minibatches of size b, and querying the gradients of these minibatches. Thus, upon querying a point xt Rd, a random minibatch {ψi}b i=1 Db is sampled, and we receive ft(xt), where ft(x) = 1

b Pb i=1 ψi(x). We make the following assumption regarding the minibatch averages:

Assumption 5.1. Let T, ϵ, δ > 0, x arg minx Rd f(x). There exists κ > 0, and a function b0 : R3 7 R, that for b b0(ϵ, δ, T) then w.p. 1 δ and t [T], the minibatch average ft(x) = 1 b Pb i=1 ψi(x) is (ϵ, κ, x )-SLQC in xt. Moreover, we assume |ft(x)| M, t [T], x Rd .

Note that we assume that b0 = poly(1/ϵ, log(T/δ)).

Justiﬁcation of Assumption 5.1 Noisy GLM regression (see Section 3.1.2), is an interesting instance of stochastic optimization problem where Assumption 5.1 holds. Indeed according to Lemma 3.2, given ϵ, δ, T > 0, then for b Ω(log(T/δ)/ϵ2) samples, the average minibatch function is (ϵ, κ, x )-SLQC in xt, t [T], w.p. 1 δ.

Local-quasi-convexity of minibatch averages is a plausible assumption when we optimize an expected sum of quasi-convex functions that share common global minima (or when the different global minima are close by). As seen from the Examples presented in Equation (2), and in Sections 3.1.1, 3.1.2, this sum is generally not quasi-convex, but is more often locally-quasi-convex.

Note that in the general case when the objective is a sum of quasi-convex functions, the number of local minima of such objective may grow exponentially with the dimension d, see [1]. This might imply that a general setup where each ψ D is quasi-convex may be generally hard.

5.1 Main Results

SNGD is presented in Algorithm 2. SNGD is similar to SGD, except we normalize the gradients. The normalization is crucial in order to take advantage of the SLQC assumption, and in order to overcome the hurdles of plateaus and cliffs. Following is our main theorem: Theorem 5.1. Fix δ, ϵ, G, M, κ > 0. Suppose we run SNGD with T κ2 x1 x 2/ϵ2 iterations, η = ϵ/κ, and b max{ M 2 log(4T/δ)

2ϵ2 , b0(ϵ, δ, T)} . Assume that for b b0(ϵ, δ, T) then w.p. 1 δ and t [T], the function ft deﬁned in the algorithm is M-bounded, and is also (ϵ, κ, x )-SLQC in xt. Then, with probability of at least 1 2δ, we have that f( x T ) f(x ) 3ϵ.

We prove of Theorem 5.1 at the end of this section. Remark 2. Since strict-quasi-convexity and (G, ϵ/G, x )-local-Lipschitzness are equivalent to SLQC, the theorem implies that f could have arbitrarily large gradients outside B(x , ϵ/G), yet SNGD is still ensured to output an ϵ-optimal point within G2 x1 x 2/ϵ2 iterations. We are not familiar with a similar guarantee for SGD even in the convex case. Remark 3. Theorem 5.1 requires the minibatch size to be Ω(1/ϵ2). In the context of learning, the number of functions, n, corresponds to the number of training examples. By standard sample complexity bounds, n should also be order of 1/ϵ2. Therefore, one may wonder, if the size of the minibatch should be order of n. This is not true, since the required training set size is 1/ϵ2 times the VC dimension of the hypothesis class. In many practical cases, the VC dimension is more signiﬁcant than 1/ϵ2, and therefore n will be much larger than the required minibatch size. The reason our analysis requires a minibatch of size 1/ϵ2, without the VC dimension factor, is because we are just validating and not learning .

In SGD and for the case of convex functions, even a minibatch of size 1 sufﬁces for guaranteed convergence. In contrast, for SNGD we require a minibatch of size 1/ϵ2. The theorem below shows that the requirement for a large minibatch is not an artifact of our analysis but is truly required. Theorem 5.2. Let ϵ (0, 0.1]; There exists a distribution over convex functions, such that running SNGD with minibatch size of b = 0.2

ϵ , with a high probability it never reaches an ϵ-optimal solution

The gap between the upper bound of 1/ϵ2 and the lower bound of 1/ϵ remains as an open question. We now provide a sketch for the proof of Theorem 5.1:

Proof of Theorem 5.1. Theorem 5.1 is a consequence of the following two lemmas. In the ﬁrst we show that whenever all ft s are SLQC, there exists some t such that ft(xt) ft(x ) ϵ. In the second lemma, we show that for a large enough minibatch size b, then for any t [T] we have f(xt) ft(xt) + ϵ, and f(x ) ft(x ) ϵ. Combining these two lemmas we conclude that f( x T ) f(x ) 3ϵ.

Lemma 5.1. Let ϵ, δ > 0. Suppose we run SNGD for T κ2 x1 x 2/ϵ2 iterations, b b0(ϵ, δ, T), and η = ϵ/κ. Assume that w.p. 1 δ all ft s are (ϵ, κ, x )-SLQC in xt, whenever b b0(ϵ, δ, T). Then w.p. 1 δ we must have some t [T] for which ft(xt) ft(x ) ϵ.

Lemma 5.1 is proved similarly to Theorem 4.1. We omit the proof due to space constraints.

The second Lemma relates ft(xt) ft(x ) ϵ to a bound on f(xt) f(x ).

Lemma 5.2. Suppose b M 2 log(4T/δ)

2 ϵ 2 then w.p. 1 δ and for every t [T]:

f(xt) ft(xt) + ϵ , and also, f(x ) ft(x ) ϵ .

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 600

MSGD Nesterov SNGD

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 600 0.005

MSGD Nesterov SNGD

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 600

b =1 b =10 b =100 b =500

Figure 2: Comparison between optimizations schemes. Left: test error. Middle: objective value (on training set). On the Right we compare the objective of SNGD for different minibatch sizes.

Lemma 5.2 is a direct consequence of Hoeffding s bound. Using the deﬁnition of x T (Alg. 2) , together with Lemma 5.2 gives:

f( x T ) f(x ) ft(xt) ft(x ) + 2ϵ, t [T]

Combining the latter with Lemma 5.1, establishes Theorem 5.1.

6 Experiments

A better understanding of how to train deep neural networks is one of the greatest challenges in current machine learning and optimization. Since learning NN (Neural Network) architectures essentially requires to solve a hard non-convex program, we have decided to focus our empirical study on this type of tasks. As a test case, we train a Neural Network with a single hidden layer of 100 units over the MNIST data set. We use a Re LU activation function, and minimize the square loss. We employ a regularization over weights with a parameter of λ = 5 10 4.

At ﬁrst we were interested in comparing the performance of SNGD to MSGD (Minibatch Stochastic Gradient Descent), and to a stochastic variant of Nesterov s accelerated gradient method [19], which is considered to be state-of-the-art. For MSGD and Nesterov s method we used a step size rule of the form ηt = η0(1 + γt) 3/4, with η0 = 0.01 and γ = 10 4. For SNGD we used the constant step size of 0.1. In Nesterov s method we used a momentum of 0.95. The comparison appears in Figures 2(a),2(b). As expected, MSGD converges relatively slowly. Conversely, the performance of SNGD is comparable with Nesterov s method. All methods employed a minibatch size of 100.

Later, we were interested in examining the effect of minibatch size on the performance of SNGD. We employed SNGD with different minibatch sizes. As seen in Figure 2(c), the performance improves signiﬁcantly with the increase of minibatch size.

7 Discussion

We have presented the ﬁrst provable gradient-based algorithm for stochastic quasi-convex optimization. This is a ﬁrst attempt at generalizing the well-developed machinery of stochastic convex optimization to the challenging non-convex problems facing machine learning, and better characterizing the border between NP-hard non-convex optimization and tractable cases such as the ones studied herein.

Amongst the numerous challenging questions that remain, we note that there is a gap between the upper and lower bound of the minibatch size sufﬁcient for SNGD to provably converge.

Acknowledgments

The research leading to these results has received funding from the European Union s Seventh Framework Programme (FP7/2007-2013) under grant agreement n 336078 ERC-SUBLRN. Shai S-Shwartz is supported by ISF n 1673/14 and by Intel s ICRI-CI.

[1] Peter Auer, Mark Herbster, and Manfred K Warmuth. Exponentially many local minima for single neurons. Advances in neural information processing systems, pages 316 322, 1996.

[2] Yoshua Bengio. Learning deep architectures for AI. Foundations and trends in Machine Learning, 2(1):1 127, 2009.

[3] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difﬁcult. Neural Networks, IEEE Transactions on, 5(2):157 166, 1994.

[4] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

[5] Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75 80, 1993.

[6] Jean-Louis Gofﬁn, Zhi-Quan Luo, and Yinyu Ye. Complexity analysis of an interior cutting plane method for convex feasibility problems. SIAM Journal on Optimization, 6(3):638 652, 1996.

[7] Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009.

[8] Qifa Ke and Takeo Kanade. Quasiconvex optimization for robust geometric reconstruction. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(10):1834 1847, 2007.

[9] Rustem F Khabibullin. A method to ﬁnd a point of a convex set. Issled. Prik. Mat., 4:15 22, 1977.

[10] Krzysztof C Kiwiel. Convergence and efﬁciency of subgradient methods for quasiconvex minimization. Mathematical programming, 90(1):1 25, 2001.

[11] Igor V Konnov. On convergence properties of a subgradient method. Optimization Methods and Software, 18(1):53 62, 2003.

[12] Jean-Jacques Laffont and David Martimort. The theory of incentives: the principal-agent model. Princeton university press, 2009.

[13] James Martens and Ilya Sutskever. Learning recurrent neural networks with hessian-free optimization. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 1033 1040, 2011.

[14] P. Mc Cullagh and JA Nelder. Generalised linear models. London: Chapman and Hall/CRC, 1989.

[15] Yu E Nesterov. Minimization methods for nonsmooth convex and quasiconvex functions. Matekon, 29:519 531, 1984.

[16] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difﬁculty of training recurrent neural networks. In Proceedings of The 30th International Conference on Machine Learning, pages 1310 1318, 2013.

[17] Boris T Polyak. A general method of solving extremum problems. Dokl. Akademii Nauk SSSR, 174(1):33, 1967.

[18] Jarosław Sikorski. Quasi subgradient algorithms for calculating surrogate constraints. In Analysis and algorithms of optimization problems, pages 203 236. Springer, 1986.

[19] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1139 1147, 2013.

[20] Hal R Varian. Price discrimination and social welfare. The American Economic Review, pages 870 875, 1985.

[21] Elmar Wolfstetter. Topics in microeconomics: Industrial organization, auctions, and incentives. Cambridge University Press, 1999.

[22] Yaroslav Ivanovich Zabotin, AI Korablev, and Rustem F Khabibullin. The minimization of quasicomplex functionals. Izv. Vyssh. Uch. Zaved. Mat., (10):27 33, 1972.