# blackbox_optimization_with_a_politician__cfdf7f5a.pdf

Black-box Optimization with a Politician

S ebastien Bubeck SEBUBECK@MICROSOFT.COM Microsoft Research

Yin-Tat Lee YINTAT@MIT.EDU MIT

We propose a new framework for black-box convex optimization which is well-suited for situations where gradient computations are expensive. We derive a new method for this framework which leverages several concepts from convex optimization, from standard ﬁrst-order methods (e.g. gradient descent or quasi-Newton methods) to analytical centers (i.e. minimizers of selfconcordant barriers). We demonstrate empirically that our new technique compares favorably with state of the art algorithms (such as BFGS).

1. Introduction

In standard black-box convex optimization (Nemirovski and Yudin, 1983; Nesterov, 2004; Bubeck, 2015) ﬁrst-order methods interact with an oracle: given a query point x, the oracle reports the value and gradient of the underlying objective function f at x. In this paper we propose to replace the oracle by a politician. Instead of answering the original query x the politician changes the question and answers a new query y which is guaranteed to be better than the original query x in the sense that f(y) f(x). The newly selected query y also depends on the history of queries that were made to the politician. Formally we introduce the following deﬁnition (for sake of simplicty we write f(x) for either a gradient or a subgradient of f at x).

Deﬁnition 1 Let X Rn and f : X R. A politician Φ for f is a mapping from X k=0(X R Rn)k to X such that for any k 0, x X, h (X R Rn)k

one has f(Φ(x, h)) f(x). Furthermore when queried at x with history h a politician for f also output f(Φ(x, h)) and f(Φ(x, h)) (in order to not overload notation we do not include these outputs in the range of Φ).

Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

Let us clarify the interaction of a ﬁrst-order method with a politician. Note that we refer to the couple (ﬁrst-order method, politician) as the algorithm. Let M : k=0(X R Rn)k X be a ﬁrst-order method and Φ a politician for some function f : X R. The course of the algorithm (M, Φ) then goes as follows: at iteration k + 1 one ﬁrst calculates the method s query point xk+1 = M(hk) (with h0 = ), then one calculates the politician s new query point yk+1 = Φ(xk+1, hk) and the ﬁrst order information at this point (f(yk+1), f(yk+1)), and ﬁnally one updates the history with this new information hk+1 = (hk, (yk+1, f(yk+1), f(yk+1))). Note that a standard oracle simply corresponds to a politician O for f such that O(x, h) = (x, f(x), f(x)) (in particular the algorithm (M, O) is the usual algorithm corresponding to the ﬁrstorder method M).

The philosophy of the above deﬁnition is that it gives in some sense an automatic way to combine different optimization algorithms. Say for example that we wish to combine the ellipsoid method with gradient descent. One way to do so is to design an ellipsoidal politician : the politician keeps track of a feasible ellipsoidal region based on the previously computed gradients, and when asked with the query x the politician chooses as a new query y the result of a line-search on the line between x and the center of current ellipsoid. Gradient descent with this ellipsoidal politician would then replace the step x x η f(x) by x y η f(y). The hope is that in practice such a combination would integrate the fast incremental progress of gradient descent with the geometrical progress of the ellipsoid method.

In this paper we focus on unconstrained convex optimization. We are particularly interested in situations where calculating a (sub)gradient has superlinear complexity (i.e., n) such as in logistic regression and semideﬁnite programming. In such cases it is natural to try to make the most out of the computed gradients by incorporating geometric reasoning (such as in the ellipsoid method). We do so by introducing the geometric politician (Section 3), which is based on a combination of the recent ideas of

Black-box optimization with a politician

(Bubeck et al., 2015) with standard cutting plane/interior point methods machinery (through the notion of a center of a set, see Section 4). For a given ﬁrst order method M, we denote by M+ the algorithm obtained by running M with the geometric politician. We demonstrate empirically (Section 5) the effectiveness of the geometric politician on various standard ﬁrst-order methods for convex optimization (gradient descent, Nesterov s accelerated gradient descent, non-linear conjugate gradient, BFGS). In particular we show that BFGS+ is a surprisingly robust and parameter-free algorithm with state of the art performance across a wide range of problems (both smooth and nonsmooth).

2. Afﬁne invariant politician

As mentioned above we assume that the complexity of computing the map x 7 f(x) is superlinear. This implies that we can afford to have a politician such that the complexity of computing the map (x, h) 7 Φ(x, h) is O(n poly(k)) (we think of the number of iterations k as typically much smaller than the dimension n). We show in this section that this condition is (essentially) automatically satisﬁed as long as the politician is afﬁne invariant in the following sense (we use a slight abuse of language and refer to a map f 7 Φf, where Φf is a politician for f, as a politician):

Deﬁnition 2 A politician f 7 Φf is called afﬁne invariant if for any function f and any afﬁne map T : Rm Rn

such that T(x) = z + Lx for some matrix L, k 0, x Rm, (yi, vi, gi) Rm R Rn, one has

T(Φf T (x, (yi, vi, L gi)i [k]))

= Φf(T(x), (T(yi), vi, gi)i [k]).

We say that an afﬁne invariant politician has cost ψ : N N if for any f : Rk R the map (x, h) Rk (Rk R Rk)k 7 Φf(x, h) can be computed in time ψ(k).

Proposition 1 Let Φ be an afﬁne invariant politician with cost ψ. Then for any f : Rn R, (yi, vi, gi) Rn R Rn, i [k] and x, yi y1 + Span(g1, . . . , gk) one can compute Φf(x, (yi, vi, gi)i [k]) Rn in time ψ(k) + O(nk2).

Proof Let G be the n k matrix with ith column given by gi. We consider the QR decomposition of G which can be computed in time O(nk2), that is Q is an n k matrix and R a k k matrix such that G = QR and Q Q = Ik. Let T be the afﬁne map deﬁned by T = y1 + Q. Note that since x y1 + Span(g1, . . . , gk) one has x = T(Q (x y1))

(and similarly for yi). Thus by afﬁne invariance one has

Φf(x, (yi, vi, gi))

= Φf(T(Q (x y1)), (T(Q (yi y1)), vi, gi))

= y1 + QΦf T (Q (x y1), (Q (yi y1), vi, Ri)),

where Ri is the ith column of R. Furthermore by deﬁnition of the cost ψ and since f T is deﬁned on Rk we see that this last quantity can be computed in time ψ(k) + O(nk2), thus concluding the proof.

The above proposition shows that with an afﬁne invariant politician and a ﬁrst order method M verifying for any (yi, vi, gi)i [k] (Rn R Rn)k,

M((yi, vi, gi)i [k]) y1 + Span(y1, . . . , yk, g1, . . . , gk),

one can run k steps of the corresponding algorithm in time O(nk2 + kψ(k)) plus the time to compute the k function values and gradients of the underlying function f to be optimized. Note that one gets a time of O(nk2) instead of O(nk3) as one can store the QR decomposition from one step to the next, and updating the decomposition only cost O(nk).

3. Geometric politician

We describe in this section the geometric politician which is based on ideas developed in (Bubeck et al., 2015). A key observation in the latter paper is that if f is a α-strongly convex function minimized at x then one has for any x,

α (f(x) f(x )) .

This motivates the following deﬁnition:

B(x, α, fval) :=

z Rn : z x 1

α (f(x) fval) .

In particular given the ﬁrst order information at y1, . . . , yk one knows that the optimum x lies in the region Rk Rn

i [k] B(yi, α, fval) where fval = min i [k] f(yi). (1)

Now suppose that given this ﬁrst order information at y1, . . . , yk the ﬁrst order method asks to query x. How should we modify this query in order to take into account the geometric information that x Rk? First observe

Black-box optimization with a politician

that for any z, B(z, α, fval) is contained in a halfspace that has z on its boundary (in the limiting case α 0 the set B(z, α, f(z)) is exactly a halfspace). In particular if the next query point yk+1 is the center of gravity of Rk then we have that the volume of Rk+1 is at most 1 1/e times the volume of Rk (see (Gr unbaum, 1960)), thus leading to an exponential convergence rate. However the region Rk can be very large initially, and the center of gravity might have a large function value and gradient, which means that Rk would be intersected with a large sphere (possibly so large that it is close to a halfspace). On the other hand the ﬁrst order method recommends to query x, which we can think of as a local improvement of yk, which should lead to a much smaller sphere. The issue is that the position of this sphere might be such that the intersection with Rk is almost as large as the sphere itself. In order to balance between the geometric and function value/gradient considerations we propose for the new query to do a line search between the center of Rk and the recommended query x. The geometric politician follows this recipe with two important modiﬁcations: (i) there are many choices of centers that would guarantee an exponential convergence rate while being much easier to compute than the center of gravity, and we choose here to consider the volumetric center, see Section 4 for the deﬁnition and more details about this notion; (ii) we use a simple heuristic to adapt online the strong convexity parameter α, namely we start with some large value for α and if it happens that the feasible region Rk is empty then we know that α was too large, in which case we reduce it. We can now describe formally the geometric politician, see Algorithm 1. Importantly one can verify that the geometric politician is afﬁne invariant and thus can be implemented efﬁciently (see the proof of Proposition 1).

Algorithm 1: Geometric Politician

Parameter: An upper bound on the strong convexity parameter α. (Can be + .) Input: Query x, past queries and the corresponding ﬁrst order information (yi, f(yi), f(yi))i [k]. Let fval = mini [k] f(yi) and the feasible region Rk(α) = T

i [k] B(yi, α, fval). if Rk(α) = then

Let α be the largest number such that Rk(α) = . α α/4. end Let yk+1 = argminy {tx+(1 t)c(Rk(α)),t R} f(y) where c(Rk(α)) is the volumetric center of Rk(α) (see Section 4). Output: yk+1, f(yk+1) and f(yk+1).

4. Volumetric center

The volumetric barrier for a polytope was introduced in (Vaidya, 1996) to construct an algorithm with both the oracle complexity of the center of gravity method and the computational complexity of the ellipsoid method (see [Section 2.3, (Bubeck, 2015)] for more details and (Lee et al., 2015) for recent advances on this construction). Recalling that the standard logarithmic barrier FP for the polytope P = {x Rn : a i x < bi, i [m]} is deﬁned by

i=1 log(bi a i x),

one deﬁnes the volumetric barrier v P for P by

v P (x) = logdet( 2FP (x)).

The volumetric center c(P) is then deﬁned as the minimizer of v P . In the context of the geometric politician (see Algorithm 1) we are dealing with an intersection of balls rather than an intersection of halfspaces. More precisely the region of interest is of the form:

i=1 {x Rn : x ci ri} .

For such a domain the natural self-concordant barrier to consider is:

i=1 log r2 i x ci 2 .

The volumetric barrier is deﬁned as before by

v R(x) = logdet( 2FR(x)),

and the volumetric center of R is the minimizer of v R. It is shown in (Anstreicher, 2004) that v R is a self-concordant barrier which means that the center can be updated (when a new ball is added to R) via few iterations of Newton s method. Often in practice, it takes less than 5 iterations to update the minimizer of a self-concordant barrier (Gofﬁn and Vial, 1999; Bahn et al., 1995) when we add a new constraint. Hence, the complexity merely depends on how fast we can compute the gradient and Hessian of FR and v R.

Proposition 2 For the analytic barrier FR, we have that

FR(x) =A 1k 1,

2FR(x) =2A A + λ(1)I

where d is a vector deﬁned by r2 i x ci 2 1, A is a k n matrix with ith row given by di(x)(x ci), λ(p) = P

i [k] dp i (x) and 1k 1 is a k 1 matrix with all entries being 1.

Black-box optimization with a politician

For the volumetric center, we have that

v R(x) = 2tr H 1 I + 4H 1 A d + 8A σ,

2v R(x) = 48A ΣA 64A AH 1A (2) A

+ 8tr(DΣ) + 2λ(2)tr(H 1) I + 4λ(2)H 1

+ 8tr(H 1)A DA + 16sym A DAH 1

4tr(H 2)A DJDA 8H 1A DJDAH 1

8sym(A DJDAH 2) 8 d AH 1A d H 1

16sym A diag AH 2A JDA

32sym A diag(AH 1A d)AH 1

where H = 2FR(x), σi = e i AH 1A ei, ei is the indicator vector with ith coordinate, J is a k k matrix with all entries being 1, sym(B) = B +B , diag(v) is a diagonal matrix with diag(v)ii = vi, D = diag(d), Σ = diag(σ), and B(2) is the Schur square of B deﬁned by B(2) ij = B2 ij.

The above proposition shows that one step of Newton method for analytic center requires 1 dense matrix multiplication and solving 1 linear system; and for volumetric center, it requires 5 dense matrix multiplications, 1 matrix inversion and solving 1 linear system if implemented correctly. Although the analytic center is a more popular choice for geometrical algorithms, we choose volumetric center here because it gives a better convergence rate (Vaidya, 1996; Atkinson and Vaidya, 1995) and the extra cost ψ(k) is negligible to the cost of updating QR decomposition nk.

5. Experiments

In this section, we compare the geometric politician against two libraries for ﬁrst order methods, min Func (Schmidt, 2012) and TFOCS (Becker et al., 2011). Both are popular MATLAB libraries for minimizing general smooth convex functions. Since the focus of this paper is all about how to ﬁnd a good step direction using a politician, we use the exact line search (up to machine accuracy) whenever possible. This eliminates the effect of different line searches and reduces the number of algorithms we need to test. TFOCS is the only algorithm we use which does not use line search because they do not provide such option. To compensate on the unfairness to TFOCS, we note that the algorithm TFOCS uses is accelerated gradient descent and hence we implement the Gonzaga-Karas s accelerated gradient descent (Gonzaga and Karas, 2013), which is specifically designed to be used with exact line search. Another reason we pick this variant of accelerated gradient descent is because we found it to be the fastest variant of accelerated gradient descent (excluding the geometric descent of (Bubeck et al., 2015)) for our tested data (Gonzaga and Karas also observed that on their own dataset).

The algorithms to be tested are the following:

[SD] Steepest descent algorithm in min Func.

[Nes] Accelerated gradient descent, General Scheme 2.2.6 in (Nesterov, 2004).

[TFOCS] Accelerated gradient descent in TFOCS.

[GK] Gonzaga-Karas s of Accelerated Gradient Descent (Sec 5.1).

[Geo] Geometric Descent (Bubeck et al., 2015).

[CG] Non-Linear Conjugate Gradient in min Func.

[BFGS] Broyden Fletcher Goldfarb Shanno algorithm in min Func.

[PCG] Preconditioned Non-Linear Conjugate Gradient in min Func.

[ +] Geometric Politician itself (Sec 5.1).

[GK+] Using GK with Geometric Oracle (Sec 5.1).

[BFGS+] Using BFGS with Geometric Oracle (Sec

We only tested the geometric oracle on GK and BFGS because they are respectively the best algorithms in theory and practice on our tested data. The + algorithm is used as the control group to test if the geometric politician by itself is sufﬁcient to achieve good convergence rate. We note that all algorithms except Nes are parameter free; each step of SD, Nes, TFOCS, GK, Geo, CG takes O(n) time and each step of BFGS, PCG, +, GK+ and BFGS+ takes roughly O(nk) time for kth iteration.

5.1. Details of Implementations

The ﬁrst algorithm we implement is the + algorithm which simply repeatedly call the politician. As we will see, this algorithm is great for non smooth problems but not competitive for smooth problems.

Algorithm 2: +

Input: x0. for k 1, 2, do

Set xk+1 Φf(xk, (xi, f(xi), f(xi))i [k]). end

The second algorithm we implement is the accelerated gradient descent proposed by Gonzaga and Karas (Gonzaga and Karas, 2013). This algorithm uses line search to learn

Black-box optimization with a politician

Algorithm 3: Gonzaga-Karas s variant of Accelerated Gradient Descent

Input: x1. γ = 2α, v0 = x0 and y0 = x0. for k 1, 2, do

yk Φf(yk 1). xk+1 = line search(yk, f(yk)). if α γ/1.02 and we are using ﬁrst order oracle then α = γ/2. (*)

if α f(yk) 2

2(f(yk) f(xk+1)) then

α = f(yk) 2

20(f(yk) f(xk+1)). G = γ α

2 vk yk 2 + f(yk), vk yk . A = G+ 1

2 f(yk) 2+(α γ)(f(xk) f(yk)). B = (α γ)(f(xk+1) f(xk)) γ(f(yk) f(xk)) G. C = γ(f(xk+1) f(xk)). β = B+

B2 4AC 2A , γ = (1 β)γ + βα. vk+1 = 1

γ ((1 β)γvk + β(αyk f(yk)). end

the the smoothness parameter and strong convexity parameter, see Algorithm 3. We disable the line (*) in the algorithm if Φf is a politician instead of an oracle because γ α does not hold for the strong convexity parameter α if Φf is not an oracle.

The third algorithm we implemented is the Broyden Fletcher Goldfarb Shanno (BFGS) algorithm. This algorithm uses the gradients to reconstruct the Hessian and use it to approximate Newton s method, see Algorithm 4. We note that another natural way to employ the politician with BFGS is to set xk+1 = line search(Φf(xk), p) and this runs faster in practice; however, this algorithm computes two gradients per iteration (namely f(xk) and f(Φf(xk))) while we restrict ourselves to algorithms which compute one gradient per iteration.

5.2. Quadratic function

We consider the function

f(x) = (x c) D(x c), (2)

where D is a diagonal matrix with entries uniformly sampled from [0, 1] and c is a random vector with entries uniformly sampled from the normal distribution N(0, 1). Since this is a quadratic function, CG, BFGS and BFGS+ are equivalent and optimal, namely, they output the minimum point in the span of all previous gradients.

Algorithm 4: BFGS

Input: x1. for k 1, 2, do

p = f(xk). for i k 1, , 1 do

αi si, p / si, yi . p = p αiyi. end p = sk 1, yk 1 / yk 1, yk 1 p. for i 1, , k 1 do

βi yi, p / si, yi . p = p + (αi βi)yi. end xk+1 = Φf (line search(xk, p)). sk = xk+1 xk, yk = f(xk+1) f(xk). end

5.3. Variant of Nesterov s Worst Function

(Nesterov, 2004) introduced the function

f(x) = (1 x[1])2 +

k=1 (x[k] x[k + 1])2

and used it to give a lower bound for all ﬁrst-order methods. To distinguish the performance between CG, BFGS and BFGS+, we consider the following non-quadratic variant

f(x) = g(1 x[1]) +

k=1 g(x[k] x[k + 1]) (3)

for some function g to be deﬁned. If we pick g(x) = |x| then all ﬁrst order methods takes at least n iterations to minimize f exactly. On the other hand with g(x) = max(|x| 0.1, 0) one of the minimizer of f is (1, 9

10, 0, 0, , 0), and thus it takes at least 11 iterations for ﬁrst order methods to minimize f in this case. We regularize the situation a bit and consider the function

(x 0.1)2 + 0.0012 0.001 if x 0.1 q

(x + 0.1)2 + 0.0012 0.001 if x 0.1

0 otherwise

Since this function is far from quadratic, our algorithms ( +, GK+, BFGS+) converge much faster. This is thus a nice example where the geometric politician helps a lot because the underlying dimension of the problem is small.

5.4. Binary regression with smoothed hinge loss

We consider the binary classiﬁcation problem on the datasets from (Chang and Lin, 2011). The problem is to

Black-box optimization with a politician

Figure 1. Comparison of ﬁrst-order methods for the function (2) with n = 10000.

minimize the regularized empirical risk:

i=1 ϕt(bia T i x) + λ

where ai Rd, bi R are given by the datasets, λ is the regularization coefﬁcient, ϕt is the smoothed hinge loss deﬁned by

0 if z 1 z + 1 t

2 if z 1 + t 1 2t(z + 1)2 otherwise

and t is the smoothness parameter. The usual choice for t is 1, here we test both t = 1 and t = 10 4. The latter case is to test how well the algorithms perform when the function is non-smooth.

We note that for this problem it would be natural to compare ourselves with SGD (stochastic gradient descent) or more reﬁned stochastic algorithms such as SAG (Le Roux et al., 2012) or SVRG (Johnson and Zhang, 2013). However since the focus of this paper is on general black-box optimization we stick to comparing only to general methods. It is an interesting open problem to extend our algorithms to the stochastic setting, see Section 6.

In ﬁgures 3 and 4, we show the performance proﬁle for problems in the LIBSVM datasets (and with different values for the regularization parameter λ). More precisely for a given algorithm we plot x [1, 10] versus the fraction of datasets that the algorithm can solve (up to a certain prespeciﬁed accuracy) in a number of iterations which is at most x times the number of iterations of the best algorithm for this dataset. Figure 3 shows the case t = 1 with the targeted accuracy 10 6; Figure 4 shows the case t = 10 4

with the targeted accuracy 10 3. We see that TFOCS is slower than SD for many problems, this is simply because SD uses the line search while TFOCS does not, and this

Iteration 0 50 100 150 200 250

Function Error

100 Variant of Nesterov's Worst Function

SD TFOCS GK Geo CG PCG BFGS

Figure 2. Comparison of ﬁrst-order methods for the function (3) with n = 10000.

makes a huge difference for simple problems. Among algorithms taking O(n) time per iteration, CG and Geo perform the best, while for the O(nk) algorithms we see that BFGS, BFGS+ and GK+ perform the best. The gap in performance is particularly striking in the non-smooth case where BFGS+ is the fastest algorithm on almost all problems and all other methods (except GK+) are lagging far behind (for 20% of the problems all other methods take 10 times more iterations than BFGS+ and GK+).

Finally in ﬁgures 5 and 6 we test ﬁve algorithms on three speciﬁc datasets (respectively in the smooth and nonsmooth case). In both ﬁgures we see that BFGS+ performs the best for all three datasets. BFGS performs second for smooth problems while GK+ performs second for nonsmooth problems.

5.5. Summary

The experiments show that BFGS+ and BFGS perform the best among all methods for smooth test problems while BFGS+ and GK+ perform the best for nonsmooth test problems. The ﬁrst phenomenon is due to the optimality of these algorithm for quadratic problems. We leave the explanation for the second phenomenon as an open problem. At least, the experiments show that this is not due to the geometric oracle itself since + is much slower, and this is not due to the original algorithm since GK performs much worse than GK+ for those problems. Overall these experiments are very promising for the geometric oracle as a replacement of quasi Newton method for non-smooth problems and as a general purpose solver due to its robustness.

Black-box optimization with a politician

Required Time Compared to the Best 2 4 6 8 10

1 Performance Profile For Smooth Problems

SD TFOCS GK Geo CG PCG BFGS

Figure 3. Performance proﬁle on problem (4) with t = 1 and λ = 10 4, 10 5, 10 6, 10 7, 10 8.

6. Discussion

First order methods generally involve only very basic operations at each step (addition, scalar multiplication). In this paper we formalize each step s operations (besides the gradient calculation) as the work of the politician. We showed that the cost per step of an afﬁne invariant politician ψ(k) is negligible compared to the gradient calculation (which is Ω(n)). This opens up a lot of possibilities: instead of basic addition or scalar multiplication one can imagine computing a center of gravity, solving a linear program, or even searching over an exponential space (indeed, say k < 30 and n > 1010, then 2k < n). Our experiments demonstrate the effectiveness of this strategy. On the other hand from a theoretical point of view a lot remains to be done. For example, one can prove results of the following ﬂavor:

Theorem 1 Let f such that αI 2f(x) βI, x Rn

and let κ = β/α. Suppose that in the Geometric Politician we replace the volumetric center by the center of gravity or the center of the John ellipsoid. Let yk be the output of the kth step of SD+ with some initial point x0. Then, we have that

f(yk) f(x )

κ 1 1 Θ(min(n log(κ), κ))

k (f(x0) f(x )) .

f(yk) f(x ) 2βR2

k + 4 where R = maxf(x) f(x0) x x .

This claim says that, up to a logarithmic factor, SD+ enjoys simultaneously the incremental progress of gradient

Required Time Compared to the Best 2 4 6 8 10

1Performance Profile For Nonsmooth Problems

SD TFOCS GK Geo CG PCG BFGS

Figure 4. Performance proﬁle on problem (4) with t = 10 4 and λ = 10 4, 10 5, 10 6, 10 7, 10 8.

descent and the geometrical progress of cutting plane methods. There are three caveats in this claim:

We use the center of gravity or the center of the John ellipsoid instead of the volumetric center. Note however that it is well-known that the volumetric center is usually more difﬁcult to analyze, (Vaidya, 1996; Atkinson and Vaidya, 1995).

The extraneous log(κ) comes from the number of potential restart when we decrease α. Is there a better way to learn α that would not incur this additional logarithmic term?

(Bubeck et al., 2015) shows essentially that one can combine the ellipsoid method with gradient descent to achieve the optimal 1 p

1/κ rate. Can we prove such a result for SD+?

The geometric politician could be reﬁned in many ways. Here are two simple questions that we leave for future work:

One can think that gradient descent stores 1 gradient information, accelerated gradient descent stores 2 gradient information, and our method stores all past gradient information. We believe that neither 1, 2 nor all is the correct answer. Instead, the algorithm should dynamically decide the number of gradients to store based on the size of its memory, the cost of computing gradients, and the information each gradient reveals.

Is there a stochastic version of our algorithm? How well would such a method compare with state of the art stochastic algorithms such as SAG (Le Roux et al., 2012) and SVRG (Johnson and Zhang, 2013)?

Black-box optimization with a politician

100 real-sim

0 50 100 10-10

Geo CG BFGS GK+ BFGS+

Figure 5. Comparison between Geo, CG, BFGS, GK+, BFGS+ on problem (4) with t = 1 and λ = 10 4, 10 6, 10 8.

M. K. Anstreicher. The volumetric barrier for convex quadratic constraints. Mathematical Programming, 100 (3):613 662, 2004.

David S Atkinson and Pravin M Vaidya. A cutting plane algorithm for convex programming that uses analytic centers. Mathematical Programming, 69(1-3):1 43, 1995.

Olivier Bahn, O Du Merle, J-L Gofﬁn, and J-P Vial. A cutting plane method from analytic centers for stochastic programming. Mathematical Programming, 69(1-3):45 73, 1995.

Stephen R Becker, Emmanuel J Cand es, and Michael C Grant. Templates for convex cone problems with applications to sparse signal recovery. Mathematical programming computation, 3(3):165 218, 2011.

S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8 (3-4):231 357, 2015.

S. Bubeck, Y.-T. Lee, and M. Singh. A geometric alternative to nesterov s accelerated gradient descent. Arxiv preprint ar Xiv:1506.08187, 2015.

Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST)., 2(3):27, 2011. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.

Jean-Louis Gofﬁn and Jean-Philippe Vial. Shallow, deep and very deep cuts in the analytic center cutting plane method. Mathematical Programming, 84(1):89 103, 1999.

100 real-sim

0 50 100 150

0 50 100 10-4

Geo CG BFGS GK+ BFGS+

Figure 6. Comparison between Geo, CG, BFGS, GK+, BFGS+ on problem (4) with t = 10 4 and λ = 10 4, 10 6, 10 8.

Cl ovis C Gonzaga and Elizabeth W Karas. Fine tuning nesterovs steepest descent algorithm for differentiable convex programming. Mathematical Programming, 138(12):141 166, 2013.

B. Gr unbaum. Partitions of mass-distributions and of convex bodies by hyperplanes. Paciﬁc J. Math, 10(4):1257 1261, 1960.

R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems (NIPS), 2013.

N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with ﬁnite training sets. In Advances in Neural Information Processing Systems (NIPS), 2012.

Y.-T. Lee, A. Sidford, and S. C.-W Wong. A faster cutting plane method and its implications for combinatorial and convex optimization. Arxiv preprint ar Xiv:1508.04874, 2015.

A. Nemirovski and D. Yudin. Problem Complexity and Method Efﬁciency in Optimization. Wiley Interscience, 1983.

Y. Nesterov. Introductory lectures on convex optimization: A basic course. Kluwer Academic Publishers, 2004.

M Schmidt. minfunc: unconstrained differentiable multivariate optimization in matlab. URL http://www. di. ens. fr/mschmidt/Software/min Func. html, 2012.

P. M. Vaidya. A new algorithm for minimizing convex functions over convex sets. Mathematical programming, 73(3):291 341, 1996.