# global_optimization_of_lipschitz_functions__f5947211.pdf

Global optimization of Lipschitz functions

C edric Malherbe 1 Nicolas Vayatis 1

The goal of the paper is to design sequential strategies which lead to efﬁcient optimization of an unknown function under the only assumption that it has a ﬁnite Lipschitz constant. We ﬁrst identify sufﬁcient conditions for the consistency of generic sequential algorithms and formulate the expected minimax rate for their performance. We introduce and analyze a ﬁrst algorithm called LIPO which assumes the Lipschitz constant to be known. Consistency, minimax rates for LIPO are proved, as well as fast rates under an additional H older like condition. An adaptive version of LIPO is also introduced for the more realistic setup where the Lipschitz constant is unknown and has to be estimated along with the optimization. Similar theoretical guarantees are shown to hold for the adaptive algorithm and a numerical assessment is provided at the end of the paper to illustrate the potential of this strategy with respect to state-of-the-art methods over typical benchmark problems for global optimization.

1. Introduction

In many applications such as complex system design or hyperparameter calibration for learning systems, the goal is to optimize the output value of an unknown function with as few evaluations as possible. Indeed, in such contexts, evaluating the performance of a single set of parameters often requires numerical simulations or crossvalidations with signiﬁcant computational cost and the operational constraints impose a sequential exploration of the solution space with small samples. Moreover, it can generally not be assumed that the function has good properties such as linearity or convexity. This generic problem of sequentially optimizing the output of an unknown and potentially nonconvex function is often referred to as

1CMLA, ENS Cachan, CNRS, Universit e Paris-Saclay, 94235, Cachan, France. Correspondence to: <name@cmla.enscachan.fr>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

global optimization (Pint er, 1991), black-box optimization (Jones et al., 1998) or derivative-free optimization (Rios & Sahinidis, 2013). There is a large number of algorithms based on various heuristics which have been introduced in order to solve this problem such as genetic algorithms, model-based methods or Bayesian optimization. We focus here on the smoothness-based approach to global optimization. This approach is based on the simple observation that, in many applications, the system presents some regularity with respects to the input. In particular, the use of the Lipschitz constant, ﬁrst proposed in the seminal works of (Shubert, 1972; Piyavskii, 1972), initiated an active line of research and played a major role in the development of many efﬁcient global optimization algorithms such as DIRECT (Jones et al., 1993), MCS (Huyer & Neumaier, 1999) or SOO (Preux et al., 2014). Convergence properties of global optimization methods have been developed in the works of (Valko et al., 2013; Munos, 2014) under local smoothness assumptions, but, up to our knowledge, such properties have not been considered in the case where only the global smoothness of the function can be speciﬁed. An interesting question is how much global assumptions on regularity which cover in some sense local assumptions may improve the convergence of the latter. In this work, we address the following questions: (i) ﬁnd the limitations and the best performance that can be achieved by any algorithm over the class of Lipschitz functions and (ii) design efﬁcient and optimal algorithms for this class of problems. Our contribution with regards to the above mentioned works is twofold. First, we introduce two novel algorithms for global optimization which exploit the global smoothness of the function and display good performance in typical benchmarks for optimization. Second, we show that these algorithms can achieve faster rates of convergence on globally smooth problems than the previously known methods which only exploit the local smoothness of the function. The rest of the paper is organized as follows. In Section 2, we introduce the framework and give generic results about the convergence of sequential algorithms. In Section 3, we introduce and analyze the LIPO algorithm which requires the knowledge of the Lipschitz constant. In Section 4, the algorithm is extended to the case where the Lipschitz constant is unknown and the adaptive algorithm is compared to existing methods in Section 5. All proofs can be found in the Supplementary Material provided as a separate document.

Global optimization of Lipschitz functions

2. Setup and preliminary results

2.1. Setup and notations

Setup. Let X Rd be a compact and convex set with nonempty interior and let f : X R be an unknown function which is only supposed to admit a maximum over its input domain X. The goal in global optimization consists in ﬁnding some point

x arg max x X f(x)

with a minimal amount of function evaluations. The standard setup involves a sequential procedure which starts by evaluating the function f(X1) at an initial point X1 and then selects at each step t 1 an evaluation point Xt+1 X depending on the previous evaluations (X1, f(X1)), . . . , (Xt, f(Xt)) and receives the evaluation of the unknown function f(Xt+1) at this point. After n iterations, we consider that the algorithm returns an evaluation point Xˆın with ˆın arg mini=1...n f(Xi) which has recorded the highest evaluation. The performance of the algorithm over the function f is then measured after n iterations through the difference between the value of the true maximum and the highest evaluation observed so far:

max x X f(x) max i=1...n f(Xi).

The analysis provided in the paper considers that the number n of evaluation points is not ﬁxed and it is assumed that function evaluations are noiseless. Moreover, the assumption made on the unknown function f throughout the paper is that it has a ﬁnite Lipschitz constant k, i.e.

k 0 s.t. |f(x) f(x )| k x x 2 (x, x ) X 2.

Before starting the analysis, we point out that similar settings have also been studied in (Munos, 2014; Malherbe et al., 2016) and that (Valko et al., 2013; Grill et al., 2015) considered the noisy scenario.

Notations. For all x = (x1, . . . , xd) Rd, we denote by x 2 = (Pd i=1 x2 i )1/2 the standard ℓ2-norm and by B(x, r) = {x Rd : x x 2 r} the ball centered in x of radius r 0. For any bounded set X Rd, we deﬁne its inner-radius as rad(X) = max{r > 0 : x X such that B(x, r) X}, its diameter as diam(X) = max(x,x ) X 2 x x 2 and we denote by µ(X) its volume where µ( ) stands for the Lebesgue measure. Lip(k) = {f : X R s.t. |f(x) f(x )| k x x 2 , (x, x ) X 2} denotes the class of k Lipschitz functions deﬁned on X and S

k 0 Lip(k) denotes the set of Lipschitz continuous functions. U(X) stands for the uniform distribution over a bounded measurable domain X, B(p) for the Bernoulli distribution of parameter p, I{ } for the standard indicator function taking values in {0, 1} and the notation X P means that the random variable X has the distribution P.

2.2. Preliminary results

In order to design efﬁcient procedures, we ﬁrst investigate the best performance that can be achieved by any algorithm over the class of Lipschitz functions.

Sequential algorithms and optimization consistency. We describe the sequential procedures that are considered here and the corresponding concept of consistency in the sense of global optimization.

Deﬁnition 1 (SEQUENTIAL ALGORITHM) The class of optimization algorithms we consider, denoted in the sequel by A, contains all the algorithms A = {At}t 1 completely described by:

1. A distribution A1 taking values in X which allows to generate the ﬁrst evaluation point, i.e. X1 A1;

2. An inﬁnite collection of distributions {At}t 2 taking values in X and based on the previous evaluations which deﬁne the iteration loop, i.e. Xt+1 At+1((X1, f(X1)), . . . , (Xt, f(Xt))).

Note that this class of algorithms also includes the deterministic methods in which case the distributions {At}t 1 are degenerate. The next deﬁnition introduces the notion of asymptotic convergence.

Deﬁnition 2 (OPTIMIZATION CONSISTENCY) A global optimization algorithm A is said to be consistent over a set F of real-valued functions admitting a maximum over X if and only if

f F, max i=1...n f(Xi) p max x X f(x)

where X1, . . . , Xn denotes a sequence of n evaluations points generated by the algorithm A over the function f.

Asymptotic performance. We now investigate the minimal conditions for a sequential algorithm to achieve asymptotic convergence. Of course, it is expected that a global optimization algorithm should be consistent at least for the class of Lipschitz functions and the following result reveals a necessary and sufﬁcient condition (NSC) in this case.

Proposition 3 (CONSISTENCY NSC) A global optimization algorithm A is consistent over the set of Lipschitz functions if and only if

k 0 Lip(k), sup x X min i=1...n Xi x 2 p 0.

A crucial consequence of the latter proposition is that the design of any consistent method ends up to covering the whole input space regardless of the function values. The example below introduces the most popular space-ﬁlling method which will play a central role in our analysis.

Global optimization of Lipschitz functions

Example 4 (PURE RANDOM SEARCH) The Pure Random Search (PRS) consists in sequentially evaluating the function over a sequence of points X1, X2, X3, . . . uniformly and independently distributed over the input space X. For this method, a simple union bound indicates that for all n N and δ (0, 1), we have with probability at least 1 δ and independently of the function values,

sup x X min i=1...n Xi x 2 diam(X) ln(n/δ) + d ln(d)

In addition to this result, we point out that the covering rate of any method can easily be shown to be at best of order Ω(n 1/d) and thus subject to to the curse of dimensionality by means of covering arguments. Keeping in mind the equivalence of Proposition 3, we may now turn to the nonasymptotic analysis.

Finite-time performance. We investigate here the best performance that can be achieved by any algorithm with a ﬁnite number of function evaluations. We start by casting a negative result stating that any algorithm can suffer, at any time, an arbitrarily large loss over the class of Lipschitz functions.

Proposition 5 Consider any global optimization algorithm A. Then, for any constant C > 0 arbitrarily large, any n N and δ (0, 1), there exists a function f S

k 0 Lip(k) only depending on (A, C, n, δ) for which we have with probability at least 1 δ,

C max x X f(x) max i=1...n f(Xi).

This result might however not be very surprising since the class of Lipschitz functions includes functions with ﬁnite, but arbitrarily large variations. When considering the subclass of functions with ﬁxed Lipschitz constant, it becomes possible to derive ﬁnite-time bounds on the minimax rate of convergence.

Proposition 6 (MINIMAX RATE) adapted from (Bull, 2011). For any Lipschitz constant k 0 and any n N , the following inequalities hold true:

inf A A sup f Lip(k) E max x X f(x) max i=1...n f(Xi)

where c1 = rad(X) /(8

d), c2 = diam(X) d! and the expectation is taken over a sequence of n evaluation points X1, . . . , Xn generated by the algorithm A over f.

We point out that this minimax rate of convergence of order Θ(n 1/d) can still be achieved by any method with an optimal covering rate of order O(n 1/d). Observe indeed that since E [maxx X f(x) maxi=1...n f(Xi)] k E [supx X mini=1...n x Xi 2] for all f Lip(k), then an optimal covering rate necessarily implies minimax efﬁciency. However, as it can be seen by examining the proof of Proposition 6 provided in the Supplementary Material, the functions constructed to prove the limiting bound of Ω(n 1/d) are spikes which are almost constant everywhere and do not present a large interest from a practical perspective. In particular, we will see in the sequel that one can design:

I) An algorithm with ﬁxed constant k 0 which achieves minimax efﬁciency and also presents exponentially decreasing rates over a large subset of functions, as opposed to space-ﬁlling methods (LIPO, Section 3).

II) A consistent algorithm which does not require the knowledge of the Lipschitz constant and presents comparable performance as when the constant k is assumed to be known (Ada LIPO, Section 4).

3. Optimization with ﬁxed Lipschitz constant

In this section, we consider the problem of optimizing an unknown function f given the knowledge that f Lip(k) for a given k 0.

3.1. The LIPO Algorithm

The inputs of the LIPO algorithm (Algorithm 1) are a number n of function evaluations, a Lipschitz constant k 0, the input space X and the unknown function f. At each iteration t 1, a random variable Xt+1 is sampled uniformly over the input space X and the algorithm decides whether or not to evaluate the function at this point. Indeed, it evaluates the function over Xt+1 if and only if the value of the upper bound on possible values UB : x 7 mini=1...t f(Xi) +k x Xi 2 evaluated at this point and computed from the previous evaluations is at least equal to the value of the best evaluation observed so far maxi=1...t f(Xi). As an example, the computation of the decision rule of LIPO is illustrated in Figure 1.

Algorithm 1 LIPO(n, k, X, f) 1. Initialization: Let X1 U(X) ..... Evaluate f(X1), t 1

2. Iterations: Repeat while t < n ..... Let Xt+1 U(X) ..... If min i=1...t(f(Xi) + k Xt+1 Xi 2) max i=1...t f(Xi)

........... Evaluate f(Xt+1), t t + 1

3. Output: Return Xˆın where ˆın arg maxi=1...n f(Xi)

Global optimization of Lipschitz functions

More formally, the mechanism behind this rule can be explained using the active subset of consistent functions previously considered in active learning (see, e.g., (Dasgupta, 2011) and (Hanneke, 2011)).

Deﬁnition 7 (CONSISTENT FUNCTIONS) The active subset of k-Lipschitz functions consistent with the unknown function f over a sample (X1, f(X1)), . . . , (Xt, f(Xt)) of t 1 evaluations is deﬁned as follows:

Fk,t := {g Lip(k) : i {1 . . . t}, g(Xi) = f(Xi)}.

Indeed, one can recover from this deﬁnition the subset of points which can actually maximize the function f.

Deﬁnition 8 (POTENTIAL MAXIMIZERS) Using the same notations as in Deﬁnition 7, we deﬁne the subset of potential maximizers estimated over any sample t 1 evaluations with a constant k 0 as follows:

Xk,t := x X : g Fk,t such that x arg max x X g(x) .

We may now provide an equivalence which makes the link with the decision rule of the LIPO algorithm.

Lemma 9 If Xk,t denotes the set of potential maximizers deﬁned above, then we have the following equivalence:

x Xk,t min i=1...t f(Xi) + k x Xi 2 max i=1...t f(Xi).

Hence, we deduce from this lemma that the algorithm only evaluates the function over points that still have a chance to be maximizers of the unknown function.

Remark 10 (EXTENSION TO OTHER SMOOTHNESS ASSUMPTIONS) It is important to note the proposed optimization scheme could easily be extended to a large number of sets of globally and locally smooth functions by slightly adapting the decision rule. For instance, when Fℓ = {f : X R | x is unique and x X, f(x ) f(x) ℓ(x , x)} denotes the set of functions locally smooth around their maxima with regards to any semi-metric ℓ: X X R previously considered in (Munos, 2014), a straightforward derivation of Lemma 9 directly gives that the decision rule applied in Xt+1 would

Figure 1. Left:A Lipschitz function, a sample of 4 evaluations and the upper bound UB : x 7 mini=1...t f(Xi) +k x Xi 2 in grey. Right: the set of points Xk,t := {x X : UB(x) maxi=1...t f(Xi)} which satisfy the decision rule.

simply consists in testing whether maxi=1...t f(Xi) mini=1...t f(Xi) + ℓ(Xt+1, Xi). However, since the purpose of this work is to design fast algorithms for Lipschitz functions, we will only derive convergence results for the version of the algorithm stated above.

3.2. Convergence analysis

We start with the consistency property of the algorithm.

Proposition 11 (CONSISTENCY) For any Lipschitz constant k 0, the LIPO algorithm tuned with a parameter k is consistent over the set k-Lipschitz functions, i.e.,

f Lip(k), max i=1...n f(Xi) p max x X f(x).

The next result shows that the value of the highest evaluation observed by the algorithm is always superior or equal in the usual stochastic ordering sense to the one of a PRS.

Proposition 12 (FASTER THAN PURE RANDOM SEARCH) Consider the LIPO algorithm tuned with any constant k 0. Then, for any f Lip(k) and n N , we have that y R,

P max i=1...n f(Xi) y P max i=1...n f(X i) y

where X1, . . . , Xn is a sequence of n evaluation points generated by LIPO and X 1, . . . , X n is a sequence of n independent random variables uniformly distributed over X.

Based on this result, one can easily derive a ﬁrst ﬁnite-time bound on the difference between the value of the true maximum and its approximation.

Corollary 13 (UPPER BOUND) For any f Lip(k), n N and δ (0, 1), we have with probability at least 1 δ,

max x X f(x) max i=1...n f(Xi) k diam(X) ln(1/δ)

This bound which assesses the miminax optimality of LIPO stated in Proposition 6 does however not show any improvement over PRS and it cannot be signiﬁcantly improved without any additional assumption as shown below.

Proposition 14 For any n N and δ (0, 1), there exists a function f Lip(k) only depending on n and δ for which we have with probability at least 1 δ:

d max x X f(x) max i=1...n f(Xi).

As announced in Section 2.2, one can nonetheless get tighter polynomial bounds and even an exponential decay by using the following condition which describes the behavior of the function around its maximum.

Global optimization of Lipschitz functions

κ = 0.5 κ = 1 κ = 2

Figure 2. Three one-dimensional functions satisfying Condition 1 with κ = 1/2 (Left), κ = 1 (Middle) and κ = 2 (Right).

Condition 1 (DECREASING RATE AROUND THE MAXIMUM) A function f : X R is (κ, cκ)-decreasing around its maximum for some κ 0, cκ 0 if:

1. The global optimizer x X is unique;

2. For all x X, we have that:

f(x ) f(x) cκ x x κ 2 .

This condition, already considered in the works of (Zhigljavsky & Pint er, 1991) and (Munos, 2014), captures how fast the function decreases around its maximum. It can be seen as a local one-sided H older condition which can only be met for κ 1 when f is assumed to be Lipschitz. As an example, three functions satisfying this condition with different values of κ are displayed on Figure 3.2.

Theorem 15 (FAST RATES) Let f Lip(k) be any Lipschitz function satisfying Condition 1 for some κ 1, cκ > 0. Then, for any n N and δ (0, 1), we have with probability at least 1 δ,

max x X f(x) max i=1...n f(Xi) k diam(X)

exp Ck,κ n ln(2) ln(n/δ) + 2(2

1 + Ck,κ n(2d(κ 1) 1) ln(n/δ) + 2(2

κ d(κ 1) , κ > 1

where Ck,κ = (cκ maxx X x x κ 1 /8k)d.

The last result we provide states an exponentially decreasing lower bound.

Theorem 16 (LOWER BOUND) For any f Lip(k) satisfying Condition 1 for some κ 1, cκ > 0 and any n N

and δ (0, 1), we have with probability at least 1 δ,

cκ rad(X)κ e κ

2n ln(1/δ)+ln(1/δ)

max x X f(x) max i=1...n f(Xi).

A discussion on these results can be found in the next section where LIPO is compared with similar algorithms.

3.3. Comparison with previous works

The Piyavskii algorithm (Piyavskii, 1972) is a Lipschitz method with ﬁxed k 0 consisting in sequentially evaluating the function over a point Xt+1 arg maxx X mini=1...t f(Xi) + k x Xi maximizing the upper bound displayed on Figure 1. (Munos, 2014) also proposed a similar algorithm (DOO) which uses a hierarchical partitioning of the space in order to sequentially expand and evaluate the function over the center of a partition which has the highest upper bound computed from a semi-metric ℓset as input. Up to our knowledge, only the consistency of the Piyavskii algorithm was proven in (Mladineo, 1986) and (Munos, 2014) derived ﬁnite-time bounds for DOO with the use of weaker local assumptions. To compare our results, we thus considered DOO tuned with ℓ(x, x ) = k x x 2 over X = [0, 1]d partitioned into a 2d-ary tree of hypercubes and with f belonging to the sets of globally smooth functions: (a) Lip(k), (b) Fκ= {f Lip(k) satisfying Condition 1 with cκ, κ 0} and (c) F κ = {f Fκ : c2 > 0, f(x ) f(x) c2 x x κ 2}. The results of the comparison can be found in Table 1. In addition to the novel lower bounds and the rate over Lip(k), we were able to obtain similar upper bounds as DOO over Fκ, uniformly better rates for the functions in F κ locally equivalent to x x κ 2 with κ > 1 and a similar exponenital rate, up to a constant factor, when κ = 1. Hence, when f is only known to be k-Lipschitz, one thus should expect the algorithm exploiting the global smoothness (LIPO) to perform asymptotically better or at least similarly to the one using the local smoothness (DOO) or no information (PRS). However, keeping in mind that the constants are not necessarily optimal, it is also interesting to note that the term (k

d/cκ)d appearing in both the exponential rates of LIPO and DOO tends to suggest that if f is also known to be locally smooth for some kℓ k, then one should expect an algorithm exploiting the local smoothness kℓto be asymptotically faster than the one using the global smoothness k in the case where κ = 1.

Algorithm DOO LIPO Piyavskii PRS

f Lip(k) Consistency Upper Bound - OP(n 1

d ) - OP(n 1

f Fκ, κ>1 Upper bound O(n κ d(κ 1) ) O P (n κ d(κ 1) ) - OP(n 1

d ) Lower bound - Ω P (e κ

d n) - ΩP(n κ

f F κ, κ>1 Upper bound O(n κ d(κ 1) ) O P (n κ κ d(κ 1) ) - OP(n κ

d ) Lower bound - Ω P (e κ

d n) - ΩP(n κ

f F κ, κ=1 Upper bound O(e

n ln(2) (2k

d/cκ)d ) O P (e

n ln(2) 2(16k

d/cκ)d ) - OP(n 1

Lower bound - Ω P (e n

d ) - ΩP(n 1

Table 1. Comparison of the results reported over the difference maxx X f(x) maxi=1...n f(Xi) in Lipschitz optimization. Dash symbols are used when no results could be found.

Global optimization of Lipschitz functions

4. Optimization with unknown Lipschitz constant

In this section, we consider the problem of optimizing any unknown function f in the class S

k 0 Lip(k).

4.1. The adaptive algorithm

The Ada LIPO algorithm (Algorithm 2) is an extension of LIPO which involves an estimate of the Lipschitz constant and takes as input a parameter p (0, 1) and a nondecreasing sequence of Lipschitz constant ki Z deﬁning a meshgrid of R+ (i.e. such that x > 0, i Z with ki x ki+1). The algorithm is initialized with a Lipschitz constant ˆk1 set to 0 and alternates randomly between two distinct phases: exploration and exploitation. Indeed, at step t < n, a Bernoulli random variable Bt+1 of parameter p driving this trade-off is sampled. If Bt+1 = 1, then the algorithm explores the space by evaluating the function over a point uniformly sampled over X. Otherwise, if Bt+1 = 0, the algorithm exploits the previous evaluations by making an iteration of the LIPO algorithm with the smallest Lipschitz constant of the sequence ˆkt which is associated with a subset of Lipschitz functions that probably contains f (step abbreviated in the algorithm by Xt+1 U(Xˆkt,t)). Once an evaluation has been made, the Lipschitz constant estimate ˆkt is updated.

Remark 17 (EXAMPLES OF MESHGRIDS) Several sequences of Lipschitz constants with various shapes such as ki = |i|sgn(i), ln(1 + |i|sgn(i)) or (1 + α)i for some α > 0 could be considered to implement the algorithm. In particular, we point out that with these sequences the computation of the estimate is straightforward. For instance, when ki = (1 + α)i, we have ˆkt = (1 + α)it where it = ln(maxi =j |f(Xj) f(Xl)|/ Xj Xl 2)/ ln(1 + α) .

4.2. Convergence analysis

Lipschitz constant estimate. Before starting the analysis of Ada LIPO, we ﬁrst provide a control on the Lipschitz constant estimate based on a sample of random evaluations that will be useful to analyse its performance. In particular, the next result illustrates the purpose of using a discretization of Lipschitz constant instead of a raw estimate of the maximum slope by showing that, given this estimate, a small subset of functions containing the unknown function can be recovered in a ﬁnite-time.

Proposition 18 Let f be any non-constant Lipschitz function. Then, if ˆkt denotes the Lipschitz constant estimate of Algorithm 2 computed with any increasing sequence ki Z deﬁning a meshgrid of R+ over a sample (X1, f(X1)), . . . , (Xt, f(Xt)) of t 2 evaluations where X1, . . . , Xt are uniformly and independently distributed

Algorithm 2 ADALIPO(n, p, ki Z, X, f) 1. Initialization: Let X1 U(X) ..... Evaluate f(X1), t 1, ˆk1 0

2. Iterations: Repeat while t < n ..... Let Bt+1 B(p) ..... If Bt+1 = 1 (Exploration) ........... Let Xt+1 U(X) ..... If Bt+1 = 0 (Exploitation) ........... Let Xt+1 U(Xˆkt,t) where Xˆkt,t denotes the set ........... of potential maximizers introduced in Deﬁnition 8 ........... computed with k set to ˆkt ..... Evaluate f(Xt+1), t t + 1

..... Let ˆkt := inf ki Z : max i =j |f(Xi) f(Xj)|

3. Output: Return Xˆın where ˆın arg maxi=1...n f(Xi)

over X, we have that

P f Lip(ˆkt) 1 (1 Γ(f, ki 1)) t/2

where the coefﬁcient

Γ(f, ki 1) := P |f(X1) f(X2)|

X1 X2 2 > ki 1

with i = min{i Z : f Lip(ki)}, is strictly positive.

Remark 19 (MEASURE OF GLOBAL SMOOTHNESS) The coefﬁcient Γ(f, ki 1) which appears in the lower bound of Proposition 18 can be seen as a measure of the global smoothness of the function f with regards to ki 1. Indeed, observing that 1/ t/2 P t/2 i=1 I{|f(Xi) f(Xi+ t/2 )| > ki 1 Xi X t/2 +i 2} p Γ(f, ki 1), it is easy to see that Γ records the ratio of volume the product space X X where f is witnessed to be at least ki 1 Lipschitz.

Remark 20 (DENSITY OF THE SEQUENCE) As a direct consequence of the previous remark, we point out that the density of the sequence ki Z, captured here by α = supi Z(ki+1 ki)/ki has opposite impacts on the maximal deviation of the estimate and its convergence rate. Indeed, since α is involved in both the following upper bounds on the deviation (limt ˆkt k )/k α where k = sup{k 0 : f / Lip(k)} and on the coefﬁcient Γ(f, ki 1) Γ(f, k /(1 + α)), we deduce that using a sequence with a small α reduces the bias but also the convergence rate through a small coefﬁcient Γ(f, ki 1).

Analysis of Ada LIPO. Given the consistency equivalence of Proposition 3, one can directly obtain the following asymptotic result.

Proposition 21 (CONSISTENCY) The Ada LIPO algorithm tuned with any parameter p (0, 1) and any sequence of

Global optimization of Lipschitz functions

Lipschitz constant ki Z covering R+ is consistent over the set of Lipschitz functions, i.e.,

k 0 Lip(k), max i=1...n f(Xi) p max x X f(x).

The next result provides a ﬁrst ﬁnite-time bound on the difference between the maximum and its approximation.

Proposition 22 (UPPER BOUND) Consider Ada LIPO tuned with any p (0, 1) and any sequence ki Z deﬁning a meshgrid of R+. Then, for any non-constant f S k 0 Lip(k), any n N and δ (0, 1), we have with probability at least 1 δ,

max x X f(x) max i=1...n f(Xi) diam(X)

p + 2 ln(δ/3) p ln(1 Γ(f, ki -1))

where Γ(f, ki 1) and i are deﬁned as in Proposition 18.

This result might be misleading since it advocates that doing pure exploration gives the best rate (i.e., when p 1). However, as Proposition 18 provides us with the guarantee that f Lip(ˆkt) within a ﬁnite number of iterations where ˆkt denotes the Lipschitz constant estimate, one can recover faster convergence rates similar to the one reported for LIPO where the constant k is assumed to be known.

Theorem 23 (FAST RATES) Consider the same assumptions as in Proposition 22 and assume in addition that the function f satisﬁes Condition 1 for some κ 1, cκ 0. Then, for any n N and δ (0, 1), we have with probability at least 1 δ,

max x X f(x) max i=1...n f(Xi) diam(X)

ki exp 2 ln(δ/4) p ln(1 Γ(f,ki 1)) + 7 ln(4/δ)

exp Cki ,κ n (1 p) ln(2) 2 ln(n/δ) + 4(2

2κ 1 + Cki ,κ n(1 p)(2d(κ 1) 1)

2 ln(n/δ) + 4(2

κ d(κ 1) , κ > 1

where Cki ,κ = (cκ, maxx X x x κ 1 2 /8ki )d.

This bound shows the precise impact of the parameters p and ki Z on the convergence of the algorithm. In particular, it illustrates the complexity of the exploration/exploitation trade-off through a constant term and a convergence rate which are inversely correlated to the exploration parameter and the density of the sequence of Lipschitz constants.

4.3. Comparison with previous works

The DIRECT algorithm (Jones et al., 1993) is a Lipschitz algorithm with unknown constant which uses a deterministic splitting technique of the search space to evaluate the function on subdivisions of the space that have recorded the highest evaluation among all subdivisions of similar size. Moreover, (Munos, 2014) generalized DIRECT in a broader setting by extending DOO to any unknown and arbitrary local semi-metric. With regards to these works, we proposed an alternative stochastic strategy which directly relies on the estimation of the Lipschitz constant and thus only presents guarantees for globally smooth functions. However, as far as we know, only the consistency property of DIRECT was shown in (Finkel & Kelley, 2004) and (Munos, 2014) derived convergence rates of the same order as for DOO, except that the best rate they derive is of order O(e c n) to be compared with the fast rate of Ada LIPO which is of order O P(e cn). The conclusion of the comparison thus remains the same as in Section 3: exploiting the global smoothness instead of just the local one allows to derive faster algorithms in the some cases where the unknown function is indeed globally smooth.

5. Experiments

We compare here the empirical performance of Ada LIPO with ﬁve state-of-the-art global optimization methods.

Algorithms. BAYESOPT (Martinez-Cantin, 2014) is a Bayesian optimization algorithm which uses a distribution over functions to build a surrogate model of the unknown function. The parameters of the distribution are estimated during the optimization process. CMA-ES (Hansen, 2006) is an evolutionary algorithm which samples the new evaluation points according to a multivariate normal distribution with mean vector and covariance matrix computed from the previous evaluations. CRS (Kaelo & Ali, 2006) is a variant of PRS including local mutations which starts with a random population and evolves these points by an heuristic rule. MLSL (Kan & Timmer, 1987) is a multistart algorithm performing a series of local optimizations starting from points randomly chosen by a clustering heuristic that helps to avoid repeated searches of the same local optima. DIRECT (Jones et al., 1993) and PRS were previously introduced. For a fair comparison, the tuning parameters were all set to default and Ada LIPO was constantly used with a parameter p set to 0.1 and a sequence ki = (1 + 0.01/d)i ﬁxed by an arbitrary rule of thumb. 1

Data sets. Following the steps of (Malherbe & Vayatis, 2016), we ﬁrst studied the task of estimating the regularization parameter λ and the bandwidth σ of a gaussian kernel ridge regression minimizing the empirical mean squared

1In Python 2.7 from Bayes Opt (Martinez-Cantin, 2014), CMA 1.1.06 (Hansen, 2011) and NLOpt (Johnson, 2014).

Global optimization of Lipschitz functions

Problem Auto-MPG Breast Cancer Concrete Housing Yacht Holder Table Rosenbrock Linear Slope Sphere Deb N.1

Ada LIPO 14.6 ( 09) 05.4 ( 03) 04.9 ( 02) 05.4 ( 04) 25.2 ( 21) 077 ( 058) 07.5 ( 07) 029 ( 13) 036 ( 12) 916( 225) Bayes Opt 10.8 ( 03) 06.8 ( 04) 06.4 ( 03) 07.5 ( 04) 13.8 ( 20) 410 ( 417) 07.6 ( 05) 032 ( 58) 019 ( 03) 814( 276) CMA-ES 29.3 ( 25) 11.1 ( 09) 10.4 ( 08) 12.4 ( 12) 29.6 ( 25) 080 ( 115) 10.0 ( 10) 100 ( 76) 171 ( 68) 930( 166) CRS 28.7 ( 14) 08.9 ( 08) 10.0 ( 09) 13.8 ( 10) 32.6 ( 15) 307 ( 422) 09.0 ( 09) 094 ( 43) 233 ( 54) 980( 166) DIRECT 11.0 ( 00) 06.0 ( 00) 06.0 ( 00) 06.0 ( 00) 11.0 ( 00) 080 ( 000) 10.0 ( 00) 092 ( 00) 031 ( 00) 1000( 00) MLSL 13.1 ( 15) 06.6 ( 03) 06.1 ( 04) 07.2 ( 03) 14.4 ( 13) 305 ( 379) 06.9 ( 05) 016 ( 33) 175( 302) 198( 326) PRS 65.1 ( 62) 10.6 ( 10) 09.8 ( 09) 11.5 ( 10) 73.3 ( 72) 210 ( 202) 09.0 ( 09) 831( 283) 924( 210) 977( 117)

Ada LIPO 17.7 ( 09) 06.6 ( 04) 06.4 ( 04) 17.9 ( 25) 33.3 ( 26) 102 ( 065) 11.5 ( 11) 053 ( 22) 042 ( 11) 986( 255) Bayes Opt 12.2 ( 06) 08.4 ( 03) 07.9 ( 03) 13.9 ( 22) 15.9 ( 21) 418 ( 410) 12.0 ( 08) 032 ( 59) 045 ( 16) 949( 153) CMA-ES 42.9 ( 31) 13.7 ( 10) 13.5 ( 10) 23.0 ( 16) 40.5 ( 30) 136 ( 184) 16.1 ( 13) 151 ( 94) 223 ( 57) 952( 127) CRS 35.8 ( 13) 13.6 ( 10) 14.6 ( 11) 22.8 ( 12) 38.3 ( 31) 580 ( 444) 15.8 ( 14) 131 ( 62) 340 ( 66) 997( 127) DIRECT 11.0 ( 00) 11.0 ( 00) 11.0 ( 00) 19.0 ( 00) 27.0 ( 00) 080 ( 000) 10.0 ( 00) 116 ( 00) 098 ( 00) 1000( 00) MLSL 15.0 ( 15) 07.6 ( 03) 07.3 ( 04) 16.3 ( 10) 16.3 ( 13) 316 ( 384) 08.8 ( 05) 018 ( 37) 226( 336) 215( 328) PRS 139 ( 131) 17.7 ( 17) 14.0 ( 12) 39.6 ( 39) 247( 249) 349 ( 290) 18.0 ( 17) 985( 104) 1000( 00) 998( 025)

Ada LIPO 32.6 ( 16) 34.1 ( 36) 70.8 ( 58) 65.4 ( 62) 61.7 ( 39) 212 ( 129) 44.6 ( 39) 122 ( 31) 052 ( 10) 1000( 00) Bayes Opt 14.0 ( 07) 31.0 ( 51) 28.2 ( 34) 17.9 ( 22) 18.5 ( 22) 422 ( 407) 27.6 ( 22) 032 ( 59) 222 ( 77) 1000( 00) CMA-ES 73.7 ( 49) 35.1 ( 20) 46.3 ( 29) 61.5 ( 85) 70.9 ( 50) 215 ( 198) 43.5 ( 37) 211 ( 92) 308 ( 60) 962( 106) CRS 48.5 ( 16) 34.8 ( 12) 36.6 ( 15) 43.7 ( 14) 52.9 ( 18) 599 ( 427) 42.7 ( 23) 168 ( 76) 607 ( 81) 1000( 00) DIRECT 47.0 ( 00) 27.0 ( 00) 37.0 ( 00) 41.0 ( 00) 49.0 ( 00) 080 ( 000) 24.0 ( 00) 226 ( 00) 548 ( 00) 1000( 00) MLSL 20.6 ( 17) 12.8 ( 03) 14.7 ( 10) 16.3 ( 10) 21.4 ( 14) 322 ( 382) 19.4 ( 49) 022 ( 42) 304( 357) 256( 334) PRS 747( 330) 145( 124) 176( 148) 406( 312) 779( 334) 772 ( 310) 100( 106) 1000( 00) 1000( 00) 1000( 00)

Table 2. Results of the numerical experiments. The table displays the number of evaluations required by each method to reach the speciﬁed target (mean standard deviation). In bold, the best result obtained in terms of average of function evaluations.

error of the predictions over a 10-fold cross validation with real data sets. The optimization was performed over (ln(λ), ln(σ)) [ 3, 5] [ 2, 2] with ﬁve data sets from the UCI Machine Learning Repository (Lichman, 2013): Auto-MPG, Breast Cancer Wisconsin (Prognostic), Concrete slump test, Housing and Yacht Hydrodynamics. We then compared the algorithms on a series of ﬁve synthetic problems commonly met in standard optimization benchmark taken from (Jamil & Yang, 2013; Surjanovic & Bingham, 2013): Holder Table, Rosenbrock, Sphere, Linear Slope and Deb N.1. This series includes multimodal and non-linear functions as well as ill-conditioned and wellshaped functions with a dimensionality ranging from 2 to 5. A complete description of the test functions of the benchmark can be found in the Supplementary Material.

Protocol and performance metrics. For each problem and each algorithm, we performed K =100 distinct runs with a budget of n =1000 function evaluations. For each target parameter t = 90%, 95% and 99%, we have collected the stopping times corresponding to the number of evaluations required by each method to reach the speciﬁed target τk := min{i = 1, . . . , n : f(X(k) i ) ftarget(t)} where min{ } = 1000 by convention, {f(X(k) i )}n i=1 denotes the evaluations made by a given method on the k-th run with k K and the target value is set to ftarget(t) := maxx X f(x) maxx X f(x) R

x X f(x) dx/µ(X) (1 t). The normalization of the target to the average value prevents the performance measures from being dependent of any constant term in the unknown function. In practice, the average was estimated from a Monte Carlo sampling of 106 evaluations and the maximum by taking the best value observed over all the sets of experiments. Based on these stopping times, we computed the average and standard deviation of the number of evaluations required to reach the target, i.e. τK = PK k=1 τk/K and

ˆστ = (PK k=1(τk τK)2/K)1/2.

Results. Results are collected in Table 2. Due to space constraints, we only make few comments. First, we point out that the proposed method displays very competitive results over most of the problems of the benchmark (except on the non-smooth Deb N.1 where most methods fail). In particular, Ada LIPO obtains several times the best performance for the target 90% and 95% (see, e.g., Breast Cancer, Holder Table, Sphere) and experiments Linear Slope and Sphere also suggest that, in the case of smooth functions, it can be robust against the dimensionality of the input space. However, in some cases, the algorithm can be witnessed to reach the 95% target with very few evaluations while getting more slowly to the 99% target (see, e.g., Concrete, Housing). This problem is due to the instability of the Lipschitz constant estimate around the maxima but could certainly be solved with the addition of a noise parameter that would allow the algorithm be more robust against local perturbations. Additionally, investigating better values for p and ki as well as alternative covering methods such as LHS (Stein, 1987) could also be promising approaches to improve its performance. However, an empirical analysis of the algorithm with these extensions is beyond the scope of the paper and will be carried out in a future work.

6. Conclusion

We introduced two novel strategies for global optimization: LIPO which requires the knowledge of the Lipschitz constant and its adaptive version Ada LIPO which estimates the constant during the optimization process. A theoretical analysis is provided and empirical results based on synthetic and real problems have been obtained demonstrating the performance of the adaptive algorithm with regards to existing state-of-the-art global optimization methods.

Global optimization of Lipschitz functions

Bull, Adam D. Convergence rates of efﬁcient global optimization algorithms. The Journal of Machine Learning Research, 12:2879 2904, 2011.

Dasgupta, Sanjoy. Two faces of active learning. Theoretical Computer Science, 412(19):1767 1781, 2011.

Finkel, Daniel E and Kelley, CT. Convergence analysis of the direct algorithm. Optimization On-line Digest, 2004.

Grill, Jean-Bastien, Valko, Michal, and Munos, R emi. Black-box optimization of noisy functions with unknown smoothness. In Neural Information Processing Systems, 2015.

Hanneke, Steve. Rates of convergence in active learning. The Annals of Statistics, 39(1):333 361, 2011.

Hansen, Nikolaus. The cma evolution strategy: a comparing review. In Towards a New Evolutionary Computation, pp. 75 102. Springer, 2006.

Hansen, Nikolaus. The cma evolution strategy: A tutorial. Retrieved May 15, 2016, from http://www. lri.fr/hansen/cmaesintro.html, 2011.

Huyer, Waltraud and Neumaier, Arnold. Global optimization by multilevel coordinate search. Journal of Global Optimization, 14(4):331 355, 1999.

Jamil, Momin and Yang, Xin-She. A literature survey of benchmark functions for global optimization problems. International Journal of Mathematical Modelling and Numerical Optimisation, 4(2):150 194, 2013.

Johnson, Steven G. The NLopt nonlinear-optimization package. Retrieved May 15, 2016, from http:// ab-initio.mit.edu/nlopt, 2014.

Jones, Donald R, Perttunen, Cary D, and Stuckman, Bruce E. Lipschitzian optimization without the lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157 181, 1993.

Jones, Donald R., Schonlau, Matthias, and Welch, William J. Efﬁcient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455 492, 1998.

Kaelo, Professor and Ali, Montaz. Some variants of the controlled random search algorithm for global optimization. Journal of Optimization Theory and Applications, 130(2):253 264, 2006.

Kan, AHG Rinnooy and Timmer, Gerrit T. Stochastic global optimization methods part i: Clustering methods. Mathematical Programming, 39(1):27 56, 1987.

Lichman, Moshe. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.

Malherbe, C edric and Vayatis, Nicolas. A ranking approach to global optimization. ar Xiv preprint ar Xiv:1603.04381, 2016.

Malherbe, C edric, Contal, Emile, and Vayatis, Nicolas. A ranking approach to global optimization. In In Proceedings of the 33st International Conference on Machine Learning, pp. 1539 1547, 2016.

Martinez-Cantin, Ruben. Bayesopt: A bayesian optimization library for nonlinear optimization, experimental design and bandits. The Journal of Machine Learning Research, 15(1):3735 3739, 2014.

Mladineo, Regina Hunter. An algorithm for ﬁnding the global maximum of a multimodal, multivariate function. Mathematical Programming, 34(2):188 200, 1986.

Munos, R emi. From bandits to monte-carlo tree search: The optimistic principle applied to optimization and planning. Foundations and Trends R in Machine Learning, 7(1):1 129, 2014.

Pint er, J anos D. Global optimization in action. Scientiﬁc American, 264:54 63, 1991.

Piyavskii, SA. An algorithm for ﬁnding the absolute extremum of a function. USSR Computational Mathematics and Mathematical Physics, 12(4):57 67, 1972.

Preux, Philippe, Munos, R emi, and Valko, Michal. Bandits attack function optimization. In Evolutionary Computation (CEC), 2014 IEEE Congress on, pp. 2245 2252. IEEE, 2014.

Rios, Luis Miguel and Sahinidis, Nikolaos V. Derivativefree optimization: a review of algorithms and comparison of software implementations. Journal of Global Optimization, 56(3):1247 1293, 2013.

Shubert, Bruno O. A sequential method seeking the global maximum of a function. SIAM Journal on Numerical Analysis, 9(3):379 388, 1972.

Stein, Michael. Large sample properties of simulations using latin hypercube sampling. Technometrics, 29(2): 143 151, 1987.

Surjanovic, Sonja and Bingham, Derek. Virtual library of simulation experiments: Test functions and datasets. Retrieved May 15, 2016, from http://www.sfu.ca/ ssurjano, 2013.

Valko, Michal, Carpentier, Alexandra, and Munos, R emi. Stochastic simultaneous optimistic optimization. In In Proceedings of the 30th International Conference on Machine Learning, pp. 19 27, 2013.

Global optimization of Lipschitz functions

Zhigljavsky, A.A. and Pint er, J.D. Theory of Global Random Search. Mathematics and its Applications. Springer Netherlands, 1991. ISBN 9780792311225.