# a_ranking_approach_to_global_optimization__ddee3454.pdf

A ranking approach to global optimization

C edric Malherbe MALHERBE@CMLA.ENS-CACHAN.FR Emile Contal CONTAL@CMLA.ENS-CACHAN.FR Nicolas Vayatis VAYATIS@CMLA.ENS-CACHAN.FR CMLA, ENS Cachan, CNRS, Universit e Paris-Saclay, 94235, Cachan, France

We consider the problem of maximizing an unknown function f over a compact and convex set X Rd using as few observations f(x) as possible. We observe that the optimization of the function f essentially relies on learning the induced bipartite ranking rule of f. Based on this idea, we relate global optimization to bipartite ranking which allows to address problems with high dimensional input space, as well as cases of functions with weak regularity properties. The paper introduces novel meta-algorithms for global optimization which rely on the choice of any bipartite ranking method. Theoretical properties are provided as well as convergence guarantees and equivalences between various optimization methods are obtained as a by-product. Eventually, numerical evidence is given to show that the main algorithm of the paper which adapts empirically to the underlying ranking structure essentially outperforms existing state-of-the-art global optimization algorithms in typical benchmarks.

1. Introduction

In many applications such as complex system design or hyperparameter calibration for learning systems, the goal is to optimize some output value of a non-explicit function with as few evaluations as possible. Indeed, in such contexts, one has access to the function values only through numerical evaluations by simulation or cross-validation with signiﬁcant computational cost. Moreover, the operational constraints generally impose a sequential exploration of the solution space with small samples. The generic problem of sequentially optimizing the output of an unknown and potentially non-convex function is often referred to as global optimization (Pint er, 1991), black-box optimization

Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

(Jones et al., 1998) or derivative-free optimization (Rios & Sahinidis, 2013). There are several algorithms based on various heuristics which have been introduced in order to address complicated optimization problems with limited regularity assumptions, such as genetic algorithms, modelbased algorithms, branch-and-bound methods... (see (Rios & Sahinidis, 2013) for a recent overview). This paper follows the line of the approaches recently considered in the machine learning literature (Bull, 2011; Munos, 2011; Sergeyev et al., 2013). These approaches extend the seminal work on Lipschitz optimization of (Hansen et al., 1992; Jones et al., 1993) and they led to signiﬁcant relaxations of the conditions required for convergence, e.g. only the existence of a local smoothness around the optimum is required (Munos, 2011; Grill et al., 2015). More precisely, in the work of (Bull, 2011; Munos, 2011), speciﬁc conditions have been identiﬁed to derive a ﬁnite-time analysis of the algorithms. However, these guarantees do not hold when the unknown function is not assumed to be locally smooth around (one of) its optimum.

In the present work, we propose to explore concepts from ranking theory based on overlaying estimated level sets (Cl emenc on et al., 2010) in order to develop global optimization algorithms that do not rely on the smoothness of the function. The idea behind this approach is simple: even if the unknown function presents arbitrary large variations, most of the information required to identify its optimum may be contained in its induced ranking rule, i.e. how the level sets of the function are included one in another. To exploit this idea, we introduce a novel optimization scheme where the complexity of the function is characterized by the underlying pairwise ranking which it deﬁnes. Our contribution is twofold: ﬁrst, we introduce two novel global optimization algorithms that learn the ranking rule induced by the unknown function with a sequential scheme, and second, we provide mathematical results in terms of statistical consistency and convergence to the optimum. Moreover the algorithms proposed lead to efﬁcient implementation and they display good performance on the classical benchmarks for global optimization as shown at the end of the paper.

A ranking approach to global optimization

The rest of the paper is organized as follows. In Section 2 we introduce the framework and give the main deﬁnitions. In Section 3, we introduce and analyze the RANKOPT algorithm which requires a prior information on the ranking structure underlying the unknown function. In Section 4, an adaptive version of the algorithm is presented. Companion results which establish the equivalence between learning algorithms and optimization procedures are discussed in Section 5 as they support implementation choices. The adaptive version of the algorithm is compared to other global optimization algorithms in Section 6. Proof sketches are postponed to the Appendix section and full proofs can be found in the supplementary material provided as a separate document.

2. Global optimization and ranking structure

2.1. Setup and notations

Setup. We consider the problem of sequentially maximizing an unknown real-valued function f : X R which is assumed to admit at least one global maximum over the compact and convex set X Rd. The objective is to identify some point

x arg max x X f(x)

with a minimal amount of function evaluations. The setup we consider is the following: at each iteration t = 1 . . . n 1, an algorithm selects an evaluation point Xt+1 X which depends on the previous evaluations {(Xi, f(Xi))}t i=1 and receives the evaluation of the unknown function f(Xt+1) at this point. After n iterations, the algorithm returns the argument of the highest value observed so far:

Xˆın where ˆın arg max i=1,...,n f(Xi).

The analysis provided in the paper considers that the number n of evaluation points is not ﬁxed and it is assumed that function evaluations are noiseless.

Notations. For any x = {x1 . . . xd} Rd, we deﬁne the standard ℓ2-norm x 2 2 = Pd i=1 x2 i , we denote by , the corresponding inner product and we denote by B(x, r) = {x Rd : x x 2 r} the ℓ2-ball of radius r 0 centered in x. For any set X Rd, we deﬁne its inner-radius as rad(X) = max{r > 0 : x X s.t. B(x, r) X}, its diameter as diam(X) = max(x,x ) X 2 x x 2 and we denote by µ(X) its volume where µ stands for the Lebesgue measure. Finally, we denote by C0(X, R) the set of continuous functions deﬁned on X taking values in R, we denote by PN(X, R) the set of (multivariate) polynomial functions of degree N deﬁned on X and we denote by U(A) the uniform distribution over a bounded measurable domain A.

2.2. The ranking structure of a real-valued function

In this section, we introduce the ranking structure as a complexity characterization for a general real-valued function to be optimized. First, we observe that real-valued functions induce an order relation over the input space X, and the underlying ordering induces a ranking rule which records pairwise comparisons between evaluation points. Deﬁnition 1. (INDUCED RANKING RULE) The ranking rule rf : X X { 1, 0, 1} induced by a function f : X R is deﬁned by:

rf(x, x ) =

1 if f(x) > f(x ) 0 if f(x) = f(x ) 1 if f(x) < f(x )

for all (x, x ) X 2.

The key argument of the paper is that the optimization of any weakly regular real-valued function only depends on the nested structure of its level sets. Hence there is an equivalence class of real-valued functions that share the same induced ranking rule as shown by the following proposition.

Proposition 1. (RANKING RULE EQUIVALENCE) Let h C0(X, R) be any continuous function. Then, a function f : X R shares the same induced ranking rule with h (i.e. rf = rh) if and only if there exists a strictly increasing (not necessary continuous) function ψ : R R such that h = ψ f.

Proposition 1 states that even if the unknown function f admits non-continuous or large variations, up to a transformation ψ, there might exist a simpler function h = ψ f that shares the same induced ranking rule. Figure 1 gives an example of two functions that share the same ranking while they display highly different regularity properties. As a second example, we may consider the problem of maximizing the function f(x) = 1 1/ |ln(x)| if x = 0 and 1 otherwise over X = [0, 1/2]. The function f in this case is not smooth around its unique global maximizer x = 0 but shares the same induced ranking rule with h(x) = x over X.

We can now introduce a complexity characterization of real-valued functions of a set X through the complexity class of its induced ranking rule. We call this class a ranking structure.

Figure 1. Two functions f and h that share the same ranking

A ranking approach to global optimization

Deﬁnition 2. (CONTINUOUS RANKING STRUCTURE

AND CONTINUOUS RANKING RULES) We say that a realvalued function f has a continuous ranking rule if rf R where R = {rh : h C0(X, R)} denotes the set of continuous ranking rules (i.e. ranking rules induced by continuous functions).

In the continuation of this deﬁnition, we further introduce two examples of more stringent ranking structures.

Deﬁnition 3. (POLYNOMIAL RANKING RULES) The set of polynomial ranking rules of degree N 1 is deﬁned as

RP(N) = {rh : h PN(X, R)}.

We point out that even a polynomial function of degree N may admit a lower degree polynomial ranking rule. For example, consider the polynomial function f(x) = (x2 3x+1)9. Since f(x) = ψ(x2 3x) where ψ : x 7 (x+1)9

is a strictly increasing function, the ranking rule induced by f is a polynomial ranking rule of degree 2.

The second class of ranking structures we introduce is a class of non-parametric rankings. Deﬁnition 4. (CONVEX RANKING RULES) The set of convex ranking rules of degree N 1 is deﬁned as

RC(N) = {r R such that for any x X, the set

{x X : r(x, x ) 0} is a union of Nconvex sets}.

It is easy to see that the ranking rule of a function f is a convex ranking rule of degree N if and only all the level sets of the function f are unions of at most N convex sets.

2.3. Identiﬁability and regularity

We now state two conditions that will be used in the theoretical analysis: the ﬁrst condition is about the identiﬁability of the maximum of the function and the second is about the regularity of f around its maximum. Condition 1. (IDENTIFIABILITY) The maximum of a function f : X R is said to be identiﬁable if for any ε > 0 arbitrarily small, µ({x X : f(x) max x X f(x) ε}) > 0.

Condition 1 prevents the function from having a jump on its maximum and will be useful to state asymptotic results of the type f(Xˆın) maxx X f(x) when n + .

Condition 2. (REGULARITY OF THE LEVEL SETS) A function f : X R has (cα, α)-regular level sets for some cα > 0, α 0 if:

1. The global optimizer x X is unique.

2. For any y Im(f), the iso-level set f 1(y) = {x X : f(x) = y} satisﬁes

max x f 1(y) x x 2 cα min x f 1(y) x x 1/(1+α) 2 .

Condition 2 guarantees that the points associated with high evaluations are close to the unique optimizer with respect to the Euclidean distance. This condition will be used to derive some ﬁnite-time bounds on the distance x Xˆın 2 between the optimizer and its estimation. Note that for any iso-level set f 1(y) with ﬁnite distance to the optimum, the condition is satisﬁed with α = 0 and cα = diam(X) / minx f 1(y) x x 2. Therefore, this condition concerns the behavior of the level sets when minx f 1(y) x x 2 0. As an example, the iso-level sets of three simple functions satisfying the condition with different values of α are shown in Figure 2.

Figure 2. Illustration of the regularity of the level sets on two simple functions. Left: f(x1, x2) = exp ( x2 1 2x2 2) where α = 0. Middle: f(x) = |x1|3 2x2 2 where α = 1/2. Right: f(x1, x2) = x4 1 2x2 2 where α = 1.

3. Optimization with ﬁxed ranking structure

In this section, we consider the problem of optimizing an unknown function f given the prior knowledge that its ranking rf belongs to a given ranking structure R R .

3.1. The RANKOPT algorithm

The input of Algorithm 1 are a number n of iterations, the unknown function f, a compact and convex set X Rd and a ranking structure R R . At each iteration t < n, a point Xt+1 is uniformly sampled over X and the algorithm decides, whether or not, to evaluate the function at this point. The decision rule involves the active subset of R which contains the ranking rules that are consistent with the ranking rule induced by f over the points sampled so far. We thus set Rt = {r R : Lt(r) = 0} where Lt is the empirical ranking loss:

Lt(r) = 2 t(t + 1)

1 i<j t 1 {rf(Xi, Xj) = r(Xi, Xj)} .

Algorithm 1 RANKOPT(n, f, X, R) 1. Initialization: Let X1 U(X) ..... Evaluate f(X1), t 1, R1 R, ˆı1 1 2. Iterations: Repeat while t < n ..... Let Xt+1 U(X) ..... If there exists r Rt such that r(Xt+1, Xˆıt) 0 ........... Evaluate f(Xt+1), t t + 1 ........... Rt {r R : Lt(r) = 0} ........... ˆıt arg maxi=1...t f(Xi) 3. Output: Return Xˆın

A ranking approach to global optimization

Figure 3. Two samples generated by the RANKOPT algorithm after n = 30 iterations with the polynomial ranking rules RP(4) on the Styblinski-Tang function deﬁned in Section 6.

As a direct consequence of the deﬁnition of the active subset, if there does not exist any r Rt such that r(Xt+1, Xˆın) 0, it implies that rf(Xt+1, Xˆıt) = 1 which means that f(Xt+1) < f(Xˆıt). Thus, the algorithm never evaluates the function at a point that will not return certainly an evaluation at least equal to the highest evaluation f(Xˆıt) observed so far.

Remark 1. (APPROXIMATION OF THE LEVEL SETS) Since rf can be any ranking of Rt, at each iteration, the sampling area Xt = {x X : r Rt s.t. r(x, Xˆıt) 0} is the smallest set that contains certainly the level set {x X : f(x) f(Xˆıt)} of the best value observed so far.

Remark 2. (CONNECTION WITH ACTIVE LEARNING) The algorithm can be seen as an extension to ranking of the baseline active learning algorithm CAL (Cohn et al., 1994; Hanneke, 2011). However, in active learning we aim at estimating a classiﬁer h : X 7 {0, 1} where the goal in global optimization is to estimate the winner of a tournament deriving from the ranking rule rf : X X 7 { 1, 0, 1} and not the ranking rule itself.

3.2. Convergence analysis

We state here some convergence properties of the RANKOPT algorithm. The results are stated in a probabilistic framework. The source of randomness comes from the algorithm itself (which generates uniform random variables) and not from the evaluations which are assumed noiseless. The next result will be important in order to formulate the consistency property of the algorithm.

Proposition 2. Fix any n N , let X Rd be any compact and convex set and let R R be any set of ranking rules. Then, for any function f : X R such that rf R and any y R, if Xˆın denotes the random output of RANKOPT(n, f, X, R), we have that,

P(f(Xˆın) y) P( max i=1...n f(X i) y),

where {X i}n i=1 are n independent random variables, uniformly distributed over X.

Combining the previous proposition with the identiﬁability condition gives the following asymptotic result.

Corollary 1. (CONSISTENCY) Using the same notations and assumptions as in Proposition 2 and if the maximum of the function f is identiﬁable (Condition 1), then,

f(Xˆın) P max x X f(x).

We now provide our ﬁnite-sample loss bounds.

Theorem 1. (UPPER BOUND) Under the same assumptions as in Proposition 2 and if the function f has (cα, α)- regular level sets (Condition 2), then, for any δ (0, 1), with probability at least 1 δ,

x Xˆın 2 Cα,X ln(1/δ)

where Cα,X = c(2+α)/(1+α) α diam(X)1/(1+α)2 .

More surprisingly, a lower bound can be derived by making the link with the theoretical PURE ADAPTIVE SEARCH (Zabinsky & Smith, 1992) that uses the knowledge of the level sets of the unknown function.

Proposition 3. Fix any n N and let {X i }n i=1 be a sequence distributed as the Markov process deﬁned by ( X 1 U(X) X t+1| X t U(X t ) t {1 . . . n 1}

where X t = {x X : f(x) f(X t )}. Then, using the same notations and assumptions as in Proposition 2, for any y R, we have that,

P(f(Xˆın) y) P(f(X n) y).

We are now ready to establish our second loss bound by combining Proposition 3 with the level set assumption.

Theorem 2. (LOWER BOUND) Under the same assumptions as in Proposition 2 and if the function f has (cα, α)- regular level sets (Condition 2), then, for any δ (0, 1), with probability at least 1 δ,

Cα,X e (1+α)2

2n ln(1/δ)+ln(1/δ) x Xˆın 2 ,

where Cα,X = c (1+α)(2+α) α rad(X)(1+α)2 .

Remark 3. (ON THE PERFORMANCE CRITERION) The level set assumption, which is used in Theorem 1 and Theorem 2, is invariant to any strictly increasing composition ψ (i.e. if f has (cα, α)-regular level sets so has ψ f). It implies that the bounds on the distance x Xˆın 2 between the exact solution and its approximation hold independently of the smoothness of the function.

Remark 4. (RELATED WORK) To the best of our knowledge, this is the ﬁrst analysis of an optimization algorithm

A ranking approach to global optimization

which uses the ranking rule induced by the unknown function. For a different approach to global optimization, we refer to the work of (Munos, 2011) where the function is assumed to be locally smooth around (one of) its optimum. In the latter work, ﬁnite-time bounds on the difference maxx X f(x) f(Xˆın) are obtained when the smoothness is known (DOO) and the smoothness is unknown (SOO).

4. Adaptive algorithm and stopping time analysis

4.1. The ADARANKOPT algorithm

The ADARANKOPT algorithm (Algorithm 2) is an extension of the RANKOPT algorithm which involves model selection following the principle of Structural Risk Minimization. We consider a parameter p (0, 1) and a nested sequence of ranking structures {RN}N N satisfying:

R1 R2 R . (1)

The algorithm is initialized by considering the smallest ranking structure R1 of the sequence. At each iteration t < n, a Bernoulli random variable Bt+1 of parameter p is sampled. If Bt+1 = 1, the algorithm explores the space by evaluating the function at a point uniformly sampled over X. If Bt+1 = 0, the algorithm exploits the previous evaluations by making an iteration of the RANKOPT algorithm with the smallest ranking structure RNt of the sequence that probably contains the true ranking rf. Once a new evaluation has been made, the index Nt+1 is updated. The parameter p drives the trade-off between the exploitation phase and the exploration phase which prevents the algorithm from getting stuck in a local maximum.

Remark 5. (NESTED SEQUENCES) Condition (1) is crucial for practical reasons discussed in Section 5. We point out that both the sequence of polynomial ranking rules {RP(N)}N N and the sequence of convex ranking rules {RC(N)}N N deﬁned in Section 2 satisfy this condition.

Algorithm 2 ADARANKOPT(n, f, X, p, {RN}N N ) 1. Initialization: Let X1 U(X) Evaluate f(X1), t 1, R R1, ˆı1 1 2. Iterations: Repeat while t < n Let Bt+1 B(p) If Bt+1 = 1 {Exploration} ..... Let Xt+1 U(X) If Bt+1 = 0 {Exploitation} ..... Let Xt+1 U({x X : r R s.t. r(x, Xˆıt) 0}) Evaluate f(Xt+1), t t + 1 ˆıt arg maxi=1...t f(Xi) Nt min{N N : minr RN Lt(r) = 0} {Update} R {r RNt : Lt(r) = 0} 3. Output: Return Xˆın

4.2. Theoretical properties of ADARANKOPT

We start by casting the consistency result for the ADARANKOPT algorithm.

Proposition 4. (CONSISTENCY) Fix any p (0, 1) and any sequence of ranking structures {RN}N N satisfying (1). Then, if the function f has an identiﬁable maximum (Condition 1) and Xˆın denotes the random output of ADARANKOPT(n, f, X, p, {RN}N N ), we have that,

f(Xˆın) P max x X f(x).

Proposition 4 reveals that even if the algorithm is poorly tuned, it will end up ﬁnding the true maximum of any function with an identiﬁable maximum.

We now investigate the number of iterations required to identify a ranking structure that contains the true ranking rule.

Deﬁnition 5. (STOPPING TIME) Let N = min{N N : rf RN} be the index of the smallest ranking structure that contains the true ranking rule and let {Nt}t N be the sequence of random variables deﬁned in the ADARANKOPT algorithm. Deﬁne the stopping time:

τN = min{t N : Nt = N }

which corresponds to the number of iterations required to identify N .

In order to bound τN , we need to control the complexity of the sequence of ranking structures. Let us denote by L(r) = P(rf(X, X ) = r(X, X )) the true ranking loss where (X, X ) is a couple of independent random variables uniformly distributed over X and deﬁne the Rademacher average of a ranking structure R as

Rn = sup r R

i=1 ϵi1[rf (Xi,X n/2 +i) =r(Xi,X n/2 +i)]

where {Xi}n i=1 are n i.i.d. copies of X U(X) and ϵ1 . . . ϵ n/2 are n/2 independent Rademacher random variables (i.e. random symmetric sign variables), independent of {Xi}n i=1.

Proposition 5. (STOPPING TIME UPPER BOUND) Assume that the index N > 1 is ﬁnite, and that infr RN 1 L(r) > 0, and that there exists V > 0 such that the Rademacher complexity of RN 1 satisﬁes E [Rn] p

V/n for all n N . Then, for any δ (0, 1), with probability at least 1 δ,

p V + ln(2/δ) infr RN 1 L(r)2

A ranking approach to global optimization

In the situation described above, one can recover an upper bound similar to the one of Theorem 1 by combining the previous result with the analysis of the RANKOPT algorithm where the structure RN is assumed to be known.

Theorem 3. (UPPER BOUND) Consider the same assumptions as in Proposition 5 and assume that the function f has (cα, α)-regular level sets (Condition 2). Then, if Xˆın denotes the random output of ADARANKOPT(n, f, X, p, {RN}N N ), for any δ (0, 1) and any n > nδ, with probability at least 1 δ,

x Xˆın 2 Cα,X ln(2/δ)

where Cα,X is the same constant as in Theorem 1 and nδ = 10(V + ln(4/δ))/(p infr RN 1 L(r)2) .

Remark 6. (COMPLEXITY ASSUMPTION) We refer here to (Cl emenc on, 2011) and we point out that standard VCtype arguments can be used in order to bound E [Rn]. If the set of functions F = {(x, x ) X 2 7 1{rf(x, x ) = r(x, x )} : r R} is a VC major class with ﬁnite VC dimension V , then E [Rn] c p

V/n for a universal constant c > 0. This covers the case of polynomial ranking rules.

5. Computational aspects

We discuss here some technical aspects involved in the practical implementation of the ADARANKOPT algorithm.

5.1. General ranking structures

Fix any nested sequence of ranking structures {RN}N N and any sample {(Xi, f(Xi))}n i=1. We address the questions of (i) sampling Xn+1 uniformly over the non-trivial subset Xn = {x X : r RNn s.t. Ln(r) = 0 and r(x, Xˆın) 0} and (ii) updating the index Nn+1 once f(Xn+1) has been evaluated. We start to show that both these steps can be done by testing if

min r RN Ln+1(r) = 0 (2)

holds true for a given N N where the empirical ranking loss is taken over a set of n + 1 samples.

(i) Sampling X U(X) until X Xn allows to sample uniformly over Xn. Using the deﬁnition of the subset, we know that X Xn if there exists a ranking r RNn {r : Ln(r) = 0} such that r(X, Xˆın) = 0 or 1. Rewriting the previous statement in terms of minimal error gives that X Xn if:

- either minr RNn Ln+1(r) = 0 where Ln+1 is taken over the sample {(Xi, f(Xi))}n i=1 (X, f(Xˆın)),

- or minr RNn Ln+1(r) = 0 where Ln+1 is taken over the sample {(Xi, f(Xi))}n i=1 (X, f(Xˆın)+c) where c > 0 is any strictly positive constant.

(ii) Assume now that f(Xn+1) has been evaluated. Since {RN}N N forms a nested sequence, we have that Nn+1 = Nn + min{i N : minr RNn+i Ln+1(r) = 0} where the empirical loss is taken over {(Xi, f(Xi))}n+1 i=1 . Therefore, Nn+1 can be updated by sequentially testing if minr RNn+i Ln+1(r) = 0 for i = 0, 1, 2 . . .

As mentioned earlier, both the previous steps can be done using a generic procedure that tests if (2) holds true. We now provide some equivalences that can be used to design this procedure for the ranking structures introduced in Section 2. For simplicity, we assume that all the evaluations of the sample are distinct:

f(X(1)) < f(X(2)) < . . . < f(X(n+1)). (3)

where (1) . . . (n+1) denote the indexes of the corresponding reordering.

5.2. Polynomial ranking rules

Consider the sequence of polynomial ranking rules {RP(N)}N N and let φN : Rd Rdim(φN) be the function that maps any point of Rd into the polynomial feature space of degree N where dim(φN) = N+d d 1. For example, φ2(x1, x2) = {x1, x2, x1x2, x2 1, x2 2}. We start by making the link with linear separability in the polynomial feature space. Proposition 6. (LINEAR SEPARABILITY) Fix any N N and assume that all the evaluations are distinct (3). Then, (2) holds true if and only if there exists an axis ω Rdim(φN) such that,

ω, φN(X(i+1)) φN(X(i)) > 0, i {1 . . . n}.

Interestingly, testing the linear separability of a sample is equivalent to testing the emptiness of a sample-dependent polyhedron. Corollary 2. Let MN be the (dim(φN), n)-matrix where its i-th column is equal to (φN(X(i+1)) φN(X(i)))T and let the operator stands for the inequality componentwise (i.e. x x xi x i i {1 . . . d}). Then, under the same assumptions as in Proposition 6, (2) holds true if and only if the following polyhedron is empty:

{λ Rn : MNλT = 0, 1, λ = 1, λ 0} = .

Remark. 7 (ALGORITHMIC ASPECTS) The problem of testing the emptiness of a polyhedron can be seen as the problem of ﬁnding a feasible point of a linear program. We refer to Chapter 11.4 in (Boyd & Vandenberghe, 2004) where algorithmic solutions are discussed.

A ranking approach to global optimization

50 100 150 200 0

0.8 Ada Rank

20 40 60 80 0

50 100 150 200 250 300 0

50 Ada Rank

(a) (b) (c) Figure 4. Empirical estimation of E [maxx X f(x) f(Xˆıt)] in terms of iteration t = 1 . . . n where the expectation was obtained by running a 1000 times each algorithm.

5.3. Convex ranking rules

Consider the sequence of convex ranking rules {RC(N)}N N . Following the steps of (Cl emenc on & Vayatis, 2010) leads to the next equivalence.

Proposition 7. (OVERLAYING CLASSIFIERS) Fix any N N and let X = [a, b]. Then, (2) holds true if and only if there a exists a nested sequence h1 h2 . . . hn+1 of n + 1 classiﬁers of the form hi(x) = PN k=1 1 {li,k x ui,k} satisfying hi(X(j)) = 1 {(j) i}, (i, j) {1 . . . n + 1}2.

The problem of overlaying classiﬁers admits a tractable solution when d = 1. In the speciﬁc case where N = 1 and d N , the problem of testing the existence of nested convex classiﬁers is equivalent to the problem of testing the emptiness of a cascade of polyhedrons.

Proposition 8. Fix any d N , let N = 1 and assume that all the evaluations are distinct. Then, (2) holds true if and only if for each k {1 . . . n} the polyhedron deﬁned by:

{λ Rk : Mkλ = XT (n+1 k), 1, λ = 1, λ 0}

where Mk is the (d, k)-matrix where its i-th column is equal to XT (n+2 i), is empty.

6. Experiments

We now compare the empirical performances of the ADARANKOPT algorithm with four global optimization algorithms taken from the NLOpt library (Johnson, 2014):

CRS (Kaelo & Ali, 2006) is a controlled random search with local mutations. It starts with a random population and randomly evolve these points by an heuristic rule.

DIRECT (Jones et al., 1993) is a Lipschitz optimization algorithm where the Lipschitz constant is unknown. It uses a deterministic splitting technique of the search space.

ESCH (Santos et al., 2010) and ISRES (Runarsson & Yao, 2000) are two evolutionary algorithms. The evolution

strategies are based on a combination of mutation rules and differential variations.

The tuning parameters were set to default and the parameter p was set to 1/4 for the convex ranking rules and to 1/10 for the polynomial ranking rules. The algorithms were compared on three synthetic problems:

(a) The task consists in maximizing the function f(x) = 1{x x }(| cos (50(x x ))|3/2 15|x x |1/2)/10 1{x > x }(|x|1/2 + 0.05| sin (50x)|3/2) over X = [0, 1] where x = 0.499. The function f has 17 local maxima and presents a discontinuity around its unique optimizer x . The horizon n was set to 200 evaluations and the convex ranking rules were used.

(b) The task consists in minimizing a perturbed version of the Styblinski-Tang function f(x) = P2 i=1(x4 i 16x2 i + 5xi)/2+cos(x1+x2) over X = [ 5, 5]2. The level sets of the Styblinski Tang function are displayed on Figure 3 and the function has 4 local minima. The polynomial ranking rules were used and the horizon n was set to 80 evaluations.

(c) The task consists in maximizing the function f(x) = 1 | P10 i=1(xi 4.5)/10|5/2 over X = [ 5, 5]10. The hyperplane {x R10 : P10 i=1 xi = 45} maximizes the function. The horizon n was set to 300 evaluations and the polynomial ranking rules were used.

The results are shown in Figure 5.1. We remark that the ADARANKOPT converges fast and avoids falling in local maxima, as opposed to most of its competitors.

7. Conclusion

We have provided a global optimization strategy based on a sequential estimation of the ranking of the unknown function. We introduced two algorithms: RANKOPT which requires a prior knowledge of the ranking rule of the unknown function and its adaptive version ADARANKOPT which performs model selection. A theoretical analysis is provided and the adaptive algorithm is shown to be empirically competitive with the state-of-the-art methods on representative synthetic examples.

A ranking approach to global optimization

Appendix - Sketch of Proofs

Full proofs can be found in the Supplementary Material.

Proof of Proposition 1. The second inclusion ( ) is a direct consequence of the deﬁnition of the ranking rules. To state the ﬁrst inclusion ( ), we introduce the function M(x) = µ({x X : rf(x, x ) = 1}) and we show that there exists two strictly increasing functions ψ and ψ

such that f = ψ M and h = ψ M. We ﬁnally get that h = (ψ ψ 1) f where ψ ψ 1 is strictly increasing.

Proof of Proposition 2. The result is obtained by induction. Since X1 U(X), the result trivially holds for n = 1. Assume that the statement holds for a given n N , ﬁx any y R and let Xy = {x X : f(x) y}. First step: using the fact that {x X : f(x) f(Xˆın)} Xn X we show that P(f(Xˆın+1) y) P(f(Xˆın) y) + P(f(Xˆın) < y)µ(Xy)/µ(X). Second step: plugging the induction assumption in the last equation and using the fact that P(f(X n+1) y) = µ(Xy)/µ(X) gives the result.

Proof of Corollary 1. Fix any ε > 0 and let Xε = {x X : f(x) maxx X f(x) ε} be the corresponding level set. Applying Proposition 2 leads to P(f(Xˆın) < maxx X f(x) ε) (1 µ(Xε)/µ(X))n n 0.

Proof of Theorem 1. Fix any δ (0, 1) and let rδ,n be the upper bound of the theorem. First step: using Proposition 3 and the level set assumption, we show that P( Xˆın x 2 rδ,n) P( n i=1{X i B(x , diam(X) (ln(1/δ)/n)1/d)}) where {X i}n i=1 are n i.i.d. copies of X U(X). Second step: we show that for any r diam(X) we have that µ(B(x , r))/µ(X) (r/ diam(X))d. Third step: using independence and the fact that 1 x e x, we ﬁnally get that P( Xˆın x 2 rδ,n) 1 (1 ln(1/δ)/n)n 1 δ.

Proof of Proposition 3. We use again an induction argument. Assume that the result holds for a given n N and ﬁx any y R. First step: using the fact that Xf(Xˆın) = {x X : f(x) f(Xˆın)} Xn we show that P(f(Xˆın+1) y) E min(1, µ(Xy)/µ(Xf(Xˆın))) . Second step: using the induction assumption we show that E min(1, µ(Xy)/µ(Xf(Xˆın))) E min(1, µ(Xy)/µ(Xf(X n))) = P(f(X n+1) y).

Proof of Theorem 2. Fix any δ (0, 1) and let rδ,n be the lower bound of the corollary. First step: using Theorem 3 and the level set assumption we show that P( Xˆın x 2 rδ,n) P(µ(X n)/µ(X) δ exp ( n p

2n ln(1/δ))) where X n = {x X : f(x) f(X n)}. Second step: we show that u (0, 1), P(µ(X n)/µ(X) u) P(Qn i=1 Ui u) where {Ui}n i=1 are n i.i.d. copies of U U([0, 1]). Third step: using concentration inequalities for sub-gamma random variables gives that P(Qn i=1 Ui δ exp ( n p

2n ln(1/δ))) < δ.

Proof of Proposition 4. Fix any ε > 0 and let Xε = {x X : f(x) maxx X f(x) ε} be the corresponding level set. Using the fact that P(Xi Xε) p µ(Xε)/µ(X) for any i N , we show by induction that P(f(Xˆın) < maxx X f(x) ε) (1 p µ(Xε)/µ(X))n n 0.

Proof of Proposition 5. Fix any δ (0, 1) and let nδ be the integer part of the upper bound of the proposition. First step: since we have a nested sequence of sets of ranking rules, P(τ nδ) = P(minr RN 1 Lnδ(r) > 0). Second step: using Hoeffding s inequality gives a lower bound on the number of i.i.d. samples collected: P(Pnδ i=1 Bi p nδ p

nδ log(2/δ)/2 = n δ) 1 δ/2. Third step: applying concentration results of ranking rules over the n δ i.i.d. samples gives that P(minr RN 1 Ln δ(r) minr RN 1 L(r) 2 p

ln (2/δ)/(n δ 1) > 0) 1 δ/2.

Proof of Theorem 3. Fix any δ (0, 1). We know that after nδ iterations the true ranking structure RN is identiﬁed with probability at least 1 δ/2 (Proposition 5). Once the structure RN is identiﬁed, one can use a similar technique as the one used in Theorem 1 to get an upper bound with probability at least 1 δ/2 thanks to the n nδ samples.

Proof of Proposition 6. The proof is a consequence of the deﬁnition of polynomial ranking rules: if r RP(N) then there exists ωr Rdim(φN) and cr R such that r(x, x ) = sgn( ωr, φN(x) + cr ωr, φN(x ) cr) = sgn( ωr, φN(x) φN(x ) ). Noticing that rf(X(i+1), X(i)) = 1 for all i {1 . . . n} gives the result.

Proof of Corollary 2. Let X i = φN(X(i+1)) φN(X(i)), i {1 . . . n} and let CH{X i}n i=1 be the convex hull of {X i}n i=1. First step: we show the following equivalence: ω Rdim(φN) such that i {1 . . . n}, ω, X i > 0 0 / CH{X i}n i=1. Second step: using the deﬁnition of convex hull we get the second equivalence: 0 CH{X i}n i=1 (λ1, . . . λn) Rn s.t. 0 = Pn i=1 λi X i, Pn i=1 λi = 1 and λi 0, i {1 . . . n}.

Proof of Proposition 7. The ﬁrst inclusion ( ) is a direct consequence of the deﬁnition of convex ranking rules. Assume now that there exists a nested sequence of classiﬁers h1 . . . hn+1 satisfying the conditions. To state the second inclusion ( ) we build a continuous approximation of the step function f(x) = Pn i=1 hi(x) that perfectly ranks the sample and induces a convex ranking.

Proof of Proposition 8. First step: using the deﬁnition of convex hulls, we show that each polyhedron of the cascade is empty iff CH{X(i)}n+1 i=n CH{X (i)}n+1 i=n 1 CH{X (i)}n+1 i=1 . Second step: we build a continuous approximation of the function f(x) = Pn k=1 1 x CH{X(i)}n+1 i=k which induces a convex ranking rule and perfectly ranks the sample.

A ranking approach to global optimization

Boyd, Stephen and Vandenberghe, Lieven. Convex optimization. Cambridge university press, 2004.

Bull, Adam D. Convergence rates of efﬁcient global optimization algorithms. The Journal of Machine Learning Research, 12:2879 2904, 2011.

Cl emenc on, St ephan. On U-processes and clustering performance. In Advances in Neural Information Processing Systems, pp. 37 45, 2011.

Cl emenc on, St ephan and Vayatis, Nicolas. Overlaying classiﬁers: a practical approach to optimal scoring. Constructive Approximation, 32(3):619 648, 2010.

Cl emenc on, St ephan, Lugosi, Gabor, and Vayatis, Nicolas. Ranking and empirical minimization of u-statistics. The Annals of Statistics, pp. 844 874, 2010.

Cohn, David, Atlas, Les, and Ladner, Richard. Improving generalization with active learning. Machine learning, 15(2):201 221, 1994.

Grill, Jean-Bastien, Valko, Michal, and Munos, R emi. Black-box optimization of noisy functions with unknown smoothness. In Neural Information Processing Systems, 2015.

Hanneke, Steve. Rates of convergence in active learning. The Annals of Statistics, 39(1):333 361, 2011.

Hansen, Pierre, Jaumard, Brigitte, and Lu, Shi-Hui. Global optimization of univariate lipschitz functions: I. survey and properties. Mathematical programming, 55(1-3): 251 272, 1992.

Johnson, Steven G. The NLopt nonlinear-optimization package, 2014. http://ab-initio.mit.edu/nlopt.

Jones, Donald R, Perttunen, Cary D, and Stuckman, Bruce E. Lipschitzian optimization without the lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157 181, 1993.

Jones, Donald R, Schonlau, Matthias, and Welch, William J. Efﬁcient global optimization of expensive black-box functions. Journal of Global optimization, 13 (4):455 492, 1998.

Kaelo, P and Ali, MM. Some variants of the controlled random search algorithm for global optimization. Journal of optimization theory and applications, 130(2):253 264, 2006.

Munos, R emi. Optimistic optimization of deterministic functions without the knowledge of its smoothness. In Advances in neural information processing systems, 2011.

Pint er, J anos D. Global optimization in action. Scientiﬁc American, 264:54 63, 1991.

Rios, Luis Miguel and Sahinidis, Nikolaos V. Derivativefree optimization: a review of algorithms and comparison of software implementations. Journal of Global Optimization, 56(3):1247 1293, 2013.

Runarsson, Thomas P and Yao, Xin. Stochastic ranking for constrained evolutionary optimization. Evolutionary Computation, IEEE Transactions on, 4(3):284 294, 2000.

Santos, Carlos Henrique da Silva, Goncalves, Marcos Sergio, and Hernandez-Figueroa, Hugo Enrique. Designing novel photonic devices by bio-inspired computing. Photonics Technology Letters, IEEE, 22(15):1177 1179, 2010.

Sergeyev, Yaroslav D, Strongin, Roman G, and Lera, Daniela. Introduction to global optimization exploiting space-ﬁlling curves. Springer Science & Business Media, 2013.

Zabinsky, Zelda B. Stochastic adaptive search for global optimization, volume 72. Springer Science & Business Media, 2013.

Zabinsky, Zelda B and Smith, Robert L. Pure adaptive search in global optimization. Mathematical Programming, 53(1-3):323 338, 1992.