# a_ranking_approach_to_global_optimization__ddee3454.pdf A ranking approach to global optimization C edric Malherbe MALHERBE@CMLA.ENS-CACHAN.FR Emile Contal CONTAL@CMLA.ENS-CACHAN.FR Nicolas Vayatis VAYATIS@CMLA.ENS-CACHAN.FR CMLA, ENS Cachan, CNRS, Universit e Paris-Saclay, 94235, Cachan, France We consider the problem of maximizing an unknown function f over a compact and convex set X Rd using as few observations f(x) as possible. We observe that the optimization of the function f essentially relies on learning the induced bipartite ranking rule of f. Based on this idea, we relate global optimization to bipartite ranking which allows to address problems with high dimensional input space, as well as cases of functions with weak regularity properties. The paper introduces novel meta-algorithms for global optimization which rely on the choice of any bipartite ranking method. Theoretical properties are provided as well as convergence guarantees and equivalences between various optimization methods are obtained as a by-product. Eventually, numerical evidence is given to show that the main algorithm of the paper which adapts empirically to the underlying ranking structure essentially outperforms existing state-of-the-art global optimization algorithms in typical benchmarks. 1. Introduction In many applications such as complex system design or hyperparameter calibration for learning systems, the goal is to optimize some output value of a non-explicit function with as few evaluations as possible. Indeed, in such contexts, one has access to the function values only through numerical evaluations by simulation or cross-validation with significant computational cost. Moreover, the operational constraints generally impose a sequential exploration of the solution space with small samples. The generic problem of sequentially optimizing the output of an unknown and potentially non-convex function is often referred to as global optimization (Pint er, 1991), black-box optimization Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). (Jones et al., 1998) or derivative-free optimization (Rios & Sahinidis, 2013). There are several algorithms based on various heuristics which have been introduced in order to address complicated optimization problems with limited regularity assumptions, such as genetic algorithms, modelbased algorithms, branch-and-bound methods... (see (Rios & Sahinidis, 2013) for a recent overview). This paper follows the line of the approaches recently considered in the machine learning literature (Bull, 2011; Munos, 2011; Sergeyev et al., 2013). These approaches extend the seminal work on Lipschitz optimization of (Hansen et al., 1992; Jones et al., 1993) and they led to significant relaxations of the conditions required for convergence, e.g. only the existence of a local smoothness around the optimum is required (Munos, 2011; Grill et al., 2015). More precisely, in the work of (Bull, 2011; Munos, 2011), specific conditions have been identified to derive a finite-time analysis of the algorithms. However, these guarantees do not hold when the unknown function is not assumed to be locally smooth around (one of) its optimum. In the present work, we propose to explore concepts from ranking theory based on overlaying estimated level sets (Cl emenc on et al., 2010) in order to develop global optimization algorithms that do not rely on the smoothness of the function. The idea behind this approach is simple: even if the unknown function presents arbitrary large variations, most of the information required to identify its optimum may be contained in its induced ranking rule, i.e. how the level sets of the function are included one in another. To exploit this idea, we introduce a novel optimization scheme where the complexity of the function is characterized by the underlying pairwise ranking which it defines. Our contribution is twofold: first, we introduce two novel global optimization algorithms that learn the ranking rule induced by the unknown function with a sequential scheme, and second, we provide mathematical results in terms of statistical consistency and convergence to the optimum. Moreover the algorithms proposed lead to efficient implementation and they display good performance on the classical benchmarks for global optimization as shown at the end of the paper. A ranking approach to global optimization The rest of the paper is organized as follows. In Section 2 we introduce the framework and give the main definitions. In Section 3, we introduce and analyze the RANKOPT algorithm which requires a prior information on the ranking structure underlying the unknown function. In Section 4, an adaptive version of the algorithm is presented. Companion results which establish the equivalence between learning algorithms and optimization procedures are discussed in Section 5 as they support implementation choices. The adaptive version of the algorithm is compared to other global optimization algorithms in Section 6. Proof sketches are postponed to the Appendix section and full proofs can be found in the supplementary material provided as a separate document. 2. Global optimization and ranking structure 2.1. Setup and notations Setup. We consider the problem of sequentially maximizing an unknown real-valued function f : X R which is assumed to admit at least one global maximum over the compact and convex set X Rd. The objective is to identify some point x arg max x X f(x) with a minimal amount of function evaluations. The setup we consider is the following: at each iteration t = 1 . . . n 1, an algorithm selects an evaluation point Xt+1 X which depends on the previous evaluations {(Xi, f(Xi))}t i=1 and receives the evaluation of the unknown function f(Xt+1) at this point. After n iterations, the algorithm returns the argument of the highest value observed so far: Xˆın where ˆın arg max i=1,...,n f(Xi). The analysis provided in the paper considers that the number n of evaluation points is not fixed and it is assumed that function evaluations are noiseless. Notations. For any x = {x1 . . . xd} Rd, we define the standard ℓ2-norm x 2 2 = Pd i=1 x2 i , we denote by , the corresponding inner product and we denote by B(x, r) = {x Rd : x x 2 r} the ℓ2-ball of radius r 0 centered in x. For any set X Rd, we define its inner-radius as rad(X) = max{r > 0 : x X s.t. B(x, r) X}, its diameter as diam(X) = max(x,x ) X 2 x x 2 and we denote by µ(X) its volume where µ stands for the Lebesgue measure. Finally, we denote by C0(X, R) the set of continuous functions defined on X taking values in R, we denote by PN(X, R) the set of (multivariate) polynomial functions of degree N defined on X and we denote by U(A) the uniform distribution over a bounded measurable domain A. 2.2. The ranking structure of a real-valued function In this section, we introduce the ranking structure as a complexity characterization for a general real-valued function to be optimized. First, we observe that real-valued functions induce an order relation over the input space X, and the underlying ordering induces a ranking rule which records pairwise comparisons between evaluation points. Definition 1. (INDUCED RANKING RULE) The ranking rule rf : X X { 1, 0, 1} induced by a function f : X R is defined by: rf(x, x ) = 1 if f(x) > f(x ) 0 if f(x) = f(x ) 1 if f(x) < f(x ) for all (x, x ) X 2. The key argument of the paper is that the optimization of any weakly regular real-valued function only depends on the nested structure of its level sets. Hence there is an equivalence class of real-valued functions that share the same induced ranking rule as shown by the following proposition. Proposition 1. (RANKING RULE EQUIVALENCE) Let h C0(X, R) be any continuous function. Then, a function f : X R shares the same induced ranking rule with h (i.e. rf = rh) if and only if there exists a strictly increasing (not necessary continuous) function ψ : R R such that h = ψ f. Proposition 1 states that even if the unknown function f admits non-continuous or large variations, up to a transformation ψ, there might exist a simpler function h = ψ f that shares the same induced ranking rule. Figure 1 gives an example of two functions that share the same ranking while they display highly different regularity properties. As a second example, we may consider the problem of maximizing the function f(x) = 1 1/ |ln(x)| if x = 0 and 1 otherwise over X = [0, 1/2]. The function f in this case is not smooth around its unique global maximizer x = 0 but shares the same induced ranking rule with h(x) = x over X. We can now introduce a complexity characterization of real-valued functions of a set X through the complexity class of its induced ranking rule. We call this class a ranking structure. Figure 1. Two functions f and h that share the same ranking A ranking approach to global optimization Definition 2. (CONTINUOUS RANKING STRUCTURE AND CONTINUOUS RANKING RULES) We say that a realvalued function f has a continuous ranking rule if rf R where R = {rh : h C0(X, R)} denotes the set of continuous ranking rules (i.e. ranking rules induced by continuous functions). In the continuation of this definition, we further introduce two examples of more stringent ranking structures. Definition 3. (POLYNOMIAL RANKING RULES) The set of polynomial ranking rules of degree N 1 is defined as RP(N) = {rh : h PN(X, R)}. We point out that even a polynomial function of degree N may admit a lower degree polynomial ranking rule. For example, consider the polynomial function f(x) = (x2 3x+1)9. Since f(x) = ψ(x2 3x) where ψ : x 7 (x+1)9 is a strictly increasing function, the ranking rule induced by f is a polynomial ranking rule of degree 2. The second class of ranking structures we introduce is a class of non-parametric rankings. Definition 4. (CONVEX RANKING RULES) The set of convex ranking rules of degree N 1 is defined as RC(N) = {r R such that for any x X, the set {x X : r(x, x ) 0} is a union of Nconvex sets}. It is easy to see that the ranking rule of a function f is a convex ranking rule of degree N if and only all the level sets of the function f are unions of at most N convex sets. 2.3. Identifiability and regularity We now state two conditions that will be used in the theoretical analysis: the first condition is about the identifiability of the maximum of the function and the second is about the regularity of f around its maximum. Condition 1. (IDENTIFIABILITY) The maximum of a function f : X R is said to be identifiable if for any ε > 0 arbitrarily small, µ({x X : f(x) max x X f(x) ε}) > 0. Condition 1 prevents the function from having a jump on its maximum and will be useful to state asymptotic results of the type f(Xˆın) maxx X f(x) when n + . Condition 2. (REGULARITY OF THE LEVEL SETS) A function f : X R has (cα, α)-regular level sets for some cα > 0, α 0 if: 1. The global optimizer x X is unique. 2. For any y Im(f), the iso-level set f 1(y) = {x X : f(x) = y} satisfies max x f 1(y) x x 2 cα min x f 1(y) x x 1/(1+α) 2 . Condition 2 guarantees that the points associated with high evaluations are close to the unique optimizer with respect to the Euclidean distance. This condition will be used to derive some finite-time bounds on the distance x Xˆın 2 between the optimizer and its estimation. Note that for any iso-level set f 1(y) with finite distance to the optimum, the condition is satisfied with α = 0 and cα = diam(X) / minx f 1(y) x x 2. Therefore, this condition concerns the behavior of the level sets when minx f 1(y) x x 2 0. As an example, the iso-level sets of three simple functions satisfying the condition with different values of α are shown in Figure 2. Figure 2. Illustration of the regularity of the level sets on two simple functions. Left: f(x1, x2) = exp ( x2 1 2x2 2) where α = 0. Middle: f(x) = |x1|3 2x2 2 where α = 1/2. Right: f(x1, x2) = x4 1 2x2 2 where α = 1. 3. Optimization with fixed ranking structure In this section, we consider the problem of optimizing an unknown function f given the prior knowledge that its ranking rf belongs to a given ranking structure R R . 3.1. The RANKOPT algorithm The input of Algorithm 1 are a number n of iterations, the unknown function f, a compact and convex set X Rd and a ranking structure R R . At each iteration t < n, a point Xt+1 is uniformly sampled over X and the algorithm decides, whether or not, to evaluate the function at this point. The decision rule involves the active subset of R which contains the ranking rules that are consistent with the ranking rule induced by f over the points sampled so far. We thus set Rt = {r R : Lt(r) = 0} where Lt is the empirical ranking loss: Lt(r) = 2 t(t + 1) 1 i 1 is finite, and that infr RN 1 L(r) > 0, and that there exists V > 0 such that the Rademacher complexity of RN 1 satisfies E [Rn] p V/n for all n N . Then, for any δ (0, 1), with probability at least 1 δ, p V + ln(2/δ) infr RN 1 L(r)2 A ranking approach to global optimization In the situation described above, one can recover an upper bound similar to the one of Theorem 1 by combining the previous result with the analysis of the RANKOPT algorithm where the structure RN is assumed to be known. Theorem 3. (UPPER BOUND) Consider the same assumptions as in Proposition 5 and assume that the function f has (cα, α)-regular level sets (Condition 2). Then, if Xˆın denotes the random output of ADARANKOPT(n, f, X, p, {RN}N N ), for any δ (0, 1) and any n > nδ, with probability at least 1 δ, x Xˆın 2 Cα,X ln(2/δ) where Cα,X is the same constant as in Theorem 1 and nδ = 10(V + ln(4/δ))/(p infr RN 1 L(r)2) . Remark 6. (COMPLEXITY ASSUMPTION) We refer here to (Cl emenc on, 2011) and we point out that standard VCtype arguments can be used in order to bound E [Rn]. If the set of functions F = {(x, x ) X 2 7 1{rf(x, x ) = r(x, x )} : r R} is a VC major class with finite VC dimension V , then E [Rn] c p V/n for a universal constant c > 0. This covers the case of polynomial ranking rules. 5. Computational aspects We discuss here some technical aspects involved in the practical implementation of the ADARANKOPT algorithm. 5.1. General ranking structures Fix any nested sequence of ranking structures {RN}N N and any sample {(Xi, f(Xi))}n i=1. We address the questions of (i) sampling Xn+1 uniformly over the non-trivial subset Xn = {x X : r RNn s.t. Ln(r) = 0 and r(x, Xˆın) 0} and (ii) updating the index Nn+1 once f(Xn+1) has been evaluated. We start to show that both these steps can be done by testing if min r RN Ln+1(r) = 0 (2) holds true for a given N N where the empirical ranking loss is taken over a set of n + 1 samples. (i) Sampling X U(X) until X Xn allows to sample uniformly over Xn. Using the definition of the subset, we know that X Xn if there exists a ranking r RNn {r : Ln(r) = 0} such that r(X, Xˆın) = 0 or 1. Rewriting the previous statement in terms of minimal error gives that X Xn if: - either minr RNn Ln+1(r) = 0 where Ln+1 is taken over the sample {(Xi, f(Xi))}n i=1 (X, f(Xˆın)), - or minr RNn Ln+1(r) = 0 where Ln+1 is taken over the sample {(Xi, f(Xi))}n i=1 (X, f(Xˆın)+c) where c > 0 is any strictly positive constant. (ii) Assume now that f(Xn+1) has been evaluated. Since {RN}N N forms a nested sequence, we have that Nn+1 = Nn + min{i N : minr RNn+i Ln+1(r) = 0} where the empirical loss is taken over {(Xi, f(Xi))}n+1 i=1 . Therefore, Nn+1 can be updated by sequentially testing if minr RNn+i Ln+1(r) = 0 for i = 0, 1, 2 . . . As mentioned earlier, both the previous steps can be done using a generic procedure that tests if (2) holds true. We now provide some equivalences that can be used to design this procedure for the ranking structures introduced in Section 2. For simplicity, we assume that all the evaluations of the sample are distinct: f(X(1)) < f(X(2)) < . . . < f(X(n+1)). (3) where (1) . . . (n+1) denote the indexes of the corresponding reordering. 5.2. Polynomial ranking rules Consider the sequence of polynomial ranking rules {RP(N)}N N and let φN : Rd Rdim(φN) be the function that maps any point of Rd into the polynomial feature space of degree N where dim(φN) = N+d d 1. For example, φ2(x1, x2) = {x1, x2, x1x2, x2 1, x2 2}. We start by making the link with linear separability in the polynomial feature space. Proposition 6. (LINEAR SEPARABILITY) Fix any N N and assume that all the evaluations are distinct (3). Then, (2) holds true if and only if there exists an axis ω Rdim(φN) such that, ω, φN(X(i+1)) φN(X(i)) > 0, i {1 . . . n}. Interestingly, testing the linear separability of a sample is equivalent to testing the emptiness of a sample-dependent polyhedron. Corollary 2. Let MN be the (dim(φN), n)-matrix where its i-th column is equal to (φN(X(i+1)) φN(X(i)))T and let the operator stands for the inequality componentwise (i.e. x x xi x i i {1 . . . d}). Then, under the same assumptions as in Proposition 6, (2) holds true if and only if the following polyhedron is empty: {λ Rn : MNλT = 0, 1, λ = 1, λ 0} = . Remark. 7 (ALGORITHMIC ASPECTS) The problem of testing the emptiness of a polyhedron can be seen as the problem of finding a feasible point of a linear program. We refer to Chapter 11.4 in (Boyd & Vandenberghe, 2004) where algorithmic solutions are discussed. A ranking approach to global optimization 50 100 150 200 0 0.8 Ada Rank 20 40 60 80 0 50 100 150 200 250 300 0 50 Ada Rank (a) (b) (c) Figure 4. Empirical estimation of E [maxx X f(x) f(Xˆıt)] in terms of iteration t = 1 . . . n where the expectation was obtained by running a 1000 times each algorithm. 5.3. Convex ranking rules Consider the sequence of convex ranking rules {RC(N)}N N . Following the steps of (Cl emenc on & Vayatis, 2010) leads to the next equivalence. Proposition 7. (OVERLAYING CLASSIFIERS) Fix any N N and let X = [a, b]. Then, (2) holds true if and only if there a exists a nested sequence h1 h2 . . . hn+1 of n + 1 classifiers of the form hi(x) = PN k=1 1 {li,k x ui,k} satisfying hi(X(j)) = 1 {(j) i}, (i, j) {1 . . . n + 1}2. The problem of overlaying classifiers admits a tractable solution when d = 1. In the specific case where N = 1 and d N , the problem of testing the existence of nested convex classifiers is equivalent to the problem of testing the emptiness of a cascade of polyhedrons. Proposition 8. Fix any d N , let N = 1 and assume that all the evaluations are distinct. Then, (2) holds true if and only if for each k {1 . . . n} the polyhedron defined by: {λ Rk : Mkλ = XT (n+1 k), 1, λ = 1, λ 0} where Mk is the (d, k)-matrix where its i-th column is equal to XT (n+2 i), is empty. 6. Experiments We now compare the empirical performances of the ADARANKOPT algorithm with four global optimization algorithms taken from the NLOpt library (Johnson, 2014): CRS (Kaelo & Ali, 2006) is a controlled random search with local mutations. It starts with a random population and randomly evolve these points by an heuristic rule. DIRECT (Jones et al., 1993) is a Lipschitz optimization algorithm where the Lipschitz constant is unknown. It uses a deterministic splitting technique of the search space. ESCH (Santos et al., 2010) and ISRES (Runarsson & Yao, 2000) are two evolutionary algorithms. The evolution strategies are based on a combination of mutation rules and differential variations. The tuning parameters were set to default and the parameter p was set to 1/4 for the convex ranking rules and to 1/10 for the polynomial ranking rules. The algorithms were compared on three synthetic problems: (a) The task consists in maximizing the function f(x) = 1{x x }(| cos (50(x x ))|3/2 15|x x |1/2)/10 1{x > x }(|x|1/2 + 0.05| sin (50x)|3/2) over X = [0, 1] where x = 0.499. The function f has 17 local maxima and presents a discontinuity around its unique optimizer x . The horizon n was set to 200 evaluations and the convex ranking rules were used. (b) The task consists in minimizing a perturbed version of the Styblinski-Tang function f(x) = P2 i=1(x4 i 16x2 i + 5xi)/2+cos(x1+x2) over X = [ 5, 5]2. The level sets of the Styblinski Tang function are displayed on Figure 3 and the function has 4 local minima. The polynomial ranking rules were used and the horizon n was set to 80 evaluations. (c) The task consists in maximizing the function f(x) = 1 | P10 i=1(xi 4.5)/10|5/2 over X = [ 5, 5]10. The hyperplane {x R10 : P10 i=1 xi = 45} maximizes the function. The horizon n was set to 300 evaluations and the polynomial ranking rules were used. The results are shown in Figure 5.1. We remark that the ADARANKOPT converges fast and avoids falling in local maxima, as opposed to most of its competitors. 7. Conclusion We have provided a global optimization strategy based on a sequential estimation of the ranking of the unknown function. We introduced two algorithms: RANKOPT which requires a prior knowledge of the ranking rule of the unknown function and its adaptive version ADARANKOPT which performs model selection. A theoretical analysis is provided and the adaptive algorithm is shown to be empirically competitive with the state-of-the-art methods on representative synthetic examples. A ranking approach to global optimization Appendix - Sketch of Proofs Full proofs can be found in the Supplementary Material. Proof of Proposition 1. The second inclusion ( ) is a direct consequence of the definition of the ranking rules. To state the first inclusion ( ), we introduce the function M(x) = µ({x X : rf(x, x ) = 1}) and we show that there exists two strictly increasing functions ψ and ψ such that f = ψ M and h = ψ M. We finally get that h = (ψ ψ 1) f where ψ ψ 1 is strictly increasing. Proof of Proposition 2. The result is obtained by induction. Since X1 U(X), the result trivially holds for n = 1. Assume that the statement holds for a given n N , fix any y R and let Xy = {x X : f(x) y}. First step: using the fact that {x X : f(x) f(Xˆın)} Xn X we show that P(f(Xˆın+1) y) P(f(Xˆın) y) + P(f(Xˆın) < y)µ(Xy)/µ(X). Second step: plugging the induction assumption in the last equation and using the fact that P(f(X n+1) y) = µ(Xy)/µ(X) gives the result. Proof of Corollary 1. Fix any ε > 0 and let Xε = {x X : f(x) maxx X f(x) ε} be the corresponding level set. Applying Proposition 2 leads to P(f(Xˆın) < maxx X f(x) ε) (1 µ(Xε)/µ(X))n n 0. Proof of Theorem 1. Fix any δ (0, 1) and let rδ,n be the upper bound of the theorem. First step: using Proposition 3 and the level set assumption, we show that P( Xˆın x 2 rδ,n) P( n i=1{X i B(x , diam(X) (ln(1/δ)/n)1/d)}) where {X i}n i=1 are n i.i.d. copies of X U(X). Second step: we show that for any r diam(X) we have that µ(B(x , r))/µ(X) (r/ diam(X))d. Third step: using independence and the fact that 1 x e x, we finally get that P( Xˆın x 2 rδ,n) 1 (1 ln(1/δ)/n)n 1 δ. Proof of Proposition 3. We use again an induction argument. Assume that the result holds for a given n N and fix any y R. First step: using the fact that Xf(Xˆın) = {x X : f(x) f(Xˆın)} Xn we show that P(f(Xˆın+1) y) E min(1, µ(Xy)/µ(Xf(Xˆın))) . Second step: using the induction assumption we show that E min(1, µ(Xy)/µ(Xf(Xˆın))) E min(1, µ(Xy)/µ(Xf(X n))) = P(f(X n+1) y). Proof of Theorem 2. Fix any δ (0, 1) and let rδ,n be the lower bound of the corollary. First step: using Theorem 3 and the level set assumption we show that P( Xˆın x 2 rδ,n) P(µ(X n)/µ(X) δ exp ( n p 2n ln(1/δ))) where X n = {x X : f(x) f(X n)}. Second step: we show that u (0, 1), P(µ(X n)/µ(X) u) P(Qn i=1 Ui u) where {Ui}n i=1 are n i.i.d. copies of U U([0, 1]). Third step: using concentration inequalities for sub-gamma random variables gives that P(Qn i=1 Ui δ exp ( n p 2n ln(1/δ))) < δ. Proof of Proposition 4. Fix any ε > 0 and let Xε = {x X : f(x) maxx X f(x) ε} be the corresponding level set. Using the fact that P(Xi Xε) p µ(Xε)/µ(X) for any i N , we show by induction that P(f(Xˆın) < maxx X f(x) ε) (1 p µ(Xε)/µ(X))n n 0. Proof of Proposition 5. Fix any δ (0, 1) and let nδ be the integer part of the upper bound of the proposition. First step: since we have a nested sequence of sets of ranking rules, P(τ nδ) = P(minr RN 1 Lnδ(r) > 0). Second step: using Hoeffding s inequality gives a lower bound on the number of i.i.d. samples collected: P(Pnδ i=1 Bi p nδ p nδ log(2/δ)/2 = n δ) 1 δ/2. Third step: applying concentration results of ranking rules over the n δ i.i.d. samples gives that P(minr RN 1 Ln δ(r) minr RN 1 L(r) 2 p ln (2/δ)/(n δ 1) > 0) 1 δ/2. Proof of Theorem 3. Fix any δ (0, 1). We know that after nδ iterations the true ranking structure RN is identified with probability at least 1 δ/2 (Proposition 5). Once the structure RN is identified, one can use a similar technique as the one used in Theorem 1 to get an upper bound with probability at least 1 δ/2 thanks to the n nδ samples. Proof of Proposition 6. The proof is a consequence of the definition of polynomial ranking rules: if r RP(N) then there exists ωr Rdim(φN) and cr R such that r(x, x ) = sgn( ωr, φN(x) + cr ωr, φN(x ) cr) = sgn( ωr, φN(x) φN(x ) ). Noticing that rf(X(i+1), X(i)) = 1 for all i {1 . . . n} gives the result. Proof of Corollary 2. Let X i = φN(X(i+1)) φN(X(i)), i {1 . . . n} and let CH{X i}n i=1 be the convex hull of {X i}n i=1. First step: we show the following equivalence: ω Rdim(φN) such that i {1 . . . n}, ω, X i > 0 0 / CH{X i}n i=1. Second step: using the definition of convex hull we get the second equivalence: 0 CH{X i}n i=1 (λ1, . . . λn) Rn s.t. 0 = Pn i=1 λi X i, Pn i=1 λi = 1 and λi 0, i {1 . . . n}. Proof of Proposition 7. The first inclusion ( ) is a direct consequence of the definition of convex ranking rules. Assume now that there exists a nested sequence of classifiers h1 . . . hn+1 satisfying the conditions. To state the second inclusion ( ) we build a continuous approximation of the step function f(x) = Pn i=1 hi(x) that perfectly ranks the sample and induces a convex ranking. Proof of Proposition 8. First step: using the definition of convex hulls, we show that each polyhedron of the cascade is empty iff CH{X(i)}n+1 i=n CH{X (i)}n+1 i=n 1 CH{X (i)}n+1 i=1 . Second step: we build a continuous approximation of the function f(x) = Pn k=1 1 x CH{X(i)}n+1 i=k which induces a convex ranking rule and perfectly ranks the sample. A ranking approach to global optimization Boyd, Stephen and Vandenberghe, Lieven. Convex optimization. Cambridge university press, 2004. Bull, Adam D. Convergence rates of efficient global optimization algorithms. The Journal of Machine Learning Research, 12:2879 2904, 2011. Cl emenc on, St ephan. On U-processes and clustering performance. In Advances in Neural Information Processing Systems, pp. 37 45, 2011. Cl emenc on, St ephan and Vayatis, Nicolas. Overlaying classifiers: a practical approach to optimal scoring. Constructive Approximation, 32(3):619 648, 2010. Cl emenc on, St ephan, Lugosi, Gabor, and Vayatis, Nicolas. Ranking and empirical minimization of u-statistics. The Annals of Statistics, pp. 844 874, 2010. Cohn, David, Atlas, Les, and Ladner, Richard. Improving generalization with active learning. Machine learning, 15(2):201 221, 1994. Grill, Jean-Bastien, Valko, Michal, and Munos, R emi. Black-box optimization of noisy functions with unknown smoothness. In Neural Information Processing Systems, 2015. Hanneke, Steve. Rates of convergence in active learning. The Annals of Statistics, 39(1):333 361, 2011. Hansen, Pierre, Jaumard, Brigitte, and Lu, Shi-Hui. Global optimization of univariate lipschitz functions: I. survey and properties. Mathematical programming, 55(1-3): 251 272, 1992. Johnson, Steven G. The NLopt nonlinear-optimization package, 2014. http://ab-initio.mit.edu/nlopt. Jones, Donald R, Perttunen, Cary D, and Stuckman, Bruce E. Lipschitzian optimization without the lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157 181, 1993. Jones, Donald R, Schonlau, Matthias, and Welch, William J. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13 (4):455 492, 1998. Kaelo, P and Ali, MM. Some variants of the controlled random search algorithm for global optimization. Journal of optimization theory and applications, 130(2):253 264, 2006. Munos, R emi. Optimistic optimization of deterministic functions without the knowledge of its smoothness. In Advances in neural information processing systems, 2011. Pint er, J anos D. Global optimization in action. Scientific American, 264:54 63, 1991. Rios, Luis Miguel and Sahinidis, Nikolaos V. Derivativefree optimization: a review of algorithms and comparison of software implementations. Journal of Global Optimization, 56(3):1247 1293, 2013. Runarsson, Thomas P and Yao, Xin. Stochastic ranking for constrained evolutionary optimization. Evolutionary Computation, IEEE Transactions on, 4(3):284 294, 2000. Santos, Carlos Henrique da Silva, Goncalves, Marcos Sergio, and Hernandez-Figueroa, Hugo Enrique. Designing novel photonic devices by bio-inspired computing. Photonics Technology Letters, IEEE, 22(15):1177 1179, 2010. Sergeyev, Yaroslav D, Strongin, Roman G, and Lera, Daniela. Introduction to global optimization exploiting space-filling curves. Springer Science & Business Media, 2013. Zabinsky, Zelda B. Stochastic adaptive search for global optimization, volume 72. Springer Science & Business Media, 2013. Zabinsky, Zelda B and Smith, Robert L. Pure adaptive search in global optimization. Mathematical Programming, 53(1-3):323 338, 1992.