# multiobjective_bayesian_optimisation_with_preferences_over_objectives__049ac2a3.pdf

Multi-objective Bayesian optimisation with preferences over objectives

Majid Abdolshah, Alistair Shilton, Santu Rana, Sunil Gupta, Svetha Venkatesh The Applied Artiﬁcial Intelligence Institute (A2I2), Deakin University, Australia {majid,alistair.shilton,santu.rana,sunil.gupta,svetha.venkatesh} @deakin.edu.au

We present a multi-objective Bayesian optimisation algorithm that allows the user to express preference-order constraints on the objectives of the type objective A is more important than objective B . These preferences are deﬁned based on the stability of the obtained solutions with respect to preferred objective functions. Rather than attempting to ﬁnd a representative subset of the complete Pareto front, our algorithm selects those Pareto-optimal points that satisfy these constraints. We formulate a new acquisition function based on expected improvement in dominated hypervolume (EHI) to ensure that the subset of Pareto front satisfying the constraints is thoroughly explored. The hypervolume calculation is weighted by the probability of a point satisfying the constraints from a gradient Gaussian Process model. We demonstrate our algorithm on both synthetic and real-world problems.

1 Introduction

In many real world problems, practitioners are required to sequentially evaluate a noisy black-box and expensive to evaluate function f with the goal of ﬁnding its optimum in some domain X. Bayesian optimisation is a well-known algorithm for such problems. There are a variety of studies such as hyperparameter tuning [27, 13, 12], expensive multi-objective optimisation for Robotics [2, 1], and experimentation optimisation in product design such as short polymer ﬁber materials [16].

Multi-objective Bayesian optimisation involves at least two conﬂicting, black-box, and expensive to evaluate objectives to be optimised simultaneously. Multi-objective optimisation usually assumes that all objectives are equally important, and solutions are found by seeking the Pareto front in the objective space [4, 5, 3]. However, in most cases, users can stipulate preferences over objectives. This information will impart on the relative importance on sections of the Pareto front. Thus using this information to preferentially sample the Pareto front will boost the efﬁciency of the optimiser, which is particularly advantageous when the objective functions are expensive.

In this study, preferences over objectives are stipulated based on the stability of the solutions with respect to a set of objective functions. As an example, there are scenarios when investment strategists are looking for Pareto optimal investment strategies that prefer stable solutions for return (objective 1) but more diverse solutions with respect to risk (objective 2) as they can later decide their appetite for risk. As can be inferred, the stability in one objective produces more diverse solutions for the other objectives. We believe in many real-world problems our proposed method can be useful in order to reduce the cost, and improve the safety of experimental design.

Whilst multi-objective Bayesian optimisation for sample efﬁcient discovery of Pareto front is an established research track [9, 18, 8, 15], limited work has examined the incorporation of preferences. Recently, there has been a study [18] wherein given a user speciﬁed preferred region in objective space,

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

Figure 1: (a) Local Pareto optimality for 2 objective example with 1D design space. Local optimality implies f0(x)

x and f1(x)

x have opposite signs since the weighted sum of gradients of the objectives with respect to x must be zero: s T

xf (x) = 0. In (b) we additionally require that || f1(x)

x || > || f0(x)

x ||, so perturbation of x will cause relatively more change in f1 than f0 - i.e. such solutions are (relatively) stable in objective f0. (c) Shows the converse, namely || f0(x)

x || > || f1(x)

x || favoring solutions that are (relatively) stable in objective f1 and diverse in f0.

the optimiser focuses its sampling to derive the Pareto front efﬁciently. However, such preferences are based on the assumption of having an accurate prior knowledge about objective space and the preferred region (generally a hyperbox) for Pareto front solutions. The main contribution of this study is formulating the concept of preference-order constraints and incorporating that into a multi-objective Bayesian optimisation framework to address the unavailability of prior knowledge and boosting the performance of optimisation in such scenarios.

We are formulating the preference-order constraints through ordering of derivatives and incorporating that into multi-objective optimisation using the geometry of the constraints space whilst needing no prior information about the functions. Formally, we ﬁnd a representative set of Pareto-optimal solutions to the following multi-objective optimisation problem:

D X = argmax x X f (x) (1)

subject to preference-order constraints - that is, assuming f = [f0, f1, . . . , fm], f0 is more important (in terms of stability) than f1 and so on. Our algorithm aims to maximise the dominated hypervolume of the solution in a way that the solutions that meet the constraints are given more weights.

To formalise the concept of preference-order constraints, we ﬁrst note that a point is locally Pareto optimal if any sufﬁciently small perturbation of a single design parameter of that point does not simultaneously increase (or decrease) all objectives. Thus, equivalently, a point is locally Pareto optimal if we can deﬁne a set of weight vectors such that, for each design parameter, the weighted sum of gradients of the objectives with respect to that design parameter is zero (see Figure 1a). Therefore, the weight vectors deﬁne the relative importance of each objective at that point. Figure 1b illustrates this concept where the blue box deﬁnes the region of stability for the function f0. Since in this section the magnitude of partial derivative for f0 is smaller compared to that of f1, the weights required to satisfy Pareto optimality would need higher weight corresponding to the gradient of f0 compared to that of f1 (see Figure 1b). Conversely, in Figure 1c, the red box highlights the section of the Pareto front where solutions have high stability in f1. To obtain samples from this section of the Pareto front, we need to make the weights corresponding to the gradient of f0 to be smaller to that of the f1.

Our solution is based on understanding the geometry of the constraints in the weight space. We show that preference order constraints gives rise to a polyhedral proper cone in this space. We show that for the pareto-optimality condition, it necessitates the gradients of the objectives at pareto-optimal points to lie in a perpendicular cone to that polyhedral. We then quantify the posterior probability that any point satisﬁes the preference-order constraints given a set of observations. We show how these posterior probabilities may be incorporated into the EHI acquisition function [11] to steer the Bayesian optimiser toward Pareto optimal points that satisfy the preference-order constraint and away from those that do not.

Sets are written A, B, C, . . . where R+ is the positive reals, R+ = R+ {0}, Z+ = {1, 2, . . .}, and Zn = {0, 1, . . . , n 1}. |A| is the cardinality of the set A. Tuples (ordered sets) are denoted A, B, C, . . .. Distributions are denoted A, B, C, . . .. column vectors are bold lower case a, b, c, . . .. Matrices bold upper case A, B, C, . . .. Element i of vector a is ai, and element i, j of matrix A is Ai,j (all indexed i, j = 0, 1, . . .). The transpose is denoted a T, AT. I is the identity matrix, 1 is a vector of 1s, 0 is a vector of 0s, and ei is a vector e(i)j = δij, where δij is the Kronecker-Delta. x = [

x0 x1 . . . xn 1 ]T, sgn(x) is the sign of x (where sgn(0) = 0), and the indicator function is denoted as 1(A).

3 Background

3.1 Gaussian Processes

Let X Rn be compact. A Gaussian process [23] GP(µ, K) is a distribution on the function space f : X R deﬁned by mean µ : X R (assumed zero without loss of generality) and kernel (covariance) K : X X R. If f(x) GP(0, K(x, x )) then the posterior of f given D = {(x(j), y(j)) Rn R|y(j) = f(x(j))+ϵ, ϵ N(0, σ2), j ZN}, f(x)|D N(µD(x), σD(x, x )), where: µD (x) = k T (x) K + σ2I 1 y σD (x, x ) = K (x, x ) k T (x) K + σ2I 1 k (x ) (2)

and y, k(x) R|D|, K R|D| |D|, k(x)j = K(x, x(j)), Kjk = K(x(j), x(k)).

Since differentiation is a linear operation, the derivative of a Gaussian process is also a Gaussian process [17, 22]. The posterior of xf given D is xf(x)|D N(µ D(x), σ D(x, x )), where:

µ D (x) = xk T (x) K + σ2I 1 y

σ D (x, x ) = x T x K (x, x ) xk T (x) (K + σ2 i I) 1 x k T (x ) T (3)

3.2 Multi-Objective Optimisation

A multi-objective optimisation problem has the form:

argmax x X f (x) (4)

where the components of f : X Rn Y Rm represent the m distinct objectives fi : X R. X and Y are called design space and objective space, respectively. A Pareto-optimal solution is a point x X for which it is not possible to ﬁnd another solution x X such that fi(x) > fi(x ) for all objectives f0, f1, . . . fm 1. The set of all Pareto optimal solutions is the Pareto set X = {x X| x X : f (x) f (x )} where y y (y dominates y ) means y = y , yi y i i, and y y means y y or y = y .

Given observations D = {(x(j), y(j)) Rn Rm|y(j) = f(x(j)) + ϵ, ϵi N(0, σ2 i )} of f the dominant set D = {(x , y ) D| (x, y) D : y y } is the most optimal subset of D (in the Pareto sense). The goodness of D is often measured by the dominated hypervolume (S-metric, [31, 10]) with respect to some reference point z Rm: S (D) = S (D ) = R

y z 1 y(i) D y(i) y dy. Thus our aim is to ﬁnd the set D that maximises the hypervolume. Optimised algorithms exist for calculating hypervolume [29, 25], S(D), which is typically calculated by sorting the dominant observations along each axis in objective space to form a grid. Dominated hypervolume (with respect to z) is then the sum of the hypervolumes of the dominated cells (ck) - i.e. S (D) = P

k vol (ck) .

3.3 Bayesian Multi-Objective Optimisation

In the multi-objective case one typically assumes that the components of f are draws from independent Gaussian processes, i.e. fi(x) GP(0, K(i)(x, x )), and fi and fi are independent i = i . A

popular acquisition function for multi-objective Bayesian optimisation is expected hypervolume improvement (EHI). The EHI acquisition function is deﬁned by:

at (x| D) = Ef(x)|D [S (D {(x, f (x))}) S (D)] (5)

[26, 30] and represents the expected change in the dominated hypervolume by the set of observations based on the posterior Gaussian process.

4 Problem Formulation

Let f : X Rn Y Rm be a vector of m independent draws fi GP(0, K(i)(x, x)) from zeromean Gaussian processes. Assume that f is expensive to evaluate. Our aim is to ﬁnd a representative set of Pareto-optimal solutions to the following multi-objective optimisation problem:

D X = argmax x XI X f (x) (6)

subject to preference-order constraints. Speciﬁcally, we want to explore only that subset of solutions XI X that place more importance on one objective fi0 than objective fi1, and so on, as speciﬁed by the (ordered) preference tuple I = (i0, i1, . . . i Q|{i0, i1, . . .} Zm, ik = ik k = k ), where Q Zm is the number of deﬁned preferences over objectives.

4.1 Preference-Order Constraints

Let x int(X) X be a Pareto-optimal point in the interior of X. Necessary (but not sufﬁcient, local) Pareto optimality conditions require that, for all sufﬁciently small δx Rn, f(x + δx) f(x), or, equivalently δx T x f (x ) / Rm +. A necessary (again not sufﬁcient) equivalent condition is that, for each axis j Zn in design space, sufﬁciently small changes in xj do not cause all objectives to simultaneously increase (and/or remain unchanged) or decrease (and/or remain unchanged). Failure of this condition would indicate that simply changing design parameter xj could improve all objectives, and hence that x was not in fact Pareto optimal. In summary, local Pareto optimality requires that j Zn there exists s(j) Rm +\{0} such that:

s T (j) xj f (x) = 0 (7)

It is important to note that this is not the same as the optimality conditions that may be derived from linear scalarisation, as the optimality conditions that arrise from linear scalarisation additionally require that s(0) = s(1) = . . . = s(n 1). Moreover (7) applies to all Pareto-optimal points, whereas linear scalarisation optimisation conditions fail for Pareto points on non-convex regions [28].

Deﬁnition 1 (Preference-Order Constraints) Let I = (i0, i1, . . . i Q|{i0, i1, . . .} Zm, ik = ik k = k ) be an (ordered) preference tuple. A vector x X satisﬁes the associated preference-order constraint if s(0), s(1), . . . , s(n 1) SI such that:

s T (j) xj f (x) = 0 j Zn

where SI s Rm +\ {0} si0 si1 si2 . . . . Further we deﬁne XI to be the set of all x X satisfying the preference-order constraint. Equivalently:

xj f (x) S I j Zn}

where S I x X| s SI, s Tx = 0 .

It is noteworthy to mention that (7) and Deﬁnition 1 are the key for calculating the compliance of a recommended solution with the preference-order constraints. Having deﬁned preference-order constraints we then calculate the posterior probability that x XI, and showing how these posterior probabilities may be incorporated into the EHI acquisition function to steer the Bayesian optimiser toward Pareto optimal points that satisfy the preference-order constraint. Before proceeding, however, it is necessary to brieﬂy consider the geometry of SI and S I .

~a(1) ~a(0)

Figure 2: Illustration of S I , SI and the vectors a(0), a(1) for a 2D case where I = (0, 1), so s0 > s1, SI is a proper cone representing the preference-order constraints; S I is the union of two sub-spaces. v S I implies a solution complying with preference-order constraints. b0 and b1 are the projection of v over a(0) and a(1). In order to satisfy v S I , it is necessary that s SI s.t. v T s = 0 or equivalently v = 0 or b0 = a T (0)v and b1 = a T (1)v have different signs.

4.2 The geometry of SI and S I

In the following we assume, w.l.o.g, that the preference-order constraints follows the order of indices in objective functions (reorder, otherwise), and that there is at least one constraint.

We now deﬁne the preference-order constraints by assumption I = (0, 1, . . . , Q|Q Zm\{0}), where Q > 0. This deﬁnes the sets SI and S I , which in turn deﬁne the constraints that must be met by the gradients of f(x) - either s(0), s(1), . . . , s(n 1) SI such that s T (j) xj f (x) = 0 j Zn or, equivalently xj f (x) S I j Zn. Next, Theorem 1 deﬁnes the representation of SI.

Theorem 1 Let I = (0, 1, . . . , Q|Q Zm\{0}) be an (ordered) preference tuple. Deﬁne SI as per deﬁnition 1. Then SI is a polyhedral (ﬁnitely-generated) proper cone (excluding the origin) that may be represented using either a polyhedral representation:

SI = n s Rm| a T (i)s 0 i Zm o \ {0} (8)

or a generative representation:

i Zm ci a(i) c Rm + o \ {0} (9)

where i Zm:

2 (ei ei+1) if i ZQ ei otherwise

l Zi+1 el if i ZQ+1

ei otherwise

and e0, e1, . . . , em 1 are the Euclidean basis of Rm.

Proof of Theorem 1 is available in the supplementary material. To test if a point satisﬁes this requirement we need to understand the geometry of the set SI. The Theorem 1 shows that SI {0} is a polyhedral (ﬁnitely generated) proper cone, represented either in terms of half-space constraints (polyhedral form) or as a positive span of extreme directions (generative representation). The geometrical intuition for this is given in Figure 2 for a simple, 2-objective case with a single preference order constraint.

Algorithm 1 Test if v S I .

Input: Preference tuple I Test vector v Rm. Output: 1(v S I ). // Calculate 1(v S I ). Let bj = a T (j)v j Zm. if i = k Zm : sgn(bi) = sgn(bk) return TRUE elseif b = 0 return TRUE else return FALSE.

Algorithm 2 Preference-Order Constrained Bayesian Optimisation (MOBO-PC).

Input: preference-order tuple I. Observations D = {(x(i), y(i)) X Y}. for t = 0, 1, . . . , T 1 do Select the test point: x = argmax x X a PEHI t (x|Dt).

(a PEHI t is evaluated using algorithm 4). Perform Experiment y = f(x) + ϵ. Update Dt+1 := Dt {(x, y)}. end for

Algorithm 3 Calculate Pr(x XI|D).

Input: Observations D = {(x(i), y(i)) X Y}. Number of Monte Carlo samples R. Test vector x X. Output: Pr(x XI|D). Let q = 0. for k = 0, 1, . . . , R 1 do //Construct samples v(0), v(1), . . . , v(n 1) Rm. Let v(j) = 0 j Zn. for i = 0, 1, . . . , m 1 do Sample u N(µ Di(x), σ Di(x, x)) (see (3)). Let [v(0)i, v(1)i, . . . , v(n 1)i] := u T. end for //Test if v(j) S I j Zn. Let q := q + Q

j Zn 1(v(j) S I ) (see algo

rithm 1). end for Return q

Algorithm 4 Calculate a PEHI t (x|D).

Input: Observations D = {(x(i), y(i)) X Y}. Number of Monte Carlo samples R. Test vector x X. Output: a PEHI t (x|D). Using algorithm 3, calculate: sx = Pr (x XI| D) s(j) = Pr x(j) XI D x(j), y(j) D Let q = 0. for k = 0, 1, . . . , R 1 do Sample yi N(µDi(x), σDi(x))) i Zm (see (2)). Construct cells c0, c1, . . . from D {(x, y)} by sorting along each axis in objective space to form a grid. Calculate: q = q+ sx P

k:y yck vol (ck) Q

j ZN:y(j) yck

end for Return q/ R.

The subsequent corollary allows us to construct a simple algorithm (algorithm 1) to test if a vector v lies in the set S I . We will use this algorithm to test if xj f(x) S I j Zn - that is, if x satisﬁes the preference-order constraints. The proof of corollary 1 is available in the supplementary material.

Corollary 1 Let I = (0, 1, . . . , Q|Q Zm\{0}) be an (ordered) preference tuple. Deﬁne S I as per deﬁnition 1. Using the notation of Theorem 1, v S I if and only if v = 0 or i = k Zm such that sgn( a T (i)v) = sgn( a T (k)v), where sgn(0) = 0.

5 Preference Constrained Bayesian Optimisation

In this section we do two things. First, we show how the Gaussian process models of the objectives fi (and their derivatives) may be used to calculate the posterior probability that x XI deﬁned by I = (0, 1, . . . , Q|Q Zm\{0}). Second, we show how the EHI acquisition function may be modiﬁed and calculated to incorporate these probabilities and hence only reward points that satisfy the preference-order conditions. Finally, we give our algorithm using this acquisition function.

5.1 Calculating Posterior Probabilities

Given that fi GP(0, K(i)(x, x)) are draws from independent Gaussian processes, and given observations D, we wish to calculate the posterior probability that x XI -

i.e.: Pr (x XI| D) = Pr xj f (x) S I j Zn . As fi GP(0, K(i)(x, x)) it follows that

xfi(x)|D Ni N(µ Di(x), σ Di(x, x )), as deﬁned by (3). Hence:

Pr (x XI| D) = Pr

v(j) S I j Zn

v(0)i v(1)i ... v(n 1)i

where v P( xf|D). We estimate it using Monte-Carlo [6] sampling as per algorithm 3.

5.2 Preference-Order Constrained Bayesian Optimisation Algorithm (MOBO-PC)

Our complete Bayesian optimisation algorithm with Preference-order constraints is given in algorithm 2. The acquisition function introduced in this algorithm gives higher importance to points satisfying the preference-order constraints. Unlike standard EHI, we take expectation over both the expected experimental outcomes fi(x) N(µDi(x), σDi(x, x)), i Zm, and the probability that points x(i) XI and x XI satisfy the preference-order constraints. We deﬁne our preference-based EHI acquisition function as:

a PEHI t (x| D) = E [SI (D {(x, f (x))}) SI (D)| D] (10)

where SI(D) is the hypervolume dominated by the observations (x, y) D satisfying the preference-order constraints. The calculation of SI(D) is illustrated in the supplementary material. The expectation of SI(D) given D is:

E [SI (D)| D] = P

k vol (ck) Pr( (x, y) D| y yck . . . x XI) . . .

k vol (ck) (1 Q

(x,y) D:y yck (1 Pr (x XI| D)))

where yck is the dominant corner of cell ck, vol(ck) is the hypervolume of cell ck, and the cells ck are constructed by sorting D along each axis in objective space. The posterior probabilities Pr(x XI|D) are calculated using algorithm 3. It follows that:

a PEHI t (x| D) = Pr (x XI| D) E h P

k:y yck vol (ck) Q

j ZN:y(j) yck

1 Pr x(j) XI D yi . . .

N (µDi (x) , σDi (x)) i Zm i

where the cells ck are constructed using the set D {(x, y)} by sorting along the axis in objective space.We estimate this acquisition function using Monte-Carlo simulation shown in algorithm 4.

6 Experiments

We conduct a series of experiments to test the empirical performance of our proposed method MOBO-PC and compare with other strategies. These experiments including synthetic data as well as optimizing the hyper-parameters of a feed-forward neural network. For Gaussian process, we use maximum likelihood estimation for setting hyperparameters [21].

6.1 Baselines

To the best of our knowledge there are no studies aiming to solve our proposed problem, however we are using PESMO, SMSego, SUR, Par EGO and EHI [9, 20, 19, 14, 7] to conﬁrm the validity of the obtained Pareto front solutions. The obtained Pareto front must be in the ground-truth whilst also satisfying the preference-order constraints. We compare our results with MOBO-RS [18] by suitably specifying bounding boxes in the objective space that can replicate a preference-order constraint.

6.2 Synthetic Functions

We begin with a comparison on minimising synthetic function Schaffer function N. 1 with 2 conﬂicting objectives f0, f1 and 1-dimensional input. (see [24]). Figure 3a shows the ground-truth Pareto front

Full Pareto

(a) Full Pareto front

(b) Case 1, s0 s1

(c) Case 2, s0 < s1

(d) Case 3, s0 > s1

0 1 2 3 4 f0

0 1 2 3 4 f0

Figure 3: Finding Pareto front which comply with the preference-order constraint. Figure 3a shows the full Pareto front solution (with no preferences). Figure 3b illustrates the Pareto front by assuming stability of ﬁrst objective f0 is similar to second objective f1. In Figure 3c, stability of f1 is preferred over f0. Figure 3d shows more stable results for f0 than f1 (s0 > s1). Figure 3e and 3f shows the results obtained by MOBO-RS and the corresponding bounding boxes. The gradient color of the Pareto front points in Figure 3b-3d indicates their degree of compliance with the constraints.

for this function. To illustrate the behavior of our method, we impose distinct preferences. Three test cases are designed to illustrate the effects of imposing preference-order constraints on the objective functions for stability. Case (1): s0 s1, Case (2): s0 < s1 and Case (3): s0 > s1. For our method it is only required to deﬁne the preference-order constraints, however for MOBO-RS, additional information as a bounding box is obligatory. Figure 3b (case 1), shows the results of preference-order constraints SI s Rm +\ {0} s0 s1 for our proposed method, where s0 represents the importance of stability in minimising f0 and s1 is the importance of stability in minimising f1. Due to same importance of both objectives, a balanced optimisation is expected. Higher weights are obtained for the Pareto front points in the middle region with highest stability for both objectives. Figure 3c (case 2) is based on the preference-order of s0 < s1 that implies the importance of stability in f1 is more than f0. The results show more stable Pareto points for f1 than f0. Figure 3d (case 3) shows the results of s0 > s1 preference-order constraint. As expected, we see more number of stable Pareto points for the important objective (i.e. f0 in this case). We deﬁned two bounding boxes for MOBO-RS approach which can represent the preference-order constraints in our approach (Figure 3e and 3f). There are inﬁnite possible bounding boxes can serve as constraints on objectives in such problems, consequently, the instability of results is expected across the various deﬁnitions of bounding boxes. We believe our method can obtain more stable Pareto front solutions especially when prior information is sparse. Also, having extra information as the weight (importance) of the Pareto front points is another advantage.

Figure 4 illustrates a special test case in which s0 > s1 and s2 > s1, yet no preferences speciﬁed over f2 and f0 while minimising Viennet function. The proposed complex preference-order constraint does not form a proper cone as elaborated in Theorem 1. However, s0 > s1 independently constructs a proper cone, likewise for s2 > s1. Figure 4a shows the results of processing these two independent constraints separately, merging their results and ﬁnding the Pareto front. Figure 4b implies more stable solutions for f0 comparing to f1. Figure 4c shows the Pareto front points comply with s2 > s1.

15.0 15.4 15.8 16.2 16.6

Full Pareto points Pareto points

Weight ( 10 3)

(a) Obtained Pareto points

1 2 3 4 5 f0

Pareto points

(b) Projection of Pareto points in f0 and f1 space

15.0 15.5 16.0 16.5 17.0 f1

Pareto points

(c) Projection of Pareto points in f1 and f2 space

Figure 4: Finding Pareto front points with partial constraints as speciﬁed by s0 > s1 and s2 > s1. Figure 4a shows the 3D plot of the obtained Pareto front points satisfying preference-order constraints with the color indicating the degree of compliance. Figure 4b illustrates the projection of Pareto optimal points on f0 f1 sub-space, and ﬁgure 4c shows the projection on f1 f2 sub-space.

0.015 0.020 0.025 0.030 f0 (error)

SUR 100 PESMO 100 PAREGO 100 SMSEGO 100 MOBO PC 100 MOBO RS 100

0.015 0.020 0.025 0.030 f0 (error)

SUR 200 PESMO 200 PAREGO 200 SMSEGO 200 MOBO PC 200 MOBO RS 200

Figure 5: Average Pareto fronts obtained by proposed method in comparison to other methods. This experiment deﬁnes s1 > s0 i.e. stability of run time is more important than the error. For MOBO-RS, [[0.02, 0], [0.03, 2]] is an additional information used as bounding box. The other methods do not incorporate preferences. The results are shown for 100 evaluations of the objectives (left) and 200 evaluations of the objectives (right).

6.3 Finding a Fast and Accurate Neural Network

Next, we train a neural network with two objectives of minimising both prediction error and prediction time, as per [9]. These are conﬂicting objectives because reducing the prediction error generally involves larger networks and consequently longer testing time. We are using MNIST dataset and the tuning parameters include number of hidden layers (x1 [1, 3]), the number of hidden units per layer (x2 [50, 300]), the learning rate (x3 (0, 0.2]), amount of dropout (x4 [0.4, 0.8]), and the level of l1 (x5 (0, 0.1]) and l2 (x6 (0, 0.1]) regularization. For this problem we assume stability of f1(time) in minimising procedure is more important than the f0(error). For MOBO-RS method, we selected [[0.02, 0], [0.03, 2]] bounding box to represent an accurate prior knowledge (see Figure 5). The results were averaged over 5 independent runs. Figure 5 illustrates that one can simply ask for more stable solutions with respect to test time (without any prior knowledge) of a neural network while optimising the hyperparameters. As all the solutions found with MOBO-PC are in range of (0, 5) test time. In addition, it seems the proposed method ﬁnds more number of Pareto front solutions in comparison with MOBO-RS. 7 Conclusion In this paper we proposed a novel multi-objective Bayesian optimisation algorithm with preferences over objectives. We deﬁne objective preferences in terms of stability and formulate a common framework to focus on the sections of the Pareto front where preferred objectives are more stable, as is required. We evaluate our method on both synthetic and real-world problems and show that the obtained Pareto fronts comply with the preference-order constraints.

Acknowledgments

This research was partially funded by Australian Government through the Australian Research Council (ARC). Prof Venkatesh is the recipient of an ARC Australian Laureate Fellowship (FL170100006).

[1] Roberto Calandra, Nakul Gopalan, André Seyfarth, Jan Peters, and Marc Peter Deisenroth. Bayesian gait optimization for bipedal locomotion. In International Conference on Learning and Intelligent Optimization, pages 274 290. Springer, 2014.

[2] Roberto Calandra, André Seyfarth, Jan Peters, and Marc Peter Deisenroth. Bayesian optimization for learning gaits under uncertainty. Annals of Mathematics and Artiﬁcial Intelligence, 76(1-2):5 23, 2016.

[3] Kalyanmoy Deb. Multi-objective optimization. In Search methodologies, pages 273 316. Springer, 2005.

[4] Kalyanmoy Deb. Multi-objective optimization. In Search methodologies, pages 403 449. Springer, 2014.

[5] Kalyanmoy Deb, Samir Agrawal, Amrit Pratap, and Tanaka Meyarivan. A fast elitist nondominated sorting genetic algorithm for multi-objective optimization: Nsga-ii. In International conference on parallel problem solving from nature, pages 849 858. Springer, 2000.

[6] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411 436, 2006.

[7] Michael Emmerich and Jan-willem Klinkenberg. The computation of the expected improvement in dominated hypervolume of pareto front approximations. Rapport technique, Leiden University, 34, 2008.

[8] Paul Feliot, Julien Bect, and Emmanuel Vazquez. A bayesian approach to constrained single-and multi-objective optimization. Journal of Global Optimization, 67(1-2):97 133, 2017.

[9] Daniel Hernández-Lobato, Jose Hernandez-Lobato, Amar Shah, and Ryan Adams. Predictive entropy search for multi-objective bayesian optimization. In International Conference on Machine Learning, pages 1492 1501, 2016.

[10] Simon Huband, Phil Hingston, Lyndon While, and Luigi Barone. An evolution strategy with probabilistic mutation for multi-objective optimisation. In Proceedings of the IEEE Congress on Evolutionary Computation, volume 4, pages 2284 2291, 2003.

[11] Iris Hupkens, André Deutz, Kaifeng Yang, and Michael Emmerich. Faster exact algorithms for computing expected hypervolume improvement. In International Conference on Evolutionary Multi-Criterion Optimization, pages 65 79. Springer, 2015.

[12] Ilija Ilievski, Taimoor Akhtar, Jiashi Feng, and Christine Annette Shoemaker. Efﬁcient hyperparameter optimization for deep learning algorithms using deterministic rbf surrogates. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017.

[13] Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast bayesian optimization of machine learning hyperparameters on large datasets. ar Xiv preprint ar Xiv:1605.07079, 2016.

[14] Joshua Knowles. Parego: a hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems. IEEE Transactions on Evolutionary Computation, 10(1):50 66, 2006.

[15] Marco Laumanns and Jiri Ocenasek. Bayesian optimization algorithms for multi-objective optimization. In International Conference on Parallel Problem Solving from Nature, pages 298 307. Springer, 2002.

[16] Cheng Li, David Rubín de Celis Leal, Santu Rana, Sunil Gupta, Alessandra Sutti, Stewart Greenhill, Teo Slezak, Murray Height, and Svetha Venkatesh. Rapid bayesian optimisation for synthesis of short polymer ﬁber materials. Scientiﬁc reports, 7(1):5683, 2017.

[17] A. O Hagan. Some bayesian numerical analysis. Bayesian Statistics, 7:345 363, 1992.

[18] Biswajit Paria, Kirthevasan Kandasamy, and Barnabás Póczos. A ﬂexible multi-objective bayesian optimization approach using random scalarizations. Co RR, abs/1805.12168, 2018.

[19] Victor Picheny. Multiobjective optimization using gaussian process emulators via stepwise uncertainty reduction. Statistics and Computing, 25(6):1265 1280, 2015.

[20] Wolfgang Ponweiser, Tobias Wagner, Dirk Biermann, and Markus Vincze. Multiobjective optimization on a limited budget of evaluations using model-assisted s-metric selection. In International Conference on Parallel Problem Solving from Nature, pages 784 794. Springer, 2008.

[21] Carl Edward Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine learning, pages 63 71. Springer, 2004.

[22] Carl Edward Rasmussen. Gaussian processes to speed up hybrid monte carlo for expensive bayesian integrals. Bayesian statistics, 7:651 659, 2008.

[23] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.

[24] James David Schaffer. Some experiments in machine learning using vector evaluated genetic algorithms (artiﬁcial intelligence, optimization, adaptation, pattern recognition). 1984.

[25] Alistair Shilton, Santu Rana, Sinil Kumar Gupta, and Svetha Venkatesh. A simple recursive algorithm for calculating expected hypervolume improvement. In Bayes Opt2017 NIPS Workshop on Bayesian Optimisation, 2017.

[26] Ofer M. Shir, Michael Emmerich, Thomas Back, and Marc J. J. Vrakking. The application of evolutionary multi-criteria optimization to dynamic molecular aligment. In Proceedings of 2007 IEEE Congress on Evolutionary Computation, 2007.

[27] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951 2959, 2012.

[28] Kristof Van Moffaert, Madalina M Drugan, and Ann Nowé. Scalarized multi-objective reinforcement learning: Novel design techniques. In Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 191 199, 2013.

[29] Lyndon While, Philip Hingston, Luigi Barone, and Simon Huband. A faster algorithm for calculating hypervolume. IEEE Transactions on Evolutionary Computation, 10(1):29 38, 2006.

[30] Martin Zaefferer, Thomax Bartz-Beielstein, Boris Naujoks, Tobias Wagner, and Michael Emmerich. A case study on multi-criteria optimization of an event detection software under limited budgets. In Proceedings of the 2013 International Conference on Evolutionary Multi-Criterion Optimization, pages 756 770. Springer, 2013.

[31] Eckart Zitzler. Evolutionary Algorithms for Multiobjective Optimization: Methods and Applications. Ph D thesis, Swiss Federal Institute of Technology Zurich, 1999.