# gaussian_process_optimization_with_mutual_information__72d20198.pdf

Gaussian Process Optimization with Mutual Information

Emile Contal CONTAL@CMLA.ENS-CACHAN.FR CMLA, UMR CNRS 8536, ENS Cachan, France

Vianney Perchet VIANNEY.PERCHET@NORMALESUP.ORG LPMA, Universit e Paris Diderot, France

Nicolas Vayatis VAYATIS@CMLA.ENS-CACHAN.FR CMLA, UMR CNRS 8536, ENS Cachan, France

In this paper, we analyze a generic algorithm scheme for sequential global optimization using Gaussian processes. The upper bounds we derive on the cumulative regret for this generic algorithm improve by an exponential factor the previously known bounds for algorithms like GPUCB. We also introduce the novel Gaussian Process Mutual Information algorithm (GP-MI), which signiﬁcantly improves further these upper bounds for the cumulative regret. We conﬁrm the efﬁciency of this algorithm on synthetic and real tasks against the natural competitor, GP-UCB, and also the Expected Improvement heuristic.

1. Introduction

Stochastic optimization problems are encountered in numerous real world domains including engineering design (Wang & Shan, 2007), ﬁnance (Ziemba & Vickson, 2006), natural sciences (Floudas & Pardalos, 2000), or in machine learning for selecting models by tuning the parameters of learning algorithms (Snoek et al., 2012). We aim at ﬁnding the input of a given system which optimizes the output (or reward). In this view, an iterative procedure uses the previously acquired measures to choose the next query predicted to be the most useful. The goal is to maximize the sum of the rewards received at each iteration, that is to minimize the cumulative regret by balancing exploration (gathering information by favoring locations with high uncertainty) and exploitation (focusing on the optimum by favoring locations with high predicted reward). This optimization task becomes challenging when the dimension

Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

of the search space is high and the evaluations are noisy and expensive. Efﬁcient algorithms have been studied to tackle this challenge such as multiarmed bandit (Auer et al., 2002; Kleinberg, 2004; Bubeck et al., 2011; Audibert et al., 2011), active learning (Carpentier et al., 2011; Chen & Krause, 2013) or Bayesian optimization (Mockus, 1989; Grunewalder et al., 2010; Srinivas et al., 2012; de Freitas et al., 2012). The theoretical analysis of such optimization procedures requires some prior assumptions on the underlying function f. Modeling f as a function distributed from a Gaussian process (GP) enforces near-by locations to have close associated values, and allows to control the general smoothness of f with high probability according to the kernel of the GP (Rasmussen & Williams, 2006). Our main contribution is twofold: we propose a generic algorithm scheme for Gaussian process optimization and we prove sharp upper bounds on its cumulative regret. The theoretical analysis has a direct impact on strategies built with the GP-UCB algorithm (Srinivas et al., 2012) such as (Krause & Ong, 2011; Desautels et al., 2012; Contal et al., 2013). We suggest an alternative policy which achieves an exponential speed up with respect to the cumulative regret. We also introduce a novel algorithm, the Gaussian Process Mutual Information algorithm (GP-MI), which improves furthermore upper bounds for the cumulative regret from O( a

T(log T)d+1) for GP-UCB, the current state of the art, to the spectacular O( a

(log T)d+1), where T is the number of iterations, d is the dimension of the input space and the kernel function is Gaussian. The remainder of this article is organized as follows. We ﬁrst introduce the setup and notations. We deﬁne the GP-MI algorithm in Section 2. Main results on the cumulative regret bounds are presented in Section 3. We then provide technical details in Section 4. We ﬁnally conﬁrm the performances of GP-MI on real and synthetic tasks compared to the state of the art of GP optimization and some heuristics used in practice.

Gaussian Process Optimization with Mutual Information

Figure 1. One dimensional Gaussian process inference of the posterior mean µ (blue line) and posterior deviation σ (half of the height of the gray envelope) with squared exponential kernel, based on four observations (blue crosses).

2. Gaussian process optimization and the GP-MI algorithm

2.1. Sequential optimization and cumulative regret

Let f : X R, where X Rd is a compact and convex set, be the unknown function modeling the system we want to be optimized. We consider the problem of ﬁnding the maximum of f denoted by:

f(x ) = max x X f(x) ,

via successive queries x1, x2, . . . X. At iteration T + 1, the choice of the next query x T +1 depends on the previous noisy observations, YT = {y1, . . . , y T } at locations XT = {x1, . . . , x T } where yt = f(xt) + ϵt for all t T, and the noise variables ϵ1, . . . , ϵT are independently distributed as a Gaussian random variable N(0, σ2) with zero mean and variance σ2. The efﬁciency of a policy and its ability to address the exploration/exploitation trade-off is measured via the cumulative regret RT , deﬁned as the sum of the instantaneous regret rt, the gaps between the value of the maximum and the values at the sample locations,

rt = f(x ) f(xt) for t T

t=1 f(x ) f(xt) .

Our aim is to obtain upper bounds on the cumulative regret RT with high probability.

2.2. The Gaussian process framework

Prior assumption on f. In order to control the smoothness of the underlying function, we assume that f is sampled from a Gaussian process GP(m, k) with mean function m : X R and kernel function k : X X R+. We formalize in this manner the prior assumption that high local variations of f have low probability. The prior mean function is considered without loss of generality to be zero, as the kernel k can completely deﬁne the GP (Rasmussen & Williams, 2006). We consider the normalized and dimensionless framework introduced by (Srinivas et al., 2010) where the variance is assumed to be bounded, that is k(x, x) 1 for all x X.

Bayesian inference. At iteration T + 1, given the previously observed noisy values YT at locations XT , we use Bayesian inference to compute the current posterior distribution (Rasmussen & Williams, 2006), which is a GP of mean µT +1 and variance σ2 T +1 given at any x X by,

µT +1(x) = k T (x) C 1 T YT (1)

and σ2 T +1(x) = k(x, x) k T (x) C 1 T k T (x) , (2)

where k T (x) = [k(xt, x)]xt XT is the vector of covariances between x and the query points at time T, and CT = KT +σ2I with KT = [k(xt, xt )]xt,xt XT the kernel matrix, σ2 is the variance of the noise and I stands for the identity matrix. The Bayesian inference is illustrated on Figure 1 in a sample problem in dimension one, where the posteriors are based on four observations of a Gaussian Process with squared exponential kernel. The height of the gray area represents two posterior standard deviations at each point.

2.3. The Gaussian Process Mutual Information algorithm

The Gaussian Process Mutual Information algorithm (GPMI) is presented as Algorithm 1. The key statement is the choice of the query, xt = argmaxx X µt(x) + φt(x). The exploitation ability of the procedure is driven by µt, while the exploration is governed by φt : X R, which is an increasing function of σ2 t (x). The novelty in the GP-MI algorithm is that φt is empirically controlled by the amount of exploration that has been already done, that is, the more the algorithm has gathered information on f, the more it will focus on the optimum. In the GP-UCB algorithm from (Srinivas et al., 2012) the exploration coefﬁcient is a O(log t) and therefore tends to inﬁnity. The parameter α in Algorithm 1 governs the trade-off between precision and conﬁdence, as shown in Theorem 2. The efﬁciency of the algorithm is robust to the choice of its value. We conﬁrm empirically this property and provide further discussion on the calibration of α in Section 5.

Gaussian Process Optimization with Mutual Information

Algorithm 1 GP-MI

pγ0 0 for t = 1, 2, . . . do

// Bayesian inference (Eq. 1, 2) Compute µt and σ2 t // Deﬁnition of φt(x) for all x X

φt(x) ?α ˆb

σ2 t (x) + pγt 1 a

// Selection of the next query location xt argmaxx X µt(x) + φt(x)

// Update pγt pγt pγt 1 + σ2 t (xt) // Query Sample xt and observe yt end for

Mutual information. The quantity pγT controlling the exploration in our algorithm forms a lower bound on the information acquired on f by the query points XT . The information on f is formally deﬁned by IT (XT ), the mutual information between f and the noisy observations YT at XT , hence the name of the GP-MI algorithm. For a Gaussian process distribution IT (XT ) = 1 2 log det(I + σ 2KT ) where KT is the kernel matrix [k(xi, xj)]xi,xj XT . We refer to (Cover & Thomas, 1991) for further reading on mutual information. We denote by γT = max XT X:|XT |=T IT (XT ) the maximum mutual information obtainable by a sequence of T query points. In the case of Gaussian processes with bounded variance, the following inequality is satisﬁed (Lemma 5.4 in (Srinivas et al., 2012)):

t=1 σ2 t (xt) C1γT (3)

where C1 = 2 log(1+σ 2) and σ2 is the noise variance. The upper bounds on the cumulative regret we derive in the next section depend mainly on this key quantity.

3. Main Results

3.1. Generic Optimization Scheme

We ﬁrst consider the generic optimization scheme deﬁned in Algorithm 2, where we let φt as a generic function viewed as a parameter of the algorithm. We only require φt to be measurable with respect to Yt 1, the observations available at iteration t 1. The theoretical analysis of Algorithm 2 can be used as a plug-in theorem for existing algorithms. For example the GP-UCB algorithm with parameter βt = O(log t) is obtained with φt(x) = a

βtσ2 t (x). A generic analysis of Algorithm 2 leads to the following upper bounds on the cumulative regret with high probability.

Algorithm 2 Generic Optimization Scheme (φt)

for t = 1, 2, . . . do

Compute µt and φt xt argmaxx X µt(x) + φt(x) Sample xt and observe yt end for

Theorem 1 (Regret bounds for the generic algorithm). For all δ > 0 and T > 0, the regret RT incurred by Algorithm 2 on f distributed as a GP perturbed by independent Gaussian noise with variance σ2 satisﬁes the following bound with high probability, with C1 = 2 log(1+σ 2) and α = log 2

t=1 pφt(xt) φt(x )q

α(C1γT + 1) + ?α

PT t=1 σ2 t (x ) ?C1γT + 1

The proof of Theorem 1, relies on concentration guarantees for Gaussian processes (see Section 4.1). Theorem 1 provides an intermediate result used for the calibration of φt to face the exploration/exploitation trade-off. For example by choosing φt(x) = ?α

2 σ2 t (x) (where the dimensional constant is hidden), Algorithm 2 becomes a variant of the GP-UCB algorithm where in particular the exploration parameter ?βt is ﬁxed to ?α

2 instead of being an increasing function of t. The upper bounds on the cumulative regret with this deﬁnition of φt are of the form RT = O(γT ), as stated in Corollary 1. We then also consider the case where the kernel k of the Gaussian process is under the form of a squared exponential (RBF) kernel, k(x1, x2) =

exp x1 x2 2

2l2 , for all x1, x2 X and length scale l R. In this setting, the maximum mutual information γT satisﬁes the upper bound γT = Op(log T)d+1q, where d is the dimension of the input space (Srinivas et al., 2012).

Corollary 1. Consider the Algorithm 2 where we set φt(x) = ?α

2 σ2 t (x). Under the assumptions of Theorem 1, we have that the cumulative regret for Algorithm 2 satisﬁes the following upper bounds with high probability:

For f sampled from a GP with general kernel: RT = O(γT ).

For f sampled from a GP with RBF kernel: RT = Op(log T)d+1q.

To prove Corollary 1 we apply Theorem 1 with the given deﬁnition of φt and then Equation 3, which leads to Pr r RT ?α

2 C1γT + 4 a

α(C1γT + 1)s 1 δ. The

Gaussian Process Optimization with Mutual Information

previously known upper bounds on the cumulative regret for the GP-UCB algorithm are of the form RT = Op?TβT γT q where βT = Op log T

δ q. The improvement of the generic Algorithm 2 with φt(x) = ?α

2 σ2 t (x) over the GP-UCB algorithm with respect to the cumulative regret is then exponential in the case of Gaussian processes with RBF kernel. For f sampled from a GP with linear kernel, corresponding to f(x) = w T x with w N(0, I), we obtain RT = Opd log Tq. We remark that the GP assumption with linear kernel is more restrictive than the linear bandit framework, as it implies a Gaussian prior over the linear coefﬁcients w. Hence there is no contradiction with the lower bounds stated for linear bandit like those of (Dani et al., 2008). We refer to (Srinivas et al., 2012) for the analysis of γT with other kernels widely used in practice.

3.2. Regret bounds for the GP-MI algorithm

We present here the main result of the paper, the upper bound on the cumulative regret for the GP-MI algorithm. Theorem 2 (Regret bounds for the GP-MI algorithm). For all δ > 0 and T > 1, the regret RT incurred by Algorithm 1 on f distributed as a GP perturbed by independent Gaussian noise with variance σ2 satisﬁes the following bound with high probability, with C1 = 2 log(1+σ 2) and α=log 2

αC1γT + 4?α ı 1 δ .

The proof for Theorem 2 is provided in Section 4.2, where we analyze the properties of the exploration functions φt. Corollary 2 describes the case with RBF kernel for the GPMI algorithm. Corollary 2 (RBF kernels). The cumulative regret RT incurred by Algorithm 1 on f sampled from a GP with RBF kernel satisﬁes with high probability,

RT = O (log T) d+1

The GP-MI algorithm signiﬁcantly improves the upper bounds for the cumulative regret over the GP-UCB algorithm and the alternative policy of Corollary 1.

4. Theoretical analysis

In this section, we provide the proofs for Theorem 1 and Theorem 2. The approach presented here to study the cumulative regret incurred by Gaussian process optimization strategies is general and can be used further for other algorithms.

4.1. Analysis of the general algorithm

The theoretical analysis of Theorem 1 uses a similar approach to the Azuma-Hoeffding inequality adapted for

Gaussian processes. Let rt = f(x ) f(xt) for all t T. We deﬁne MT , which is shown later to be a martingale with respect to YT 1,

rt pµt(x ) µt(xt)q , (4)

for T 1 and M0 = 0. Let Yt be deﬁned as the martingale difference sequence with respect to MT , that is the difference between the instantaneous regret and the gap between the posterior mean for the optimum and the one for the point queried,

Yt = Mt Mt 1 = rt pµt(x ) µt(xt)q for t 1 .

Lemma 1. The sequence MT is a martingale with respect to YT 1 and for all t T, given Yt 1, the random variable Yt is distributed as a Gaussian N(0, ℓ2 t) with zero mean and variance ℓ2 t, where:

ℓ2 t = σ2 t (x ) + σ2 t (xt) 2k(x , xt) . (5)

Proof. From the GP assumption, we know that given Yt 1, the distribution of f(x) is Gaussian Npµt(x), σ2 t (x)q for all x X, and rt is a projection of a Gaussian random vector, that is rt is distributed as a Gaussian Npµt(x ) µt(xt), ℓ2 tq and Yt is distributed as Gaussian N(0, ℓ2 t), with ℓ2 t = σ2 t (x ) + σ2 t (xt) 2k(x , xt), hence MT is a Gaussian martingale.

We now give a concentration result for MT using inequalities for self-normalized martingales.

Lemma 2. For all δ > 0 and T > 1, the martingale MT normalized by the predictable quadratic variation PT t=1 ℓ2 t satisﬁes the following concentration inequality with α = log 2

δ and y = 8(C1γT + 1):

t=1 σ2 t (x )

Proof. Let y = 8(C1γT + 1). We introduce the notation Pr>[A] for Pr[A PT t=1 ℓ2 t > y] and Pr [A] for Pr[A PT t=1 ℓ2 t y]. Given that Mt is a Gaussian martingale, using Theorem 4.2 and Remark 4.2 from (Bercu & Touati, 2008) with M T = PT t=1 ℓ2 t and a = 0 and b = 1 we obtain for all x > 0:

Pr> MT PT t=1 ℓ2 t > x ı < exp x2y

y where α = log 2

δ , we have:

y PT t=1 ℓ2 t ı < δ

Gaussian Process Optimization with Mutual Information

By deﬁnition of ℓt in Eq. 5 and with k(x , xt) 0, we have for all t 1 that ℓ2 t σ2 t (xt) + σ2 t (x ). Using Equation 3 we have pγT y

8, we ﬁnally get:

Pr> MT > ?2αy

y PT t=1 σ2 t (x ) ı < δ

Now, using Theorem 4.1 and Remark 4.2 from (Bercu & Touati, 2008) the following inequality is satisﬁed for all x > 0: Pr r MT > xs < exp x2

With x = ?2αy we have:

Pr r MT > ?2αys < δ

Combining Equations 6 and 7 leads to,

t=1 σ2 t (x )

proving Lemma 2.

The following lemma concludes the proof of Theorem 1 using the previous concentration result and the properties of the generic Algorithm 2.

Lemma 3. The cumulative regret for Algorithm 2 on f sampled from a GP satisﬁes the following bound for all δ > 0 and α and y deﬁned in Lemma 2:

t=1 pφt(xt) φt(x )q + a

t=1 σ2 t (x )

Proof. By construction of the generic Algorithm 2, we have xt = argmaxx X µt(x) + φt(x), which guarantees for all t 1 that µt(x ) µt(xt) φt(xt) φt(x ). Replacing MT by its deﬁnition in Eq. 4 and using the previous property in Lemma 2 proves Lemma 3.

4.2. Analysis of the GP-MI algorithm

In order to bound the cumulative regret for the GP-MI algorithm, we focus on an alternative deﬁnition of the exploration functions φt where the last term is modiﬁed inductively so as to simplify the sum PT t=1 φt(xt) for all T > 0. Being a constant term for a ﬁxed t > 0, Algorithm 1 remains unchanged. Let φt be deﬁned as,

α(σ2 t (x) + pγt 1)

i=1 φi(xi) ,

where xt is the point selected by Algorithm 1 at iteration t. We have for all T > 1,

t=1 φt(xt) = a

t=1 φt(xt) +

We can now derive upper bounds for PT t=1 pφt(xt) φt(x )q which will be plugged in Theorem 1 in order to cancel out the terms involving x . In this manner we can calibrate sharply the exploration/exploitation trade-off by optimizing the remaining terms.

Lemma 4. For the GP-MI algorithm, the exploration term in the equation of Theorem 1 satisﬁes the following inequality:

t=1 pφt(xt) φt(x )q a

PT t=1 σ2 t (x ) a

Proof. Using our alternative deﬁnition of φt which gives the equality stated in Equation 8, we know that,

t=1 pφt(xt) φt(x )q =

pγt 1 + σ2 t (x )

By concavity of the square root, we have for all a b that ?

a + b ?a b 2?a. Introducing the notations at = pγt 1 + σ2 t (x ) and bt = σ2 t (x ), we obtain,

t=1 pφt(xt) φt(x )q a

Moreover, with 0 σ2 t (x) 1 for all x X, we have at pγT + 1 and bt 0 for all t T which gives,

bt ?at PT t=1 σ2 t (x ) a

leading to the inequality of Lemma 4.

The following lemma combines the results from Theorem 1 and Lemma 4 to derive upper bounds on the cumulative regret for the GP-MI algorithm with high probability.

Lemma 5. The cumulative regret for Algorithm 1 on f sampled from a GP satisﬁes the following bound for all δ > 0 and α deﬁned in Lemma 2,

αC1γT + 4?α ı 1 δ .

Gaussian Process Optimization with Mutual Information

Proof. Considering Theorem 1 in the case of the GPMI algorithm and bounding PT t=1 pφt(xt) φt(x )q with Lemma 4, we obtain the following bound on the cumulative regret incurred by GP-MI:

PT t=1 σ2 t (x ) a

α(C1γT + 1) + ?α

PT t=1 σ2 t (x ) ?C1γT + 1

which simpliﬁes to the inequality of Lemma 5 using Equation 3, and thus proves Theorem 2.

5. Practical considerations and experiments

5.1. Numerical experiments

Protocol. We compare the empirical performances of our algorithm against the state-of-the-art of GP optimization, the GP-UCB algorithm (Srinivas et al., 2012), and a commonly used heuristic, the Expected Improvement (EI) algorithm with GP (Jones et al., 1998). The tasks used for assessment come from two real applications and ﬁve synthetic problems described here. For all data sets and algorithms the learners were initialized with a random subset of 10 observations {(xi, yi)}i 10. When the prior distribution of the underlying function was not known, the Bayesian inference was made using a squared exponential kernel. We ﬁrst picked the half of the data set to estimate the hyperparameters of the kernel via cross validation in this subset. In this way, each algorithm was running with the same prior information. The value of the parameter δ for the GP-MI and the GP-UCB algorithms was ﬁxed to δ = 10 6 for all these experimental tasks. Modifying this value by several orders of magnitude is insigniﬁcant with respect to the empirical mean cumulative regret incurred by the algorithms, as discussed in Section 5.2. The results are provided in Figure 3. The curves show the evolution of the average regret RT

T in term of iteration T. We report the mean value with the conﬁdence interval over a hundred experiments.

Description of the data sets. We describe brieﬂy all the data sets used for assessment.

Generated GP. The generated Gaussian process functions are random GPs drawn from an isotropic Mat ern kernel in dimension 2 and 4, with the kernel bandwidth set to 1 for dimension 2, and 16 for dimension 4. The Mat ern parameter was set to ν = 3 and the noise standard deviation to 1% of the signal standard deviation.

Gaussian Mixture. This synthetic function comes from the addition of three 2-D Gaussian functions.

(a) Gaussian mixture

(b) Himmelblau

Figure 2. Visualization of the synthetic functions used for assessment

We then perturb these Gaussian functions with smooth variations generated from a Gaussian Process with isotropic Mat ern Kernel and 1% of noise. It is shown on Figure 2(a). The highest peak being thin, the sequential search for the maximum of this function is quite challenging.

Himmelblau. This task is another synthetic function in dimension 2. We compute a slightly tilted version of the Himmelblau s function with the addition of a linear function, and take the opposite to match the challenge of ﬁnding its maximum. This function presents four peaks but only one global maximum. It gives a practical way to test the ability of a strategy to manage exploration/exploitation trade-offs. It is represented in Figure 2(b).

Gaussian Process Optimization with Mutual Information

Branin. The Branin or Branin-Hoo function is a common benchmark function for global optimization. It presents three global optimum in the 2-D square [ 5, 10] [0, 15]. This benchmark is one of the two synthetic functions used by (Srinivas et al., 2012) to evaluate the empirical performances of the GP-UCB algorithm. No noise has been added to the original signal in this experimental task.

Goldstein-Price. The Goldstein & Price function is an other benchmark function for global optimization, with a single global optimum but several local optima in the 2-D square [ 2, 2] [ 2, 2]. This is the second synthetic benchmark used by (Srinivas et al., 2012). Like in the previous challenge, no noise has been added to the original signal.

Tsunamis. Recent post-tsunami survey data as well as the numerical simulations of (Hill et al., 2012) have shown that in some cases the run-up, which is the maximum vertical extent of wave climbing on a beach, in areas which were supposed to be protected by small islands in the vicinity of coast, was significantly higher than in neighboring locations. Motivated by these observations (Stefanakis et al., 2012) investigated this phenomenon by employing numerical simulations using the VOLNA code (Dutykh et al., 2011) with the simpliﬁed geometry of a conical island sitting on a ﬂat surface in front of a sloping beach. In the study of (Stefanakis et al., 2013) the setup was controlled by ﬁve physical parameters and the aim was to ﬁnd with conﬁdence and with the least number of simulations the parameters leading to the maximum run-up ampliﬁcation.

Mackey-Glass function. The Mackey-Glass delaydifferential equation is a chaotic system in dimension 6, but without noise. It models real feedback systems and is used in physiological domains such as hematology, cardiology, neurology, and psychiatry. The highly chaotic behavior of this function makes it an exceptionally difﬁcult optimization problem. It has been used as a benchmark for example by (Flake & Lawrence, 2002).

Empirical comparison of the algorithms. Figure 3 compares the empirical mean average regret RT

T for the three algorithms. On the easy optimization assessments like the Branin data set (Fig. 3(e)) the three strategies behave in a comparable manner, but the GP-UCB algorithm incurs a larger cumulative regret. For more difﬁcult assessments the GP-UCB algorithm performs poorly and our algorithm always surpasses the EI heuristic. The improvement of the GP-MI algorithm against the two competitors is the most signiﬁcant for exceptionally challenging optimization tasks as illustrated in Figures 3(a) to 3(d) and

50 100 150 200 250 0

(a) Generated GP(d=2)

100 200 300 400 500 0

(b) Generated GP(d=4)

0 200 400 600 800 1,000 0

(c) Gaussian Mixture

0 50 100 150 200 250 0

(d) Himmelblau

0 50 100 150 200 250 0

0 50 100 150 200 250 0

(f) Goldstein

0 100 200 300 400 500 0

(g) Tsunamis

0 200 400 600 800 1,000 0

(h) Mackey-Glass

Figure 3. Empirical mean and conﬁdence interval of the average regret RT

T in term of iteration T on real and synthetic tasks for the GP-MI and GP-UCB algorithms and the EI heuristic (lower is better).

3(h), where the underlying functions present several local optima. The ability of our algorithm to deal with the exploration/exploitation trade-off is emphasized by these experimental results as its average regret decreases directly after the ﬁrst iterations, avoiding unwanted exploration like GP-UCB on Figures 3(a) to 3(d), or getting stuck in some local optimum like EI on Figures 3(c), 3(g) and 3(h). We further mention that the GP-MI algorithm is empirically robust against the number of dimensions of the data set (Fig. 3(b), 3(g), 3(h)).

5.2. Practical aspects

Calibration of α. The value of the parameter α is chosen following Theorem 2 as α = log 2

δ with 0 < δ < 1 being a conﬁdence parameter. The guarantees we prove in Section 4.2 on the cumulative regret for the GP-MI algorithm holds with probability at least 1 δ. With α increasing linearly for δ decreasing exponentially toward 0, the algorithm is

Gaussian Process Optimization with Mutual Information

50 100 150 200 250 0

Figure 4. Small impact of the value of δ on the mean average regret of the GP-MI algorithm running on the Himmelblau data set.

robust to the choice of δ. We present on Figure 4 the small impact of δ on the average regret for four different values selected on a wide range.

Numerical Complexity. Even if the numerical cost of GP-MI is insigniﬁcant in practice compared to the cost of the evaluation of f, the complexity of the sequential Bayesian update (Osborne, 2010) is O(T 2) and might be prohibitive for large T. One can reduce drastically the computational time by means of Lazy Variance Calculation (Desautels et al., 2012), built on the fact that σ2 T (x) always decreases for increasing T and for all x X. We further mention that approximated inference algorithms such as the EP approximation and MCMC sampling (Kuss et al., 2005) can be used as an alternative if the computational time is a restrictive factor.

6. Conclusion

We introduced the GP-MI algorithm for GP optimization and prove upper bounds on its cumulative regret which improve exponentially the state-of-the-art in common settings. The theoretical analysis was presented in a generic framework in order to expand its impact to other similar algorithms. The experiments we performed on real and synthetic assessments conﬁrmed empirically the efﬁciency of our algorithm against both the theoretical state-of-the-art of GP optimization, the GP-UCB algorithm, and the commonly used EI heuristic.

ACKNOWLEDGEMENTS

The authors would like to thank David Buffoni and Rapha el Bonaque for fruitful discussions. The author also thank the anonymous reviewers for their detailed feedback.

Audibert, J-Y., Bubeck, S., and Munos, R. Bandit view on noisy optimization. In Optimization for Machine Learning, pp. 431 454. MIT Press, 2011.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235 256, 2002.

Bercu, B. and Touati, A. Exponential inequalities for selfnormalized martingales with applications. The Annals of Applied Probability, 18(5):1848 1869, 2008.

Bubeck, S., Munos, R., Stoltz, G., and Szepesv ari, C. Xarmed bandits. Journal of Machine Learning Research, 12:1655 1695, 2011.

Carpentier, A., Lazaric, A., Ghavamzadeh, M., Munos, R., and Auer, P. Upper-conﬁdence-bound algorithms for active learning in multi-armed bandits. In Proceedings of the International Conference on Algorithmic Learning Theory, pp. 189 203. Springer-Verlag, 2011.

Chen, Y. and Krause, A. Near-optimal batch mode active learning and adaptive submodular optimization. In Proceedings of the International Conference on Machine Learning. icml.cc / Omnipress, 2013.

Contal, E., Buffoni, D., Robicquet, A., and Vayatis, N. Parallel Gaussian process optimization with upper conﬁdence bound and pure exploration. In Machine Learning and Knowledge Discovery in Databases, volume 8188, pp. 225 240. Springer Berlin Heidelberg, 2013.

Cover, T. M. and Thomas, J. A. Elements of Information Theory. Wiley-Interscience, 1991.

Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning Theory, pp. 355 366, 2008.

de Freitas, N., Smola, A. J., and Zoghi, M. Exponential regret bounds for Gaussian process bandits with deterministic observations. In Proceedings of the 29th International Conference on Machine Learning. icml.cc / Omnipress, 2012.

Desautels, T., Krause, A., and Burdick, J.W. Parallelizing exploration-exploitation tradeoffs with Gaussian process bandit optimization. In Proceedings of the 29th International Conference on Machine Learning, pp. 1191 1198. icml.cc / Omnipress, 2012.

Dutykh, D., Poncet, R, and Dias, F. The VOLNA code for the numerical modelling of tsunami waves: generation, propagation and inundation. European Journal of Mechanics B/Fluids, 30:598 615, 2011.

Gaussian Process Optimization with Mutual Information

Flake, G. W. and Lawrence, S. Efﬁcient SVM regression training with SMO. Machine Learning, 46(1-3):271 290, 2002.

Floudas, C.A. and Pardalos, P.M. Optimization in Computational Chemistry and Molecular Biology: Local and Global Approaches. Nonconvex Optimization and Its Applications. Springer, 2000.

Grunewalder, S., Audibert, J-Y., Opper, M., and Shawe Taylor, J. Regret bounds for Gaussian process bandit problems. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, pp. 273 280. MIT Press, 2010.

Hill, E. M., Borrero, J. C., Huang, Z., Qiu, Q., Banerjee, P., Natawidjaja, D. H., Elosegui, P., Fritz, H. M., Suwargadi, B. W., Pranantyo, I. R., Li, L., Macpherson, K. A., Skanavis, V., Synolakis, C. E., and Sieh, K. The 2010 Mw 7.8 Mentawai earthquake: Very shallow source of a rare tsunami earthquake determined from tsunami ﬁeld survey and near-ﬁeld GPS data. J. Geophys. Res., 117: B06402 , 2012.

Jones, D. R., Schonlau, M., and Welch, W. J. Efﬁcient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455 492, December 1998.

Kleinberg, R. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems 17, pp. 697 704. MIT Press, 2004.

Krause, A. and Ong, C.S. Contextual Gaussian process bandit optimization. In Advances in Neural Information Processing Systems 24, pp. 2447 2455, 2011.

Kuss, M., Pﬁngsten, T., Csat o, L., and Rasmussen, C.E. Approximate inference for robust Gaussian process regression. Max Planck Inst. Biological Cybern., Tubingen, Germany Tech. Rep, 136, 2005.

Mockus, J. Bayesian approach to global optimization. Mathematics and its applications. Kluwer Academic, 1989.

Osborne, Michael. Bayesian Gaussian processes for sequential prediction, optimisation and quadrature. Ph D thesis, Oxford University New College, 2010.

Rasmussen, C. E. and Williams, C. Gaussian Processes for Machine Learning. MIT Press, 2006.

Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, pp. 2960 2968, 2012.

Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the International Conference on Machine Learning, pp. 1015 1022. icml.cc / Omnipress, 2010.

Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250 3265, 2012.

Stefanakis, T. S., Dias, F., Vayatis, N., and Guillas, S. Long-wave runup on a plane beach behind a conical island. In Proceedings of the World Conference on Earthquake Engineering, 2012.

Stefanakis, T. S., Contal, E., Vayatis, N., Dias, F., and Synolakis, C. E. Can small islands protect nearby coasts from tsunamis ? An active experimental design approach. ar Xiv preprint ar Xiv:1305.7385, 2013.

Wang, G. and Shan, S. Review of metamodeling techniques in support of engineering design optimization. Journal of Mechanical Design, 129(4):370 380, 2007.

Ziemba, W. T. and Vickson, R. G. Stochastic optimization models in ﬁnance. World Scientiﬁc Singapore, 2006.