# adversarially_robust_optimization_with_gaussian_processes__c61a583d.pdf

Adversarially Robust Optimization

with Gaussian Processes

Ilija Bogunovic

LIONS, EPFL ilija.bogunovic@epfl.ch

Jonathan Scarlett National University of Singapore

scarlett@comp.nus.edu.sg

Stefanie Jegelka

MIT CSAIL stefje@mit.edu

Volkan Cevher

LIONS, EPFL volkan.cevher@epfl.ch

In this paper, we consider the problem of Gaussian process (GP) optimization with an added robustness requirement: The returned point may be perturbed by an adversary, and we require the function value to remain as high as possible even after this perturbation. This problem is motivated by settings in which the underlying functions during optimization and implementation stages are different, or when one is interested in ﬁnding an entire region of good inputs rather than only a single point. We show that standard GP optimization algorithms do not exhibit the desired robustness properties, and provide a novel conﬁdence-bound based algorithm STABLEOPT for this purpose. We rigorously establish the required number of samples for STABLEOPT to ﬁnd a near-optimal point, and we complement this guarantee with an algorithm-independent lower bound. We experimentally demonstrate several potential applications of interest using real-world data sets, and we show that STABLEOPT consistently succeeds in ﬁnding a stable maximizer where several baseline methods fail.

1 Introduction

Gaussian processes (GP) provide a powerful means for sequentially optimizing a black-box function f that is costly to evaluate and for which noisy point evaluations are available. Since its introduction, this approach has successfully been applied to numerous applications, including robotics [21], hyperparameter tuning [30], recommender systems [34], environmental monitoring [31], and more.

In many such applications, one is faced with various forms of uncertainty that are not accounted for by standard algorithms. In robotics, the optimization is often performed via simulations, creating a mismatch between the assumed function and the true one; in hyperparameter tuning, the function is typically similarly mismatched due to limited training data; in recommendation systems and several other applications, the underlying function is inherently time-varying, so the returned solution may become increasingly stale over time; the list goes on.

In this paper, we address these considerations by studying the GP optimization problem with an additional requirement of adversarial robustness: The returned point may be perturbed by an adversary, and we require the function value to remain as high as possible even after this perturbation. This problem is of interest not only for attaining improved robustness to uncertainty, but also for settings where one seeks a region of good points rather than a single point, and for other related max-min optimization settings (see Section 4 for further discussion).

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

Related work. Numerous algorithms have been developed for GP optimization in recent years [7, 16, 17, 26, 28, 31, 35]. Beyond the standard setting, several important extensions have been considered, including batch sampling [11,12,14], contextual and time-varying settings [6,20], safety requirements [33], and high dimensional settings [18,25,36], just to name a few.

Various forms of robustness in GP optimization have been considered previously. A prominent example is that of outliers [22], in which certain function values are highly unreliable; however, this is a separate issue from that of the present paper, since in [22] the returned point does not undergo any perturbation. Another related recent work is [2], which assumes that the sampled points (rather than the returned one) are subject to uncertainty. In addition to this difference, the uncertainty in [2] is random rather than adversarial, which is complementary but distinct from our work. The same is true of a setting called unscented Bayesian optimization in [23]. Moreover, no theoretical results are given in [2,23]. In [8], a robust form of batch optimization is considered, but with yet another form of robustness, namely, some experiments in the batch may fail to produce an outcome. Level-set estimation [7,15] is another approach to ﬁnding regions of good points rather than a single point.

Our problem formulation is also related to other works on non-convex robust optimization, particularly those of Bertsimas et al. [3, 4]. In these works, a stable design x is sought that solves minx2D maxδ2U f(x + δ). Here, δ resides in some uncertainty set U, and represents the perturbation against which the design x needs to be protected. Related problems have also recently been considered in the context of adversarial training (e.g., [29]). Compared to these works, our work bears the crucial difference that the objective function is unknown, and we can only learn about it through noisy point evaluations (i.e. bandit feedback).

Other works, such as [5,9,19,32,37], have considered robust optimization problems of the following form: For a given set of objectives {f1, . . . , fm} ﬁnd x achieving maxx2D mini=1,...,m fi(x). We discuss variations of our algorithm for this type of formulation in Section 4.

Contributions. We introduce a variant of GP optimization in which the returned solution is required to exhibit stability/robustness to an adversarial perturbation. We demonstrate the failures of standard algorithms, and introduce a new algorithm STABLEOPT that overcomes these limitations. We provide a novel theoretical analysis characterizing the number of samples required for STABLEOPT to attain a near-optimal robust solution, and we complement this with an algorithm-independent lower bound. We provide several variations of our max-min optimization framework and theory, including connections and comparisons to previous works. Finally, we experimentally demonstrate a variety of potential applications of interest using real-world data sets, and we show that STABLEOPT consistently succeeds in ﬁnding a stable maximizer where several baseline methods fail.

2 Problem Setup

Model. Let f be an unknown reward function over a domain D Rp for some dimension p. At time t, we query f at a single point xt 2 D and observe a noisy sample yt = f(xt) + zt, where zt N(0, σ2). After T rounds, a recommended point x(T ) is returned. In contrast with the standard goal of making f(x(T )) as high as possible, we seek to ﬁnd a point such that f remains high even after an adversarial perturbation; a formal description is given below.

We assume that D is endowed with a kernel function k( , ), and f has a bounded norm in the corresponding Reproducing Kernel Hilbert Space (RKHS) Hk(D). Speciﬁcally, we assume that f 2 Fk(B), where

Fk(B) = {f 2 Hk(D) : kfkk B}, (1)

and kfkk is the RKHS norm in Hk(D). It is well-known that this assumption permits the construction of conﬁdence bounds via Gaussian process (GP) methods; see Lemma 1 below for a precise statement. We assume that the kernel is normalized to satisfy k(x, x) = 1 for all x 2 D. Two commonlyconsidered kernels are squared exponential (SE) and Matérn:

k SE(x, x0) = exp

k Mat(x, x0) = 21

min δ2 (x) f(x + δ)

Figure 1: (Left) A function f and its maximizer x

0. (Middle) For = 0.06 and d(x, x0) = |x x0|, the decision that corresponds to the local wider maximum of f is the optimal -stable decision. (Right) GP-UCB selects a point that nearly maximizes f, but is suboptimal in the -stable sense.

where l denotes the length-scale, > 0 is an additional parameter that dictates the smoothness, and J( ) and Γ( ) denote the modiﬁed Bessel function and the gamma function, respectively [24].

Given a sequence of decisions {x1, , xt} and their noisy observations {y1, , yt}, the posterior distribution under a GP(0, k(x, x0)) prior is also Gaussian, with the following mean and variance:

µt(x) = kt(x)T %

t (x) = k(x, x) kt(x)T %

& 1kt(x), (5)

where kt(x) =

i=1, and Kt =

t,t0 is the kernel matrix.

Optimization goal. Let d(x, x0) be a function mapping D D ! R, and let be a constant known as the stability parameter. For each point x 2 D, we deﬁne a set

x0 x : x0 2 D and d(x, x0)

One can interpret this as the set of perturbations of x such that the newly obtained point x0 is within a distance of x. While we refer to d( , ) as the distance function throughout the paper, we allow it

to be a general function, and not necessarily a distance in the mathematical sense. As we exemplify in Section 5, the parameter might be naturally speciﬁed as part of the application, or might be better treated as a parameter that can be tuned for the purpose of the overall learning goal.

We deﬁne an -stable optimal input to be any x

min δ2 (x) f(x + δ). (7)

Our goal is to report a point x(T ) that is stable in the sense of having low -regret, deﬁned as

r (x) = min δ2 (x ) f(x

+ δ) min δ2 (x) f(x + δ). (8)

Note that once r (x) for some accuracy value 0, it follows that

min δ2 (x) f(x + δ) min δ2 (x ) f(x

We assume that d( , ) and are known, i.e., they are speciﬁed as part of the optimization formulation.

As a running example, we consider the case that d(x, x0) = kx x0k for some norm k k (e.g., 2-norm), in which case achieving low -regret amounts to favoring broad peaks instead of narrow ones, particularly for higher ; see Figure 1 for an illustration. In Section 4, we discuss how our framework also captures a variety of other max-min optimization settings of interest.

Failure of classical methods. Various algorithms have been developed for achieving small regret in the standard GP optimization problem. A prominent example is GP-UCB, which chooses

xt 2 arg max

ucbt 1(x), (10)

where ucbt 1(x) := µt 1(x) + β1/2

t σt 1(x). This algorithm is guaranteed to achieve sublinear cumulative regret with high probability [31], for a suitably chosen βt. While this is useful when

Algorithm 1 The STABLEOPT algorithm

Input: Domain D, GP prior (µ0, σ0, k), parameters {βt}t 1, stability , distance function d( , )

1: for t = 1, 2, . . . , T do 2: Set

xt = arg max

min δ2 (x) ucbt 1(x + δ). (13)

3: Set δt = arg minδ2 ( xt) lcbt 1( xt + δ) 4: Sample xt + δt, and observe yt = f( xt + δt) + zt 5: Update µt, σt, ucbt and lcbt according to (5) and (12), by including {( xt + δt, yt)} 6: end for

0,1 in general for a given ﬁxed 6= 0, these two decisions may not coincide, and hence, minδ2 (x

0 + δ) can be signiﬁcantly smaller than minδ2 (x

A visual example is given in Figure 1 (Right), where the selected point of GP-UCB for t = 20 is shown. This point nearly maximizes f, but it is strictly suboptimal in the -stable sense. The same limitation applies to other GP optimization strategies (e.g., [7,16,17,26,28,35]) whose goal is to identify the global non-robust maximum x

0. In Section 5, we will see that more advanced baseline strategies also perform poorly when applied to our problem.

3 Proposed Algorithm and Theory

Our proposed algorithm, STABLEOPT, is described in Algorithm 1, and makes use of the following conﬁdence bounds depending on an exploration parameter βt (cf., Lemma 1 below):

ucbt 1(x) := µt 1(x) + β1/2

t σt 1(x), (11)

lcbt 1(x) := µt 1(x) β1/2

t σt 1(x). (12)

The point xt deﬁned in (13) is the one having the highest stable upper conﬁdence bound. However, the queried point is not xt, but instead xt + δt, where δt 2 ( xt) is chosen to minimize the lower conﬁdence bound. As a result, the algorithm is based on two distinct principles: (i) optimism in the face of uncertainty when it comes to selecting xt; (ii) pessimism in the face of uncertainty when it comes to anticipating the perturbation of xt. The ﬁrst of these is inherent to existing algorithms such as GP-UCB [31], whereas the second is unique to the adversarially robust GP optimization problem. An example illustration of STABLEOPT s execution is given in the supplementary material.

We have left the ﬁnal reported point x(T ) unspeciﬁed in Algorithm 1, as there are numerous reasonable choices. The simplest choice is to simply return x(T ) = x T , but in our theory and experiments, we will focus on x(T ) equaling the point in { x1, . . . , x T } with the highest lower conﬁdence bound.

Finding an exact solution to the optimization of the acquisition function in (13) can be challenging in practice. When D is continuous, a natural approach is to ﬁnd an approximate solution using an efﬁcient local search algorithm for robust optimization with a fully known objective function, such as that of [4].

3.1 Upper bound on -regret

Our analysis makes use of the maximum information gain under t noisy measurements:

γt = max x1, ,xt

1 2 log det(It + σ 2Kt), (14)

which has been used in numerous theoretical works on GP optimization following [31].

STABLEOPT depends on the exploration parameter βt, which determines the width of the conﬁdence bounds. In our main result, we set βt as in [10] and make use of the following.

1In this discussion, we take d(x, x0) = kx x0k2, so that = 0 recovers the standard non-stable regret [31].

Lemma 1. [10] Fix f 2 Fk(B), and consider the sampling model yt = f(xt) + zt with zt N(0, σ2), with independence between times. Under the choice βt =

2(γt 1 + log e

the following holds with probability at least 1 :

lcbt 1(x) f(x) ucbt 1(x), 8x 2 D, 8t 1. (15)

The following theorem bounds the performance of STABLEOPT under a suitable choice of the recommended point x(T ). The proof is given in the supplementary material. Theorem 1. (Upper Bound) Fix > 0, > 0, B > 0, T 2 Z, 2 (0, 1), and a distance function d(x, x0), and suppose that

where C1 = 8/ log(1 + σ 2). For any f 2 Fk(B), STABLEOPT with βt set as in Lemma 1 achieves r (x(T )) after T rounds with probability at least 1 , where

x(T ) = xt , t = arg max

min δ2 ( xt) lcbt 1( xt + δ). (17)

This result holds for general kernels, and for both ﬁnite and continuous D. Our analysis bounds function values according to the conﬁdence bounds in Lemma 1 analogously to GP-UCB [31], but also addresses the non-trivial challenge of characterizing the perturbations δt. While we focused on the non-Bayesian RKHS setting, the proof can easily be adapted to handle the Bayesian optimization (BO) setting in which f GP(0, k); see Section 4 for further discussion.

Theorem 1 can be made more explicit by substituting bounds on γT ; in particular, γT =

O((log T)p+1) for the SE kernel, and γT = O(T

p(p+1) 2 +p(p+1) log T) for the Matérnkernel [31]. The former yields T = O % 1

in Theorem 1 for constant B, σ2, and (where O ( ) hides dimension-independent log factors), which we will shortly see nearly matches an algorithmindependent lower bound.

3.2 Lower bound on -regret

Establishing lower bounds under general kernels and input domains is an open problem even in the non-robust setting. Accordingly, the following theorem focuses on a more speciﬁc setting than the upper bound: We let the input domain be [0, 1]p for some dimension p, and we focus on the SE and Matérn kernels. In addition, we only consider the case that d(x, x0) = kx x0k2, though extensions to other norms (e.g., 1 or 1) follow immediately from the proof. Theorem 2. (Lower Bound) Let D = [0, 1]p for some dimension p, and set d(x, x0) = kx x0k2. Fix 2

, B > 0, and T 2 Z. Suppose there exists an algorithm that, for any f 2 Fk(B), reports a point x(T ) achieving -regret r (x(T )) after T rounds with probability at least 1 . Then, provided that

B and are sufﬁciently small, we have the following:

1. For k = k SE, it is necessary that T =

2. For k = k Matérn, it is necessary that T =

Here we assume that the stability parameter , dimension p, target probability , and kernel parameters l, are ﬁxed (i.e., not varying as a function of the parameters T, , σ and B).

The proof is based on constructing a ﬁnite subset of difﬁcult functions in Fk(B) and applying lower bounding techniques from the multi-armed bandit literature, also making use of several auxiliary results from the non-robust setting [27]. More speciﬁcally, the functions in the restricted class consist of narrow negative valleys that the adversary can perturb the reported point into, but that are hard to identify until a large number of samples have been taken.

For constant σ2 and B, the condition for the SE kernel simpliﬁes to T =

nearly matching the upper bound T = O % 1

of STABLEOPT established above. In the case of the Matérn kernel, more signiﬁcant gaps remain between the upper and lower bounds; however, similar gaps remain even in the standard (non-robust) setting [27].

4 Variations of STABLEOPT

While the above problem formulation seeks robustness within an -ball corresponding to the distance function d( , ), our algorithm and theory apply to a variety of seemingly distinct settings. We outline a few such settings here; in the supplementary material, we give details of their derivations.

Robust Bayesian optimization. Algorithm 1 and Theorem 1 extend readily to the Bayesian setting in which f GP(0, k(x, x0)). In particular, since the proof of Theorem 1 is based on conﬁdence bounds, the only change required is selecting βt to be that used for the Bayesian setting in [31]. As a result, our framework also captures the novel problem of adversarially robust Bayesian optimization. All of the variations outlined below similarly apply to both the Bayesian and non-Bayesian settings.

Robustness to unknown parameters. Consider the scenario where an unknown function f : D ! R has a bounded RKHS norm under some composite kernel k((x, ), (x0, 0)). Some special cases include k((x, ), (x0, 0)) = k(x, x0) + k( , 0) and k((x, ), (x0, 0)) = k(x, x0)k( , 0) [20]. The posterior mean µt(x, ) and variance σ2

t (x, ) conditioned on the previous observations (x1, 1, y1), ..., (xt 1, t 1, yt 1) are computed analogously to (5) [20].

A robust optimization formulation in this setting is to seek x that solves

2 f(x, ). (18)

That is, we seek to ﬁnd a conﬁguration x that is robust against any possible parameter vector 2 .

Potential applications of this setup include the following:

In areas such a robotics, we may have numerous parameters to tune (given by x and collec-

tively), but when it comes to implementation, some of them (i.e., only ) become out of our control. Hence, we need to be robust against whatever values they may take. We wish to tune hyperparameters in order to make an algorithm work simultaneously for

a number of distinct data types that bear some similarities/correlations. The data types are represented by , and we can choose the data type to our liking during the optimization stage.

STABLEOPT can be used to solve (18); we maintain t instead of δt, and modify the main steps to

xt 2 arg max

min 2 ucbt 1(x, ), (19)

t 2 arg min

lcbt 1(xt, ). (20)

At each time step, STABLEOPT receives a noisy observation yt = f(xt, t) + zt, which is used with (xt, t) for computing the posterior. As explained in the supplementary material, the guarantee r (x(T )) in Theorem 1 is replaced by min 2 f(x(T ), ) maxx2D min 2 f(x, ) .

Robust estimation. Continuing with the consideration of a composite kernel on (x, ), we consider the following practical problem variant proposed in [4]. Let 2 be an estimate of the true problem coefﬁcient 2 . Since, is an estimate, the true coefﬁcient satisﬁes = + δ , where δ represents uncertainty in . Often, practitioners solve maxx2D f(x, ) and ignore the uncertainty. As a more sophisticated approach, we let ( ) =

: 2 and d( , )

, and consider the following robust problem formulation:

x2D min δ 2 ( ) f(x, + δ ). (21)

For the given estimate , the main steps of STABLEOPT in this setting are

xt 2 arg max

min δ 2 ( ) ucbt 1(x, + δ ), (22)

δ ,t 2 arg min

lcbt 1(xt, + δ ), (23)

and the noisy observations take the form yt = f(xt, + δ ,t) + zt. The guarantee r (x(T )) in Theorem 1 is replaced by minδ 2 ( ) f(x(T ), + δ ) maxx2D minδ 2 ( ) f(x, + δ ) .

Robust group identiﬁcation. In some applications, it is natural to partition D into disjoint subsets, and search for the subset with the highest worst-case function value (see Section 5 for a movie

recommendation example). Letting G = {G1, . . . , Gk} denote the groups that partition the input space, the robust optimization problem is given by

x2G f(x), (24)

and the algorithm reports a group G(T ). The main steps of STABLEOPT are given by

Gt 2 arg max

min x2G ucbt 1(x), (25)

xt 2 arg min

lcbt 1(x), (26)

and the observations are of the form yt = f(xt) + zt as usual. The guarantee r (x(T )) in Theorem 1 is replaced by minx2G(T ) f(x) max G2G minx2G f(x) .

The preceding variations of STABLEOPT, as well as their resulting variations of Theorem 1, follow by substituting certain (unconventional) choices of d( , ) and into Algorithm 1 and Theorem 1, with (x, ) in place of x where appropriate. The details are given in the supplementary material.

5 Experiments

In this section, we experimentally validate the performance of STABLEOPT by comparing against several baselines. Each algorithm that we consider may distinguish between the sampled point (i.e., the point that produces the noisy observation yt) and the reported point (i.e., the point believed to be near-optimal in terms of -stability). For STABLEOPT, as described in Algorithm 1, the sampled point is xt + δt, and the reported point after time t is the one in { x1, . . . , xt} with the highest value of minδ2 ( xt) lcbt( xt + δ).2 In addition, we consider the following baselines:

GP-UCB (see (10)). We consider GP-UCB to be a good representative of the wide range of

existing methods that search for the non-robust global maximum.

MAXIMIN-GP-UCB. We consider a natural generalization of GP-UCB in which, at each time

step, the sampled and reported point are both given by

xt = arg max

min δ2 (x) ucbt 1(x + δ). (27)

STABLE-GP-RANDOM. The sampling point xt at every time step is chosen uniformly at

random, while the reported point at time t is chosen to be the point among the sampled points {x1, . . . , xt} according to the same rule as the one used for STABLEOPT.

STABLE-GP-UCB. The sampled point is given by the GP-UCB rule, while the reported point

is again chosen in the same way as in STABLEOPT.

As observed in existing works (e.g., [7,31]), the theoretical choice of βt is overly conservative. We therefore adopt a constant value of β1/2

t = 2.0 in each of the above methods, which we found to provide a suitable exploration/exploitation trade-off for each of the above algorithms.

Given a reported point x(t) at time t, the performance metric is the -regret r (x(t)) given in (8). Two observations are in order: (i) All the baselines are heuristic approaches for our problem, and they do not have guarantees in terms of near-optimal stability; (ii) We do not compare against other standard GP optimization methods, as their performance is comparable to that of GP-UCB; in particular, they suffer from exactly the same pitfalls described at the end of Section 2.

Synthetic function. We consider the synthetic function from [4] (see Figure 2a), given by

fpoly(x, y) = 2x6 + 12.2x5 21.2x4 6.2x + 6.4x3 + 4.7x2 y6 + 11y5

43.3y4 + 10y + 74.8y3 56.9y2 + 4.1xy + 0.1y2x2 0.4y2x 0.4x2y. (28)

2This is slightly different from Theorem 1, which uses the conﬁdence bound lcb 1 for x instead of adopting the common bound lcbt. We found the latter to be more suitable when the kernel hyperparameters are updated online, whereas Theorem 1 assumes a known kernel. Theorem 1 can be adapted to use lcbt alone by intersecting the conﬁdence bounds at each time instant so that they are monotonically shrinking [15].

(a) fpoly(x, y)

(b) gpoly(x, y)

0 20 40 60 80 100 t

Stable Opt GP-UCB Maxi Min-GP-UCB Stable-GP-UCB Stable-GP-Random

(c) -regret

Figure 2: (Left) Synthetic function from [4]. (Middle) Counterpart with worst-case perturbations. (Right) The performance. In this example, STABLEOPT signiﬁcantly outperforms the baselines.

The decision space is a uniformly spaced grid of points in (( 0.95, 3.2), ( 0.45, 4.4)) of size 104. There exist multiple local maxima, and the global maximum is at (x

f) = (2.82, 4.0), with fpoly(x

f) = 20.82. Similarly as in [4], given stability parameters δ = (δx, δy), where kδk2 0.5, the robust optimization problem is

max (x,y)2D gpoly(x, y), (29)

gpoly(x, y) := min (δx,δy)2 0.5(x,y) fpoly(x δx, y δy). (30)

A plot of gpoly is shown in Figure 2b. The global maximum is attained at (x

g) = ( 0.195, 0.284) and gpoly(x

g) = 4.33, and the inputs maximizing f yield gpoly(x

f) = 22.34.

We initialize the above algorithms by selecting 10 uniformly random inputs (x, y), keeping those points the same for each algorithm. The kernel adopted is a squared exponential ARD kernel. We randomly sample 500 points whose function value is above 15.0 to learn the GP hyperparameters via maximum likelihood, and then run the algorithms with these hyperparameters. The observation noise standard deviation is set to 0.1, and the number of sampling rounds is T = 100. We repeat the experiment 100 times and show the average performance in Figure 2c. We observe that STABLEOPT signiﬁcantly outperforms the baselines in this experiment. In the later rounds, the baselines report points that are close to the global optimizer, which is suboptimal with respect to the -regret.

Lake data. In the supplementary material, we provide an analogous experiment to that above using chlorophyll concentration data from Lake Zürich, with STABLEOPT again performing best.

Robust robot pushing. We consider the deterministic version of the robot pushing objective from [35], with publicly available code.3 The goal is to ﬁnd a good pre-image for pushing an object to a target location. The 3-dimensional function takes as input the robot location (rx, ry) and pushing duration rt, and outputs f(rx, ry, rt) = 5 dend, where dend is the distance from the pushed object to the target location. The domain D is continuous: rx, ry 2 [ 5, 5] and rt 2 [1, 30].

We consider a twist on this problem in which there is uncertainty regarding the precise target location, so one seeks a set of input parameters that is robust against a number of different potential locations. In the simplest case, the number of such locations is ﬁnite, meaning we can model this problem as r 2 arg maxr2D mini2[m] fi(r), where each fi corresponds to a different target location, and [m] = {1, . . . , m}. This is a special case of (18) with a ﬁnite set of cardinality m.

In our experiment, we use m = 2. Hence, our goal is to ﬁnd an input conﬁguration r that is robust against two different target locations. The ﬁrst target is uniform over the domain, and the second is uniform over the 1-ball centered at the ﬁrst target location with radius r = 2.0. We initialize each algorithm with one random sample from each fi. We run each method for T = 100 rounds, and for a reported point rt at time t, we compare the methods in terms of the robust objective mini2[m] fi(rt). We perform a fully Bayesian treatment of the hyperparameters, sampling every 10 rounds as in [17,35]. We average over 30 random pairs of {f1, f2} and report the results in Figure 3. STABLEOPT noticeably outperforms its competitors except in some of the very early rounds. We note that the apparent discontinuities in certain curves are a result of the hyperparmeter re-estimation.

3https://github.com/zi-w/Max-value-Entropy-Search

0 20 40 60 80 100 t

Avg. Min. Obj. Val.

GP-UCB Maxi Min-GP-UCB Stable-GP-UCB Stable-GP-Random Stable Opt

t=5 t=10 t=15 t=20 t=25 0.0

Avg. ϵ-regret

GP-UCB Maxi Min-GP-UCB Stable Opt

Figure 3: Robust robot pushing experiment (Left) and Movie Lens-100K experiment (Right)

Group movie recommendation. Our goal in this task is to recommend a group of movies to a user such that every movie in the group is to their liking. We use the Movie Lens-100K dataset, which consists of 1682 movies and 943 users. The data takes the form of an incomplete matrix R of ratings, where Ri,j is the rating of movie i given by the user j. To impute the missing rating values, we apply non-negative matrix factorization with p = 15 latent factors. This produces a feature vector for each movie mi 2 Rp and user uj 2 Rp. We use 10% of the user data for training, in which we ﬁt a Gaussian distribution P(u) = N(u|µ, ). For a given user uj in the test set, P(u) is considered to be a prior, and the objective is given by fj(mi) = m T

i uj, corresponding to a GP with a linear kernel.

We cluster the movie feature vectors into k = 80 groups, i.e., G = {G1, . . . , Gk}, via the k-means algorithm. Similarly to (26), the robust optimization problem for a given user j is

G2G gj(G), (31)

where gj(G) = minmi2G fj(mi). That is, for the user with feature vector uj, our goal is to ﬁnd the group of movies to recommend such that the entire collection of movies is robust with respect to the user s preferences.

In this experiment, we compare STABLEOPT against GP-UCB and MAXIMIN-GP-UCB. We report the -regret given by gj(G ) gj(G(t)) where G is the maximizer of (31), and G(t) is the reported group after time t. Since we are reporting groups rather than points, the baselines require slight modiﬁcations: At time t, GP-UCB selects the movie mt (i.e., asks for the user s rating of it) and reports the group G(t) to which mt belongs. MAXIMIN-GP-UCB reports G(t) 2 arg max G2G minm2G ucbt 1(m) and then selects mt 2 arg minm2G(t) ucbt 1(m). Finally, STABLEOPT reports a group in the same way as MAXIMIN-GP-UCB, but selects mt analogously to (26). In Figure 3, we show the average -regret, where the average is taken over 500 different test users. In this experiment, the average -regret decreases rapidly after only a small number of rounds. Among the three methods, STABLEOPT is the only one that ﬁnds the optimal movie group.

6 Conclusion

We have introduced and studied a variant of GP optimization in which one requires stability/robustness to an adversarial perturbation. We demonstrated the failures of existing algorithms, and provided a new algorithm STABLEOPT that overcomes these limitations, with rigorous guarantees. We showed that our framework naturally applies to several interesting max-min optimization formulations, and we demonstrated signiﬁcant improvements over some natural baselines in the experimental examples.

An interesting direction for future work is to study the -stable optimization formulation in the context of hyperparameter tuning (e.g., for deep neural networks). One might expect that wide function maxima in hyperparameter space provide better generalization than narrow maxima, but establishing this requires further investigation. Similar considerations are an ongoing topic of debate in understanding the parameter space rather than the hyperparmeter space, e.g., see [13].

Acknowledgment. This work was partially supported by the Swiss National Science Foundation (SNSF) under grant number 407540_167319, by the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement no725594 - time-data), by DARPA DSO s Lagrange program under grant FA86501827838, and by an NUS

startup grant.

[1] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. Technical report, http://www.dklevine.com/archive/refs4462.pdf, 1998. [2] Justin J. Beland and Prasanth B. Nair. Bayesian optimization under uncertainty. NIPS Bayes Opt

2017 workshop, 2017. [3] Dimitris Bertsimas, Omid Nohadani, and Kwong Meng Teo. Nonconvex robust optimization

for problems with constraints. INFORMS Journal on Computing, 22(1):44 58, 2010. [4] Dimitris Bertsimas, Omid Nohadani, and Kwong Meng Teo. Robust optimization for uncon-

strained simulation-based problems. Operations Research, 58(1):161 178, 2010. [5] Ilija Bogunovic, Slobodan Mitrovi c, Jonathan Scarlett, and Volkan Cevher. Robust submodular

maximization: A non-uniform partitioning approach. In International Conference on Machine Learning (ICML), pages 508 516, 2017. [6] Ilija Bogunovic, Jonathan Scarlett, and Volkan Cevher. Time-varying Gaussian process bandit

optimization. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 314 323, 2016. [7] Ilija Bogunovic, Jonathan Scarlett, Andreas Krause, and Volkan Cevher. Truncated variance

reduction: A uniﬁed approach to Bayesian optimization and level-set estimation. In Advances in Neural Information Processing Systems (NIPS), pages 1507 1515, 2016. [8] Ilija Bogunovic, Junyao Zhao, and Volkan Cevher. Robust maximization of non-submodular

objectives. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 890 899, 2018. [9] Robert S Chen, Brendan Lucier, Yaron Singer, and Vasilis Syrgkanis. Robust optimization for

non-convex objectives. In Advances in Neural Information Processing Systems (NIPS), pages 4708 4717, 2017. [10] Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In International

Conference on Machine Learning (ICML), pages 844 853, 2017. [11] Emile Contal, David Buffoni, Alexandre Robicquet, and Nicolas Vayatis. Parallel Gaussian

process optimization with upper conﬁdence bound and pure exploration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 225 240. Springer, 2013. [12] Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizing exploration-exploitation

tradeoffs in Gaussian process bandit optimization. Journal of Machine Learning Research, 15(1):3873 3923, 2014. [13] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize

for deep nets. In International Conference on Machine Learning (ICML), 2017. [14] Javier González, Zhenwen Dai, Philipp Hennig, and Neil Lawrence. Batch Bayesian optimiza-

tion via local penalization. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 648 657, 2016. [15] Alkis Gotovos, Nathalie Casati, Gregory Hitz, and Andreas Krause. Active learning for level

set estimation. In International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 1344 1350, 2013. [16] Philipp Hennig and Christian J Schuler. Entropy search for information-efﬁcient global opti-

mization. Journal of Machine Learning Research, 13(Jun):1809 1837, 2012. [17] José Miguel Hernández-Lobato, Matthew W Hoffman, and Zoubin Ghahramani. Predictive

entropy search for efﬁcient global optimization of black-box functions. In Advances in Neural Information Processing Systems (NIPS), pages 918 926, 2014. [18] Kirthevasan Kandasamy, Jeff Schneider, and Barnabás Póczos. High dimensional Bayesian

optimisation and bandits via additive models. In International Conference on Machine Learning (ICML), pages 295 304, 2015. [19] Andreas Krause, H Brendan Mc Mahan, Carlos Guestrin, and Anupam Gupta. Robust sub-

modular observation selection. Journal of Machine Learning Research, 9(Dec):2761 2801, 2008.

[20] Andreas Krause and Cheng S Ong. Contextual Gaussian process bandit optimization. In

Advances in Neural Information Processing Systems (NIPS), pages 2447 2455, 2011. [21] Daniel J Lizotte, Tao Wang, Michael H Bowling, and Dale Schuurmans. Automatic gait

optimization with Gaussian process regression. In International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 944 949, 2007. [22] Ruben Martinez-Cantin, Kevin Tee, and Michael Mc Court. Practical Bayesian optimization in

the presence of outliers. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2018. [23] J. Nogueira, R. Martinez-Cantin, A. Bernardino, and L. Jamone. Unscented Bayesian opti-

mization for safe robot grasping. In IEEE/RSJ Int. Conf. Intel. Robots and Systems (IROS), 2016. [24] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning,

volume 1. MIT press Cambridge, 2006. [25] Paul Rolland, Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. High-dimensional Bayesian optimization via additive models with overlapping groups. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 298 307, 2018. [26] Binxin Ru, Michael Osborne, and Mark Mc Leod. Fast information-theoretic Bayesian optimisa-

tion. ar Xiv preprint ar Xiv:1711.00673, 2017. [27] Jonathan Scarlett, Ilijia Bogunovic, and Volkan Cevher. Lower bounds on regret for noisy

Gaussian process bandit optimization. In Conference on Learning Theory (COLT), 2017. [28] Shubhanshu Shekhar and Tara Javidi. Gaussian process bandits with adaptive discretization.

ar Xiv preprint ar Xiv:1712.01447, 2017. [29] Aman Sinha, Hongseok Namkoong, and John Duchi. Certiﬁable distributional robustness

with principled adversarial training. In International Conference on Learning Representations (ICLR), 2018. [30] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine

learning algorithms. In Advances in Neural information Processing Systems (NIPS), pages 2951 2959, 2012. [31] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process op-

timization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning (ICML), pages 1015 1022, 2010. [32] Matthew Staib, Bryan Wilder, and Stefanie Jegelka. Distributionally robust submodular maxi-

mization. ar Xiv preprint ar Xiv:1802.05249, 2018. [33] Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization

with Gaussian processes. In International Conference on Machine Learning (ICML), pages 997 1005, 2015. [34] Hastagiri P Vanchinathan, Isidor Nikolic, Fabio De Bona, and Andreas Krause. Explore-

exploit in top-n recommender systems via Gaussian processes. In Proceedings of the 8th ACM Conference on Recommender systems, pages 225 232. ACM, 2014. [35] Zi Wang and Stefanie Jegelka. Max-value entropy search for efﬁcient Bayesian optimization.

In International Conference on Machine Learning (ICML), pages 3627 3635, 2017. [36] Zi Wang, Chengtao Li, Stefanie Jegelka, and Pushmeet Kohli. Batched high-dimensional

Bayesian optimization via structural kernel learning. In International Conference on Machine Learning (ICML), pages 3656 3664, 2017. [37] Bryan Wilder. Equilibrium computation for zero sum games with submodular structure. In

Conference on Artiﬁcial Intelligence (AAAI), 2017.