# robust_bayesian_satisficing__6bd088f0.pdf

Robust Bayesian Satisficing

Artun Saday Bilkent University artun.saday@bilkent.edu.tr

Ya sar Cahit Yıldırım Bilkent University cahit.yildirim@bilkent.edu.tr

Cem Tekin Bilkent University cemtekin@ee.bilkent.edu.tr

Distributional shifts pose a significant challenge to achieving robustness in contemporary machine learning. To overcome this challenge, robust satisficing (RS) seeks a robust solution to an unspecified distributional shift while achieving a utility above a desired threshold. This paper focuses on the problem of RS in contextual Bayesian optimization when there is a discrepancy between the true and reference distributions of the context. We propose a novel robust Bayesian satisficing algorithm called Ro BOS for noisy black-box optimization. Our algorithm guarantees sublinear lenient regret under certain assumptions on the amount of distribution shift. In addition, we define a weaker notion of regret called robust satisficing regret, in which our algorithm achieves a sublinear upper bound independent of the amount of distribution shift. To demonstrate the effectiveness of our method, we apply it to various learning problems and compare it to other approaches, such as distributionally robust optimization.

1 Introduction

Bayesian optimization (BO) [1, 2] is a powerful technique for optimizing complex black-box functions that are expensive to evaluate. It is particularly useful in situations where the function is noisy or has multiple local optima. The approach combines a probabilistic model of the objective function with a search algorithm to efficiently identify the best input values. In recent years, BO has become a popular method for various sequential decision-making problems such as parameter tuning in machine learning [3], vaccine and drug development [4], and dynamic treatment regimes [5].

Contextual BO [6] is an extension of BO that allows for optimization in the presence of contexts, i.e., exogenous variables associated with the environment that can affect the outcome. A common approach to BO when the context distribution is known is to maximize the expected utility [7, 8]. Often, however, there exists a distributional mismatch between the reference distribution that the learner assumes and the true covariate distribution the environment decides. When there is uncertainty surrounding the reference distribution, choosing the solution that maximizes the expected utility may result in a suboptimal or even disastrous outcome. A plethora of methods have been developed to tackle distribution shifts in BO; the ones that are most closely related to our work are adversarially robust BO (STABLEOPT) [9] and distributionally robust BO (DRBO) [10]. STABLEOPT aims to maximize the utility under an adversarial perturbation to the input. DRBO aims to maximize the utility under the worst-case context distribution in a known uncertainty set. We focus on the contextual framework of DRBO. However, unlike DRBO, our algorithm does not require as input an uncertainty set. This provides an additional level of robustness to distribution shifts, even when the true distribution lies outside of a known uncertainty set.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

In this paper, we introduce the concept of robust Bayesian satisficing (RBS), whose roots have been set by Herbert Simon [11]. In his Nobel Prize in Economics speech in 1978, Simon mentioned that decision makers can satisfice either by finding optimum solutions for a simplified world or by finding satisfactory solutions for a more realistic world . Satisficing can be described as achieving a satisfactory threshold τ (aka aspiration level) utility under uncertainty. It has been observed that satisficing behavior is prevalent in decision-making scenarios where the agents face risks and uncertainty and exhibit bounded rational behavior due to the immense complexity of the problem, computational limits, and time constraints [12]. Since its introduction, satisficing in decision-making has been investigated in many different disciplines, including economics [13], management science [14, 15], psychology [16], and engineering [17, 18]. The concept of satisficing has also been recently formalized within the multi-armed bandit framework in terms of regret minimization [19, 20, 21, 22] and good arm identification [23]. Recently, it has been shown that satisficing designs can be found with a sample complexity much smaller than what is necessary to identify optimal designs [23]. Moreover, [21] demonstrated that when the future rewards are discounted, i.e., learning is timesensitive, algorithms that seek a satisficing design yield considerably larger returns than algorithms that converge to an optimal design. The concept of robust satisficing (RS) is intimately connected to satisficing. In the seminal work of Schwartz et al. [24], robust satisficing is described as finding a design that maximizes the robustness to uncertainty and satisfices. Long et al. [25] cast robust satisficing as an optimization problem and propose models to estimate a robust satisficing decision efficiently.

Inspired by this rich line of literature, we introduce the concept of robust Bayesian satisficing (RBS) as an alternative paradigm for robust Bayesian optimization. The objective of RBS is to satisfice under ever-evolving conditions by achieving rewards that are comparable with an aspiration level τ, even under an unrestricted distributional shift.

Contributions. Our contributions can be summarized as follows:

We propose robust Bayesian satisficing, a new decision-making framework that merges robust satisficing with the power of Bayesian surrogate modeling. We provide a detailed comparison between RBS and other BO methods.

To measure the performance of the learner with respect to an aspiration level τ, we introduce two regret definitions. The first one is the lenient regret studied by [22] and [23]. The second one is the robust satisficing regret, which measures the loss of the learner with respect to a benchmark that tracks the quality of the true robust satisficing actions. We also provide a connection between these two regret measures.

We propose a Gaussian process (GP) based learning algorithm called Robust Bayesian Optimistic Satisficing (Ro BOS). Ro BOS only requires as input an aspiration level τ that it seeks to achieve. Unlike algorithms for robust BO, it does not require as input an uncertainty set that quantifies the degree of distribution shift.

We prove that Ro BOS achieves with high probability O(γT

T) robust satisficing regret and O(γT

T + ET ) lenient regret, where γT is the maximum information gain over T rounds and ET := PT t=1 ϵt is the sum of distribution shifts by round T, where ϵt is the amount of distribution shift in round t.

We provide a detailed numerical comparison between Ro BOS and other robust BO algorithms, verifying the practical resilience of Ro BOS in handling unknown distribution shifts.

Organization. The remainder of the paper is organized as follows. Problem formulation and regret definitions are introduced in Section 2. Ro BOS is introduced in Section 3, and its regret analysis is carried out in Section 4. Experimental results are reported in Section 5. Conclusion, limitations, and future research are discussed in Section 6. Additional results and complete proofs of the theoretical results in the main paper can be found in the appendix.

2 Problem definition

Let f : X C R be an unknown reward function defined over a parameter space with finite action and context sets, X and C := {c1, . . . , cn} respectively. Let P0 represent the set of all distributions

Table 1: Comparison of optimization objectives.

Method Inputs Objective

SO f, Pt Find x X that maximize Ec Pt[f(x, c)] S f, Pt, τ Find x X that satisfy Ec Pt[f(x, c)] τ WRO f, Pt, t Find x X that maximize minc t f(x, c) DRO f, Ut Find x X that maximize inf P Ut Ec P [f(x, c)]

RS f, Pt, τ Find x X that minimize k(x) where k(x) = min k s.t. Ec P [f(x, c)] τ k (P, Pt), P P0, k 0

over C. The objective is to sequentially optimize f using noisy observations. At each round t [T], in turn the environment provides a reference distribution Pt P0, the learner chooses an action xt X and the environment provides a context ct C together with a noisy observation yt = f(xt, ct) + ηt, where ηt is conditionally σ-subgaussian given x1, c1, y1, . . . , xt 1, ct 1, yt 1, xt, ct. We assume that ct is sampled independently from a time-dependent, unknown distribution P t P0, which can be different than Pt. We represent distributions Pt and P t with n-dimensional non-negative vectors wt and w t such that ||wt||1 = ||w t ||1 = 1. We represent the distance between P, P P0 with (P, P ). In particular, we consider maximum mean discrepancy (MMD) as the distance measure.

Optimization objective. To motivate robust Bayesian satisficing, we review and compare various optimization objectives. Throughout this section, we assume that f is known. The key novelty of our work is combining the new robust satisficing objective from [25] with Bayesian surrogate modeling to address unmet real-world challenges faced by BO. Robust satisficing aims to perform satisfactorily well over a wide range of possible distributions on C. This is different from stochastic optimization (SO) [26] which aims to optimize for a given reference distribution Pt,1 distributionally robust optimization (DRO) [27, 28], which aims to optimize the worst-case scenario in an ambiguity set Ut, usually taken as a ball of radius r centered at Pt, and worst-case robust optimization (WRO) [29], which aims to optimize under the worst-case contexts from an uncertainty set t of contexts, and satisficing (S) [30] which seeks for a satisfactory solution that achieves threshold τ. Table 1 compares different optimization objectives. In RS, the objective is to find x t X that solves in each round t

κτ,t = min k s.t. Ec P [f(x, c)] τ k (P, Pt), P P0 , x X, k 0 . (1)

To find x t , we can first compute the fragility of x X as

κτ,t(x) = min k s.t. Ec P [f(x, c)] τ k (P, Pt), P P0, k 0 . (2)

When (1) is feasible, the RS solution at round t is the one with the minimum fragility, i.e., x t arg minx X κτ,t(x) and κτ,t = minx X κτ,t(x). Similar to DRO, RS utilizes a reference distribution assumed to be a proxy for the true distribution. However, unlike DRO, we do not define an ambiguity set Ut that represents all plausible distributions on C but rather define a threshold value τ, which we aim to satisfy. In cases where we are not confident that the reference distribution Pt accurately represents the true distribution P t , finding a meaningful ambiguity set can be difficult. In contrast, τ has a meaningful interpretation and can be expressed as a percentage of the SO solution computed under the reference distribution Pt over C given by2

Zt := max x X Ec Pt[f(x, c)] . (3)

Unlike DRO, in RS, we are certain that our formulation covers the true distribution P t . The fragility can be viewed as the minimum rate of suboptimality one can obtain with respect to the threshold per unit of distribution shift from Pt. The success of an action is measured by whether it achieves the desired threshold value τ in expectation. Depending on the objective function and the ambiguity set, the RS solution can differ from the DRO and the SO solutions (see Figure 1 for an example).

1One particular instance of this is when Pt is the empirical context distribution. 2When f is unknown, Zt cannot be computed exactly. However, one can reflect the uncertainty about f by using a GP surrogate model, by which upper and lower bounds on Zt can be computed.

-4.0 -1.33 1.33 4.0

(Heatmap of f(x, c))

4 3 2 1 0 1 2 3 4

Expected Reward

c Pt [f(x, c)]

t [f(x, c)]

RS: = 0.65 * Zt DRO: r = (P *

DRO: r = (P *

DRO: r = 3 (P *

Figure 1: Examples of RS, DRO, and SO solutions where Zt is the solution to the SO problem given in (3). For DRO, Ut corresponds to a ball centered at Pt with radius r. Rhombus, cross, and hexagon correspond to RS and two other suboptimal solutions, respectively. Note that the SO solution corresponds to the point with a hexagon, i.e., a suboptimal solution. When the radius of the ambiguity ball of DRO captures the discrepancy between Pt and P t perfectly, it selects the RS solution. When the ambiguity ball is too small or too large, it fails to find the RS solution and selects a suboptimal solution.

An intriguing question is whether τ of RS is more interpretable than Ut of DRO. While [25] provides a detailed discussion of the pros and cons of these two approaches, below we compare them on an important dynamic drug dosage problem.

Example 1. Type 1 Diabetes Mellitus (T1DM) patients require bolus insulin doses (id) after meals for postprandial blood glucose (pbg) regulation. One of the most important factors that affect pbg is meal carbohydrate (cho) intake [31]. Let X and C represent admissible id and cho values. For x X, c C, let g(x, c) represent the corresponding (expected) bpg value. Function g depends on the patient s characteristics and can be regarded as unknown. The main goal of pbg regulation is to keep pbg close to a target level K in order to prevent two potentially life-threatening events called hypoglycemia and hyperglycemia. This requires xt to be chosen judiciously based on current ct. Patients rely on a method called cho counting to calculate ct. Often, this method is prone to errors [32]. The reported cho intake ζt can differ significantly from ct. In order to use DRO, one needs to identify a range of plausible distributions for cho calculation errors, which is hard to calculate and interpret. On the other hand, specifying τ corresponds to defining an interval of safe pbg values around K that one is content with, which is in line with the standard clinical practice [33]. We provide experiments on this application in Section 5.

Regret measures. We consider RS from a regret minimization perspective, where the objective is to choose a sequence of actions x1, . . . , x T that minimize growth rate of the regret. In particular, we focus on two different regret measures. The first one is the lenient regret [22] given as

τ Ec P t [f(xt, c)] + , (4)

where ( )+ := max{0, }. The lenient regret measures the cumulative loss of the learner on the chosen sequence of actions w.r.t. specified threshold τ that we aim to achieve. If an action achieves τ in the expectation it accumulates no regret, otherwise, it accumulates the difference. Achieving sublinear lenient regret under any distribution shift is an impossible task. Note that even the RS action x t computed with complete knowledge of f only guarantees an expected reward at least as large as τ κτ,t (P t , Pt). Therefore, our lenient regret upper bound will depend on distribution shifts.

We also define a new notion of regret called robust satisficing regret, given as

τ κτ,t (P t , Pt) Ec P t [f(xt, c)] + . (5)

Rrs T measures the accumulated loss of the learner with respect to the robust satisficing benchmark τ κτ,t (P t , Pt) under the true distribution. In particular, the true robust satisficing action x t achieves Ec P t [f(x t , c)] τ κτ,t (P t , Pt) .

It is obvious that Rrs T Rl T . When there is no distribution shift, i.e., P t = Pt and (1) is feasible for all t, then the two regret notions are equivalent.

In order to minimize the regrets in (4) and (5), we will develop an algorithm that utilizes Gaussian processes (GPs) as a surrogate model. This requires us to impose mild assumptions on f, which are detailed below.

Regularity assumptions. We assume that f belongs to a reproducing kernel Hilbert space (RKHS) with kernel function k. We assume that k((x, c), (x , c )) = k X (x, x )k C(c, c ) is the product kernel formed by kernel functions k X and k C defined over the action and context spaces respectively. Let the Hilbert norm ||f||H of f be bounded above by B. This is a common assumption made in BO literature which allows working with GPs as a surrogate model.

GP surrogate and confidence intervals. Define Z := X C. Our algorithm uses a GP to model f, defined by a prior mean function µ(z) = E[f(z)] and a positive definite kernel function k(z, z ) = E[(f(z) µ(z))(f(z ) µ(z ))]. Furthermore we assume µ(z) = 0 and k(z, z) 1, z Z. The prior distribution over f is modeled as GP(0, k(z, z )). Using Gaussian likelihood with variance λ > 0, the posterior distribution of f, given the observations yt = [y1, . . . , yt]T at points zt = [z1, . . . , zt]T is modeled as a GP with posterior mean and covariance at the beginning of round t 1 given as

µt(z) = kt 1(z)T(Kt 1 + λIt 1) 1yt 1 kt(z, z ) = k(z, z ) kt 1(z)T(Kt 1 + λIt 1) 1kt 1(z )

σ2 t (z) = kt(z, z) ,

where kt(z) = [k(z, z1) . . . k(z, zt)]T, Kt is the t t kernel matrix of the observations with (Kt)ij = k(zi, zj) and It is the t t identity matrix. Let σ2 t (z) := kt(z, z) represent the posterior variance of the model. Define µ0 := 0 and σ2 0(z) := k(z, z).

The maximum information gain over t rounds is defined as [1]

γt := max A X C:|A|=t 1 2 log(det(It + λ 1KA)) , (6)

where KA is the kernel matrix of the sampling points A. The following lemma from [34] which is based on [35, Theorem 2] provides tight confidence intervals for functions f with bounded Hilbert norm in RKHS. Lemma 1. [34, Theorem 1] Let δ (0, 1), λ := max{1, λ}, and

log(det(Kt 1 + λIt 1)) + 2 log 1

Then, the following holds with probability at least 1 δ: |µt(x, c) f(x, c)| βt(δ)σt(x, c), x X, c C, t 1 .

For simplicity, we set λ = 1 in the rest of the paper, and observe that log(det(Kt 1 + It 1)) 2γt 1. When the confidence parameter δ is clear from the context, we use βt to represent βt(δ) to reduce clutter. We define upper confidence bound (UCB) and lower confidence bound (LCB) for (x, c) X C as follows: ucbt(x, c) := µt(x, c) + βtσt(x, c), lcbt(x, c) := µt(x, c) βtσt(x, c) .

For x X, we denote the corresponding UCB and LCB vectors in Rn by ucbt x := [ucbt(x, c1), . . . , ucbt(x, cn)]T and lcbt x := [lcbt(x, c1), . . . , lcbt(x, cn)]T. Also let fx := [f(x, c1), . . . , f(x, cn)]T.

3 Ro BOS for robust Bayesian satisficing

To perform robust Bayesian satisficing (RBS), we propose a learning algorithm called Robust Bayesian Optimistic Satisficing (Ro BOS), whose pseudocode is given in Algorithm 1. At the beginning of each round t, Ro BOS observes the reference distribution wt. Then, it computes UCB index ucbt(x, c) for each action-context pair (x, c) X C, by using the GP posterior mean and standard deviation at round t. UCB indices, τ and wt are used to compute the estimated fragility of each action x, at round t, given as

( maxw (C)\{wt} τ w,ucbt x w wt M if wt, ucbt x τ + if wt, ucbt x < τ (7)

where (C) represents the probability simplex over C, M represents the n by n kernel matrix and ||w||M :=

w TMw is the MMD measure. Specifically, given the kernel k C : C C R+, M is the kernel matrix of the context set C, i.e., Mij = k C(ci, cj). The estimated fragility ˆκτ,t(x) of an action x, is an optimistic proxy for the true fragility κτ,t(x). Note that when ˆκτ,t(x) 0, the threshold τ is achieved under any context distribution given the UCBs of rewards of x. On the other hand, if τ > Ec Pt[ucbt(x, c)], then τ cannot be achieved under the reference distribution given the UCB indices, thus ˆκτ,t(x) = . The next lemma, whose proof is given in the appendix, relates ˆκτ,t(x) with κτ,t(x). Lemma 2. Fix τ R. With probability at least 1 δ, ˆκτ,t(x) κτ,t(x) for all x X and t 1.

To perform robust satisficing, Ro BOS chooses as xt the action with the lowest estimated fragility, i.e., xt = arg minx X ˆκτ,t(x). After action selection, ct and yt are observed from the environment, which are then used to update the GP posterior.

Algorithm 1: Ro BOS Inputs X, C, τ, GP kernel k, µ0 = 0, σ, confidence parameter δ, B for t = 1, 2, . . . do

1. Observe the reference distribution wt 2. Compute ucbt(x, c) = µt(x, c) + βtσt(x, c) for all x X and c C 3. Compute ˆκτ,t(x) as in (7) 4. Choose action xt = arg minx X ˆκτ,t(x) 5. Observe ct P t and yt = f(xt, ct) + ηt 6. Use {xt, ct, yt} to compute µt+1 and σt+1 end

4 Regret analysis

We will analyze the regret under the assumption that (1) is feasible. Assumption 1. Let ˆxt := arg maxx X wt, fx be the best action under the reference distribution. We assume that τ wt, fˆxt for all t [T], i.e., the threshold is at most the solution to the SO problem.

If Assumption 1 does not hold in round t, then (1) is infeasible, which means that κτ,t = and there is no robust satisficing solution. Therefore, measuring the regret in such a round will be meaningless. In practice, if the learner is flexible about its aspiration level, Assumption 1 can be relaxed by dynamically selecting τ at each round to be less than wt, lcbt ˆx t , where ˆx t := arg maxx X wt, lcbt x .3

The following theorem provides an upper bound on the robust satisficing regret of Ro BOS based on the maximum information gain given in (6). Theorem 3. Fix δ (0, 1). When Ro BOS is run under Assumption 1 with confidence parameter δ/2 and βt := βt(δ/2), where βt(δ) is defined in Lemma 1, then with a probability of at least 1 δ, the

3Indeed, our algorithm can be straightforwardly adapted to work with dynamic thresholds τt, t 1. When we also change the thresholds in our regret definitions (4) and (5) to τt, and update Assumption 1 such that τt wt, fˆxt , t [T], all derived regret bounds still hold.

robust satisficing regret Rrs T of Ro BOS is upper-bounded as follows:

T 2γT + 2 log 12

Proof Sketch: Assume that confidence bounds in Lemma 1 hold (happens with probability at least 1 δ/2). Let rrs t := τ κτ,t (P t , Pt) Ec P t [f(xt, c)] be the instantaneous regret. Robust satisficing regret can be written as Rrs T = PT t=1 rrs t I(rrs t 0). We have

rrs t τ κτ,t w t wt M w t , ucbt xt + 2βt w t , σt(xt, ) (8)

τ κτ,t w t wt M (τ ˆκτ,t w t wt M) + 2βt w t , σt(xt, ) (9) = w t wt M(ˆκτ,t κτ,t) + 2βt w t , σt(xt, ) 2βt w t , σt(xt, ) ,

where (8) comes from the confidence bounds in Lemma 1, (9) comes from the fact that our algorithm at each round guarantees w t , ucbt xt τ ˆκτ,t w t wt M, and the last inequality is due to Lemma 2. The rest of the proof follows the standard methods used in the Gaussian process literature together with an auxiliary concentration lemma.

Our analysis uses the fact that under Assumption 1, both the fragility κτ,t and the estimated fragility ˆκτ,t are finite. We also note that when τ > wt, fˆxt and Pt = P t , then κτ,t = , and the instantaneous regret is (τ κτ,t (P t , Pt) Ec P t [f(xt, c)])+ = 0. Setting βT as in Lemma 1, Theorem 3 gives a regret bound of O(γT

T). The next corollary uses the bounds on γT from [36] to bound the regret for known kernel families. Corollary 1. When the kernel k(z, z ) is either Mátern-ν kernel or squared exponential kernel,the robust satisficing regret of Ro BOS, on an input domain of dimension d, is bounded by: O(T 2ν+3d 4ν+2d ) and O(

T) respectively.

Next, we analyze the lenient regret of Ro BOS. As we discussed in Section 2, lenient regret is a stronger regret measure under which even x t can suffer linear regret. Therefore, our lenient regret bound depends on the amount of distribution shift ϵt := ||w t wt||M, t [T]. We highlight that regret analysis of Ro BOS is different than that of DRBO in [10], which is done under the assumption that ||w t wt||M ϵ t, for some ϵ t 0, t [T]. The major difference is that DRBO requires (ϵ t)t [T ] as input while Ro BOS does not require (ϵt)t [T ] as input. Also note that ϵt ϵ t.

Before stating the lenient regret bound, we let B := maxx fx M 1. It is known that B B p

λmax(M 1)n, where λmax(M 1) represents the largest eigenvalue of M 1 and B represents an upper bound on the RKHS norm of f [10]. Theorem 4. Fix δ (0, 1). Under Assumption 1, when Ro BOS is run with confidence parameter δ/2 with βt := βt(δ/2), where βt is defined in Lemma 1, then with a probability of at least 1 δ, the lenient regret Rl T of Ro BOS is upper-bounded as follows:

T 2γT + 2 log 12

Proof Sketch: When wt, ucbt x τ, let wt x := arg maxw (C)\{wt} τ w,ucbt x w wt M be the context distribution that achieves ˆκτ,t(x). We have

rl t := τ w t , fxt

= τ w t , ucbt xt + w t , ucbt xt fxt

τ w t , ucbt xt + 2βt w t , σt(xt, ) , (10)

where (10) follows from the confidence bounds in Lemma 1.

Consider w t = wt. By Assumption 1 and the selection rule of Ro BOS (i.e., wt, ucbt xt τ), we obtain

rl t τ wt, ucbt xt + 2βt w t , σt(xt, ) 2βt w t , σt(xt, ) .

Consider w t = wt. Continuing from (10)

rl t w t wt M τ w t , ucbt xt w t wt M + 2βt w t , σt(xt, )

w t wt M ˆκτ,t(xt) + 2βt w t , σt(xt, ) (11) w t wt M ˆκτ,t(ˆxt) + 2βt w t , σt(xt, ) (12)

w t wt M wt, fˆxt wt ˆxt, ucbt ˆxt wt ˆxt wt M + 2βt w t , σt(xt, ) (13)

= w t wt M fˆxt M 1 + 2βt w t , σt(xt, ) (14)

ϵt B + 2βt w t , σt(xt, ) , (15)

where (11) comes from the definition of ˆκτ,t(xt); (12) follows from xt = arg minx X ˆκτ,t(x); (13) results from the assumption on τ in Assumption 1; (14) utilizes Lemma 1 and the Cauchy-Schwarz inequality; and finally (15) follows from the definition of ϵt. The remainder of the proof follows similar arguments as in the proof of Theorem 3.

Theorem 4 shows that the lenient regret scales as O(γT

T + ET ), where ET := PT t=1 ϵt. An important special case in which ET is sublinear is data-driven optimization. For this case, P t = P

is fixed for t [T], P1 is the uniform distribution over the contexts, and Pt = Pt 1 s=1 δcs, t > 1 is the empirical distribution of the observed contexts, where δc is the Dirac measure defined for c C such that δc(A) = 1 if c A and 0 otherwise. The next result follows from steps similar to the proof of [10, Corollary 4]. Lemma 5. Consider data-driven optimization model with δ (0, 1). Under Assumption 1, when Ro BOS is run with confidence parameter δ/3 with βt := βt(δ/3), where βt is defined in Lemma 1, then with a probability of at least 1 δ

T 2γT + 2 log 12

+ B ϵ1 + B 2

2 log π2T 2

5 Experiments

We evaluate the proposed algorithm on one synthetic and one real-world environment. We compare the lenient regret and robust satisficing regret of Ro BOS with the following benchmark algorithms. Our implementation is available at http://github.com/Bilkent-CYBORG/Ro BOS.

SO: As the representative of SO we use a stochastic version of the GP-UCB algorithm that samples at each round point xt = arg maxx X Ec Pt[ucbt(x, c)]. DRBO: For the DRO approach we consider the DRBO algorithm from [10]. DRBO, at each round samples xt = arg maxx X inf P Ut Ec P [ucbt(x, c)]. WRBO: For the WRO approach we consider a maxi-min algorithm we call WRBO, that maximizes the worst-case reward over the context set. Note that when the ambiguity set Ut of DRBO is P0, i.e., the set of all possible distributions on C, DRBO reduces to WRBO which samples at each round the point xt = arg maxx X minc C ucbt(x, c).

Synthetic environment. The objective function is given Figure 1(Left). The reference distribution and the true distribution are chosen to be stationary with Pt = N(2, 1) and P t = N(0, 25). We denote the true MMD distance (Pt, P t ) with ϵ, and run different instances of DRBO with ambiguity balls of radius r = 3ϵ, r = ϵ, and r = ϵ/3. When the radius of the ambiguity ball is ϵ, DRBO knows the exact distributional shift. The threshold τ is set to 60% of the maximum value of the objective function, i.e., τ = 0.6 max(x,c) X C f(x, c) 0.65Zt. We use an RBF kernel with Automatic Relevance Determination (ARD) and lengthscales 0.2 and 5 respectively for dimensions X and C. Simulations are run with observation noise ηt N(0, 0.022).

Results for the robust satisficing and lenient regrets are given in Figure 2. Under this configuration, the RS solution, as noted in Figure 1, is the only solution that achieves the desired threshold τ. Hence, algorithms that converge to a different solution accumulate linear lenient regret, as can be seen in Figure 2(Right). When DRBO is run with an overconfident ambiguity set (radius ϵ/3), it converges to a suboptimal solution (hexagon); when the ambiguity set is underconfident (radius 3ϵ), it converges

0 50 100 150 200 Round

DRBO: 3ϵ DRBO: ϵ DRBO: ϵ/3 WRBO SO Ro BOS

0 50 100 150 200 Round

Figure 2: Results for synthetic environment. Plots show robust satisficing regret (left) and lenient regret (right) averaged over 50 independent runs with error bars corresponding to standard deviations divided by 2.

0.1 0.16 0.25 0.4 0.63 1.0 1.58 2.51 3.98 6.31 10.0 ϵ

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 τ

Avg. Reward

Ro BOS DRBO

Figure 3: Average reward for different τ and r values in Ro BOS and DRBO. τ is selected linearly between minimum and maximum function values. The radius r of the ambiguity ball for DRBO is selected from 0.1ϵ to 10ϵ. Plots show average of 10 independent runs each with 200 rounds.

to another suboptimal solution (cross). Only when the distribution shift is precisely known and the ambiguity set is adjusted accordingly (radius ϵ), DRBO converges to the RS solution. Refer to Figure 1 to see the exact solutions converged.

We also investigate the sensitivity of the choice of τ in Ro BOS compared to the choice of r in DRBO. Figure 3 compares the average rewards of Ro BOS and DRBO for a wide range of τ and r values. While Ro BOS is designed to satisfice the reward rather than to maximize, its performance remains competitive with DRBO across diverse hyperparameter settings. Notably, for τ [0.1, 0.3], Ro BOS opts for the solution indicated by a cross in Figure 1, signifying a trade-off: with a smaller τ, Ro BOS prioritizes robustness guarantees over maximizing the reward.

Insulin dosage for T1DM. We test our algorithm on the problem of personalized insulin dose allocation for Type 1 Diabetes Mellitus (T1DM) patients. We use the open-source implementation of the U.S. FDA-approved University of Virginia (UVA)/PADOVA T1DM simulator [37, 5]. The simulator takes in as input the fasting blood glucose level of the patient, the amount of carbohydrate intake during the monitored meal and the insulin dosage given to the patient, it gives an output of the blood glucose level measured 150 minutes after the meal. We assume the insulin is administered to the patient right after the meal. Similar to [5], we set the target blood glucose level as K = 112.5 mg/dl and define the pseudo-reward function as r(t) = |o(t) K| where o(t) is the achieved blood glucose level at round t. We further define a safe blood glucose level range as 102.5 - 122.5 mg/dl. For our setup, this corresponds to setting the threshold τ = 10. At each round t, the environment picks the true and reference distributions as Pt N(ζt, 2.25) and P t N(ζt + N, 9) where ζt U(20, 80) and N is the random term setting the distributional shift. We define the action set X to be the insulin dose and the context set C to be the amount of carbohydrate intake. Here the distributional shift can be interpreted as the estimation error of the patient on their carbohydrate intake. We ran our experiments with N U( 6, 6) and with Nt U( 6/ log(t+2), 6/ log(t+2)). Simulations are run with observation noise ηt N(0, 1). For the GP surrogate, we used a Matérn-ν kernel with length-scale parameter 10. As seen in Figures 4a and 4b, when the amount of distribution

0 50 100 150 200 Round

DRBO: 3ϵ DRBO: ϵ DRBO: ϵ/3 WRBO SO Ro BOS

0 50 100 150 200 Round

0 50 100 150 200 Round

DRBO: 3ϵ DRBO: ϵ DRBO: ϵ/3 WRBO SO Ro BOS

0 50 100 150 200 Round

Figure 4: Results for insulin dose allocation simulation where ϵt is the true MMD distance between the true and reference distributions at round t, and the threshold τ = 10 specifies a satisficing threshold of 10 mg/dl away from the target blood glucose level. On (b), the true MMD distance ϵt ϵ0/ log(t) decays with t. Plots show robust satisficing regret (Left) and lenient regret (Right) averaged over 50 independent runs with error bars corresponding to standard deviations divided by 2.

shift at each round is known exactly by DRBO, it can perform better than Ro BOS. However, when the distributional shift is either underestimated or overestimated, Ro BOS achieves better results. Plots of the average cumulative rewards of the algorithms can be found in the appendix.

6 Conclusion, limitations, and future research

We introduced robust Bayesian satisficing as a new sequential decision-making paradigm that offers a satisfactory level of protection against incalculable distribution shifts. We introduced the first RBS algorithm Ro BOS, and proved information gain based lenient and robust satisficing regret bounds.

In our experiments, we observed that when the range of the distribution shift can be correctly estimated with a tight uncertainty set Ut centered at the reference distribution, DRBO [10] can perform better than Ro BOS, especially when it comes to maximizing total reward. This is not unexpected since one can guarantee better performance with more information about the distribution shift. Nevertheless, in cases when the estimated shift does not adequately represent the true shift or when the main objective is to achieve a desired value instead of maximizing the reward, Ro BOS emerges as a robust alternative.

Our fundamental research on RBS brings forth many interesting future research directions. One potential direction is to extend Ro BOS to work in continuous action and context spaces. This will require a more nuanced regret analysis and computationally efficient procedures to calculate the estimated fragility at each round. Another interesting future work is to design alternative acquisition strategies for RBS. For instance, one can investigate a Thompson sampling based approach instead of the UCB approach we pursued in this work.

Acknowledgements: Y. Cahit Yıldırım was supported by Turk Telekom as part of 5G and Beyond Joint Graduate Support Programme coordinated by Information and Communication Technologies Authority.

[1] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias W Seeger. Informationtheoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250 3265, 2012.

[2] Roman Garnett. Bayesian optimization. Cambridge University Press, 2023.

[3] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems, 25, 2012.

[4] James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyperparameter optimization. In Advances in Neural Information Processing Systems, volume 24, 2011.

[5] Ilker Demirel, Ahmet Alparslan Celik, and Cem Tekin. ESCADA: Efficient safety and context aware dose allocation for precision medicine. In Advances in Neural Information Processing Systems, volume 35, pages 27441 27454, 2022.

[6] Andreas Krause and Cheng Ong. Contextual Gaussian process bandit optimization. Advances in Neural Information Processing Systems, 24, 2011.

[7] Rafael Oliveira, Lionel Ott, and Fabio Ramos. Bayesian optimisation under uncertain inputs. In 22nd International Conference on Artificial Intelligence and Statistics, pages 1177 1184, 2019.

[8] Johannes Kirschner and Andreas Krause. Stochastic bandits with context distributions. In Advances in Neural Information Processing Systems, volume 32, 2019.

[9] Ilija Bogunovic, Jonathan Scarlett, Stefanie Jegelka, and Volkan Cevher. Adversarially robust optimization with Gaussian processes. Advances in Neural Information Processing Systems, 31, 2018.

[10] Johannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, and Andreas Krause. Distributionally robust Bayesian optimization. 108:2174 2184, 2020.

[11] Herbert A Simon. A behavioral model of rational choice. The Quarterly Journal of Economics, 69(1):99 118, 1955.

[12] James CT Mao. Survey of capital budgeting: Theory and practice. Journal of Finance, pages 349 360, 1970.

[13] Robert Bordley and Marco Li Calzi. Decision analysis using targets instead of utility functions. Decisions in Economics and Finance, 23(1):53 74, 2000.

[14] Terry M Moe. The new economics of organization. American Journal of Political Science, pages 739 777, 1984.

[15] Sidney G Winter. The satisficing principle in capability learning. Strategic Management Journal, 21(10-11):981 996, 2000.

[16] Barry Schwartz, Andrew Ward, John Monterosso, Sonja Lyubomirsky, Katherine White, and Darrin R Lehman. Maximizing versus satisficing: happiness is a matter of choice. Journal of Personality and Social Psychology, 83(5):1178, 2002.

[17] Hirotaka Nakayama and Yoshikazu Sawaragi. Satisficing trade-off method for multiobjective programming. In Interactive Decision Analysis, pages 113 122. 1984.

[18] Michael A Goodrich, Wynn C Stirling, and Richard L Frost. A theory of satisficing decisions and control. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 28(6):763 779, 1998.

[19] Paul Reverdy, Vaibhav Srivastava, and Naomi Ehrich Leonard. Satisficing in multi-armed bandit problems. IEEE Transactions on Automatic Control, 62(8):3788 3803, 2016.

[20] Alihan Hüyük and Cem Tekin. Multi-objective multi-armed bandit with lexicographically ordered and satisficing objectives. Machine Learning, 110(6):1233 1266, 2021.

[21] Daniel Russo and Benjamin Van Roy. Satisficing in time-sensitive bandit learning. Mathematics of Operations Research, 47(4):2815 2839, 2022.

[22] Nadav Merlis and Shie Mannor. Lenient regret for multi-armed bandits. In AAAI Conference on Artificial Intelligence, volume 35, pages 8950 8957, 2021.

[23] Xu Cai, Selwyn Gomes, and Jonathan Scarlett. Lenient regret and good-action identification in Gaussian process bandits. In International Conference on Machine Learning, pages 1183 1192, 2021.

[24] Barry Schwartz, Yakov Ben-Haim, and Cliff Dacso. What makes a good decision? robust satisficing as a normative standard of rational decision making. Journal for the Theory of Social Behaviour, 41(2):209 227, 2011.

[25] Daniel Zhuoyu Long, Melvyn Sim, and Minglong Zhou. Robust satisficing. Operations Research, 71(1):61 82, 2023.

[26] Anton J Kleywegt, Alexander Shapiro, and Tito Homem-de Mello. The sample average approximation method for stochastic discrete optimization. SIAM Journal on Optimization, 12(2):479 502, 2002.

[27] Herbert E Scarf. A min-max solution of an inventory problem. Rand Corporation Santa Monica, 1957.

[28] Hamed Rahimian and Sanjay Mehrotra. Distributionally robust optimization: A review. ar Xiv preprint ar Xiv: 1908.05659, 2019.

[29] Dimitris Bertsimas, Omid Nohadani, and Kwong Meng Teo. Robust optimization for unconstrained simulation-based problems. Operations Research, 58(1):161 178, 2010.

[30] Herbert A Simon. Rational choice and the structure of the environment. Psychological Review, 63(2):129, 1956.

[31] John Walsh, Ruth Roberts, and Timothy Bailey. Guidelines for optimal bolus calculator settings in adults. Journal of Diabetes Science and Technology, 5(1):129 135, 2011.

[32] Tomoyuki Kawamura, Chihiro Takamura, Masakazu Hirose, Tomomi Hashimoto, Takashi Higashide, Yoneo Kashihara, Kayako Hashimura, and Haruo Shintaku. The factors affecting on estimation of carbohydrate content of meals in carbohydrate counting. Clinical Pediatric Endocrinology, 24(4):153 165, 2015.

[33] Lindy Kahanovitz, Patrick M Sluss, and Steven J Russell. Type 1 diabetes a clinical perspective. Point of Care, 16(1):37, 2017.

[34] Christian Fiedler, Carsten W Scherer, and Sebastian Trimpe. Practical and rigorous uncertainty bounds for Gaussian process regression. In AAAI Conference on Artificial Intelligence, volume 35, pages 7439 7447, 2021.

[35] Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In International Conference on Machine Learning, pages 844 853, 2017.

[36] Sattar Vakili, Kia Khezeli, and Victor Picheny. On information gain and regret bounds in Gaussian process bandits. In International Conference on Artificial Intelligence and Statistics, pages 82 90, 2021.

[37] Boris P Kovatchev, Marc Breton, Chiara Dalla Man, and Claudio Cobelli. In silico preclinical trials: a proof of concept in closed-loop control of type 1 diabetes, 2009.

[38] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Schölkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2):1 141, 2017.

A Table of notations

Table 2: Table of notations

Notation Description

τ aspiration level (threshold) w M

w TMw MMD measure with kernel matrix M X Action set C Context set w t True distribution at round t wt Reference distribution at round t fx [f(x, c1), . . . , f(x, cn)]T

ucbt x upper confidence bound of fx given by [ucbt(x, c1), . . . , ucbt(x, cn)]T

( max n maxw (C)\{wt} τ w,fx w wt M , 0 o if wt, fx τ

+ if wt, fx < τ

( maxw (C)\{wt} τ w,ucbt x w wt M if wt, ucbt x τ + if wt, ucbt x < τ . ˆxt arg maxx X wt, fx x t arg minx X κτ,t(x) when wt, fˆxt τ xt arg minx X ˆκτ,t(x) when wt, fˆxt τ κτ,t κτ,t(x t ) when wt, fˆxt τ ˆκτ,t ˆκτ,t(xt) when wt, fˆxt τ wt x arg maxw (C)\{wt} τ w,fx w wt M when wt, fx τ

wt x arg maxw (C)\{wt} τ w,ucbt x w wt M when wt, ucbt x τ

B Additional experimental results

0 50 100 150 200 Round

Cumulative Reward

DRBO: 3ϵ DRBO: ϵ DRBO: ϵ/3 WRBO SO Ro BOS

0 50 100 150 200 Round

Cumulative Reward

DRBO: 3ϵ DRBO: ϵ DRBO: ϵ/3 WRBO SO Ro BOS

Figure 5: Cumulative rewards of the insulin dosage for T1DM experiments given in the main paper. Left and right plots are continuations of Figures 4a and 4b, respectively. Plots show average of 50 independent runs with error bars corresponding to standard deviation divided by 2. The pseudoreward function is defined as r(t) = |o(t) K|.

C.1 Proof of Lemma 2

Assume that the confidence intervals in Lemma 1 hold. When ˆκτ,t(x) = , we also have κτ,t(x) = . When ˆκτ,t(x) < and κτ,t(x) = , the inequality holds.

Recall that when wt, ucbt x τ, wt x := arg maxw (C)\{wt} τ w,ucbt x w wt M represents the context

distribution that achieves ˆκτ,t(x). When wt, fx τ, let wt x := arg maxw (C)\{wt} τ w,fx w wt M be the context distribution that achieves κτ,t(x). When ˆκτ,t(x) < and κτ,t(x) < , we have

ˆκτ,t(x) κτ,t(x) τ wt x, ucbt x wtx wt M τ wt x, fx wtx wt M

τ wt x, ucbt x wtx wt M τ wt x, fx wtx wt M

= wt x, fx ucbt x wtx wt M 0 .

C.2 An Auxiliary Concentration Lemma

Lemma 6. ([10, Lemma 7]) Let St 0 be a non-negative stochastic process with filtration Ft, and define mt = E[St|Ft 1]. Further assume that St B for B 1. Then for any T 1, with probability at least 1 δ it holds that

t=1 St + 4B log 1

δ + 8B log(4B) + 1

t=1 St + 8B log 6B

C.3 Proof of Theorem 3

The robust satisficing regret is upper bounded by the sum of instantaneous regrets rrs t := τ κτ,t w t wt M w t , fxt as follows:

τ κτ,t (P t , Pt) Ec P t [f(xt, c)] +

t=1 (τ κτ,t w t wt M w t , fxt )+

t=1 rrs t I(rrs t 0) . (16)

Assume that the confidence intervals in Lemma 1 hold (an event that happens with probability at least 1 δ/2). For the instantaneous regret we have

rrs t = τ κτ,t w t wt M w t , ucbt xt + w t , ucbt xt fxt

τ κτ,t w t wt M w t , ucbt xt + 2βt w t , σt(xt, ) (17)

τ κτ,t w t wt M (τ ˆκτ,t w t wt M) + 2βt w t , σt(xt, ) (18) = w t wt M(ˆκτ,t κτ,t) + 2βt w t , σt(xt, ) 2βt w t , σt(xt, ) . (19)

In the derivation above, (17) follows from the confidence bounds given in Lemma 1; (18) follows from w t , ucbt xt τ ˆκτ,t(xt)||w t wt||M and ˆκτ,t = ˆκτ,t(xt); (19) is due to Lemma 2.

From this point on, we bound the robust satisficing regret by following standard steps for bounding regret of GP bandits (see e.g., [10]). First, by plugging the upper bound on rrs t obtained in (19) to

(16), and using monotonicity of βt, we obtain

t=1 w t , σt(xt, ) (20)

t=1 w t , σt(xt, ) 2 (21)

t=1 w t , σt(xt, )2 , (22)

where (21) uses the Cauchy-Schwarz inequality and (22) uses the Jensen inequality.

To complete the proof we use Lemma 6 to relate the expectation of the posterior variance in (22) to the posterior variance of the observations. Namely, we have with probability at least 1 δ/2

t=1 w t , σt(xt, )2 2

t=1 σt(xt, ct)2 + 8 log 12

Next, to relate the sum of variances that appears in (23) to the maximum information gain. We use x 2 log(1 + x) for all x [0, 1] to obtain

t=1 σt(xt, ct)2

t=1 2 log(1 + σt(xt, ct)2)

where (24) follows from [35, Lemma 3].

Finally, to bound the robust satisficing regret, we plug the upper bound in (24) to the r.h.s. of (23), and use it to upper bound (22) as follows

T 2γT + 2 log 12

Let A and B be the events that Lemma 1 and Lemma 6 hold. Setting the confidence of each event to 1 δ/2, we can bound from below the probability that both events hold

P(A B) = 1 P( A B) 1 P( A) P( B) = 1 δ ,

completing the proof of Theorem 3.

C.4 Proof of Theorem 4

Define the instantaneous regret as rl t := τ w t , fxt . We have

τ Ec P t [f(xt, c)] + =

t=1 rl t I(rl t 0) . (25)

Assume that the confidence intervals in Lemma 1 hold (an event that happens with probability at least 1 δ/2). Then, for the instantaneous regret we have

rl t = τ w t , ucbt xt + w t , ucbt xt fxt

τ w t , ucbt xt + 2βt w t , σt(xt, ) . (26)

When, w t = wt, by Assumption 1 and the selection rule of Ro BOS (i.e., wt, ucbt xt τ), we obtain

rl t τ wt, ucbt xt + 2βt w t , σt(xt, ) 2βt w t , σt(xt, ) .

When w t = wt, continuing from (26), we obtain

rl t w t wt M τ w t , ucbt xt w t wt M + 2βt w t , σt(xt, )

w t wt M τ wt xt, ucbt xt wtxt wt M + 2βt w t , σt(xt, ) (27)

w t wt M τ wt ˆxt, ucbt ˆxt wt ˆxt wt M + 2βt w t , σt(xt, ) (28)

w t wt M wt, fˆxt wt ˆxt, ucbt ˆxt wt ˆxt wt M + 2βt w t , σt(xt, ) (29)

w t wt M wt wt ˆxt, fˆxt wt ˆxt wt M + 2βt w t , σt(xt, ) (30)

w t wt M wt ˆxt wt M fˆxt M 1

wt ˆxt wt M + 2βt w t , σt(xt, ) (31)

= w t wt M fˆxt M 1 + 2βt w t , σt(xt, )

2βt w t , σt(xt, ) + ϵt B . (32)

Since the final bound is non-negative, it can be used to bound all terms rl t I(rl t 0) in Rl T . In the derivation above, (26) follows from Lemma 1; (27) uses the fact that wt xt is the distribution that achieves ˆκτ,t(xt); (28) uses the fact that xt minimizes ˆκτ,t(x), (29) uses Assumption 1, (30) again uses Lemma 1, (31) follows from the Cauchy-Schwarz inequality; and (32) follows from the facts ||w t wt||M = ϵt and B = maxx fx M 1. From this point on, the regret bound follows the same arguments as in the proof of Theorem 3. In particular, using (32), we write

t=1 w t , σt(xt, ) + B T X

t=1 ϵt . (33)

We complete the proof by continuing from (20), by using the same steps as in the proof of Theorem 3 starting from (20), which leads to the regret bound in the statement of Theorem 4.

C.5 Proof of Lemma 5

The first term of Lemma 5 directly follows from Theorem 4. For the second term, we use the result of [38, Theorem 3.4], which is restated below.

[38, Theorem 3.4] Assume k(ci, cj) 1 for all ci, cj C. Then, with probability at least 1 δ

d(P , ˆPt) 1 t 1(2 + p

2 log(1/δ)) .

To use the lemma above, let δt = 2δ π2t2 . Then,

d(P , ˆPt) > 1 t 1

Using a union bound, we obtain

t [T] : d(P , ˆPt) > 1 t 1

2δ π2t2 = 2δ

The inequality above implies that

t [T] : d(P , ˆPt) > 1 t 1

Since ϵt 1 t 1

2δ , we have with probability at least 1 δ

T 2γT + 2 log 12

T 2γT + 2 log 12

+ B ϵ1 + B T X

T 2γT + 2 log 12

+ B ϵ1 + B T X

2 log π2T 2

T 2γT + 2 log 12

+ B ϵ1 + B T 1 X

2 log π2T 2

T 2γT + 2 log 12

+ B ϵ1 + B 2

2 log π2T 2

Also, note that

ϵ1 = w w1 M = w 1

n M sup {w: w 1=1} w 1

n M = O(1) .