# bayesian_regret_minimization_in_offline_bandits__d692ca39.pdf

Bayesian Regret Minimization in Offline Bandits

Marek Petrik 1 Guy Tennenholtz 2 Mohammad Ghavamzadeh 3

We study how to make decisions that minimize Bayesian regret in offline linear bandits. Prior work suggests that one must take actions with maximum lower confidence bound (LCB) on their reward. We argue that the reliance on LCB is inherently flawed in this setting and propose a new algorithm that directly minimizes upper bounds on the Bayesian regret using efficient conic optimization solvers. Our bounds build heavily on new connections to monetary risk measures. Proving a matching lower bound, we show that our upper bounds are tight, and by minimizing them we are guaranteed to outperform the LCB approach. Our numerical results on synthetic domains confirm that our approach is superior to LCB.

1. Introduction

The problem of offline bandits is an important special case of offline reinforcement learning (RL) in which the model consists of a single state and involves no state transitions (Hong et al., 2023). Offline RL, a challenging research problem with a rich history, is inspired by the need to make reliable decisions when learning from a logged dataset (Lange et al., 2012; Rashidinejad et al., 2022). Practical problems from recommendations to search to ranking can be modeled as offline bandits; see, for example, Hong et al. (2023) and references therein. Moreover, gaining a deeper theoretical understanding of offline bandits is a vital stepping stone in understanding the complete offline RL problem.

We study the problem of minimizing the Bayesian regret in the offline linear bandit setting. Bayesian regret differs substantially from its frequentist counterpart. While frequentist regret assumes a fixed true model and studies algorithms response to random datasets, Bayesian regret assumes a fixed dataset and studies algorithms regret as a function of

*Equal contribution 1University of New Hampshire 2Google Research 3Amazon AGI. Correspondence to: Marek Petrik <mpetrik@cs.unh.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

the true model. When provided with good priors, Bayesian methods offer sufficiently tight bounds to achieve excellent practical results (Lattimore & Szepesvari, 2018; Gelman et al., 2014; Vaart, 2000). As such, the strengths of Bayesian methods complement the scalability and simplicity of the frequentist algorithms.

Most prior work on Bayesian offline RL and bandits has adopted a form of pessimism that chooses actions with the highest lower confidence bounds (LCBs). These LCB-style algorithms compute a policy or action with the largest expected return (or reward), penalized by its uncertainty. The uncertainty penalty is computed from credible regions derived from the posterior distribution (Delage & Mannor, 2010; Hong et al., 2023; Brown et al., 2020; Javed et al., 2021; Lobo et al., 2023), and often gives rise to some form of robust optimization (Behzadian et al., 2021; Petrik & Russel, 2019). LCB-style Bayesian algorithms are generally inspired by the success of this approach in frequentist settings, where LCB is typically computed from concentration inequalities (Xie et al., 2021; Rashidinejad et al., 2022; Jin et al., 2022; Ghosh et al., 2022; Cheng et al., 2022).

In this paper, we propose a Bayesian regret minimization algorithm, called BRMOB, that takes a new approach to Bayesian offline bandits. Instead of adopting an LCB-style strategy, we directly minimize new regret upper bounds. To derive these bounds, we reformulate the usual highconfidence objective as a Value-at-Risk (Va R) of the epistemic uncertainty. Then, we bound the Va R by combining techniques from robust optimization and Chernoff analysis. Our bounds apply to both Gaussian and sub-Gaussian posteriors over the latent reward parameter. BRMOB minimizes the regret bounds efficiently using convex conic solvers. Finally, we also establish a matching lower bound that shows our upper bounds are tight.

Compared with prior work in Bayesian offline bandits, BRMOB achieves tighter theoretical guarantees and better empirical performance. Two main innovations enable these improvements. First, BRMOB computes randomized policies. Our numerical results show that randomizing among actions results in hedging that can significantly reduce regret compared to deterministic policies. In contrast to BRMOB, most existing algorithms in Bayesian offline bandits (Hong et al., 2023) and Bayesian offline RL (Delage & Mannor, 2010;

Bayesian Regret Minimization in Offline Bandits

Petrik & Russel, 2019; Behzadian et al., 2021; Angelotti et al., 2021) are restricted to deterministic policies. Second, BRMOB is the only algorithm that explicitly minimizes Bayesian regret bounds. As discussed above, existing algorithms usually maximize the LCB on returns (Hong et al., 2023; Uehara & Sun, 2023), which does not guarantee to reduce Bayesian regret. Similarly, recent algorithms that maximize the expected return (Steimle et al., 2021; Su & Petrik, 2023) are also not known to reduce Bayesian regret.

We also study the general suitability of LCB algorithms for minimizing Bayesian regret. While BRMOB significantly outperforms a particular LCB algorithm known as Flat OPO (Hong et al., 2023), the more critical question is whether the general LCB approach is viable for Bayesian regret minimization. Using our new regret lower bounds, we answer this question negatively. More precisely, we show that penalizing reward uncertainty, the core of all LCB algorithms, is guaranteed to increase the algorithm s regret even in very simple problems. This is because actions with high uncertainty may also have a high upside and avoiding them increases regret. Therefore, we believe that explicit regret minimization, as in BRMOB, is a more promising future direction than LCB-style algorithms.

Bayesian regret minimization in offline bandits can also be framed as a chance-constrained optimization problem (Ben Tal et al., 2009). The recent chance-constrained optimization literature is mostly focused on constraints in which the function is concave or linear in the uncertain parameter (Gupta, 2019; Bertsimas et al., 2021). However, the chance constraint in the Bayesian regret minimization problem is convex, preventing us from using these methods. Another non-concave chance-constrained optimization approach is to resort to scenario-based or sample-based methods (Calafiore & Campi, 2005; Nemirovski & Shapiro, 2006; Luedtke & Ahmed, 2008; Brown et al., 2020). We briefly discuss these methods in Section 3.2. Such samplebased formulations are general, simple to implement and work well in practice. However, they scale poorly to large problems, provide no theoretical insights, and struggle to compute randomized policies. The closest result to our work is the Bernstein technique for bounding linear chanceconstrained programs (Nemirovski & Shapiro, 2007; Pintér, 1989), which is a special case of one of our bounds.

The paper is organized as follows. After introducing our notations and the popular risk-measure Value-at-Risk (Va R) in Section 2, we formally define the problem of Bayesian regret minimization in offline bandits and connect it to minimizing Va R in Section 3. In Section 4, we derive two new upper bounds on the Bayesian regret and propose our main algorithm, BRMOB, that is based on a simultaneous minimization of these two regret bounds. We also prove a lower bound on the regret that shows our upper bound is tight. In

Section 5, we first derive a regret bound for BRMOB in terms of problem parameters and show that it compares favorably with LCB-based algorithms. We then argue that the general LCB approach is unsuitable for minimizing Bayesian regret. Finally, in Section 6, we compare BRMOB s performance with three baseline algorithms on synthetic domains and show that it is preferable to LCB-style algorithms.

2. Preliminaries

We begin by defining the notations we use throughout the paper. We use lower and upper case bold letters to denote vectors and matrices, such as x RN and A Rn n, and normal font for the elements of vectors and matrices, e.g., xi. We define the weighted ℓ2-norm for any vector x Rd and positive definite matrix A Rd d as x A =

x TAx. We denote by k, k N the k-dimensional probability simplex, and by I, 0, 1, and 1a the identity matrix, the zero vector, the one vector, and the one-hot vector all with appropriate dimensions. Random variables are adorned with a tilde and are not capitalized. For example, x represents a vector-valued random variable. Finally, we denote by Ωthe probability space of a random variable.

Suppose that x: Ω R is a random variable that represents costs. Then, its value-at-risk (Va R) at a risk-level α [0, 1) is usually defined as the largest lower bound on its α-quantile (e.g., Follmer & Schied 2016, definition 4.45, and remark A.20):

Va Rα [ x] = inf {t R | P [ x > t] 1 α} (1a)

= sup {t R | P [ x t] > 1 α} . (1b)

The definition of Va R in the literature depends on whether x represents costs or rewards (Hau et al., 2023). If x represents rewards, maximizing Va Rα [ x] is equivalent to minimizing Va Rα [ x]. For Gaussian random variables, x N(µ, σ2), Va R has the following analytical form (Follmer & Schied, 2016):

Va Rα [ x] = µ + σ zα , (2)

where zα is the α-quantile of N(0, 1).

3. Bayesian Offline Bandits

In this section, we first formally define the problem of Bayesian regret minimization in offline bandits and connect it to monetary risk measures. We then describe two techniques that have been used in solving this problem.

3.1. Problem Definition

Consider a stochastic linear bandit problem with k N arms (actions) from the set A = {a1, . . . , ak}. Each arm a A is associated with a d-dimensional feature vector ϕa Rd

Bayesian Regret Minimization in Offline Bandits

and its reward distribution has a mean r(a; θ) = ϕT aθ for some unknown parameter θ Rd. We define the feature matrix Φ Rd k as Φ = (ϕa)a A. The goal of the agent is to learn a (possibly randomized) policy π k to choose its actions accordingly. We denote by πa the probability according to which policy π selects an action a A. The mean reward, or value, of a policy π is defined as

r(π; θ) = E[r( a; θ) | a π]

a A πa r(a; θ) = πTΦTθ . (3)

An optimal policy π (θ) is one that maximizes (3).

In the offline bandit setting, the agent only has access to a logged dataset D = {( ai, yi)}n i=1, and is not capable of interacting further with the environment. Each pair ( ai, yi) in D consists of an action ai selected according to some arbitrary logging policy and a sampled reward yi from the reward distribution of action ai. We use D to refer to an instantiation of the random dataset D.

We take the Bayesian perspective in this paper and model our uncertainty about the reward parameter θ: Ω Rd

by assuming it is a random variable with a known prior P θ(θ). Therefore, all quantities that depend on θ are also random. The logged dataset D is used to derive the posterior density P θ|D(θ) over the reward parameter. To streamline

the notation, we denote by θD := ( θ | D = D) the random variable distributed according to this posterior distribution P θ|D. We discuss the derivation of the posterior in Section 5.

As described above, in the Bayesian offline bandit setting we assume that the logged data D is fixed to some D and the uncertainty is over the reward parameter θ. This is different than the frequentist offline setting in which the reward parameter is fixed, θ = θ , and the randomness is over different datasets generated by the logging policy.

In the Bayesian offline bandit setting, our goal is to compute a policy π k that minimizes the high-confidence Bayesian regret Rδ : k R+ defined as

Rδ(π) := min ϵ subject to

P max a A r(a; θD) r(π; θD) ϵ 1 δ, (4)

where δ (0, 1

2) is the small error tolerance parameter. We also use α = 1 δ to denote the confidence in the solution.

Note that (4) compares the value of a fixed policy π with the reward of an action (max action) that depends on the posterior random variable θD. Thus, one cannot expect to achieve a regret of zero. By taking a close look at the definition of regret in (4) and using the definition of Va R

in (1a), we may equivalently write our objective in (4) as

Rδ(π) = Va R1 δ

max a A r(a; θD) r(π; θD) . (5)

One could optimize other objectives besides the highconfidence regret in (5). Other objectives, such as maximizing the Va R of the reward, are easier to solve and Appendix E discusses them in greater detail.

3.2. Baseline Algorithms

We now provide a brief description of two methods that have been used to solve Bayesian offline bandits (defined in Section 3.1) and closely related problems.

Lower Confidence Bound (LCB) Pessimism to the uncertainty in the problem s parameter is the most common approach in offline decision-making problems, ranging from offline RL (Uehara & Sun, 2023; Rashidinejad et al., 2022; Xie et al., 2022), to robust RL (Petrik & Russel, 2019; Behzadian et al., 2021; Lobo et al., 2020), and offline bandits (Hong et al., 2023). In the case of offline bandits, this approach is compellingly simple and is known as maximizing a lower confidence bound, or LCB. The general recipe of the LCB algorithm for Gaussian and sub-Gaussian posteriors θD is to simply choose the action ˆa A such that

ˆa arg max a A ℓβ(a) := µT nϕa β q

for some β > 0. The terms µT nϕa and p

ϕTaΣnϕa represent the posterior mean and standard deviation of r(a, θD) = ϕT a θD. The parameter β is typically chosen to guarantee that ℓβ(a) is a high-probability lower bound on the return of action a A:

P h ℓβ(a) r(a, θD) i 1 δ.

The Flat OPO algorithm (Hong et al., 2023) is a particular instance of the LCB approach to offline bandits that uses β = p

5d log(1/δ) for Gaussian posteriors. When β = 0, we refer to an algorithm that implements (6) as Greedy.

Scenario-based Methods Another natural approach to minimizing the Bayesian regret in offline bandits is to treat the optimization in (4) as a chance-constrained optimization problem. The most general algorithm to solve chanceconstrained optimization is to use scenario-based techniques to minimize the regret Rδ(π) (Calafiore & Campi, 2005; Nemirovski & Shapiro, 2006; Luedtke & Ahmed, 2008). A typical scenario-based algorithm first approximates θD with a discrete random variable q constructed by sampling from its posterior P θ|D(θ), and then computes a

Bayesian Regret Minimization in Offline Bandits

deterministic policy by solving

arg min ˆa A Va R1 δ

max a A r(a; q) r(ˆa; q) . (7)

The optimization in (7) can be solved by enumerating all the actions and computing the Va R of the discrete random variable within the brackets. The important question that has been extensively studied here is the number of samples needed to obtain a solution with high confidence (Nemirovski & Shapiro, 2007; 2006). The time complexity of this algorithm is a function of the number of samples and the desired confidence to guarantee a certain suboptimality of the solution; we refer the interested reader to Nemirovski & Shapiro (2007) for a detailed analysis discussion.

Despite the generality and simplicity of scenario-based methods, they have several important drawbacks. They require sampling from the posterior, with a sample complexity that scales poorly with the dimension d, number of actions k, and particularly confidence level 1 δ. They do not provide theoretical guarantees for the regret of the obtained policy and offer no insights into how the regret scales with the parameters of the problem.

Finally, minimizing the regret in (5) over the space of randomized policies is challenging using scenario-based methods because it requires solving a mixed-integer linear program (Lobo et al., 2020). Other ideas have been explored (Calafiore & Campi, 2005; Brown et al., 2020) but a detailed study of such algorithms is beyond our scope.

4. Minimizing Analytical Regret Bounds

In this section, we propose our new approach for minimizing the Bayesian regret, Rδ(π), defined in (5). In particular, we derive two upper bounds on Rδ(π) that complement each other depending on the relative sizes of the feature vector d and action space k. We also prove a lower bound on Rδ(π) that shows our upper bound is tight. Finally, we propose our BRMOB algorithm that aims at jointly minimizing our two upper bounds. The proofs of this section are in Appendix B.

4.1. Bayesian Regret Bounds

To avoid unnecessary complexity, we assume in this section that the posterior distribution over the reward parameter is Gaussian. We show analogous results for the general sub-Gaussian case in Appendix D.

Assumption 4.1. The posterior over the latent reward parameter is distributed as θD N(µ, Σ), with mean µ Rd and a positive definite covariance matrix Σ Rd d.

We begin by showing that under Assumption 4.1, the regret r(a; θD) r(π; θD) of any policy π k with respect to a A has a Gaussian distribution.

Lemma 4.2. Suppose that θD N(µ, Σ). Then, for any policy π k, the Bayesian regret in (5) can be written as

Rδ(π) = Va R1 δ

max a A xπ a

where xπ a N(µπ a , σπ a ) with

µπ a = µTΦ(1a π), σπ a = Φ(1a π) Σ . (9)

Lemma 4.2 points to the main challenge in deriving tight bounds on Rδ(π). Even when θD is normally distributed, the random variable maxa A xπ a is unlikely to be Gaussian. The lack of normality prevents us from deriving an exact analytical expression for Rδ(π) using (2). In the remainder of the section, we derive two separate techniques for upper bounding the Va R of the maximum of random variables in (8), thereby also bounding the Bayesian regret Rδ(π).

Our first bound expresses the overall regret as a maximum over individual action regrets. We refer to it as an action-set bound, because it grows with the size of the action space k, and state it in Theorem 4.3.

Theorem 4.3. The regret for any policy π k satisfies

Rδ(π) min ξ k max a A µπ a + σπ a z1 δξa (10a)

min ξ k max a A µπ a + σπ a p

2 log(1/δξa) , (10b)

where z1 δξa is the (1 δξa)-th standard normal quantile.

A special case of (10) is when ξ = 1/k 1 is uniform, in which case is simplifies to

Rδ(π) max a A µπ a + σπ a p

2 log(k/δ) . (11)

This shows that the action-set bound in Theorem 4.3 grows sub-logarithmically with the number of actions k.

Because the bound in Theorem 4.3 is based on a union bound, the question of its tightness is particularly salient. To address this, we prove a lower bound on the regret when the arms are independent (e.g., multi-armed bandits).

Theorem 4.4. Suppose that π k is a deterministic policy such that πa1 = 1 for a1 A without loss of generality. When µ2 = µ3 = = µk, Σ is diagonal with Σ2,2 = Σ3,3 = . . . Σk,k, and Φ = I, then

Rδ(π) µπ a2 + σπ a2 κl(k 1),

κl(k) = 1 + q

2π) 2 log 1 (1 δ) 1/k .

The lower bound in Theorem 4.4 indicates that Theorem 4.3 is tight. For an ease of reference we use κu(k) =

Bayesian Regret Minimization in Offline Bandits

50 100 150 200

Action number k

κu(k)/κl(k 1)

Figure 1. The quotient of the upper bound coefficient κu(k) and the lower bound coefficient κl(k).

2 log(k/δ) to refer to the coefficient on the RHS of (11). The main difference between the upper and lower bounds are the coefficients κu(k) and κl(k 1). One can readily show that κu(k) O(κl(k 1)), since κu(1)/κl(1) < 10 when δ 1/2 and κu(k)/κl(k 1) is a non-increasing function of k. Figure 1 depicts the quotient of the upper and lower bound coefficients as a function of k for δ = 0.1.

We now state our second upper bound on the regret in Theorem 4.5. We refer to it as the parameter-space bound because it grows with the dimension d of the parameter.

Theorem 4.5. The regret for any policy π k satisfies

Rδ(π) max a A µπ a + σπ a q

χ2 d(1 δ) (12a)

max a A µπ a + σπ a 5d log(1/δ) , (12b)

where χ2 d(1 δ) is the (1 δ)-th quantile of the χ2 distribution with d degrees of freedom.

Note that a growing body of literature argues that using credible regions in constructing robust approximations of Va R is overly conservative when used with linear or concave functions (Gupta, 2019; Bertsimas et al., 2021; Petrik & Russel, 2019). However, these results do not apply to our setting because the maximum in (8) is non-concave.

To compare our two upper bounds in (10) and (12), it is sufficient to compare the terms z1 δξa and p

χ2 d(1 δ). From Theorems 4.3 and 4.5, we can conclude that the second upper bound is preferable when d < log k.

4.2. Optimization Algorithm

We now describe our main algorithm, Bayesian Regret Minimization for Offline Bandits (BRMOB), whose pseudo-code is reported in Algorithm 1. Before describing BRMOB in greater detail, it is important to note that it returns a randomized policy. Unlike in online bandits, here the goal of randomization among the actions is not to explore, but rather to reduce the risk of incurring high regret. The numerical results in

Section 6 show that the ability to randomize over actions significantly reduces Bayesian regret in many situations.

BRMOB s strategy is to compute a policy with the minimum regret guarantee. In Line 2, it computes a policy π0 that simultaneously minimizes our two proposed upper bounds: the one in Theorem 4.3 with a uniform ξ as given in (11), and the one in Theorem 4.5 as given in (12). The bounds can be optimized jointly because they differ only in constant ν. The optimization in (13) is a second-order conic program (SOCP), because ν 0 and can be solved very efficiently (Ap S, 2022; Lubin et al., 2023). The actual time complexity depends on the particular SOCP solver used, but most interior-point algorithms run in O(k6) complexity or faster (Kitahara & Tsuchiya, 2018).

After completing Line 2, BRMOB proceeds with m iterations of tightening the regret bound and improving the policy. In each iteration i, it tightens the regret bound in Theorem 4.3 by optimizing ξi in (14) for the incumbent policy πi 1. The minimum in (14) can be computed efficiently using exponential and second-order cones (Ap S, 2022; Lubin et al., 2023). Exponential conic optimization is hypothesized to be polynomial time, but this fact has not been established yet to the best of our knowledge. The algorithm then minimizes the tightened bound by solving (13) and obtains an improved policy πi.

The tightening steps in Algorithm 1 can be seen as a coordinate descent procedure for joint minimization of π and ξ in (10). It would be preferable to minimize the bound simultaneously over π and ξ, but such optimization appears to be intractable.

Finally, Algorithm 1 returns a policy in the set {πi}m i=0 with the smallest regret bound ρi in Line 6. Although ρi

will be generally non-increasing with an increasing i, this is not guaranteed. This is because the tightening step in (14) minimizes the bounds in (10b) and (12b). These bounds are generally looser than the bounds in (10a) and (12a) optimized by (13).

We provide a worst-case error bound on the regret of BRMOB in Section 5. Our regret bound holds for any number of tightening steps, including m = 0. We focus on bounds that are independent of m for the sake of simplicity, since the improvements that arise from the tightening procedure can be difficult to quantify cleanly.

We conclude this section with the following result that shows BRMOB indeed minimizes the regret upper bounds in Theorems 4.3 and 4.5 as intended.

Proposition 4.6. Suppose that BRMOB returns a policy ˆπ k and a regret bound ˆρ. Then

Rδ(ˆπ) ˆρ min π k max a A µπ a (n) + σπ a (n) η , (15)

Bayesian Regret Minimization in Offline Bandits

Algorithm 1: BRMOB: Bayesian Regret Minimization for Offline Bandits

Input: Posterior parameters µ and Σ, risk tolerance δ (0, 1/2), feature matrix Φ Rd k, # of iterations m

1 Initialize ν0 a min np

χ2 d(1 δ), z1 δ/k o , a A ; i 0 ;

2 Minimize regret bounds: Let ρi and πi be the optimizers of

minimize π, s Rk +, ρ R ρ subject to 1Tπ = 1, ρ µπ a + sa νi a, s2 a (σπ a )2, a A. (13)

3 for i = 1, . . . , m do

4 Tighten regret bounds: Let ξi be an optimizer of

minimize ξ, s Rk +, l Rk, ρ R ρ subject to 1Tξ = δ, ρ µπi 1 a +σπi 1 a sa, s2 a 2la, la log ξa, a A. (14)

5 Set νi a z1 δξia, a A ;

6 Solve (13) and let ρi and πi be its optimizers ;

7 i arg mini=0,...,m ρi ; // Choose policy with the best regret guarantee

8 return randomized policy πi , regret upper bound ρi ;

where η = min np

2 log(k/δ), p

5d log(1/δ) o .

5. Regret Analysis

In this section, we derive a regret bound for BRMOB and compare it with that of Flat OPO (Hong et al., 2023), an LCB-style algorithm. We use a frequentist analysis to bound the Bayesian regret of BRMOB as a function of k, d, number of samples n, and coverage of the dataset D. Section 5.3 concludes by arguing that the general LCB approach in (6) is unsuitable for minimizing Bayesian regret; see Appendix E for other objectives that can be optimized using LCB-style algorithms. Our lower bound shows that LCB can match BRMOB s regret only if the confidence penalty β is very small and decreases with k and d. All the proofs of this section are reported in Appendix C.

5.1. Sample-Based Regret Bound

As in prior work (Hong et al., 2023), we assume a Gaussian prior distribution over the reward parameter P θ = N(µ0, Σ0) with an invertible Σ0, and Gaussian rewards y N r(a; θ) = ϕT a θ, σ2 for each action a A. As a result, the posterior distribution over the parameter given a dataset D = {(ai, yi)}n i=1 is also Gaussian θD N(µn, Σn) with

Σn = (Σ 1 0 + σ 2Gn) 1,

µn = Σn(Σ 1 0 µ0 + σ 2Bnyn), (16)

and where Bn = (ϕai)n i=1 is the matrix with observed features in its columns, yn = (yi)n i=1 is the vector of observed rewards, and Gn = BT n Bn is the empirical covariance

matrix (see Bayesian linear regression for example in Rasmussen & Williams 2006; Deisenroth et al. 2021).

To express the regret bound as a function of the dataset D, we make the following standard quality assumption. Assumption 5.1. The feature vectors satisfy ϕa 2 1, a A, and there exists a γ > 0 such that

Gn γn ϕaϕT a, a A, n 1 .

Intuitively, Assumption 5.1 states that the dataset provides sufficient information such that the norm of the covariance matrix Σn of the posterior distribution over θD decreases linearly with n. From a frequentist perspective, this assumption holds with high probability by the Bernstein-Von-Mises theorem under mild conditions (Vaart, 2000).

We are now ready to bound the Bayesian regret of BRMOB. We state the bound for the general case and then tighten it when µn = 0 (only the variance of actions matters). Theorem 5.2. Suppose that the parameter has a Gaussian posterior θD N(µn, Σn) and BRMOB returns a policy ˆπ. Then, the regret of BRMOB is bounded as Rδ(ˆπ) 2η, where

min {2 log(k/δ), 5d log(1/δ)}

λmax(Σ0) 1 + γn σ 2 . (17)

Moreover, if µn = 0 then Rδ(ˆπ) 2 (1 maxa A ˆπa ) η with maxa A ˆπa 1/d+1 .

5.2. Comparison with Flat OPO

We now compare the regret bound of BRMOB with that of Flat OPO (Hong et al., 2023), an LCB-based algorithm for re-

Bayesian Regret Minimization in Offline Bandits

gret minimization in Bayesian offline bandits. As discussed in Section 3.2, using the LCB principle is the most common approach to regret minimization in offline decision-making. Hong et al. (2023) derived the following regret bound for Flat OPO under Assumption 5.1:

5d2 log(1/δ) λmax(Σ0) 1 + γn σ 2 , (18)

where ˆπ is a deterministic policy returned by the algorithm.

Comparing our regret bound for BRMOB in (17) with Flat OPO s in (18), we notice two main improvements. The first one is that the BRMOB s regret is bounded by log k. Thus, when the number of actions k satisfies k exp(d), the regret guarantee of BRMOB can be dramatically lower than that achieved by Flat OPO. It is unclear how one could extend the existing analysis in Hong et al. (2023) to bound its regret in terms of k. Its design and analysis rely on a robust set, which is difficult to restrict using k.

The second improvement is that the regret bound of BRMOB grows

d slower than Flat OPO s, which is a significant reduction in regret. This improvement is probably a consequence of our tighter analysis rather than better algorithmic design. The analysis in Hong et al. (2023) uses a general upper bound on the trace of a rank one matrix, which introduces an unnecessary

d term. Yet, applying our techniques to Flat OPO yields additional constant terms missing in (18).

5.3. Limitation of LCB

In this section, we argue that the popular LCB approach is inherently unsuitable for minimizing Bayesian regret in offline bandits. As we discussed in Section 5.2, BRMOB achieves significantly better regret guarantees than Flat OPO. Our numerical results in Section 6 also show that BRMOB outperforms FLat OPO. However, these results are obtained for a particular value of β in (6). Our theoretical analysis suggests that even a simple Greedy algorithm, which uses β = 0 in (6), can significantly outperform LCB. The intuition behind the LCB approach is that one should prefer actions with low uncertainty, and thus, limited downside. This intuition is correct when the goal is to maximize the Va R of reward as shown in Appendix E.1. However, this intuition does not apply when the objective is regret minimization. In fact, actions with low uncertainty also have limited upside and high regret, and thus, as we show, penalizing high variance actions is counterproductive.

We now construct a simple class of offline bandit problems to illustrate LCB s limitations. For this class of problems, we show that the lower bound on the regret of LCB can be far greater than the upper bound on BRMOB, or even Greedy, policy. In what follows, we assume that LCB computes the high-confidence lower bound as in (6) for some value of β.

5 10 15 20 100.0

No. of actions: k

Figure 2. The value of β used by Flat OPO in Example 1, βOPO, and the upper bound β that may avoid the under-performance of LCB, defined in (22), as functions of the number of actions k.

Example 1. Consider a class of offline bandit problems parametrized with the β used in (6). The bandit has k 2 arms, feature dimension d = k, and a feature matrix Φ = I. Suppose that the posterior covariance over the reward parameter Σ Rk k is diagonal with the diagonal elements σ1 = 0 and σ2 = = σk, and the posterior mean has the following form: µ1 = 0 and µ2 = = µk = β σ2 0.

The intuition underlying the bandit problems in Example 1 is as follows. It has one action, a1, with low reward and low variance. The other k 1 arms are i.i.d. with higher mean and variance. LCB prefers to take action a1 because of its low variance and forgoing the higher mean of the other actions. The next theorem shows that taking any of the other actions with a higher mean, as would be chosen by BRMOB, or even Greedy that selects an action with the largest posterior mean, leads to a far lower regret.

Theorem 5.3. Consider the bandit problems in Example 1 and assume a realization of LCB with a coefficient β > 0 that breaks ties by choosing an ai with the smallest i. Then, LCB returns πLCB k with πLCB(a1) = 1 and

Rδ(πLCB) (β + κl(k)) σa2. (19)

Moreover, Greedy with the same tie-breaks will return a policy πG k with πG(a2) = 1 and

2 σa2 κu(k). (20)

Finally, BRMOB s regret also satisfies the bound in (20).

Theorem 5.3 shows that even in a simple class of problems, Greedy (or BRMOB) computes a policy that outperforms LCB significantly. The increase in regret of LCB versus Greedy (or BRMOB) can be bounded from below as

Rδ(πLCB) Rδ(πG) β+κl(k)

2κu(k) σa2. (21)

Note that the bound, when positive, can be made arbitrarily large by scaling σa2.

Using algebraic manipulation of the bound in (21), we can show that β should satisfy the following condition for LCB

Bayesian Regret Minimization in Offline Bandits

0 10 20 30 40 50

Sample count: n

Bayesian regret

BRMOB Flat OPO

Greedy Scenario

0 10 20 30 40 50

Sample count: n

Bayesian regret

0 10 20 30 40 50

Sample count: n

Bayesian regret

Figure 3. Bayesian regret with k = d = 5 (left), k = d = 50 (middle and right). The prior mean is µ0 = 0 (left and middle) and (µ0)a = a for a = 1, . . . 50 (right).

0 10 20 30 40 50

Sample count: n

Bayesian regret

BRMOB Flat OPO

Greedy Scenario

0 10 20 30 40 50 0.5

Sample count: n

Bayesian regret

0 10 20 30 40 50

Sample count: n

Bayesian regret

Figure 4. Bayesian regret with d = 4 and k = 10 (left), k = 50 (middle), and k = 100 (right).

to perform better than Greedy and BRMOB:

2κu(k) κl(k 1) 1. (22)

The inequality in (22) indicates that penalizing an action s uncertainty with a β greater than β increases the regret of LCB. For comparison, the β used by Flat OPO for the class of problems in Example 1 in which d = k is βOPO = p

5k log(1/δ). Figure 2 shows that for δ = 0.1, βOPO exceeds β , and thus, violates the condition in (22) and performs worse than Greedy and BRMOB for all values of k.

6. Numerical Results

In this section, we compare BRMOB to several baselines on synthetic domains. Here, we evaluate the basic version of BRMOB with m = 0 iterations and defer results that demonstrate the improvement from the tightening step to Appendix F. Particularly, we compare it to (i) Flat OPO (Hong et al., 2023) that is based on the LCB principle, (ii) Greedy method which selects an action a with the largest value of µa, and finally, (iii) Scenario, the scenario-based method described in (7) in Section 3.2. We execute Scenario with 4000 samples from the posterior. Increasing this number did not improve our results.

Our experiments use synthetic domains, each defined by a normal prior (µ0, I) and a feature matrix Φ. To evaluate the Bayesian regret as a function of data size n, we first sample a single large dataset and then vary the number of data points n from this set used to estimate the posterior

distributions. The regret for each policy is computed by a scenario-based algorithm that samples from the posterior and computes the empirical Va R. We use the error tolerance of δ = 0.1 throughout. Results are averaged over 100 runs of this process to reduce variance. As confidence intervals were negligible for all algorithms except Flat OPO, to avoid clutter, we do not plot them here and refer the reader to Appendix F for additional details.

We evaluate the algorithms on three domains. The first one uses k = d actions, an identity feature matrix Φ = I, and zero prior mean µ0 = 0. The second one is the same, except (µ0)a = a to simulate varying rewards for actions. Finally, the third one fixes dimension d = 4 and varies k while using randomly generated features from the ℓ -ball.

Figures 3 and 4 summarize our numerical results. They consistently show across all domains that BRMOB significantly outperforms all the other algorithms, particularly when the posterior uncertainty is large. The only challenging setting for BRMOB is when k d. Note that Flat OPO s performance is noisy in Figure 3 because its β coefficient grows fast with d. A common practice is to tune β, but we did not find any value of β for which Flat OPO performs better than Greedy, which is consistent with our theoretical analysis in Section 5.3. It is also notable that Greedy outperforms LCB significantly, furnishing further evidence that LCB is unsuitable for minimizing regret. We provide additional results, including confidence bounds and runtime, in Appendix F.

Bayesian Regret Minimization in Offline Bandits

7. Conclusion

We proposed BRMOB, a new approach for Bayesian regret minimization in offline bandits, that is based on jointly minimizing two analytical upper bounds on the Bayesian regret. We proved a regret bound for BRMOB and showed that it compares favorably with an existing LCB-style algorithm Flat OPO (Hong et al., 2023). Finally, we showed theoretically and empirically that the popular LCB approach is unsuitable for minimizing Bayesian regret.

Our approach can be extended to several more general settings. The algorithm and bounds can generalize to sub Gaussian posterior distributions as described in Appendix D. The algorithm can also be extended to contextual linear bandits by computing a separate policy π for each context individually or by assuming a random context. Another important future direction is understanding the implications of our results to frequentist settings where similar concerns about the value of the LCB approach have been raised (Xie et al., 2022; Xiao et al., 2021).

Acknowledgments

We thank the anonymous referees for their helpful comments and suggestions. The work in the paper was supported, in part, by NSF grants 2144601 and 2218063. In addition, this work was performed in part while Mohammad Ghavamzadeh and Marek Petrik were at Google Research.

Impact Statement

This paper aims to advance the theory field of Machine Learning. There are many potential societal consequences of our work, but it is difficult to speculate what they may be. Of course, we sincerely hope that the impacts will only be positive.

Ahmadi-Javid, A. Entropic Value-at-Risk: A new coherent risk measure. Journal of Optimization Theory and Applications, 155(3):1105 1123, 2012.

Angelotti, G., Drougard, N., and Chanel, C. P. C. Exploitation vs caution: Risk-sensitive policies for offline learning. ar Xiv:2105.13431 [cs, eess], 2021.

Ap S, M. The MOSEK optimization toolbox for MATLAB manual. Version 10.0., 2022.

Behzadian, B., Russel, R., Ho, C. P., and Petrik, M. Optimizing percentile criterion using robust MDPs. In International Conference on Artificial Intelligence and Statistics (AIStats), 2021.

Ben-Tal, A., El Ghaoui, L., and Nemirovski, A. Robust Optimization. Princeton University Press, 2009.

Bertsimas, D., den Hertog, D., and Pauphilet, J. Probabilistic Guarantees in Robust Optimization. SIAM Journal on Optimization, 31(4):2893 2920, 2021.

Brown, D. S., Niekum, S., and Petrik, M. Bayesian robust optimization for imitation learning. In Advances in Neural Information Processing Systems (Neur IPS), 2020.

Calafiore, G. and Campi, M. C. Uncertain convex programs: Randomized solutions and confidence levels. Mathematical Programming, Series A, 102:25 46, 2005.

Cheng, C.-A., Xie, T., Jiang, N., and Agarwal, A. Adversarially trained actor critic for offline reinforcement learning, 2022.

David, H. A. and Nagaraja, H. N. Order Statistics. John Wiley & Sons, Ltd, 3 edition, 2003.

Deisenroth, M. P., Faisal, A. A., and Ong, C. S. Mathematics for Machine Learning. Cambridge University Press, 2021.

Delage, E. and Mannor, S. Percentile optimization for Markov decision processes with parameter uncertainty. Operations Research, 58(1):203 213, 2010.

Follmer, H. and Schied, A. Stochastic Finance: Introduction in Discrete Time. De Gruyter Graduate, fourth edition, 2016.

Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. Bayesian Data Analysis. Chapman and Hall/CRC, 3rd

edition, 2014.

Ghosh, D., Ajay, A., Agrawal, P., and Levine, S. Offline RL policies should be trained to be adaptive, 2022.

Gupta, V. Near-optimal bayesian ambiguity sets for distributionally robust optimization. Management Science, 65 (9):4242 4260, 2019.

Hau, J. L., Petrik, M., and Ghavamzadeh, M. Entropic risk optimization in discounted MDPs. In Artificial Intelligence and Statistics (AISTATS), 2023.

Hong, J., Kveton, B., Katariya, S., Zaheer, M., and Ghavamzadeh, M. Multi-task off-policy learning from bandit feedback. In International Conference on Machine Learning, 2023.

Horn, R. A. and Johnson, C. A. Matrix Analysis. Cambridge University Press, second edition, 2013.

Javed, Z., Brown, D., Sharma, S., Zhu, J., Balakrishna, A., Petrik, M., Dragan, A., and Goldberg, K. Policy gradient Bayesian robust optimization for imitation learning. In International Conference on Machine Learning (ICML), 2021.

Bayesian Regret Minimization in Offline Bandits

Jin, C., Netrapalli, P., Ge, R., Kakade, S. M., and Jordan, M. I. A short note on concentration inequalities for random vectors with subgaussian norm. ar Xiv preprint ar Xiv:1902.03736, 2019.

Jin, Y., Yang, Z., and Wang, Z. Is pessimism provably efficient for offline RL?, 2022.

Kitahara, T. and Tsuchiya, T. An extension of Chubanov s polynomial-time linear programming algorithm to second-order cone programming. Optimization Methods and Software, 33(1):1 25, January 2018. ISSN 1055-6788. doi: 10.1080/10556788.2017.1382495.

Lange, S., Gabel, T., and Riedmiller, M. Batch reinforcement learning. In Reinforcement Learning, pp. 45 73. Springer, 2012.

Lattimore, T. and Szepesvari, C. Bandit Algorithms. Cambridge University Press, 2018.

Laurent, B. and Massart, P. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302 1338, 2000.

Lobo, E., Cousins, C., Petrik, M., and Zick, Y. Percentile criterion optimization in offline reinforcement learning. In Neural Information Processing Systems (Neur IPS), 2023.

Lobo, E. A., Ghavamzadeh, M., and Petrik, M. Soft-Robust Algorithms for Handling Model Misspecification, 2020.

Lubin, M., Dowson, O., Garcia, J. D., Huchette, J., Legat, B., and Vielma, J. P. Jump 1.0: Recent improvements to a modeling language for mathematical optimization. Mathematical Programming Computation, 2023.

Luedtke, J. and Ahmed, S. A sample approximation approach for optimization with probabilistic constraints. SIAM Journal on Optimization, 19(2):674 699, 2008.

Nemirovski, A. and Shapiro, A. Scenario Approximations of Chance Constraints. In Calafiore, G. and Dabbene, F. (eds.), Probabilistic and Randomized Methods for Design under Uncertainty, pp. 3 47. Springer, 2006.

Nemirovski, A. and Shapiro, A. Convex approximations of chance constrained programs. SIAM Journal on Optimization, 17(4):969 996, 2007.

Petrik, M. and Russel, R. H. Beyond confidence regions: Tight Bayesian ambiguity sets for robust MDPs. In Advances in Neural Information Processing Systems, volume 32, 2019.

Pintér, J. Deterministic approximations of probability inequalities. Zeitschrift für Operations-Research, 33(4): 219 239, 1989.

Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. IEEE Transactions on Information Theory, 68(12):8156 8196, 2022.

Rasmussen, C. E. and Williams, C. Gaussian Processes for Machine Learning. MIT Press, 2006.

Rockafellar, R. T. and Wets, R. J. Variational Analysis. Springer, 2009.

Shapiro, A., Dentcheva, D., and Ruszczynski, A. Lectures on Stochastic Programming: Modeling and Theory. SIAM, 2014.

Steimle, L. N., Kaufman, D. L., and Denton, B. T. Multimodel Markov decision processes. IISE Transactions, 53, 2021.

Su, X. and Petrik, M. Solving multi-model MDPs by coordinate ascent and dynamic programming,. In Uncertainty in Artificial Intelligence (UAI), 2023.

Uehara, M. and Sun, W. Pessimistic model-based offline reinforcement learning under partial coverage. In International Conference on Learning Representations (ICLR), 2023.

Vaart, A. W. V. D. Asymptotic Statistics. Cambridge University Press, 2000.

Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010.

Xiao, C., Wu, Y., Littlemore, T., Dai, B., Mei, J., Li, L., Szepesvari, C., and Schuurmans, D. On the optimality of batch policy optimization algorithms. In International Conference of Machine Learning (ICML), 2021.

Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal, A. Bellman-consistent pessimism for offline reinforcement learning. In Advances in Neural Information Processing Systems, volume 34, pp. 6683 6694, 2021.

Xie, T., Bhardwaj, M., Jiang, N., and Cheng, C.-A. ARMOR: A model-based framework for improving arbitrary baseline policies with offline data, 2022.

Bayesian Regret Minimization in Offline Bandits

A. Technical Background and Lemmas

A scalar random variable x: Ω R with mean µ = E[ x] is sub-Gaussian with a variance factor σ2 0 when

E exp λ( x µ) exp λ2σ2/2 , λ R . (23)

A multivariate random variable x: Ω Rd with a mean µ = E[ x] is sub-Gaussian with a covariance factor Σ Rd d

when (Vershynin, 2010; Jin et al., 2019)

E exp λw T( x µ) exp λ2w TΣw/2 , λ R, w d . (24)

The Entropic Value at Risk (EVa R) is a risk measure related to Va R, defined as (Ahmadi-Javid, 2012)

EVa Rα [ x] = inf β>0 β 1 E[exp(β x)] log(1 α) , α [0, 1). (25)

The following lemma shows that EVa R is an upper bound on Va R. This is a property that will be useful in our proofs later on.

Lemma A.1. For any random variable x: Ω R, we have that

Va Rα [ x] EVa Rα [ x] , α [0, 1).

Proof. This is a consequence of Proposition 3.2 in Ahmadi-Javid (2012) and the fact that CVa R upper bounds Va R.

Similar to (2) for Va R, we can show that for Gaussian random variables, x N(µ, σ2), EVa R has the following analytical form (Ahmadi-Javid, 2012): EVa Rα [ x] = µ + σ p

2 log(1 α). (26)

One advantage of EVa R over Va R is that we can bound it in the more general case of sub-Gaussian random variables by the same bound as for a normal random variable in (26) (see the following lemma).

Lemma A.2. Let x: Ω R be a sub-Gaussian random variable defined according to (23). Then, we have

EVa Rα [ x] µ + σ p

2 log(1 α) , α [0, 1) .

Proof. From the translation invariance of EVa R (Ahmadi-Javid, 2012, Theorem 3.1) and the definitions in (23) and (25), we have

EVa Rα [ x] = µ + EVa Rα [ x µ] = µ + inf β>0 β 1 E exp β ( x µ) log(1 α)

µ + inf β>0 β 1 β2σ2

2 log(1 α) = µ + σ p

2 log(1 α) .

The last step follows by solving for the optimal β = σ 1p

2 log(1 α) from the first-order optimality conditions of the convex objective function.

B. Proofs of Section 4

Proof of Lemma 4.2. We obtain by algebraic manipulation that

max a A r(a; θD) r(π; θD) = max a A 1T aΦT θD πTΦT θD = max a A 1T a ΦT θD 1πTΦT θD

= max a A 1T a I 1πT ΦT θD .

Let xπ = I 1πT ΦT θD, which is a linear transformation of the normal random variable θD. The result follows because linear transformations preserve normality (Deisenroth et al., 2021).

Bayesian Regret Minimization in Offline Bandits

B.1. Proof of Theorem 4.3

We first report a result in the following lemma which we later use to prove Theorem 4.3.

Lemma B.1. Suppose x: Ω Rk is a random variable such that all α-quantiles, α [0, 1), for each xa, a A are unique. Then, the following inequality holds for each α [0, 1):

inf max a A Va R1 ξa [ xa] | ξ Rk +, 1Tξ = 1 α .

We interpret the maximum of all as .

Proof. The result develops as

(a)= sup t R | P max a A xa t > 1 α

(b) sup t R | P max a A xa t 1 α

a A P [ xa t] 1 α

a A P [ xa t] X

| ξ Rk +, 1Tξ = 1 α

(e) inf max a A sup {t R | P [ xa t] ξa} | ξ Rk +, 1Tξ = 1 α

(f) inf max a A Va R1 ξa [ xa] | ξ Rk +, 1Tξ = 1 α .

(a) is from the definition of Va R. (b) follows by relaxing the set by replacing the strict inequality with a non-strict one. (c) follows by relaxing the constraint further using the union bound. (d) follows from algebraic manipulation because the objective is constant in the choice of ξ. (e) holds by relaxing the sum constraints and then representing the supremum over a union of sets by a maximum of the suprema of the sets as

a A P [ xa t] X

sup {t R | P [ xa t] ξa, a A}

= max a A sup {t R | P [ xa t] ξa} .

Finally, (f) follows from the definition of Va R and because then the quantiles are unique (Follmer & Schied, 2016)

Va R1 ξa [ xa] = sup {t R | P [ xa t] ξa} = sup {t R | P [ xa t] > ξa} .

The first equality is the definition of the upper quantile q+ and the second equality is the definition of the lower quantile q , which are equal by the uniqueness assumption.

We are now ready to prove Theorem 4.3.

Proof of Theorem 4.3. The first inequality in (10) follows from Lemma 4.2 and Lemma B.1 by some algebraic manipulation. The second inequality in (10) follows from upper bounding the Va R of a Gaussian random variable using (2) and the fact that xπ a is a Gaussian random variable with mean µTΦ(1a π) and standard deviation Φ(1a π) Σ.

The inequality z1 δξa p

2 log 1/δξa holds because for a standard normal random variable y, we have that

z1 δξa = Va R1 δξa [ y] (a) EVa R1 δξa [ y] (b)= p

2 log(1/δξa) .

(a) follows from Lemma A.1 and (b) is by (26).

Bayesian Regret Minimization in Offline Bandits

B.2. Proof of Theorem 4.4

First, we prove a lower bound on the Va R of a single Gaussian random variable.

Lemma B.2. Suppose that x N(0, 1) and α 1

Va Rα [ x] 1 + q

2π) 2 log(1 α).

Proof. To establish this lower bound on Va R, we use the known bounds on the cumulative distribution function of a Gaussian random variable as stated, for example, in eq. (13.1) in Lattimore & Szepesvari (2018). For any t R we have that

4t2 + 16 exp t2

From the definition of Va R in (1b) we get that

Va Rα [ x] = sup {t R | P [ x t] > 1 α}

= sup {t R+ | P [ x t] > 1 α}

4t2 + 16 exp t2

4(t + 1) exp t2

Here, we restricted t to be non-negative, which does not impact the Va R value because for α 0.5 we have that Va Rα [ x] 0. The first inequality is a lower bound that follows by tightening the feasible set in the supremum. The final inequality follows since

4t2 + 16 2t + 4 from the triangle inequality.

Then, algebraic manipulation of the right-hand side above gives us that

Va Rα [ x] sup n t R+ | t2 2t > 1 log(1 α) + 2 log

Then, using the fact that the constraint is concave in t, we get the final lower bound on Va R by solving the quadratic equation.

The following lemma bounds the Va R of a maximum of independent random variables. This is possible because the maximum is the first order statistic which has an easy-to-represent CDF (David & Nagaraja, 2003).

Lemma B.3. Suppose that xi : Ω R, i = 1, . . . , n are i.i.d. random variables. Then

max i=1,...,n xi

= Va Rα1/n [ x1] .

Proof. Recall i.i.d. random variables satisfy that

P max i=1,...,n xi

i=1,...,n P [ xi] = P [ x1]n .

The result then follows from the definition of Va R in (1a) and from algebraic manipulation as

max i=1,...,n xi

= inf t R | P max i=1,...,n xi > t 1 α = inf t R | P max i=1,...,n xi t α

= inf {t R | P [ x1 t]n α} = inf n t R | P [ x1 t] α 1/no = Va Rα1/n [ x1] .

Bayesian Regret Minimization in Offline Bandits

Proof of Theorem 4.4. Define a restricted set of actions A2 = A \ {a1}. As in the remainder of the paper, we use α = 1 δ to simplify the notation in this proof.

From the definition of regret in (5) and the monotonicity of Va R (Shapiro et al., 2014) we get that the regret of the π can be lower bounded as the maximum regret compared only to actions in A2:

Rδ(π) = Va Rα

max a A r( θ, a) r( θ, a1) Va Rα

max a A2 r( θ, a) r( θ, a1) .

From the theorem s assumptions, the random variables za = r( θ, a) r( θ, a1) for a A2 are independent and identically distributed as N(µ2 µ1, σ2 2 + σ2 1) where σi = Σi,i for i = 1, . . . , k. Then, using the inequality above and Lemma B.3 we get that

Rδ(π) = Va Rα

max a A2 r( θ, a) r( θ, a1) Va R1 α1/k [ z]

= (µ2 µ1) + q

σ2 1 + σ2 2 Va R1 α1/k

" za2 (µ2 µ1) p

σ2 1 + σ2 2

Here, we used the fact that Va R is positively homogenous and translation equivariant. The result follows by Lemma B.2 since the random variable inside of the Va R above is distributed as N(0, 1).

B.3. Proof of Theorem 4.5

This result follows from standard robust optimization techniques (see, for example, Gupta (2019); Petrik & Russel (2019)) as well as bandit analysis. In fact, similar or perhaps almost identical analysis has been used to analyze the regret of Flat OPO in Hong et al. (2023). We provide an independent proof for the sake of completeness.

The following two auxiliary lemmas are used to show that a robust optimization over a credible region can be used to upper bound the Va R of any random variable. The first auxiliary lemma establishes a sufficient condition for a robust optimization being an overestimate of Va R.

Lemma B.4. Suppose that we are given an ambiguity set P X, a function g: X R, and a random variable x : Ω X. If P Z = for Z = x X | g(x) Va Rα [g( x)] , then

Va Rα [g( x)] sup x P g(x) .

Proof. By the hopothesis, there exists some ˆx P Z. Then, we have supx P g(x) g(ˆx) Va Rα [g( x)] that concludes the proof, where the first inequality is by definition and the second one is from the definition of the set Z.

The second auxiliary lemma shows that a credible region is sufficient to upper bound Va R using a robust optimization problem.

Lemma B.5. Suppose that we are given an ambiguity set P X, a function g: X R, and a random variable x : Ω X. Then, we have P[ x P] α = Va Rα [g( x)] sup x P g(x) .

Proof. Our proof is by contradiction using Lemma B.4. We start by assuming that P[ x P]. Define Z = x X | g(x) Va Rα [g( x)] as in Lemma B.4. From Lemma B.4, we know that if supx P g(x) Va Rα [g( x)] is false, then we should have P Z = . By the definition of Va R, we have that P[ x Z] > 1 α. Then, we get a contradiction with P Z = as follows 1 P[ x P Z] = P[ x P] + P[ x Z] > α + 1 α > 1 .

The following lemma uses a standard technique for constructing a credible region for a multivariate normal distribution (Hong et al., 2023; Gupta, 2019).

Bayesian Regret Minimization in Offline Bandits

Lemma B.6. Suppose that x N(µ, Σ) is a multi-variate normal random variable with a mean µ Rd and a covariance matrix Σ Rd d. Then the set P Rd, defined as

P = x Rd | x µ 2 Σ 1 χ2 d(α) ,

with χ2 d(α) being the α-quantile of the χ2 d distribution, satisfies that P[ x P] = α .

Proof. One can readily verify that Σ 1

2 ( x µ) N(0, I) is a standard multivariate normal distribution. The norm of this value is a sum of i.i.d. standard normal variables, and thus, is distributed according to the χ2 d distribution with d degrees of freedom: Σ 1

2 ( x µ) T Σ 1

2 ( x µ) = x µ 2 Σ 1 χ2 d .

Therefore, by algebraic manipulation and the definition of a quantile, we obtain that

P[ x P] = P x µ 2 Σ 1 χ2 d(α) = α .

Finally, the following lemma derives the optimal solution of a quadratic optimization problem that arises in the formulation.

Lemma B.7. The equality

n x Tp | p ˆp 2 C b, p Rko = x T ˆp +

b x C 1 (27)

holds for any given vectors x, ˆp Rd and a matrix C Rd d that is positive definite: C 0.

Proof. From the convexity of the optimization problem in (27), we can construct the optimizer p using KKT conditions as

b x C 1 C 1x .

The result then follows by substituting p into the maximization problem in the lemma.

We are now ready to prove the main theorem.

Proof of Theorem 4.5. We derive the bound in (12) using the robust representation of Va R (Ben-Tal et al., 2009). We first construct the set Pδ Rd as Pδ = θ Rd | θ µ 2 Σ 1 χ2 d(1 δ) . (28)

Using Lemma B.6 and the definition of Pδ in (28), we can see that Pδ is indeed a credible region:

P h θ Pδ i = 1 δ .

Then, Lemma B.5 gives us the first inequality in (12):

Rδ(π) max θ Pδ max a A (r(a; θ) r(π; θ)) .

The second inequality in (12) is a consequence of Lemma B.7 with x = Φ(1a π), ˆp = µ, p = θ, C = Σ 1, and b = p

Finally, the inequality p

χ2 d(1 δ) p

5d log(1/δ) follows from Lemma 1 in Laurent & Massart (2000) as in the proof of Lemma 3 in Hong et al. (2023).

Bayesian Regret Minimization in Offline Bandits

B.4. Proof of Proposition 4.6

Proof. The corollary is an immediate consequence of Theorems 4.3 and 4.5 and the construction of Algorithm 1. By construction, π0 is the solution to

π0 arg min π k max a A µTΦ(1a π) + Φ(1a π) Σ ν0 a ,

where ν0 a is defined in Algorithm 1. Therefore, using Theorems 4.3 and 4.5 to upper bound νa, we obtain

Rδ(π0) min π k max a A µTΦ(1a π) + Φ(1a π) Σ min np

2 log(k/δ), p

5d log(1/δ) o .

This proves the corollary when i = 0 in Algorithm 1. Then, using Theorem 4.3 with general ξ, we observe that the algorithm selects i > 0 only when Rδ(πi ) ρi ρ0, which means that the corollary also holds.

C. Proofs of Section 5

C.1. Proof of Theorem 5.2

Proof. To prove the first claim of the theorem, let π be a policy that minimizes the linear component of the regret:

π arg min π k µTΦ(1a π) .

Note that the minimum above is upper-bounded by 0. Next we use Proposition 4.6 to bound the regret:

Rδ(ˆπ) min π k max a A µTΦ(1a π) + Φ(1a π) Σn min np

2 log(k/δ), p

5d log(1/δ) o

max a A µTΦ(1a π) + Φ(1a π) Σn min np

2 log(k/δ), p

5d log(1/δ) o

max a A Φ(1a π) Σn min np

2 log(k/δ), p

5d log(1/δ) o .

Now, we bound the term Φ(1a π) Σn. Recall that π 2 π 1 1, since π k. Then, for each a A, we have by algebraic manipulation that

Φ(1a π) 2 Σn = (1a π)TΦTΣnΦ(1a π)

= 1T aΦTΣnΦ1a + πTΦTΣnΦ π 2 1T aΦTΣnΦ π

(a) 4 max a A 1T a ΦTΣnΦ1a = 4 max a A ϕT a Σnϕa .

(a) holds by the Cauchy-Schwartz inequality because

1T aΦTΣnΦ π Σ 1/2 n Φ1a 2 Σ 1/2 n Φ π 2 max a A Σ 1/2 n Φ1a 2 2 .

The last inequality in the above equation is satisfied because Σ 1/2 n Φ π 2 P a A πa Σ 1/2 n Φ1a 2 maxa A Σ 1/2 n Φ1a 2, which in turn follows by Jensen s inequality from the convexity of the ℓ2-norm and the fact that π k. The term πTΦTΣnΦ π is upper bounded by an analogous argument.

Now Assumption 5.1 implies the following for each a A:

Gn γn ϕaϕT a Σ 1 0 + σ 2Gn Σ 1 0 + σ 2 γn ϕaϕT a 0

(Σ 1 0 + σ 2Gn) 1 (Σ 1 0 + σ 2 γn ϕaϕT a) 1

ϕT a(Σ 1 0 + σ 2Gn) 1ϕa ϕT a(Σ 1 0 + σ 2 γn ϕaϕT a) 1ϕa

ϕT aΣnϕa ϕT a(Σ 1 0 + σ 2 γn ϕaϕT a) 1ϕa. (29)

Bayesian Regret Minimization in Offline Bandits

The second line holds because we assumed Σ0 0, and thus, Σ 1 0 0, and adding a positive definite matrix preserves definiteness. The third line holds from the definiteness in the second line and Horn & Johnson (2013, corollary 7.7.4(a)). Finally, the fourth line holds from the definition of positive semi-definiteness.

We continue by applying the Woodbury matrix identity to (29), which give us the following inequality for each a A:

ϕT aΣnϕa ϕT a(Σ 1 0 + σ 2 γn ϕaϕT a) 1ϕa = 1 (ϕTaΣ0ϕa) 1 + σ 2 γn

1 λmax(Σ0) 1 + σ 2 γn ,

where λmax computes the maximum eigenvalues of the matrix. The inequality above holds because

0 ϕT aΣ0ϕa λmax(Σ0) ϕa ,

which can be seen from the eigendecomposition of the symmetric matrix. Substituting the inequality above proves the theorem.

To prove the special case of the theorem with µn = 0, let π0 be the solution in the first iteration of Algorithm 1. Given the posterior distribution of θD, the policy π0 is chosen as

π0 arg min π k max a A 0TΦ(1a π) + Φ(1a π) Σ ν0 a

= arg min π k max a A Φ(1a π) Σ .

The square of this minimization problem can be formulated as a convex quadratic program

min t R, π k

n t | t Σ 1/2 n ϕa Σ 1/2Φπ 2 2 , a A o . (30)

Because Σ 1/2Φπ Rd and is a convex combination of points in Rd, there exists an optimal π0 such that l = | a A | π0 a > 0 | d + 1 (Rockafellar & Wets, 2009). Then, let ˆa arg maxa A π0 a . We have that π0 ˆa 1

l because l actions are positive, and the constraint t Σ 1/2 n ϕa Σ 1/2Φπ 2 2 is active (holds with equality). If the constraint were not active, this would be a contradiction with the optimality of π0 because decreasing π0 ˆa would reduce the objective. Then, using the inequalities above and the triangle inequality, we get that the optimal t in (30) satisfies

t = Σ 1/2 n ϕˆa Σ 1/2Φπ0 2 = 1 max a A π0 a Σ 1/2 n ϕˆa Σ 1/2 n ϕa 2

1 max a A π0 a Σ 1/2 n ϕˆa Σ 1/2 n ϕa 2 2 1 max a A π0 a Σ 1/2 n ϕa 2.

The remainder of the proof follows from the same steps as the proof of Theorem 5.2. The lower bound on maxa A ˆπa holds from the existence of π0 with at most d + 1 positive elements, as discussed above.

C.2. Proof of Theorem 5.3

Proof of Theorem 5.3. First, from the construction of Example 1, we have that

a1 arg min a A µa β σa = arg min a A β σa β σa = A,

and therefore πLCB is the policy returned by LCB that breaks ties as specified. Then, using Theorem 4.4, we bound the regret of LCB as

Rδ(πLCB) µa2 + σa2 κl(k 1) = β σa2 + σa2 κl(k 1) = (β + κl(k 1)) σa2.

In contrast, Greedy selects a2 deterministically since

a2 arg min a A µa = {a2, . . . , ak} .

Bayesian Regret Minimization in Offline Bandits

Then, using Theorem 4.3 and (11) in particular, we upper bound the regret of πG as

Rδ(πG) max a A µπG a + σπG a κu(k)

= max a {a2,...,ak} µπG a + σπG a κu(k)

= max a {a2,...,ak}

σ2a + σ2a2 κu(k)

= max a {a2,...,ak}

2 σa2 κu(k).

The equalities follow from substituting the definitions of relative means and variances and from algebraic manipulation.

D. Sub-Gaussian Posterior

We discuss here how our results can extend to θD with sub-Gaussian distributions. The modifications necessary are quite minor. The key to the approach is to generalize Theorem 4.3 to a sub-Gaussian distribution as the following theorem states.

Theorem D.1. Suppose that θD is a random variable with an atomless distribution that is sub-Gaussian with mean µ and covariance factor Σ. Then the regret for each π k satisfies that

Rδ(π) min ξ k max a A Va R1 δξa h r(a; θD) r(π; θD) i

min ξ k max a A µTΦ(1a π) + Φ(1a π) Σ p

2 log(1/δξa). (31)

Proof. The first inequality in (31) holds by Theorem 4.3 since this inequality does not require that the posterior is normal. That is, we have that

Rδ(π) min ξ k max a A Va R1 δξa h r(a; θD) r(π; θD) i

= min ξ k max a A Va R1 δξa h (1a π)TΦT θD i

min ξ k max a A EVa R1 δξa h (1a π)TΦT θD i .

The last inequality follows from Lemma A.1. For each a A, the definition of a multi-variate sub-Gaussian random variable in (24) with w T = (1a π)TΦT implies that that (1a π)TΦT θD is sub-Gaussian with mean µ = (1a π)TΦTµ and a variance factor σ2 = (1a π)TΦTΣΦ(1a π). Therefore, from Lemma A.2 we have

min ξ k max a A EVa R1 δξa h (1a π)TΦT θD i µTΦ(1a π) + Φ(1a π) Σ p

2 log(1/δξa) ,

which proves the result.

Theorem 4.5 can also be extended to the sub-Gaussian setting but seems to require an additional assumption that θ µ 2 Σ 1 is a sub-gamma random variable, and we leave it for future work.

Armed with Theorem D.1, we can adapt Algorithm 1 to the sub-Gaussian setting simply by setting ν0 a = p

2 log(k/δ). Note that (14) already uses the correct inequality for a sub-Gaussian distribution.

E. Other Objectives

We now briefly discuss two other related objectives as alternatives to minimizing the high-confidence Bayesian regret, defined in (4) and (5). These objectives may be preferable in some settings because they can be solved optimally using simple and tractable techniques.

Bayesian Regret Minimization in Offline Bandits

E.1. Expected Bayes Regret

The first objective we discuss is expected Bayes regret, which is obtained by simply replacing the Va R by expectation in (5). In this case, the goal of the agent is to minimizes the expected regret, defined as

min π k E max a A r(a; θD) r(π; θD) .

Using the linearity property of the expectation operator and the reward function r, we have

arg min π k E max a A r(a; θD) r(π; θD) = arg max π k E h r(π; θD) i = arg max π k r π; E h θD i .

This means it is sufficient to maximize the return for the mean posterior parameter value. In most case, such as when the posterior over θD is normal, this is an easy optimization problem to solve optimally.

E.2. High-confidence Return

The second objective we discuss is high-confidence return, which is obtained by simply replacing the regret with return in (5). In this case, the goal of the agent is to minimizes the Va R of the return random variable as

min π k Va R1 δ h r(π; θD) i = min π k Va R1 δ h πTΦT θD) i . (32)

One may think of this objective as minimizing the regret with respect to 0. The reward inside is negated because we use Va R which measures costs rather than rewards. Note that Va R1 δ [ x] Va Rδ [ x] with an equality for atomless (continuous) distributions.

When θD N(µ, Σ), the optimization in (32) can be solved optimally using an LCB-style algorithm. Then, using the properties of linear transformation of normal distributions, for each π k, we obtain

πTΦT θD N(πTΦTµ, πTΦTΣΦπ) .

Combining the objective in (32) with (2), we get that the objective is

max π k πTΦTµ

πTΦTΣΦπ z1 δ . (33)

Recall that z1 δ is the 1 δ-th quantile of the standard normal distribution. We can reformulate (33) as the following second-order conic program (for δ 1/2)

maximize π Rk, s R πTΦTµ z1 δ s

subject to s2 πTΦTΣΦπ,

1Tπ = 1, π 0 .

When restricted to deterministic policies, the optimization in (33) reduces to a plain deterministic LCB algorithm. The Flat OPO algorithm can be seen as an approximation of (33) in which z1 δ is replaced by its upper bound.

F. Additional Experimental Details

In this section, we provide some additional experimental results. First, Figures 5 and 6 report the same results as Figures 3 and 4 but also report the 95% confidence interval for the average regret over the 100 runs. Second, we report the effect of the tightening step on the quality of the bounds in Figure 7 compared to a scenario-based estimation. In this simplified example, we fix some policy π and assume the particular parameters of the distribution of xa = θT DΦ(1a π), a A, which is normal by Lemma 4.2. The results in the figure show that when the distribution x is close to i.i.d. the tightening step does not improve the bound. This is expected since the optimal ξ in (14) is nearly uniform. However, when the means or variances of the xa vary across actions a A, then the tightening step can significantly reduce the error bound.

Bayesian Regret Minimization in Offline Bandits

0 10 20 30 40 50

Sample count: n

Bayesian regret

BRMOB Flat OPO

Greedy Scenario

0 10 20 30 40 50 1.5

Sample count: n

Bayesian regret

0 10 20 30 40 50

Sample count: n

Bayesian regret

Figure 5. Bayesian regret with k = d = 5 (left), k = d = 50 (middle and right). The prior mean is µ0 = 0 (left and middle) and (µ0)a = a for a = 1, . . . 50 (right).

0 10 20 30 40 50

Sample count: n

Bayesian regret

BRMOB Flat OPO

Greedy Scenario

0 10 20 30 40 50 0.5

Sample count: n

Bayesian regret

0 10 20 30 40 50

Sample count: n

Bayesian regret

Figure 6. Bayesian regret with d = 4 and k = 10 (left), k = 50 (middle), and k = 100 (right).

0 25 50 75 100

Action count: k

Regret bound

Scenario Uniform ξ Tightened ξ

0 25 50 75 100

Action count: k

Regret bound

0 25 50 75 100

Action count: k

Regret bound

Figure 7. Regret bounds in Theorem 4.3 for different choices of ξ as a function of k. The posterior distribution of x is normal with µ = 0, Σ = I (left), µ = 0, Σaa = a2/k (middle), and µa = a/k, Σaa = a2/k (right) with a = 1, . . . , k.

Bayesian Regret Minimization in Offline Bandits

0 25 50 75 100 0

Action count: k

Runtime (s)

BRMOB Flat OPO

Greedy Scenario

Figure 8. Runtime comparison of algorithms in seconds for a problem with µ = 0 and Σ = I.

Figure 8 compares the runtime of the algorithms considered as a function of the number of arms. The runtime excludes the time to compute the posterior distribution which is independent of the particular method considered. We use MOSEK to compute the SOCP optimization and do not run any tightening steps. The number of samples m needed for the Scenario algorithm was derived from the Dvoretzky-Kiefer-Wolfowitz bound as

m = 100 (1 0.95)2 log 2 k

This number of samples guarantees a small sub-optimality gap with probability 95%. We suspect, however, that this number of samples can be reduced with more careful assumptions and algorithmic design (Calafiore & Campi, 2005; Nemirovski & Shapiro, 2007; 2006). Such analysis is beyond the scope of this work.