# bandits_with_mean_bounds__81ff3f24.pdf Published in Transactions on Machine Learning Research (11/2024) Bandits with Mean Bounds Nihal Sharma nihal.sharma@utexas.edu The University of Texas at Austin Soumya Basu basusoumya@google.com Google Karthikeyan Shanmugam karthikeyanvs@google.com Google Deep Mind Sanjay Shakkottai sanjay.shakkottai@utexas.edu The University of Texas at Austin Reviewed on Open Review: https: // openreview. net/ forum? id= 4TZ4DE24f X We study a variant of the bandit problem where side information in the form of bounds on the mean of each arm is provided. We prove that these translate to tighter estimates of subgaussian factors and develop novel algorithms that exploit these estimates. In the linear setting, we present the Restricted-set OFUL (R-OFUL) algorithm that additionally uses the geometric properties of the problem to (potentially) restrict the set of arms being played and reduce exploration rates for suboptimal arms. In the stochastic case, we propose the non-optimistic Global Under-Explore (GLUE) algorithm which employs the inferred subgaussian estimates to adapt the rate of exploration for the arms. We analyze the regret of R-OFUL and GLUE, showing that our regret upper bounds are never worse than that of the standard OFUL and UCB algorithms respectively. Further, we also consider a practically motivated setting of learning from confounded logs where mean bounds appear naturally. 1 Introduction We study the problem of bandits with mean bounds and bounded rewards where an agent is presented with a set of K arms, along with side-information in the form of upper and lower bounds on the average reward for each of the arms. The agent is then asked to successively choose arms based on previous observations in order to maximize the cumulative reward. As is standard in MABs, the agent s performance is compared to that of a genie that always chooses the arm that gives the largest reward in expectation. We consider this in the commonly studied frameworks of stochastic Multi-Armed Bandits (MABs) and the Linear Bandit. In the former, rewards of each arm are drawn independently from its associated distribution and are uncorrelated with rewards of other arms. The linear setting couples the rewards from all arms through an unknown, but fixed latent vector that is used to paramterize the mean rewards. We seek to design arm selection policies that efficiently utilize the provided mean bounds in offering both improve regret performance and computational complexity. Our setting is motivated by the problem of inferring efficacy of interventions from confounded logs. As a concrete example, consider the following healthcare example: Interavenous tissue plasminogen (t PA) activators are known to be highly effective in treating acute ischemic strokes if administered within 3 hours of the onset of symptoms. Otherwise, a less-effective medical therapy is recommended, as t PA causes higher chances of adverse side effects and hemorrhages Powers et al. (2019). If a log is then generated without recording the time since the onset of symptoms, it would present t PA to be better than the alternative. Naively using such a log to infer t PA to be the best intervention could lead to unfavourable outcomes in areas with poor access Published in Transactions on Machine Learning Research (11/2024) to healthcare, where patients take longer to reach the hospital after symptoms appear. Inferring the optimal decision in the presence partial contextual information is non-trivial in general, however, one can extract bounds on the mean of the treatments using these logs. The offline problem of extracting bounds on mean effects of interventions is well-studied by works such as Robins et al. (2000); Brumback et al. (2004); Richardson et al. (2014); Zhang & Bareinboim (2017); Yadlowsky et al. (2022). We are interested in using these bounds to aid the online learning of optimal actions, also studied in Zhang & Bareinboim (2017); Combes et al. (2017) for the Bernoulli MABs (the latter also treats gaussian rewards). In this work, we show that using the mean bounds in non-trivial ways can lead to improvements in regret performance over existing methods in the case of linear bandits and stochastic MABs with this side information. The key intuition for our improvements comes from the notion of subgaussian factors of bounded random variables. A random variable taking values in [0, 1] is known to be 1/2 subgaussian, which is tight for a Bernoulli random variable with mean 1 2. Given additional information about the location of its mean (in an interval, say), tighter estimates of the variance can be inferred. However, whether such information provides a sharper subgaussian factor, which in itself is an upper bound to the variance, is not known. The worst-case factor above is commonly used in the bandit setting, to characterize the concentration behavior of (bounded) random variables around their mean. Further, the jump from estimation to decision-making means that estimates can be incorrect (e.g., biased, poor confidence) so long as it does not alter the decision. In a bandit setting, this observation has been classically used to explore sub-optimal actions at a lower rate than that for the best action; this leads to a worse accuracy bound for the reward estimates of sub-optimal actions but does not affect decision-making. In this paper, we go beyond this intuition: we show that, surprisingly, known bounds on mean rewards for some actions (side information) enable us to explore other possibly unrelated actions at lower-rates, thus improving overall cumulative regret. Our contributions are detailed below: Contribution 1: Improved Subgaussian Factors with Mean Bounds: We provide a characterization of the subgaussian factor σ 1 2 of any random variable bounded in [0, 1] when upper and lower bounds on its mean are known. Specifically, in Theorem 3 and Corollary 3.1, we show that when the mean is known to be towards either half of this interval, one can infer factors that are strictly less than 0.5. These immediately imply tighter concentration bounds for bounded random variables. This result could be of independent interest, however, we study the effects of such information in bandit learning. Contribution 2: Linear Bandits: We present the Restricted-set Optimism in the Face of Uncertainty for Linear bandits (R-OFUL) algorithm for bounded rewards. R-OFUL first uses the structure imposed by the linear rewards to refine the given side information and produces the tightest possible mean bounds for each arm in the action set before online interaction. Then, at each time, it leverages the geometry of the problem to restrict the set of arms to be considered. Finally, it reduces exploration by using the sharp estimates of subgaussian factors above for arms in the restricted set and chooses actions much like the standard OFUL algorithm of Abbasi-Yadkori et al. (2011). We show that the lower bounds are key to our improvements in the online phase of the algorithm. First, the restricted set is constructed as a cone around the arm with the largest lower bound. Combining the lower bounds of the arms that remain with the boundedness of rewards then leads to our reduced exploration rates. Our analysis in Theorem 4 shows that using side information, R-OFUL can improve over the regret guarantees of standard OFUL by a constant factor. It also improves the computational cost due to the restriction of arms. To the best of our knowledge, this is the first investigation on Linear Bandits with Mean bounds. Contribution 3: Stochastic MABs: We develop GLobal Under-Explore (GLUE) an index based policy which, unlike the UCB Auer et al. (2002) and kl-UCB Cappé et al. (2013) algorithms, is not optimistic; i.e. the indices do not serve as high-probability upper bounds to true means of each arm. We use the fact that violating the upper bound property for the indices of suboptimal arms does not adversely affect the regret as long as the property is maintained for the (unknown) best arm. In particular, our indices for an arm are formed using a quantity that can be strictly lesser than the true subgaussian factor of the arm only if the arm is sub-optimal. This causes the sub-optimal arms to be under-explored which leads to improvement in regret performance. Published in Transactions on Machine Learning Research (11/2024) This problem is a specific form of the Structured Bandit framework of Combes et al. (2017) where the authors develop OSSB, a non-optimistic algorithm to balance exploration across arms in structured spaces. In the case of Bernoulli rewards, OSSB reduces to the B-kl-UCB algorithm of Zhang & Bareinboim (2017) that uses the kl-UCB index for each arm truncated at the corresponding upper bound. In both B-kl-UCB and OSSB, the lower bounds on arm means are only used initially to prune away arms that can be identified as suboptimal, and the upper bounds are used to clip the arm indices at each time. For general bounded rewards, the upper bound on s.g-factor of the optimal arm can be inferred from the highest lower bound of arm means. Therefore, we define the exploration rates of the arms to be the minimum of the individual arm subgaussian factors and the aforementioned upper bound. Our analysis in Theorem 6 shows that in instances with non-informative mean bounds, we recover the performance of UCB. However, using our adapted rates in instances with rich side information leads to GLUE significant improvements over vanilla UCB. Empirically, we see that our performance is comparable to B-KL-UCB when it is known that one of the arms has a mean close to 1. Contribution 4: Mean bounds from confounded logs: We develop techniques to extract upper and lower bounds on the means of arm rewards from partially confounded logged data. Specifically, we consider a dataset that has been collected by an oracle that observes the full context, takes the optimal action and receives the corresponding rewards. However, the log only contains some parts of the context along with the corresponding (action, reward) pair, generalizing the work of Zhang & Bareinboim (2017), where none of the contexts are recorded. In the stochastic case, using bounds on the gap between the means of the best and second best arms as observed by the oracle, as well as the corresponding gap between best and worst arms, we derive upper and lower bounds on the mean rewards of arms to be used by an agent that acts only based only on the recorded parts of the context. We show that these are tight, i.e. there are instances that meet both the upper and lower bounds. We also show that these bounds can be inferred in a linear setting without the knowledge of gaps. We validate our work through synthetic and semi-synthetic experiments with the Movielens 1M dataset Harper & Konstan (2015). 1.1 Related Work Bandit problems have seen a lot of interest over the past few decades (see Bubeck et al. (2012); Lattimore & Szepesvári (2020) for comprehensive surveys). A vein of generalization for the same has seen numerous advances in incorporating several forms of side information to induce further structure into this setting. Notable among them are graph-structures information (Buccapatnam et al., 2014; Valko et al., 2014; Amin et al., 2012), latent models (Li et al., 2010; Bareinboim et al., 2015; Lattimore et al., 2016; Sen et al., 2017), expert models (Auer et al., 1995; Mannor & Shamir, 2011), smoothness of the search space (Kleinberg et al., 2008; Srinivas et al., 2010; Bubeck et al., 2011), among several others. We assume side information in the form of mean bounds and study how such information affects the decision making process. In the stochastic setting, our bandit problem has connections to the works by Zhang & Bareinboim (2017) and Combes et al. (2017). An in-depth comparison with these can be found in Section 5.3. Along another thread, Bubeck et al. (2013) provide algorithms with bounded regret if the mean of the best arm and a lower bound on the minimum suboptimality gap is known. These techniques, however, do not apply in our setting as the side information we consider does not allow us to extract such quantities in general. In the linear setting, with time-varying action sets, the works of Li et al. (2010); Abbasi-Yadkori et al. (2011) are inspired by the upper confidence bound-type arguments of Auer et al. (2002) for the stochastic case. When action sets remain fixed over time, arm-elimination type algorithms like ones in Valko et al. (2014) improve dependence on the dimension of the arms. We study the novel setting of linear bandits with mean bounds and varying action sets in this work. The extraction of mean bounds from confounded logs has been studied in the context of estimating treatment effects in the presence of confounders. Here, actions are treatments, and the rewards capture the effects of this choice. A line of existing work performs sensitivity analysis by varying a model on the latents, measured variables, treatments and outcomes in a way that is consistent with the observed data (Robins et al., 2000; Brumback et al., 2004; Richardson et al., 2014). Recently in (Yadlowsky et al., 2022; Zhao et al., 2019), a universal bound on the ratio of selection bias due to the unobserved confounder is assumed. This means that Published in Transactions on Machine Learning Research (11/2024) the treatment choice has a bounded sensitivity on the unknown context (i.e., mostly irrelevant). We deal with the other extreme, where we assume that the outcomes in the log are recorded using an unknown optimal policy (under complete information), and that the knowledge of worst case sub-optimality gaps for the given latent context space is known. Our assumption allows for strong dependence on the unknown context. The use of logged data to improve online learning has been studied recently by (Zhang & Bareinboim, 2017; Zhang et al., 2019; Ye et al., 2020). The first assumes that the log contains no information of the variables that affect reward generation, while the others assumes that all such variables are present. We consider the middle ground, by assuming that a fraction of these variables are included in the logs. the authors of Tennenholtz et al. (2020) studied a related linear bandit problem where the agent is provided with partial observations collected offline according to a fixed behavioral policy which can be sampled from. These are then used to aid online decision making after the agent observes the full context. In contrast, we consider the case where the agent can only observe the partial context at each time and is provided with confounded logs collected from a policy (to which we do not assume sampling access) that is optimal under the full context. 2 Bandits with Mean Bounds We consider the round-based interaction of a learning agent with a stochastic environment through a set of K0 actions (or arms). At each round, the agent chooses one of the provided arms and observes a stochastic reward. To aid its decision making, the agent is also provided with side information in the form of bounds on the mean reward of each arm. The goal of the agent is to choose arms such that the cumulative reward is competitive with respect that obtained by a genie that only chooses the best arm in each round. In this work, we concentrate on two commonly studied formulations of this problem: Stochastic Multi-armed Bandits and Linear Bandits. We detail the notations, structure of rewards, and notions of the best arm for each of these below. Stochastic Multi-armed Bandit: The agent is provided with K0 arms indexed by the set [K0] = {1, 2, ...K0}, with each arm being associated with a fixed and unknown reward distribution supported over the interval [0, 1]. The mean reward of arm k [K0] is denoted by µk. For each arm k [K0], the side information is given by a tuple (lk(t), uk(t)) such that µk [lk(t), uk(t)] and lk(t), uk(t) [0, 1]. These tuples thus specify upper and lower bounds on the mean reward for each of the arms and can be different for each arm. With this knowledge, at round t, the agent chooses an arm At and observes a reward Yt sampled from the distribution associated with the chosen arm. We define k = arg maxk [K0] µk be the (unique) best arm with µk = µ and the genie chooses this arm at each round. The agent thus aims to minimize its average cumulative regret, which at round T is given by RT = PT t=1 µ E[Yt]. The expectation here is over the randomness of the rewards and the choice of arms of the agent. If we have that for all k [K0], lk = 0, uk = 1, then our setting matches that of the standard Multi-armed Bandit with bounded rewards. Linear Bandit: In this case, in each round t, the agent is provided a (possibly different) set of action At A sampled from a (possibly infinite) set of actions A such that At = K0 . Each arm a A is a vector in Rd. At round t, the agent chooses an arm At At and observes a reward Yt = At, θ + ηt where θ Rd is a fixed unknown vector. The noise ηt is a conditionally σ(At) subgaussian random variable with respect to the filtration Ft = σ (A1, Y1, ..., At 1, Yt 1, At}). This arm-specific subguassian factor is not revealed to the agent. As in the case above, for all rounds t, Yt [0, 1] and the agent is provided with tuples (la, ua) for each a A such that θT a [la, ua] and la, ua [0, 1]. We also assume that θ [m, M] and that a = 1 ( is the Euclidean norm) for a A. As before, the agent aims to maximize its cumulative reward in order to remain competitive with a genie that chooses the arm with highest mean reward at each round. Equivalently, the agent aims to minimize the regret, which at round T is given by RT = PT t=1 maxa At a At, θ . In contrast to the setting above, RT is a random variable due to the randomness in At. With la = 0, ua = 1 for a A, our setting becomes that of the standard Linear Bandit with bounded rewards. Published in Transactions on Machine Learning Research (11/2024) For both the settings above, we note three things: a) The agent is allowed to use the all historical information and accumulated rewards to inform future arm choices, b) The provided bounds do not restrict the range of observed rewards, but only their mean, c) Our discussion can be easily generalized to bounded rewards supported over any interval [a, b] by appropriate shifting and scaling. Our methods will also apply when bounds were known on the norm of the actions instead of the strict equality in the linear bandit case. In the standard versions of both the above settings, Optimism-based policies have been well-studied. In the stochastic case, the standard UCB algorithm in Auer et al. (2002) and the KL-UCB algorithm in Cappé et al. (2013) achieve provably optimal performance for specific families of reward distributions. However, with non-trivial mean bounds, it is shown in Zhang & Bareinboim (2017) that one can outperform these methods by leveraging the information that is provided by these bounds. In the linear case, the OFUL algorithm of Abbasi-Yadkori et al. (2011) (or Lin UCB of Li et al. (2010)) provides optimal regret guarantees up to logarithmic factors (see Chapter 24 in Lattimore & Szepesvári (2020)). To the best of our knowledge, the linear bandit problem with mean bounds has not been considered before. These naive algorithms that do not use the knowledge of the provided side information will serve as baselines to the methods that we develop. Before we propose our methods, we will first spend time developing improved concentration bounds for empirical means of random variables given mean bounds. These concentrations will be crucial in analyzing our arm selection policies for both the settings above. 3 Improved Concentrations with Mean Bounds A random variable X is said to be σ subgaussian if and only if E [exp(s(X E[X])] exp s2σ2 2 . As a consequence, we have the following Chernoff-Hoeffding concentration inequality for σ subgaussian variables: P(X t) exp t2 Further, it is well known that any random variable that is bounded in an interval [a, b] is b a 2 subguassian. With no additional information about this variable, we can not improve this factor. Suppose the mean of the random variable were known. We ask the question Does this additional information lead to a tighter (< 1 2) estimate of the subgaussian factor? . This is important because a tighter subgaussian parameter would lead to faster concentrations of the random variable. Suppose now that the random variable X [0, 1] has mean m that is known. Since it is bounded, the variance of this random variable is bounded by that of a Bernoulli random variable with the same mean. Further, the square of the subgaussian factor for any random variable is an upper bound on its variance (See Lemma 5.4 in Lattimore & Szepesvári (2020)). Therefore, we have that var(X) m(1 m) sg(Bernoulli(m))2 1 2 where var( ) is the variance and sg( ) is the subgaussian factor of the argument. We will now show that there indeed exist σ (m(1 m), 1 2) such that any random variable X [0, 1] with mean m (0, 1) is σ subgaussian (the cases when m = 0, 1 are trivial). For this to hold, we require that E h es(X m)i m exp (s(1 m)) + (1 m) exp ( ms) exp s2σ2 s2 log (1 m)e ms + µes(1 m) max s R 2 s2 log mes(1 m) + (1 m)e ms (1) Thus, it is sufficient to prove that there exist a unique maximizer to the RHS of Equation 1 and that the achieved maxima is < 1 2. Below, we give the steps involved in proving this. The full proof is deferred to Appendix A. We will use a sequence of lemmas to establish our result. For simplicity, we define fm(x) = mex(1 m) + (1 m)e mx and begin by proving the following facts: Lemma 1. For all m (0, 1) and x R , fm(x) > 1. Further, fm(x) = f1 m( x) and limx 0 2 x2 log (fm(x)) = m(1 m). Published in Transactions on Machine Learning Research (11/2024) Given these properties of fm(x), we then prove the following Lemmas: Lemma 2. The following are true: 1. When m (0.5, 1), the function 2 x2 log (fm(x)) is not maximized at any x > 0. 2. For all m (0, 0.5), x > 0, we have that with x1 as defined in Lemma 15 a) For all x (0, x1), 2 x2 log (fm(x)) > m(1 m), b) For all x > 2 m, 2 x2 log (fm(x)) < m(1 m). Combining these two Lemmas, we have our final result: Theorem 3. Let X [0, 1] be a random variable with mean m (0, 1), m = 0.5. Then, X is σ-subgaussian for σ2 = maxx R 2 x2 log me(1 m)x + (1 m)e mx . Additionally, p m(1 m) < σ < 1 Figure 1: Improved Subgaussian factor vs. Bernoulli variance The lower bound of p m(1 m) on σ implies that it increases in (0, 0.5) and decreases in (0.5, 1). In Figure 1, we plot the values of σ2 and m(1 m), the variance of a Bernoulli with mean m, for different values of m in (0, 1) to verify this trend. This leads to the following corollary: Corollary 3.1 (Improved Subgaussian factors with mean bounds). Let X be a random variable over [0, 1] such that E[X] [l, u] for some l, u [0, 1]. For any m (0, 1), let c(m) = maxs R 2 s2 log(fm(s)) with fm(x) = mex(1 m) + (1 m)e mx. Then, the following are true: 1. If l > 0.5, we have that X is p c(l)-subgaussian. 2. If u < 0.5, we have that X is p c(u)-subgaussian. Thus, bounds on the mean of the random variable that do not contain the worst-case value of 0.5 always lead to improved estimates of its subgaussian factor. We call such mean bounds to be non-trivial . 4 Restricted-set Optimism under Uncertainty for Linear Bandits with Mean Bounds Now that we know the implications of non-trivial mean bounds on the concentration behavior of a random variable, we move on to studying the effects of such information on the regret of a linear bandit agent. Recall that at each round t, the agent is given an action set At of K0 vectors in Rd sampled from a set A and must learn to choose actions At At. Observing Yt = At, θ + ηt, where θ is an unknown vector and ηt a conditionally σ(At)-subgaussian random variable with respect to the filtration generated by all observations up to t 1 and the choice of arm At, the agent competes with a genie that always picks the arm in the Published in Transactions on Machine Learning Research (11/2024) Algorithm 1 Rrestricted-set Optimism in the Face of Uncertainty for Linear bandits with mean bounds 1: Inputs: Action set A, bound tuples (la, ua) for each a A 2: Tightening bounds: Update the tuples (la, ua) as in Equation 2. 3: for t = 1,2,3,... do 4: Receive set At, identify ˆat and lmax(t). 5: Restricting instantaneous action sets: 6: Prune out all arms a with ua < lmax(t). 7: Form set Ar(t) defined in Equation 3. 8: Reduced exploration: 9: Compute σt as in Equation 4. 10: With Et defined in Equation 5, choose at to be arg maxa Ar(t),θ Et a, θ , observe reward Yt. 11: end for provided set with largest mean reward. That is, the agent aims to minimize its regret at round T given by RT = PT t=1 maxa At a At, θ . As is standard in Linear Bandits, this regret RT is random due to the randomness in the choice of AT and thus, we seek to minimize this regret with high probability. Specifically, we use δ to be the user-specified failure probability and develop a policy that minimizes regret with probability at least 1 δ. The agent is also provided with tuples (la, ua) for each arm a A which can be used to aid its decision making. Further, it is known that θ [m, M] and that for all a A, a = 1. We propose the Restricted-set Optimism under Uncertainty for Linear Bandits (R-OFUL) algorithm (summarized in Algorithm 1) that uses these mean bounds and the linear structure of rewards to minimize regret. At each round, the agent performs three steps: a) Tightening the mean bounds, b) Restricting the instantaneous action set, and c) Invoking the update similar to the OFUL algorithm of Abbasi-Yadkori et al. (2011) with potentially improved subgaussian factors that leads to reduced exploration. We describe each of these steps below. Tightening the Mean Bounds: Since all rewards are obtained using the same parameter θ , bounds on the mean of some action give us non-trivial information about the mean reward of all other actions. This fact is used to tighten the provided upper and lower bounds on arm means. Formally, define Cb = {θ : θ [m, M], a A, a, θ [la, ua]} be the set of feasible parameter vectors. Then, the tight bounds la, ua (we abuse notation by representing both the provided and tighter bounds by the same variables) are computed for each arm a A as la min θ Cb a, θ , ua max θ Cb a, θ . (2) After these tight bounds are computed, the set of feasible vectors Cb is recomputed with these updated versions of the mean bounds. Restricting instantaneous action sets: This phase is carried out after receiving the set At at time t and chooses a subset of arms to be considered at each round. First the agent prunes away deterministically suboptimal arms used lmax(t) = maxa At la. This restricts the set of arms in consideration to Ap(t) = {a : ua lmax(t)}. Now we denote by ˆat = arg maxa At la the most promising arm at time t and use ang(a, a ) = cos 1 ( a,a / a a ) as the angle between the vectors a and a . With αt = cos 1 (lmax(t)/M), the restricted set of actions at time t is given by Ar(t) = {a Ap(t) : ang(a, ˆat) 2αt} (3) Reduced Exploration: In this phase, the agent computes an upper bound on the s.g-factors of arms in the set Ar(t) and uses this to reduce exploration in its arm selection strategy. Recall that for any y (0, 1) we use c(y) = maxs R 2 s2 log(fy(s)) where fy(x) = ye(1 y)x + (1 y)e yx. For each arm a At with (tightened) Published in Transactions on Machine Learning Research (11/2024) mean bounds la, ua, we define ψ2(la, ua) = c(la) if la > 0.5 c(ua) if ua < 0.5 0.25 otherwise to be the function that maps the upper and lower bounds to the reduced s.g-factor as a result of Corollary 3.1. Then, upper bound on the s.g-factor of any arm in Ar(t) is given by σt = min ψ(m cos(3αt), b), max a Ar(t) ψ(la, ua) . (4) The agent forms the ridge regression estimate of the parameter θ at time t as ˆθt = V 1 t Pt n=1 an Yn with V t = Pt n=1 ana T n + λId. Here Id is the d-dimensional identity matrix and λ > 0 is a regularization parameter. It then defines the following set around θ : Ct = θ Cb : ˆθt θ V t 1 βt 1(δ) , p + d log 1 + t Here γ = maxn t σn and δ is the user-specified failure probability. The choice of arm At is given by At = arg max a Ar(t),θ Ct a, θ . (6) This selection rule is similar to that of vanilla OFUL in Abbasi-Yadkori et al. (2011). The difference here, we restrict a) the set Ct with the feasible set Cb, b) the set of arms that can be played at time t with Ar(t) and c) the confidence width βt(δ) using the tighter subgaussian estimate of γ. 4.1 Key Ideas 1. Deriving the restricted set: The boundedness and linearity of rewards implies that if there exists an arm that promises a high mean (in our case, ˆat), its angle with the parameter θ is upper bounded. This is captured by the quantity αt. Further, it also implies that the angle between the best arm in the set Ar(t) and θ can be no more than αt. Combining these, we get that the best arm lies in a cone of angle 2αt around ˆat. 2. Tight s.g-factors: Since the goal is to minimize regret, the worst-case arm that can be played (one with the lowest mean reward) in Ar(t) is one that is at an angle 3αt away from θ . As rewards are bounded, this lower bound on the mean reward of the worst arm can be translated into an upper bound on the s.g-factor of arms in Ar(t) using the function ψ. If the side information of arms in Ar(t) can provide sharper estimates on this upper bound, we use those instead. This is reflected in our definition of σt in Equation 4 4.2 Regret of R-OFUL We now present our high-probability regret guarantees for Algorithm 1. Theorem 4. With probability at least 1 δ, the regret of R-OFUL suffers a worst-case regret of 8dtβt 1(δ) log 1 + t Proof Sketch. First, we establish the following lemma: Lemma 5. Let a t = arg maxa At a, θ be the best arm at time t. Then, a t Ar(t). Further, for any arm a At, the s.g-factor σ(a) satisfies σa σt. Published in Transactions on Machine Learning Research (11/2024) (a) 5 arms per round (b) 10 arms per round (c) 15 Arms per round Figure 2: Comparing R-OFUL (Algorithm 1) with vanilla OFUL with arms in R10 and bounded rewards. Results are averaged over 200 runs and error bars for one standard deviation are displayed. R-OFUL restricts arms and only chooses between 2-3 arms per round on average and thus, its average regret is comparable over all three figures, while OFUL suffers regret that grows with the number of arms per round. We also observe that R-OFUL computes arm updates 6 6.7 faster on average. Using this lemma, we generalize the concentration arguments of Abbasi-Yadkori et al. (2011) to prove that with probability at least 1 δ, the parameter vector satisfies θ Cb. The result then follows using the standard arguments for bounding regret in linear bandits. The full proof can be found in Appendix B. 4.3 Comparisons and Discussions R-OFUL vs OFUL: The regret of the OFUL algorithm is of the order O p dtσmax(d + log(1/δ)) with σmax = (b a)/2 as the universal upper bound on the s.g-factor of the arms in the set A. If no informative bounds are present (the worst case) our R-OFUL algorithm matches the performance of OFUL exactly. However, richer side information can lead to a constant factor improvement in the finite-time regret guarantees as γ σmax. Further, in cases where Ar(t) = 1, the instance-dependent regret of R-OFUL is 0, while any linear bandit algorithm that does not perform this arm restriction will suffer non-zero regret on average. We also improve on the computation complexity of OFUL. Specifically, to compute At (Equation 6), the usual practice is to solve the optimization problem for all a individually for all arms and then choose one with the maximum objective value. If the mean bounds are such that |Ar(t)| < |At|, that is, if the set of arms is restricted by R-OFUL, then, so is the number of optimization sub-problems to solve, which speeds up computation. Optimality: We study a generalization of OFUL which is known to be optimal up to logarithmic factors in the worst case. However, it is known that optimality in general linear bandit problems is achieved by complex non-optimistic policies. The study of regret lower bounds and policies that match them are left open. Empirical Evaluations: We compare the regret performance of R-OFUL with the vanilla OFUL algorithm empirically in Figure 2. For this, we sample a random set of 100 arms in R10 and sample 5, 10, 15 of these in each round. We set θ as the vector 0, 1 10 and normalize it. To generate rewards, for each arm a At, we first sample a Gaussian random variable with mean a, θ and variance 0.8, and clip this to be in [0, 1] ([ 1, 0]) if a, θ > 0( 0) to form our bounded rewards. The upper and lower bounds on the rewards are generated separately to be away from the mean by a uniform random variable in [0, 0.5] and these are also clipped the same way as the rewards. That is, arms with positive (non-positive) mean reward are forced to have bounds in [0, 1] ([ 1, 0]). The subgaussian upper bound for OFUL is provided as 0.5, the worst-case subgaussian factor for any random variable bounded in either [0, 1] or [ 1, 0] while R-OFUL computes tighter subgaussian factors based on the mean bounds and uses γ as in Equation 5 to choose arms. We also use the norm bounds on θ to m = 0, M = 1. We average our results over 200 independent runs. We see that R-OFUL consistently achieves much lower regret than the vanilla variant. Further, we also observe that Published in Transactions on Machine Learning Research (11/2024) Algorithm 2 GLobal Under-Explore (GLUE) for MABs with Mean Bounds 1: Inputs: Upper and lower bounds for each arm k [K0] : uk, lk. 2: Pruning Phase: 3: Define lmax = maxk K0 lk and eliminate all arms i [K0] with ui < lmax. 4: Reindex the remaining arms to be in {1, 2, ..., K}. 5: Learning Phase: 6: Set UCB index scaling parameters ψk for each arm as in Equation (7). 7: for t = 1, 2, 3, ... do 8: Play arm At = arg maxk [K] Uk(t 1) and observe reward Yt. 9: Increment TAt(t) and update ˆµAt(t). 10: Update Uk(t) as in Equation (8). 11: end for empirically, R-OFUL restricts arms and only chooses between 2-3 arms per round on average. This leads to a 6 6.7 computation speed up compared to vanilla OFUL. 5 Global Under-exploration for Stochastic Multi-armed Bandits with Mean Bounds Now we move on to the stochastic Multi-armed Bandit (MAB) setting. We recall that in this case, there is no linear structure on the rewards. At each time, the agent picks one of K0 arms, and observes a reward that is randomly sampled from the distribution that is associated with this arm. The agent can also access the tuples (lk, uk) for all arms k [K0] to inform its choices. At round T, the agent aims to minimize the expected cumulative regret RT = PT t=1 E[µ Yt]. Prior work in the Zhang & Bareinboim (2017); Combes et al. (2017) have investigated this problem before, the for Bernoulli rewards, and the latter as a structured bandit problem. The B-KL-UCB algorithm of the former sets arms indices as the clipped version (using the provided bounds) of the the standard KL-UCB indices from Cappé et al. (2013). The latter proposes OSSB for general reward distributions where the agent decides arms at each round by solving a distribution-dependent, semi-infinite optimization problem. The authors also prove that OSSB achieves asymptotically optimal regret for Bernoulli and Gaussian rewards when reward distributions are known apriori. In the special case of Bernoulli rewards, OSSSB and B-KL-UCB are equivalent (see Appendix D), and are thus optimal. However, it is often not practical to assume that the reward distributions are known apriori. Further, even with this knowledge, it is unclear how computationally tractable the per-round optimization problem of OSSB is. While the Bernoulli version in B-KL-UCB is still applicable in this case, the KL-UCB indices for each arm in each round are themselves optimization problems with no known closed form solution. As the number of arms grow, computing these indices efficiently poses a challenge. The UCB algorithm of Auer et al. (2002) provides an index-based solution that is inexpensive to compute, albeit at the cost of the optimal regret performance. Specifically, it is well known that in the vanilla MAB setting, KL-UCB always outperforms UCB for bounded reward distributions, with larger gains when the mean rewards are closer to the extremities. This is mainly due to the fact that UCB chooses the worst-case exploration rate for any bounded distribution, while KL-UCB adapts its rate according to the (unknown) value of the mean rewards. Works such as Liu et al. (2018) address the computation issue by using a set of semi-distance functions to boost the UCB index. This leads to a trade-offbetween optimal performance and efficient compute. We propose Global Under-Exploration (GLUE) for Stochastic MABs with Mean Bounds in Algorithm 2. It draws inspiration from KL-UCB, adapting its exploration rate to the location of the arm means using the provided mean bounds but like UCB, provides arm indices that are inexpensive to compute. We note that unlike B-KL-UCB, GLUE is not the clipped version of vanilla UCB. In particular, GLUE is not optimistic, i.e, its arm indices are not high-probability upper bounds to the true mean of the arm. The algorithm works in two phases: Published in Transactions on Machine Learning Research (11/2024) Pruning Phase: Before the start of the online learning process, GLUE examines the provided mean bounds of each of the K0 arms are prunes out any deterministically suboptimal arms. Specifically, we compute lmax = maxk [K0] lk and discard any arm k with uk < lmax. Such a strategy is also followed by B-KL-UCB of Zhang & Bareinboim (2017). Learning Phase: Let [K] = {1, 2, ..., K} be the (re-indexed) set of pruned arms, with K K0. For each of the arm k [K], we compute c(lk) if l > 0.5 c(uk) if u < 0.5 0.25 otherwise , ψk = ( σkmax if lkmax > 0.5 σk otherwise. (7) Here, kmax = arg maxk [K] lk, and c(y) = maxs R 2 s2 fy(s) with fy(x) = ye(1 y)x + (1 y)e yx. Let Tk(n) = Pn t=1 1{At = k} and ˆµk(n) = 1 Tk(n) Pn t=1 Yt1{At = k} be the number of plays and the empirical mean reward obtained from arm k up to time n respectively. The index of arm k at time t is then set to be Uk(t) = min uk, ˆµk(t) + r 2ψ2 k log(f(t + 1)) with f(t) = 1 + t log2(t). (8) We observe two key things here. First, for each arm k, Uk(t) is set using the quantity ψk rather than the true subgaussian factor σ2 k. When lmax > 0.5, ψk for all arms is set as the subgaussian factor of the arm with the largest lower bound kmax which is strictly lower than σ2 k for k = kmax. Thus in general, ψk σ2 k. Second, arm indices are clipped at the respective upper bounds, which is sensible as these indices are a representation of our belief of true arm means. We also note that using ψk = 0.25 and setting Uk(t) without clipping at uk recovers the standard Upper Confidence Bound (UCB) algorithm in Auer et al. (2002). 5.1 Key Ideas 1. Sharper subgaussian factors: As we saw in Section 3, bounds on the mean of a random variable (here, the arm rewards), provide tighter estimates of its subgausssian factor (Corollary 3.1). This is reflected in our definition of σk in Equation 7. 2. Underexploring suboptimal arms using ψk: Consider a standard MAB instance with K arms where it is known that arm k provides rewards with a subgaussian factor of σ2 k. The standard UCB algorithm would then explore arm k at the rate dictated by the corresponding subgaussian factor. Now suppose additionally, that it was known that the (unknown) best arm produced rewards with subgaussian factor σ2 best. We construct the algorithm A which sets the index of arm k analogous to UCB, but with exploration rate dictated by min{σ2 k, σ2 best}. As a result, A potentially assumes incorrect s.g-factors for suboptimal arms. Consequently, the indices of A are no longer upper confidence bounds to the arm means, i.e., A is not optimistic. The key difference in the dynamics of UCB and A is that the latter potentially explores suboptimal arms less frequently than the former. This would lead to A incurring lower regret than standard UCB. In GLUE, min{σ2 k, σ2 best} is analogous to ψk. Specifically, when lmax > 0.5, the side information allows us to estimate an upper bound on the subgaussian factor of the best arm. We note that this upper bound is computed using the arm kmax; the identity of the best arm is still unknown. Thus, the regret minimization problem remains non-trivial in this case. 5.2 Regret Upper Bounds for GLUE In this section, we study the regret performance of GLUE. We regard improvements from pruning as trivial and analyze regret for the set of arms that remain after pruning. We assume without loss of generality that after pruning the arms are indexed in non-decreasing order of mean, i.e., µ = µ1 > µ2 ... µK, where K K0. We also define the sub-optimality gap of an arm k as k = µ µk. The following theorem bounds the regret of GLUE, and also presents its asymptotic regret scaling which is an upper bound to lim supn Rn log(n). This quantity captures the long-term (logarithmic) contribution of arms towards regret. Published in Transactions on Machine Learning Research (11/2024) (a) lmax > 0.5; Arm 3 is meta-pruned (b) Non-Informative Bounds; Arm 2 is meta-pruned (c) lmax < 0.5; Improved UCB and GLUE are equivalent (d) Non-informative bounds; Clipping only Figure 3: Empirical validation for Stochastic MABs with Mean Bounds under Clipped Uniform and Bernoulli rewards: Each row corresponds to a different specification of arm means and mean bounds shown in the left subplot. Regret performance under clipped uniform and Bernoulli rewards are shown in the latter two. The regret is averaged over 200 runs and error bars of one standard deviation are shown. In Figure 3a, for each arm, ψk = σ1. In Figure 3b, the bounds reveal no non-trivial information, however, Arm 2 is meta-pruned. In Figure 3c, we set ψk = σk since lmax < 0.5 and thus, Improved UCB and GLUE coincide. In Figure 3d, the bounds do not provide non-trivial information about subgaussian factors and are only used to clip rewards. B-UCB, UCBImproved and GLUE compute arm choices 11 faster than B-KL-UCB on average. Theorem 6. Let K1(δ) = {k [K] : µ > uk + δ, k > 1} and K2(δ) = {k [K] : µ uk + δ, k > 1}, for any δ > 0. Then, the expected cumulative regret of Algorithm 2 satisfies Rn infδ>0 Rn(δ) with 5ψ2 1 k (max{δ, µ uk})2 + X k(1 + q(n)) + 5ψ2 1 + 2ψ2 k log f(n) k Here, the function q(n) Θ log(f(n)) 2 3 . Further, the asymptotic regret of GLUE satisfies Rn log(n) X 2ψ2 k k . (9) Proof Sketch for Theorem 6. As is usual in the regret analysis of stochastic MABs, we upper bound the number of times a suboptimal arm is played, which translates to a bound on the cumulative regret. In contrast however, we can not rely on the UCB property of the indices any longer. Recall that we only need to consider the set of K arms that remain after pruning. Theorem 3 and Corollary 3.1 gives us that σk is an upper bound on the true s.g-factor for arm k. We first show that the best arm k is always ψk subgaussian. Then, we establish that the asymptotic contribution to regret of any arm in K1(δ) is a constant. We deem these arms to be meta-pruned . For K2(δ), we show that the use of pseudo-variances is sufficient to explore these arms at a logarithmic rate. The theorem then follows by combining these two results, with the full proof in Appendix C. 5.3 Comparison and Discussions Regret Upper Bounds: GLUE vs Existing Algorithms: We compare the regret upper bound of GLUE to a slew of baselines in Table 1. Here, B-UCB is the clipped version of the vanilla UCB algorithm of Auer et al. (2002). It is equivalent to using GLUE with ψk = 0.25, the worst-case subgaussian factor for each arm k. Further, d(p, q) = p log (p/q) + (1 p) log ((1 p)/(1 q)) is the Bernoulli Kullback-Liebler divergence. Published in Transactions on Machine Learning Research (11/2024) Table 1: Asymptotic Regret Comparisons: Columns list the algorithm, the (bounded) distribution of rewards, whether it displays meta-pruning, and the asymptotic regret upper bound. Algorithm Reward dist. Meta Asymptotic (bounded) pruning Regret UCB Any No P B-UCB Any Yes P KL-UCB Any No P B-KL-UCB Any Yes P OSSB Bernoulli Yes P GLUE Any Yes P The vanilla UCB and KL-UCB algorithms do not display meta-pruning as they do not use the mean bounds. The regret bound of OSSB in Combes et al. (2017) matches that of B-KL-UCB for Bernoulli rewards (see Appendix D) and is optimal. The UCB, B-UCB and GLUE algorithms require O(K0) compute per iteration in the worst-case (when no arms are pruned). KL-UCB, B-KL-UCB (and OSSB for Bernoulli rewards) require O(αopt K0) computational complexity, with αopt the cost of solving the index optimization problem for each arm. Uncertain mean bounds: Our focus in this work has been the case when the provided mean bounds hold with probability 1. The immediate follow-up would be the case where these bounds only hold for each arm k with some probability pk < 1. Practically, such settings occur when we are given historical data about plays from each arm (this is the setting studied in Shivaswamy & Joachims (2012)). The following corollary gives a horizon-dependent regret bound for GLUE in this case. Corollary 6.1. Let µk [lk, uk] hold with probability (1 pk) for each arm k [K]. Then by time n, GLUE suffers an additional regret of n P k [K] pk over Theorem 6. If we know the horizon to be T and each arm is observed O(log T) times in the provided history, using standard Chernoff-type arguments, we easily see that P k [K] pk = O(1/T). Thus, the additional regret incurred by time T is simply a constant. However, the cases when the number of samples for an arm are sub-logarithmic in the horizon and that with an unknown horizon are left open. Lack of global under-exploration in R-OFUL: In the stochastic MAB with bounds, we used GLUE to explore all arms at rate ψ2 k the subgaussian factor of the best arm (when lmax > 0.5). This was possible because the estimate of the best arm remained a ψ2 k-subgaussian random variable, and was thus explored at the correct rate. In the linear case, the estimates for each arm uses all samples collected so far in ˆθt, the ridge regression estimate of θ . The estimate of the best arm is thus no longer ψ2 k-subgaussian and under-exploring arms in the set Ar(t) using ψ(lmax, b), for example, would lead to poor estimates even for the best arm and thus larger regret. Empirical Comparisons: We compare GLUE with B-UCB, Improved UCB (GLUE with σk instead of ψk in Uk(t)) and B-KL-UCB for clipped uniform distributions (drawn from a uniform distribution in an interval around the mean that is fully contained in [0, 1]) and Bernoulli rewards. We chose not to compare with OSSB since it is unclear how the associated optimization problem can be solved for this distributions. We drop UCB and KL-UCB since the clipped variants in B-UCB and B-KL-UCB, respectively, are never worse then the vanilla counterprats. All plots are averaged over 200 independent runs. When lmax is close to 1, we see that GLUE outperforms the optimistic UCB-based baselines (B-UCB and Improved UCB) and is comparable to B KL UCB. We note that this is the case where KL variants enjoy maximum improvements over UCB variants, but GLUE competes with the optimal in this case. In the case when all upper bounds are lesser than 0.5, GLUE matches the performance of Improved UCB, thus improving over B-UCB. However, since there is no information about the best arm in this case, B-KL-UCB outperforms Published in Transactions on Machine Learning Research (11/2024) GLUE. We also note that each of B-UCB, Improved UCB and GLUE compute each iteration 11 faster than B-KL-UCB on average. 6 A Use Case: Learning from Confounded Logs Up to this point, we have assumed that the learning agent has access to a tuple of mean bounds for each arm before the start of the online learning process. In this section, we describe a practically motivated scenario in which such mean bounds arise naturally. Specifically, we consider the task of extracting non-trivial bounds on means from partially confounded data from an optimal oracle. We show that under mild assumptions, this problem leads to the stochastic MAB problem (and a Linear Bandit) with mean bounds. 6.1 Confounded Logs for Stochastic Multi-armed Bandits The Oracle Environment: We consider a contextual environment, where nature samples a context vector (z, u) C = Z U from an unknown but fixed distribution P. We assume that sets Z and U are both discrete. At each time any of the K0 actions from the set A = {1, 2, ..., K0} can be taken. The reward of each arm k A for a context (z, u) C has mean µk,z,u and support [0, 1] (with appropriate shifting and scaling, this can be generalized for any finite interval [a, b]). For each context (z, u) C, let there be a unique best arm k z,u and let µ z,u be the mean of this arm. Confounded Logs: The oracle observes the complete context (zt, ut) C at each time t and also knows the optimal arm k zt,ut for this context. She picks this arm and observes an independently sampled reward yt with mean value µk zt,ut,zt,ut = µ zt,ut. She logs the information in a data set while omitting the partial context ut. In particular, she creates the data set D = {(zt, kt, yt) : t N}. The Agent Environment: A new agent is provided with the oracle s log. In this paper, we consider the infinite data setting. The agent makes sequential decisions about the choice of arms having observed the context zt Z at each time t = {1, 2, ...}, while the part of the context u U is hidden from this agent. Let at be the arm that is chosen, zt be the context, and Yt be the reward at time t. Define the average reward of arm k A under the observed context z Z as µk,z = P u U µk,z,u P(u|z). The optimal reward of the agent under context z Z is defined as µ z = maxk A µk,z. The agent aims to minimize its cumulative regret for each context separately. The cumulative reward for each z Z at time T is defined as: Rz(T) = E [P t I(zt = z)(µ z Yt)] 6.1.1 Transferring Knowledge through Bounds The agent is interested in the quantities µk,z, the mean reward of arm k under the partial context z, for all arms k A and contexts z Z, to minimize its cumulative regret. As the oracle only plays an optimal arm after seeing the hidden context u, the log provided by the oracle is biased and thus µk,z can not be recovered from the log in general. However, it is possible to extract non-trivial upper and lower bounds on the average µk,z. In a binary reward and action setting, similar observations have been made in Zhang & Bareinboim (2017). Alternative approaches to this problem, that include assuming bounds on the inverse probability weighting among others, are discussed in Section 1.1. Our assumption below is different, and in a setting where there are more than two arms. We specify that the logs have been collected using a policy that plays an optimal arm for each (z, u) C, but do not explicitly impose conditions on the distributions. Instead, in Assumption 1, we impose a separation condition on the means of the arms conditioned on the full context. We define: Definition 1. Let us define δz, δz for each z Z as follows: µ z,u max k A,k =k z,u µk,z,u , δz max u U µ z,u min k A,k =k z,u µk,z,u Thus, for each observed context z, these quantities specify the sub-optimality gaps between the best and second best arms, as well as the best and the worst arms, respectively and which hold uniformly over the hidden contexts u U. Published in Transactions on Machine Learning Research (11/2024) Assumption 1 (Separation Assumption). The vectors {δz, δz : z Z} are provided as a part of the log. Additionally, for each z Z, we have that δz > 0. Remarks on Model and Assumption 1: Relation to Gap in Agent Space: We note that these gaps in the latent space do not allow us to infer the gap in the agent space. However, due to optimal play by oracle, these gaps help in obtaining mean bounds for all arms, even those which have not been recorded in the log (Theorem 7). Interpretation: The gap δz gives us information about the hardest arm to differentiate in the latent space when the full context was observed (δz analogously is for the easiest arm to discredit). We also note that our approach can still be applied when the trivial bound δz = 1, or universal bounds not depending on context δ δz, δ δz are used, leading to bounds that are not as tight. Motivating Healthcare Example: Often, hospital medical records contain information zt about the patient, the treatment given kt, and the corresponding reward yt that is obtained (in terms of patient s health outcome). However, a good doctor looks at some other information ut (that is not recorded) during consultation and prescribes the best action kt under the full context (zt, ut). If one is now tasked with developing a machine learning algorithm (an agent) to automate prescriptions given the medical record z, this agent algorithm needs to find the best treatment k(z) on average over P(u|z). Furthermore, the gaps on treatments effects can potentially be inferred from other data sets like placebo-controlled trials. Existing alternate assumptions: Studies in Yadlowsky et al. (2022); Zhao et al. (2019) impose conditions on bounded sensitivity of the effects with respect to the hidden/unrecorded context (effectively, that this context does not significantly alter the effect). In our work, we explore the other alternative where the treatment has been chosen optimally with respect to the hidden/unrecorded context, and allows strong dependence on this hidden context. Quantities computed from log data: The following quantities can now be computed by the agent from the observed log for each arm k [K0] and each visible context z Z: 1. pz(k) : The probability of picking arm k under each context z is denoted as pz(k). Mathematically, pz(k) = P u U:k=k z,u P(u|z). 2. µz : The average reward observed under each context z is defined as µz. It can be computed by averaging observed rewards for all the entries with context z. This can be written as µz = P u U µ z,u P(u|z). 3. µz(k): The contribution from arm k to the average reward µz. Thus we have that µz(k) = P u U:k=k z,u µ z.u P(u|z). 4. K>(k, z): Finally, we identify the set of arms with "large rewards" K>(k, z) = {k [K0] : k = k, µz(k ) > δzpz(k )} for each k [K] and z Z. To see that these quantities can indeed be inferred, assume that the logs are given as an infinite table with columns Z, A, Y . Now, we collapse rows that share Z = z, A = k into a single row with Y now being the mean reward of all such rows in the original table. With this new finite table, to compute pz(k), we fix all rows with Z = z and compute the fraction of these that have A = k. µz(k) is the average reward in all rows that have Z = z, A = k, and µz = P k A µz(k). Finally, the set K>(k, z) can be computed from µz(k), pz(k) and the upper bound on the latent suboptimality gap δz. Bounds in terms of computed quantities: Using the quantities defined above, the following theorem describes how one can compute lower and upper bounds on the arm rewards in the agent space. Theorem 7. The following statements are true for each k [K0] and z Z: 1. Upper Bound: µk,z uk,z := µz δz(1 pz(k)). Published in Transactions on Machine Learning Research (11/2024) 2. Lower Bound: µk,z lk,z := µz(k) + P µz(k ) δzpz(k ) . Note that these bounds can be provided for all arms k [K0] and contexts z Z. Specifically, bounds can also be extracted for the arms that are never played in the log. This comes as a result of the gaps defined in Definition 1. Proving this results requires a careful use of the total probability theorem which can be found in Appendix E Remarks: 1. Finite Log: While we study the case of infnite logs, our approach can be readily extended to finite logs by replacing the quantities µz, µz(k), pz(k) with empirical estimates and using standard Chernoffbounds to augment the bounds in Theorem 7 to hold with a probability which is a function of the size of the log. Then, the performance of the learning algorithm is as in Corollary 6.1 and the discussion that follows. 2. Distribution shifts: Our results can also be used in the case where under a fixed visible context, the conditional distribution of the hidden contexts changes from oracle to the learner. In particular, for some z Z, suppose that ψz maxu U |Po(u|z) Pl(u|z)|, where Po and Pl are used to differentiate the oracle and learner environments respectively. Using ψz, we can readily modify our results in Theorem 7 to produce upper and lower bounds for arm rewards under the context z when the distributions are no longer the same. Now, we show the existence of instances for which our bounds are tight. We say an instance is admissible if and only if it satisfies all the statistics generated by the log data. The following proposition shows all the bounds defined in Theorem 7 are partially tight. Proposition 8 (Tightness of Transfer). For any log with uk,z, lk,z as defined in Theorem 7 the following statements hold: 1. There exists an admissible instance where upper bounds µk,z = uk,z, for all k [K0] and z Z. 2. For each k [K0], there exists a (separate) admissible instance such that µk,z = lk,z for each z Z. The learning procedure is then carried out by using the bounds of Theorem 7 to instantiate GLUE (Algorithm 2, with regret as in Theorem 6) for each partial context z Z with uk = uk,z and lk = lk,z. 6.2 Confounded Logs for Linear Bandits The Oracle Environment: Consider a linear bandit environment with K arms ak = (ak,z, ak,u) for i [K] and (random) latent vector θ = (θ z, Tu) such that θ z is fixed. Suppose a, θ Rp, and for all k [K], au,k, Tu Rd Further, we assume that Tu is such that each entry of the vector is always between [m, M], and ak = 1 for all k [K] (the equality can be generalized to bounds on the norm). At each round t, Tu(t) is drawn independently from some distribution C and the full vector θ (t) = (θ z, Tu(t)) is revealed to the oracle, based on which the arm At = arg maxk [K] ak, θ (t) is played. The oracle then observes the reward Yt = At, θ (t) + ηt where ηt is the conditionally σ(At) subgaussian bounded random variable with respect to the filtration generated by the reward observations up to time t 1 and arm choices up to time t. Confounded Logs: We define ˆk(T) to be the index of the best arm when the random realization of Tu was T, i.e., ˆk(T) = arg maxk [K],Tu=T ak, θ . The oracle records the tuple (ˆk(Tu(t)), θ z, Yt) at the end of each round and omits the explicit information about Tu(t). These tuples are collected over an infinite horizon to form the dataset D = {(ˆk(Tu(t)), θ z, Yt) : t N}. The Agent Environment: The agent is provided with the infinite log generated by the oracle and needs to interact with the same environment with no other information about θ being revealed at each time. Specifically, at each round t, the agent chooses an arm At {ak : k [K]} and observes Yt = At, θ + ηt with Tu(t) being drawn independently according to C and ηt the conditionally σ(At)-subgaussian noise as before. Since the latent vector θ is hidden and is random at each round, the agent seeks to minimize the average regret given by t=1 max k [K] E[ ak At, θ ] Published in Transactions on Machine Learning Research (11/2024) t=1 max k [K] ak,z At,z, θ z + E[ ak,u At,u, Tu ] t=1 max k [K] ak,z At,z, θ z + ak,u At,u, θ u In the above, θ u us the average value of the vector Tu under the distribution C and At,z, At,u are the parts of the arm At corresponding to parameters θ z, θ u respectively. Note that the first term in the sum is known fully to the agent as θ z is recorded in the oracle logs and thus, the learning problem of the agent now reduces to a Linear Bandit over d dimensions (the dimension of θ u) using the pseudo-rewards Y t = Yt At, θ z for the chosen arm At. The genie policy in this case is to always play the arm k = arg maxk [K] ak,z, θ z + ak,u, θ u . We note that this definition averages out the randomness in Tu and in general can not be inferred by knowing ˆk(Tu) from the logs. Mapping these back to our running healthcare example, the seen context θ z are the known effects of each intervention in {ak : k [K]} on the patient that have been established using medical trials, while Tu represents the responses patient to the intervention based on biomarkers that have not been explored in the trials. Thus, θ u is the average of these unknown effects over the population of patients. Further, the following quantities can be inferred from the logs: 1. pk: The probability that arm k was best in the oracle environment defined as pk = PC ˆk(Tu(t)) = k . 2. νk: The average reward obtained when arm k was optimal. This can be written as νk = E h Yt|ˆk(Tu(t)) = k i . Using these quantities, the following result suggests the mean bounds that the agent can infer from the logs: Theorem 9. For all arms ak : k [K], let pk, νk be as defined above and 1d be the vector of all 1 s in d dimensions. Then, we have that E[ ak, θ ] νkpk (1 pk) ak,z, θ z [md 1d, ak,u , Md 1d, ak,u ] . Using these bounds, the agent can then employ R-OFUL (Algorithm 1) in order to minimize its regret. Contrary to the stochastic setting, due to the linearity of rewards, we need not make any assumptions on the suboptimality gaps in the oracle environment to provide these mean bounds. 6.3 Empirical Validation Confounded Logs for Stochastic MABs: We present a recommendation systems example where an agent is tasked with learning the best movie to recommend to users in an online manner: the agent observes the user s occupation at each round and solves an independent bandit instance for each occupation. To assist in its learning, logs collected from an oracle are provided to this agent. The oracle has access to more information about users and observes occupation, age and gender in order to recommend movies, but only records the occupation for each user. We assume that environments do not change between the oracle and the agent, i.e., users are drawn randomly from the same population in both cases. For this set of experiments, we use the Movielens 1M dataset of Harper & Konstan (2015). This data set consists of 6040 users, from whom over 1 million ratings of 3952 movies are collected. Each user is associated with a gender, age, occupation and zip-code. In this work, we ignore the zip-code. The ratings (or rewards) lie in the interval [1, 5]. These are normalized to [0, 1] through normal shifting and scaling. The reward matrix of size 6040 3952 is then completed using the Soft Impute algorithm from Mazumder et al. (2010). Post matrix completion, we delete all users who s total reward across all movies is below 1500. We also further cluster the age attribute to have 4 classes: 0 17, 18 25, 25 49, 50+. We also combine the occupation attributes appropriately to have 8 classes namely: Student, Academic, Scientific, Office, Arts, Law, Retired and Others. Published in Transactions on Machine Learning Research (11/2024) After the above, the Occupation attribute is treated as the visible one, age and gender are hidden. We form meta-users by averaging all the users according to the tuple of attributes (gender,age). The reward matrix is also modified to now have each row represent a particular (gender, age) realization (for example, ( M , 0-17)). These reward matrices are separately computed for each of the 8 occupations. We then sample a random collection of 15 movies for each occupation to be considered as arms. Theorem 7 is then employed in order to form the confounded logs for each occupation class. The realized upper and lower bounds after this process are shown in Figure 6 in the appendix. In Figure 4, we provide the online learning behavior for two of these instances. (a) Visible Context: LAW (b) Visible Context: ACADEMIC Figure 4: Online Learning Behavior of two instances from the experiments using the Movielens 1M dataset. The visible context is displayed in the captions. (a) Setting 1 (b) Setting 2 Figure 5: Online Learning Behavior of two instances from the synthetic Linear Bandit setup. In each of the figures, the left figure summarizes the inferred bounds on θ u, ak for each k [12]. The right figure displays the online experiment. Results are averaged over 200 independent runs and one standard deviation error bars are displayed. In the first setting, the bounds do not provide any improved subgaussian estimates, however, the restriction helps improve regret. In the second setting, the bounds provide improved exploration rates. Confounded Logs for Linear Bandits: For this set of experiments, we use a synthetic setup: We first fix θ z, θ u = (1, 2, 3, 4, 5)T R5. We normalize θ z to have norm 1, and θ u to have norm 0.9. For the arms, for each of the 12 arms, we draw ak,z to from a Folded Normal Distribution (|X| is folded normal if X is normally distributed) in R5. We set a1,u to be in a ball of radius 0.1 around θ u at random and for all other k, ak,u is set to be a normalized 2-sparse vector in R5. With these, we generate logs by uniformly sampling Tu(t) in a ball of radius 0.1 around θ u independently for each t, then picking the best arm for this Tu(t). Then we sample a noisy latent reward uniformly at random from a symmetric interval around Tu(t), aˆk(Tu(t)),u , ensuring that it is in [0, 1]. The reward at each time is given by the sum of this noisy latent reward and θ z, aˆk(Tu(t)),u . We collect logs of size 105 to approximate the infinite logs. We invoke Theorem 9 in order to infer upper and lower bounds on each of the 12 arms. Then, we spawn variants of the R-OFUL (Algorithm 1 and vanilla OFUL from Abbasi-Yadkori et al. (2011) and run an online linear bandit algorithm on our setting for 5000 iterations. The results are averaged over 200 independent runs. In Figure 5, we present out results. We see that when the bounds do no help in restricting the arms under consideration, R-OFUL matches vanilla OFUL (improvements here are from clipping the optimisitc indices for each arm). When non-trivial information can be gathered from the mean bounds, R-OFUL significatntly outperforms the vanilla variant. Published in Transactions on Machine Learning Research (11/2024) 7 Conclusion In this work, we treated the problem of bandit learning with mean bounds in the linear and stochastic settings. Beginning with the study of how mean bounds lead to inference of sharp subgaussian factors when rewards are bounded, we present the R-OFUL algorithm for the linear setting and the GLUE algorithm for stochastic bandits that use these sharper factors to reduce exploration. In the linear case, by restricting the set of arms being considered at each round, we show that R-OFUL enjoys not only improved regret, but also increased compute efficiency in comparison to vanilla OFUL. In the stochastic setting, GLUE can offer comparable performance to the optimal B-KL-UCB algorithm when rich side information is available, and is never worse than the vanilla UCB policy. Further, we studied the practical use case of learning from (infinite) confounded logs where such mean bounds on rewards of actions can be inferred under mild assumptions on the latent environment. Several avenues of future work were discussed over the course of exposition, chief among which is the question of lower bounds for the linear setting. While we present a compute-efficient baseline for the case of linear bandits with mean bounds, drawing from intuition in standard linear bandits, the lower bound is likely achieved by a complex, non-optimistic algorithm that further leverages the side information. Studying the optimality would also help characterize the environments in which the mean bounds improve regret performance. Further, the case of learning from finitely many confounded observations, which manifests as a problem of learning from probabilistic mean bounds remains open. In Corollary 6.1, we provide a natural a regret upper bound to the performance of GLUE in this case (which can also be extended to R-OFUL for linear bandits) which serves as a baseline. Finally, our assumption of bounded rewards led to tighter subgaussian factors and regret improvements. Identifying other use cases and modes of side information with arbitrary, potentially unbounded, reward distributions from which such improved factors can be computed also serves an an interesting direction. Acknowledgements We sincerely thank all the reviewers and editors for the feedback that helped improve the quality of the paper. This research was partially supported by NSF Grants 1826320, 2019844, 2107037 and 2112471, ARO grant W911NF-17-1-0359, US DOD grant H98230-18-D-0007, ONR Grant N00014-19-1-2566 and the US Do T supported D-STOP Tier 1 University Transportation Center. Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011. Kareem Amin, Michael Kearns, and Umar Syed. Graphical models for bandit problems. ar Xiv preprint ar Xiv:1202.3782, 2012. Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pp. 322 331. IEEE, 1995. Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002. Elias Bareinboim, Andrew Forney, and Judea Pearl. Bandits with unobserved confounders: A causal approach. In Advances in Neural Information Processing Systems, pp. 1342 1350, 2015. Babette A Brumback, Miguel A Hernán, Sebastien JPA Haneuse, and James M Robins. Sensitivity analyses for unmeasured confounding assuming a marginal structural model for repeated measures. Statistics in medicine, 23(5):749 767, 2004. Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. X-armed bandits. Journal of Machine Learning Research, 12(May):1655 1695, 2011. Published in Transactions on Machine Learning Research (11/2024) Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012. Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet. Bounded regret in stochastic multi-armed bandits. In Conference on Learning Theory, pp. 122 134, 2013. Swapna Buccapatnam, Atilla Eryilmaz, and Ness B Shroff. Stochastic bandits with side observations on networks. In The 2014 ACM international conference on Measurement and modeling of computer systems, pp. 289 300, 2014. Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, Gilles Stoltz, et al. Kullback leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516 1541, 2013. Richard Combes, Stefan Magureanu, and Alexandre Proutiere. Minimal exploration in structured stochastic bandits. In Advances in Neural Information Processing Systems, pp. 1763 1771, 2017. F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (Tii S), 5(4):1 19, 2015. Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pp. 681 690, 2008. Finnian Lattimore, Tor Lattimore, and Mark D Reid. Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems, pp. 1181 1189, 2016. Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020. Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670, 2010. Fang Liu, Sinong Wang, Swapna Buccapatnam, and Ness Shroff. Ucboost: a boosting approach to tame complexity and optimality for stochastic bandits. ar Xiv preprint ar Xiv:1804.05929, 2018. Shie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. In Advances in Neural Information Processing Systems, pp. 684 692, 2011. Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. Journal of machine learning research, 11(Aug):2287 2322, 2010. William J Powers, Alejandro A Rabinstein, Teri Ackerson, Opeolu M Adeoye, Nicholas C Bambakidis, Kyra Becker, José Biller, Michael Brown, Bart M Demaerschalk, Brian Hoh, et al. Guidelines for the early management of patients with acute ischemic stroke: 2019 update to the 2018 guidelines for the early management of acute ischemic stroke: A guideline for healthcare professionals from the american heart association/american stroke association. Stroke, 50(12):e344 e418, 2019. Amy Richardson, Michael G Hudgens, Peter B Gilbert, and Jason P Fine. Nonparametric bounds and sensitivity analysis of treatment effects. Statistical science: a review journal of the Institute of Mathematical Statistics, 29(4):596, 2014. James M Robins, Andrea Rotnitzky, and Daniel O Scharfstein. Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models. In Statistical models in epidemiology, the environment, and clinical trials, pp. 1 94. Springer, 2000. Rajat Sen, Karthikeyan Shanmugam, Murat Kocaoglu, Alex Dimakis, and Sanjay Shakkottai. Contextual bandits with latent confounders: An nmf approach. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pp. 518 527, 2017. Pannagadatta Shivaswamy and Thorsten Joachims. Multi-armed bandit problems with history. In Artificial Intelligence and Statistics, pp. 1046 1054, 2012. Published in Transactions on Machine Learning Research (11/2024) Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 1015 1022, 2010. Guy Tennenholtz, Uri Shalit, Shie Mannor, and Yonathan Efroni. Bandits with partially observable offline data. ar Xiv preprint ar Xiv:2006.06731, 2020. Michal Valko, Rémi Munos, Branislav Kveton, and Tomáš Kocák. Spectral bandits for smooth graph functions. In International Conference on Machine Learning, pp. 46 54, 2014. Steve Yadlowsky, Hongseok Namkoong, Sanjay Basu, John Duchi, and Lu Tian. Bounds on the conditional and average treatment effect with unobserved confounding factors. The Annals of Statistics, 50(5):2587 2615, 2022. Li Ye, Yishi Lin, Hong Xie, and John Lui. Combining offline causal inference and online bandit learning for data driven decisions. ar Xiv preprint ar Xiv:2001.05699, 2020. Chicheng Zhang, Alekh Agarwal, Hal Daumé Iii, John Langford, and Sahand Negahban. Warm-starting contextual bandits: Robustly combining supervised and bandit feedback. In International Conference on Machine Learning, pp. 7335 7344, 2019. Junzhe Zhang and Elias Bareinboim. Transfer learning in multi-armed bandit: A causal approach. In Proceedings of the 16th Conference on Autonomous Agents and Multi Agent Systems, pp. 1778 1780, 2017. Qingyuan Zhao, Dylan S Small, and Bhaswar B Bhattacharya. Sensitivity analysis for inverse probability weighting estimators via the percentile bootstrap. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(4):735 761, 2019. Published in Transactions on Machine Learning Research (11/2024) A Extreme bounds imply tighter subgaussian bounds Proof of Lemma 1. Since e mx and e(1 m)x are both convex, so is f(x) (since m, 1 m 0). Thus, f is minimized when f (x) = m(1 m) e(1 m)x e mx = 0 e(1 m)x = e mx (1 m)x = mx Therefore, the minimum value is f(0) = 1. Now, we have f1 m(x) = (1 m)e(1 (1 m))x + (1 (1 m))e (1 m)x = (1 m)e m( x) + me(1 m)( x) = fm( x). Finally, the limit at x = 0 can be computed by using L Hopital s rule. We now prove each of the facts in Lemma 2 separately. Together, these two facts and Lemma 1, we have that there exist a σ < 0.5 for all m = 0.5. A.1 Non-existence of Positive Maximizer for Large Means We begin with the case when m > 0.5 and prove some useful Lemmas that will help establish our result. Lemma 10. Let A = mx + m(1 m)x2 2 , B = x + log(m). Then, A > B for all m (0.5, 1) and x R Proof. We have that mx + m(1 m)x2 2 (x + log(m)) > 0 m(1 m)x2 + 2(m 1)x 2 log(m) > 0 Call m(1 m)x2 + 2(m 1)x 2 log(m) = h(x), differentiating and setting it to 0 gives 2m(1 m)x = 2(1 m) = = x = 1 m. Thus, minx h(x) = h 1 m 2 log(m). Since log(1 + y) < y for all y > 1, with m > 0.5, we can write m 2 log(m) > m 1 m 2(m 1) = (m 1)(1 2m) Lemma 11. With A and B as defined in Lemma 10, the function e A e B is non-decreasing for m (0.5, 1) and x 1 Proof. We have that e A e Bnon-decreasing d(e A e B) (m + m(1 m)x) e A e B 0. From Lemma 10, (m + m(1 m)x) e A e B > e B (m + m(1 m)x 1) e A e Bnon-decreasing = e B (m + m(1 m)x 1) 0 m(1 m)x (1 m) 0 x 1 Published in Transactions on Machine Learning Research (11/2024) Lemma 12. With A and B as defined in Lemma 10, the function e A e B is increasing for m (0.5, 1) and x 0, 1 Proof. We begin with e A e Bincreasing d(e A e B) (m + m(1 m)x) e A e B > 0 e B (m + m(1 m)x) e A B 1 > 0 (m + m(1 m)x) e A B > 1 e B A < (m + m(1 m)x) Note that at x = 0, both sides of this equation compute to m. Thus, the inequality holds if d(e B A) d(m+m(1 m)x) dx for x > 0. We have that d(e B A) dx = e B A d(B A) dx = (1 m)(1 mx)e B A and d(m+m(1 m)x) dx = m(1 m). Thus, we require that (1 mx)e B A < m. Let k(x) = (1 mx)e B A. Then, we have that k (x) = (1 mx)d e B A dx me B A = e B A (1 m)(1 mx)2 m . Note that (1 mx)2 decreases from 1 to 0 in (0, 1 m). Therefore, (1 mx)2 < maxx (0, 1 m )(1 mx)2 = 1 < m 1 m for m > 0.5. Therefore, (1 m)(1 mx)2 m < 0 in (0, 1 m). Since e B A > 0, k(x) is also decreasing in this range. Thus, we have that for m (0, 1 (1 mx)e B A < max x (0, 1 m )(1 mx)e B A = 1 m. Lemma 13. For all m (0.5, 1) and x > 0, we have that log (fm(x)) < m(1 m)x2 Proof. With A, B as defined in Lemma 10, we require that log (fm(x)) < m(1 m)x2 mex + 1 m < exp mx + m(1 m)x2 1 m < e A mex = e A ex+log(m) = e A e B. From Lemmas 11 and 12, we have that e A e B is non-decreasing in (0, ). Thus, e A e B > minx (0, ) e A e B = 1 m, giving us the result. Lemma 14 (Part 1 of Lemma 2). When m (0.5, 1), the function 2 x2 log (fm(x)) is not maximized at any x > 0. Proof. Using 13, we have for any m (0.5, 1), x > 0, 2 x2 log (fm(x)) < 2 x2 m(1 m)x2 2 = m(1 m). However, Lemma 1 shows that limx 0 2 x2 log (fm(x)) = m(1 m). Therefore, the function is not maximized when x > 0. Published in Transactions on Machine Learning Research (11/2024) A.2 Existence of Positive Maximizer for Small Means We now move on to the case when m < 0.5. As before, we will establish the result using some useful lemmas. Lemma 15. Let A = mx + m(1 m)x2 2 and B = x + log(m). Then, for m (0, 0.5), we have that the function e A e B is a) decreasing for x (0, x1) where x1 = 1 b) increasing for x > 2 Proof. Part a: We require that in the given range, (m + m(1 m)x) < e B A. Both sides of the above inequality compute to m at x = 0. Therefore, the inequality holds for x (0, x1) if d(m + m(1 m)x) dx = m(1 m) < (1 m)(1 mx)e B A = d(e B A) dx That is, m < (1 mx)e B A. Let k(x) = (1 mx)e B A. Then, we have k (x) = e B A (1 m)(1 mx)2 m > 0 (1 m)(1 mx)2 m > 0. Let p(x) = (1 m)(1 mx)2 m. Then, we have that p(x) = 0 at x = 1 m(1 m). Therefore, p(x) > 0 for x < x1. Thus, k(x) is increasing in (0, x1) and hence k(x) > k(0) = m. This implies that (m + m(1 m)x) < e B A in (0, x1) which leads to the result. Part b. We first note that at x = 2 m +log(m) since m < 1. Further, d A dx = m+m(m 1)x > 1 = d B dx for all x > 1 m. Thus, A > B for x > 2 Thus, we get that dx e B = (m + m(1 m)x) e A e B > e B (m + m(1 m)x 1) > 0 for x > 1 And thus, e A e B is increasing in this range. Lemma 16 (Part 2 in lemma 2). For all m (0, 0.5), x > 0, we have that with x1 as defined in Lemma 15 a) For all x (0, x1), 2 x2 log (fm(x)) > m(1 m), b) For all x > 2 m, 2 x2 log (fm(x)) < m(1 m). Proof. Part a. For this, we require that in (0, x1), log (fm(x)) > m(1 m)x2 e mx (mex + 1 m) > exp m(1 m)x2 1 m > e A e B Published in Transactions on Machine Learning Research (11/2024) with A, B as in Lemma 15. Using part a of this Lemma, the inequality holds since e A e B is decreasing in (0, x1) and is maximized at x = 0 with value 1 m. Part b. To see this, we require that log (fm(x)) < m(1 m)x2 2 1 m < e A e B This holds since e A e B is increasing when x > 2 m (Lemma 15 part b) and is hence minimized at x = 2 m with value (1 m)e2/m > (1 m) since m < 1. A.3 Putting it all together The following lemma will help us prove our final result. Lemma 17. Define C = mx + x2 8 and D = x + log(m). Then, e C e D is increasing in (0, ) for any m (0, 0.5). Proof. This hold if and only if d C dx e D > 0 for all x > 0. That is, e C e D > 0 m + x 4 > exp(D C). For x = 0, both sides of this inequality compute to m. Thus, the inequality holds if dx exp(D C) 1 Since exp(D C) > 0 for all x, the inequality is trivially true for x > (1 m). To see that this is true for x (0, 4(1 m)), define k(x) = 1 m x 4 exp(D C). Since k (x) = 1 m x 4 exp(D C), we have that k (x) = 0 1 m x = 0 x = 4(1 m) 2. Therefore, in (0, 4(1 m)), k(x) = 0 x = 4(1 m) 2 = 2(1 2m). Further, since 1 m x 4 > 0 for x < 4(1 m), x = 2(1 2m) must be a maximizer of k(x). We are now left to prove that k(2(1 2m)) < 1 4 for any m < 1 2. For this we have k(2(1 2m)) = 1 m 1 2m exp 2(1 2m) + log(m) 2m(1 2m) (1 2m)2 2 exp log(m) + (1 2m) 2 2m 1 2m 2 exp (1 2m)(3 2m) Let q(x) = x 2 exp (1 2m)(3 2m) 2 . We have that q (x) = exp (1 2m)(3 2m) = exp (1 2m)(3 2m) Published in Transactions on Machine Learning Research (11/2024) 2 exp (1 2m)(3 2m) Thus, q(x) is increasing for all x. Specifically, we have that for any m (0, 1 2), k(2(1 2m)) = q(m) < q( 1 Thus, k(2(1 2m)) < 1 4 for any m (0, 1 2) and x (0, 4(1 m)). Therefore, 1 4 exp(D C) and thus, e C e D is increasing for all x > 0. We are now ready to present our final result. Proof of Theorem 3. Subgaussianity Let gm(x) = 2 x2 log (fm(x)). For m < 0.5, part b of Lemma 1 and Theorem 14 imply that the maximum of gm(x) is not achieved at any negative x. Further, Theorem 16 implies that there exists x > 0 : gm(x) > m(1 m). Additionally, since gm(x) < m(1 m) for x > 2 m, we have that the function of interest is maximized at some 0 < x < 2 Analogously, for m > 0.5, this function is maximized at some 2 Thus, for any random variable X [0, 1] with mean m (0, 1), m = 0.5, we have E[exp (s(X m))] 2 . Thus, X is σ-subgaussian. Bounds on σ From Part 2 of Lemma 2 (Lemma 16), we have that σ2 lim x 0 2 x2 log(fm(x)) = m(1 m). Thus, we are left with showing that σ < 1 2 when m < 0.5. Exploiting part b of Lemma 1 implies the result for m < 0.5. We require that maxx R 2 x2 log (fm(x)) < 1 4. However, since we have already shown that this maximum is achieved at a positive x, it is sufficient to show maxx>0 2 x2 log (fm(x)) < 1 4. That is, for all x > 0, 2 x2 log (fm(x)) < 1 log (fm(x)) < x2 mex + 1 m < exp mx + x2 1 m < exp mx + x2 exp(x + log(m)) From Lemma 17, we have that the RHS is increasing in (0, ) and is thus minimized when x = 0 with minimum value 1 4. This gives us the result. B Linear Bandits with Mean Bounds Now we concentrate on the setting of Section 4 and provide proofs for the result of Theorem 4. First, we will show that the tight bounds we propose in Equation 2 are valid and tight. Then, we prove that the best instantaneous arm at each time is always contained in the restricted set at each time. Next, we show that the quantity σt is a valid upper bound on the subgaussian factor for all arms in the restricted set. This culminates in our final regret result. We begin with the bounds. Published in Transactions on Machine Learning Research (11/2024) Claim 1. The solutions to the optimization problems in Equation 2 is tight, that is, θ Cb such that for any a A, a, θ < la or a, θ > ua Proof. First we note that the constraint set Cb is non-empty since θ Cb and thus the quantities la, ua are well-defined for each a A. Suppose θ Cb with a, θ < la for some action a. This implies that la = minθ Cb a, θ which is a contradiction. A similar argument for the upper bound completes the proof. Note that at this point, Cb is redefined to be the set {θ : θ [m, M], a A, a, θ [la, ua]}. We now show that the best arm is always in contention at any time. Lemma 18. Let a t = arg maxa At a, θ be the instantaneous best arm. Then, a t Ar(t). Proof. First, we show that a t Ap(t). Suppose that ua t < lmax(t) = a, θ ua t < lmax(t) ˆat, θ , that is, a t , θ < ˆat, θ . This is a contradiction to the definition of a t and thus, a t Ap(t). To show that a t Aprune(t), we first observe that lmax(t) ˆat, θ = ˆa θ cos(ang(ˆat, θ )) M cos(ang(ˆat, θ )) θ M, a = 1 = ang(ˆat, θ ) cos 1 lmax(t) Further, since lmax(t) a t , θ , we get that ang(a t , θ ) αt. Combining these two, we get that ang(ˆat, a t ) 2αt and thus, a t Ar(t). The following lemma proves that the quantity σt is a valid upper bound to the s.g-factor of any arm in the restricted set. Lemma 19. For any arm a Ar(t), σ(a) σt. Proof. From Corollary 3.1, we have that σ(a) ψ(la, ua) for any arm a Ar(t). Therefore, it follows that σ(a) maxa Ar(t) ψ(l a, u a). We are only left to prove that σ(a) mψ(3αt). For this, we note that the arm in Ar(t) with lowest mean reward is at most 3αt away from θ . This happens when θ is at an angle αt away from ˆat and there exists some arm abad at an angle αt on the opposite side of ˆat. For this fictitious arm abad, we have abad, θ = abad θ cos(3αt) m cos(3αt). Now, there are two cases: 1. If m cos(3αt) < 0.2178a+0.7822b: Then, ψ(m cos(3αt), b) = (b a)2 4 and a Ar(t), σ(a) ψ(m cos(3αt), b) follows by definition. 2. If m cos(3αt) 0.2178a + 0.7822b: Then any arm a Ar(t), a, θ abad, θ m cos(3αt). Using this as a lower bound for the arm a, we can write σ(a) ψ(m cos(3αt), ua) = ψ(m cos(3αt), b). This completes the proof. We now prove our final regret result. Proof of Theorem 4. Let S t = Pt n=1 ηnan σn , where the random variable ηt is conditionally σat-subgaussian. From Lemma 19, σat σt and thus, ηt is also conditionally σt-subgaussian. We claim that Mt(x) = exp x, S t x 2 Vt 2 with M0(x) := 1 is a supermartingale with respect to the filtration Ft 1 = σ(a1, Y1, ..., at 1, Yt 1, at). To see this, observe that Mt(x) = exp σn + x ana T n 2 Published in Transactions on Machine Learning Research (11/2024) = E [Mt 1|Ft 1] = Mt 1(x) E σt + x ata T t 2 Mt 1(x) ηt is conditionally σt-s.g. Now let H = λId and h N(0, H 1). With some linear algebra, we can write Rd Mt(x)dh(x) = det V t exp S t V 1 t 2 Since Mt(x) is a supermartingale, so is M t. Therefore, using the Maximal inequality for non-negative supermartingales, we get that δ P t : S t 2 V 1 t 2 log 1 + log det V t + log det V t The last step uses the fact that Pt n=1 ηnan σt 1 maxn t σn Pt n=1 ηnan = 1 γ Pt n=1 ηnan. Now, with Vt = Pt n=1 ana T n, we have V 1 t Vt Id θ λθ T I V 1 t Vt θ Combining Equations 10 and 11, we get that with probability at least 1 δ, ˆθt θ V 1 t + log det V t βt(δ) det V t λd trace V t Now, with Et = n θ : ˆθt θ V 1 t βt(δ) o observe that Ct = Cb Et. Thus, P( t : θ Ct) P( t : θ Et) δ. The result of the theorem follows by using the standard arguments of Theorem 3 in Abbasi-Yadkori et al. (2011). C GLUE for stochastic Multi-Armed Bandits with Mean Bounds We begin with a few useful lemmas. Lemma 20. Any arm k is σk subgaussian. The best arm is always ψk-subgaussian. Published in Transactions on Machine Learning Research (11/2024) Proof. Both these are an application of Corollary 3.1. The first one follows using bounds lk, uk for the suboptimal arm k. For the second, observe that µ [lmax, uk ], which are used to form ψk. The next lemma proves that meta-pruned arms are only played a constant number of times asymptotically. Lemma 21. Let arm k K1(δ) := {k [K] : µ uk + δ, k = 1}. Then, for all n 1, E[Tk(n)] 5ψ2 1 (max{δ, µ uk})2 . Proof. Since k K1(δ) = k > 0, we have that {At = k} {U1(t) uk}. Thus, E[Tk(n)] = E t=1 1{At = k} t=1 1{U1(t 1) uk} ˆµ1(t) µ uk µ 2ψ2 1 log(f(t)) T1(t 1) 2ψ2 1 log(f(t)) (Using Union Bound, Lemma 20) r=1 exp r(µ uk)2 2ψ2 1 (µ uk)2 5ψ2 1 (µ uk)2 5ψ2 1 (max{δ, µ uk})2 where the last inequality follows from the assumption on uk. The next lemma is a restatement of Lemma 8.2 in Lattimore & Szepesvári (2020) for standard subgaussian variables. It will be used to bound the number of plays of a suboptimal arm that is not meta-pruned. Lemma 22 (Lemma 8.2 in Lattimore & Szepesvári (2020)). Let {Xi} be a sequence of zero mean, independent σ-subgaussian random variables. Let ˆµt = 1 t Pt r=1 Xr, δ > 0, a > 0, u = 2aδ 2 and Then, E[κ] E[κ ] 1 + 2δ 2 a + σ2πa + σ2 for each n 1. Proof. This is a restatement of the lemma for the general case of σ-subgaussian random variables. Published in Transactions on Machine Learning Research (11/2024) Clearly, we have that E[κ] E[κ ]. Thus, 2a/t 2 (Xi σ subgaussian) Now, using x2 = (δ 2σ2 , we have δ2 = dt = 2 Note that the definition of u = 2a δ2 gives t = u = x = 0. Thus, we get E[κ ] 1 + 2a 0 xe x2dx + 4 πaσ2 + σ2 . Where we use that R 0 xe x2dx = 1 2 and R 0 e x2dx = π/2. Lemma 23. For the best arm, E [Pn t=1 1{U1(t 1) µ ϵ}] 5ψ2 1 ϵ2 for any ϵ > 0. Proof. We have t=1 1{U1(t 1) µ ϵ} t=1 1{U1(t 1) µ ϵ} 2ψ2 1 log(f(t)) T1(t 1) This follows by using the union bound and Lemma 20 for Arm 1 (See Theorem 8.1 in Lattimore & Szepesvári (2020) for more details on the inequality). Lemma 24. If k K2(δ) := {k [K] : µ uk + δ, k = 1} for an arbitrary δ > 0, then, E[Tk(n)] 1 + 5ψ2 1 2 k + 2 ψ2 k log(f(n)) + q ψ2 kσ2 kπ log(f(n)) + σ2 k Where h(n) = Ω log(f(n))2/3 . Published in Transactions on Machine Learning Research (11/2024) Proof. Since k = 1, {At = k} {U1(t 1) µ ϵ} {Uk(t 1) > µ ϵ, At = k} for any ϵ (0, k). Thus, we have that t=1 1{U1(t 1) µ ϵ} t=1 1{Uk(t 1) > µ ϵ, At = k} 5ψ2 1 ϵ2 + E[ t=1 1{Uk(t 1) > µ ϵ, At = k}] (Using Lemma 23) For the second term, following the steps in the Proof of Theorem 8.1 in Lattimore & Szepesvári (2020) and using Lemma 22 above, we get t=1 1{Uk(t 1) > µ ϵ, At = k}] 2ψ2 k log(f(t)) Tk(t 1) > µ ϵ, At = k ˆµkr µk > k ϵ 2ψ2 k log(f(n)) 1 + 2 ( k ϵ)2 ψ2 k log(f(n)) + q ψ2 kσ2 kπ log(f(n)) + σ2 k Here, ˆµkt is the empirical mean of t i.i.d samples from arm k. The last inequality follows from Lemma 22 with a = ψ2 k log(f(n)), δ = ( k ϵ) and σ = σk(using Lemma 20). Substituting this into the original expression, we get E[Tk(n)] 5ψ2 1 ϵ2 + 1 + 2 ( k ϵ)2 ψ2 k log(f(n)) + q ψ2 kσ2 kπ log(f(n)) + σ2 k inf ϵ (0, k) 5ψ2 1 ϵ2 + 1 + 2 ( k ϵ)2 ψ2 k log(f(n)) + q ψ2 kσ2 kπ log(f(n)) + σ2 k Now, let g(n) = ψ2 k log(f(n)) + p ψ2 kσ2 kπ log(f(n)) + σ2 k , and 0 < ϵ = k αg(n)1/3+1 < k for α = (2/5ψ2 1)1/3. We have the following: inf ϵ (0, k) 5ψ2 1 ϵ2 + 1 + 2 ( k ϵ)2 g(n) 1 + 5ψ2 1 2 k (αg(n)1/3 + 1)2 + 2 (αg(n)1/3 + 1)2 α2g(n)2/3 g(n) = 1 + 5ψ2 1 2 k α2g(n)2/3 + 2αg(n)1/3 + 1 + 2g(n) (αg(n)1/3 + 1)2 = 1 + 5ψ2 1 2 k + 2g(n) 2 k + g(n)2/3 5ψ2 1α2 + 4 10ψ2 1α + 2 = 1 + 5ψ2 1 2 k + 2g(n) 2 k + (20ψ2 1)1/3g(n)2/3 1 + (20ψ2 1)1/3 + 3g(n)1/3 | {z } =h(n) The result follows with h(n) being defined as the last two terms of the expression above. We are now ready to prove the theorem. Published in Transactions on Machine Learning Research (11/2024) Proof of Theorem 6. The theorem now follows immediately using Lemmas 21 and 24. This is because we can decompose regret as t=1 E[µ Yt] = X K1(δ) k E[Tk(n)] + X K2(δ) k E[Tk(n)], where K1(δ) = {k [K] : µ > uk + δ, k > 1} and K2(δ) = {k [K] : µ uk + δ, k > 1}, for any δ > 0. The first part of the theorem follows by bounding the two summations using Lemmas 21 and 24 respectively. To prove the asymptotic part, we simply take δ 0 and n in the above expression. D OSSB for MABs with mean bounds and Bernoulli Rewards Here, we derive closed form expressions for the semi-infinite optimization problem in Combes et al. (2017) in the case of Bernoulli rewards and bandits with mean bounds. In this section, we follow the notation in their paper. An instance θ is represented by a vector of arm means (θ1, θ2, ...θK). We say an instance θ is feasible if for each k [K], we have that θk [lk, uk]. Let Θ be the set of all feasible instances. With Θ as above, ν(θk) Bernoulli(θk) and the mapping k, θ 7 µ(k, θ) as µ(k, θ) = θk, our setting of Bernoulli bandits with mean bounds can hence be viewed as a Structured Bandit. Define for any α, β Θ, D(α, β, k) = d(αk, βk) where d( , ) is the Bernoulli kl-divergence function. Let µ (θ) = maxk [K]µ(k, θ). Then, the optimization problem that determines the regret lower bound in our case can be given as: k [K] η(k) (µ (θ) µ(k, θ)) k [K] η(k)D(θ, λ, k) 1 λ Λ(θ) Λ(θ) = {λ Θ : D(θ, λ, k θ) = 0, k (θ) = k (λ)} We now present the solution of the above optimization for a given instance θ. For simplicity, consider the case with 2 arms: Arm 1 and Arm k. Let the instance θ Θ be such that µ(1, θ) = µ (θ). We now have two cases: 1. µ (θ) > uk: In this case, the set Λ(θ) is empty since there is can be no other instance λ Θ with µ(1, λ) = µ(1, θ) = µ (θ) and Arm k as optimal (since µ(k, θ) uk < µ (θ)). Thus, the optimal solution to the optimization problem is η = (0, 0). 2. µ θ uk(θ): In this case, we note that to satisfy the constraint over all instances in Λ(θ), it it sufficient to set η(k) = 1 minµ [µ (θ),uk] d(µ(k,θ),µ) = 1 d(µ(k,θ),µ (θ)). This uses the fact that d(a, x) as a function of x is increasing in [a, 1]. Thus, the optimal solution in this case is η = 0, 1 d(µ(k,θ),µ (θ)) . Generalizing this to the case with K arms, we get that the solution to this problem for a given θ Θ is given by ( 0 if µ (θ) > uk or if k = k (θ) 1 d(µ(k,θ),µ (θ)) otherwise Finally, using this as the optima of the problem above, we have that the value at this optima is given by µ (θ) µ(k, θ) d (µ(k, θ), µ (θ)) = X k(θ) d (µ(k, θ), µ (θ)) Thus, the asymptotic regret of OSSB reduces to C(θ) as ϵ, γ 0. We note that in our case the OSSB algorithm uses the solution to the above optimization at the t-th round with θ replaced with the truncated empirical means, i.e. {min(uk, max(lk, ˆµk(t))) k [K]}. Published in Transactions on Machine Learning Research (11/2024) E Confounded Logs E.1 Stochastic Multi-armed Bandit Setting We begin with the proof of Theorem 7 Proof of Theorem 7. For the upper bound, we split the expectation into two parts: (z, u) where k is optimal, and (z, u) where k is not optimal. We have u U µk,z,u P(u|z) u U:k=k z,u µ z,u P(u|z) + X u U:k =k z,u µk,z,u P(u|z) (Splitting the sum into parts based on the optimality of k) u U:k=k z,u µ z,u P(u|z) + X u U:k =k z,u µ z,u δz P(u|z) (If k = k z,u, then its reward is at most µ z,u δz) u U µ z,u P(u|z) δZ u U:k =k z,u P(u|z) = µz δZ(1 pz(k)) (Using the definitions of µz and pz(k)) The inequality follows since the logs are assumed to be collected under an optimal policy. This completes the proof of the upper bound. To prove the lower bound, we fix an arbitratry k [K0] and z Z. Recall that K>(k, z) = {k : k = k, µz(k ) > δzpz(k )} and let K (k, z) := [K0]\K>(k, z) = {k : k = k, µz(k ) δzpz(k )}. We note that these sets can be identified from the logged data because µz(k ) and pz(k ) can be derived for any k and z. Now we define the sets U>(k, z), U (k, z) as U>(k, z) = {u U : k z,u K>(k, z)}, U (k, z) = {u U : k z,u K (k, z)} We now expand the mean of the arm k under partial context z as follows. u U:k=k z,u µ z,u P(u|z) + X u U:k =k z,u µk,z,u P(u|z) = µz(k) + X u U (k,z) µk,z,u P(u|z) + X u U>(k,z) µk,z,u P(u|z) u U>(k,z) (µ z,u δz)P(u|z) = µz(k) + X k K>(k,z) µz(k ) δz X k K>(k,z) pz(k ). The inequality holds because, firstly, we have µk,z,u 0 which we use for u U (k, z). Secondly, we have µk,z,u (µ z,u δz) by definition of δz, which we use for u U>(k, z). Since this holds for any arbitrary k, z, this proves the lower bound. Now we move on to the proof of our tightness claim Proof of 8. Upper Bound Tightness: For the tightness of the upper bound, notice that u U µ z,u P(u|z) δz X u U:k =k z,u P(u|z) Published in Transactions on Machine Learning Research (11/2024) u U:k=k z,u µk,z,u P(u|z) + X u U:k =k z,u (µ z,u δz)P(u|z). Which is the same as the quantity µk,z = P u U µk,z,u P(u|z) when µk,z,u = µ z,u δz u U : k = k z,u If the above equality holds for all k [K0] and all z Z, we have that µk,z = uk,z. This instance is admissible since it maintains the means of the best arms are unchanged, and thus do not affect the quantities µz and pz(k) for any k. Thus, there exists an admissible instance where the upper bounds for each arm is tight. Lower Bound Tightness: Fix an arm k, and a partial context z. For the tightness of the lower bound for this particular arm k and partial context z, we present an instance now. We assign µk,z,u = 0 for all u U (k, z), and µk,z,u = µ z,u δz for all u U (k, z). For this particular choice, it is easy to see that the above lower bound is tight. We now need to prove that this instance is admissible. 1. By construction we know µk,z,u 0 as µ z,u δz for all partial contexts u U>(k, z). That µk,z,u 1, is easy to check. 2. We know that (µ z,u µk,z,u) δz by construction for all partial contexts u U>(k, z), and due to the fact that µ z,u δz and µk,z,u = 0 for all partial contexts u U (k, z). 3. We know that (µ z,u µk,z,u) δz by construction for all partial contexts u U>(k, z), and due to the fact that µ z,u δz and µk,z,u = 0 for all partial contexts u U (k, z). Here, µ z,u δz due to non-negativity of the mean-rewards and by definition of δz (the smallest gaps). 4. Finally, k is never optimal for any partial contexts u U (k, z) U>(k, z). Indeed, for all the partial contexts u U (k, z), we have µk,z,u = 0 and µ z,u δz > 0. For all the partial contexts u U>(k, z) we have µk,z,u = (µ z,u δz) < µ z,u. Therefore, for this instance we never observe k for any partial contexts u U (k, z) U>(k, z), which ensures the log statistics does not change. This concludes that the instance is admissible, and it proves that the lower bound is tight. E.2 Confounded Logs for Linear Bandits with Fixed Arms We now prove our result for mean bounds from confounded logs in the linear bandit case. Proof of Theorem 9. Let Ek = n k = ˆk(Tu) o be the event that arm k was optimal under the full context θ when Tu was drawn according to the distribution C. By definition, we have pk = P(Ek) and E[ ak, θ ] = ak,z, θ z + E[ ak,u, Tu ] = ak,z, θ z + pk E [ ak,u, Tu |Ek] + (1 pk)E ak,u, Tu |EC k = pkνk + (1 pk) ak,z, θ z + E ak,u, Tu |EC k Since Tu is such that the values of each of its entries are in [m, M], we can write dm 1d, ak,u E ak,u, Tu |EC k d M 1d, ak,u . Rearranging the above equation then gives us the required result. E.3 More Empirical Results In Figure 6, we present the instances extracted out of each of the visible contexts of Occupation. The bounds are calculated as suggested by Theorem 7. Published in Transactions on Machine Learning Research (11/2024) (a) Academic (f) Retired (g) Scientific (h) Student Figure 6: Instances with Occupation as visible context: The occupations are listed in the caption. The data filtration and movie selection process is explained in Section 6.3.