# thompson_sampling_with_information_relaxation_penalties__707cd34d.pdf

Thompson Sampling with Information Relaxation Penalties

Seungki Min Columbia Business School Costis Maglaras Columbia Business School

Ciamac C. Moallemi Columbia Business School

We consider a ﬁnite-horizon multi-armed bandit (MAB) problem in a Bayesian setting, for which we propose an information relaxation sampling framework. With this framework, we deﬁne an intuitive family of control policies that include Thompson sampling (TS) and the Bayesian optimal policy as endpoints. Analogous to TS, which, at each decision epoch pulls an arm that is best with respect to the randomly sampled parameters, our algorithms sample entire future reward realizations and take the corresponding best action. However, this is done in the presence of penalties that seek to compensate for the availability of future information. We develop several novel policies and performance bounds for MAB problems that vary in terms of improving performance and increasing computational complexity between the two endpoints. Our policies can be viewed as natural generalizations of TS that simultaneously incorporate knowledge of the time horizon and explicitly consider the exploration-exploitation trade-off. We prove associated structural results on performance bounds and suboptimality gaps. Numerical experiments suggest that this new class of policies perform well, in particular in settings where the ﬁnite time horizon introduces signiﬁcant exploration-exploitation tension into the problem.

1 Introduction

Dating back to the earliest work [2, 10], multi-armed bandit (MAB) problems have been considered within a Bayesian framework, in which the unknown parameters are modeled as random variables drawn from a known prior distribution. In this setting, the problem can be viewed as a Markov decision process (MDP) with a state that is an information state describing the beliefs of unknown parameters that evolve stochastically upon each play of an arm according to Bayes rule.

Under the objective of expected performance, where the expectation is taken with respect to the prior distribution over unknown parameters, the (Bayesian) optimal policy (OPT) is characterized by Bellman equations immediately following from the MDP formulation. In the discounted inﬁnitehorizon setting, the celebrated Gittins index [10] determines an optimal policy, despite the fact that its computation is still challenging. In the non-discounted ﬁnite-horizon setting, which we consider, the problem becomes more difﬁcult [1], and except for some special cases, the Bellman equations are neither analytically nor numerically tractable, due to the curse of dimensionality. In this paper, we focus on the Bayesian setting, and attempt to apply ideas from dynamic programming (DP) to develop tractable policies with good performance.

To this end, we apply the idea of information relaxation [4], a technique that provides a systematic way of obtaining the performance bounds on the optimal policy. In multi-period stochastic DP

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

problems, admissible policies are required to make decisions based only on previously revealed information. The idea of information relaxation is to consider non-anticipativity as a constraint imposed on the policy space that can be relaxed, while simultaneously introducing a penalty for this relaxation into the objective, as in the usual Lagrangian relaxations of convex duality theory. Under such a relaxation, the decision maker (DM) is allowed to access future information and is asked to solve an optimization problem so as to maximize her total reward, in the presence of penalties that punish any violation of the non-anticipativity constraint. When the penalties satisfy a condition (dual feasibility, formally deﬁned in 3), the expected value of the maximal reward adjusted by the penalties provides an upper bound on the expected performance of the (non-anticipating) optimal policy.

The idea of relaxing the non-anticipativity constraint has been studied in different contexts [17, 6, 18, 11], and was later formulated as a formal framework by [4], upon which our methodology is developed. This framework has been applied to a variety of applications including optimal stopping problems [7], linear-quadratic control [12], dynamic portfolio execution [13], and more (see [3]). Typically, the application of this method to a speciﬁc class of MDPs requires custom analysis. In particular, it is not always easy to determine penalty functions that (1) yield a relaxation that is tractable to solve, and (2) provide tight upper bounds on the performance of the optimal policy. Moreover, the established information relaxation theory focuses on upper bounds and provides no guidance on the development of tractable policies.

Our contribution is to apply the information relaxation techniques to the ﬁnite-horizon stochastic MAB problem, explicitly exploiting the structure of a Bayesian learning process. In particular,

1. we propose a series of information relaxations and penalties of increasing computational complexity; 2. we systematically obtain the upper bounds on the best achievable expected performance that trade off between tightness and computational complexity; 3. and we develop associated (randomized) policies that generalize Thompson sampling (TS) in the ﬁnite-horizon setting.

In our framework, which we call information relaxation sampling, each of the penalty functions (and information relaxations) determines one policy and one performance bound given a particular problem instance speciﬁed by the time horizon and the prior beliefs. As a base case for our algorithms, we have TS [21] and the conventional regret benchmark that has been used for Bayesian regret analysis since [15]. At the other extreme, the optimal policy OPT and its expected performance follow from the ideal penalty (which, not surprisingly, is intractable to compute). By picking increasingly strict information penalties, we can improve the policy and the associated bound between the two extremes of TS and OPT.

As an example, one of our algorithms, IRS.FH, provides a very simple modiﬁcation of TS that naturally incorporates time horizon T. Recalling that TS makes a decision based on sampled parameters from the posterior distribution in each epoch, we focus on the fact that knowing the parameters is as informative as having an inﬁnite number of future reward observations in terms of the best arm identiﬁcation. By contrast, IRS.FH makes a decision based on future Bayesian estimates, updated with only T 1 future reward realizations for each arm, where the rewards are sampled based on the prior belief at the moment. When T = 1 (equivalently, at the last decision epoch), such a policy takes a myopically best action based only on the current estimates, which is indeed an optimal decision, whereas TS would still explore unnecessarily. While keeping the recursive structure of the sequential decision-making process of TS, IRS.FH naturally performs less exploration than TS as the remaining time horizon diminishes. This mitigates a common practical criticism of TS: it explores too much.

Beyond this, we propose other algorithms that more explicitly quantify the beneﬁt of exploration and more explicitly trade off between exploration and exploitation, at the cost of additional computational complexity. As we increase the complexity, we achieve policies that improve performance, and separately provide tighter tractable computational upper bounds on the expected performance of any policy for a particular problem instance. By providing natural generalizations of TS, our work provides both a deeper understanding of TS and improved policies that do not require tuning. Since TS has been shown to be asymptotically regret optimal [5], our improvements can at best be (asymptotically) constant factor improvements by that metric. On the other hand, TS is extremely

popular in practice, and we demonstrate in numerical examples that the improvements can be signiﬁcant and are likely to be of practical interest.

Moreover, we develop upper bounds on performance that are useful in their own right. Suppose that a decision maker faces a particular problem instance and is considering any particular MAB policy (be it one we suggest or otherwise). By simulating the policy, a lower bound on the performance of the optimal policy can be found. We introduce a series of upper bounds that can also be evaluated in any problem instance via simulation. Paired with the lower bound, these provide a computational, simulation-based conﬁdence interval that can be helpful to the decision maker. For example, if the upper bound and lower bound are close, the suboptimality gap of the policy under consideration is guaranteed to be small, and it is not worth investing in better policies.

2 Notation and Preliminaries

Problem. We consider a classical stochastic MAB problem with K independent arms and ﬁnitehorizon T. At each decision epoch t = 1, . . . , T, the decision maker (DM) pulls an arm at A {1, . . . , K} and earns a stochastic reward associated with arm at. More formally, the reward from the nth pull of arm a is denoted by Ra,n which is independently drawn from unknown distribution Ra(θa), where θa Θa is the parameter associated with arm a. We also have a prior distribution Pa(ya) over unknown parameter θa, where ya Ya, which we call belief, is a hyperparameter describing the prior distribution: θa Pa(ya) and Ra,n|θa Ra(θa) for all n N and all a A.

We deﬁne the outcome ω ((θa)a A, (Ra,n)a A,n N) that incorporates the all uncertainties that the DM encounters. Given the prior belief vector y (y1, . . . , y K) Y, we let I(y) be the prior distribution of outcome ω that would be described with Pa s and Ra s.

We additionally deﬁne the true mean reward µa and its Bayesian estimate ˆµa,n as follows

µa(θa) E [Ra,n|θa] , ˆµa,n(ω; ya) E [µa(θa)|Ra,1, . . . , Ra,n] . (1)

Through out the paper, we assume that the rewards are absolutely integrable over the prior distribution: i.e., E [|Ra,n|] < , or more explicitly, Er Ra(Pa(ya)) [|r|] < where Ra(Pa(ya)) denotes the (unconditional) distribution of reward Ra,n as a doubly stochastic random variable.

Policy. Given an action sequence up to time t, a1:t (a1, . . . , at) At, deﬁne the number of pulls nt(a1:t, a) Pt s=1 1{as = a} for each arm a, and the corresponding reward realization rt(a1:t, ω) Rat,nt(a1:t,at). The natural ﬁltration Ft(a1:t, ω; T, y)

σ T, y, (as, rs(a1:s, ω))s [t] encodes the observations revealed up to time t (inclusive).

Let aπ 1:t be the action sequence taken by a policy π. A policy π is called non-anticipating if its every action aπ t is Ft 1-measurable, and we deﬁne ΠF be a set of all non-anticipating policies, including randomized ones. The (Bayesian) performance of a policy π is deﬁned as the expected total reward over the randomness associated with the outcome, i.e.,

V (π, T, y) Eω I(y)

t=1 rt(aπ 1:t, ω)

MDP formulation. We assume that we are equipped with a Bayesian update function U : Y A R 7 Y so that after observing Ra,1 = r from some arm a, the belief vector is updated from y to U(y, a, r) according to Bayes rule, where only the ath component is updated in this step.

In a Bayesian framework, the MAB problem has a recursive structure. Given a time horizon T and prior belief y, suppose the DM had just earned r by pulling an arm a at time t = 1. The remaining problem for the DM is equivalent to a problem with time horizon T 1 and prior belief U(y, a, r). Following from this Markovian structure, we obtain the Bellman equations for the MAB problem:

Q (T, y, a) Er Ra(Pa(ya)) [r + V (T 1, U(y, a, r))] , V (T, y) max a A Q (T, y, a), (3)

with V (0, y) 0 for all y Y. While the Bellman equation is intractable to analyze, it offers a characterization of the Bayesian optimal policy (OPT) and the best achievable performance V : i.e., V (T, y) = V (OPT, T, y) = supπ ΠF V (π, T, y).

3 Information Relaxation Sampling

We propose a general framework, which we refer to as information relaxation sampling (IRS), that takes as an input a penalty function zt( ), and produces as outputs a policy πz and an associated performance bound W z.

Information relaxation penalties and inner problem. If we relax the non-anticipativity constraint imposed on policy space ΠF (i.e., aπ t is Ft 1-measurable), the DM will be allowed to ﬁrst observe all future outcomes in advance, and then pick an action (i.e., aπ t is σ(ω)-measurable). To compensate for this relaxation, we impose a penalty on the DM for violating the nonanticipativity constraint.

We introduce a penalty function zt(a1:t, ω; T, y) to denote the penalty that the DM incurs at time t, when taking an action sequence a1:t given a particular instance speciﬁed by ω, T and y. The clairvoyant DM can ﬁnd the best action sequence that is optimal for a particular outcome ω in the presence of penalties zt, by solving the following (deterministic) optimization problem, referred to as the inner problem:

maximizea1:T AT

t=1 rt(a1:t, ω) zt(a1:t, ω; T, y). ( )

Deﬁnition 1 (Dual feasibility). A penalty function zt is dual feasible if it is ex-ante zero-mean, i.e.,

E [zt(a1:t, ω; T, y) |Ft 1(a1:t 1, ω; T, y)] = 0, a1:t At, t [T]. (4)

To clarify the notion of conditional expectation, we remark that the mapping a1:t 7 zt(a1:t, ω; T, y) is a stochastic function of the action sequence a1:t since the outcome ω is random.1 The dual feasibility condition requires that the DM who makes decisions on the natural ﬁltration will receive zero penalties in expectation.

IRS performance bound. Let W z(T, y) be the expected maximal value of the inner problem ( ), when the outcome ω is randomly drawn from its prior distribution I(y), i.e., the expected total payoff that a clairvoyant DM can achieve in the presence of penalties:

W z(T, y) Eω I(y)

max a1:T AT

t=1 rt(a1:t, ω) zt(a1:t, ω; T, y)

We can obtain this value numerically via simulation: draw outcomes ω(1), ω(2), . . . , ω(S) independently from I(y), solve the inner problem for each outcome separately, and then take the average of the maximal values across these samples. The following theorem shows that W z is indeed a valid performance bound of the stochastic MAB problem.

Theorem 1 (Weak duality and strong duality). If the penalty function zt is dual feasible, W z is an upper bound on the optimal value V : for any T and y,

(Weak duality) W z(T, y) V (T, y). (6)

There exists a dual feasible penalty function, referred to as the ideal penalty zideal t , such that

(Strong duality) W ideal(T, y) = V (T, y). (7)

The ideal penalty function zideal t has a following functional form:

zideal t (a1:t, ω) rt(a1:t, ω) E [rt(a1:t, ω) |Ft 1(a1:t 1, ω)] (8) + V (T t, yt(a1:t, ω)) E [V (T t, yt(a1:t, ω))| Ft 1(a1:t 1, ω)] .

A good penalty function precisely penalizes for the additional proﬁt extracted from using the future information ω. At extreme, the ideal penalty zideal t , intractable however, removes any incentive to deviate from OPT and results in the strong duality. In (8), yt(a1:t, ω) represents the posterior belief that the DM would have at time t after observing the reward realizations associated with a1:t given ω.

1 As in usual probability theory, Z(ω) E[X(ω)|Y (ω)] represents the expected value of a random variable X(ω) given the information Y (ω), and Z(ω) is itself a random variable that has a dependency on ω.

IRS policy. Given a penalty function zt, we characterize a randomized and non-anticipating IRS policy πz ΠF as follows. The policy πz speciﬁes which arm to pull when the remaining time is T and current belief is y. Given T and y, it (i) ﬁrst samples an outcome ω from I(y) randomly, (ii) solves the inner problem to ﬁnd a best action sequence a 1:T with respect to ω in the presence of penalties zt, and (iii) takes the ﬁrst action a 1 that the clairvoyant optimal solution a 1:T suggests. Analogous to Thompson sampling, it repeats steps (i) (iii) at every decision epoch, while updating the remaining time T and belief y upon each reward realization.

Algorithm 1: Information relaxation sampling (IRS) policy Function IRS(T, y; z)

1 Sample ω I(y) (equivalently, θa Pa(ya) and Ra,n Ra( θa), a A, n [T])

2 Find the best action sequence with respect to ω under penalties zt:

a 1:T argmaxa1:T AT n PT t=1 rt(a1:t, ω) zt(a1:t, ω; T, y) o

3 return a 1 Procedure IRS-Outer(T, y; z)

2 for t = 1, 2, . . . , T do

3 Play at IRS(T t + 1, yt 1; z)

4 Earn and observe a reward rt and update belief yt U(yt 1, at, rt) end

Remark 1. The ideal penalty yields the Bayesian optimal policy: i.e., V (πideal, T, y) = V (T, y).

Choice of penalty functions. IRS policies include Thompson sampling and the Bayesian optimal policy as two extremal cases. We propose a set of penalty functions spanning these two. While deferring the detailed explanations in 3.1 3.4, we brieﬂy list the penalty functions:

z TS t (a1:t, ω) rt(a1:t, ω) E [rt(a1:t, ω) |θ1, . . . , θK ] (9)

z IRS.FH t (a1:t, ω) rt(a1:t, ω) E [rt(a1:t, ω) |ˆµ1,T 1(ω), . . . , ˆµK,T 1(ω)] (10)

z IRS.V-ZERO t (a1:t, ω) rt(a1:t, ω) E [rt(a1:t, ω) |Ft 1(a1:t 1, ω)] (11)

z IRS.V-EMAX t (a1:t, ω) rt(a1:t, ω) E [rt(a1:t, ω) |Ft 1(a1:t 1, ω)] (12)

+ W TS (T t, yt(a1:t, ω)) E W TS (T t, yt(a1:t, ω)) Ft 1(a1:t 1, ω)

To help understanding, we provide an identity as an example: E [rt(a1:t, ω) |Ft 1(a1:t 1, ω)] = E µat(θat) Rat,1, . . . , Rat,nt 1(a1:t 1,at) = ˆµat,nt 1(a1:t 1,at)(ω) they all represent the mean reward that the DM expects to get from arm at right before making a decision at time t. Remark 2. All penalty functions (8) (12) are dual feasible.

As we sequentially increase its complexity, from z TS to zideal, the penalty function more accurately penalizes the beneﬁt of knowing the future outcomes, more explicitly preventing the DM from exploiting the future information. As summarized in Table 1, it makes the inner problem closer to the original stochastic optimization problem that results in a better performing policy and a tighter performance bound. As a result, we achieve a family of algorithms that are intuitive and tractable, exhibiting a trade-off between quality and computational efﬁciency.

3.1 Thompson Sampling

With the penalty function z TS t (a1:t, ω) = rt(a1:t, ω) µat(θat), the inner problem ( ) reduces to

max a1:T AT

t=1 rt(a1:t, ω) z TS t (a1:t, ω)

= max a1:T AT

t=1 µat(θat)

= T max a A µa(θa). (13)

The resulting performance bound W TS(T, y) is E [T maxa A µa(θa)] that is the conventional benchmark in a Bayesian setting [15, 19]. The corresponding IRS policy πTS restores Thompson sampling: when the sampled outcome ω is used instead, it plays the arm a 1 = argmaxa µa( θa) where each θa is sampled from Pa(ya). Recall that this sampling-based decision making is repeated in each epoch, while updating the belief sequentially, as described in IRS-OUTER in Algorithm 1.

Penalty function Policy Performance bound Inner problem Run time

z TS t TS W TS Find a best arm given parameters. O(K) z IRS.FH t πIRS.FH W IRS.FH Find a best arm given ﬁnite observations. O(K) or O(KT) z IRS.V-ZERO t πIRS.V-ZERO W IRS.V-ZERO Find an optimal allocation of T pulls. O(KT 2) z IRS.V-EMAX t πIRS.V-EMAX W IRS.V-EMAX Find an optimal action sequence. O(KT K) zideal t OPT V Solve Bellman equations.

Table 1: List of algorithms associated with the penalty functions (8) (12). Run time represents the time complexity of solving one instance of inner problem, that is, the time required to obtain one sample of performance bound W z or to make a single decision in policy πz. In IRS.FH, O(K) is achievable when the prior distribution Pa is a conjugate prior of the reward distribution Ra.

Recall that ˆµa,T 1(ω) is the Bayesian estimate on the mean reward of an arm a inferred from the ﬁrst T 1 reward realizations Ra,1, . . . , Ra,T 1. Given (10), the optimal solution to the inner problem ( ) is to pull an arm with the highest ˆµa,T 1(ω) from beginning to the end:

max a1:T AT

t=1 rt(a1:t, ω) z IRS.FH t (a1:t, ω)

= max a1:T AT

t=1 ˆµat,T 1(ω)

= T max a A ˆµa,T 1(ω).

(14) IRS.FH is almost identical to TS except that µa(θa) is replaced with ˆµa,T 1(ω). Note that ˆµa,T 1(ω) is less informative than µa(θa) for the DM, since she will never be able to learn µa(θa) perfectly within a ﬁnite horizon. In terms of estimation, knowing the parameters is equivalent to having the inﬁnite number of observations. The inner problem of TS asks the DM to identify the best arm based on the inﬁnite number of samples whereas that of IRS.FH asks her to identify the best arm based on the ﬁnite number of samples, which takes into account the length of time horizon explicitly.

Focusing on the policies πIRS.FH and πTS (where the randomly generated µa( θa) and ˆµa,T 1( ω) are used), we observe that the distribution of ˆµa,T 1( ω) will be more concentrated while both have the same mean µa E[µa( θa)] = ˆµa,0. Since the variance of ˆµa,T 1( ω) and µa( θa) govern the degree of random exploration (deviating from the myopic decision of pulling an arm with the largest µa), πIRS.FH naturally explores less than TS, in particular when it approaches the end of the horizon (T 1). For the performance bounds, by the same reason, we have W IRS.FH = E[T maxa ˆµa,T 1(ω)] W TS = E[T maxa µa(θa)], meaning that IRS.FH yields a performance bound that is tighter than the conventional regret benchmark.

Sampling ˆµa,T 1( ω) at once. In order to obtain ˆµa,T 1( ω) for a synthesized outcome ω, one may apply Bayes rule sequentially for each reward realization, which will take O(KT) computations in total. It can be done in O(K) when the belief can be updated in a batch by the use of sufﬁcient statistics. In Beta-Bernoulli and Gaussian MABs, for example, ˆµa,T 1( ω) can be represented as a convex combination of the current estimate µa and the sample mean 1 T 1 PT 1 n=1 Ra,n where PT 1 n=1 Ra,n is distributed with Binomial(T 1, θa) for Beta-Bernoulli case and N((T 1) θa, (T 1) σ2 a) for Gaussian case (σ2 a represents the noise variance). After sampling the parameter θa, we can sample PT 1 n=1 Ra,n directly from the known distribution, then use it to compute ˆµa,T 1( ω) without sequentially updating the belief. In such cases, a single decision of πIRS.FH can be made within O(K) operations, similar in complexity to TS.

3.3 IRS.V-ZERO

Under the penalty z IRS.V-ZERO t , the DM at time t earns E [rt(a1:t, ω) |Ft 1(a1:t 1, ω)], the expected mean reward that she can infer from the observations prior to time t. As we deﬁned Ra,n to be a reward from the nth pull on arm a (not the pull at time n), the posterior belief associated with each arm is determined only by the number of past pulls on that arm from the nth pull on arm a, the DM earns ˆµa,n 1(ω), irrespective of the detailed sequence of past actions.

Following this observation, solving the inner problem ( ) is equivalent to ﬁnding the optimal allocation (n 1, n 2, . . . , n K) among T remaining opportunities : omitting ω for brevity, it reduces to

max a1:T AT

t=1 ˆµat,nt 1(a1:t 1,at)

= max a1:T AT

n T (a1:T ,a) X

n=1 ˆµa,n 1

= max n1:K NT

(15) where Sa,n Pn m=1 ˆµa,m 1 is the cumulative payoff from the ﬁrst n pulls of an arm a, and NT {(n1, . . . , n K) ZK + : PK a=1 na = T} is the set of all feasible allocations. Once the Sa,n s are computed, this inner problem can be solved within O(KT 2) operations by sequentially applying sup convolution K times. The detailed implementation is provided in Appendix B.1.

Given an optimal allocation n , the policy πIRS.V-ZERO needs to select which arm to pull next. In principle, any arm a that was included in the solution of the inner problem, n a > 0, would be ﬁne, but we suggest a selection rule in which the arm that needs most pulls is chosen, i.e., argmaxa n a.

3.4 IRS.V-EMAX

Under perfect information relaxation, the DM perfectly knows not only (i) what she will earn at future times but also (ii) how her belief will evolve as a result of her action sequence. The previous algorithms focus on the former component by making the DM to adjust the future rewards by conditioning (e.g., E[rt(at)|θ], E[rt(at)|ˆµ1:K,T 1] and E[rt(at)|Ft 1]). IRS.V-EMAX also focuses on the second component as well by charging an additional cost for using the information on her future belief transitions.

Speciﬁcally, the penalty function z IRS.V-EMAX t is obtained from zideal t in (8) by replacing V (T, y) with W TS(T, y), which is a tractable alternative. The use of W TS(T, y) leads to a simple expression for its conditional expectation: since θ|Ft 1 is distributed with P(yt 1), we have

E W TS (T t, yt) Ft 1 = (T t) E h max a µa(θa) Ft 1 i (16)

= (T t) Eθ P(yt 1) h max a µa(θa) i = W TS (T t, yt 1) . (17)

We further observe that, given ω, the future belief yt(a1:t, ω) depends only on how many times each arm has been pulled, irrespective of the sequence of the pulls, and hence, the number of possible future beliefs is O(T K), not O(KT ). Given the above observations, we can solve the inner problem within O(KT K) computations by dynamic programming (i.e., by ﬁnding a best action at each future belief while iterating over the beliefs in an appropriate order). See B.2 for details.

Remark 3 (Single period optimality). When T = 1, all πIRS.FH, πIRS.V-ZERO, and πIRS.V-EMAX take the optimal action that is pulling the myopically best arm a = argmaxa E[µa(θa)]. Proposition 1 (Asymptotic behavior). Assume that µi(θi) = µj(θj) almost surely for any two distinct arms i = j. As T , the distribution of IRS.FH s and IRS.V-ZERO s action2 converge to that of Thompson sampling: for all a A, lim T P [IRS.FH(T, y) = a] = lim T P [IRS.V-ZERO(T, y) = a] = P [TS(y) = a] . (18)

TS(y), IRS.FH(T, y) and IRS.V-ZERO(T, y) denote the action taken by policies πTS, πIRS.FH and πIRS.V-ZERO, repsectively, when the remaining time is T and the prior belief is y. These are random variables, since each of these policies uses a randomly sampled outcome ω on its own.

Remark 3 and Proposition 1 state that IRS.FH and IRS.V-ZERO behave like TS during the initial decision epochs, gradually shift toward the myopic scheme and end up with optimal decision; in contrast, TS will continue to explore throughout. The transition from exploration to exploitation under these IRS policies occurs smoothly, without relying on an auxiliary control parameter. While maintaining their recursive structure, IRS policies take into account the horizon T, and naturally balance exploitation and exploration.

2 For IRS.V-ZERO, we assume a particular selection rule such that a = argmaxa n a.

Theorem 2 (Monotonicity in performance bounds). IRS.FH and IRS.V-ZERO monotonically improve the performance bound: for any T and y,

W TS(T, y) W IRS.FH(T, y) W IRS.V-ZERO(T, y). (19)

Note that W TS(T, y) = Eθ P(y) [T maxa µa(θa)] is the conventional benchmark.

In addition, we have W IRS.V-EMAX W ideal since W ideal is the lowest attainable upper bound (Theorem 1). Empirically, we also observe W IRS.V-ZERO W IRS.V-EMAX.

Theorem 3 (Suboptimality gap). In the Beta-Bernoulli MAB, for any T and y,

W TS(T, y) V (πTS, T, y) 3K + 2 p

W IRS.FH(T, y) V (πIRS.FH, T, y) 3K + 2 p

W IRS.V-ZERO(T, y) V (πIRS.V-ZERO, T, y) 2K + p

We do not have a theoretical guarantee for monotonicity in the actual performance V (πz, T, y) among IRS policies. Instead, Theorem 3 indirectly shows the improvements in suboptmality gap, W z(T, y) V (πz, T, y): although all the bounds have the same asymptotic order of O( KT log T), the IRS policies improve the leading coefﬁcient or the additional term.

Theorem 2 and 3 highlight that a better choice of penalty function zt leads to a tighter performance bound W z and a better performing policy πz. Recall that the penalties are designed to penalize the gain of having additional future information. While all IRS algorithms are basically optimistic in a sense that the DM makes a decision believing that the informed outcome (ω or ω) will be realized, a better penalty function prevents the DM from picking up an action that is overly optimized to a particular future realization.

5 Numerical Experiment

We visualize the effectiveness of IRS policies and performance bounds in case of Gaussian MAB with ﬁve arms (K = 5) with different noise variances. More speciﬁcally, each arm a A has the unknown mean reward θa N(0, 12) and yields the stochastic rewards Ra,n N(θa, σ2 a) where σ1 = 0.1, σ2 = 0.4, σ3 = 1.0, σ4 = 4.0 and σ5 = 10.0. Our experiment includes the state-of-the-art algorithms that are particularly suitable in a Bayesian framework: Bayesian upper conﬁdence bound [14] (BAYES-UCB, with a quantile of 1 1

t ), information directed sampling [20] (IDS), and optimistic Gittins index [9] (OGI, one-step look ahead approximation with a discount factor γt = 1 1

t ). In the simulation, we randomly generate a set of outcomes ω(1), . . . , ω(S) and measure the performance of each policy π, V (π, T, y), and the performance bounds, W z(T, y), via sample average approximation across these sampled outcomes (S = 20, 000).

Figure 1 plots the regret of policies (solid lines, W TS(T, y) V (π, T, y)) and the regret bounds (dashed lines, W TS(T, y) W z(T, y)) that are measured at the different values of T = 5, 10, . . . , 500.

Our regret measure W TS V (π) = E h PT t=1 maxa µa(θa) µaπ t (θaπ t ) i is equivalent to the conven-

tional Bayesian regret [19], and the measure W TS W z provides the lower bound on the achievable regret since W TS V (π) W TS W z for any policy π ΠF due to the weak duality. Despite the fact that we cannot compute the Bayesian optimal policy directly, we can infer where its regret curve is located in the shaded region of the plot.

Note that lower regret curves are better, and higher bound curves are better. As we incorporate more complicated IRS algorithm from TS to IRS.V-ZERO, we observe a clear improvement in both performances and bounds, as predicted in Theorem 2 and 3. In this particular example, it is crucial to incorporate how much we can learn about each of the arms during the remaining time periods, which heavily depends on the noise level σa and time horizon T. Accordingly, IRS policies outperform to the others, since ours more explicitly incorporate the exploitation-exploration trade-off.

0 100 200 300 400 500 600 Time horizon T

Bayesian regret WTS(T, y) V( , T, y)

IDS IRS.V-ZERO

Gaussian MAB (K = 5) with Heteroscedastic Noise

TS (34 ms) Bayes-UCB (88 ms) IRS.FH (128 ms) OGI (829 ms) IDS (2.9 sec) IRS.V-ZERO (7.4 sec)

Figure 1: Regret plot for Gaussian MAB with different noise variances. The solid lines represent the (Bayesian) regret of policies, W TS(T, y) V (π, T, y), and the dashed lines represent the regret bounds that IRS algorithms produce, W TS(T, y) W z(T, y). The lowest achievable regret (W TS(T, y) V (T, y)) should be within the shaded area. The times in the legend represent the average length of time required to simulate each policy for a single problem instance with T = 500.

6 Discussion

We have developed a uniﬁed framework providing a principled method of improving TS that does not require any tuning or additional parameters. Despite the fact that this paper focuses on a ﬁnite-horizon MAB with independent arms, the general idea of information relaxation sampling is not restricted to this setting: we brieﬂy illustrate how to extend the framework to a broader class of problems.

MAB with unknown time horizon. The framework (penalties, policies, and upper bounds) can naturally incorporate the unknown T within the Bayesian setting: i.e., the horizon T is also a random variable whose prior distribution is known. As a simple case, if T is independent of the DM s actions, we can reformulate the objective function of the inner problem as P t=1 γt(rt(a1:t, ω) zt(a1:t, ω)) where the discount factor γt P[T t] is the survival probability, and rt( ) and zt( ) are the reward and penalty terms used in the paper. Alternatively, we can treat the random variable T like the random reward realizations sample T from its prior distribution while a penalty function (additionally) penalizes for the gain from knowing T (one can imagine that the outcome ω now includes the realization of T). Structural results such as weak duality and strong duality will continue to hold.

MAB in more complicated settings. Consider the following examples: (i) A ﬁnite-horizon MAB with correlated arms (e.g., Ra,n N(x a θ, σ2 a) where θ Rd is shared across the arms, and xa Rd is an arm s feature vector): IRS.V-ZERO can be immediately implemented by adopting the DP algorithm discussed in B.2. (ii) MAB with the delayed reward realization: IRS.FH can be immediately implemented by simulating the DM s learning process in the presence of delay. (iii) MAB with a budget constraint (in which each arm consumes a certain amount of budget and the DM wants to maximize the total reward within a limited budget. See [8]): all IRS algorithms can be implemented by solving a budget-constrained optimization problem instead of a horizon-constrained optimization problem.

In these extensions, we can obtain not only the online decision making policies but also their performance bounds as in this paper. Generally speaking, our framework provides a systemic way of improving TS by taking into account the exploitation-exploration trade-off more carefully, particularly in the presence of some constraint that incurs incomplete learning. The main challenge would be to design a suitable penalty function that is tractable yet captures the problem-speciﬁc exploration-exploitation trade-off precisely.

[1] Donald A. Berry and Bert Fristedt. Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, 1985.

[2] Rusell N. Bradt, S. M. Johnson, and Samuel Karlin. On sequential designs for maximizing the sum of n observations. Annals of Mathematical Statistics, 27(4):1060 1074, 1956.

[3] David B. Brown and Martin B. Haugh. Information relaxation bounds for inﬁnite horizon Markov decision processes. Operations Research, 65(5):1355 1379, 2017.

[4] David B. Brown, James E. Smith, and Peng Sun. Information relaxations and duality in stochastic dynamic programs. Operations Research, 58(4):785 801, 2010.

[5] Sebastien Bubeck and Che-Yu Liu. Prior-free and prior-dependent regret bounds for Thompson sampling. Proceedings of the 26th International Conference on Neural Information Processing Systems, 1(638-646), 2013.

[6] M. H. A. Davis and I. Karatzas. A Deterministic Approach to Optimal Stopping. Wiley, 1994.

[7] Vijay V. Desai, Vivek F. Farias, and Ciamac C. Moallemi. Pathwise optimization for optimal stopping problems. Management Science, 58(12):2292 2308, 2012.

[8] Wenkui Ding, Tao Qin, Xu-Dong Zhang, and Tie-Yan Liu. Multi-armed bandit with budget constraint and variable costs. Proceedings of the 27th AAAI Conference on Artiﬁcial Intelligence, 2013.

[9] Vivek F. Farias and Eli Gutin. Optimistic Gittins indices. Proceedings of the 30th International Conference on Neural Information Processing Systems, (3161-3169), 2016.

[10] J. C. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, Series B, 41(2):148 177, 1979.

[11] Martin B. Haugh and Leonid Kogan. Pricing American options: A duality approach. Operations Research, 52(2):258 270, 2004.

[12] Martin B. Haugh and Andrew E. B. Lim. Linear-quadratic control and information relaxations. Operations Research Letters, 40:521 528, 2012.

[13] Martin B. Haugh and Chun Wang. Dynamic portfolio execution and information relaxations. SIAM Journal of Financial Math, 5:316 359, 2014.

[14] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On Bayesian upper conﬁdence bounds for bandit problems. Proceedings of the Fifteenth International Conference on Artiﬁcial Intelligence and Statistics,, 22:592 600, 2012.

[15] Tze Leueng Lai and Herbert Robbins. Asymptotically efﬁcient adaptive allocation rules. Advances in Applied Mathematics, 6:4 22, 1985.

[16] Olivier Marchal and Julyan Arbel. On the sub-Gaussianity of the Beta and Dirichlet distributions. 2017.

[17] R. T. Rockafellar and Roger J.-B. Wets. Scenarios and policy aggregation in optimization under uncertainty. Mathematics of Operations Research, 16(1):119 147, 1991.

[18] L. C. G. Rogers. Monte Carlo valuation of American options. Mathematical Finance, 12(3):271 286, 2002.

[19] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221 1243, 2014.

[20] Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. Operations Research, 66(1):230 252, 2017.

[21] W. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933.