# contextual_informationdirected_sampling__091a4ed7.pdf

Contextual Information-Directed Sampling

Botao Hao 1 Tor Lattimore 1 Chao Qin 2

Information-directed sampling (IDS) has recently demonstrated its potential as a data-efficient reinforcement learning algorithm (Lu et al., 2021). However, it is still unclear what is the right form of information ratio to optimize when contextual information is available. We investigate the IDS design through two contextual bandit problems: contextual bandits with graph feedback and sparse linear contextual bandits. We provably demonstrate the advantage of contextual IDS over conditional IDS and emphasize the importance of considering the context distribution. The main message is that an intelligent agent should invest more on the actions that are beneficial for the future unseen contexts while the conditional IDS can be myopic. We further propose a computationallyefficient version of contextual IDS based on Actor Critic and evaluate it empirically on a neural network contextual bandit.

1. Introduction

Information-directed sampling (IDS) (Russo & Van Roy, 2018) is a promising approach to balance the exploration and exploitation tradeoff (Lu et al., 2021). Its theoretical properties have been systematically studied in a range of problems (Kirschner & Krause, 2018; Liu et al., 2018a; Kirschner et al., 2020a;b; Hao et al., 2021a). However, the aforementioned analysis1 is limited to the fixed action set.

In this work, we study the IDS design for contextual bandits (Langford & Zhang, 2007), which can be viewed as a simplified case of full reinforcement learning. The context

*Equal contribution 1Deepmind 2Columbia University. Correspondence to: Botao Hao <haobotao000@gmail.com>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). 1One exception is Kirschner et al. (2020a) who has extended IDS to contextual partial monitoring.

at each round is generated independently, rather than determined by the actions. Contextual bandits are widely used in real-world applications, such as recommender systems (Li et al., 2010).

A natural generalization of IDS to the contextual case is conditional IDS. Given the current context, conditional IDS acts as if it were facing a fixed action set. However, this design ignores the context distribution and can be myopic. We mainly want to address the question of what is the right form of information ratio in the contextual setting and highlight the necessity of utilizing the context distribution.

Contributions Our contribution is four-fold:

We introduce a version of contextual IDS that takes the context distribution into consideration for contextual bandits with graph feedback and sparse linear contextual bandits.

For contextual bandits with graph feedback, we prove that conditional IDS suffers Ω( p

β(G)n) Bayesian regret lower bound for a particular prior and graph while contextual IDS can achieve e O(min{ p

β(G)n, δ(G)1/3n2/3)} Bayesian regret upper bound for any prior. Here, n is the time horizon, G is a directed feedback graph over the set of actions, β(G) is the independence number and δ(G) is the weak domination number of the graph. In the regime where β(G) (δ(G)2n)1/3, contextual IDS achieves better regret bound than conditional IDS.

For sparse linear contextual bandits, we prove that conditional IDS suffers Ω(

nds) Bayesian regret lower bound for a particular sparse prior while contextual IDS can achieve e O(min{

nds, sn2/3}) Bayesian regret upper bound for any sparse prior. Here, d is the feature dimension and s is the sparsity. In the data-poor regime where d sn1/3, contextual IDS achieves better regret bound than conditional IDS.

We further propose a computationally-efficient algorithm to approximate contextual IDS based on Actor-

Contextual Information-Directed Sampling

Critic (Konda & Tsitsiklis, 2000) and evaluate it empirically on a neural network contextual bandit.

2. Related works

Russo & Van Roy (2018) introduced IDS and derived Bayesian regret bounds for multi-armed bandits, linear bandits and combinatorial bandits. Kirschner & Krause (2018); Kirschner et al. (2020a) investigated the use of frequentist IDS for bandits with heteroscedastic noise and partial monitoring. Kirschner et al. (2020b) proved the asymptotic optimality of frequentist IDS for linear bandits. Hao et al. (2021a) developed a class of information-theoretic Bayesian regret bounds for sparse linear bandits that nearly match existing lower bounds on a variety of problem instances. Recently, Lu et al. (2021) proposed a general reinforcement learning framework and used IDS as a key component to build a data-efficient agent. Arumugam & Van Roy (2021) investigated different versions of learning targets when designing IDS. However, there are no concrete regret bounds provided in those works.

Graph feedback Mannor & Shamir (2011); Alon et al. (2015) gave a full characterization of online learning with graph feedback. Tossou et al. (2017) provided the first information-theoretic analysis of Thompson sampling for bandits with graph feedback and Liu et al. (2018a) derived a Bayesian regret bound of IDS. Both bounds depend on the clique cover number which could be much larger than the independence number. Liu et al. (2018b) further improved the analysis with the feedback graph unknown and derived the Bayesian regret bound of a mixture policy in terms of the independence number. In contrast, our work is the first one to consider contextual bandits with graph feedback and derived nearly optimal regret bound in terms of both independence number and domination number for a single algorithm. Very recently, Dann et al. (2020) considered the reinforcement learning with graph feedback. Their O( n)-upper bound depends on mas-number rather than independence number. We briefly summarize the comparison in Table 2.

Sparse linear bandits For sparse linear bandits with fixed action set, Abbasi-Yadkori et al. (2012) proposed an inefficient online-to-confidence-set conversion approach that achieves an e O(

sdn) upper bound for an arbitrary action set. Lattimore et al. (2015) developed a selective explore-then-commit algorithm that only works when the action set is exactly the binary hypercube and derived an optimal O(s n) upper bound. Hao et al. (2020) intro-

duced the notion of an exploratory action set and proved a Θ(poly(s)n2/3) minimax rate for the data-poor regime using an explore-then-commit algorithm. Note that the minimax lower bound automatically holds for contextual setting when the context distribution is a Dirac on the hard problem instance. Hao et al. (2021b) extended this concept to a MDP setting.

It recently became popular to study the contextual setting. These results can not be reduced to our setting since they rely on either careful assumptions on the context distribution to achieve e O(poly(s) n) or e O(poly(s) log(n)) regret bounds (Bastani & Bayati, 2020; Wang et al., 2018; Kim & Paik, 2019; Wang et al., 2020; Ren & Zhou, 2020; Oh et al., 2021) such that classical high-dimensional statistics can be used. However, minimax lower bounds is missing to understand if those assumptions are fundamental. Another line of works have polynomial dependency on the number of actions (Agarwal et al., 2014; Foster & Rakhlin, 2020; Simchi-Levi & Xu, 2020). We briefly summarize the comparison in Table 2.

3. Problem Setting

We study the stochastic contextual bandit problem with a horizon of n rounds and a finite set of possible contexts S. For each context m S, there is an action set Am. The interaction protocol is as follows. First the environment samples a sequence of independent contexts (st)n t=1 from a distribution ξ over S. At the start of round t, the context st is revealed to the agent, who may use their observations and possibly an external source of randomness to choose an action At At = Ast. Then the agent receives an observation Ot,At including an immediate reward Yt,At as well as some side information. Here

Yt,a = f(st, a, θ ) + ηt,a ,

where f is the reward function, θ is the unknown parameter and ηt,a is 1-sub-Gaussian noise. Sometimes we write Yt = Yt,At for short.

We consider the Bayesian setting in the sense that θ is sampled from some prior distribution. Let A = m SAm and the history Ft = {(s1, A1, O1), . . . , (st 1, At 1, Ot 1)}. A policy π = (πt)t N is a sequence of (suitably measurable) deterministic mappings from the history Ft and context space S to a probability distribution P(A) with the constraint that for each context m, πt( |m) can only put mass over P(Am). Then the Bayesian regret of a policy π

Contextual Information-Directed Sampling

Table 1. Comparisons with existing results on regret upper bounds and lower bounds for bandits with graphical feedback. Here, β(G) is the independence number, δ(G) is the weak domination number and c(G) is the clique cover number of the graph. Note that β(G) c(G).

Upper Bound Regret Contextual? Algorithm Alon et al. (2015) e O( p

β(G)n) no Exp3.G for strongly observable graph Alon et al. (2015) e O(δ(G)1/3n2/3) no Exp3.G for weakly observable graph Tossou et al. (2017) e O( p

c(G)n) no Thompson sampling Liu et al. (2018a) e O( p

c(G)n) no fixed action-set IDS This paper e O(min( p

β(G)n, δ(G)1/3n2/3)) yes contextual IDS Minimax Lower Bound Alon et al. (2015) Ω( p

β(G)n) no strongly observable graph Alon et al. (2015) Ω(δ(G)1/3n2/3) no weakly observable graph

Table 2. Comparisons with existing results on regret upper bounds and lower bounds for sparse linear bandits. Here, s is the sparsity, d is the feature dimension, n is the number of rounds, K is the number of arms, Cmin is the minimum eigenvalue of the data matrix and τ is a problem-dependent parameter that may have a complicated form.

Upper Bound Regret Contextual? Assumption Algorithm Abbasi-Yadkori et al. (2012) e O(

sdn) yes none UCB Sivakumar et al. (2020) e O(

sdn) yes adver. + Gaussian noise greedy Bastani & Bayati (2020) e O(τKs2(log(n))2) yes compatibility condition Lasso greedy Lattimore et al. (2015) e O(s n) no action set is hypercube explore-then-commit Hao et al. (2021a) e O(min(

sdn, sn2/3/Cmin)) no none fixed action-set IDS This paper e O(min(

sdn, sn2/3/Cmin)) yes none contextual IDS Minimax Lower Bound Hao et al. (2020) Ω(min(

sdn, C 1/3 min s1/3n2/3)) no N.A N.A.

is defined as

BR(n; π) = E

t=1 max a At f(st, a, θ )

where the expectation is over the interaction sequence induced by the agent and environment, the context distribution and the prior distribution over θ .

Notations Given a measure P and jointly distributed random variables X and Y we let PX denote the law of X and we let PX|Y be the conditional law of X given Y such that: PX|Y ( ) = P(X |Y ). The mutual information between X and Y is I(X; Y ) = E[DKL(PX|Y ||PX)] where DKL is the relative entropy. We write Pt( ) = P( |Ft) as the posterior measure where P is the probability measure over θ and the history and Et[ ] = E[ |Ft]. Denote It(X; Y ) = Et[DKL(Pt,X|Y ||Pt,X)]. We write 1{ } as an indicator function. For a positive semidefinite matrix A, we let σmin(A) be the minimum eigenvalue of A. Denote P(A) be the space of probability measures over a set A with the Borel σ-algebra.

4. Conditional IDS versus Contextual IDS

We introduce the formal definition of conditional IDS and contextual IDS for general contextual bandits. Suppose that the optimal action associated with context st is a t = argmaxa At f(st, a, θ ), which in Bayesian setting is a random variable. We first define the one-step expected regret of taking action a At as

t(a|st) := Et h f(st, a t , θ ) f(st, a, θ ) st i , (4.1)

and write t(st) R|At| as the corresponding vector.

4.1. Conditional IDS

A natural generalization of the standard IDS design (Russo & Van Roy, 2018) to contextual setting is conditional IDS. It has been investigated empirically in Lu et al. (2021) for the full reinforcement learning setting.

Conditional on the current context, we define the information gain of taking action a with respect to the current optimal action a t as It(a t ; Ot,a|st) :=

Et h DKL(Pt(a t |st, Ot,a)||Pt(a t |st)) st i ,

Contextual Information-Directed Sampling

and write It(a t , st) R|At| such that [It(a t , st)]a = It(a t ; Ot,a|st).

We introduce the (α, λ)-conditional information ratio (CIR) as Γλ t,α : P(A) R such that

Γλ t,α(π( |st)) = max 0, t(st) π( |st) α λ

It(a t , st) π( |st) , (4.2)

for some parameters α, λ > 0. Conditional IDS minimizes (α, λ)-CIR to find a probability distribution over the action space:

πt( |st) = argmin π( |st) P(At) Γλ t,α(π( |st)) .

Remark 4.1. In comparison to the standard information ratio by Russo & Van Roy (2018) who specified λ = 2, α = 0 in the non-contextual setting, we consider a generalized version introduced by Lattimore & Gy orgy (2020). As observed by Lattimore & Gy orgy (2020), the right value of λ depends on the dependence of the regret on the horizon. The parameter α is always chosen at order O(1/ n) for certain problems such that its regret is negligible. Their role will be more clear in Sections 5&6.

4.2. Contextual IDS

A possible limitation of conditional IDS is that the CIR does not take the context distribution into consideration. As we show later, this can result in sub-optimality in certain problems. Contextual IDS, which was first proposed by Kirschner et al. (2020a) for partial monitoring, is introduced to remedy this shortcoming.

Let π : S A be the optimal policy such that for each m S, π (m) = argmaxa Am f(m, a, θ ) with the tie broken arbitrarily. We define the information gain of taking action a with respect to π as:

It(π ; Ot,a) := Et [DKL(Pt(π |Ot,a)||Pt(π ))] ,

and write It(π ) R|At| as the corresponding vector. Define a policy class Π = {π : S P(A), supp(π( |m)) Am} where supp( ) is the support of a vector. We introduce the (α, λ)-marginal information ratio (MIR) as Ψλ t,α : Π R such that

Ψλ t,α(π) = max 0, Est[ t(st) π( |st)] α λ

Est[It(π ) π( |st)] , (4.3)

where the expectation is taken with respect to the context distribution. Contextual IDS minimizes the (α, λ)-MIR to

find a mapping from the context space to the action space:

πt = argmin π Π Ψλ t,α(π) .

When receiving context st at round t, the agent plays actions according to πt( |st).

We highlight the key difference between conditional IDS and contextual IDS as follows:

Conditional IDS only optimizes a probability distribution for the current context while contextual IDS optimizes a full mapping from the context space to action space.

Conditional IDS only seeks information about the current optimal action while contextual IDS seeks information for the whole optimal policy.

Remark 4.2. We can also define the information gain in conditional IDS and contextual IDS with respect to the unknown parameter θ . This could bring in certain computational advantage when approximating the information gain. However, θ contains much more information to learn than the optimal action or policy such that the agent may suffer more regret.

4.3. Why conditional IDS could be myopic?

Conditional IDS myopically balances exploration and exploitation without taking the context distribution into consideration. This has the advantage of being simple to implement but sometimes leads both under and over exploration, which the following examples illustrate. In both examples there are two contexts arriving with equal probability.

Example 1 [UNDER EXPLORATION] Consider a noiseless case. Context set 1 contains k actions where one is the optimal action and the remaining k 1 actions yield regret 1. Context set 2 contains a revealing action with regret 1 and one action with no regret. The revealing action provides an observation of the rewards for all the k actions in context set 1. When context set 2 arrives, conditional IDS will never play the revealing action since it incurs high immediate regret with no useful information for the current context set. However, this ignores the fact that the revealing action could be informative for the unseen context set 1. Conditional IDS under-explores and suffers O(k) regret. In contrast, contextual IDS exploits the context distribution and plays the revealing action in context 2 and only suffers O(1) regret.

Contextual Information-Directed Sampling

Example 2 [OVER EXPLORATION] Context set 1 contains a single revealing action (hence no regret). Context set 2 has k actions. The first is a revealing action and has a (known) regret of Θ(

k ) with = Θ(1/ n). Of the remaining actions, one is optimal (zero regret) and the others have regret , with the prior such that the identify of the optimal action is unknown. Contextual IDS will avoid the revealing action in context set 2 because it understands that this information can be obtained more cheaply in context set 1. Its regret is O( n). Meanwhile, if the constants are tuned appropriately, then conditional IDS will play the revealing action in context set 2 and suffer regret Ω(

Remark 4.3. Similar shortcoming could also hold for the class of UCB and Thompson sampling algorithms since they have no explicit way to encode the information of context distribution.

4.4. Generic regret bound for contextual IDS

Before we step into specific examples, we first provide a generic bound for contextual IDS. Denote Iα,λ as the worstcase MIR such that for any t [n], Ψλ t,α(πt) Iα,λ almost surely.

Theorem 4.4. Let πMIR = (πt)t [n] be the contextual IDS policy such that πt minimizes (α, 2)-MIR at each round. Then the following regret upper bound holds

BR(n; πMIR) nα+

inf λ 2 21 2/λI1/λ α,λ n1 1/λE

t=1 Est It(π ) πt( |st) # 1

The proof is deferred to Appendix A.1. An interesting consequence of Theorem 4.4 is that contextual IDS minimizing λ = 2 can adapt to λ > 2. This shows the power of contextual IDS to adapt to different information-regret structures.

5. Contextual Bandits with Graph Feedback

We consider a Bayesian formulation of contextual k-armed bandits with graph feedback, which generalizes the setup in Alon et al. (2015). Let G = (A, E) be a directed feedback graph over the set of actions A with |A| = k. For each i A, we define its in-neighborhood and out-neighborhood as N in(i) = {j A : (j, i) E} and N out(i) = {j A : (i, j) E}. The feedback graph G is fixed and known to the agent.

Let the unknown parameter θ Rk. At each round t, the agent receives an observation characterized by G in the sense that taking action a At, the observation Ot,a contains the rewards Yt,a = θ a + ηt,a for each action a N out(a). We review two standard quantities to describe the graph structure used in Alon et al. (2015).

Definition 5.1 (Independence number). An independent set of a graph G = (A, E) is a subset C A such that no two different i, j A are connected by an edge in E. The cardinality of a largest independent set is the independence number of G, denoted by β(G).

A vertex a is observable if N in(a) = . A vertex is strongly observable if it has either a self-loop or incoming edges from all other vertices. A vertex is weakly observable if it is observable but not strongly observable. A graph G is observable if all its vertices are observable and it is strongly observable if all its vertices are strongly observable. A graph is weakly observable if it is observable but not strongly observable.

Definition 5.2 (Weak domination number). In a directed graph G = (A, E) with a set of weakly observable vertices W A, a weakly dominating set D A is a set of vertices that dominates W. That means for any w W there exists d D such that w N out(d). The weak domination number of G, denoted by δ(G), is the size of the smallest weakly dominating set.

5.1. Lower bound for conditional IDS

We first prove that conditional IDS suffers Ω( p

nβ(G)) Bayesian regret lower bound for a particular prior and show later the optimal rate for this prior could be much smaller when β(G) is very large (Section 5.2).

Theorem 5.3. Let πCIR be a conditional IDS policy. There exists a contextual bandit with graph feedback instance such that β(G) = k 1 and

BR(n; πCIR) 1

n(β(G) 2) .

Proof. Let us construct the hard problem instance that generalizes Example 1 in Section 4.3. Suppose {x1, . . . , xk} Rk is the set of basis vectors that corresponds to k arms. The first k 1 arms form a standard multi-armed Gaussian bandit with unit variance. When kth arm is pulled, the agent always suffers a regret of 1, but obtains samples for the first k 1 arms. In terms of the language of graph feedback, the first k 1 arms only contain self-loop while the outneighborhood of the last arm contains all the first k 1 arms. One can verify β(G) = k 1.

Contextual Information-Directed Sampling

There are two context sets and the arrival probability of each context is 1/2. Context set 1 consists of {xk 1, xk} while context set 2 consists of {x1, . . . , xk 2}.

For each i {2, . . . , k 2}, let θ(i) Rk with θ(i) j =

1{i = j}γ for j {2, . . . , k 2} and θ(i) 1 = 0, θ(i) k 1 = 0,

θ(i) k = γ 1 where γ > 0 is the gap that will be chosen later. Assume the prior of θ is uniformly distributed over {θ(2), . . . , θ(k 2)}.

We write Ei[ ] = Eθ(i)[ ] for short and the expectation is taken with respect to the measures on the sequence of outcomes (A1, Y1, . . . , An, Yn) determined by θ(i). Define the cumulative regret of policy π interacting with bandit θ(i) as

Rθ(i)(n; π) =

t=1 Ei h x st, θ(i) Yt 1(st = 1) i

t=1 Ei h x st, θ(i) Yt 1(st = 2) i ,

such that BR(n; π) = 1 k 3 Pk 2 i=2 Rθ(i)(n; π).

Step 1. Fix a conditional IDS policy π. From the definition of conditional IDS in Eq. (4.2), when context set 1 is arriving, conditional IDS will always pull xk 1 for this prior since xk 1 is always the optimal arm for context set 1. This means conditional IDS will suffer no regret for context set 1, which implies for any i {2, . . . , k 2},

t=1 Ei h x st, θ(i) Yt 1(st = 1) i = 0 .

Step 2. Define Ti(n) as the number of pulls for arm i over n rounds. On the other hand,

x st, θ(i) Yt 1(st = 2)

j=1,j =i Tj(n)

γ = Ei [n2 Ti(n)] γ ,

where the first equation comes from the fact that the context sets 1 and 2 are disjoint and n2 is the number of times that context 2 arrives over n rounds. Since the context is generated independently with respect to the learning process, we have Ei[n2] = n/2. By Pinsker s inequality and the divergence decomposition lemma (Lattimore & Szepesv ari, 2020, Lemma 15.1), we know that with γ = 1

i=2 Ei[Ti(n)]

i=2 E1[Ti(n)] + 1

nk E1(Ti(n))

BR(n; π) = 1 k 3

i=2 Ei [n2 Ti(n)] γ

i=2 Ei[Ti(n)]

This finishes the proof.

Remark 5.4. We can strengthen the lower bound such that it holds for any graph by replacing the standard multi-armed bandit instance in context set 2 by the hard instance in Alon et al. (2017, Theorem 5).

5.2. Upper bound achieved by contextual IDS

In this section, we derive a Bayesian regret upper bound achieved by contextual IDS. According to Theorem 4.4, it suffices to bound the worst-case MIR Iα,λ for λ = 2, 3 as well as the cumulative information gain respectively. Let Rmax be an almost surely upper bound on the maximum expected reward.

Lemma 5.5 (Squared information ratio). For any ϵ [0, 1] and a strongly observable graph G, the (2ϵRmax, 2)-MIR can be bounded by

I2ϵRmax,2 4R2 max + 4 1 ϵ β(G) log 4k2

The proof is deferred to Appendix B.1 and the proof idea is to bound the worst-case squared information ratio by the squared information ratio of Thompson sampling.

Next we will bound Iα,λ for λ = 3 in terms of an explorability constant.

Definition 5.6 (Explorability constant). We define the explorability constant as ϑ(G, ξ) :=

max π:S P(A) min d A Es ξ,a π( |s) h 1 d N out(a) i .

Lemma 5.7 (Cubic information ratio). For any ϵ [0, 1] and a weakly observable graph G, the (2ϵRmax, 3)-MIR can be bounded by

I2ϵRmax,3 R3 max + Rmax

The proof is deferred to Appendix B.2 and the proof idea is to bound the worst-case cubic information ratio by the cubic information ratio of a policy that mixes Thompson sampling and a policy maximizing the explorability constant.

Contextual Information-Directed Sampling

Lemma 5.8. The cumulative information gain can be bounded by

t=1 Et It(π ) πt( |st) #

H(π ) M log(k) ,

where H( ) is the entropy and M is the number of available contexts.

The proof is deferred to Appendix B.3. Combining the results in Lemmas 5.5-5.8 and the generic regret bound in Theorem 4.4, we obtain the follow theorem.

Theorem 5.9 (Regret bound for graph contextual IDS). Suppose πMIR = (πt)t N where πt minimizes (2Rmax/ n, 2)- MIR at each round. If G is strongly observable, we have

BR(n; πMIR) CRmax min

β(G) log 4k2 n

where C is an absolute constant.

Ignoring the constant and logarithmic terms, the Bayesian regret upper bound in Theorem 5.9 can be simplified to

β(G)Mn, (M/ϑ(G, ξ))1/3n2/3 .

When the independence number is large enough such that β(G) (ϑ(G, ξ)) 2/3(n/M)1/3, this bound is dominated by e O((M/ϑ(G, ξ))1/3n2/3) that is independent of the independence number. Together with Theorem 5.3, we can argue that conditional IDS is sub-optimal comparing with contextual IDS in this regime.

Remark 5.10 (Connection between ϑ(G, ξ) and δ(G)). Let D A be the smallest weakly dominating set of the full graph with |D| = δ(G). Consider a policy µ such that µ( |st) is uniform over D At. Then we have

min d A Es ξ,a µ( |s) [1 {d N out(a)}] 1 Mδ(G) ,

which implies 1/ϑ(G, ξ) Mδ(G). When M = 1, Alon et al. (2015) demonstrated that the minimax optimal rate for weakly observable graph is eΘ(δ(G)1/3n2/3) which implies 1/ϑ(G) δ(G) up to some logarithm factors.

Remark 5.11 (Open problem). For the fixed action set setting, Alon et al. (2015) proved that the independence number and weakly domination number are the fundamentally quantity to describe the graph structure. However, our regret

upper bound has the number of contexts M appearing. It is interesting to investigate if M can be removed for contextual bandits with graph feedback.

Note that one can also bound H(π ) H(θ ) k such that the dependency on M does not appear. But it is unclear the right dependency on M and k for the optimal regret rate in the contextual setting.

6. Sparse Linear Contextual Bandits

We consider a Bayesian formulation of sparse linear contextual bandits. The notion of sparsity can be defined through the parameter space Θ:

j=1 1{θj = 0} s, θ 2 1

We assume s is known and it can be relaxed by putting a prior on it. We consider the case where Am = A for all m S. Suppose ϕ : S A Rd is a feature map known to the agent. In sparse linear contextual bandits, the reward of taking action a is f(st, a, θ ) = ϕ(st, a) θ + ηt,a , where θ is a random variable taking values in Θ and denote ρ as the prior distribution.

We first define a quantity to describe the explorability of the feature set.

Definition 6.1 (Explorability constant). We define the explorability constant as Cmin(ϕ, ξ) :=

max π:S P(A) σmin Es ξ h Ea π( |s)[ϕ(s, a)ϕ(s, a) ] i .

Remark 6.2. The explorability constant is a problemdependent constant that has previously been introduced in Hao et al. (2020; 2021a) for a non-contextual sparse linear bandits. For an easy instance when {ϕ(st, a)}a A is a full hypercube for every st S, Cmin(ϕ, ξ) could be as small as 1 where the policy is to sample uniformly from the corner of the hypercube no matter which context arrives.

6.1. Lower bound for conditional IDS

We prove an algorithm-dependent lower bound for a sparse linear contextual bandits instance.

Theorem 6.3. Let πCIR be a conditional IDS policy. There exists a sparse linear contextual bandit instance whose explorability constant is 1/2 such that

BR(n; πCIR) 1

Contextual Information-Directed Sampling

How about the principle of optimism? Hao et al. (2021a, Section 4) has shown that even for non-contextual case, optimism-based algorithms such as UCB or Thompson sampling fail to optimally balance information and regret and result in a sub-optimal regret bound.

6.2. Upper bound achieved by contextual IDS

In this section, we will prove that contextual IDS can achieve a nearly dimension-free regret bound. We say that the feature set has sparse optimal actions if the optimal action is s-sparse for each context almost surely with respect to the prior.

Lemma 6.4 (Bounds for information ratio). We have I0,2 d/2. When the feature set has sparse optimal actions, we have I0,3 s2/(4Cmin(ϕ, ξ)).

From the definition of π (m) = argmina Am a θ , we know that π is a deterministic function of θ . By the data processing lemma, we have I(π ; Fn+1) I(θ ; Fn+1). According to Lemma 5.8 in Hao et al. (2021a), we have the following lemma.

Lemma 6.5. The cumulative information gain can be bounded by

t=1 Et It(π ) πt( |st) #

2s log(c1dn1/2/s) ,

for some constant c1 > 0.

Combining the results in Lemmas 6.4-6.5 with the generic regret bound in Theorem 4.4, we obtain the following regret bound.

Theorem 6.6 (Regret bound for sparse contextual IDS). Suppose πMIR = (πt)t N where πt = argminπ Ψ2 t,0(π). When the feature set has sparse optimal actions, the following regret bound holds

BR(n; πMIR)

nds log(dn1/2/s), sn 2 3 (log(c1dn1/2/s)) 1 3

(Cmin(ϕ, ξ)) 1 3

for some constant c1 > 0.

In the data-poor regime where d n1/3s/C2/3 min, contextual IDS achieves e O(sn2/3) regret bound that is tighter than the lower bound for conditional IDS. This rate matches the minimax lower bound derived in Hao et al. (2020) up to a s factor in the data-poor regime.

7. Practical Algorithm

As shown in Section 4.1, Conditional IDS only needs to optimize over probability distributions. As proved by Russo & Van Roy (2018, Proposition 6), conditional IDS suffices to randomize over two actions at each round and the computational complexity is further improved to O(|A|) by Kirschner (2021, Lemma 2.7). However, contextual IDS in Section 4.2 needs to optimize over a mapping from the context space to action space at each round which in general is much more computationally demanding.

To obtain a practical and scalable algorithm, we approximate the contextual IDS using Actor-Critic (Konda & Tsitsiklis, 2000). We parametrize the policy class by a neural network and optimize the information ratio by multi-steps stochastic gradient decent.

Consider a parametrized policy class Πθ = {πθ : S P(A)}. The policy, which is a conditional probability distribution πθ, can be parameterized with a neural network. This neural network maps (deterministically) from a context s to a probability distribution over A. We further parametrize the critic which is the reward function by a value network Qθ.

To avoid additional approximation errors, we assume we can sample from the true context distribution. At each round, the parametrized contextual IDS minimizes the following empirical MIR loss through SGD:

Pw i=1[ t(s(i) t ) π( |s(i) t )]2 Pw i=1[It(π ) π( |s(i) t )] , (7.1)

where {s(1) t , . . . , s(w) t } are the independent samples of contexts. We use the Epistemic Neural Networks (ENN) (Osband et al., 2021a) to quantify the posterior uncertainty of the value network. For each given s(i) t , following the procedure in Section 6.3.3 and 6.3.4 of Lu et al. (2021), we can approximate the one-step expected regret and the information gain efficiently using the samples outputted by ENN.

8. Experiment

We conduct some preliminary experiments to evaluate the parametrized contextual IDS through a neural network contextual bandit.

At each round t, the environment independently generates an observation in the form of d-dimensional contextual vector xt from some distributions. Let fθ : Rd R|A| be a neural network link function. When the agent takes action a, she

Contextual Information-Directed Sampling

Figure 1. Cumulative regret for a neural network contextual bandit.

will receive a reward in the form of Yt,a = [fθ (xt)]a+ηt,a. This is the not sharing parameter formulation for contextual bandits which means each arm has its own parameter.

We set the generative model fθ being a 2-hidden-layer Re LU MLP with 10 hidden neurons. The number of actions is 5. The contextual vector xt R10 is sampled from N(0, I10) and the noise is sampled from standard Gaussian distribution.

We compare contextual IDS with conditional IDS and Thompson sampling. For a fair comparison, we use the same ENN architecture to obtain posterior samples for three agents. As reported by Osband et al. (2021b), ensemble sampling with randomized prior function tends to be the best ENN that balances the computation and accuracy so we use 10 ensembles in our experiment. With 200 posterior samples, we use the same way described by Lu et al. (2021) to approximate the one-step regret and information gain for both conditional IDS and contextual IDS. We sample 20 independent contextual vectors at each round. Both the policy network and value network are using 2-hidden-layer Re LU MLP with 20 hidden neurons and optimized by Adam with learning rate 0.001. We report our results in Figure 1 where parametrized contextual IDS achieves reasonably good regret.

9. Discussion and Future Work

In this work, we investigate the right form of information ratio for contextual bandits and emphasize the importance of utilizing context distribution information through contextual bandits with graph feedback and sparse linear contextual bandits. For linear contextual bandits with moderately small number of actions, one future work is to see if contextual IDS can achieve O( p

dn log(k)) regret bound.

Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. Onlineto-confidence-set conversions and application to sparse stochastic bandits. In Artificial Intelligence and Statistics, pp. 1 9, 2012.

Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638 1646. PMLR, 2014.

Alon, N., Cesa-Bianchi, N., Dekel, O., and Koren, T. Online learning with feedback graphs: Beyond bandits. In Conference on Learning Theory, pp. 23 35. PMLR, 2015.

Alon, N., Cesa-Bianchi, N., Gentile, C., Mannor, S., Mansour, Y., and Shamir, O. Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing, 46(6):1785 1826, 2017.

Arumugam, D. and Van Roy, B. The value of information when deciding what to learn. Advances in Neural Information Processing Systems, 34, 2021.

Bastani, H. and Bayati, M. Online decision making with high-dimensional covariates. Operations Research, 68 (1):276 294, 2020.

Dann, C., Mansour, Y., Mohri, M., Sekhari, A., and Sridharan, K. Reinforcement learning with feedback graphs. Advances in Neural Information Processing Systems, 33, 2020.

Foster, D. and Rakhlin, A. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 3199 3210. PMLR, 2020.

Hao, B., Lattimore, T., and Wang, M. High-dimensional sparse linear bandits. Advances in Neural Information Processing Systems, 33:10753 10763, 2020.

Hao, B., Lattimore, T., and Deng, W. Information directed sampling for sparse linear bandits. Advances in Neural Information Processing Systems, 34, 2021a.

Hao, B., Lattimore, T., Szepesv ari, C., and Wang, M. Online sparse reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp. 316 324. PMLR, 2021b.

Kim, G.-S. and Paik, M. C. Doubly-robust lasso bandit. In Advances in Neural Information Processing Systems, pp. 5869 5879, 2019.

Contextual Information-Directed Sampling

Kirschner, J. Information-Directed Sampling-Frequentist Analysis and Applications. Ph D thesis, ETH Zurich, 2021.

Kirschner, J. and Krause, A. Information directed sampling and bandits with heteroscedastic noise. In Conference On Learning Theory, pp. 358 384. PMLR, 2018.

Kirschner, J., Lattimore, T., and Krause, A. Information directed sampling for linear partial monitoring. In Conference on Learning Theory, pp. 2328 2369. PMLR, 2020a.

Kirschner, J., Lattimore, T., Vernade, C., and Szepesv ari, C. Asymptotically optimal information-directed sampling. ar Xiv preprint ar Xiv:2011.05944, 2020b.

Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008 1014, 2000.

Langford, J. and Zhang, T. The epoch-greedy algorithm for contextual multi-armed bandits. Advances in neural information processing systems, 20(1):96 1, 2007.

Lattimore, T. and Gy orgy, A. Mirror descent and the information ratio. ar Xiv preprint ar Xiv:2009.12228, 2020.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Lattimore, T., Crammer, K., and Szepesv ari, C. Linear multi-resource allocation with semi-bandit feedback. In Advances in Neural Information Processing Systems, pp. 964 972, 2015.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670, 2010.

Liu, F., Buccapatnam, S., and Shroff, N. Information directed sampling for stochastic bandits with graph feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018a.

Liu, F., Zheng, Z., and Shroff, N. Analysis of thompson sampling for graphical bandits without the graphs. ar Xiv preprint ar Xiv:1805.08930, 2018b.

Lu, X., Van Roy, B., Dwaracherla, V., Ibrahimi, M., Osband, I., and Wen, Z. Reinforcement learning, bit by bit. ar Xiv preprint ar Xiv:2103.04047, 2021.

Mannor, S. and Shamir, O. From bandits to experts: On the value of side-observations. Advances in Neural Information Processing Systems, 24:684 692, 2011.

Oh, M.-h., Iyengar, G., and Zeevi, A. Sparsity-agnostic lasso bandit. In International Conference on Machine Learning, pp. 8271 8280. PMLR, 2021.

Osband, I., Wen, Z., Asghari, M., Ibrahimi, M., Lu, X., and Van Roy, B. Epistemic neural networks. ar Xiv preprint ar Xiv:2107.08924, 2021a.

Osband, I., Wen, Z., Asghari, S. M., Dwaracherla, V., Hao, B., Ibrahimi, M., Lawson, D., Lu, X., O Donoghue, B., and Van Roy, B. Evaluating predictive distributions: Does bayesian deep learning work? ar Xiv preprint ar Xiv:2110.04629, 2021b.

Ren, Z. and Zhou, Z. Dynamic batch learning in highdimensional sparse linear contextual bandits. ar Xiv preprint ar Xiv:2008.11918, 2020.

Russo, D. and Van Roy, B. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39 (4):1221 1243, 2014.

Russo, D. and Van Roy, B. Learning to optimize via information-directed sampling. Operations Research, 66(1):230 252, 2018.

Simchi-Levi, D. and Xu, Y. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Available at SSRN, 2020.

Sivakumar, V., Wu, Z. S., and Banerjee, A. Structured linear contextual bandits: A sharp and geometric smoothed analysis. International Conference on Machine Learning, 2020.

Tossou, A. C., Dimitrakakis, C., and Dubhashi, D. Thompson sampling for stochastic bandits with graph feedback. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.

Wang, X., Wei, M., and Yao, T. Minimax concave penalized multi-armed bandit model with high-dimensional covariates. In International Conference on Machine Learning, pp. 5200 5208, 2018.

Wang, Y., Chen, Y., Fang, E. X., Wang, Z., and Li, R. Nearly dimension-independent sparse linear bandit over small action spaces via best subset selection. ar Xiv preprint ar Xiv:2009.02003, 2020.

Contextual Information-Directed Sampling

A. General Results

A.1. Proof of Theorem 4.4

Proof. From the definitions in Eqs. (3.1) and (4.1), we could decompose the Bayesian regret bound of contextual IDS as

BR(n; πMIR) = nα + E

t=1 Est ( t(st) α) πt( |st) #

Recall that at each round contextual IDS follows

πt = argmin π:S P(A) Ψ2 t,α(π) = argmin π:S P(A)

max 0, (Est[ t(st) π( |st)] α) 2

Est[It(π ) π( |st)] .

Moreover, we define

qt,λ = argmin π:S P(A) Ψλ t,α(π) = argmin π:S P(A)

max 0, (Est[ t(st) π( |st)] α) λ

Est[It(π ) π( |st)] .

Suppose for a moment that Est[ t(st) πt( |st)] α > 0. Let M be the number of contexts. Then the derivative can be written as

π( |m)Ψ2 t,α(πt) = 2pm t(m)(Est[ t(st) π( |st)] α)

Est[It(π ) π( |st)] pm It(m) Est[ t(st) π( |st)] α) 2

(Est[It(π ) π( |st)])2 ,

where pm is the arrival probability for context m. Using the first-order optimality condition for each m [M],

Est π( |st)Ψt,2(πt), qt,λ( |st) πt( |st) =

m=1 pm π( |m)Ψ2 t,α(πt), qt,λ( |m) πt( |m) 0 .

This implies

2Est t(st) (qt,λ( |st) πt( |st)) (Est[ t(st) πt( |st)] α) Est[It(π ) πt( |st)]

Est It(π ) (qt,λ( |st) πt( |st)) Est[ t(st) πt( |st)] α) 2

(Est[It(π ) πt( |st)])2 .

Since we assume Est[ t(st) (πt( |st)] α > 0 and the information gain is always non-negative, we have

2Est t(st) (qt,λ( |st) πt( |st)) Est It(π ) (qt,λ( |st) πt( |st)) Est[ t(st) πt( |st)] α) Est[It(π ) πt( |st)] .

which implies

2 Est[ t(st) qt,λ( |st)] α Est[ t(st) πt( |st)] α 1 + Est[It(π ) qt,λ( |st)]

Est[It(π ) πt( |st)]

Est[ t(st) πt( |st)] α .

Putting the above results together,

(Est[ t(st) πt( |st)] α)λ

Est[It(π ) π( |st)] = (Est[ t(st) πt( |st)] α)2(Est[ t(st) π( |st)] α)λ 2

Est[It(π ) π( |st)]

2λ 2(Est[ t(st) πt( |st)] α)2(Est[ t(st) qt,λ( |st)] α)λ 2

Est[It(π ) πt( |st)]

2λ 2(Est[ t(st) qt,λ( |st)] α)2(Est[ t(st) qt,λ( |st)] α)λ 2

Est[It(π ) qt,λ( |st)]

= 2λ 2(Est[ t(st) qt,λ( |st)] α)λ

Est[It(π ) qt,λ( |st)] .

Contextual Information-Directed Sampling

Recall that Iα,λ is the worst-case information ratio such that for any t [n], Ψλ t,α(πt) Iα,λ almost surely. Therefore,

Est t(st) πt( |st) α 21 2/λ Est It(π ) πt( |st) 1/λ I1/λ α,λ , (A.2)

which is obvious when Est[ t(st) (πt( |st)] α 0. Combining Eqs. (A.1) and (A.2) together and using Holder s inequality, we obtain

BR(n; πIDS) nα + 21 2/λI1/λ α,λ n1 1/λE

t=1 Est It(π ) πt( |st) #1/λ

This ends the proof.

B. Contextual Bandits with Graph Feedback

B.1. Proof of Lemma 5.5

Proof. We bound the squared information ratio in terms of independence number β(G). Let πTS t ( |st) = Pt(a t = ) and consider a mixture policy πmix t = (1 ϵ)πTS t + ϵ/k with some mixing parameter ϵ [0, 1] that will be determined later.

First, we will derive an upper bound for one-step regret. From the definition of one-step regret,

t(st) πmix t ( |st) = X

a At πmix t (a|st)Et Yt,a t Yt,a|st

a At πTS t (a|st)Et Yt,a t Yt,a|st + ϵ X

1 k Et Yt,a t Yt,a|st . (B.1)

Recall that Rmax is the almost surely upper bound of maximum expected reward and let

dt(a, a ) = DKL(Pt(Yt,a |a t = a ||Pt(Yt,a ))).

It is easy to see Yt,a is a p

R2max + 1 sub-Gaussian random variable. For the first term of Eq. (B.1), according to Lemma 3 in Russo & Van Roy (2014), we have

a At πTS t (a|st)Et Yt,a t Yt,a|st X

a At πTS t (a|st)

2 dt(a, a) . (B.2)

For the second term of Eq. (B.1), we directly bound it by

1 k Et Yt,a t Yt,a|st 2ϵRmax . (B.3)

Putting Eqs. (B.1)-(B.3) together, we have

t(st) πmix t ( |st) 2ϵRmax X

a At πTS t (a|st)

2 dt(a, a) .

Second, we will derive a lower bound for the information gain. From data processing inequality and conditional on st,

It π ; (Yt,a )a Nout(a) It a t ; (Yt,a )a N out(a) . (B.4)

Let E = (aij)1 i,j |A| be the adjacency matrix that represents the graph feedback structure G. Then aij = 1 if there exists an edge (i, j) E and aij = 0 otherwise. Since for any ai, aj N out(a), Yt,ai and Yt,aj are mutually independent,

It a t ; (Yt,a )a N out(a) X

a Nout(a) It(a t ; Yt,a ) = X

a At E(a, a )I(a t ; Yt,a ) .

Contextual Information-Directed Sampling

It(a t ) πmix t ( |st) X

a At It a t ; (Yt,a )a Nout(a) πmix t (a|st)

a At E(a, a )It(a t ; Yt,a )πmix t (a|st)

πmix t (a|st) + X

a N in(a) πmix t (a |st)

It(a t ; Yt,a)

πmix t (a|st) + X

a N in(a) πmix t (a |st)

a At πTS t (a |st)dt(a, a )

πmix t (a|st) + X

a N in(a) πmix t (a |st)

πTS t (a|st)dt(a, a)

πmix t (a|st) + P

a Nin(a) πmix t (a |st)

πTS t (a|st) πTS t (a|st)2dt(a, a) .

Let s denote

Ua,t = πmix t (a|st) + P

a N in st(a) πmix t (a |st)

πTS t (a|st) .

Putting the above together,

t(st) πmix t ( |st) 2ϵRmax

a At Ua,tπTS t (a|st)2dt(a, a)

It(a t ) πmix t ( |st) ,

where the first inequality is due to Cauchy Schwarz inequality. This implies t(st) πmix t ( |st) 2ϵRmax 2

It(a t ) πmix t ( |st) R2 max + 1

πTS t (a|st) πmix t (a|st) + P

a Nin(a) πmix t (a |st)

R2 max + 1 2(1 ϵ)

πmix t (a|st) πmix t (a|st) + P

a Nin(a) πmix t (a |st) ,

where we use πmix t (1 ϵ)πTS t . Using Jenson s inequality and Eq. (B.4),

(Est[ t(st) πmix t ( |st)] 2ϵRmax)2

Est[It(π ) πmix t ( |st)] (Est[ t(st) πmix t ( |st)] 2ϵRmax)2

Est[It(a t ) πmix t ( |st)]

( t(st) πmix t ( |st) α)2

It(a t ) πmix t ( |st)

" R2 max + 1 2(1 ϵ)

πmix t (a|st) πmix t (a|st) + P

a Nin(a) πmix t (a |st)

Next we restate Lemma 5 from Alon et al. (2015) to bound the right hand side.

Lemma B.1. Let G = (A, E) be a directed graph with |A| = k. Assume a distribution π over A such that π(i) η for all i A with some constant 0 < η < 0.5. Then we have X

π(i) π(i) + P

j Nin(i) π(j) 4β(G) log 4k β(G)η

Contextual Information-Directed Sampling

Using Lemma B.1 with η = ϵ/k,

(Est[ t(st) πmix t ( |st)] 2ϵRmax)2

Est[It(π ) πmix t ( |st)] 2R2 max + 2 1 ϵ 2β(G) log 4k2

According to the definition of contextual IDS, this ends the proof.

B.2. Proof of Lemma 5.7

Proof. We bound the worst-case marginal information ratio with λ = 3. Define an exploratory policy µ : S A such that

µ = argmax π min d Es,a π( |s) 1 d N out(a) .

Consider a mixture policy πmix t = (1 ϵ)πTS t + ϵµ for some mixing parameter ϵ [0, 1]. Since for any ai, aj N out(a), Yt,ai and Yt,aj are mutually independent, we have

Est It(π ) πmix t ( |st) = Est,a πmix t ( |st) It π ; (Yt,a )a Nout(a)

Est,a πmix t ( |st) It a t ; (Yt,a )a Nout(a)

Est,a πmix t ( |st)

a Nout(a) It (a t ; Yt,a )

where the first inequality is from data processing inequality. From the definition of the mixture policy, πmix(a|st) ϵµ(a|st) for any a At. Then we have

Est It(π ) πmix t ( |st) ϵEst,a µ( |st)

a Nout(a) It (a t ; Yt,a )

ϵEst,a µ( |st)

a Nout(a) It (a t ; Yt,a ) 1 a N out(a)

ϵEst,a µ( |st)

a At It (a t ; Yt,a ) 1 a N out(a) #

Then we have

Est It(π ) πmix t ( |st) ϵEst,a µ( |st)

a At It(a t ; Yt,a )

min a Est,a µ( |st) 1 a N out(a)

ϵϑ(G, ξ)Est

a At It(a t ; Yt,a )

R2max + 1Est

a At Pt(a t = a) X

a At (Et[Yt,a |a t = a] Et[Yt,a ])2 #

where the last inequality is from Pinsker s inequality.

On the other hand, by the definition of mixture policy and Jenson s inequality,

Est[ t(st) πmix t ( |st)] (1 ϵ)Est,a πTS t ( |st)Et[Yt,a t Yt,a] + 2ϵRmax

Est,a πTS t ( |st) [Et[Yt,a|a t = a] Et[Yt,a]] + 2ϵRmax

v u u t Est

a At Pt(a t = a) (Et[Yt,a|a t = a] Et[Yt,a])2 #

Contextual Information-Directed Sampling

Combining with the lower bound of the information gain in Eq. (B.5),

Est[ t(st) πmix t ( |st)]

2ϵϑ(G, ξ) Est It(π ) πmix t ( |st) + 2ϵRmax .

By choosing

ϵ = R2 max + 1

1 8ϑ(G, ξ)E It(π ) πmix t ( |st) 1/3 ,

Est[ t(st) πmix t ( |st)] 2Rmax

1 8ϑ(G, ξ)Est It(π ) πmix t ( |st) 1/3 .

This implies Est[ t(st) πmix t ( |st)] 3

Est It(π ) πmix t ( |st) R3 max + Rmax

This ends the proof.

B.3. Proof of Lemma 5.8

Proof. According the the definition of cumulative information gain,

t=1 Est It(π ) πt( |st) #

= Et,st[It(π ; Ot|At, st)] . (B.6)

It(π ; (st, At, Ot)) = Et,st[It(π ; Ot|At, st)] + Et,st[It(π ; At|st)] + It(π ; st) = Et,st[It(π ; Ot|At, st)] .

By the chain rule,

I(π ; Fn+1) =

t=1 E [It(π ; (st, At, Ot))] =

t=1 E Est It(π ) πt( |st) .

On the other hand, I(π ; Fn+1) H(π ) M log(k) ,

where H( ) is the entropy. Putting the above together,

t=1 Est It(π ) πt( |st) #

This ends the proof.

Remark B.2. We analyze the upper bound H(π ) under the following examples. Denote the number of contexts by M, which could be infinite.

1. The possible choices of π is at most k M, and thus H(π ) log(k M) = M log(k). When M = , this bound is vacuous.

2. We can divide the context space into fixed and disjoint clusters such that for each parameter in the parameter space, the optimal policy maps all the contexts in a single cluster to a single action. For example, a naive partition is that each cluster contains a single context, and in this case, the number of clusters denoted equals the number of contexts M. Let P be the minimum number of such clusters, so P M. The possible choice of π is k P , and thus H(π ) log(k P ) = P log(k). When M = but P < , we obtain a valid upper bound instead of .

Contextual Information-Directed Sampling

B.4. Proof of Theorem 5.9

Proof. According to Theorem 4.4, we have

BR(n; πMIR) 2Rmax n + inf λ 2 21 2/λI1/λ 2Rmax/ n,λn1 1/λE

t=1 E It(π ) πt( |st) #1/λ

Using Lemma 5.5 with ϵ = 1/ n, we have

I2Rmax/ n,2 2R2 max + 2 1 1/ n 2β(G) log 4k2 n

16(R2 max + 1)β(G) log 4k2 n

Combining Lemmas 5.7-5.8 together, we have

BR(n; πMIR) 2Rmax n+

16(R2 max + 1)β(G) log 4k2 n

n M log(k) 1/2 , R3 max + Rmax

ϑ(G, ξ) 2M log(k) 1

β(G) log 4k2 n

n M log(k), 2M log(k)

for some absolute constant C > 0. This ends the proof.

C. Sparse Linear Contextual Bandits

C.1. Proof of Theorem 6.3

Proof. We assume there are two available contexts. Context set 1 consists of a single action x0 = (0, . . . , 0) Rd+1 and an informative action set H as follows:

H = n x Rd+1 xj { κ, κ} for j [d], xd = 1 o , (C.1)

where 0 < κ 1 is a constant. Let d = sp for some integer p 2. Context set 2 consists of a multi-task bandit action set A = {({ei Rp : i [p]}s, 0)} Rd+1 whose element s last coordinate is always 0. The arriving probability of each context is 1/2. One can verify for this feature set, the explorability constant Cmin(ϕ, ξ) = 1/2 achieved by a policy that uniformly samples from H when context set 1 arrives.

Fix a conditional IDS policy πCIR. Let > 0 and Θ = { ei : i [p]} Rp. Given θ {(Θs, 1)} Rd+1 and i [s], let θ(i) Rp be defined by θ(i) k = θ(i 1)s+k, which means that

θ = [θ(1) , . . . , θ(s) , 1] .

Assume the prior of θ is uniformly distributed over {(Θs, 1)}. Define the cumulative regret of policy π interacting with bandit θ as

t=1 Eθ x st, θ Yt

t=1 Eθ x st, θ Yt 1(st = 1) +

t=1 Eθ x st, θ Yt 1(st = 2) ,

where we write x st = ϕ(st, a t ) for short. Therefore,

BR(n; π) = 1 |Θ|s X

θ {(Θs,1)} Rθ(n; π) .

Contextual Information-Directed Sampling

Note that when context set 1 arrives, the action from H suffers at least 1 s regret and thus since x0 is always the optimal action for context set 1. From the definition of conditional IDS in Eq. (4.2), when context set 1 is arriving and s < 1, conditional IDS will always pull x0 for this prior. That means conditional IDS will suffer no regret for context set 1 and implies for any θ {(Θs, 1)},

t=1 Eθ x st, θ Yt I(st = 1) =

t=1 Eθ x 1, θ πCIR(i|1), θ |st = 1 P(st = 1) = 0 .

It remains to bound the second term in Eq. (C.2). It essentially follows the proof of Theorem 24.3 in Lattimore & Szepesv ari (2020). From the proof of multi-task bandit lower bound, we have

t=1 Eθ x st, θ Yt 1(st = 2) 1

This ends the proof.

C.2. Proof of Lemma 6.4

Proof. If one can derive a worst-case bound of Ψλ t,α(eπ) for a particular policy eπ,, we can have an upper bound for Iα,λ automatically. The remaining step is to choose a proper policy eπ for λ = 2, 3 separately.

First, we bound the information ratio with λ = 2. By the definition of mutual information, for any a At, we have

It(a t ; Yt,a) = X

a At Pt(a t = a )DKL (Pt(Yt,a = |a t = a )||Pt(Yt,a = )) . (C.3)

Recall that Rmax is the upper bound of maximum expected reward. It is easy to see Yt,a is a p

R2max + 1 sub-Gaussian random variable. According to Lemma 3 in Russo & Van Roy (2014), we have

It(a t ; Yt,a) 2 R2max + 1

a At Pt(a t = a ) Et[Yt,a|a t = a ] Et[Yt,a] 2 . (C.4)

The MIR of contextual IDS can be bounded by the MIR of Thompson sampling:

I0,2 max t [n] max 0, E[ t(st) πTS t ( |st)] 2

E[It(π ) πTS t ( |st)] max t [n] max 0, E[ t(st) πTS t ( |st)] 2

E[It(a t ) πTS t ( |st)] max t [n] E ( t(st) πTS t ( |st))2

It(a t ) πTS t ( |st)

where the first inequality is from data processing inequality and the second inequality is from Jenson s inequality. Using the matrix trace rank trick described in Proposition 5 in Russo & Van Roy (2014), we have I0,2 (R2 max + 1)d/2 in the end.

Second, we bound the information ratio with λ = 3. Let s define an exploratory policy µ such that

µ = argmax π:S P(A) σmin Es ξ h Ea π( |s)[ϕ(s, a)ϕ(s, a) ] i . (C.5)

Consider a mixture policy πmix t = (1 ϵ)πTS t + ϵµ where the mixture rate ϵ 0 will be decided later.

Step 1: Bound the information gain According the lower bound of information gain in Eq. (C.4),

Est It(a t ) πmix t ( |st)

2 R2max + 1Est ξ,a πmix t ( |st)

a At Pt(a t = a ) (Et[Yt,a|a t = a ] Et[Yt,a])2 #

= 2 R2max + 1Est ξ,a πmix t ( |st)

a At Pt(a t = a ) ϕ(st, a) Et[θ |a t = a ] ϕ(st, a) Et[θ ] 2 #

Contextual Information-Directed Sampling

By the definition of the mixture policy, we know that πmix t (a|st) ϵµ(a|st) for any a At. Then we have

Est It(a t ) πmix t ( |st) 2ϵ R2max + 1

a At Pt(a t = a )

Est ξ,a µ( |st) h (Et[θ |a t = a ] Et[θ ]) ϕ(st, a)ϕ(st, a) (Et[θ |a t = a ] Et[θ ]) i .

From the definition of µ in Eq. (C.5), we have

Est It(a t ) πmix t ( |st) 2ϵ R2max + 1

a At Pt(a t = a )Cmin(ϕ, ξ) Et[θ |a t = a ] Et[θ ] 2 2 .

Step 2: Bound the instant regret We decompose the regret by the contribution from the exploratory policy and the one from TS:

Est t(st) πmix t ( |st) = (1 ϵ)Est

a Pt(a t = a) Et ϕ(st, a) θ a t = a Et ϕ(st, a) θ #

a Et h ϕ(st, a t ) θ ϕ(st, a) θ i µ(a|st)

Since Rmax is the upper bound of maximum expected reward, the second term can be bounded 2Rmaxϵ. Using Jenson s inequality, the first term can be bounded by

a Pt(a t = a) Et ϕ(st, a) θ a t = a Et ϕ(st, a) θ #

v u u t Est

a Pt(a t = a) Et ϕ(st, a) θ a t = a Et [ϕ(st, a) θ ] 2 #

Since all the optimal actions are sparse, any action a with Pt(a t = a) > 0 must be sparse. Then we have

ϕ(st, a) (Et[θ |a t = a] Et[θ ]) 2 s2 Et[θ |a t = a] Et[θ ] 2

for any action a with Pt(a t = a) > 0. This further implies

a Pt(a t = a) Et ϕ(st, a) θ a t = a Et ϕ(st, a) θ #

v u u t Est

a Pt(a t = a)s2 Et[θ |a t = a] Et[θ ] 2

v u u ts2(R2max + 1)

2ϵCmin(ϕ, ξ) 2ϵCmin(ϕ, ξ)

(R2max + 1) Est

a At Pt(a t = a)s2 Et[θ |a t = a] Et[θ ] 2

s2(R2max + 1)

2ϵCmin(ϕ, ξ) E It(a t ) πmix t ( |st) .

Putting Eq. (C.6) and (C.7) together, we have

E t(st) πmix t ( |st)

s2(R2max + 1)

2ϵCmin(ϕ, ξ) E It(a t ) πmix t ( |st) + 2Rmaxϵ .

Contextual Information-Directed Sampling

By optimizing the mixture rate ϵ, we have

Est t(st) πmix t ( |st) 3

Est It(π ) πmix t ( |st)

Est t(st) πmix t ( |st) 3

Est It(a t ) πmix t ( |st) s2(R2 max + 1) 8R2max Cmin s2

4Cmin(ϕ, ξ) .

This ends the proof.