# contextlumpable_stochastic_bandits__1e1bd368.pdf

Context-lumpable stochastic bandits

Chung-Wei Lee University of Southern California leechung@usc.edu

Qinghua Liu Princeton University qinghual@princeton.edu

Yasin Abbasi-Yadkori Google Deep Mind yadkori@google.com

Chi Jin Princeton University chij@princeton.edu

Tor Lattimore Google Deep Mind lattimore@google.com

Csaba Szepesvári Google Deep Mind and University of Alberta szepesva@ualberta.ca

We consider a contextual bandit problem with S contexts and K actions. In each round t = 1, 2, . . . the learner observes a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round. Under the assumption that the contexts can be lumped into r min{S, K} groups such that the mean reward for the various actions is the same for any two contexts that are in the same group, we give an algorithm that outputs an ε-optimal policy after using at most e O(r(S + K)/ε2) samples with high probability and provide a matching Ω(r(S + K)/ε2) lower bound. In the regret minimization setting, we give an algorithm whose cumulative regret up to time T is bounded by e O( p

r3(S + K)T). To the best of our knowledge, we are the first to show the near-optimal sample complexity in the PAC setting and e O( p

poly(r)(S + K)T) minimax regret in the online setting for this problem. We also show our algorithms can be applied to more general low-rank bandits and get improved regret bounds in some scenarios.

1 Introduction

Consider a recommendation platform that interacts with a finite set of users in an online fashion. Users arrive at the platform and receive a recommendation. If they engage with the recommendation (e.g., they click ) then the platform receives a reward, otherwise no reward is obtained. Assume that the users can be partitioned into a small number of groups such that users in the same group have the same preferences.

We ask whether there exist algorithms that can take advantage of the lumpability of users into a few groups, even when the identity of the group a user belongs to is unknown and only learnable because they share preferences with other users in the group.

A slightly more general version of this problem can be formalized as follows: Viewing users as contexts and recommendations as actions (or arms), assume that there are S contexts and K actions. In round t = 1, 2, . . . the learner first receives a context it, sampled from an unknown

most works were done when interning at Deep Mind.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

distribution on the set [S] := {1, . . . , S} of possible contexts. The learner then chooses an action jt [K] := {1, . . . , K} and observes a reward yt = A(it, jt) + ηt, where given the past, ηt has a subgaussian tail (precise definitions are postponed to Section 2) and A : [S] [K] R is an unknown function of mean rewards (R denotes the set of reals). We consider two settings when the goal of the learner is either to identify a near-optimal policy π : [S] [K], or to keep its regret small. Policy π is called ε-optimal if

E[A(i1, π(i1))] max π E[A(i1, π (i1))] ε, (1)

while the regret of the learner for a horizon of T is

t=1 max j [K] A(it, j)

t=1 A(it, jt)

The expectations are taken with respect to the randomness of both the learner and environment, including contexts and rewards. It is well known (e.g., Lattimore and Szepesvári [2020]) that there are algorithms such that an ε-optimal policy will be discovered after

interactions, and there are also algorithms for which the regret satisfies

Reg T = e O(

Here, the notation e O( ) hides polylogarithmic factors of the variables involved. We say that the stochastic, finite, contextual bandit problem specified by A is r-lumpable (or, in short, the bandit problem is context-lumpable) if there is a partitioning of [S] into r min{S, K} groups such that A(i, ) = A(i , ) holds whenever i, i [S] belong to the same group. It is not hard to see that any algorithm needs at least Ω(r(S + K)/ε2) interactions to discover an ε-optimal policy (Theorem 3). Indeed, if we view A as an S K matrix, the lumpability condition states that A = UV where U is an S r binary matrix where each row has a single nonzero element, and V is an r K matrix, which gives the unique mean rewards given the group indices. Hence, crude parameter counting suggests that there are r(S + K) parameters to be learned.

The question is whether SK in Eq. (3) and Eq. (4) can be replaced with (S + K)poly(r) without knowing the grouping of the contexts.

More generally, we can ask the same questions for contextual bandits with the low-rank structure, where the matrix A has rank r. Equivalently, the low-rank condition means that we have the same decomposition A = UV as above but no more constraints on U. In the example of recommendation platforms, this assumption is more versatile as the users are modeled as mixtures of r preference types instead of belonging to one type only.

1.1 Related works

Our problem can be seen as an instance of contextual bandit problems introduced by Auer et al. [2002]. For a good summary of the history of the contextual bandit problem, the reader is advised to consult the article by Tewari and Murphy [2017]. Further review of prior works can also be found in the books of Slivkins [2019], Lattimore and Szepesvári [2020]. Another way to approach context-lumpable stochastic bandits is to model them as stochastic linear bandits with changing action sets Auer [2002], Chu et al. [2011], Abbasi-Yadkori et al. [2011]. However, lumpablility does not give improvements on the regret bound when running the standard algorithms in these settings (EXP4 and Sup Lin Rel, respectively). We provide more details in the appendix.

We are not the first to study the r-context-lumpable stochastic bandit problem. The earliest work is probably that of Maillard and Mannor [2014] who named this problem latent bandits: they consider the group identity of the context as latent, or missing information. While the paper introduces this problem, the main theoretical results are derived for the case when the reward distributions given the group information are known to the learner. Further, the results derived are instance dependent.

Altogether, while this paper makes interesting contributions, it does not help in answering our questions.2

The earlier work of Gentile et al. [2014] considered another generalization of our problem where in round t = 1, 2, . . . the learner gets an action set {xt,1, . . . , xt,Kt} Rd and the mean payoff of choosing action 1 j Kt given the past and the current context it is x t,jθg(it) with θ1, . . . , θr Rd unknown. When specializing this to our setting, their result depends on the minimum separation mini,j:g(i) =g(j) A(i, ) A(j, ) , which, according to the authors may be an artifact of their analysis. An extra assumption, which the authors suggest could be removed, is that the distribution of the contexts is uniform (as we shall see, removing this assumption is considerable work). Further, as the authors note, the worst-case for their algorithm is when the contexts are equipartitioned, in which case their bound reduces to the naive bound that one gets for non-lumpable problems. Li et al. [2016] considers the variation of the problem when the set of arms (and their feature vectors) are fixed and is known before learning begins and separation is defined with respect to this fixed set of arms (and hence could be larger than before). From our perspective, their theoretical results have the same weaknesses as the previous paper. Gentile et al. [2017] further generalizes these previous works to the case when the set of contexts is not fixed a priori. However, the previously mentioned weaknesses of the results remain.

Our problem is also a special case of learning in lumpable Markov Decision Processes. In such Markov Decision Processes (MDPs) the states can be partitioned such that states within a group of a partition behave identically, both in terms of the transitions and the rewards received [e.g., Ren and Krogh, 2002].3 We put a survey of this line of work in Appendix A.6.

In summary, although the problem is extensively studied in the literature, we are not aware of any previous results that have implications for the existence of the minimax regret bounds in terms of e O( p

poly(r)(S + K)T), which we believe is a fundamental question in this area.

2 Notation and problem definition

For an positive integer n, we use [n] to denote the set {1, . . . , n}. Further, for a finite set U, (U) denotes the set of probability distributions over U and unif(U) denotes the uniform distribution over U. The set of real numbers is denoted by R.

As introduced earlier, we consider context-lumpable bandits with S contexts and K actions such that the S contexts can be lumped into r groups so that the mean payoff A(i, j) given that action j [K] is used while the context is i [S] depends only on the identity of the group that i belongs to and the action j. That is, for some g : [S] [r] map

A(i, j) = A(i , j) for any j [K], i, i [S] such that g(i) = g(i ) .

Neither A, nor g is known to the learner who interacts with the bandit instance in the following way: In rounds t = 1, 2, . . . the learner first observes a context it [S] randomly chosen from a fixed, context distribution ν ([S]), independently of the past. The distribution ν is also unknown to the learner. Next, the learner chooses an action jt [K] to receive a reward of yt = A(it, jt) + ηt , where ηt is 1-subgaussian given the past: For any λ R,

E[exp(ληt) | i1, j1, . . . , it, jt] exp(λ2/2) almost surely (a.s.).

As described earlier, we are interested in two problem settings. In the PAC setting, the learner is given a target suboptimality level ε > 0 and a target confidence δ The learner is then tasked to discover a policy π : [S] [K] that is ε-optimal (cf. Eq. (1)), with probability δ. In this setting, in addition to deciding what actions to choose, the learner also needs to decide when to stop collecting data and when it stops it has to return a map ˆπ : [S] [K]. If T is the (random) time when the learner stops, then the efficiency of the learner is characterized by E[T]. In the online setting, the goal of the learner is to keep its regret (cf. Eq. (2)) low. In Section 3, we will consider the PAC setting, while the online setting will be considered in Section 4.

2Somewhat confusingly, Hong et al. [2020] define latent bandit problems differently from the definition given by Maillard and Mannor [2014]: They assume that in addition to the context, a latent (unobserved) variable, influences the rewards, however, they also assume that the distributions of the reward given an action, context and latent state are all known. Their results therefore can not help us in answering our question. 3Lumpability was first introduced first by Kemeny and Snell [1976] in the context of Markov chains.

In the remainder of the paper, we will also need some further notation. The map g induces a partitioning P = {B(1), . . . , B(r)} of [S] in the natural way:

B(b) = {i : g(i) = b} .

We call each subset B P a block and call B(b) block b for every b [r]. We also define block reward µ(b, j) = A(i, j) of block b for every context i B(b) and arm j [K]. Finally, we define the block distribution ω (r) so that ω(b) = P

i B(b) ν(i).

3 Near-optimal PAC Learning in Context-lumpable Bandits

In this section, we present an algorithm for PAC learning in context-lumpable bandits, and prove that its sample complexity guarantee matches the minimax rate up to a logarithmic factor.

3.1 Special case: almost-uniform context distribution

To better illustrate the key ideas behind the algorithm design, we first consider a special case of almost-uniform context distribution. Formally, we assume ν(i) [1/(8S), 8/S] for all i [S] throughout this subsection. The pseudocode of our algorithm is provided in Algorithm 1, which consists of three main modules. Below we elaborate on each of them separately.

Data collection (Line 3-4 and Algorithm 2). The high-level goal of this step is to collect a sufficient amount of data for every block/action pair so that in later steps we have sufficient information to select a near-optimal arm for each block. One naive approach is to try every action on every context for a sufficient amount of time (e.g., e O(1/ε2)). However, this will cause an undesired factor of SK in the sample complexity. To address this issue, note that contexts from the same block share the same reward function, which means that for every block/action pair (b, j) [r] [K], we only need to sufficiently try action j on a single context i from block b (i.e., i B(b)). However, the block structure is unknown a priori. We circumvent this problem by assigning a random action ψ(i) to each context i for them to explore (Line 4). And after action ψ(i) has been tried on context i sufficiently many times, we update ψ(i) by reassigning context i with a new random action (Line 8-10).

However, there is still one key problem unresolved yet: how many samples to use for estimating each A(i, ψ(i)) before reassigning context i with another random action, given a fixed budget of total sample complexity. On one hand, if we only try each (i, ψ(i)) pair a very few numbers of times before switching, then we can potentially explore a lot of different actions for each block but the accuracy of the reward estimate could be too low to identify a near-optimal policy. On the other hand, estimating each A(i, ψ(i)) with a huge number of samples resolves the poor accuracy issue but could potentially cause the problem of under-exploration, especially for those blocks consisting of very few contexts. In the extreme case, if a block only contains a single context, then our total budget may only afford to test a single action on that context (block).

To address this issue, we choose different levels of accuracy for different blocks adaptively depending on their block size. Specifically, contexts from larger blocks can afford to try each random action before switching for a larger number of times to achieve higher estimate accuracy because larger blocks consist of more contexts, which means that the average number of random actions in each context inside a larger block needs to try is smaller. And for smaller blocks, the case is the opposite. Finally, we remark that the above scheme of using adaptive estimate accuracy can be implemented without any prior knowledge of the block structure. We only need to iterate over different accuracy levels using an exponential schedule (Line 3), and each block will be sufficiently explored at its appropriate accuracy level. Specifically, let N = log(1/ε2) , and consider accuracy levels n [N]. For accuracy level n, we collect a dataset Dn by adding a context-action pair after the pair is played for 2n timesteps (Line 9).

Screening optimal action candidates (Line 5-11). Equipped with the dataset collected from Step 1, we want to identify a subset W of [K] so that (i) for every block b [r], W contains a near-optimal action of block b, and (ii) the size of W is as small as possible. We construct such W in an iterative way. For each accuracy level n [N], we first compute the context-action pair (i , j ) with the highest reward estimate in Dn (Line 7). Intuitively, as (i , j ) has been tried for 2n timesteps in constructing Dn, it is natural to guess that j could potentially be a e O(2 n/2)-optimal action for

certain blocks so we add it into optimal action candidate set W (Line 8). To identify the contexts from those blocks, we sample new data to estimate the reward of j on each context i [S]. This can be done by calling Algorithm 2 and setting the exploration set K to be j (Line 10). If a context i can achieve reward close enough to b An(i , j ) at action j , then we peel off every (i, j) Dn (Line 11). This is because we have found j as an optimal action candidate for i at this accuracy level and don t need to consider other j. We repeat the above process, and add a new arm to W in each iteration, until Dn becomes empty.

Solving the simplified context-lumpable bandit (Line 12). After obtaining the optimal action candidate set W, we can discard all actions not in W and solve a simplified context-lumpable bandit problem by replacing the original action set [K] with W. Note that e O(S|W|/ε2) samples suffice for learning an ε-optimal policy for this simplified problem. For example, we can directly try each action j W on each context i [S] for eΘ(1/ε2) times, and define πout(i) arg maxj W A(i, j) where A(i, j) is the empirical estimate of A(i, j) based on eΘ(1/ε2) samples.

Algorithm 1 Algorithm for Almost-uniform Context Distribution

1 Let t denote the current time and initialize N log(1/ε2) , W , elg 16 log(r SK/δ)

2 Define K(i) [K] for i [S], L r(S + K)elg/ε2

Step 1. Data collection

3 for accuracy level n = 1, . . . , N do

4 Execute Algorithm 2 with input L, n, K, and receive Dn and b An Step 2. Screening optimal action candidates

5 for accuracy level n = 1, . . . , N do

6 while Dn = do

7 Compute (i , j ) arg max(i,j) Dn b An(i, j)

8 Update optimal action candidates W W S{j }

9 Update K(i) {j } for i [S] and L 8elg2n S

10 Execute Algorithm 2 with input L , n, K, and reassign output to e Dn and e An

11 Shrink Dn {(i, j) Dn : | e An(i, j ) b An(i , j )| q

Step 3. Solving the simplified context-lumpable bandit

12 Use 4S|W|

δ samples to find πout s.t. A(i, πout(i)) maxj W A(i, j) ε for all i [S]

13 Output πout

Algorithm 2 Data Collection

1 Input L, n, K

2 Let t denote the current time and initialize Dn

3 for context i [S] do

4 Assign ψt(i) to be an arm drawn from unif(K(i))

5 for L timesteps do

6 Receive context it, play arm jt = ψ(it), and receive reward yt 7 Preset ψt+1(i) = ψt(i) for every i [S]

8 if (Pt τ=1 1(iτ = it))%2n = 0 then

9 Dn Dn S{(it, ψ(it))} and b An(it, ψ(it))

τ t yτ 1[iτ =it,jτ =ψ(it)] P

τ t 1[iτ =it,jτ =ψ(it)]

10 Reassign ψt+1(it) unif(K(it));

11 Output Dn and b An

Now we present the theoretical guarantee for Algorithm 1. The proof and exact constants in the bound can be found in Appendix C. Theorem 1. Let δ (0, 1) and assume ν(i) [1/(8S), 8/S] for all i [S]. Algorithm 1 outputs an e O(ε)-optimal policy within e O(r(S + K) log(1/δ)/ε2) samples with probability at least 1 δ.

3.2 Extension to general context distribution

Algorithm 3 Algorithm for General Context Distribution

1 Let J = 4S

ε log(S/δ), L = log(S/ε) , N = 524 log r SK

δ 1 + 2 log 1

2 Estimate the context distribution by sampling J contexts, and denote the estimate by ˆν

3 Split the context set into L disjoint subsets {Xl}l [L] where

Xl {i [S] : ˆν(i) (2 l 1, 2 l]}, l [0 : L 1] {i [S] : ˆν(i) [0, 2 l]}, l = L 4 for l [L 1] do

5 Execute Algorithm 1 to learn policy πl for subset Xl from N time steps

6 Output πout such that πout equals to πl over Xl for l [0 : L 1], and arbitrary over XL

In this subsection, we show how to generalize Algorithm 1 to handle general context distributions. We present the pseudo-code in Algorithm 3. The algorithm consists of two key steps. In the first step, we use J = e O(S/ε) samples to obtain an empirical estimate of the context distribution, denoted by ˆν. Then we divide the context set into many disjoint subsets {Xl}l [0:L] such that inside each subset Xl, the conditional empirical context distribution is almost uniform. As a result, we can invoke Algorithm 1 to find a near-optimal policy πl for each subset Xl (Line 5). It requires e O(r(S + K)/ε2) time steps for every ℓbut we only use samples where contexts are from Xl and ignore the rest. Finally, we glue all πl together to get a policy πout that is near-optimal for the original problem. Formally, we have the following theoretical guarantee for Algorithm 3. The proof and exact constants are in Appendix C.

Theorem 2. Let δ (0, 1). Algorithm 3 outputs an e O(ε)-optimal policy within e O(r(S + K) log(1/δ)/ε2) samples with probability at least 1 δ.

Note that we can always learn an ε-optimal policy for any context-lumpable bandit within e O(SK/ε2) samples by invoking any existing near-optimal algorithm for finite contextual bandits. As a result, by combining Theorem 2 with the e O(SK/ε2) upper bound, we obtain that e O(min{r(S + K), SK}/ε2) samples suffice for learning any context-lumpable bandit, which according to the following theorem is minimax optimal up to a logarithmic factor.

Theorem 3. Learning an ε-optimal policy for a context-lumpable bandit with probability no smaller than 1/2 requires at least Ω(min{r(S + K), SK}/ε2) samples in the worst case.

4 Regret Minimization in Context-lumpable Bandits

In this section, we extend the idea from the PAC setting to the online setting. To better introduce the key ideas, we first consider a special case when both context and block distributions are uniform (Section 4.1). Then we consider the most general case in Section 4.2.

4.1 Special Case: Uniform Context and Block Distribution

In this section, we assume that distributions ν and ω are uniform, and thus, g evenly splits the contexts into r blocks so that there are S/r contexts in each block and every context appears with the same probability at each timestep. We will relax these assumptions and consider the general case in the next subsection.

For this case, we introduce Algorithm 4, which uses phased elimination in combination with a clustering procedure. The algorithm runs in phases h = 1, 2, . . . that are specified by error tolerance parameter εh = 2 h/2. Like phased elimination algorithms for multi-armed bandits, we need to ensure at phase h actions the algorithms play are all e O(εh)-optimal. Thus, at the end of each phase h, we eliminate all actions that are not e O(εh)-optimal. However, the set of e O(εh)-optimal actions of each block is different. Therefore, we also perform clustering on contexts and perform elimination for each subset of the partition. Specifically, we maintain a partition of contexts Ph at each phase h and initialize P1 = {[S]}. For each cluster C Ph, we maintain a set of good arms GOODh(C), which we will prove is e O(εh)-optimal for contexts in the cluster with high probability.

We use Algorithm 2 to collect data similar to Algorithm 1 (Line 6). At phase h, we try to estimate mean reward up to error e O(εh) with probability 1 δh. The difference is that we assume ω is uniform, so we don t need different accuracy levels n, which will be required for the algorithm that handles the general case. Also, for every context i C, we only explore arms good for now, that is, in GOODh(C) instead of exploring all the arms [K]. This reflects that in the online setting, we need to minimize regret and cannot afford to explore bad arms too many times.

Based on the data we collect during the exploration stage, we then check if there is a large gap across contexts in the same subset for any arms (Line 7). A large gap suggests that (i) the subset contains at least two blocks (ii) the mean reward of the arm is significantly different in these blocks and we can use this arm to partition the contexts by running Algorithm 5 (clustering stage). We repeatedly do clustering (Line 9) and split heterogeneous subsets (Line 10) until we cannot find a large gap within the same subset. If no large gap is detected, then each arm has similar mean rewards (up to error e O(εh)) across all blocks in the same subset. Then we eliminate arms that are significantly worse than the empirical best arm (elimination stage) from GOODh(C) for every subset C and start a new episode (Line 14).

Algorithm 5 plays j for each context i C for e O(1/ε2) rounds and calculates empirical means of arm j for each context in C. It then sorts the contexts by the empirical means and performs clustering (Line 4). Specifically, the algorithm enumerates contexts in descending order of empirical means and splits contexts until a large gap between consecutive values is detected (Line 7). As we call Algorithm 5 only if a large gap is detected, we prove that in the appendix it correctly separates the subset into at least two parts without putting any contexts in the same block into different parts by setting ε = εh/r.

Algorithm 4 Algorithm for Uniform Block Distribution

1 Initialize P1 {[S]}, GOOD1(C) [K] for C P1 2 Let t denote the current time

3 for phase h = 1, 2, . . . , do

4 Let εh 2 h/2, δh ε2 h/(r3SK), elgh 64 log(r SK/δh), eεh q

elgh εh Step 1. Data collection

5 Define Lh r(S+K)elgh

ε2 h , nh log( 1

ε2 h ), Kh(i) GOODh(C), C Ph, C i, i [S]

6 Execute Algorithm 2 with input Lh, nh, Kh, and receive Dh and b Ah Step 2. Test homogeneity and perform clustering on heterogeneous subsets

7 while C Ph, i, i C, j [K] such that b Ah(i, j) b Ah(i, j) eεh do

8 Define ε εh

r , K(i) {j} if i C else GOODh(C) for C Ph, C i

9 Execute Algorithm 5 with input ε , δ , K, C, and j, and get P, a partition of C

10 Initialize GOODh(C ) GOODh(C), C P and update Ph (P Ph)\{C} Step 3. Eliminate suboptimal actions in each subset

12 for C Ph+1 do

13 Calculate µh(C, j) maxi:i C,(i,j) Dh b Ah(i, j), j GOODh(C)

14 Update GOODh+1(C) {j : j GOODh(C), maxj µh(C, j ) µh(C, j) 2eεh}

Similar to the analysis of other phased elimination algorithms, we have to show that in a phase specified by error level εh, with high probability, (i) the optimal arm is not eliminated and (ii) all eω(εh)-suboptimal arms are eliminated, that is, all arms in GOODh(C) for all C are e O(εh)-optimal. We show the final regret here and defer the details to Appendix D

Theorem 4. Under the assumption that context distribution ν and block distribution ω are uniform, regret of Algorithm 6 is bounded as Reg T = e O( p

r3(S + K)T).

Algorithm 5 Split Contexts Into Multiple Blocks

1 Input: error ε , confidence δ , exploration sets K, subset C, separating arm j

2 Initialize elg 64 log(S/δ ), L S elg/ε 2, n 1/ε 2

3 Execute Algorithm 2 with input L , n , K, and reassign output to D and b A

4 Sort contexts in C and label them as i1, . . . , i|C| so that b A(i1, j) b A(i2, j) b A(i|C|, j)

5 Initialize P1 {i1}, b 1

6 for k = 2, 3, . . . , |C| do

7 if b A(ik 1, j) b A(ik, j) q

8 Update b b + 1 and initialize Pb {}

9 Pb Pb {ik}

10 Output {P1, . . . , Pb}

Compared to our PAC result, we get an extra dependency on r. This is because we pay e O(S/ε 2) = e O(r2S/ε2 h) samples to do clustering instead of e O(1/ε2 h) in order to ensure a perfect partition, that is, we never separate contexts in the same block with high probability. This is crucial for our phase-elimination algorithm as we may call Algorithm 5 in different phases. We left getting better than e O( p

r3(S + K)T) regret upper bounds or better than Ω( p

r(S + K)T) regret lower bounds (even for this uniform special case) as an important future direction.

4.2 Non-uniform Context and Block Distribution

Similar to the PAC learning setting, we can use Algorithm 3 to reduce general context distributions to (nearly) uniform ones. We provide more details in Appendix F. As a result, without loss of generality, we may assume ν is almost-uniform, and we focus on how to handle non-uniform block distribution ω. In the extreme, there may only be one context for some blocks, which becomes challenging to estimate their mean rewards.

For this case, we introduce Algorithm 6. Intuitively, based on Algorithm 4, we can further employ different accuracy levels as Algorithm 1. However, as our goal becomes minimizing regret, it is difficult to control regret for smaller n and larger blocks. Specifically, for smaller n, we explore more actions (with fewer samples) for each context in a single phase. Since fewer samples are used, we may unavoidably play suboptimal actions and suffer large regret. This becomes worse for a large block as more contexts in the block suffer large regret. We note that this is not a problem in the PAC setting because we only need to control sample complexity.

We fix the issue by setting different lengths L for different accuracy levels. Recall in Algorithm 1, we use the same length L = e O(r(S+K)/ε2) for every n. Intuitively, since we allow less accurate estimations for smaller n, we may use fewer data. It turns out indeed we can set L = e O(r(S + K)2(n+h)/2) for level n at phase h, and with more refined analysis, it provides similar guarantees as before. More importantly, since smaller n uses (exponentially) fewer samples, the overall regret is well controlled.

In addition to setting the lengths sophisticatedly, we also need to carefully maintain sets of good actions GOODh,n not only at each phase h but also at each accuracy level n. The partition of contexts Ph, however, only evolves with phases and is shared between different levels at the same phase. Similar to Algorithm 4, we also do clustering (Step 2) and elimination (Step 3) but do the procedures in parallel for every accuracy level n. In the clustering stage, we use different thresholds to define a large gap . This reflects that n represents different accuracy levels and thus has different widths of confidence intervals. For the elimination stage, we enforce the inclusion relation GOODh(n, C) GOODh(n , C) for n n (Line 17), which will be useful in the regret analysis. We now present the main theorem and put the complete proof in Appendix G.

Theorem 5. Regret of Algorithm 6 is bounded as Reg T = e O( p

r3(S + K)T) .

We remark that Algorithm 6 is anytime, that is, it doesn t require the time horizon T as input. However, it does require the number of blocks r as prior knowledge. Removing the knowledge of r is an interesting future direction. One promising idea is to apply a doubling trick on r. Specifically, we have an initial guess r = e O(1) and run Algorithm 6; when |Ph| > r, we double r and restart

Algorithm 6 Algorithm for Non-uniform Block Distribution

1 Initialize P1 {[S]}, GOOD1,1(C) [K], C P1 2 Let t denote the current time

3 for phase h = 1, 2, . . . , do

4 Let εh 2 h/2, Nh h, δh ε2 h/(r3SK), elgh 128 log(r SKNh/δh) Step 1. Data collection

5 for accuracy level n = 1, . . . , Nh do

6 Define Lh,n r(S + K)elgh2(n+h)/2, Kh,n(i) GOODh,n(C), C Ph, C i, i [S]

7 Execute Algorithm 2 with input Lh, n, Kh,n, and receive Dh,n and b Ah,n Step 2. Test homogeneity and perform clustering on heterogeneous subsets

8 while C Ph, i, i C, j [K], n [Nh] such that b Ah,n(i, j) b Ah,n(i, j) q

9 Define ε εh

r , K(i) {j} if i C else GOODh,n(C) for C Ph, C i

10 Execute Algorithm 5 with input ε , δ , K, C, and j, and get P, a partition of C

11 Initialize GOODh,n(C ) GOODh,n(C), C P and update Ph (P Ph)\{C} Step 3. Eliminate suboptimal actions in each subset

12 Ph+1 Ph, GOOD1,1(C) [K], C Ph+1

13 for accuracy level n = 2, . . . , Nh do

14 for C Ph+1 do

15 Calculate µh,n(C, j) maxi:i C,(i,j) Dh,n b Ah,n(i, j), j GOODh,n(C)

16 Let Gh,n(C) j : j GOODh,n(C), maxj µh,n(C, j ) µh,n(C, j) 2 q

17 Update GOODh+1,n(C) GOODh+1,n 1(C) Gh,n(C)

the algorithm. The analysis, though, may be much more complicated. Finally, we note that when knowing r, one can calculate in advance if

SKT is less than p

r3(S + K)T and switch to a standard algorithm achieving e O(

SKT) in that case. This modification guarantees the regret is never worse than the standard bound.

5 From Context-lumpable Bandits to Contextual Low-rank Bandits

Finally, we consider the more general contextual low-rank bandit problem. Specifically, we allow A to have rank r, that is A = UV for some S r matrix U and r K matrix V . Lumpability is a special case in the sense that U is binary where each row has a single nonzero element.

To solve the problem, We show a reduction from contextual low-rank bandits to context-lumpable bandits. Consider an α-covering of rows of U, and notice that the covering number, Rα, is 1/αr. The context-lumpable bandits can be seen as α-approximate context-lumpable bandits with Rα blocks, where the reward of contexts on the same block differs at most α. Ignoring this misspecification, we may run Algorithm 1 and Algorithm 6 for the PAC and online settings, respectively. Moreover, it turns out that our algorithms are robust to this misspecification when α is sufficiently small (O(ε) in the PAC setting, for example). Therefore, the sample complexity and the regret bounds of these algorithms will be in terms of Rα despite having an exponential dependency on r. The resulting bounds can still be smaller than the naive SK/ε2 and

SKT bounds in some scenarios, for example, when S and K are super large. We put the details in Appendix H.

6 Conclusions and Future Directions

We consider a contextual bandit problem with S contexts and K actions. Under the assumption that the context-action reward matrix has r min{S, K} unique rows, we show an algorithm that outputs an ε-optimal policy and has the optimal sample complexity of e O(r(S + K)/ε2) with high

probability. In the regret minimization setting, we show an algorithm whose cumulative regret up to time T is bounded as e O( p

r3(S + K)T).

An immediate next question is whether a regret bound of order e O( p

r(S + K)T) is achievable in the regret minimization setting. A second open question is concerned with obtaining a e O( p

poly(r)(S + K)T) regret bound in contextual low-rank bandits. Our regret analysis heavily relies on the assumption that contexts arrive in an I.I.D fashion. Extending our results to the setting with adversarial contexts remains another important future direction.

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.

Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models going beyond svd. In 2012 IEEE 53rd annual symposium on foundations of computer science, pages 1 10. IEEE, 2012.

P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397 422, 2002.

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19 26. JMLR Workshop and Conference Proceedings, 2011.

Leonardo Cella, Alessandro Lazaric, and Massimiliano Pontil. Meta-learning with stochastic linear bandits. Arxiv, 2020.

Yuxin Chen, Yuejie Chi, Jianqing Fan, Cong Ma, and Yuling Yan. Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. SIAM journal on optimization, 30(4):3098 3121, 2020.

Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208 214. JMLR Workshop and Conference Proceedings, 2011.

Christoph Dann, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. On oracle-efficient PAC RL with rich observations. Advances in neural information processing systems, 31, 2018.

Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, and John Langford. Provably efficient RL with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665 1674. PMLR, 2019.

Yaqi Duan, Tracy Ke, and Mengdi Wang. State aggregation learning from Markov transition data. Advances in Neural Information Processing Systems, 32, 2019.

Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Efficient optimal learning for contextual bandits. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pages 169 178, 2011.

Fei Feng, Ruosong Wang, Wotao Yin, Simon S Du, and Lin Yang. Provably efficient exploration for reinforcement learning using unsupervised learning. Advances in Neural Information Processing Systems, 33:22492 22504, 2020.

Dylan Foster and Alexander Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pages 3199 3210. PMLR, 2020.

Claudio Gentile, Shuai Li, and Giovanni Zappella. Online clustering of bandits. In International Conference on Machine Learning, pages 757 765. PMLR, 2014.

Claudio Gentile, Shuai Li, Purushottam Kar, Alexandros Karatzoglou, Giovanni Zappella, and Evans Etrue. On context-dependent clustering of bandits. In Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70 of Proceedings of Machine Learning Research, pages 1253 1262, 2017.

Joey Hong, Branislav Kveton, Manzil Zaheer, Yinlam Chow, Amr Ahmed, and Craig Boutilier. Latent bandits revisited. In Neur IPS, 2020.

Prateek Jain and Soumyabrata Pal. Online low rank matrix completion, 2022.

Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 665 674, 2013.

Kyoungseok Jang, Kwang-Sung Jun, Se-Young Yun, and Wanmo Kang. Improved regret bounds of bilinear bandits using action space analysis. In International Conference on Machine Learning, pages 4744 4754. PMLR, 2021.

Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, pages 1704 1713. PMLR, 2017.

Kwang-Sung Jun, Rebecca Willett, Stephen Wright, and Robert Nowak. Bilinear bandits with low-rank structure. In International Conference on Machine Learning, pages 3163 3172. PMLR, 2019.

Yue Kang, Cho-Jui Hsieh, and Thomas Chun Man Lee. Efficient frameworks for generalized low-rank matrix bandit problems. In Advances in Neural Information Processing Systems, 2022.

Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade, and Zheng Wen. Stochastic rank-1 bandits. In Artificial Intelligence and Statistics, pages 392 401. PMLR, 2017.

John G Kemeny and James Laurie Snell. Finite Markov chains. Springer, 1976.

Branislav Kveton, Csaba Szepesvári, Anup Rao, Zheng Wen, Yasin Abbasi-Yadkori, and S Muthukrishnan. Stochastic low-rank bandits. ar Xiv preprint ar Xiv:1712.04644, 2017.

Branislav Kveton, Martin Mladenov, Chih-Wei Hsu, Manzil Zaheer, Csaba Szepesvári, and Craig Boutilier. Differentiable meta-learning in contextual bandits. ar Xiv:2006.05094v1, 2020.

Branislav Kveton, Mikhail Konobeev, Manzil Zaheer, Chih wei Hsu, Martin Mladenov, Craig Boutilier, and Csaba Szepesvari. Meta-Thompson sampling. Arxiv, 2021.

Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, and Shie Mannor. RL for latent MDPs: Regret guarantees and a lower bound. Advances in Neural Information Processing Systems, 34: 24523 24534, 2021.

Tor Lattimore and Botao Hao. Bandit phase retrieval. Advances in Neural Information Processing Systems, 34:18801 18811, 2021.

Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 539 548, 2016.

Yangyi Lu, Amirhossein Meisami, and Ambuj Tewari. Low-rank generalized linear bandit problems. In International Conference on Artificial Intelligence and Statistics, pages 460 468. PMLR, 2021.

Odalric-Ambrym Maillard and Shie Mannor. Latent bandits. In International Conference on Machine Learning, pages 136 144. PMLR, 2014.

Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, pages 6961 6971. PMLR, 2020.

Aditya Modi, Jinglin Chen, Akshay Krishnamurthy, Nan Jiang, and Alekh Agarwal. Model-free representation learning and exploration in low-rank MDPs. ar Xiv preprint ar Xiv:2102.07035, 2021.

Chengzhuo Ni, Anru R Zhang, Yaqi Duan, and Mengdi Wang. Learning good state and action representations via tensor decomposition. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 1682 1687. IEEE, 2021.

Matteo Papini, Andrea Tirinzoni, Aldo Pacchiano, Marcello Restelli, Alessandro Lazaric, and Matteo Pirotta. Reinforcement learning in linear MDPs: Constant regret and representation selection. Advances in Neural Information Processing Systems, 34:16371 16383, 2021.

Zhiyuan Ren and B.H. Krogh. State aggregation in Markov decision processes. In Proceedings of the 41st IEEE Conference on Decision and Control, volume 4, pages 3819 3824, 2002.

Rajat Sen, Karthikeyan Shanmugam, Murat Kocaoglu, Alexandros G. Dimakis, and Sanjay Shakkottai. Contextual bandits with latent confounders: An NMF approach, 2016.

David Simchi-Levi and Yunzong Xu. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Mathematics of Operations Research, 47(3):1904 1931, 2022.

Aleksandrs Slivkins. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(1-2):1 286, 2019.

A. Tewari and S. A. Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495 517. Springer, 2017.

Cindy Trinh, Emilie Kaufmann, Claire Vernade, and Richard Combes. Solving Bernoulli rank-one bandits with unimodal Thompson sampling. In Algorithmic Learning Theory, pages 862 889. PMLR, 2020.

Masatoshi Uehara, Xuezhou Zhang, and Wen Sun. Representation learning for online and offline RL in low-rank MDPs. In International Conference on Learning Representations, 2022.

Amy Zhang, Shagun Sodhani, Khimya Khetarpal, and Joelle Pineau. Learning robust state abstractions for hidden-parameter block MDPs. In International Conference on Learning Representations, 2020.

Weitong Zhang, Jiafan He, Dongruo Zhou, Amy Zhang, and Quanquan Gu. Provably efficient representation learning in low-rank Markov decision processes. ar Xiv preprint ar Xiv:2106.11935, 2021.

Xuezhou Zhang, Yuda Song, Masatoshi Uehara, Mengdi Wang, Alekh Agarwal, and Wen Sun. Efficient reinforcement learning in block MDPs: A model-free representation learning approach. In International Conference on Machine Learning, pages 26517 26547. PMLR, 2022.

A More Related Works

A.1 From bilinear to low-rank bandits

The low-rank structure has been shown useful in other bandit problems. Katariya et al. [2017] considered the case when the learner chooses both context and arm and the reward map A, when viewed as a S K matrix, has rank one. This is a special case of a linear bandit setting with a fixed action set where both the features underlying the actions and the parameter vectors are shaped as matrices of matching dimensions, and both matrices are rank-one. Similarly to our problem, the challenge is to make the regret depend on S + K rather than on S K (but note that S does not have the interpretation of the number of contexts here). This paper demonstrates that this is possible when considering gap-dependent bounds.

Trinh et al. [2020] improved the result of Katariya et al. [2017] and gave an asymptotically optimal regret bound. Kveton et al. [2017] generalized the setting of Katariya et al. [2017] to the case when the reward A (viewed as matrix) is of rank r min(S, K) and is a hott topics matrix , the learner in each round can choose r row and column indices, observes the entries of A at the resulting submatrix of A in noise and incurs the maximum reward in this submatrix. For this problem they show an instance dependent bound that depends on S, K, r only through (S + K)poly(r).

Jun et al. [2019] dropped the extra conditions on the mean reward matrix besides assuming that it has low rank. Jang et al. [2021] gave the first upper bound of the form O((S + K)r

T), though with inefficient algorithms, which was the first bound better than the naive bound O(SK

T).4 Later, Lu et al. [2021] dropped the condition on the action matrices and also extended the results to the generalized linear setting (they assume that both the action matrices and the parameter matrix has a constant Frobenius norm bound). The regret bound of Jun et al. [2019], Lu et al. [2021] takes the form O((S + K)3/2

r T), which is still worse than the earlier mentioned naive bound. Lattimore and Hao [2021] prove that (up to logarithmic factors) when S = K and both the action and reward matrices are rank one and symmetric, the minimax regret is of order S

Recently, Kang et al. [2022] improved the previous state-of-the-art for the generalized linear setting. The new regret bound, which they prove under a mild extra assumption, takes the form O((S + K)r

T) and is conjectured to match the order of the minimax regret. Noticing that the minimax lower bounds developed for the finite-armed stochastic bandits (e.g., Exercise 15.2 of Lattimore and Szepesvári 2020) is applicable to the rank-one setting (as the proofs use actions and rewards that are rank one, one-hot matrices), we see that the regret is at least Ω(

SKT) in these problems. As

max(S, K)2 = max(S, K) S + K, this lower bound rules out upper bounds of the form O( p

(S + K)poly(r)T), though it is compatible with the upper bound mentioned above.5

Jain and Pal [2022] study a related problem where the learner chooses one action per context in each round and incurs a reward for each of the actions. They also propose an epsilon-greedy type algorithm and show that its regret scales as T 2/3polylog(S + K) provided some technical conditions hold, one of which is that the condition number of the reward A when viewed as a matrix is constant. The main limitation of this work is that the technical conditions are restrictive and the problem is easier as in each round an observation is available for each context, whereas in our setting the contexts arrive at random from a distribution which may be very far from the uniform distribution, which, as we shall see, will require extra care.

A.2 Bandit meta-learning

Bandit meta-learning is concerned with learning a lower dimensional subspace across bandit tasks [Kveton et al., 2020, 2021, Cella et al., 2020], while we consider only a single task.

4To get this naive bound, following an argument of Jang et al. [2021], notice that the problem is an instance of d-dimensional linear bandits with d = SK with actions and parameters belonging to the unit sphere, in which case the minimax regret, up to logarithmic factors, is of order d

T (e.g., Theorem 24.2, Lattimore and Hao 2021). If the action set has one-hot matrices only, this bound improves to

d T, as discussed before. 5Here, f(S, K) g(S, K) means that f, g are within a constant factor of each other, while f(S, K) g(S, K) means that f(S, K) = o(g(S, K)) as S, K .

A.3 Contextual bandit problems

Contextual bandit problems, introduced by Auer et al. [2002], can be seen as a special case of the prediction with expert advice problem, which is an online problem. In a prediction with expert advice problem, the learner is given access to the recommendations of N experts in the form of distributions over the K actions. The learner still needs to choose an action in each round with the goal of competing with the total reward collected by the best expert in hindsight. Auer et al. [2002] consider the adversarial case when in each round, each action is assigned a reward in an arbitrary way from a known, finite interval. The authors propose the EXP4 algorithm that is shown to achieve O( KT log N) regret in T rounds of interactions. Beygelzimer et al. [2011] extended this result to control the regret with high probability, while Dudik et al. [2011] introduced a computationally efficient variant assuming a computationally efficient cost-sensitive classification oracle.

To reduce an r-lumpable contextual bandit problem to prediction with expert advice, we need to choose the experts. The obvious choice here is that an expert is a [S] [K] map that is the composition of an [S] [r] map followed by an [r] [K] map. Denoting by N the number of such experts, we see that log(N) = S log(r) + r log(K), and hence the regret of EXP4 is of order Ω(

SKT), which is not better than not using the lumpability structure.

Another line of work focuses on oracle-based contextual bandits [Foster and Rakhlin, 2020, Simchi Levi and Xu, 2022]. In this setting, the learner has access to function class F and the optimal regret is O( p

KT log(|F|)), where |F| is the size of the function class. However, with the same argument, we can conclude |F| = Ω(S) even under the lumpable assumption. Therefore, the results in this direction also give no direct implication to the problem we consider.

A.4 Stochastic linear bandits with changing action sets

In this setting, in each round t = 1, 2, . . . the learner first receives K d-dimensional vectors, xt,1, . . . , xt,K, each corresponding to an action. Choosing action j gives a reward whose mean (given the past) is x t,jθ for some unknown parameter vector θ Rd. For our case, one can choose d = SK: θ can be the flattening of A and xt,j is a unit vector so that x t,jθ = A(it, j). Applying the Sup Lin Rel algorithm from Section 4 of Auer [2002] to this setting, we get the regret bound

d T log3(KT)), which shows no improvement compared to ignoring lumpability.

A.5 Matrix completion problems

The offline version of our problem is closely related to matrix completion problems, where the goal is to reconstruct a matrix with missing values under the low-rank constraint [Arora et al., 2012, Jain et al., 2013, Chen et al., 2020]. Based on these ideas, Sen et al. [2016] propose an epsilon-greedy algorithm for the contextual low-rank bandit problem and show that its regret is of order T 2/3(S poly(r, log K))1/3. This result holds under an assumption that the reward matrix has a nonnegative decomposition A = UV where entries of U and V are all nonnegative. If A is nonnegative valued itself, r-lumpability of the contexts implies that such a decomposition exist. However, the main result of this work needs additional conditions. In particular, the context distribution needs to be near-uniform, and due to their restricted isometry property assumptions, the context groups need to be of approximately the same size. In particular, if all context groups except one have a single member, their bound degrades to the trivial bound mentioned earlier.

A.6 Lumpable Markov decision processes

Lumpable MDPs in contemporary literature on machine learning are referred to as block MDPs [Du et al., 2019]. Learning to act in a block MDPs has been the subject a numerous recent papers [Dann et al., 2018, Du et al., 2019, Misra et al., 2020, Feng et al., 2020, Zhang et al., 2020, 2022]. Furthermore, learning to act in a block MDP is a special case of the so-called low-rank setting, which was also heavily studied [Jiang et al., 2017, Uehara et al., 2022, Modi et al., 2021, Papini et al., 2021, Zhang et al., 2021, Misra et al., 2020], both in the online and in the PAC settings. The learner in these problems is given a set of feature maps with the promise that at least one of the feature maps will allow a factored (low-rank) representation of the transition dynamics and the reward.

Despite the significant effort that went into studying this problem, in our setting none of the existing results improve upon the naive bound that one can get for non-lumpable problems. Duan et al. [2019], Ni et al. [2021] study the offline setting, where the considerations are rather different. Finally, Kwon et al. [2021] consider the problem when over multiple episodes, a learner interacts with a finite-horizon MDP, which is chosen at random from one of r such unknown MDPs. The learner is given no additional information about the hidden identity of the MDP that it faces. The challenge here is that even if the MDPs were known, the problem of acting optimally is nontrivial, while in our case this is not a problem. As such, while superficially, the problems may look similar, they are quite different.

B Auxiliary Lemmas

In this section, we show some probabilistic lemmas that will be useful to prove our results. The following lemma is a high-probability version of the classical coupon collector s problem , which says that the expected number of coupons required to draw with replacement is Θ(K log K) in order to get each of K coupon at least once.

Lemma 6. Given a set K with |K| K and consider M i.i.d samples drawn from unif(K). Then with probability 1 δ , every element in K appears at least once in these samples as long as M K log(K/δ ).

Proof. Fix an element j K. The probability that j appears in none of the M samples can be bounded by

A union bound on every j in K finishes the proof.

Lemma 7 (Concentration of Subgaussian random variables). Let X1 µ, . . . , XM µ be a sequence of independent 1-subgaussian random variables and bµ = 1 M PM m=1 Xm. Then with probability 1 δ , δ < 1

Lemma 8 (Bernstein s inequality). Let X1, . . . , XM be a sequence of independent random variables with Xm E[Xm] b almost surely for every m and v = PM m=1 Var[Xm]. With probability 1 δ , we have

m=1 E[Xm] + p

2v log(1/δ ) + 2b

3 log(1/δ )

The following lemma is a direct consequence of Bernstein s inequality, which states, in terms of the coupon collector s problem, that after drawing M coupons, one can expect with a high probability a particular coupon appears at least Mp/2 times if its probability of appearing is p.

Lemma 9. Let X1, . . . , XM be i.i.d Bernoulli random variables so that E[Xm] = p for all m. With probability 1 δ , we have PM m=1 Xm Mp/2 as long as Mp 16 log(1/δ ).

Proof. Let Ym = 1 Xm We have E[Ym] = 1 p. By Lemma 8, with probability 1 δ , we have

m=1 Xm M (1 p) + p

2Mp(1 p) log(1/δ ) + log(1/δ ).

Rearranging terms and using Mp 16 log(1/δ ) gives

m=1 Xm Mp p

2Mp(1 p) log(1/δ ) log(1/δ )

C Proofs for Section 3

We first provide an outline of the proof in Appendix C.1 and show the complete proofs of the lemma in the corresponding subsections.

C.1 Proof of Theorem 1

To begin with, we assign adaptive target optimality (accuracy) to each block according to their block size:

ε, for b [r].

The following lemma states that as long as W contains an e O (εb)-optimal action for each block b [r], then the output policy is e O (ε)-optimal with high probability. Lemma 10. For a positive constant C, suppose for any b [r] with ω(b) ε/r, W contains a Cεb-optimal action, then πout is 2(C +1)ε-optimal for the original context-lumpable bandit problem.

As a result, if we can prove that the precondition of Lemma 10 holds with high probability, then the correctness of Algorithm 1 follows immediately. To do so, we first argue that for every block b [r], we have sufficiently explored the entire action set at accuracy level n = log(1/ε2 b) in the data collection step, as stated in the next lemma. Lemma 11. With probability at least 1 2δ, for any b [r] with ω(b) ε/r, we have that: for any j [K], there exists (i, j) Dn satisfying g(i) = b where n = log(1/ε2 b) .

Intuitively, Lemma 11 says that we have tested all actions on each block b [r] at the corresponding accuracy level n = log(1/ε2 b) . Based on Lemma 11, the following lemma states that in the second step we are able to filter out an e O (εb)-optimal action for block b by using the information contained in Dn. Lemma 12. With probability at least 1 4δ, for any b [r] with ω(b) ε/r, a 10εb p

log(r SK/δ)- optimal action of block b is added into W in the process of shrinking Dn where n = log(1/ε2 b) .

Now we have proved the correctness of Algorithm 1. However, to obtain the desired sample complexity, it remains to show that the size of the optimal action candidate set W is relatively small, which we prove next. Lemma 13. With probability at least 1 4δ, |W| Nr.

Now we are ready to prove Theorem 1 by combining all the above lemmas.

Proof of Theorem 1. The step of data collection uses

8 1 + log 1

samples. By Lemma 13, the step of screening optimal action candidates uses

2(162) 1 + log 1

samples and solving the simplified context-lumpable bandit problem uses

4 1 + log 1

samples. Overall, the number of samples is bounded by

524 log r SK

1 + 2 log 1

By Lemma 10 and Lemma 12, πout is 2 10 p

log(r SK/δ) + 1 ε-optimal.

C.2 Proof of Lemma 10

Proof. We control the suboptimality of πout in the following way: X

i [S] ν(i) max j [K] A(i, j) A(i, πout(i))

b [r]: ω(b) ε/r

i B(b) ν(i) max j W A(i, j) A(i, πout(i)) + Cεb

b [r]: ω(b)<ε/r ω(b)

b [r] ω(b)εb + ε 2(C + 1)ε,

where the first inequality uses the precondition of the lemma, the second one uses the ε-optimality of πout in the simplified context-lumpable bandit, and the final one follows from Cauchy-Schwarz inequality.

C.3 Proof of Lemma 11

Proof. Fix a block b [r]. For each accuracy level, recall in the step of data collection, we sample L = r(S + K)elg/ε2 contexts i.i.d. from ν. By Lemma 9, with probability 1 δ

r, at least Lω(b)/2 of them are from block b as Lω(b) Lε

r elg 16 log(r/δ).

Further, recall that we define accuracy level n = log(1/ε2 b) . We first show that this n is well defined, that is, n 1. Indeed, we have 1 ε2 b = 1 ε2/(rω(b)) ε

as long as ε 1

2. Since we add a context-action pair into Dn once we have collected

2n = 2 log(1/ε2 b) < 2log(1/ε2 b)+1 = 2/ε2 b samples for estimating its reward. Note that in the end there are at most |B(b)|(2n 1) samples from block b unused. Thus, with probability 1 δ

r, there are at least Lω(b)/2 |B(b)|(2n 1) samples used. Therefore, the number of context-action pairs, where the contexts are from block b, that are added into Dn is at least

Lω(b)/2 |B(b)|(2n 1)

2/ε2 b S (2n 2/ε2 b and |B(b)| S)

Lε2/(2rε2 b) 2/ε2 b S (by definition of εb)

= 16(S + K) log(r SK/δ) S (the value of L) K log(r SK/δ).

Conditioned on this event, with probability 1 δ

r, for any j [K], there exists (i, j) Dn satisfying g(i) = b by Lemma 6. Therefore, the lemma holds for block b with probability at least 1 2δ

r . We complete the proof by a union bound on all blocks.

C.4 Proof of Lemma 12

Proof. Denote by i the first context being removed from Dn, which satisfies g(i) = b. By the rule of shrinking Dn, we have

| e An(i, j ) b An(i , j )| 4

log(r SK/δ)

By Lemma 7 and a union bound on all pairs in Dn, with probability 1 2δ/r, we have

| e An(i , j ) A(i , j )| 1

log(r SK/δ)

2n and | b A(i , j ) An(i , j )| 1

log(r SK/δ)

for every (i , j ) Dn. Therefore, we have

|A(i, j ) A(i , j )| 5

log(r SK/δ)

By Lemma 11, we know that for any j [K], there exists (i , j) Dn satisfying (g(i ), j) = (b, j), before we remove context i. Therefore, by the definition of (i , j ), for any j [K], there exists (i , j) Dn satisfying g(i ) = b and

b An(i , j ) b An(i , j),

which, again by Eq. (5) with probability 1 2δ/r

A(i , j ) max j µ(b, j)

log(r SK/δ)

By combining Eq. (6) and Eq. (7), we conclude that j is a 10 p

log(r SK/δ)εb-optimal action for block b because

log(r SK/δ)

log(r SK/δ)

1/ε2 b = 10 p

log(r SK/δ)εb

This completes the proof by a union bound on all blocks.

C.5 Proof of Lemma 13

Proof. It suffices to show that for each accuracy level n, we will add at most r actions into W before shrinking Dn to . Below we prove a stronger argument: each time we shrink Dn, we will remove all the contexts from at least one block from Dn.

Denote by (i, j) an arbitrary context-action pair from Dn such that g(i) = g(i ), then by Eq. (5), we have e An(i, j ) b An(i , j ) e An(i, j ) A(i, j ) + A(i , j ) b An(i , j ) 4

log(r SK/δ)

which implies all context-action pairs with context from block g(i ) will be removed from Dn.

C.6 Proof of Theorem 2

Proof. By Lemma 9 and a union bound over [S], we have that with probability 1 δ for all i [S]:

ν(i) log(S/δ)

ˆν(i) 2 ν(i) + log(S/δ)

which implies that ν(i) ε/(4S) for ˆν(i) ε/S and ν(i) 3ε/S for ˆν(i) < ε/S. By definition, for any l [L 1], Xl consists of contexts such that ˆν(i) > 2 l 1 ε/S. As a result, for any l [L 1], by Lemma 9, we have that at least Nν(Xl)/2 out of N samples corresponds to Xl and thus can be used in learning πl by executing Algorithm 1, where ν(Xl) := P

i Xl ν(i) ε/(4S).

Moreover, notice that inside each Xl, the conditional context distribution is almost uniform, so we can invoke Theorem 1 to obtain that πl is

ν(Xl) -optimal

over context Xl, where C = 2

log(r SK/δ) + 1). This means that,

max j A(i, j) A(i, πl(i)) = Cε p

As a result, the total suboptimality of πout can be upper bounded as following

i [S] ν(i) max j [K] A(i, j) A(i, πout(i))

i Xl ν(i) max j A(i, j) A(i, πl(i)) + X

l [L 1] ν(Xl) ε p

ν(Xl) + ε = C

where the final equality uses Cauchy-Schwartz inequality and L log(S/ε) + 1.

C.7 Proof of Theorem 3

Proof. Since r S, we have

Ω(min{r(S + K), SK}) =

Ω(SK), r K, Ω(r S), r < K & S K, Ω(r K), r < K & S < K.

Case (1) We construct the following hard instance:

The reward after pulling action j [K] in block b [r] is sampled from distribution Bernoulli(1/2 + ε1(b = j)).

The context distribution is uniform over [S]. For each i [S], we sample g(i) uniformly at random from [K].

The above instance is the standard one commonly used in proving lower bounds for contextual bandit with S contexts and K arms [e.g., Lattimore and Szepesvári, 2020]. And learning an ε-optimal policy for the above hard instance requires at least Ω(SK/ε2) samples.

Case (2) We only need to slightly modify the above hard instance:

The reward after pulling action j [K] in block b [r] is sampled from distribution Bernoulli(1/2 + ε1(b = j)).

The context distribution is uniform over [S]. For each i [S], we sample g(i) uniformly at random from [r].

The above modified problem is equivalent to the original one with S contexts but r arms. As a result, a lower bound of form Ω(Sr/ε2) holds.

Case (3) Similarly, we slightly modify the first hard instance:

For each block b [r], sample j b uniformly at random from [K]. The reward after pulling action j [K] in block b [r] is sampled from distribution Bernoulli(1/2 + ε1(j = j b )).

The context distribution is uniform over [r] and g(i) = min{i, r}.

The above modified problem is equivalent to the original one with K arms but r contexts. As a result, a lower bound of form Ω(Kr/ε2) holds.

D Proof of Theorem 4

In this section, we first show key lemmas to prove Theorem 4. The proofs of the lemmas are deferred to Appendix E.

In the following discussion, we consider a single phase (i.e. fix an error εh). Similar to the analysis of other phased elimination algorithms, we have to show that in a phase specified by error level εh, with high probability, (i) the optimal arm is not eliminated and (ii) all ω(eεh)-suboptimal arms are eliminated, that is, all arms in GOODh(C) for all C are O(eεh)-optimal.

The following lemma is the counterpart of Lemma 11, which ensures that every arm is played in every block. Lemma 14. With probability at least 1 2δh, for any cluster C Ph and any block b so that B(b) C, we have that: for any j GOODh(C) there exists (i, j) Dh satisfying g(i) = b.

The above lemma ensures that every block has a least one context i assigned to include (i, j) in Dh. Thus, every arm is explored for every block.

Next, we define an event Eh under which the estimates b Ah are good:

Eh = | b Ah(i, j) A(i, j)| eεh

4 , (i, j) Dh

The level of precision eεh

4 is more accurate than the elimination step and will be helpful in the analysis. The following lemma states that Eh is a high probability event. Lemma 15. Event Eh holds with probability 1 δh.

Next, let jb = arg maxj [K] µ(b, j) be the optimal arm in block b. The next lemma says that the optimal arm jb is not eliminated during the execution of the algorithm. Lemma 16. Assume action jb GOODh(C) for every block b [r], and its corresponding partition Ph C B(b). Then for every block b [r], jb is not eliminated from GOODh+1(C) with probability at least 1 3δh.

The high-level idea of the proof is that the error of the estimated mean is smaller than eεh, so b At(i, jb) will not be much worse than other arms, given that its true mean is largest. The next lemma shows that arms in GOODh+1(C) are all O(εh)-optimal. Formally, we say an arm j in a block b is εoptimal if maxj µ(b, j ) µ(b, j) ε. Similarly, we say an arm j in a block b is ε-suboptimal if maxj µ(b, j ) µ(b, j) ε. Lemma 17. For any block b [r] and its corresponding cluster Ph C B(b), all 3eεh-suboptimal arms in block b are eliminated in GOODh+1(C) with probability at least 1 3δh. Consequently, only 6eεh-optimal arms in block b are played in phase h + 1 for every context in block b.

Finally, we need to argue that Algorithm 5 is not called too many times. To show this, we first provide a general guarantee of Algorithm 5. Lemma 18. Suppose we have context iu B(bu) C in block bu and context il B(bl) C in block bl so that

A(iu, j) A(il, j) = µ(bu, j) µ(bl, j) > 3r

Then Algorithm 5 separates B(bu) and B(bl) perfectly with probabilty at least 1 2δ . In other words, there exist indices cu and cl, cu = cl, such that B(bu) Pcu and B(bl) Pcl.

Consequently, with a high probability, Algorithm 4 makes progress in terms of clustering contexts every time calling Algorithm 5 and the number of calls is bounded by r, as shown in the following lemma.

Lemma 19. When Algorithm 5 is called by Algorithm 4, it separates at least two blocks and never separates contexts in the same block with probability at least 1 5δh. Consequently, Algorithm 5 is called at most r times.

We are now ready to bound regret.

Proof of Theorem 4. Since there are at most O(log T ) phases, it suffices to bound regret at a single phase h. Conditioned on all the high probability good events in the previous lemmas, we first bound the total number of timesteps spent in phase h. In the data collection stage, we use at most

Lh = r(S+K)elgh

ε2 h samples; as for the clustering stage, by Lemma 19, we know the total length of executing Algorithm 5 is at most

r L = r S elg/ε 2 = 16Sr3 elg/ε2 h = e O r3S

Thus, the length of phase h is the minimum of T and e O r3(S+K)

. Therefore, by Lemma 17, the

regret is at most (recall that eεh = p

logh εh = e O(εh))

6eεh min T, e O r3(S + K)

= min 6T eεh, e O r3(S + K)

= min e O (Tεh) , e O r3(S + K)

r3(S + K)T .

On the other hand, the bad events happen with probability O(δh) = O(ε2 h/(r3SK)). In this case the regret contributes at most e O(1).

E Missing Proofs in Appendix D

E.1 Proof of Lemma 14

Proof. Note that under the uniform block assumption, we have ω(b) = 1

r . Thus, the proof follows directly from the proof of Lemma 11 when εb = εh, δ = ε2 h, and n = nh. The only difference is when applying Lemma 6, Lemma 11 uses K = [K] but we need K = GOODh(C) here.

E.2 Proof of Lemma 15

Proof. Fix an (i, j) pair. Applying Lemma 7, we have with probability 1 δh

b Ah(i, j) A(i, j) 2

We complete the proof by applying a union bound on all (i, j) pairs in Dh.

E.3 Proof of Lemma 16

Proof. We prove by contradiction. Assume that jb is removed from GOODh(C). Let

j = arg max j GOODh(C)

be the action achieving the highest empirical mean. By Lemma 14 there exists a context i B(b) so that (i , j ) Dh. Moreover, the assumption that jb is eliminated implies that the condition at Line 7 of Algorithm 4 does not hold for the current partition Ph, which further implies that there exists context i C so that

b Ah(i, j ) b Ah(i , j ) = µh(C, j ) b Ah(i , j ) < eεh .

On the other hand, by the assumption that jb is eliminated, again by Lemma 14 we have that there exists a context i B(b), (i , jb) Dh such that

µh(C, j ) b Ah(i , jb) µh(C, j ) µh(C, jb) > 2eεh .

Combining two inequalities, we have

b Ah(i , jb) + eεh < µh(C, j ) eεh < b Ah(i , j ).

This means that µ(b, jb) + eεh

2 < µ(b, j ) under Eh, which contradicts the optimality of jb. Therefore, we conclude that jb is not eliminated.

E.4 Proof of Lemma 17

Proof. By Lemma 16, jb is not eliminated for any block b [r] with probability at least 1 3δh. Thus, for any block b [r] and arm j GOODh(C) so that µ(b, jb) µ(b, j) > 3eεh, we have

max j µh(C, j ) µh(C, j) µh(C, jb) µh(C, j) µ(b, jb) µ(b, j) 2 eεh

Therefore, j is eliminated at Line 14.

E.5 Proof of Lemma 18

Proof. With probability 1 δh

S , context i receives at least 2/ε 2 2n samples by Lemma 9. Thus, b A(i, j) is well defined for every i C with probability 1 δh. Moreover, by Lemma 7 and a union bound, we have for every context i, with probability at least 1 δh,

A(i, j) b A(i, j)

Clearly, we have b A(iu) b A(il) under Eq. (8). Consequently, to simplify the notation, we do the following modification on labels of contexts and blocks. First, we restrict the game to C, where there are S contexts and r blocks; also, we relabel contexts so that iu = 1, il = S , and

b A(iu) = b A(1) b A(2) b A(S 1) b A(S ) = b A(il).

Finally, given a context i [S ], we define

µ(b) = max i [S ],g(i )=b b A(i , j) and µ(b) = min i [S ],g(i )=b b A(i , j)

for its block b = g(i) and we relabel blocks so that bu = 1 g(i) r = bl and

µ(bu) = µ(1) µ(bu 1) µ(bl + 1) µ(r ) = µ(bl).

It is not hard to see that this modification is without loss of generality. We show next that there exists a context i, iu < i il, such that Line 7 of Algorithm 5 holds, that is,

b A(i 1, j) b A(i, j) q

elg ε . (9)

We prove this by contradiction. Assume Eq. (9) doesn t hold for any k.

Then we have

A(iu, j) A(il, j) |A(iu, j) µ(bu)| + |A(il, j) µ(bl)| + µ(bu) µ(bl)

µ(ib 1) µ(ib)

µ(ib 1) µ(ib) + q

which contradicts the condition that A(iu, j) A(il, j) 3r

elg ε. Therefore, we conclude that Eq. (9) holds for some k.

E.6 Proof of Lemma 19

Proof. The proof follows directly from Lemma 18 and the fact that Algorithm 4 use ε = εh

4r and thus

A(i, j) A(i, j) eεh

which satisfies the condition of Lemma 18.

F Non-uniform Context Distribution

We show a reduction to problems with approximately uniform context distributions. The cost of this reduction is an extra e O(

ST) additive regret and an extra O(log(ST)) multiplicative factor in regret. The idea is to learn the context distribution in the first e O(

ST) timesteps. With high probability, we can estimate ν(i) for any context i with a constant multiplicative error as long as ν(i) = eΩ(1/

ST).Then we split the contexts into several buckets so that contexts within the same bucket have the same probability up to a constant factor. For contexts i with ν(i) = o(1/

ST), we can not estimate the probability properly but we can simply ignore such contexts and suffer regret at most O(T S 1/

ST). We then run the algorithm that handles uniform context distribution for each bucket separately. Since there are O(log(ST)) buckets, the overall regret is O(log(ST)) times maximum regret over all subsets (or p

log(ST) with refined analysis using a Cauchy Schwarz inequality).

G Proof of Theorem 5

εh,b = max 1, 1 rω(b)

εh, for b [r].

Also, for every phase h and every level n, we define

Next, we show the counterpart of Lemma 14 in the non-uniform case. Lemma 20. With probability at least 1 2δh, for any cluster C Ph, any block b with B(b) C, we have that: for any level n log(1/ε2 h,b) , action j GOODh,n(C), there exists (i, j) Dh satisfying g(i) = b.

Proof. Fix a block b [r]. For each accuracy level, recall in the step of data collection, we sample L = r(S + K)elgh2n contexts i.i.d. from ν. By Lemma 9, with probability 1 δh

r , at least Lω(b)/2 of them are from block b as

S elg 16 log(r/δh),

where the first inequality comes from the near-uniform context distribution assumption. Since we add a context-action pair into Dn once we have collected

2n 2 log(1/ε2 h,b) < 2log(1/ε2 h,b)+1 = 2/ε2 h,b samples for estimating its reward. Note that in the end, there are at most |B(b)|(2n 1) samples from block b unused. Thus, with probability 1 δh

S , the number of the context-action pairs, where the contexts are from block b, that are added into Dn is at least

Lω(b)/2 |B(b)|(2n 1)

2/ε2 h,b S (2n 2/ε2 h,b and |B(b)| S)

Lεh/(2rεh,b)

2/ε2 h,b S (by definition of εh,b)

= 16(S + K) log(r SK/δh) S (the value of L) K log(r SK/δh).

Conditioned on this event, with probability 1 δh

r , for any j GOODh,n(C), there exists (i, j) Dn satisfying g(i) = b by Lemma 6. Therefore, the lemma holds for block b with probability at least 1 2δh

r We complete the proof by a union bound on all blocks.

Now define a good event Eh as

Eh = b Ah,n(i, j) A(i, j) 1

4 eεh,n, (i, j) Dh, n [Nh] .

Lemma 21. Event Eh holds with probability 1 δh.

Proof. Fix an (i, j) pair and level n. Applying Lemma 7, we have with probability 1 δh SKNh ,

b Ah,n(i, j) A(i, j) 2

log(SKNh/δh)

We complete the proof by applying a union bound on all (i, j) pairs in Dh and all levels n [Nh].

In the following, we present and then prove the counterpart of Lemma 16 for Algorithm 6. Lemma 22. Assume action jb GOODh,n(C) for b [r] with ω(b) 2εh

r , and its corresponding cluster Ph C B(b). Then jb is not eliminated from GOODh+1,n(C) with probability at least 1 3δh for any level n log(1/ε2 h,b) .

Proof. We prove by induction on n. The base case is n = 1, which satisfies the condition of Lemma 20 as

log ω(b)2r2

log(4) 1 = n.

For the inductive step we prove by contradiction. Assume that jb is eliminated in GOODh,n+1(C) but not eliminated in GOODh,n(C) for n 2. This means that the following inequality hold:

max j GOODh,n(C) µh,n(C, j ) µh,n(C, jb) > 2eεh,n

Let j = arg maxj GOOD(n,C) µh,n(C, j). By Lemma 20 there exists a context i B(b) so that ψ(n, i ) j as n log(1/ε2 h,b). Moreover, the assumption that jb is eliminated implies that the condition at Line 8 does not hold for the current partition Ph, which further implies that

µh,n(C, j ) b Ah,n(i , j ) < eεh,n

On the other hand, by the assumption that jb is eliminated, again by Lemma 20 we have that there exists a context i B(b), ψ(n, i ) jb such that

µh,n(C, j ) b Ah,n(i , jb) > 2eεh,n

Combining two inequalities, we have b Ah,n(i , jb) + eεh,n < b Ah,n(i , j ). This means that µ(b, jb) + eεh,n

2 < µ(b, j ) by Lemma 21, which contradicts the optimality of jb. Therefore, we conclude that jb is not eliminated.

Now we present a key lemma similar to Lemma 17.

Lemma 23. For any block b [r] with ω(b) 2εh

r and its corresponding cluster Ph C B(b), all 3eεh,n-suboptimal arms in block b are eliminated in GOODh+1,n(C) for any level n log(1/ε2 h,b) with probability at least 1 3δh. Consequently, only 6eεh,n-optimal arms in block b are played in phase h + 1 for every context in block b.

Proof. By Lemma 22, jb is not eliminated with high probability, so for every pair of block b [r] with ω(b) 2εh

r and arm j GOODh,n(C) so that µ(b, jb) µ(b, j) > 3eεh,n, we have

max j µh,n(C, j ) µh,n(C, j) µh,n(C, jb) µh,n(C, j) µ(b, jb) µ(b, j) eεh,n

2 2.5 eεh,n > 2eεh,n

Therefore, j is eliminated in GOODh,n(C), and thus eliminated in GOODh+1,n (C) for n > n.

G.1 Proof of Theorem 5

Proof. Fix a phase h and a level n. It suffices to bound regret within a single pair (h, n). For b [r], let nb = log(1/ε2 h,n) . By Lemma 23, if n > nb, all 6eεh,nb-suboptimal arms are eliminated from GOODh,n(C) for any cluster C Ph. Therefore, regret of playing an action from GOODh,n(C) is 6eεh,nb. In this case, regret is bounded by

b [r] r(S + K)2(n+h)/2 ω(b) 6eεh,nb X

b [r] r(S + K)2(n+h)/2 ω(b) 6

b [r] r(S + K)2(n+h)/2 ω(b) εh,b

b [r] r(S + K)2(n+h)/2 ω(b) εh rω(b)

elghr(S + K)2(n+h)/2 = e O r(S + K)2h/2

If n nb, a similar argument shows that regret of playing an action from GOODh,n(C) is 6eεh,n. Therefore, regret is bounded by

b [r] r(S + K)2(n+h)/2 ω(b) 6eεh,n 6 X

b [r] ω(b) r(S + K)2(n+h)/2

elghr(S + K)2h = e O r(S + K)2h/2

We conclude the overall regret is e O p

r(S + K)T for the data collection stage by noting that

T r(S + K)2(h 1) as phase h 1 is executed completely. The same argument holds for analyzing the clustering stage when replacing r with r3, so we conclude the regret is bounded by e O p

r3(S + K)T .

H Omitted Details in Section 5

In this section we discuss how to solve the more general low-rank bandit problem. Note that A has rank r if and only if there exist vectors w1, . . . , w S Rr and v1, . . . , v K Rr so that A(i, j) = w i vj. We first consider r = 1 and assume that every wi is non-negative. In this case, we have arg max j [K] A(i, j) = arg max j [K] wivj = arg max j [K] vj. (10)

In other words, there exists an arm that is optimal for all contexts. The problem becomes simple as we can run the EXP4 algorithm [Auer et al., 2002] using constant experts that recommend the same arm for all contexts. Thus, we will have only K experts and have the following proposition:

Proposition 24. Consider the rank-1 bandit problem with r = 1 and wi 0 for every i [S]. The regret of the EXP4 algorithm run with K constant experts is bounded by O( KT log K).

However, the idea seems hard to generalize when wi R, as Eq. (10) does not hold anymore and we need exponentially many experts for EXP4. Next we introduce a new idea based on a reduction to context-lumpable bandits.

In the following we define constant B = maxi wi . To better illustrate the idea we first assume r = 1 and consider the PAC setting. We create an α-covering of [ B, B] and cluster each i into one of the segment. Specifically, we have 2B

α intervals B, B + 1

, . . . , B 1

and each wi is assigned to the interval that contains it. Given α, let Rα denote the number of intervals that have at least one context. Conceptually we can view contexts in the same interval as if they are in the same block in context-lumpable bandits, and we have Rα blocks analogously. For contexts i, i in the same interval, they are indeed similar in the sense that we have |A(i, j) A(i, j)| = O(α) for every arm j. Intuitively, if α is much smaller than ε, then Algorithm 1 can proceed normally as a rank-Rα context-lumpable bandit problem. Consequently, we have the following theorem.

Theorem 25. For r = 1, by choosing α = Θ(ε), Algorithm 1 uses e O Rα(S + K)/ε2 samples and outputs an e O (ε)-optimal policy.

Clearly we have Rα min 1

α, S , so the above theorem leads to a e O (S + K)/ε3 sample complexity in the worst case. The idea can be generalized to regret minimization and r > 1. Specifically, we construct an α-grid of [ B, B]r so that Rα = O( 1

αr ) and run Algorithm 6 for regret minimization. Consequently, we have the following theorem:

Theorem 26. Let p = 1 3r+2 and choose α = (S + K)p T p. Then regret of Algorithm 6 is bounded

as Reg T = e O (S + K)p T 1 p .

Proof. Similar to previous analysis, the regret in a single phase h is

e O (εh) min T, e O R3 α(S + K)

= min e O (εh T) , e O R3 α(S + K)

Recall that Rα = O(1/αr). Therefore, when α = Θ(εh), we have the above regret is bounded by

e O R3 α(S + K)

α + αT = e O (S + K)

α3r+1 + αT = e O (S + K)p T 1 p

when choosing α optimally as α = (S + K)p T p. Otherwise, α = o(εh) and Eq. (11) can be bounded by

e O R3 α(S + K)

= e O R3 α(S + K)

= e O (S + K)p T 1 p .

We finish the proof by noting that there are at most log T phases.

The bound becomes non-trivial when S and K are large. For example, when S = K =

T, the bound is T 1 p/2 = o(T) for any r while the e O

SKT bound given by EXP4 is vacuous. The

factor 3 comes from the r3 term in the regret bound of Theorem 5. It is a promising direction to first improve the factor in context-lumpable bandits and extend it to low-rank bandits using the reduction introduced here.