# stochastic_bandits_with_groups_of_similar_arms__5b439d30.pdf

Stochastic bandits with groups of similar arms

Fabien Pesquerel* Hassan Saber Odalric-Ambrym Maillard fabien.pesquerel@inria.fr hassan.saber@inria.fr odalric.maillard@inria.fr

Univ. Lille, CNRS, Inria, Centrale Lille, UMR 9198-CRISt AL, F-59000 Lille, France

We consider a variant of the stochastic multi-armed bandit problem where arms are known to be organized into different groups having the same mean. The groups are unknown but a lower bound q on their size is known. This situation typically appears when each arm can be described with a list of categorical attributes, and the (unknown) mean reward function only depends on a subset of them, the others being redundant. In this case, q is linked naturally to the number of attributes considered redundant, and the number of categories of each attribute. For this structured problem of practical relevance, we ﬁrst derive the asymptotic regret lower bound and corresponding constrained optimization problem. They reveal the achievable regret can be substantially reduced when compared to the unstructured setup, possibly by a factor q. However, solving exactly the exact constrained optimization problem involves a combinatorial problem. We introduce a lowerbound inspired strategy involving a computationally efﬁcient relaxation that is based on a sorting mechanism. We further prove it achieves a lower bound close to the optimal one up to a controlled factor, and achieves an asymptotic regret q times smaller than the unstructured one. We believe this shows it is a valuable strategy for the practitioner. Last, we illustrate the performance of the considered strategy on numerical experiments involving a large number of arms.

1 Introduction

The ﬁnite stochastic multi-armed bandit problem is a popular framework for studying sequential decision making problems in which a learner sequentially samples from a ﬁnite set of distributions called arms. It was ﬁrst introduced in the context of medical trials [Thompson, 1933b, 1935] and later formalized under this name by Robbins in Robbins [1952]. We refer the interested reader to Lattimore and Szepesvári [2020] for a recent survey. This is one of the simplest theoretical framework in which one can study the notion of exploration-exploitation tradeoff. This tension between exploration and exploitation arises from the sequential optimization problem a learner is trying to perform while being uncertain about the very problem it is optimizing.

Formally, a multi-armed bandit conﬁguration is speciﬁed by a set of unknown real-valued probability distributions ν =(νa)a A with means (µa)a A, indexed by a set of arms A. We hereafter consider a ﬁnite A, and that all νa, a A belong to the same family of distributions F (e.g. Bernoulli, Gaussian, etc.), that is ν FA. The bandit game proceeds as follows. At each time t N, the learner chooses an arm at A based on the past observations and decisions, then receives and observes a sample Xt (called the reward), conditionally independent, sampled from νat. Her goal is to maximize the cumulative reward received over time. The mean of each arm is unknown, which makes the problem non-trivial, hence the learner should adjust her sampling strategy based on past information obtained from drawing different arms in order to maximize the expected sum of rewards. The maximal expected value of a ﬁnite bandit conﬁguration is denoted by µ , deﬁned as µ = max a A µa. The

performance of the strategy used by the agent is measured by the (pseudo) regret, that compares the

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

expected sum of rewards obtained by an oracle that would constantly pull an optimal arm and the ones obtained by the learner, up to some time horizon T (that we assume is unknown to the learner). Deﬁnition 1 (Regret). The regret incurred by a sampling strategy after T time steps on a bandit conﬁguration ν is given by:

R (ν, T) = Eν

t=1 (µ µat)

a A (µ µa) Eν (Na(T)) ,

where Na(T) = TP

t=1 I{at = a} denotes the number of selection of arm a after T time steps.

Group of similar arms Motivated by various practical reasons, one may want to restrict to a subset B FA of allowed bandit conﬁgurations instead of the full set FA. In this paper, we study a variant of the multi-armed bandit problem in which the reward function, µ : a A µa, is assumed to satisfy a cluster-like structural property. A bandit conﬁguration ν is said to satisfy the q-equivalence property if for every arm a A, there are at least q 1 distinct arms having the same expected value:

a A, |{a A|µa = µa}| q.

Assuming the set of arms A and base distributions D are known to the learner, we denote by Bq the set of bandit conﬁgurations having the q-equivalence property. We also denote by Bq(µ) the set of all expected values in Bq. Formally, Bq(µ) is the image of Bq under the µ mapping. Deﬁnition 2 (Arm equivalence and equivalence class). Given a bandit conﬁguration ν, two arms a, a A are said to be equivalent if their associated distributions have the same expected values:

a a µa = µa .

An equivalence class c in ν is a maximal subset of arms in A having the same mean, i.e., for all arms a, a in c, µa = µa and for all arm a c and a A \ c, µa = µa .

This situation typically appears in practical situations when each arm can be described with a list of categorical attributes, and the (unknown) mean reward function only depends on a subset of them, the others being redundant. In this case, q is naturally linked to the number of attributes considered redundant (or useless descriptors), and the number of categories of each attribute. Precisely, q = Q

i R ci where R is the set of redundant attributes and ci the number of categories for attribute i. The learner may know that there exists such a structure while not knowing a closed form formula mapping the list of categorical attributes to the signiﬁcant subset. In this case, q might be a lower bound on the sizes of the class since the set R might not be the largest possible one or because the number of redundant attributes depends on the number of relevant attributes. In all cases, the smallest possible number of redundant attributes can be naturally linked to q. We hereafter consider the learner only knows q but would like to exploit the prior knowledge of this structure in a bandit problem.

Regret lower bounds overview In order to assess the performance of a bandit algorithm on a set of conﬁgurations B, one naturally studies the best guarantee achievable by a uniformly efﬁcient algorithm on B, i.e with sub-linear regret on any instance ν B of the bandit problem. When B = FA, such a guarantee was ﬁrst provided by Lai and Robbins [1985] for parametric families F, and then extended by Burnetas and Katehakis [1996] for more general families. It states that any algorithm that is uniformly efﬁcient1 on a family of distributions F must satisfy

lim inf T R(ν, T)

µ µa KF(νa µ ) , KF(νa µ ) = inf G F {KL(νa G):EG(X)>µ } . (1)

This popular result entails that any strategy having the desirable property to have sub-linear regret on any instance in F must incur a non-trivial minimal regret. When B is a strict subset of FA, the bandit problem is called structured, as in this case pulling an arm may reveal information that makes it possible to reﬁne estimation of other arms (e.g. think of the set of bandit conﬁgurations having Lipschitz mean function with respect to A Rd). The presence of structure may considerably modify the achievable lowest regret, as shown in Burnetas and Katehakis [1996] and Graves and Lai

1Formally, for each bandit on F, for each arm k with k >0, then E[Nk(T)]=o(T α) for all α (0, 1].

[1997], who extended the (unstructured) lower bounds to arbitrarily structured bandit problems (and

beyond). These lower bound take the generic form lim inf T R(ν, T)

log(T) CB(ν), where CB(ν) is

a constant term solution of a constrained linear-optimization problem. A bandit algorithm is then called asymptotically optimal for a set B when its regret asymptotically matches this lower bound.

Existing strategies In order to minimize the regret, a learner faces the classical exploration/exploitation dilemma: it needs to balance exploration, that is gaining information about the expected values of the arms by sampling them, and exploitation, that is playing the most promising arm sufﬁciently often. Many algorithms have been proposed to solve the multi-armed bandits problem (see Lattimore and Szepesvári [2020] for a survey). The study of the lower bounds had a crucial impact on the development of provably asymptotically optimal strategies. In the case of unstructured bandit B = FA, this includes strategies that build on the concept of Optimism in Face of Uncertainty (the most celebrated of which being the Upper Conﬁdence Bound (UCB) algorithms Agrawal [1995], Auer et al. [2002]), such as KLUCB [Lai, 1987, Cappé et al., 2013, Maillard, 2018], DMED and IMED Honda and Takemura [2011, 2015], that are proven asymptotically optimal for various families F (e.g. one-dimensional exponential families), and directly exploit the lower bound in their structure. Alternative asymptotically optimal strategies include the Thompson Sampling (TS) Thompson [1933a], Agrawal and Goyal [2012], which uses a Bayesian posterior distribution given a speciﬁc prior, whose optimality was shown in Korda et al. [2013]. See also Kveton et al. [2019] for other randomized algorithms and Kveton et al. [2020], Chan [2020], Baudry et al. [2020] for recent non-parametric extensions using re-sampling methods. Further, some authors also allow many optimal arms, see de Heide et al. [2021], or even countably many arms, see Kalvit and Zeevi [2020]. However, these works do not consider nor exploit a constraint on the level-sets of the mean function and follow an optimistic paradigm while we follow an information minimization targeting optimality. On the other hand, several instances of structured bandits received considerable attention in the last few years. This is the case for instance of linear bandits, see [Abbasi-Yadkori et al., 2011, Srinivas et al., 2010, Durand et al., 2017, Kveton et al., 2020] and Lattimore and Szepesvari [2017], Lipschitz bandits Magureanu et al. [2014], Wang et al. [2020], Lu et al. [2019], unimodal bandits Yu and Mannor [2011], Combes and Proutiere [2014], Saber et al. [2020], or combinatorial bandits Kveton et al. [2015], Magureanu [2018], and more recently Cuvelier et al. [2021b]. A generic asymptotically optimal algorithm, called OSSB (Optimal Structured Stochastic Bandit), has been introduced in the work of Combes et al. [2017], and proven to be asymptotically optimal for all structures satisfying some weak properties that include all the aforementioned structures. Although being asymptotically optimal this algorithm often suffers from a long burn-in phase that may hinder its ﬁnite practical performance. It further comes with high computational price as it requires to solve an empirical version of the optimization problem CB(ν) at each step. This motivates the quest for alternative strategies, perhaps less generic but better suited to a speciﬁc structure. Inspired by combinatorial structures for which computing CD(ν) is simply not feasible, a relaxation of the generic constrained optimization problem was recently proposed in Cuvelier et al. [2021a]. The authors show that this comes at the price of trading-off regret optimality for computational efﬁciency. Indeed in some structure, combinatorial properties are at stake and asymptotically optimal algorithms may require solving combinatorial optimization problems (see Cuvelier et al. [2021a]) related to CB(ν). In order to exploit the combinatorial structures in a numerically efﬁcient way, research has been made in how to relax these combinatorial optimization problems while preserving theoretical properties on the regret of the relaxed algorithms (see Cuvelier et al. [2021b,a]). Our work consider similar computational issues, with a different perspective.

Goal For the structure Bq, as we show in Theorem 1 below, the term CBq(ν) unfortunately makes appear in general a combinatorial optimization problem. This makes resorting to OSSB or any strategy targeting exact asymptotic optimality a daunting task for the practitioner. In this paper, our goal is to provide a computationally efﬁcient strategy adapted to the structure Bq, that is able to reach optimality up to controlled error term.

Outline and contributions The rest of this paper is organized as follows. In section 2, we derive a lower bound on the regret for the structured set of bandit conﬁgurations Bq. This bound makes appear two components, one that we call non-combinatorial as optimizing it can be done efﬁciently, and a second term that we term combinatorial as it involves solving a combinatorial problem. Interestingly, using in Lemma 1 and Theorem 3 that the contribution of the combinatorial part of the lower bound

can be controlled. Owing to this key insight, we introduce in section 3, IMED-EC, an adaptation of the IMED strategy from Honda and Takemura [2015] to the structured set Bq. One advantage of IMED over a KL-UCB alternative is its reduced complexity, which translates to the equivalence class setup. At each time step, the complexity of computing the next arm to be pulled by IMED-EC is no more than the one of sorting a list of |A| elements once the IMED indexes have been computed, which is only log |A| times larger than looking for the minimal IMED index. In Section 4, we prove that IMED-EC achieves a controlled asymptotic regret that matches the non-combinatorial part of the lower bound and is at most (less than) a factor of 2 times the optimal regret bound. Last, we illustrate the beneﬁt of the IMED-EC over its unstructured version in section 5, where it shows a substantial improvement. Our experiments also highlights the robustness of the algorithm to a misspeciﬁed parameter q, which is a desirable feature for the practitioner.

2 A regret lower bound with combinatorial and non-combinatorial parts

In this section, we derive a lower bound on the number of pulls of suboptimal arms that involves a combinatorial optimization problem. Using that lower bound, we derive a simple algorithm, IMED-EC, that does not involve any optimization problem. While not being asymptotically optimal, we will show in the next section that our algorithm has an upper bound on its regret that is no more than a fraction of the unstructured regret.

The proof of Theorem 1 is based on the concept of most confusing instance. Most confusing instances allow to assess the intrinsic difﬁculty of a bandit problem and allow to compute lower bounds on the number of times suboptimal arms are pulled. The lower bound informs us on the minimal amount of exploration one needs to do to solve a bandit problem. More formally, a confusing instance ν associated to a suboptimal arm a for a bandit problem ν is a bandit instance with the same set of arms as the original one, but in which µa has been changed to µ a > µ . An optimal sampling strategy (one that does not sample suboptimal arms too much) should behave differently on the two problems. Studying this difference, we can compute the minimal amount of exploration performed by an optimal strategy on arm a in the original problem ν. Doing so for all suboptimal arms allows to bound the number of samples of suboptimal arms and therefore characterize the intrinsic complexity of a bandit instance ν.

In a structured setting, a confusing instance also has to respect the structure. In our case, it means that a confusing instance cannot have a class with less than q arms. We will therefore consider confusing instances associated to classes rather than individual arms.

Deﬁnition 3 (Confusing instance). Given a bandit conﬁguration ν Bq, a real number λ and a subset cq A of q equivalent arms in ν, we denote by Bq (ν, cq, λ) the set of all bandit conﬁgurations having the same set of arms as ν and such that for all ν Bq (ν, cq, λ), ν Bq and for every arm a in cq, µ a λ.

When λ > µ , and cq is a subset of a suboptimal class, a bandit conﬁguration in Bq (ν, cq, λ) is called a confusing instance of ν.

Similarly to the notation introduced above, we will use the notation Bq (µ, cq, λ) to specify the set of means of bandit conﬁgurations in Bq (µ, cq, λ).

The aim of an asymptotic lower bound on the number of pulls of a suboptimal arm is to mathematically understand the minimal asymptotic amount of exploration an algorithm should perform.

Assumption 1: The family F is such that for all κ F, µ 7 KF (κ µ) and µ 7 Keq (κ µ) are continuous, where Keq(κ, µ) = inf G F {KL(κ, G):EG(X)=µ} with KL being a notation for the relative entropy or Kullback-Leibler divergence.

Assumption 2: The family F is an exponential family of dimension 1. Therefore the KL divergences are parameterized by the mean and we may write the KL as a function of the means, κ, χ F, KL (κ χ) = KL (EX κ (X) EX χ (X)) (identiﬁcation of the KL with its parameterization by the means).

Theorem 1 (Asymptotic lower bound). Let q N be a positive integer and ν Bq be a bandit conﬁguration having the q-equivalence property. Let c A be a suboptimal equivalence class in ν.

Assuming uniform consistency, for all suboptimal arms a,

α > 0, lim T + E Na(T)

assuming assumption 1, we have the following asymptotic bandit dependent lower bound on the number of pulls of arms in c:

a cq Eν(Na(T)) KF (νa µ ) + inf µ Bq(µ,cq,µ ) P

a/ cq Eν(Na(T)) Keq(νa µ a)

log T 1 , (2)

where cq is any subset of c having q distincts arms within it.

We brieﬂy sketch how confusing instances are used in the proof of Theorem 1. We consider confusing instances in which q arms from a suboptimal class c are moved above the optimal one (w.r.t. the mean). If there are q arms in the class, then there are no remaining arms to move. If there are more than 2q arms, then moving q arms creates a reminder of size larger than q meaning that the crafted confusing instance respects the equivalence structure. However, if there are between q + 1 and 2q 1 arms, then the reminder is of size larger than 1 but strictly smaller than q. The created confusing instance does not respect the equivalence structure and we have to deal with the arms in the reminder (the inﬁmum of equation (2)). There are |c| choose q possible choices to move q arms from class c (the minimum of equation (2)). All in all, the lower bound involves a combinatorial optimization problem.

While this lower bound involves a combinatorial optimization term, one can distinguish between two regimes depending on the size of the suboptimal class. The combinatorial regime and the non combinatorial regime.

Non-combinatorial regime For a suboptimal class c, if |c| = q or |c| 2q, then the lower bound reduces to

a cq Eν (Na(T)) KF (νa µ )

because the reminder is of size larger than q and the inﬁmum from Theorem 1 disappears. Indeed, the inﬁmum is always 0 as this quantity can be obtained by choosing µ a = µa for all a c \ cq. Furthermore, the minimum over all q-partitions of c is in fact the sum of the q smallest elements of {Eν (Na(T)) KF (νa µ )}a c. The search amongst all the q-partitions of c amounts to a research of the q smallest elements which is not more complex than sorting a list of |c| elements. Hence, the problem is no more a combinatorial optimization one and we call this case the non-combinatorial regime.

Lemma 1. Let ν Bq be a bandit conﬁguration having the q-equivalence property. Let c be a suboptimal class in the non-combinatorial regime, then, under assumption 1 and 2,

a c Eν (Na(T))

q 1 KF (νa λ). (3)

While we do not have information about individual number of times an arm in a class has been sampled, Lemma 1 roughly tells us than on average, the lower bound on the minimal amount of exploration of an arm in a suboptimal class has been divided by q.

Lemma 2. If all suboptimal classes are in the non-combinatorial regime, under assumption 1 and 2, the regret may be asymptotically lower bounded by

lim inf T R (ν, T)

µ µa KF (νa λ). (4)

Lemma 2 informs us that in the non-combinatorial regime, the classical lower bound on the regret given by equation (1) has been divided by q.

Combinatorial regime For a suboptimal class c to be in the combinatorial regime, we need q < |c| < 2q, since the reminder is such that 0 < |c \ cq| < q and the inﬁmum in Theorem 1 is not 0. In that case, the lower bound (2) involves a combinatorial optimization problem. Two difﬁculties arise from the term inf µ Bq(µ,cq,λ)

a/ cq Eν (Na(T)) Keq (νa µ a) .

First, while we could have thought that summing on the reminder c cq would be enough, the summand has to be on a / cq as a whole. Indeed, the residual c cq may be of size q 1 meaning that it might cost less to move an arm from another class to the residual in order to complete it rather than moving all the reminder. Second, while we could have thought that moving elements from one class of ν to another might be enough, the inﬁmum has to be taken on Bq (µ, cq, λ). Indeed, the residual c cq may be of size q 1 and the nearest class might be of size exactly q. In this case, it may cost less to move all the 2q 1 distributions in between the two classes and create a new one rather than merging one of the two with the other. Lemma 3. Let ν Bq be a bandit conﬁguration having the q-equivalence property and c be a suboptimal class in the combinatorial regime. Then, under assumptions 1 and 2,

a c Eν (Na(T))

q |c|KF (νa µ ) + |c| q

|c| minκ ν Keq (νa κ) , (5)

a c Eν (Na(T))

1 KF (νa µ ). (6)

Those equations can be compared to the equation (3) from the non-combinatorial regime. We emphasize the fact that the lower bounds given by equations (5) and (6) are not the largest possible lower bound and hence do not provide as much information about the algorithmically achievable regret as the largest one given by equation (2). However, together with a regret upper bound on the algorithm IMED-EC, those quantities will help us control the asymptotic discrepancy between IMED-EC s regret and the asymptotic lower bound given by Theorem 1.

3 Information Minimization for bandits with equivalence class

The algorithm we present, IMED-EC, depends on the (weak) indexes introduced in the IMED paper by Honda and Takemura [2015]. At each time step t, for each arm a A, we can compute its IMED index as

Ia(t) = Na(t)KF (bµa(t) bµ (t)) + log Na(t),

where bµ (t) = maxa A bµa(t) and for each arm a A, bµa(t) is the empirical mean of arm a computed with samples from this arm collected up to time t, bµa(t) = 1 Na(t) Pt s=1 Xs1 {as = a}, where Xs is the sample collected by the algorithm at time step s. Let ν Bq be a bandit conﬁguration having the q-equivalence property. We denote by A (t) = arg maxa A {bµa(t)} the set of empirical optimal arms at time t. We will denote by Aq(t) the set of arms having the q smallest IMED indexes (breaking ties randomly so that this set has size q). We will also consider the two following quantities for each time t:

I (t) = min a A (t) Ia(t) = min a A (t) log Na(t),

I(t) = min A A |A |=q

a A Ia (t) = X

a Aq(t) Ia(t).

I(t) can be computed efﬁciently by summing the q smallest elements of the list of IMED indexes. Finding the q smallest elements can be done by maintaining a sorted array of IMED indexes while computing them. The procedure costs a constant factor of log |A|. Computing I(t) therefore costs O (|A| log |A|), which is only log |A| times larger than looking for the minimal IMED index. Computing |A (t)| can be done by maintaining a set of arms having the best empirical mean (adds a constant factor). The IMED-EC algorithm is presented in Algorithm 1.

Algorithm 1 IMED-EC (IMED for Equivalent Classes)

Pull each arm once for t = |A| . . . T 1 do

if I (t) I(t) then

Pull at+1 arg mina A (t) Na(t) (chosen arbitrarily) else

Pull at+1 arg mina/ A (t) Ia(t) (chosen arbitrarily) end if end for

While the orginal problem involves combinatorial quantities, those are not involved in the IMED-EC algorithm. From a time complexity viewpoint, this makes this algorithm on par with other popular algorithms such as UCB, KLUCB, and IMED algorithm. On the contrary, the general structure algorithm OSSB involves solving a combinatorial optimization problem at each time step, which makes it numerically inefﬁcient. We are not aware of any general relaxation method for this algorithm that we could compare IMED-EC with. It is interesting to note that in the case where q = 1, the IMED-EC algorithms coincide with the IMED algorithm.

Intuition For an arm a, Na(t)KF (bµa(t) bµ (t)) may be interpreted as the opposite of a loglikelihood of optimality of that arm. log Na(t) is linked to the log-frequency of play of that arm, the frequency of play of an arm being interpreted as the probability of pulling that arm is a sequence of length t. The IMED algorithm thus can be intuitively understood as an algorithm matching an empirical log-probability with a log-frequency of play. In our setting, there is at least q elements in each group. It therefore makes sense to test for the optimality of a group rather single elements. Since all arms are independent, it makes sense to sum the log-likelihood of optimality on all the q-partitions of the set of arms. Since we have the intuition that this ﬁrst part is the logarithm of a product of probability, we may compare it to the product of the frequencies. Therefore, we get that important quantities are the sum of IMED indexes for each q partition of the arms, seen as a comparison between the optimality of this group of q elements and the associated frequency of play of that group. The minimal IMED index is the one whose frequency of play is the lowest compared to its likelihood of optimality, similarly for the sum of IMED indexes. Other intuitions regarding the fairness (frequency of pulls within the same class) of the algorithm are given in appendix D.

4 Regret analysis

In this section, we now detail the main bound on the regret of IMED-EC.

Theorem 2 (Upper bound on the number of pulls). Under the IMED-EC algorithms, under assumption 1 and 2, the number of pulls of a suboptimal arm a is upper bounded by:

Eν (Na(T)) log T q KF (νa µ ) (1 + α(ε)) + f(ε), (7)

where 0 < ε < 1

3 mina A\A (µ µa), f is function that depends on concentration properties on F, and α tends to 0 as ε tends to 0.

Remark: α and f functions are mostly used for deriving theoretical guarantees in IMED-EC regret analysis. α is controlled thanks to property 2 as in the paper of Honda and Takemura [2015] for IMED regret analysis. A ﬁnite sample analysis can be derived from a careful analysis of the term f. Being more precise requires scrutinizing the properties of the considered family.

Corollary 1. Under the IMED-EC algorithms, under assumptions 1 and 2, the number of pulls of a suboptimal arm a is upper bounded by:

a cq Eν (Na(T)) KF (νa µ ) (1 + α (ε)) log T + g(ε). (8)

where 0 < ε < 1

3 mina A\A (µ µa) α and α tends to 0 as ε tends to 0.

Theorem 3 (Asymptotic upper bound on the number of pulls). Under the IMED-EC algorithms, under assumption 1 and 2, the number of pulls of a suboptimal arm a is asymptotically upper bounded by:

lim inf t + Eν (Na(T))

log T 1 q KF (νa µ ). (9)

Discussion This upper bound shows that in particular, the number of pulls of a suboptimal class, P

a c Eν (Na(T)) is asymptotically no more than |c| q KF(νa µ ) log T. This hence matches the lower bound in the non-combinatorial regime. In the combinatorial regime, along with equation (6), this regret upper bound shows that

|c| q KF (νa µ ) lim inf T

2 |c| q KF (νa µ ),

proving that the regret of the proposed IMED-EC does not differ from the optimal lower bound by a factor more than 2. This is a striking result. Equation (6) can be used to have an even more precise control on the discrepancy to the optimal regret bound, as it shows the factor 2 can be actually replaced with 1 + |c| q

q minκ ν Keq(νa κ)

KF(νa µ ) . Since the factor |c| q

q minκ ν Keq(νa κ)

KF(νa µ ) is strictly between 0 and 1 in the combinatorial regime that we are studying, the discrepancy between the lower bound and the regret of IMED-EC indeed is always bounded by 2. On the other hand, this reﬁned error measurement is problem dependant while the factor of 2 is universal.

We provide the full proof of Theorem 3 and Theorem 2 in appendix C where we also discuss how to weaken assumption 2 and still get the result of theorem 2.

5 Experiments

In this section, we support our theoretical analysis by conducting three sets of experiments. The Python code used to perform those experiments is available on Github2. We support our empirical evidences using plots of cumulative regrets. In this section, all the experiments are conducted using gaussian distributions whose means are between 0 and 1 and of unit standard deviation. Those graphs are representative of all the experiments that we conducted and more plots and experiments may be found in the appendix D.

Balanced class, perfect knowledge In this set of experiments, see Figure 1, we focus on the bandit conﬁgurations in which all equivalence classes have the same cardinality and assume that we know the number of elements per class. This setting is interesting for two reasons. First, one can compute the theoretical lowerbound without solving a combinatorial optimization problem. Second, the theoretical analysis shows that IMED-EC is asymptotically optimal in this case. This setting will thus allow us to numerically grasp what happens in the most structured case. We compare IMED-EC to unspecialized bandit algorithm, UCB, IMED and KLUCB. To make the comparison fairer we also compare IMED-EC to OSSB, an algorithm specialized in structured bandit. Since OSSB has to solve a combinatorial optimization problem at each time step, we cannot carry experiments on large sets of arms while comparing IMED-EC to it. In this particular setting, we see that while OSSB and IMED-EC are provably asymptotically optimal, IMED-EC numerically performs better in ﬁnite time horizon. We recall that it is furthermore numerically more efﬁcient since it does not involve any combinatorial optimization. Without too much surprises, IMED-EC also outperforms unspecialized algorithm.

Imperfect knowledge In the experiment plotted Figure 2, we leverage the knowledge hypothesis and assume that we only know a lower bound on the number of elements per class while the classes are still balanced. We compare IMED-EC to unspecialized bandit algorithm, IMED and KLUCB. We drop OSSB from our test bed due to the computational burden of solving a combinatorial optimization problem at each time step. We can see that the ﬁnite time cumulative regret of IMED-EC indeed is much smaller than the regret of the unspecialized algorithms.

Inﬂuence of the parameter q Here we show the numerical robustness of IMED-EC with respect to the lower bound parameter q on the number of elements per classes. On the same bandit problem,

2https://github.com/fabienpesquerel/stochastic-bandits-with-groups-of-similar-arms-neurips-2021

Figure 1: 3 classes, 3 distributions per class - set of means = {0.3, 0.5, 0.9}

Figure 2: 7 classes, 8 distributions per class - set of means = {0.1, 0.3, 0.4, 0.5, 0.6, 0.75, 0.9}

we compare different instances of IMED-EC where different values of q are used. In the legend, opt. stands for optimal and corresponds to the largest valid lower bound on the number of elements per class, i.e. the minimal number of elements in a class. The experiment Figure 3 is performed on a bandit problem with 7 classes and an uneven number of distributions per class. The smallest class has 4 elements and the largest 23. While q increases up to the minimum cardinality of a class, we see that the performances of IMED-EC increases. It is rather remarkable that once we go beyond that theoretical threshold, the performances of IMED-EC do not deteriorate. We even found it difﬁcult to ﬁnd settings to deteriorate them at all. While the expected regret does not seem to deteriorate, we sometimes see that the tails of the regret widen as it can be seen on the plot Figure 3 for q = 7 and q = 20 since the 0.9 quantile curves are so large for those values of q. We interpret part of this robustness to the fact that the relaxation induced in IMED-EC makes the algorithm over explore compared to what the true lower bound suggests. Increasing q reduces the exploration and therefore may improve the performances of the algorithm. However, this robustness is observed even in the case where the classes are balanced. This interpretation thus does not explain everything about the numerical robustness of IMED-EC. This type of experiment does not take more than roughly 10 to 15 minutes on a notebook run in Google Colab depending on the number of arms, the horizon and the number of runs. This supports the numerical efﬁciency of the relaxation made in IMED-EC.

Figure 3: 7 classes, unbalanced - set of means = {0.1, 0.3, 0.4, 0.5, 0.6, 0.75, 0.9}

6 Conclusion

In this paper, we introduced IMED-EC, a numerically efﬁcient algorithm to solve a structured bandit problem for which we derived a lower bound involving a combinatorial optimization problem. While not being asymptotically optimal, we proved that the asymptotic regret of IMED-EC is always smaller than the unstructured one and that we can control the discrepancy with respect to the structured regret lower bound by a factor of at most 2.

Acknowledgments and Disclosure of Funding

This work has been supported by the French Ministry of Higher Education and Research, Inria, Scool, the French Agence Nationale de la Recherche (ANR) under grant ANR-16-CE40-0002 (the BADASS project) the MEL and the I-Site ULNE regarding project R-PILOTE-19-004-APPRENF. The Ph D of Fabien Pesquerel is supported by a grant from École Normale Supérieure.

Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011.

R. Agrawal. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27, 1995.

S. Agrawal and N. Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Proceedings of the 25th Annual Conference on Learning Theory, 2012.

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47, 2002.

D. Baudry, E. Kaufmann, and O.-A. Maillard. Sub-sampling for Efﬁcient Non-Parametric Bandit Exploration. In Neur IPS 2020, Vancouver, Canada, Dec. 2020. URL https://hal. archives-ouvertes.fr/hal-02977552.

A. N. Burnetas and M. N. Katehakis. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122 142, 1996.

O. Cappé, A. Garivier, O.-A. Maillard, R. Munos, and G. Stoltz. Kullback Leibler upper conﬁdence bounds for optimal sequential allocation. Annals of Statistics, 41(3):1516 1541, 2013.

H. P. Chan. The multi-armed bandit problem: An efﬁcient nonparametric solution. The Annals of Statistics, 48(1):346 373, 2020.

R. Combes and A. Proutiere. Unimodal bandits: Regret lower bounds and optimal algorithms. In International Conference on Machine Learning, 2014.

R. Combes, S. Magureanu, and A. Proutiere. Minimal exploration in structured stochastic bandits. In Advances in Neural Information Processing Systems, pages 1763 1771, 2017.

T. Cuvelier, R. Combes, and E. Gourdin. Asymptotically optimal strategies for combinatorial semi-bandits in polynomial time. In Algorithmic Learning Theory, pages 505 528. PMLR, 2021a.

T. Cuvelier, R. Combes, and E. Gourdin. Statistically efﬁcient, polynomial-time algorithms for combinatorial semi-bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 5(1):1 31, 2021b.

R. de Heide, J. Cheshire, P. Ménard, and A. Carpentier. Bandits with many optimal arms. In Advances in Neural Information Processing Systems, 2021. URL https://arxiv.org/abs/2103.12452.

A. Dembo and O. Zeitouni. Large deviations techniques and applications. Elearn, 1998.

A. Durand, O.-A. Maillard, and J. Pineau. Streaming kernel regression with provably adaptive mean, variance, and regularization. ar Xiv preprint ar Xiv:1708.00768, 2017.

A. Garivier, P. Ménard, and G. Stoltz. Explore ﬁrst, exploit next: The true shape of regret in bandit problems. ar Xiv preprint ar Xiv:1602.07182, 2016.

T. L. Graves and T. L. Lai. Asymptotically efﬁcient adaptive choice of control laws incontrolled markov chains. SIAM journal on control and optimization, 35(3):715 743, 1997.

J. Honda and A. Takemura. An asymptotically optimal policy for ﬁnite support models in the multiarmed bandit problem. Machine Learning, 85(3):361 391, 2011.

J. Honda and A. Takemura. Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards. Machine Learning, 16:3721 3756, 2015.

A. Kalvit and A. Zeevi. From ﬁnite to countable-armed bandits. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 8259 8269. Curran Associates, Inc., 2020. URL https://proceedings. neurips.cc/paper/2020/file/5dbc8390f17e019d300d5a162c3ce3bc-Paper.pdf.

N. Korda, E. Kaufmann, and R. Munos. Thompson Sampling for 1-dimensional Exponential family bandits. In Advances in Neural Information Processing Systems, 2013.

B. Kveton, C. Szepesvari, Z. Wen, and A. Ashkan. Cascading bandits: Learning to rank in the cascade model. In International Conference on Machine Learning, pages 767 776. PMLR, 2015.

B. Kveton, C. Szepesvari, Z. Wen, M. Ghavamzadeh, and T. Lattimore. Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In ICML, 2019.

B. Kveton, M. Zaheer, C. Szepesvari, L. Li, M. Ghavamzadeh, and C. Boutilier. Randomized exploration in generalized linear bandits. In S. Chiappa and R. Calandra, editors, Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 2066 2076. PMLR, 26 28 Aug 2020. URL http://proceedings.mlr.press/v108/kveton20a.html.

T. Lai and H. Robbins. Asymptotically efﬁcient adaptive allocation rules. Advances in Applied Mathematics, 6, 1985.

T. L. Lai. Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics, pages 1091 1114, 1987.

T. L. Lai. Boundary crossing problems for sample means. The Annals of Probability, pages 375 396, 1988.

T. Lattimore and C. Szepesvari. The end of optimism? an asymptotic analysis of ﬁnite-armed linear bandits. In Artiﬁcial Intelligence and Statistics, pages 728 737, 2017.

T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

S. Lu, G. Wang, Y. Hu, and L. Zhang. Optimal algorithms for Lipschitz bandits with heavy-tailed rewards. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4154 4163. PMLR, 09 15 Jun 2019. URL http://proceedings.mlr.press/v97/lu19c. html.

S. Magureanu. Efﬁcient Online Learning under Bandit Feedback. Ph D thesis, KTH Royal Institute of Technology, 2018.

S. Magureanu, R. Combes, and A. Proutiere. Lipschitz bandits: Regret lower bounds and optimal algorithms. Machine Learning, 35:1 25, 2014.

O.-A. Maillard. Boundary crossing probabilities for general exponential families. Mathematical Methods of Statistics, 27(1):1 31, 2018.

H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society, 58:527 535, 1952.

H. Saber, P. Ménard, and O.-A. Maillard. Forced-exploration free strategies for unimodal bandits. ar Xiv preprint ar Xiv:2006.16569, 2020.

N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 1015 1022. Omnipress, 2010.

W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25, 1933a.

W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933b.

W. R. Thompson. On a criterion for the rejection of observations and the distribution of the ratio of deviation to sample standard deviation. The Annals of Mathematical Statistics, 6(4):214 219, 1935.

T. Wang, W. Ye, D. Geng, and C. Rudin. Towards practical lipschitz bandits. Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference, Oct 2020. doi: 10.1145/3412815.3416885. URL http://dx.doi.org/10.1145/3412815.3416885.

J. Y. Yu and S. Mannor. Unimodal bandits. In ICML, 2011.