# incentivizing_combinatorial_bandit_exploration__3e2fafa0.pdf

Incentivizing Combinatorial Bandit Exploration

Xinyan Hu1, Dung Daniel Ngo2, Aleksandrs Slivkins3, and Zhiwei Steven Wu4

1University of California, Berkeley. Email: xinyanhu@berkeley.edu

2University of Minnesota. Email: ngo00054@umn.edu 3Microsoft Research NYC. Email: slivkins@microsoft.com

4Carnegie Mellon University. Email: zstevenwu@cmu.edu

Consider a bandit algorithm that recommends actions to self-interested users in a recommendation system. The users are free to choose other actions and need to be incentivized to follow the algorithm s recommendations. While the users prefer to exploit, the algorithm can incentivize them to explore by leveraging the information collected from the previous users. All published work on this problem, known as incentivized exploration, focuses on small, unstructured action sets and mainly targets the case when the users beliefs are independent across actions. However, realistic exploration problems often feature large, structured action sets and highly correlated beliefs. We focus on a paradigmatic exploration problem with structure: combinatorial semi-bandits. We prove that Thompson Sampling, when applied to combinatorial semi-bandits, is incentive-compatible when initialized with a sufﬁcient number of samples of each arm (where this number is determined in advance by the Bayesian prior). Moreover, we design incentive-compatible algorithms for collecting the initial samples.

1 Introduction

We consider incentivized exploration: how to incentivize self-interested users to explore. A social planner interacts with self-interested users (henceforth, agents) and can make recommendations, but cannot enforce the agents to comply with these recommendations. The agents face uncertainty about the available alternatives. The social planner would like the agents to trade off exploration for the sake of acquiring new information and exploitation, making optimal near-term decisions based on the current information. The agents, on the other hand, prefer to exploit. However, the algorithm can incentivize them to explore by leveraging the information collected from the previous users. This problem has been studied since Kremer et al. (2014), see Slivkins (2019, Ch. 11) for an overview.

The basic model of incentivized exploration is very stylized and can be extended along two dimension : more sophisticated economic models for agents behavior and incentives, and more complex machine-learning models for actions structures and rewards. All published work has only dealt with the former, whereas we pursue the latter. In particular, all published work focuses on small, unstructured action sets. Moreover, the case of independent priors when the users beliefs are independent across actions is emphasized as the main, paradigmatic special case when speciﬁc performance guarantees are derived. However, realistic exploration problems often feature large sets with some known structure that connects actions to one another. A major recurring theme in the vast literature on multi-armed bandits is taking advantage of the available structure so as to enable the algorithm to cope with the large number of actions.

Research done while X.Hu was an undergraduate student at Peking University and a (virtual) visiting student at Carnegie Mellon University.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

We focus on a paradigmatic, well-studied exploration problem with structured actions: combinatorial semi-bandits. Here, each arm is a subset of some ground set, whose elements are called atoms. In each round, the algorithm chooses an arm, and observes/collects reward for each atom in this arm. The reward for each atom is drawn independently from some ﬁxed (but unknown) distribution speciﬁc to this atom. The set of feasible arms reﬂects the structure of the problem, e.g., it can comprise all subsets of atoms of a given cardinality, or all edge-paths in a given graph. Since the number of arms (K) can be exponential in the number of atoms (d), the main theme is replacing the dependence on K in regret bounds for unstructured" K-armed bandits with a similar dependence on d.

We adopt a standard model for incentivized exploration from Kremer et al. (2014). The social planner is implemented as a bandit algorithm. Each round corresponds to a new agent which arrives and receives the arm chosen by the algorithm as a recommendation. Agents have Bayesian beliefs, independent across the atoms (but highly correlated across the arms). The algorithm must ensure that following its recommendation is in each agent s best interest, a condition called Bayesian incentivecompatibility (BIC). Each agent does not observe what happened with the previous agents, but the algorithm does. This information asymmetry is crucial for creating incentives.

Our contributions. We prove that Thompson Sampling is BIC when initialized with at least n TS samples of each atom, where n TS is determined by the prior and scales polynomially in the number of atoms (d). Thompson Sampling (Thompson, 1933) is a well-known bandit algorithm with nearoptimal regret bounds and good empirical performance. The initial samples can be provided by another BIC algorithm (more on this below), or procured exogenously, e.g., bought with money.

Next, we consider the problem of initial exploration: essentially, design a BIC algorithm that samples each atom at least once. Such algorithms are interesting in their own right, and can be used to bootstrap Thompson Sampling, as per above. We present two such algorithms, which build on prior work (Mansour et al., 2020; Simchowitz and Slivkins, 2021) and extend it in non-trivial ways. The objective to be optimized is the sufﬁcient number of rounds T0, and particularly its dependence on d. To calibrate, prior work on incentivized exploration in multi-armed bandits with correlated priors does not provide any guarantees for a super-constant number of arms (K), and is known to have T0 > exp( (K)) in some natural examples (Mansour et al., 2020). In contrast, our algorithms satisfy T0 exp(O(d)) for a paradigmatic special case, and T0 exp(O(d3)) in general.

Finally, what if the prior is not independent across atoms? We focus on two arms with arbitrary correlation, a fundamental special case of incentivized exploration, and prove that our analysis of Thompson Sampling extends to the case. This result may be of independent interest.

Discussion. Like all prior work on incentivized exploration, we consider standard, yet idealized models for agents economic behavior and the machine-learning problem being solved by the social planner. The modeling captures something essential about exploration and incentives in recommendation systems, but is not supposed to capture all the particularities of any speciﬁc application scenario. The goal of this paper is to bring more complexity into the machine-learning problem; advancing the economic model is beyond our scope.

We focus on establishing the BIC property and asymptotic guarantees in terms of the number of atoms, without attempting to optimize the dependence on the per-atom Bayesian priors. Our Thompson Sampling result has an encouraging practical implication: a standard, well-performing bandit algorithm plays well with users incentives, provided a small (in theory) amount of initial data.

(Stylized) motivating examples of incentivized exploration in combinatorial semi-bandits include: recommending online content, e.g., for news or entertainment (with atoms as e.g., speciﬁc news articles); recommending complementary products, e.g., a suit that consists of multiple items of clothing; recommending driving directions; see Appendix A for more details. In all examples, the social planner corresponds to the online platform issuing the respective recommendations. Such online platforms are often interested in maximizing users happiness, rather than (or in addition to) the immediate revenue, as a way to ensure user engagement and long-term success.

Related work. Incentivized exploration, as deﬁned in this paper, has been introduced in Kremer et al. (2014) and subsequently studied, e.g., in Mansour et al. (2020, 2022); Immorlica et al. (2020); Bahar et al. (2016, 2019), along with some extensions. Most related is Sellke and Slivkins (2021), which obtains similar BIC results for the special case of multi-armed bandits with independent priors, both for Thompson Sampling and for initial exploration. A yet unpublished working paper of Simchowitz and Slivkins (2021) provides a BIC algorithm for initial exploration in reinforcement learning; we

build on this result in one of ours. Similar, but technically incomparable versions have been studied, e.g., with time-discounted rewards (Bimpikis et al., 2018) and creating incentives via money (Frazier et al., 2014; Chen et al., 2018).

From the perspective of theoretical economics, incentivized exploration is related to the literature on information design (Kamenica, 2019; Bergemann and Morris, 2019): essentially, one round of incentivized exploration is an instance of Bayesian persuasion, a central model in this literature. Other online" models of Bayesian persuasion have been studied (e.g., Castiglioni et al., 2020; Zu et al.,

2021), but are very different from ours in that the planner s private signal is drawn IID in each round (whereas in our model it is the algorithm s history); the problem has nothing to do with exploration, and is not even meaningful without incentives.

On the machine learning side, this paper is related to the work on combinatorial semi-bandits, starting from György et al. (2007), e.g., (Chen et al., 2013; Kveton et al., 2015, 2014), and the work on Thompson Sampling, see Russo et al. (2018) for a survey. In particular, near-optimal Bayesian regret bounds have been derived in Russo and Van Roy (2014, 2016), and frequentist ones in (Agrawal and Goyal, 2017; Kaufmann et al., 2012). Thompson Sampling has been applied to combinatorial semi-bandits, (e.g., Gopalan et al., 2014; Wen et al., 2015; Degenne and Perchet, 2016; Wang and Chen, 2018), with Bayesian regret bounds derived in Russo and Van Roy (2016).

2 Problem Formulation and Preliminaries

Our algorithm operates according to the standard protocol for combinatorial semi-bandits, with an ancillary incentive-compatibility constraint, standard in the literature on incentivized exploration.

Combinatorial semi-bandits. There are T rounds, d atoms and K arms, where each arm is a subset of atoms. The set A of feasible arms is ﬁxed and known. In each round t, each atom generates reward r(t)

2 [0, 1]. The algorithm chooses an arm A(t) 2 A and observes the reward of each atom in this arm (and nothing else). Algorithm s reward in this round is the total reward of these atoms.

Formally, we write [T] := {1 , . . . , T} for the set of all rounds and [d] for the set of all atoms, so that arms are subsets A [d]. Let be the expected reward of atom 2 [d], and let µ(A) = P

2A be the expected reward of a given arm A [d]. Note that d-armed bandits are a special case when the feasible arms are singleton sets { }, 2 [d].

Stochastic rewards and Bayesian priors. The reward of each atom 2 [0, 1] is drawn independently in each round from a ﬁxed distribution D speciﬁc to this atom. This distribution comes from a parametric family, parameterized by the expected reward . The (realized) problem instance is therefore speciﬁed by the mean reward vector = ( 1 , . . . , d). Initially, each is drawn independently from a Bayesian prior P with support [0, 1]. Put differently, the mean reward vector is drawn from the product prior P = P1 Pd.

Incentive-compatibility. The algorithm must ensure that in each round t, conditional on a particular arm A = A(t) being chosen, the expected reward of this arm is at least as good as that of any other arm. Formally, the algorithm is called Bayesian incentive-compatible (BIC) if for each round t 2 [T],

E[ µ(A) µ(A0) | A(t) = A ] 0 8 arms A, A0 2 A with P[A(t) = A] > 0. (1)

This deﬁnition is based on the following stylized story . In each round t, a new user arrives to a recommendation system, observes the arm A(t) chosen by our algorithm, and interprets it as a recommendation. Then the user decides which arm to choose (not necessarily the arm recommended), and receives the corresponding reward. Accordingly, the user needs to be incentivized to follow the recommendation. We adopt a standard setup from economic theory (and the prior work on incentivized exploration): each user has the same prior P, knows the algorithm, and wishes to maximize her expected reward. We posit that the user does not observe anything else before making her decision, other than the recommended arm. In particular, she does not observe anything about the previous rounds. Then, (1) ensures that she is (weakly) incentivized to follow her recommendation, assuming that the previous users followed theirs. We posit that under (1), the user does follow recommendations, and then reports the rewards of all atoms to the algorithm.

We emphasize that this story is not a part of our formal model (although it can be expressed as such if needed). In fact, the story can be extended to allow the algorithm to reveal an arbitrary message"

to each user, but this additional power is useless: essentially, anything that can be achieved with arbitrary messages can also be achieved with arm recommendations. This can easily be proved as a version of Myerson s direct revelation principle from theoretical economics.

Conventions. Each atom satisﬁes P[ > 0] > 0: else, its rewards are all 0, so it can be ignored.

No arm is contained in any the other arm. This is w.l.o.g. for Bernoulli rewards, and more generally if P[r(t)

= 0 | > 0] > 0: if A A0 for arms A, A0 then A cannot be chosen by any BIC algorithm.

W.l.o.g., order the atoms by their prior mean rewards 0

:= E[ ], so that E[ 0

Let A = arg max A2A µ(A) denote the best arm overall, with some ﬁxed tie-breaking rule.

By a slight abuse of notation, each arm A [d] is sometimes identiﬁed with a binary vector v 2 {0, 1}d such that v = 1 , 2 A, for each atom 2 [d]. In particular, we write A = v .

Thompson Sampling has a very simple deﬁnition, generic to many versions of multi-armed bandits. Let Ft denote the realized history (tuples of chosen actions and realized rewards of all atoms) up to and not including round t. Write E(t) [ ] = E [ | Ft ] and P(t) [ ] = P [ | Ft ] as a shorthand for posterior updates. Thompson Sampling in a given round t draws an arm independently at random from distribution p(t)(A) = P(t)[A = A], A 2 A. If Thompson Sampling is started from some ﬁxed round t0 > 1, this is tantamount to starting the algorithm from round 1, but with prior P( | Ft0) rather than P. The algorithm is well-deﬁned for an arbitrary prior P.

While this paper is not concerned with computational issues, they are as follows. The posterior update P( | Ft) can be performed for each atom separately: P ( | Ft, ), where Ft, is the corresponding history of samples from this atom. A standard implementation draws 0

2 [0, 1] independently from P ( | Ft, ), for each atom , then chooses the best arm according to these draws: arg max A2A

. The posterior updates P ( | Ft, ) and the arg max choice are not computationally efﬁcient in general, and may require heuristics (this is a common situation for all variants of Thompson Sampling). A paradigmatic special case is Beta priors P and Bernoulli reward distributions D , so that the posterior update P ( | Ft, ) is another Beta prior.

Composition of BIC algorithms. We rely on a generic observation from Mansour et al. (2020, 2022) that the composition of two BIC algorithms is also BIC.

Lemma 2.1. Let ALG be a BIC algorithm which stops after some round T0. Let ALG0(H) be another algorithm that initially inputs the history H collected by ALG, and suppose it is BIC. Consider the composite algorithm: ALG followed by ALG0(H), which stops at the time horizon T. If T0 is determined before the composite algorithm starts, then this algorithm is BIC.

3 Thompson Sampling is BIC

Our main result is that Thompson Sampling is BIC when initialized with at least n TS samples of each atom, where n TS is determined by the prior and scales polynomially in d, the number of atoms.

Theorem 3.1. Let ALG be any BIC algorithm such that by some time T0 T (determined by the prior) it almost surely collects at least n TS = CTS d2 2

TS ) samples from each atom, where

TS = min A,A02A E[(µ(A) µ(A0))+] and δTS = min

A2A P[A = A], (2)

and CTS is a large enough absolute constant. Consider the composite algorithm which runs ALG for the ﬁrst T0 rounds, followed by Thompson sampling. This algorithm is BIC.

Note that T0 and n TS are constants" once the prior is ﬁxed, in the sense that they do not depend on the time horizon T, the mean reward vector , or the rewards in the data collected by ALG.

Remark 3.2. For statistical guarantees, consider Bayesian regret, i.e., regret in expectation over the Bayesian prior. Bayesian regret of the composite algorithm in Theorem 3.1 is at most T0 plus Bayesian regret of Thompson Sampling. The latter is O(pd T log d) for any prior (Russo and Van Roy, 2016). For an end-to-end result, we provide a suitable ALG in Section 4.1, with a speciﬁc T0.

Remark 3.3. We can invoke Lemma 2.1 since T0 is determined in advance. So, it sufﬁces to show that each round t of Thompson Sampling satisﬁes the BIC condition (1).

Let us clarify the dependence on d. Note that parameters TS and δTS may depend on d through the prior. To separate the dependence on d from that on the prior, we posit that each per-atom prior P belongs to a ﬁxed collection C. We make mild non-degeneracy assumptions:2

P [ µ(A0) < E[µ(A)] ] > 0 for all arms A 6= A0. (3) P[ > ] > 0 for all atoms 2 [d] and some 2 (0, 1). (4)

P[ < x] > poly(1/x) exp( x ) for all atoms 2 [d], x 2 (0, 1/2) and some 0. (5) Corollary 3.4. Suppose all priors P of atoms 2 [d] belong to some ﬁxed, ﬁnite collection C of priors and assumptions (3-5) are satisﬁed with some absolute constants , . Then n TS = OC(d3+ log d), where OC hides the absolute constants and the dependency on C. Remark 3.5. The initial data can also be provided to Thompson Sampling exogenously (rather than via a BIC algorithm ALG), e.g., purchased with money. More formally, one would need to provide a collection of (arm, atoms rewards) datapoints such that each atom is sampled at least n TS times.3

Proof Sketch for Theorem 3.1 (full proof in Appendix B). In order to establish the BIC condition in (1) for Thompson Sampling, we ﬁrst observe that P[A = A] is a positive prior-dependent constant for all arms A, so it sufﬁces to prove E

E(t)[µ(A) µ(A0)] 1{A =A}

0 for all A, A0.

Next, to show a lower bound on E

(µ(A) µ(A0)) 1{A =A}

, we will leverage the Harris inequality Theorem D.2, which says increasing functions of independent random variables are non-negatively correlated. Observe that the functions (µ(A) µ(A0))+ and 1{A =A} are co-monotone in each coordinate of (i.e., either both increasing or both decreasing in a coordinate). Then, the mixedmonotonicity Harris inequality (see Theorem D.2) implies that:

(µ(A) µ(A0)) 1{A =A}

(µ(A) µ(A0))+ 1{A =A}

where δA = P[A = A] δTS.

To ﬁnish the proof, we show the expected absolute difference between E(t)[µ(A) µ(A0)] 1{A =A} and (µ(A) µ(A0)) 1{A =A} is upper bounded by TS δA. By regrouping and using triangle inequality as well as

$$x 1{A =A}

$$ = |x| 1{A =A}, we can upper bound this estimation error by sum of E

$$E(t) [ µ(A) ] µ(A)

$$E(t) [ µ(A0) ] µ(A0)

. Since the mean reward of each atom can be estimated by their empirical average, we can apply Bayesian Chernoff (see Lemma D.1) and observe that these two terms are n 1/2

TS times O(1)-sub-Gaussian random variables. By the sub-Gaussian tail bound (see Lemma D.4), we upper bound both terms by O(n 1/2

log(1/δA)). We conclude by using our choices of n TS and observing that δTS δA.

Proof Sketch for Corollary 3.4 (full proof in Appendix B). To derive the dependence of n TS on d, we investigate how the prior-dependent constants TS and δTS depends on d. First, we can let C be a version of TS where the min is taken over all ordered pairs of priors in C. Since C is ﬁnite and satisﬁes the pairwise non-dominance assumption (3), TS C > 0.

By deﬁnition, δTS = min A2A P[A = A] is the minimum probability that arm A is the best arm overall. Fix an arm A. We observe that the event where arm A is the best arm is more likely than the event where each atom in A is larger than , and all other atoms not in A is smaller than /d. Hence, we can lower bound P[A = A] by E

1{8 2A, } 1{8x/2A, x /d}

. Since the prior P is independent across atoms, we can write the expression above as product of E

1{8x/2A, x /d}

. As the values { } 2[d] are independent and co-monotone in each in coordinate of , repeated application of mixed-monotonicity Harris inequality (see Remark D.3) implies that:

P[ ] P[ /d]

2For the special case of d-armed bandits, assumption (3) is necessary and sufﬁcient for the respective arm A to be explorable: chosen in some round by some BIC algorithm (Sellke and Slivkins, 2021).

3A subtlety: the number of samples of each arm should be known in advance. This is because otherwise Bayesian update on this data may become dependent on the data-collection algorithm.

By full support assumption (4), we deﬁne a prior-dependent constant = min A2A P[ ] > 0. Then, by deﬁnition of and the non-degeneracy assumption (5), the expression above is lower bounded by d

poly(dd/( )d) exp( d( /d) ). Plugging this bound into n T S, we obtain n T S = OC(d3+ log d).

3.1 The two-arm case with arbitrary correlation

What if the prior is not independent across atoms? Our analysis extends to the case of K = 2 arms A, A0 with arbitrary correlation between the atoms. In fact, we do not assume combinatorial semi-bandit structure, and instead focus on the fundamental special case of incentivized exploration: when one has two arms A, A0 with arbitrary joint prior on (µ(A), µ(A0)). 4

Theorem 3.6. The assertion in Theorem 3.1 holds for the case when one has two arms A, A0 and an arbitrary joint prior on (µ(A), µ(A0)).

This result completes our understanding of incentivized exploration with two correlated arms: indeed, a necessary and sufﬁcient condition (and the algorithm) are known for collecting the initial data (Mansour et al., 2020). A similar result for two independent arms is in (Sellke and Slivkins, 2021).

The analysis is very similar to that of Theorem 3.1, and omitted. Unfortunately, our technique does not appear to extend to a larger number of arbitrarily correlated arms.

4 BIC algorithms for initial exploration

We present two BIC algorithms for initial exploration, where the objective is to sample each atom at least once (i.e., choose arms whose union is [d]) and complete in N0 rounds for some N0 determined by the prior. Such algorithms are interesting in their own right, and can be used to bootstrap Thompson Sampling as per Theorem 3.1. (To collect n samples of each arm, repeat the algorithm n times.) Both algorithms complete in the number of rounds that is exponential in poly(d). The ﬁrst algorithm completes in exp(OP(d)) rounds, but is restricted to arms of the same size and Beta-Bernoulli priors. We obtain exp(OP(d2)) for arbitrary sets of arms. The second algorithm sidesteps the Beta-Bernoulli restriction, but completes in exp(OP(d3)) rounds.

4.1 Reduction to K-armed bandits

The ﬁrst algorithm builds a substantial "super-structure" on top of the "hidden exploration" paradigm from Mansour et al. (2020). The latter paradigm is deﬁned for K-armed bandits and is proved to work for arms with independent priors, ordered by their prior mean rewards. However, for combinatorial semi-bandits, the arms priors are highly correlated. We provide a new interpretation of their analysis in terms of exploring a generic sequence of arms that satisﬁes a certain property (P). Our technical contribution is to construct a sequence of arms and prove that it satisﬁes Property (P). Note that it sufﬁces to explore a sequence of arms which collectively cover all the atoms.

Throughout this subsection, we make the following assumptions:

the prior P for each atom is a Beta distribution with parameters ( , β ); (7) the reward distributions D are Bernoulli distributions. (8)

This is a paradigmatic special case for Thompson Sampling (and Bayesian inference in general).

Let (n) = / ( + β + n), n 0 be the posterior mean reward of atom when conditioned on n samples of this atom such that each of these samples returns reward 0.

Given any number n 2 N, let us deﬁne a sequence of (n) 1 arms V n

1 , . . . , V n

(n) 2 A. Let V1 be a prior-best arm: any arm with the largest prior mean reward. The subsequent arms are deﬁned inductively. Essentially, we pretend that each atom in each arm in the sequence so far has been sampled exactly n times and received 0 each time it has been sampled. The next arm is deﬁned as the posterior-best arm: an arm with a largest posterior reward after seing these samples. Formally, for each i 2, we deﬁne arm V n

i given the previous arms V n

1 , . . . , V n

i 1. For each atom 2 [d] deﬁne

4Equivalently, we have d = 2 atoms with an arbitrary joint prior on ( 1, 2), and the feasible arms are the two singleton arms {1} and {2}.

i ( ) = n if this atom is contained in one of the previous arms in the sequence, and set Zn

i ( ) = 0 otherwise. Then, deﬁne V n

i as a the posterior-best arm if the posterior mean rewards for atoms are given by ( Zn

i ( ) ). That is:

i 2 arg max

i ( ) ) . (9)

The sequence stops when the arms therein cover all atoms at least once, and continues inﬁnitely otherwise; this deﬁnes (n). 5

To state Property (P), we focus on this sequence for a particular, prior-dependent choice of n.

(P) There exist numbers n P 2 N and P, P 2 (0, 1), determined by the prior P, which

satisfy the following. Focus on the sequence of arms V1 , . . . , V , where = (n P) and Vi = V n P

i for each i 2 [ ]. Let HN

i , i 2 [ ] be a dataset that consists of exactly N 2 N samples of each arm V1 , . . . , Vi, where each sample contains the reward for each atom in the respective arm, and HN

0 is an empty dataset. Then

P 8i 2 [ ] and N n P, (10)

where the random variable XN

i is deﬁned as

i = min arms A6=Vi E

µ(Vi) µ(A) | HN

In intuition, any given arm Vi can be the posterior best arm with a margin P and probability at least P after seeing at least n P samples of the previous arms V1 , . . . , Vi 1.

Given Property (P), prior work guarantees the following (without relying on assumptions (7-8)).

Theorem 4.1 (Mansour et al. (2020)). Assume Property (P) holds with constants n P, P, P and = (n P). Then there exists a BIC algorithm which explores each arm V1 , . . . , V at least n P times and completes in T0 rounds, where T0 = n P (1 + d) / ( P P).

Next, we establish (P). First we state a result for a paradigmatic case when all arms have the same cardinality, then relax it in what follows (with a somewhat weaker guarantee).

Theorem 4.2. Assume Beta-Bernoulli priors (7-8). Further, assume that

The arms are all subsets of [d] of a given size m; (11)

Then Property (P) holds with = (n P) = dd/me and

n P = dβd/ de max

2[d] d e (12)

P = min atoms 6= 02[d], n,n02{0, n P} | (n) 0(n0)| . (13)

1)d n P, (14)

as long as P and P are strictly positive.

Proof Sketch. For each arm Vi, i 2 [ ] we consider the event that the dataset Hn

i from Property (P) contains the reward of 0 for each samples of each atom. We take n to be large enough so that this event makes all arms V1 , . . . , Vi 1 look inferior to Vi, in terms of the posterior mean reward. The key is to lower-bound the probability of this event; a non-trivial step here requires Harris inequality.

We show that T0, the requisite number of rounds, depends exponentially on the number of atoms d. To this end, we deﬁne a suitable parameterization of the priors. To handle P in Theorem 4.2, we posit a lower bound that depends on d, but this dependence is very mild.

5In Theorem 4.2 and Theorem 4.4, we upper-bound (n) for some prior-dependent n = n P.

Corollary 4.3. Assume Beta-Bernoulli priors (7-8) and that (11) holds. Fix some absolute constants c0 2 N and c, c0 2 (0, 1). Suppose E[ ] c0 for all atoms, and the priors satisfy the following non-degeneracy conditions:

max , 02[d] dβ / e d 0e c0,

min , 02[d], n,n02{0, c0} | (n) 0(n0)| (c d)

Then there exists a BIC algorithm which samples each atom at least once and completes in

rounds, where Φ = c (1 c0) c0 is a constant.

Finally, we handle general feasible sets, i.e., without assumption (11). The guarantee becomes slightly weaker, in that we have d2 in the exponent rather than d. Theorem 4.4. Assume Beta-Bernoulli priors (7-8). Then Property (P) holds with (n P) d and

n P = d( d + βd)/ de max

2[d] d e d (15)

P = min A6=A02A,n2{0,n P}d

$$$$$ . (16)

1)d n P, (17)

as long as P and P are strictly positive. Corollary 4.5. Assume Beta-Bernoulli priors (7-8). Fix some absolute constants c1 2 N and c2, c3 2 (0, 1). Suppose E [ ] c3 for all atoms, and the priors satisfy the following nondegeneracy conditions:

max atoms , 02[d] d( + β )/ e d 0e c1

min A6=A02A,n2{0,n P}d

$$$$$ (c d2

Then there exists a BIC algorithm which samples each atom at least once and completes in

rounds, where Φ = c2 (1 c3) c1 is a constant.

In Appendix C.4, we provide some motivation for why (18) is a mild assumption. Our intuition is that typically" P should be on the order of e O(d), whereas (18) only requires it to be e (d2).

4.2 Reduction to incentivized reinforcement learning

Our second algorithm builds on the Hidden Hallucination approach from Simchowitz and Slivkins (2021), which targets incentivized exploration for episodic reinforcement learning. We use this approach by encoding" a problem instance of combinatorial semi-bandits as a tabular MDP, so that actions in the MDP correspond to atoms, and feasible trajectories correspond to feasible arms. Then we invoke a theorem in Simchowitz and Slivkins (2021) and translate" this theorem back to combinatorial semi-bandits.

More speciﬁcally, consider a tabular MDP with deterministic transitions and unique initial state. Each action in this MDP correspond to some atom ; then the action s reward is drawn from the corresponding reward distribution D . In general, only a subset of actions is feasible at a given statestage pair of the MDP. Let G be the transition graph in such MDP: it is a rooted directed graph such that the nodes of G correspond to state-stage pairs in the MDP (the root node corresponding to the initial state and stage 0). Each edge (u, v) in G corresponds to some MDP action feasible at u, i.e., to some atom. While different edges in G can correspond to the same atom, we require that any rooted directed

path in G cannot contain two edges that correspond to the same atom. Let AP be the subset of atoms that corresponds to a given rooted directed path P, and let AG = { AP : rooted directed paths in G } be the family of arms encoded" by G. A family of arms A is called MDP-encodable if A = AG for some transition graph G as deﬁned above with O(d2) nodes.

Our result applies to all MDP-encodable feasible sets. In particular, the set of all subsets of exactly m atoms, for some ﬁxed m d, is MDP-encodable. To see this, consider an MDP with m stages and d states, where each state 2 [d] corresponds to the largest atom already included in the arm, and actions feasible at a given stage i and state correspond to all atoms larger than .

Our result allows for arbitrary per-atom priors P , subject to a minor non-degeneracy condition, and reward distributions D that are supported on the same countable set. Theorem 4.6. Consider a feasible arm set A that is MDP-encodable, as deﬁned above. Suppose the per-atom priors P lie in some ﬁxed, ﬁnite collection C such that any P 2 C satisﬁes P[ ] > 0 for all > 0 and E[ ] > 0. Further, suppose all reward distributions D that are supported on the same countable set. Fix parameter δ 2 (0, 1). There is a BIC algorithm such that with probability at least 1 δ each atom is sampled at least once. This algorithm completes in N0 rounds, where N0 = Φ d3

poly(d) log(δ 1)

for some constant ΦC 2 (0, 1) determined by collection C. Remark 4.7. While the guarantee in Theorem 4.6 holds with probability 1 δ, rather than almost surely, it sufﬁces to bootstrap Thompson Sampling in Theorem 3.1. To see this, let ALG be an algorithm that runs for T0 = N0 n TS + d n TS rounds (with n TS from Theorem 3.1), and proceeds as follows: ﬁrst it repeats the algorithm from Theorem 4.6 n TS times, and in the remaining rounds it deterministically plays an arm with the largest prior mean reward (this algorithm is BIC). Deﬁne the success event" as one in which ALG samples each atom n TS times in the ﬁrst N0 NTS rounds. Consider another algorithm, ALG , which runs for T0 rounds, coincides with ALG on the ﬁrst N0 NTS rounds, and in the remaining rounds coincides with ALG on the success event, and otherwise plays some arms so as to sample each atom at least once (this algorithm is not necessarily BIC). Now, if Thompson Sampling is preceded by ALG , then the analysis in Theorem 3.1 guarantees that each round t of Thompson Sampling satisﬁes the BIC property (1), and does so with a strictly positive prior-dependent constant on the right hand side of (1). Therefore, the same holds for ALG, since it coincides with ALG w.h.p., if the failure probability δ in Theorem 4.6 is chosen small enough.

Acknowledgments and Disclosure of Funding

ZSW, DN were supported in part by the NSF FAI Award #1939606, NSF SCC Award #1952085, a Google Faculty Research Award, a J.P. Morgan Faculty Award, a Facebook Research Award, and a Mozilla Research Grant. Any opinions, ﬁndings, conclusions, or recommendations expressed in this material are those of the authors and not necessarily reﬂect the views of the National Science Foundation and other funding agencies. We would like to thank Keegan Harris and Max Simchowitz for brief collaborations, and Mark Sellke for insightful conversations on incentivized exploration.

S. Agrawal and N. Goyal. Near-optimal regret bounds for thompson sampling. J. of the ACM, 64(5):

30:1 30:24, 2017. Preliminary version in AISTATS 2013. G. Bahar, R. Smorodinsky, and M. Tennenholtz. Economic recommendation systems. In 16th ACM

Conf. on Electronic Commerce (ACM-EC), 2016. G. Bahar, R. Smorodinsky, and M. Tennenholtz. Social learning and the innkeeper s challenge. In

ACM Conf. on Economics and Computation (ACM-EC), pages 153 170, 2019. D. Bergemann and S. Morris. Information design: A uniﬁed perspective. Journal of Economic

Literature, 57(1):44 95, March 2019. K. Bimpikis, Y. Papanastasiou, and N. Savva. Crowdsourcing exploration. Management Science, 64

(4):1477 1973, 2018. M. Castiglioni, A. Celli, A. Marchesi, and N. Gatti. Online bayesian persuasion. In Advances in

Neural Information Processing Systems (Neur IPS), 2020. B. Chen, P. I. Frazier, and D. Kempe. Incentivizing exploration by heterogeneous users. In Conf. on

Learning Theory (COLT), pages 798 818, 2018.

W. Chen, Y. Wang, and Y. Yuan. Combinatorial multi-armed bandit: General framework and

applications. In 20th Intl. Conf. on Machine Learning (ICML), pages 151 159, 2013. R. Degenne and V. Perchet. Combinatorial semi-bandit with known covariance. In Advances in

Neural Information Processing Systems, volume 29, 2016. P. Frazier, D. Kempe, J. M. Kleinberg, and R. Kleinberg. Incentivizing exploration. In ACM Conf. on

Economics and Computation (ACM-EC), 2014. A. Gopalan, S. Mannor, and Y. Mansour. Thompson sampling for complex online problems. In

ICML, 2014. A. György, T. Linder, G. Lugosi, and G. Ottucsák. The on-line shortest path problem under partial

monitoring. J. of Machine Learning Research (JMLR), 8:2369 2403, 2007. T. E. Harris. A lower bound for the critical probability in a certain percolation process. Mathematical

Proceedings of the Cambridge Philosophical Society, 56(1):13 20, 1960. N. Immorlica, J. Mao, A. Slivkins, and S. Wu. Incentivizing exploration with selective data disclosure.

In ACM Conf. on Economics and Computation (ACM-EC), 2020. Working paper available at https://arxiv.org/abs/1811.06026. E. Kamenica. Bayesian persuasion and information design. Annual Review of Economics, 11(1):

249 272, 2019. E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: An asymptotically optimal ﬁnite-time

analysis. In 23rd Intl. Conf. on Algorithmic Learning Theory (ALT), pages 199 213, 2012. I. Kremer, Y. Mansour, and M. Perry. Implementing the wisdom of the crowd". J. of Political

Economy, 122(5):988 1012, 2014. Preliminary version in ACM EC 2013. B. Kveton, Z. Wen, A. Ashkan, H. Eydgahi, and B. Eriksson. Matroid bandits: Fast combinatorial

optimization with learning. In 13th Conf. on Uncertainty in Artiﬁcial Intelligence (UAI), pages 420 429, 2014. B. Kveton, Z. Wen, A. Ashkan, and C. Szepesvári. Tight regret bounds for stochastic combinatorial

semi-bandits. In 18th Intl. Conf. on Artiﬁcial Intelligence and Statistics (AISTATS), 2015. Y. Mansour, A. Slivkins, and V. Syrgkanis. Bayesian incentive-compatible bandit exploration.

Operations Research, 68(4):1132 1161, 2020. Preliminary version in ACM EC 2015. Y. Mansour, A. Slivkins, V. Syrgkanis, and S. Wu. Bayesian exploration: Incentivizing exploration in

Bayesian games. Operations Research, 70(2), 2022. Preliminary version in ACM EC 2016. D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations

Research, 39(4):1221 1243, 2014. D. Russo and B. Van Roy. An information-theoretic analysis of thompson sampling. J. of Machine

Learning Research (JMLR), 17:68:1 68:30, 2016. D. Russo, B. Van Roy, A. Kazerouni, I. Osband, and Z. Wen. A tutorial on thompson sampling.

Foundations and Trends in Machine Learning, 11(1):1 96, 2018. Published with Now Publishers (Boston, MA, USA). Also available at https://arxiv.org/abs/1707.02038. M. Sellke and A. Slivkins. The price of incentivizing exploration: A characterization via thompson

sampling and sample complexity. In 22th ACM Conf. on Economics and Computation (ACM-EC), 2021. M. Simchowitz and A. Slivkins. Incentives and exploration in reinforcement learning, 2021. Working

paper, available at https://arxiv.org/abs/2103.00360. A. Slivkins. Introduction to multi-armed bandits. Foundations and Trendsr in Machine Learning,

12(1-2):1 286, Nov. 2019. Published with Now Publishers (Boston, MA, USA). Also available at

https://arxiv.org/abs/1904.07272. Latest online revision: Jan 2022. W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the

evidence of two samples. Biometrika, 25(3-4):285 294, 1933. S. Wang and W. Chen. Thompson sampling for combinatorial semi-bandits. In Proceedings of

the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5114 5122. PMLR, 10 15 Jul 2018. Z. Wen, B. Kveton, and A. Ashkan. Efﬁcient learning in large-scale combinatorial semi-bandits. In

32nd Intl. Conf. on Machine Learning (ICML), pages 1113 1122, 2015.

Y. Zu, K. Iyer, and H. Xu. Learning to persuade on the ﬂy: Robustness against ignorance. In ACM

Conf. on Economics and Computation (ACM-EC), pages 927 928, 2021.