# multiplayer_bandits_the_adversarial_case__ec3bfc4a.pdf

Journal of Machine Learning Research 21 (2020) 1-23 Submitted 11/19; Published 4/20

Multi-Player Bandits: The Adversarial Case

Pragnya Alatur PRAGNYA.ALATUR@GMAIL.COM

Kﬁr Y. Levy KFIRYLEVY@TECHNION.AC.IL Faculty of Electrical Engineering Technion - Israel Institute of Technology Haifa, 3200003, Israel

Andreas Krause KRAUSEA@ETHZ.CH Department of Computer Science ETH Zurich 8092 Zürich, Switzerland

Editor: Ohad Shamir

We consider a setting where multiple players sequentially choose among a common set of actions (arms). Motivated by an application to cognitive radio networks, we assume that players incur a loss upon colliding, and that communication between players is not possible. Existing approaches assume that the system is stationary. Yet this assumption is often violated in practice, e.g., due to signal strength ﬂuctuations. In this work, we design the ﬁrst multi-player Bandit algorithm that provably works in arbitrarily changing environments, where the losses of the arms may even be chosen by an adversary. This resolves an open problem posed by Rosenski et al. (2016).

Keywords: Multi-Armed Bandits, Multi-Player Problems, Online Learning, Sequential Decision Making, Cognitive Radio Networks

1. Introduction

The Multi Armed Bandit (MAB) problem is a fundamental setting for capturing and analyzing sequential decision making. Since the seminal work of Robbins (1952) there has been a plethora of research on this topic (Cesa-Bianchi and Lugosi, 2006; Bubeck and Cesa-Bianchi, 2012; Lattimore and Szepesvári, 2018), addressing both the stochastic and adversarial MAB settings. In the stochastic setting it is assumed that the environment is stationary, namely that except for noisy ﬂuctuations, the environment does not change over time. The adversarial setting is more general, and enables to capture dynamic (arbitrarily changing) environments. Most existing work on MABs considers a single player who sequentially interacts with the environment. Nevertheless, in many real world scenarios, the learner also interacts with other players, either collaboratively or competitively. One such intriguing multi-player setting arises in cognitive radio networks, where multiple broadcasters (players) share a common set of transmission channels (arms). In this setting, players incur an extra loss upon colliding (transmitting on the same channel), and communication between players is generally not possible. This challenging setting has recently received considerable attention, see Avner and Mannor (2014); Rosenski et al. (2016); Bistritz and Leshem (2018).

c 2020 Pragnya Alatur, Kﬁr Y. Levy and Andreas Krause.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/19-912.html.

ALATUR, LEVY AND KRAUSE

Despite impressive progress on multi-player Bandit problems, existing work only addresses the stochastic setting where the environment is stationary. Yet, this may not capture common phenomena in cognitive radio networks, such as channel breakdowns or signal strength ﬂuctuations due to changing environmental conditions. In this work, we address the adversarial multi-player MAB setting, and provide the ﬁrst efﬁcient algorithm with provable guarantees. This resolves an open problem posed by Rosenski, Shamir, and Szlak (2016). Assuming that K players choose among a set of N arms, we provide an efﬁcient method with two variants that ensure respective regret bounds of O(K4/3N2/3T 2/3), and O(K4/3N1/3T 2/3)1. Our key algorithmic technique is to imitate the idealized case where there is full communication between the players. Then, to address the no-communication constraint, we enforce the players to keep the same arms within long periods of time (blocks). This gives them the chance to coordinate among themselves via a simple protocol that uses collisions as a primitive, yet effective manner of coordination. We suggest two different coordination schemes, yielding two different guarantees. Related Work: The stochastic multi-player MAB problem has been extensively investigated in the past years. The majority of work on this topic assumes that players may communicate with each other (Lai et al., 2008; Liu and Zhao, 2010; Vakili et al., 2013; Liu et al., 2013; Avner and Mannor, 2016, 2018). The more realistic no-communication" setting is discussed in (Anandkumar et al., 2011; Avner and Mannor, 2014; Rosenski et al., 2016; Bistritz and Leshem, 2018). Avner and Mannor (2014) are the ﬁrst to provide regret guarantees for the no-communication" stochastic setting, establishing a bound of O(T 2/3). This has been later improved by Rosenski et al. (2016), who establish a constant regret (independent of T) for the case where there exists a ﬁxed gap between mean losses. Recently, Bistritz and Leshem (2018) have explored a more challenging setting, where each player has a different loss vector for the arms. They provide an algorithm that ensures O(log2 T) regret for this setting. The stochastic setting where the number of players may change throughout the game is addressed by Rosenski et al. (2016), where a regret bound of O(

T) is established. Avner and Mannor (2014) also discuss this case and provide an algorithm that in some scenarios ensures an O(T 2/3) regret. Different multi-player adversarial MAB settings are explored in Awerbuch and Kleinberg (2008); Cesa-Bianchi et al. (2016). Nevertheless, these works allow players to communicate, and do not assume a collision loss". There also exists rich literature on Combinatorial bandit settings Uchiya et al. (2010); Audibert et al. (2013); Combes et al. (2015), where several players may fully communicate to jointly choose a set of arms in each round. Nevertheless, these works do not address the no communication" setup. In a contemporary unpublished work (Bubeck et al. (2019)), a regret of O(

T) is obtained for the case of two-players MAB setting. However, their paper does not establish any guarantees for the general multi-players MAB setting. Conversely, we provide an efﬁcient algorithm which holds for the general multi-player setting.

2. Background and Setting

2.1. Background

The N-armed bandit setting can be described as a repeated game over T rounds between a single player and an adversary. At each round t [T] (we denote [N] := {1, . . . , N}, for any N Z+),

1. Using O( ) we ignore logarithmic factors in T, N.

MULTI-PLAYER BANDITS: THE ADVERSARIAL CASE

the player chooses an arm It [N] and the adversary independently chooses a loss for each arm lt i [0, 1], i [N]. Then, the player incurs the loss of the chosen arm lt It, and gets to view the loss of this arm only (bandit feedback). The goal of the player is to minimize the regret, deﬁned as RT := PT t=1 lt It mini [N] PT t=1 lt i. We are interested in learning algorithms that ensure an expected regret which is sublinear in T, here expectation is with respect to possible randomization in the player s strategy as well as in the choices of the adversary. The seminal work of Auer et al. (2002) presents an algorithm that achieves an optimal regret bound of O( TN log N) for this setting. Their algorithm, called EXP3, devises an unbiased estimate of the loss vector in each round, n elt i o

i [N]. These are then used to pick an arm in each

round by sampling It exp( η Pt 1 τ=1 elτ i ), for any arm i [N].

2.2. K-Player MAB Setting

We consider a repeated game of T rounds between K players and an adversary in the N-armed bandit setting. For now assume that each player has a unique rank in [K], and that each player knows her own rank (but does not necessarily know the rank of other players)2. We also refer to the player with rank k as player k". Now at each round t [T],

1. each player k [K] chooses an arm It k [N]

2. the adversary independently chooses a loss for each arm lt i [0, 1], i [N]

3. for each player k [K] one of two cases applies,

Collision: if another player chose the same arm, i.e., m = k such that It k = It m, then player k gets to know that a collision occurred, and incurs a loss of 1. No Collision: if there was no collision, player k incurs the loss of the chosen arm lt It k, and gets to view the loss of this arm only (bandit feedback).

We emphasize that at each round all players play simultaneously. We further assume that communication between players is not possible. Finally, note that the ability to distinguish between collision and non-collision is a reasonable assumption when modeling cognitive radio networks and was also used in previous work, e.g. by Rosenski et al. (2016). Our assumption is that the players are cooperative and thus their goal is to obtain low regret together with respect to the K distinct best arms in hindsight. Let Ct k {0, 1} be an indicator for whether player k collided at time t (Ct k = 1) or not (Ct k = 0). With this, we deﬁne the regret RT after T rounds as follows:

k=1, Ct k=0

| {z } no collisions

k=1 Ct k | {z } collisions

min i1,...,i K [N] im =in, m =n

k=1 lt ik .

We are interested in learning algorithms that ensure an expected regret that is sublinear in T.

2. As we show in Section 3, such ranking can be achieved by running a simple procedure at the beginning of the game (see Algorithm 2).

ALATUR, LEVY AND KRAUSE

Staying quiet: For simplicity, we assume that a player may choose to stay quiet, i.e., not choose an arm, in any given round. By staying quiet she does not cause any collisions, but she will still suffer a loss of 1 for that round. This is a reasonable assumption in cognitive radio applications, as a user may choose to not transmit anything. In Appendix A.3 we show how to relax this assumption.

Adversary: Our analysis focuses on oblivious adversaries, meaning that the adversary may know the strategy of the players, yet he is limited to choosing the loss sequence before the game starts.

Further assumptions: We assume that every player knows T, the number of arms N, the number of players K and that K < N and N < T. Furthermore, we assume that the set of players is ﬁxed and no player enters or exits during the game. Using a standard doubling technique we may extend our method to the case where T is unknown.

3. Multi-Player MABs

In this section, we present our algorithm for the N-armed bandit setting with K players. We ﬁrst discuss an idealized setting in which players can fully communicate and then build our K-player communication-free algorithm on top of that. For ease of exposition we mainly discuss a variant of our method ensuring a regret of O(K4/3N2/3T 2/3) (see Thm. 2). We then explain how to improve this rate to O(K4/3N1/3T 2/3) (see Thm. 4) by using a more sophisticated coordination mechanism.

3.1. Idealized, Communication-Enabled Setting

In this setting, the players can fully communicate and thus, they can coordinate their choices to avoid collisions, resulting in a collision loss of 0, i.e., PT t=1 PK k=1 Ct k = 0. In this case, the K players would behave as a single player who chooses K distinct arms in each round and aims to obtain low regret with respect to the K best arms in hindsight. Such a hypothetical player (let us call her K-Metaplayer) chooses K distinct arms It := {It 1, ..., It K} in each step t and receives the losses of these arms3. Her regret after T rounds is Rmeta T := PT t=1 PK k=1 lt It k min i1,...,i K [N] im =in, m =n

PT t=1 PK k=1 lt ik. We will see soon show a simple adap-

tation of the EXP3 algorithm ((Auer et al., 2002)) yields low regret for the K-Metaplayer. First, let us deﬁne the set of meta-arms M as follows:

M := n {i1, ..., i K} [N] im = in for any m = n o .

We further deﬁne the loss lt I of a meta-arm I M at time t as lt I := P k I lt k. From this, it is immediate that the best meta-arm in hindsight w.r.t. losses (lt I)I M,t [T] consists of the K best arms in hindsight w.r.t. (lt i)i [N],t [T]. With these deﬁnitions, one can view the K-Metaplayer as a player who chooses one meta-arm in each step, receives that meta-arm s loss and aims to obtain low regret with respect to the best meta-arm in hindsight. This exact setting was described and analyzed by Uchiya et al. (2010), who present and analyze an adaptation of EXP3 to obtain a low-regret K-Metaplayer algorithm. Feedback model: In Alg. 1 we describe and analyze a variant of the above setting. We assume a more restrictive bandit feedback, where the Metaplayer gets to view only a single arm chosen

3. Actually, as we will soon see, we analyze a slightly different setting where the Metaplayer gets to view only a single arm chosen uniformly at random from It.

MULTI-PLAYER BANDITS: THE ADVERSARIAL CASE

Algorithm 1 K-Metaplayer algorithm (Input: η)

1: Input: η

2: for t = 1 to T do

3: Set cumulative loss estimate f Lt I = Pt 1 τ=1 P i I elτ i , for all meta-arms I M

4: Set probability pt(I) = e η f Lt I P J M e η f Lt J , for all meta-arms I M

5: Sample meta-arm It = {It 1, ..., It K} at random according to P t = (pt(I))I M 6: Pick one of the K arms Jt Uniform(It 1, ..., It K)

7: Choose arms It 1, ..., It K in the game, suffer losses lt It 1, ..., lt It K and observe lt Jt

8: Set loss estimate elt i = K lt i P i I M pt(I) I{Jt=i}, for all arms i [N]

uniformly at random (u.a.r.) among It := {It 1, ..., It K}. As we show later, this serves as a building block for our algorithm in the more realistic no-communication setting. The following Lemma states the guarantees of Algorithm 1,

Lemma 1 Employing the K-Metaplayer algorithm (Alg. 1) with η = q

log(N K) KTN guarantees a regret

bound of E[Rmeta T ] 2 q

KTN log N K 2K TN log N. We defer the proof to Appendix A.1.

Note that the above bound is worse by a factor of

K compared to the bound appearing in Uchiya et al. (2010). This is since we consider a more restrictive feedback model (i.e., viewing a single arm rather than K arms in each round).

Together as one K-Metaplayer Let us turn our attention back to the K players in an idealized setting with full communication. How do the players need to play in order to behave as the KMetaplayer in Alg. 1? We suggest to do so by assigning roles as follows: Player 1 takes the role of a global coordinator, who decides which arm each of the K players should pick. She samples K arms in each step using the metaplayer algorithm, chooses one out of those K u.a.r. for herself and assigns the rest to the other players. She then communicates to the other players what arms she has chosen for them. Players 2, ..., K simply behave as followers and accept whatever arm the coordinator chooses for them. With this, they are playing exactly as the metaplayer from Algorithm 1 and their regret would be bounded according to Lemma 1. Note that the coordinator samples K arms but receives feedback only for one of them. This is the reason behind the feedback model considered in Alg. 1. Also, note that in this case, the coordinator is the only player that actually learns" from the feedback. All other players follow the coordinator and ignore their loss observations.

3.2. Real, Communication-Free Setting

In the previous section, we described and analyzed an idealized setting where all players can fully communicate and can therefore act as a single metaplayer. Then we have shown that by assigning Player 1 the role of a global coordinator, and the rest of the players being followers, we can exactly imitate the metaplayer algorithm. This strategy however, requires full communication. Here, we show

ALATUR, LEVY AND KRAUSE

Block 1 Block 2 Block T

Round 1 Round T

Ranking Block 3

sub-block 2 (N steps)

sub-block K (N steps)

Figure 1: Illustration of the K-player algorithm. The upper part illustrates the timeline of the algorithm and the lower part shows the close-up view of a single block in the algorithm. Coordinate phases are marked in orange and Play phases are shown in blue. At the beginning of the algorithm, the players compute a ranking (red). This will be discussed further below.

how to build on these ideas to devise an algorithm for the realistic no-communication" setting. Our C&P (Coordinate & Play) algorithm is depicted in Figure 1, as well as in Alg. 3, and 4. Its guarantees are stated in Theorem 2 and at the end of this section we discuss an efﬁcient implementation of our method.

Our method builds on top of the idealized scheme, with two additional ideas.

Infrequent switches: In order to give players the opportunity to coordinate, we prevent them from frequently switching their decisions. Concretely, as is described in Fig 1, instead of sampling K arms in each round, the coordinator (as well as the followers) keeps the same K arms for a block of τ consecutive rounds. The coordinator (Alg. 3) runs a blocked version of the K-metaplayer algorithm (Alg. 1): In each block, the coordinator samples an arm according to Alg. 1, but stays on that arm for the entire block. Then she feeds the average loss of that arm back into Alg. 1 to update her loss estimates. While these blocks enable coordination, they cause degradation to the regret guarantees (Dekel et al., 2012). We elaborate on that in the analysis.

Coordinate and Play: We depict the timeline of our algorithm in Figure 1. As can be seen, we divide each block into two phases: Coordinate phase (orange), and Play phase (blue). At the beginning of each block, the coordinator picks K arms according to the blocked version of the K-metaplayer algorithm. Then, during Coordinate, the coordinator assigns an arm to each of the K 1 followers. Thus, the Coordinate phase is further divided into K 1 sub-blocks 2, ..., K (Fig. 1, bottom part). At sub-block k, the k th follower gets assigned to an arm by a protocol that uses collisions as a primitive, yet efﬁcient, manner of coordination.

This protocol (see Alg. 3, and Alg. 4) is very simple: during sub-block k, the coordinator stays on the arm for player k (a follower). Player k tries out all arms in a round-robin fashion, until she collides with the coordinator. At this point, player k learns her arm and the coordinator can repeat this procedure with the other players. While player k is trying to ﬁnd her arm, all other followers

MULTI-PLAYER BANDITS: THE ADVERSARIAL CASE

will stay quiet. Since each follower needs at most N trials, all followers will have learned their arms after (K 1) N rounds. After Coordinate, each player has learned her arm. During Play, all players stay on their arms for the remaining steps of the block. At the end of the block, the coordinator uses the feedback she has collected in order to update her loss estimates. If T is not divisible by τ, the players will play for T

τ blocks and choose arms uniformly at random for the remaining steps. This will increase the regret by at most Kτ. Ranking: So far we assumed that the players have unique ranks in [K]. They can compute the ranking by using a scheme that we adopt from Rosenski et al. (2016). The idea is playing a "Musical Chairs game" on the ﬁrst K arms {1, . . . , K} for TR rounds: A player chooses arms uniformly at random until she chooses an arm i without colliding. At this point, that player becomes the owner of arm i and will receive the rank i. This player i then just stays on arm i for the remaining of the TR rounds. We will set TR in a way that the ranking completes successfully with high probability.

Theorem 2 Suppose that the K players use our C&P Algorithm. Meaning, they ﬁrst compute a ranking using Algorithm 2 with TR = K e log T. Afterwards, player 1 will act as coordinator and play according to Algorithm 3. The other players will behave as followers and run Algorithm 4. Then, the expected regret of the K players is bounded as follows,

E[RT ] 4K4/3N2/3(log N)1/3T 2/3 + 2K2 e log T ,

for block size τ = K2NT

log N 1/3 and η = r

Remark: In the variant that we present and analyze above, the coordinator requires N steps in each sub-block in order to communicate an arm in [N] to the follower. Nevertheless, one requires roughly log2 N bits to communicate a number in [N]. This observation can be used to design a coordination mechanism that requires only log2 N rounds in each sub-block. This leads to an improved regret bound of O(K4/3N1/3T 2/3). See Section. 3.3 for more details. Proof [Proof of Theorem 2] By setting the length of the ranking phase TR = K e log T, the ranking completes after TR rounds with probability at least 1 K

T (see Appendix A.2 for the derivation). Case 1: Ranking unsuccessful With probability at most K

T , the players do not succeed in computing a ranking. The worst regret that they could obtain in the game is KT. Case 2: Ranking successful In the idealized setting with communication from the previous Section 3.1, the Coordinate phase would not be necessary. In that case, Algorithms 3 and 4 together are just the result of applying the blocking technique to the K-Metaplayer algorithm 1. This can be analyzed using the following Theorem from Dekel et al. (2012),

Algorithm 2 C&P Ranking

1: Input: TR 2: for t = 1 to TR do

3: Choose arm r Uniform(1, ..., K)

4: if I did not collide then My rank is r

5: Choose arm r for the remaining of the TR rounds and return.

ALATUR, LEVY AND KRAUSE

Algorithm 3 C&P Coordinator algorithm

1: Input: η, block size τ

2: for block b = 1 to T

τ do Choose K arms according to the metaplayer

3: Set cumulative loss estimate f Lb I = Pb 1 t=1 P i Ielt i, for all meta-arms I M

4: Set probability pb(I) = e ηg Lb I P J M e ηg Lb J , for all meta-arms I M

5: Choose meta-arm f Jb = {f Jb 1, ..., f Jb K} at random according to P b = (pb(I))I M

6: Let e Ib = ( e Ib 1, ..., f Ib K) be a uniform random permutation of f Jb

7: for sub-block r = 2 to K do Each sub-block has exactly N steps

8: Choose It 1 = e Ibr in steps t until collision

9: After collision, choose It 1 = e Ib 1 for the remaining steps t of sub-block r

10: end for Play

11: Choose arm It 1 = e Ib 1 for remaining steps t of block b

Feed average loss of arm e Ib 1 back to the metaplayer

12: Set blb i = Pb τ t=(b 1) τ+1 I{It 1=i} lt i, for all arms i [N]

13: Set loss estimate elb i = K

1 τ blb i P i I M pb(I) I{ e Ib 1=i}, for all arms i [N]

14: end for

Algorithm 4 C&P Follower algorithm

1: Input: block size τ, rank r

2: for block b = 1 to T

τ do Coordinate

3: Stay quiet during sub-blocks 2, ..., r 1 Each sub-block has exactly N steps

4: During sub-block r, explore arms in a round-robin fashion until collision. e Ibr is the arm on which the collision occurred. Choose e Ibr for remaining steps of sub-block r.

5: Stay quiet during remaining sub-blocks r + 1, ..., K Play

6: Choose It r = e Ibr for remaining steps t of block b

Theorem 3 (Dekel et al. (2012)) Let A be a bandit algorithm with expected regret bound of R(T). Then using the blocked version of A with a block of size τ gives a regret bound of τR(T/τ) + τ.

The term τ above accounts for the additional regret in case T is not divisible by τ. Since we have K players, we will replace that term by Kτ. Hence, by applying the above theorem to the regret bound from Lemma 1, we obtain that the regret of the K players in a setting with communication would be

C T 1/2τ 1/2 + Kτ for C = 2 q

KN log N K 2K N log N.

MULTI-PLAYER BANDITS: THE ADVERSARIAL CASE

In the real setting without communication, the Coordinate phase is needed and takes (K 1) N steps. During the Coordinate phase in one block, each player adds at most (K 1) N to the total regret, either by staying quiet (loss 1) or by not choosing the optimal arm (round-robin exploration). Thus, the Coordinate phase increases the total regret by at most T

τ (K 1) N K. Finally, the ranking algorithm adds K TR = K2 e log T to the regret. Put together, the expected regret of the K players, assuming that ranking was successful (we denote this event by S), is bounded as follows:

E[RT |S] C T 1/2τ 1/2 + Kτ | {z } Thm. 3 + Lemma 1 + T

τ K2N | {z } Coordinate

+ K2 e log T | {z } Ranking

O(K4/3N2/3(log N)1/3T 2/3) ,

where in the second line we use C = 2K N log N which holds by Lemma 1; we also take

log N 1/3 and η = r

τ KN . Furthermore, we use K < T.

Combining the results from cases 1 and 2 with TR = K e log T, gives the following bound:

E[RT ] = Pr[S] | {z } 1

E[RT |S] + Pr[Sc] | {z }

E[RT |Sc] | {z } K T

4K4/3N2/3(log N)1/3T 2/3 + 2K2e log T ,

where S denotes the event where ranking is successful, and Sc is its complement.

Remark: So far we assumed that the players need to stay quiet during the Coordinate phase, but this assumption is actually not necessary. In Appendix A.3 we show how to relax this assumption.

Efﬁcient Implementation: As the number of meta-arms is exponential |M| = Θ(NK), sampling a meta-arm directly according to Alg. 3 can be very slow. Nevertheless, as we show in Section 4, we can reduce the computational complexity to O(NK). The key insight that enables an efﬁcient implementation is that the sampling distribution may be described as a K-DPP. This allows us to use powerful sampling mechanisms to reduce the computational cost to O(NK) in any block. Similarly, computing the marginal probability P i I M pb(I) for any ﬁxed arm i [N] can be done in O(NK).

3.3. Improved Coordination Scheme and Regret Bounds

In section 3, we discussed our C&P algorithm which achieves low regret for the K-player setting. C&P is a blocked algorithm and consists of a Coordinate and a Play phase in each block (see Fig. 1). In this section, we will discuss a more efﬁcient scheme to shorten the Coordinate phase. Concretely, we will show how to reduce the size of each sub-block in Coordinate from N to log2(N + 1) rounds. In C&P the Coordinate part between coordinator and player k (follower) in block b works as follows:

1. Coordinator chooses arm e Ib k [N] for player k

2. In sub-block k, the coordinator stays on arm e Ib k until a collision. At the same time, player k explores arms in a round-robin fashion until collision.

ALATUR, LEVY AND KRAUSE

The collision happens on arm e Ib k and at this point, player k learns her arm. Using the round-robin exploration of arms, player k will learn her arm in at most N steps. Instead of doing a naive round-robin exploration on the arms, we can have a more efﬁcient coordination scheme using binary encoding. The idea is that in order to encode any number in [N] (we assume N > 1), we need at most log2(N + 1) bits. This enables to reduce the length of each sub-block from N to log2(N + 1) . Next we elaborate on this efﬁcient coordination scheme.

Based on the above idea, we propose the following coordination scheme using sub-blocks of size log2(N + 1) : In a sub-block k of block b, the coordinator ﬁrst computes the binary representation of e Ib k [N], which is the arm that she has chosen for player k. Then, coordinator and player k iterate over arms 1, ..., log2(N + 1) in an increasing order. In step i {1, ..., log2(N + 1) } of the iteration, if the binary representation of e Ik b has a "1" at position i, the coordinator will choose arm i, otherwise she will stay quiet. Player k will choose arm i in step i of the iteration and interpret a collision (no collision) as a "1" ("0"). At the end of sub-block k, player k can reconstruct the binary number to obtain the value of e Ib k from that. With this, we reduce the size of each sub-block in Coordinate from N to log2(N + 1) rounds. The next theorem demonstrates the affect of this coordination scheme on the regret of C&P.

Theorem 4 Suppose that the K players use our C&P Algorithm with the improved coordination scheme we describe above. Meaning, they ﬁrst compute a ranking using Algorithm 2 with TR = K e log T. Afterwards, player 1 will act as coordinator and play according to Algorithm 3. The other players will behave as followers and run Algorithm 4. Then, the expected regret of the K players is bounded as follows,

E[RT ] 7K4/3N1/3(log N)2/3T 2/3 + 2K2e log T ,

for block size τ = 4 K2 log N T

(log 2)2 N 1/3 and η = r

Proof [Proof of Theorem 4] As the following analysis is based on the proof for Theorem 2, we recommend the reader to take a look at that proof ﬁrst. In the original C&P the regret accounted for the Coordinate phases was T

τ K2N. With the improvement, the regret for Coordinate becomes T

τ K2 log2(N + 1) 2 T

τ K2 log2 N (for N > 1). Plugging this into the expression for E[RT |S] (see Theorem 2), we obtain:

E[RT |S] C T 1/2τ 1/2 + Kτ | {z } Thm. 3 + Lemma 1 + 2T

τ K2 log2 N | {z } Coordinate

+ K2 e log T | {z } Ranking

7K4/3N1/3(log N)2/3T 2/3 + K2 e log T ,

where in the second line we use C = 2K N log N (by Lemma 1), τ = 4 K2 log N T

(log 2)2 N 1/3 ,

τ KN and the log base change log2(N) = log N

log 2 . Finally, we also use K < T.

MULTI-PLAYER BANDITS: THE ADVERSARIAL CASE

As in Theorem 2, combining the results for the two cases, i.e. S and Sc with TR = K e log T, we obtain the following improved regret bound:

E[RT ] = Pr[S] | {z } 1

E[RT |S] + Pr[Sc] | {z }

E[RT |Sc] | {z } K T

7K4/3N1/3(log N)2/3T 2/3 + 2K2e log T .

4. Efﬁcient sampling from the K-Metaplayer s distribution using K-DPPs

In this section, we will discuss how the coordinator can efﬁciently sample K arms and compute marginal probabilities in Alg. 3 using K-DPPs. We will ﬁrst give some background on DPPs and K-DPPs before explaining how to use them for our case. DPPs (Determinantal Point Processes) are probabilistic models that can model certain probability distributions of the type P : 2Y [0, 1], where Y = [N] and 2Y is the power set of Y.4 Hence, a DPP samples subsets over a ground set Y. In general, a DPP P is speciﬁed by a Kernel matrix (see deﬁnition 2.1 of Kulesza and Taskar (2012)). L-Ensembles are a speciﬁc type of DPPs and we will focus only on those since this is what we will need for the coordinator algorithm. An L-Ensemble DPP P is deﬁned by a N N-Kernel matrix L as follows (see deﬁnition 2.2 of Kulesza and Taskar (2012)):

P(Y = Y ) det(LY ) (Y Y, Y is a random variable specifying the outcome of the DPP.)

LY is the submatrix of L obtained by keeping only the rows and columns indexed by Y . The only restriction on L is that it needs to be symmetric and positive semideﬁnite. K-DPPs deﬁne probability distributions over subsets of size K, while the outcome set of a DPP can have any size. A K-DPP PK is speciﬁed by a N N-Kernel matrix L as follows (see deﬁnition 5.1 of Kulesza and Taskar (2012)):

PK(Y = Y ) = det(LY ) P Y Y,|Y |=K det(LY )

As before, L needs to be positive and semideﬁnite. For DPPs and K-DPPs, sampling and marginalization can be done efﬁciently. Because of this, K-DPPs were appealing to us as they would allow us to efﬁciently sample a set of exactly K distinct arms, which is what we need for the coordinator. We will now see how we can model the coordinator s probability distribution over meta-arms as a K-DPP, i.e. we will determine how L needs to be set. Let us ﬁrst recall the coordinator s probability for meta-arms. For this, let f Lb i = Pb 1 τ=1 elτ i denote

the cumulative loss estimate for any arm i Y in block b. And let f Lb I = P i I f Lb i be the cumulative loss estimate for any meta-arm I M in block b. The probability that the coordinator chooses I M in block b, is:

pb(I) = e η f Lb I P J M e η f Lb J (see in Alg. 3)

4. In general, Y does not need to be discrete. For more information on the continuous case, please refer to Kulesza and Taskar (2012).

ALATUR, LEVY AND KRAUSE

For our K-DPP, Y = [N] is the ground set and M the set of outcomes. For block b, let the N N-Kernel matrix Lb be deﬁned as follows:

( e ηf Lb i, i = j 0, i = j (diagonal matrix)

Clearly, L is symmetric and positive deﬁnite. Hence, it induces the following K-DPP PK:

PK(Y = I) det(Lb I) (Y is the random variable specifying the K-DPP s outcome, I M)

i I Lb i,i (Lb I is a diagonal matrix)

= e η P i I f Lb i

= e η f Lb I (by deﬁnition of f Lb I )

Note that a K-DPP samples subsets of size K, i.e. Y does not contain any element twice and its size is K. Since the probabilities need to sum up to one, we conclude:

P(Y = I) = e η f Lb I P J M e η f Lb J

= Pr[f Jb = I] (Coordinator s probability of choosing meta-arm I)

Cost for sampling a meta-arm Kulesza and Taskar (2012) describe in their Algorithm 1 how to sample from a general DPP. In a general DPP, the outcome can be any subset of Y, its size is not necessarily equal to K. Their algorithm consists of two phases:

1. Sample eigenvectors of Lb. This determines the size of the DPP outcome.

2. Use the sampled eigenvectors to actually choose a subset of Y.

In order to obtain a K-DPP algorithm, they replace phase 1 with an algorithm that samples exactly K eigenvectors (see Alg. 6). This then ﬁxes the size of the outcome to K, which is what we want in a K-DPP. Algorithm 5 describes the algorithm for sampling from a K-DPP.

Since our matrix Lb is diagonal, its eigendecomposition is very simple: The eigenvalues are the diagonal elements of Lb, the eigenvectors are the standard basis vectors. This means, that in phase 2 of Alg. 5, we would simply end up choosing the indexes of the eigenvectors computed in phase 1. Thus, we can actually ﬁnish after phase 1. Algorithm 6 describes how to sample exactly K eigenvectors. Since this part requires O(NK), we conclude that sampling a meta-arm in any block can be done in O(NK). For completeness, we have written down the algorithm for sampling K eigenvectors from Kulesza and Taskar (2012) and a sub-algorithm that it uses here in Algorithm 6 and 7, respectively.

MULTI-PLAYER BANDITS: THE ADVERSARIAL CASE

Algorithm 5 Sampling from a K-DPP (based on Algorithm 1 from Kulesza and Taskar (2012))

1: (vn, λn)N n=1 = Eigendecomposition of Lb

Phase 1 begins

2: V Sample K eigenvectors of Lb (Algorithm 6)

Phase 2 begins

4: while |V | > 0 do

5: Select i from [N] with Pr(i) = 1 |V | P v V (v T ei)2 ei = i-th standard basis vector

7: V V , an orthonormal basis for the subspace of V orthogonal to ei 8: end while Output: J

Algorithm 6 Sampling K eigenvectors (Algorithm 8 in Kulesza and Taskar (2012))

Input: K, (vn, λn)N n=1 = Eigendecomposition of Lb

Compute en l for l = 0, 1, ..., K and n = 0, 1, ..., N (Algorithm 7)

3: for n = N, ..., 2, 1 do

4: if l = 0 then

7: if u U[0, 1] < λn en 1 l 1

8: J J {vn}

11: end for

Cost for computing the marginal probability of one arm If the K-Metaplayer decides to update arm i at the end of block b, she needs to compute the marginal probability P i I M pb(I). We can rewrite this as follows:

i I M pb(I) = X

ZN K (Normalizer ZN K := P J M e η f Lb J)

i I M e η P j I f Lb j (by deﬁnition of f Lb I )

= e ηf Lb i

i/ {i1,...,i K 1} Y e η PK 1 k=1 g Lb ik

| {z } =:( )

ALATUR, LEVY AND KRAUSE

Algorithm 7 Sub-algorithm for Alg. 6: Computing elementary symmetric polynomials (Algorithm 7 in Kulesza and Taskar (2012))

Input: K, eigenvalues λ1, λ2, ..., λN 1: en 0 1 n {0, 1, 2, ..., N}

2: e0 l 0 l {1, 2, ..., K}

3: for l = 1, 2, ..., K do

4: for n = 1, 2, ..., N do

5: en l en 1 l + λnen 1 l 1 6: end for

Output: Values ej i, i {0, ..., K}, j {0, ..., N}

By inspecting the expression inside sum ( ) more closely, we observe that it looks very similar to the K-DPP that we deﬁned before. In fact, that expression can be seen as a (K-1)-DPP over ground set [N] \ {i} with Kernel matrix Lb i consisting of Lb without the i-th row and column. Therefore, the sum ( ) is actually just the normalization constant, let s call it ZN i K 1, of that (K-1)-DPP. Hence, the marginal probability for arm i can be written as:

i I M pb(I) = e ηf Lb i

ZN K ZN i K 1

From proposition 5.1 in Kulesza and Taskar (2012), we know that both ZN K and ZN i K 1 can be computed in O(NK) each. We conclude that calculating the marginal probability for one arm in any block can be done in O(NK).

5. Experiments

We run experiments with three different setups and compare the performance of C&P to the Musical Chairs algorithm (MC) from Rosenski et al. (2016). MC achieves constant regret with high probability in a stochastic setting by assuming a ﬁxed gap between the K-th and (K + 1)-th best arms. It starts with a learning phase of T0 O(1) rounds, during which players explore arms uniformly at random and observe losses. Based on these observations, the players compute a mean loss estimate for each arm. In the second phase, the players play a musical chairs game, where each player keeps choosing u.a.r. among the K best arms according to her own estimates. As soon as a player chooses an arm without colliding for the ﬁrst time, she becomes the owner of that arm and stays there for the rest of the game. For all experiments, we set N = 8, K = 4, T = 240000, TR = 20 and T0 = 3000. This value for T0 was also used for the experiments by Rosenski et al. (2016). We repeat this for 10 runs for each setup and measure the online regret Rt, i.e., the difference between the cumulative player loss at time t [T] and the cumulative loss of the K arms that are the K best in the time period [t]. For each setup, we create a plot that shows the average regret and the standard deviation (as a colored region around the average). In the plots, the blue curves show the results of MC and the green curves show the results of C&P. The black dashed line indicates the end of MC s learning phase (t = T0).

MULTI-PLAYER BANDITS: THE ADVERSARIAL CASE

(a) Experiment 1 (stochastic)

(b) Experiment 2 (link failures)

(c) Experiment 3 (link improves)

Figure 2: Results: The red dashed lines indicate when a link failed or came up.

For all of the following three setups, we also run experiments to measure the accumulated regret RT after T rounds. These can be found in Section A.4 of Appendix A.

Experiment 1 (Stochastic) We choose N mean losses in [0,1] u.a.r. with a gap of at least 0.05 between the Kth and (K + 1)-th best arms. For each arm, the losses are sampled i.i.d. from a Bernoulli distribution with the selected means. This is a similar setup as the one from Rosenski et al. (2016). The results are shown in Figure 2a. As we can see, MC (blue curve) accumulates regret up to time T0 but the regret stays constant afterwards. For our algorithm (green curve), we can see that it keeps accumulating regret until the end of the game.

Experiment 2 (Non-stochastic) We model a network scenario in which good links (i.e. small losses) fail suddenly. Concretely, we initially set the mean loss µi for each arm i as follows: µ1 = µ2 = µ3 = µ4 = 0.1 and µ5 = µ6 = µ7 = µ8 = 0.3. Each arm i s losses are sampled i.i.d. from Bernoulli distribution Ber(µi). At time T

4 , link" (arm) 1 fails and its remaining losses are sampled i.i.d. from Ber(0.9). After a while, at time T

3 , link 3 also fails and from then on its losses are also chosen from Ber(0.9). Figure 2b shows the results of this experiment. The red dashed lines represent the two link failures.

Experiment 3 (Non-stochastic) We model another network scenario, in which a bad link improves suddenly (or a link that was down comes up). We set the initial mean losses as follows: µ1 = 0.9 and µ2 = µ3 = µ4 = µ5 = µ6 = µ7 = µ8 = 0.7. As before, the losses are sampled i.i.d. from a Bernoulli distribution with the corresponding means. At time T

4 , link 1 improves and its losses are

ALATUR, LEVY AND KRAUSE

from then on chosen from Ber(0.1). Figure 2c shows the results of this experiment. The red dashed line shows when link 1 improves. In Experiments 2 and 3, the link failures and improvements, respectively, happen after the learning phase T0 in MC. Due to this, MC (blue curve) cannot react to them and its regret starts to increase. For C&P (green curve), the regret is initially larger than MC, yet it is able to react to the link changes.

6. Discussion and Conclusions

We have presented the ﬁrst efﬁcient algorithm for the multiplayer no communication" adversarial setting. Our method obtains a regret bound of e O(K4/3N1/3T 2/3), and it is interesting to understand whether one can devise an efﬁcient method that obtains a rate of O(poly(K,N)

T). In our algorithm, there is a single learner (coordinator) while all others just accept the coordinator s decisions and ignore the loss feedback that they receive. This poses a single point of failure. One possible way to remedy this might be to switch coordinators after each block in a round-robin fashion: Player 1 would be the coordinator in block 1, player 2 would be the coordinator in block 2 and so on.

Acknowledgments

We would like to thank Johannes Kirschner and Mojmír Mutný for their valuable feedback on the manuscript. We would also like to thank the anonymous reviewers for their reviews and suggestions. This project has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme grant agreement No. 815943, as well as from the ETH Zurich Postdoctoral Fellowship and Marie Curie Actions for People COFUND program.

MULTI-PLAYER BANDITS: THE ADVERSARIAL CASE

Appendix A.

A.1. Regret analysis for Lemma 1 (K-Metaplayer)

In this section, we will prove that using η = q

log(N K) KTN , the metaplayer s regret with Alg. 1 is bounded by,

E[Rmeta T ] 2

KTN log N K

Proof Using the view of meta-arms (see Section 3.1), Alg. 1 is very similar to playing EXP3 on |M| = N K meta-arms. In order to apply regret guarantees from the EXP3 analysis, we need to show that:

1. A meta-arm I M is chosen proportional to exp( ηf Lt I) at time t, where f Lt I = Pt 1 τ=1 P i I elτ i is the cumulative loss estimate of I at time t. This can be seen directly in Alg. 1.

2. elt I = P i I elt i is an unbiased estimate of the true meta-arm s loss lt I = P i I lt i at time t, for

any I M and any t. For this, we will ﬁrst show that for any arm i [N], elt i is an unbiased estimate of lt i:

E[elt i|pt] = K lt i P i Z M pt(Z) Pr[Jt = i]

= K lt i P i Z M pt(Z) Pr[i It] | {z } =P i Z M pt(Z)

Pr[Jt = i|i It] | {z } = 1

From the law of total expectation, we can derive that E[elt i] = E[E[elt i|pt]] = lt i. Finally, by linearity of expectation (as elt I = P i I elt i, we conclude that elt I is an unbiased estimate of lt I.

Given 1. and 2., we can apply standard EXP3 regret guarantees to obtain the following bound on the metaplayer s regret:

E[Rmeta T ] η

I M E[pt(I) E[(elt I)2|pt] | {z } =:( )

] + log N K

(e.g. see Lecture 9, Mc Mahan and Dekel (2014). Also, we used that |M| = N K .)

ALATUR, LEVY AND KRAUSE

The variance term ( ) can be simpliﬁed as follows:

E[(elt I)2|pt] = E[( X

elt i)2|pt] (by deﬁnition)

j,k I E[elt j elt k|pt] (Linearity of expectation)

i I E[(elt i)2|pt]

(The loss estimate at time t is non-zero for at most one arm, thus all terms for j = k cancel)

K lt i P i Z M pt(Z)

!2 Pr[Jt = i]

K lt i P i Z M pt(Z)

!2 Pr[i It] | {z } =P i Z M pt(Z)

Pr[Jt = i|i It] | {z } = 1

(lt i)2 P i Z M pt(Z)

Plugging this back into our expression for the regret, we obtain:

E[Rmeta T ] η

I M E[pt(I) E[(elt I)2|pt]] + log N K

I M E[pt(I) K X

(lt i)2 P i Z M pt(Z)] + log N K

I M pt(I) X

(lt i)2 P i Z M pt(Z) | {z } ( )

] + log N K

η (Linearity of expectation)

In ( ), we ﬁrst sum over all meta-arms I and then over all arms i that are in I. We can instead sum over all arms i ﬁrst and then over all meta-arms I that contain i. Hence, we can rewrite ( ) as follows:

I M pt(I) X

(lt i)2 P i Z M pt(Z) =

(lt i)2 P i Z M pt(Z)

i I M pt(I)

i=1 (lt i)2

MULTI-PLAYER BANDITS: THE ADVERSARIAL CASE

By plugging this back into our regret expression, we get:

E[Rmeta T ] Kη

I M pt(I) X

(lt i)2 P i Z M pt(Z)] + log N K

i=1 (lt i)2] + log N K

i=1 E( lt i |{z} [0,1]

)2 + log N K

KTNη + log N K

KTN log N K

log(N K) KTN )

This concludes the regret analysis for Lemma 1.

A.2. Success analysis of the ranking algorithm 2

In this section, we will show that the players will successfully compute a ranking using algorithm 2 within TR = K e log T rounds with probability at least 1 K

T . The analysis uses ideas from the proof of Lemma 3 in Rosenski et al. (2016). For a ﬁxed player, let qt be the probability that she gets a rank assigned in step t. qt can be bounded as:

K )Unranked 1

(Free = set of available arms at time t, Unranked = number of players who don t have a rank yet)

K )K 1 (Unranked is at most K)

1 K e (|Free| 1, (1 1

K )K 1 e 1 for K 1)

The probability that she doesn t have a rank after step t is thus at most:

e t K e (Using the inequality 1 x e x)

By union bound, the probability that there s at least one player who is not ﬁxed after t = TR rounds, is at most

ALATUR, LEVY AND KRAUSE

By setting TR = K e log T, we conclude that after TR rounds, the probability that all players have a rank, is at least

1 K e K e log T

A.3. Staying Quiet

So far, we assumed that players need to stay quiet during the Coordinate phase of our C&P algorithm presented in Section 3. I.e., during sub-block k, all players except the coordinator and player k, don t pick any arms. This assumption can however be relaxed using a simple modiﬁcation to our protocol: During sub-block k {2, ..., K}, all followers except player k stay on arm 1. Player k explores all arms in a round-robin fashion for at most N steps, until she collides on an arm i = 1. If she manages to do so, i is the arm that the coordinator has chosen for her. If player k doesn t collide on any other arm except on 1, she can conclude that the coordinator has picked arm 1 for her.

A.4. Experiments (Measuring the accumulated regret)

For the three setups that we described in Section 5, we run experiments to measure the accumulated regret RT of both MC and our algorithm. We visualize the outcome in a loglog plot to compare the experimental results with our theoretical bound (Theorem 2). In all three experiments, we set N = 8, K = 4, TR = 25 and T0 = 3000 (length of MC s learning phase). For T, we choose T = 100000 + i 1000, where i {0, ..., 1300}. Per value of T, we do 10 runs and measure the regrets. In the loglog plots, the blue dots show the average regrets of MC and the green dots the average regrets of our algorithm. The standard deviations are shown as colored regions around the average regrets. Besides this, we ﬁt a line on the log average regrets for each algorithm and plotted those as well. With these lines, we can compare whether the experimental results match what we expect from Theorem 2.

Experiment 1 We use the same setup as in experiment 1 from 5, i.e. arms with i.i.d. Bernoulli losses where the arms means are sampled u.a.r. from [0,1] with a gap of at least 0.05 between the K-th and (K + 1)-th best arm. The results are shown in Figure 3a.

Experiment 2 In this experiment, we use the setup from experiment 2 in Section 5, i.e. we model a network in which two links go down at time T

3 , respectively. Figure 3b shows the results of this experiment.

Experiment 3 For this, we use the setup from experiment 3 in Section 5, in which a bad link suddenly improves or comes up at time T

4 . The outcome of this experiment is shown in Figure 3c.

MULTI-PLAYER BANDITS: THE ADVERSARIAL CASE

(a) Loglog plot of experiment 1 (stochastic losses).

(b) Loglog plot of experiment 2 (link failures).

(c) Loglog plot of experiment 3 (link improves).

ALATUR, LEVY AND KRAUSE

Animashree Anandkumar, Nithin Michael, Ao Kevin Tang, and Ananthram Swami. Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE Journal on Selected Areas in Communications, 29(4):731 745, 2011.

Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31 45, 2013.

Peter Auer, Nicolò Cesa-bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32:2002, 2002.

Orly Avner and Shie Mannor. Concurrent bandits and cognitive radio networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 66 81. Springer, 2014.

Orly Avner and Shie Mannor. Multi-user lax communications: a multi-armed bandit approach. In IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on Computer Communications, pages 1 9. IEEE, 2016.

Orly Avner and Shie Mannor. Multi-user communication networks: A coordinated multi-armed bandit approach. ar Xiv preprint ar Xiv:1808.04875, 2018.

Baruch Awerbuch and Robert Kleinberg. Competitive collaborative learning. Journal of Computer and System Sciences, 74(8):1271 1288, 2008.

Ilai Bistritz and Amir Leshem. Distributed multi-player bandits-a game of thrones approach. In Advances in Neural Information Processing Systems, pages 7222 7232, 2018.

Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1 122, 2012.

Sébastien Bubeck, Yuanzhi Li, Yuval Peres, and Mark Sellke. Non-stochastic multi-player multiarmed bandits: Optimal rate with collision information, sublinear without. ar Xiv preprint ar Xiv:1904.12233, 2019.

Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

Nicolo Cesa-Bianchi, Claudio Gentile, Yishay Mansour, and Alberto Minora. Delay and cooperation in nonstochastic bandits. JOURNAL OF MACHINE LEARNING RESEARCH, 49:605 622, 2016.

Richard Combes, Mohammad Sadegh Talebi Mazraeh Shahi, Alexandre Proutiere, et al. Combinatorial bandits revisited. In Advances in Neural Information Processing Systems, pages 2116 2124, 2015.

Ofer Dekel, Ambuj Tewari, and Raman Arora. Online bandit learning against an adaptive adversary: from regret to policy regret. In ICML, 2012.

Alex Kulesza and Ben Taskar. Determinantal Point Processes for Machine Learning. Now Publishers Inc., Hanover, MA, USA, 2012. ISBN 1601986289, 9781601986283.

MULTI-PLAYER BANDITS: THE ADVERSARIAL CASE

Lifeng Lai, Hai Jiang, and H Vincent Poor. Medium access in cognitive radio networks: A competitive multi-armed bandit framework. In Signals, Systems and Computers, 2008 42nd Asilomar Conference on, pages 98 102. IEEE, 2008.

Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint, 2018.

Haoyang Liu, Keqin Liu, Qing Zhao, et al. Learning in a changing world: Restless multi-armed bandit with unknown dynamics. IEEE Trans. Information Theory, 59(3):1902 1916, 2013.

Keqin Liu and Qing Zhao. Distributed learning in multi-armed bandit with multiple players. IEEE Transactions on Signal Processing, 58(11):5667 5681, 2010.

Brendan Mc Mahan and Ofer Dekel. Cse599s: Online learning, 2014. URL https://courses. cs.washington.edu/courses/cse599s/14sp/scribes.html.

Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527 535, 1952.

Jonathan Rosenski, Ohad Shamir, and Liran Szlak. Multi-player bandits: A musical chairs approach. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, 2016.

Taishi Uchiya, Atsuyoshi Nakamura, and Mineichi Kudo. Algorithms for adversarial bandit problems with multiple plays. In International Conference on Algorithmic Learning Theory, pages 375 389. Springer, 2010.

Sattar Vakili, Keqin Liu, and Qing Zhao. Deterministic sequencing of exploration and exploitation for multi-armed bandit problems. IEEE Journal of Selected Topics in Signal Processing, 7(5): 759 767, 2013.