# a_gang_of_adversarial_bandits__0bc69ab7.pdf A Gang of Adversarial Bandits Mark Herbster*, Stephen Pasteris* Department of Computer Science University College London London WC1E 6BT {m.herbster,s.pasteris}@cs.ucl.ac.uk Fabio Vitale University of Lille 59653 Villeneuve d Ascq CEDEX France fabio.vitale2@univ-lille.fr Massimiliano Pontil CSML, Istituto Italiano di Tecnologia and Department of Computer Science University College London massimiliano.pontil@iit.it We consider running multiple instances of multi-armed bandit (MAB) problems in parallel. A main motivation for this study are online recommendation systems, in which each of N users is associated with a MAB problem and the goal is to exploit users similarity in order to learn users preferences to K items more efficiently. We consider the adversarial MAB setting, whereby an adversary is free to choose which user and which loss to present to the learner during the learning process. Users are in a social network and the learner is aided by a-priori knowledge of the strengths of the social links between all pairs of users. It is assumed that if the social link between two users is strong then they tend to share the same action. The regret is measured relative to an arbitrary function which maps users to actions. The smoothness of the function is captured by a resistance-based dispersion measure Ψ. We present two learning algorithms, GABA-I and GABA-II which exploit the network structure to bias towards functions of low Ψ values. We show that GABA-I has an expected regret bound of O( p ln(NK/Ψ)ΨKT) and per-trial time complexity of O(K ln(N)), whilst GABA-II has a weaker O( p ln(N/Ψ) ln(NK/Ψ)ΨKT) regret, but a better O(ln(K) ln(N)) per-trial time complexity. We highlight improvements of both algorithms over running independent standard MABs across users. 1 Introduction During the last decade multi-armed bandits (MAB) have received a great deal of attention in machine learning and related fields, due to their wide practical and theoretical importance. The central problem is to design a decision strategy whereby a learner explores sequentially the environment in order to find the best item (arm) within a prescribed set. At each step in the exploration the learner chooses an arm, after which feedback (typically a loss or reward corresponding to the selected arm) is observed from the environment. Then the next decision is made by the learner based on past interactions, and the process repeats. The goal is to design efficient exploration strategies which incur a small cumulative loss in comparison to the cumulative loss that would have been obtained by always * Equal contribution. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). selecting the best arm in hindsight. Applications of MAB are numerous, including recommender systems [1], clinical trials [2], and adaptive routing [3], among others. In this paper we study the problem in which the learner is facing several MAB problems that are related according to a prescribed interaction graph. A main motivation behind this problem are online recommendation systems, whereby each of several users is associated with a MAB problem (task), where the arms correspond to a finite set of products, and the graph represents a social network among users. The goal is to exploit users similarity in order to improve the efficiency of learning users preferences via online exploration of products. In the standard full information setting, a lot of work has been done showing that techniques from multitask learning are effective in reducing the amount of data needed to learn each of the individual tasks, both in the statistical and adversarial settings, see [4, 5, 6, 7, 8, 9, 10, 11] and references therein. Graphs have been used to model task relationships, with different tasks parameters encouraged to be close according to the graph topology. In contrast, multitask learning in the bandit setting is much less explored. The algorithms that we present exploit the network homophily principle [12] which formulates that users that are connected in the network have similar preferences, that is, they tend to share preferred recommendations. We will show that our algorithms exploit graph structure and enjoy potentially much smaller regret bounds than the cumulative regret of standard MAB run independently on each user. Since the original graph may be dense, we exploit a randomized sparsification technique to develop fast prediction algorithms. Our approach builds upon previous work on online learning over graphs [13, 14] to generate a perfect full oriented binary tree, whose leaves are in one-to-one correspondence with the nodes of the original graph. This construction approximately preserves the relevant graph properties in expectation, and provides the starting point for designing our efficient algorithms. A further ingredient in our algorithm is provided by the method of specialists [15, 16]. Our learning strategies combine the above ingredients to devise efficient online algorithms under partial feedback. Contributions. We introduce two Gang of Adversarial BAndit algorithms, GABA-I and GABAII that learn jointly MAB models for N users over K possible actions. Both algorithms are designed to exploit network structure while being (extremely) computationally efficient. We derive expected (over the algorithms randomizations) regret bounds. The bounds scale with the dispersion measure Ψ [1, N] of the best actions over the graph. For GABA-I the bound1 is of order of O( p ln(NK/Ψ)ΨKT), where T is the number of trials, and has a per-trial time complexity of O(K ln(N)). On the other hand GABA-II has a weaker expected regret bound of O( p ln(N/Ψ) ln(NK/Ψ)ΨKT) but is faster, having a per-trial time complexity of O(ln(K) ln(N)). Thus the GABA-I algorithm improves on algorithms that treat each user independently, as in the best case the regret improves from O( ln N) and in the worst case the regret degrades by at most a constant factor. GABA-II has slightly weaker regret bounds; however, it is more computationally efficient. Outline of Main Results. The social network graph G is determined by a set of undirected links between users {ωu,v}N u 0; Distribution w1 : S [0, 1] s.t. P s S w1(s) = 1.) For t = 1, . . . , T do 1. a [K], pt,a P s S:s(ut)=a wt(s); 2. Predict at by drawing from [K] with probability P [at = a] := pt,a/ pt 1; 3. Receive ℓt,at 4. λt exp( ηℓt,at pt 1/pt,at); zt pt 1/( pt 1 (1 λt)pt,at); wt(s) s(ut) = wt(s)zt s(ut) = at wt(s)ztλt s(ut) = at Figure 2: SPECIALISTEXP Algorithm 3 Predicting with Specialists We build on the prediction with expert advice framework [71, 72, 73, 74], specifically that with bandit feedback: pioneered by the EXP4 algorithm [18]. This type of online algorithm maintains a distribution over a set of predictors ( experts ). After the predictors predict they incur a loss and the distribution is updated accordingly. Although, except in special cases, this procedure does not have a natural Bayesian interpretation, probabilistic methods still may be transferred into the expert advice framework. In particular we will exploit an analogue of message-passing as used in graphical models [75] to predict very efficiently over exponentially-sized sets of predictors. Broadly speaking we would like build a graphical model that is isomorphic to the user graph G. However it is well-known that exact prediction with graphical models that contain cycles is NP-hard [76]. Thus a benefit of the embedding to a BST (see Section 2.2) is that it enables fast and exact computation as the graph is now cycle-free and Lemma 2 ensures that the embedding only modestly increases our regret bounds. Surprisingly, we improve in terms of computation over standard message passing techniques, i.e., if we embedded to a line graph we would require O(KN) time to predict [75] per trial or using the method of [77] O(K3 log N) time. However, we will require only O(K log N) and O(log K log N) for the GABA-I and GABA-II algorithms respectively (see Figures 3 and 4). To accomplish this technically we adapt the method of specialists [15, 16]. A specialist is a prediction function s : [N] {1, 2, . . . , K, } from a context space to an extended output space with abstentions. For us the context space is just the set of users [N]; and the extended output space is {1, 2, . . . , K, } where [K] corresponds to predicted actions, but indicates that the specialist abstains from predicting an action. Thus a specialist specializes its prediction to part of the context space. We denote the set of all specialists as S := {1, . . . , K, }[N]. As a single specialist only predicts over part of the context space, we need a set of specialists S S if we wish to define a function that predicts an action for every context. A specialist set S S is well-formed if for each u [N] there exists a unique specialist s S such that s(u) [K]. For such a user u and specialist s we then define S (u) := s(u) so that S is a function from [N] into [K]. Finally a specialist model is defined by giving a distribution w1 : S [0, 1] s.t. P s S w1(s) = 1. To predict with specialists we adapt [15] to the EXP3/4 [18] setting giving the SPECIALISTEXP algorithm (see Figure 2). We then bound the regret by combining the analysis of [15, 18] into the following theorem. Theorem 3. The expected regret of SPECIALISTEXP with initial specialist distribution w1 : S [0, 1] and learning rate η > 0 is bounded above by t [T ] ℓt,at ℓt,S (ut) s S ln 1 w1(s)|S| for all well-formed specialist sets S S. In the following we give the two distributions that define the two specialist models corresponding to GABA-I and GABA-II in (7) and (9), and in the supplementary material we detail how these distributions lead to the regret bounds in Corollaries 4 and 5. We now give the distribution w1( ) over S that defines the GABA-I model. The model has a single parameter φ (0, 1) and we give the following helper functions to define the distribution, valid1(s) := J u, v [N] : s(u) = s(v) or s(u) = or s(v) = K cut(s) := X u [N 1] Js(u) = s(u + 1)K startfactor(s) := K 1 K Js(1) = K + 1 K Js(1) = K. The function valid1( ) determines the support of w1( ) which are the specialists that predict a unique action or abstain, hence the cardinality of the support of w1( ) is K (2N 1) + 1. The remaining two functions quantitatively determine probability mass of a specialist as: w1(s) := valid1(s) 1 K startfactor(s) (1 φ)N 1 cut(s)φcut(s) ( s S) . (7) We note that this specialist selection is similar to that of the Markov circadian specialists in [16] except that the nodes of the Markov chain are now users instead of trials. Corollary 4. The expected regret of SPECIALISTEXP with distribution w1( ) as defined by (7) with parameter φ = 4Ψ(y)/(K(N 1)), learning rate η = q 10Ψ(y) ln(KN/Ψ(y)) KT and with Ψ(y) (N 1)/4 is bounded above by: t [T ] ℓt,at ℓt,y(ut) for any mapping of users to actions y : [N] [K]. We now give the distribution w1( ) over S that defines the GABA-II model. Whereas for GABA-I the cardinality of the support was exponential in N, for GABA-II the cardinality is just K(2N 1). The supported specialists in GABA-II predict a unique action over a contiguous l, . . . , r and abstain everywhere else, thus they are of the form: sl,r a (u) := a u {l, . . . , r} u {l, . . . , r} , but not all contiguous segments are supported. The segments supported are those that correspond to the set of all leaf-descendents of a node in the BST (see Section 2.2). As an example if N = 4 the supported (l, r) segments are {(1, 1), (2, 2), (3, 3), (4, 4), (1, 2), (3, 4), (1, 4)}. Expressing this algebraically leads to a relatively complex validity function valid2(s) := J a [K]; l, r [N]; i, j [log2 N] : 1+r l = 2i and l = 2i(j 1)+1 and s = sl,r a K , and then the distribution is defined as, w1(s) := valid2(s) 1 K(2N 1) ( s S) . (9) We note that this selection of specialists is a simple multi-action extension of those defined in [65]. Corollary 5. The expected regret of SPECIALISTEXP with distribution w1( ) as defined by (9) with learning rate η = q 8Ψ(y) log2(e N/Ψ(y)) ln(3KN/2Ψ(y)) KT and with Ψ(y) N/2 is bounded above by: t [T ] ℓt,at ℓt,y(ut) for any mapping of users to actions y : [N] [K]. 4 The GABA Algorithms We now introduce the GABA algorithms. Both algorithms are based on the BST C (see Section 2.2). Since we have an exponential number of non-zero weight specialists in GABA-I a direct implementation of SPECIALISTEXP would take per-trial time and space exponential in N. We now describe how GABA-I implements SPECIALISTEXP, bringing the per-trial time down to O(K ln(N)) and the space down to O(KN). The implementation works by, for each action independently, performing online belief propagation [64] over the tree C . We note that each of these K online belief propagations is over two states {0, 1} and hence takes a per-trial time of only O(ln(N)). We now detail this procedure: GABA-I maintains a vector valued function αt : C {0, 1} {0, 1} RK which, for all i, j {0, 1} and t [T] , has the following properties: u [N] \ {ut} , αt+1(u, i, j) = αt(u, i, j) (11) and for all internal vertices n of C we have: αt(n, i, j) = X k {0,1} αt( (n), i, k) αt( (n), k, j) (12) On trial t GABA-I computes pt by sending vector valued messages down the path in C from the root to ut. Specifically, we construct the left and right message functions β t , β t : (ut) {0, 1} RK as follows. Each (non-root, proper) ancestor n of ut receives, for i {0, 1} , K dimensional vector messages β t ( (n), i) and β t ( (n), i) from its parent and then constructs its own messages β t (n, i) from β t ( (n), j) and αt( ( (n)), j, i) and messages β t (n, i) from β t ( (n), j) and αt( ( (n)), i, j), for all i, j {0, 1} . It then sends these messages to its child that is next on the path to ut. Once ut has received the messages from its parent it combines them with αt(ut, 1, i) (for i {0, 1}) to create pt. On the receipt of ℓt,at we update the function αt to αt+1 noting that by (11) and (12) we need only modify the values αt(n, i, j) when n is an ancestor of ut. 4.2 GABA-II For GABA-II we have O(K ln(N)) non-zero weight specialists that don t abstain on any given trial so a direct implementation of SPECIALISTEXP would take a per-trial time of O(K ln(N)). We now show how GABA-II implements SPECIALISTEXP, which takes the per-trial time down to O(ln(K) ln(N)) whilst maintaining the space complexity of O(KN). We first note that SPECIALISTEXP maintains a weight for each specialist. For any vertex n of C and any action a , the weight, on trial t , of the specialist that predicts a whenever ut is its descendant and abstains otherwise, is kept, by GABA-II in the following factored form: µt(n)θt(n, a) K(2N 1) (13) where µt+1(n) := µt(n) whenever n / (ut) , and θt+1(n, a) := θt(n, a) whenever n / (ut) or a = at. In addition to the tree C, GABA-II also works with an oriented full binary tree B whose leaves are the actions (in this overview we assume that the cardinality of the action set is an integer power of two, although this is not required by GABA-II). For any vertex n of C the function θt(n, ) is extended onto all internal vertices of B by the following inductive relationship: θt(n, m) := θt(n, (m)) + θt(n, (m)) (14) To sample the action at GABA-II first samples an ancestor δt of ut with probability P [δt = n] µt(n)θt(n, r) where r is the root of B. GABA-II then uses the function θt(δt, ) to sample action at with probability P [at = a | δt = n] = θt(n, a)/θt(n, r) in O(ln(K)) time. The law of total probability and (13) can then be used to show that P [at = a] pt,a where pt,a is as defined in SPECIALISTEXP. On the receipt of ℓt,at we update the functions µt and θt to µt+1 and θt+1 noting that by the equalities between these functions and (14) we need only modify the values µt(n) and θt(n, m) when n is an ancestor of ut and m is an ancestor of at. GABA-I (Learning rate : η > 0; Model parameter: φ (0, 1)) 0. Construct binary support tree C via CONSTRUCTBST-C algorithm (see Figure 1). 1. leaf n C , i, j {0, 1} , α1(n, i, j) Ji = j Kφ1 + Ji = j K(1 φ)1; 2. For d = 1, 2, . . . , h 1, n C at depth h d, i, j {0, 1}, do α1(n, i, j) P k {0,1} α1( (n), i, k) α1( (n), k, j); For t = 1, 2 . . . T, do 3. d [h] {0} νt,d ancestor of ut at depth d in C; 4. i {0, 1}, β t (νt,0, i) (1 + Ji = 0K(K 2))1/K; i {0, 1}, β t (νt,0, i) 1; 5. For d = 1, 2, . . . , h, do (a) if νt,d = (νt,d 1) then i {0, 1} i. β t (νt,d, i) β t (νt,d 1, i); ii. β t (νt,d, i) P j {0,1} αt( (νt,d 1), i, j) β t (νt,d 1, j); (b) if νt,d = (νt,d 1) then i {0, 1} i. β t (νt,d, i) β t (νt,d 1, i); ii. β t (νt,d, i) P j {0,1} β t (νt,d 1, j) αt( (νt,d 1), j, i); 6. pt (1/K) P i {0,1} β t (νt,h, 1) αt(νt,h, 1, i) β t (νt,h, i); 7. Predict at [K] with probability P [at = a] = pt,a/ pt 1; 8. Receive ℓt,at 9. a [K] , ct,a exp ( ηJa = at Kℓt,at pt 1/ pt,a); πt ( pt 1ct)/( pt ct); 10. i {0, 1}, αt+1(νt,h, 1, i) πt αt(νt,h, 1, i); 11. i {0, 1}, αt+1(νt,h, 0, i) αt(νt,h, 0, i); 12. n C \ {νt,d | d [h] {0}}, i, j {0, 1}, αt+1(n, i, j) αt(n, i, j); 13. For d = 1, 2, . . . , h 1, do i, j {0, 1} αt+1(νt,(h d), i, j) P k {0,1} αt+1( (νt,(h d)), i, k) αt+1( (νt,(h d)), k, j); Figure 3: GABA-I Algorithm GABA-II (Learning rate: η > 0) 0. Construct binary support tree C via CONSTRUCTBST-C algorithm (see Figure 1). 1. Construct a full perfect oriented binary tree B with height g := log2(K) , whose first K leaves represent the actions [K]; Set r to be the root of B; 2. vertex n C: (a) µ1(n) 1; leaf m B, if m [K] then θ1(n, m) 1; else θ1(n, m) 0; (b) d {1, 2, . . . , g}, m B at depth g d, θ1(n, m) := θ1(n, (m)) + θ1(n, (m)); For t = 1, 2 . . . T, do 3. Draw δt from (ut) with prob. P [δt = n] µt(n)θt(n, r); ζt,0 r; 4. For d = 0, . . . , g 1: draw ζt,d+1 from { (ζt,d), (ζt,d)} with prob. P [ζt,d+1 = m] θt(δt, m); 5. Predict at ζt,g; 6. Receive ℓt,at 7. ψt P n (ut) µt(n)θ(n, r); ϱt P n (ut) µt(n)θ(n, at); λt exp( ηℓt,atψt/ϱt); (a) µt+1(n) µt(n)ψt/(ψt (1 λt)ϱt); θt+1(n, at) λtθt(n, at); (b) m B \ (at), θt+1(n, m) := θt(n, m); (c) For d = 1, 2, . . . g: θt+1(n, ζt,(g d)) θt+1(n, (ζt,(g d))) + θt+1(n, (ζt,(g d))); 9. n C \ (ut), µt+1(n) := µt(n); m B , θt+1(n, m) := θt(n, m); Figure 4: GABA-II Algorithm 4.3 Parameter Tuning A limitation of the GABA-I regret bound is that it is dependent on knowing the optimal values of the parameters φ and η, and for GABA-II on the parameter η. In the following, we will 1) sketch how to autotune φ at little cost and 2) autotune η, however at essentially the cost of moving Ψ(y) outside of the square root. We first sketch how to automatically tune the parameter φ that appears in GABA-I. Assume, without loss of generality, that N is an integer power of 2. The idea of our tuning method is that since φ is unknown we will mix over possible values of φ [0, 1]. In fact, at little cost in regret it is sufficient to just mix over the exponentially increasing values of φ = 2/N, 4/N, 8/N, . . . , N/N . Thus each specialist is split into log2 N specialists, so that the new distribution over specialists is w1(sφ) := 1 log2 N valid1(s) 1 K startfactor(s) (1 φ)N 1 cut(s)φcut(s) , s S,φ {2/N,4/N,...,1} w1(sφ) = 1. Implementing this efficiently is similar to the implementation of GABA-I, except that we now have log2 N copies of the BST Cφ, each initialized with a different value of φ. On each trial the computed values from the log2 N copies of the BST Cφ are summed to find the prediction vector. After receipt of the loss, all copies of the BST are updated as in GABA-I. The regret bound of this autotuning with respect to φ is equal, up to an O( p log(log(N))) factor, to that of GABA-I with the optimal φ, but comes at the cost of an additional O(log(N)) factor in the computation time. Now that we have shown how to automatically tune φ in GABA-I we are left with the learning rate η in both algorithms. We first note that, with any η, the regret of both algorithms is Υ/η + ηKT/2, where Υ is the robustified resistance weighted cutsize Ψ(y) multiplied by logarithmic terms (one in GABA-I and two in GABA-II). By setting η = p 2/KT we get a regret of (Υ + 1) p KT/2. In addition, if T is unknown then a doubling trick can be performed with this result to get a regret bound of O(Υ KT) with no parameters needed. We compare this to the regret bound of O( ΥKT) that comes from the optimal tuning of η. It remains an open problem to bring Υ inside the square-root. Even with the above knowledge-free tuning of η, our methods improve over the baseline comparator of running an independent EXP3 algorithm for each of the N users in many natural scenarios. Recall that in this case the induced regret is then e O( NKT) (see (2)). Consider a very large social network where the bandit problem is to show 1-of-K advertisements (for simplicity assume K O(1)) at the nodes (users). Now consider the case that each user is served at most one advert, i.e., there is at most a single trial for any given user. Since N T the bound of the baseline is now the vacuous regret Θ(T). We can intuitively see that this analysis is correct since the baseline algorithm is now just picking a single uniformly at random advertisement from [K] for each user independently. However, observe that when Ψ(y) o( T) we get e O(Ψ(y) T) o(T), which is non-vacuous. Intuitively, GABA-I/II may achieve this result since algorithmically they are exploiting the network structure. 5 Conclusion We considered a contextual, non-stochastic bandit problem in which the finite set of contexts (a.k.a users) form a social network and the inductive bias is that if the social link between two users is strong then actions that perform well for one of these users are likely to perform well for the other. We gave two highly efficient algorithms for this problem, both with good regret bounds. Since this work is theoretical in nature we cannot foresee any potential negative societal impacts. In the future it may be interesting to investigate extensions of our algorithms to the stochastic setting, as well as continuous bandit settings. Finally, it would be valuable to study potential applications of our algorithms, with large scale recommender systems being a natural candidate. On the theory side our bounds are based on an exponential potential function. Improved adversarial regret bounds were proven for an alternate potential function in [19] and it is an open question if our techniques can be extended to that potential. Acknowledgements. Mark Herbster was supported by the U.S. Army Research Laboratory and the U.K. Ministry of Defence under Agreement Number W911NF-16-3-0001. Stephen Pasteris and Massimiliano Pontil were supported in part by EPSRC Grant N. EP/P009069/1 and SAP SE. [1] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW 10, pages 661 670, New York, NY, USA, 2010. Association for Computing Machinery. [2] Sofía S Villar Villar, Jack Bowden, and James Wason. Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges. Statistical Science, 30(2):199 215, 2015. [3] Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1):97 114, 2008. [4] Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res., 6:1817 1853, December 2005. [5] Andreas Maurer and Massimiliano Pontil. Excess risk bounds for multitask learning with trace norm regularization. In Conference on Learning Theory, pages 55 76, 2013. [6] Maria-Florina Balcan, Mikhail Khodak, and Ameet Talwalkar. Provable guarantees for gradientbased meta-learning. In International Conference on Machine Learning, pages 424 433, 2019. [7] Giovanni Cavallanti, Nicolò Cesa-Bianchi, and Claudio Gentile. Linear algorithms for online multitask classification. J. Mach. Learn. Res., 11:2901 2934, December 2010. [8] Pierre Alquier, The Tien Mai, and Massimiliano Pontil. Regret Bounds for Lifelong Learning. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on rtificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 261 269, Fort Lauderdale, FL, USA, 20 22 Apr 2017. PMLR. [9] Giulia Denevi, Carlo Ciliberto, Dimitris Stamos, and Massimiliano Pontil. Incremental learningto-learn with statistical guarantees. ar Xiv preprint ar Xiv:1803.08089, 2018. [10] Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learning-to-learn stochastic gradient descent with biased regularization. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1566 1575, Long Beach, California, USA, 09 15 Jun 2019. PMLR. [11] Anastasia Pentina and Ruth Urner. Lifelong learning with weighted majority votes. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3612 3620. Curran Associates, Inc., 2016. [12] Miller Mc Pherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27(1):415 444, 2001. [13] M. Herbster, G. Lever, and M. Pontil. Online prediction on large diameter graphs. In NIPS, 2008. [14] Nicolò Cesa-Bianchi, C. Gentile, Fabio Vitale, and Giovanni Zappella. Random spanning trees and the prediction of weighted graphs. In ICML, 2010. [15] Y. Freund, R. Schapire, Y. Singer, and Manfred K. Warmuth. Using and combining predictors that specialize. In STOC 97, 1997. [16] Wouter M. Koolen, Dmitry Adamskiy, and Manfred K. Warmuth. Putting bayes to sleep. In NIPS, 2012. [17] Aleksandrs Slivkins. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(1-2):1 286, 2019. [18] Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48 77, January 2003. [19] Jean-Yves Audibert and Sébastien Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, 11(94):2785 2836, 2010. [20] Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012. [21] N. Cesa-Bianchi, C. Gentile, and Giovanni Zappella. A gang of bandits. In NIPS, 2013. [22] Kaige Yang, X. Dong, and L. Toni. Laplacian-regularized graph bandits: Algorithms and theoretical analysis. In AISTATS, 2020. [23] Sharan Vaswani, M. Schmidt, and L. Lakshmanan. Horde of bandits using gaussian markov random fields. In AISTATS, 2017. [24] Qingyun Wu, Huazheng Wang, Quanquan Gu, and Hongning Wang. Contextual bandits in a collaborative environment. Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, 2016. [25] Xin Wang, S. Hoi, Chenghao Liu, and M. Ester. Interactive social recommendation. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 2017. [26] Huazheng Wang, Qingyun Wu, and Hongning Wang. Factorization bandits for interactive recommendation. In AAAI, 2017. [27] C. Gentile, S. Li, and Giovanni Zappella. Online clustering of bandits. In ICML, 2014. [28] S. Li, C. Gentile, and Alexandros Karatzoglou. Graph clustering bandits for recommendation. Ar Xiv, abs/1605.00596, 2016. [29] C. Gentile, S. Li, Purushottam Kar, Alexandros Karatzoglou, Giovanni Zappella, and Evans Etrue. On context-dependent clustering of bandits. In ICML, 2017. [30] S. Li, C. Gentile, Alexandros Karatzoglou, and Giovanni Zappella. Data-dependent clustering in exploration-exploitation algorithms. Ar Xiv, abs/1502.03473, 2015. [31] S. Li and Purushottam Kar. Context-aware bandits. Ar Xiv, abs/1510.03164, 2015. [32] S. Li, C. Gentile, Alexandros Karatzoglou, and Giovanni Zappella. Online contextdependent clustering in recommendations based on exploration-exploitation algorithms. Ar Xiv, abs/1608.03544, 2016. [33] S. Li, Alexandros Karatzoglou, and C. Gentile. Collaborative filtering bandits. Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, 2016. [34] Kaige Yang and L. Toni. Graph-based recommendation system. 2018 IEEE Global Conference on Signal and Information Processing (Global SIP), pages 798 802, 2018. [35] Trong T. Nguyen and H. W. Lauw. Dynamic clustering of contextual multi-armed bandits. Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, 2014. [36] Zongguo Wang, Chicheng Zhang, M. Singh, L. Riek, and K. Chaudhuri. Multitask bandit learning through heterogeneous feedback aggregation. In AISTATS, 2021. [37] Zongguo Wang, M. Singh, Chicheng Zhang, L. Riek, and K. Chaudhuri. Stochastic multi-player bandit learning from player-dependent feedback. 2020. [38] Hassan Saber, Pierre Ménard, and Odalric-Ambrym Maillard. Optimal strategies for graphstructured bandits. Ar Xiv, abs/2007.03224, 2020. [39] Sabina Tomkins, Peng Liao, P. Klasnja, and S. Murphy. Intelligentpooling: Practical thompson sampling for mhealth. Ar Xiv, abs/2008.01571, 2020. [40] Silviu Maniu, Stratis Ioannidis, and B. Cautis. Bandits under the influence. 2020 IEEE International Conference on Data Mining (ICDM), pages 1172 1177, 2020. [41] L. Yang, Bo Liu, L. Lin, Feng Xia, Kai Chen, and Qiang Yang. Exploring clustering of bandits for online recommendation system. Fourteenth ACM Conference on Recommender Systems, 2020. [42] Huazheng Wang, Qingyun Wu, and Hongning Wang. Learning hidden features for contextual bandits. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2016. [43] Swapna Buccapatnam, A. Eryilmaz, and N. Shroff. Multi-armed bandits in the presence of side observations in social networks. 52nd IEEE Conference on Decision and Control, pages 7309 7314, 2013. [44] S. Kar, H. Poor, and S. Cui. Bandit problems in networks: Asymptotically efficient distributed allocation rules. IEEE Conference on Decision and Control and European Control Conference, pages 1771 1778, 2011. [45] Balázs Szörényi, R. Busa-Fekete, I. Hegedüs, Róbert Ormándi, M. Jelasity, and B. Kégl. Gossip-based distributed stochastic bandit algorithms. In ICML, 2013. [46] N. Korda, Balázs Szörényi, and S. Li. Distributed clustering of linear bandits in peer to peer networks. Ar Xiv, abs/1604.07706, 2016. [47] Abishek Sankararaman, A. Ganesh, and S. Shakkottai. Social learning in multi agent multi armed bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3:1 35, 2019. [48] L. E. Celis and F. Salehi. Lean from thy neighbor: Stochastic and adversarial bandits in a network. Ar Xiv, abs/1704.04470, 2017. [49] Yogev Bar-On and Y. Mansour. Individual regret in cooperative nonstochastic multi-armed bandits. In Neur IPS, 2019. [50] N. Cesa-Bianchi, C. Gentile, Y. Mansour, and Alberto Minora. Delay and cooperation in nonstochastic bandits. Ar Xiv, abs/1602.04741, 2016. [51] Bo Liu, Y. Wei, Y. Zhang, Z. Yan, and Qiang Yang. Transferable contextual bandit for crossdomain recommendation. In AAAI, 2018. [52] L. Cella, A. Lazaric, and M. Pontil. Meta-learning with stochastic linear bandits. Ar Xiv, abs/2005.08531, 2020. [53] Michal Valko, R. Munos, B. Kveton, and Tomás Kocák. Spectral bandits for smooth graph functions. In ICML, 2014. [54] M. K. Hanawal, Venkatesh Saligrama, Michal Valko, and R. Munos. Cheap bandits. Ar Xiv, abs/1506.04782, 2015. [55] A. Deshmukh, Ü. Dogan, and C. Scott. Multi-task learning for contextual bandits. In NIPS, 2017. [56] Michal Valko. Bandits on graphs and structures. Diss., École normale supérieure de Cachan ENS Cachanz, 2016. [57] M. K. Hanawal and Venkatesh Saligrama. Efficient detection and localization on graph structured data. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5590 5594, 2015. [58] M. K. Hanawal and Venkatesh Saligrama. Cost effective algorithms for spectral bandits. 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1323 1329, 2015. [59] Sylvain Lamprier, Thibault Gisselbrecht, and P. Gallinari. Variational thompson sampling for relational recurrent bandits. In ECML/PKDD, 2017. [60] Aadirupa Saha, Shreyas Sheshadri, and C. Bhattacharyya. Be greedy: How chromatic number meets regret minimization in graph bandits. In UAI, 2019. [61] N. Alon, N. Cesa-Bianchi, C. Gentile, Shie Mannor, Y. Mansour, and O. Shamir. Nonstochastic multi-armed bandits with graph-structured feedback. SIAM J. Comput., 46:1785 1826, 2017. [62] N. Alon, N. Cesa-Bianchi, O. Dekel, and T. Koren. Online learning with feedback graphs: Beyond bandits. In COLT, 2015. [63] Meng Fang and D. Tao. Networked bandits with disjoint linear payoffs. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014. [64] M. Herbster, Stephen Pasteris, and Fabio Vitale. Online sum-product computation over trees. In NIPS, 2012. [65] M. Herbster and J. Robinson. Online prediction of switching graph labelings with cluster specialists. In Neur IPS, 2019. [66] Lyons, Russell, and Yuval Peres. Probability on trees and networks. Cambridge University Press, 2017. [67] Nicolò Cesa-Bianchi, Claudio Gentile, Fabio Vitale, and Giovanni Zappella. Random spanning trees and the prediction of weighted graphs. Journal of Machine Learning Research, 14(2):1251 1284, 2013. [68] R. Pemantle. Uniform random spanning trees. ar Xiv: Probability, 2004. [69] Aaron Schild. An almost-linear time algorithm for uniform random spanning tree generation. Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 2018. [70] D. Durfee, Rasmus Kyng, J. Peebles, A. Rao, and Sushant Sachdeva. Sampling random spanning trees faster than matrix multiplication. Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, 2017. [71] Volodimir G. Vovk. Aggregating strategies. Proc. of Computational Learning Theory, 1990. [72] David Haussler, Jyrki Kivinen, and Manfred K Warmuth. Sequential prediction of individual sequences under general loss functions. IEEE Transactions on Information Theory, 44(5):1906 1925, 1998. [73] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and computation, 108(2):212 261, 1994. [74] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997. [75] David Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012. [76] Leslie Ann Goldberg and Mark Jerrum. The complexity of ferromagnetic ising with local fields. Combinatorics, Probability & Computing, 16(1):43 61, 2007. [77] Arthur L. Delcher, Adam J. Grove, Simon Kasif, and Judea Pearl. Logarithmic-time updates and queries in probabilistic networks. Co RR, abs/1408.1479, 2014. [78] Guy Bresler, D. Shah, and L. F. Voloch. Collaborative filtering with low regret. Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, 2016. [79] Guy Bresler, D. Shah, and L. F. Voloch. Regret guarantees for item-item collaborative filtering. ar Xiv: Learning, 2015. [80] Nicolò Cesa-Bianchi and G. Lsugosi. Combinatorial bandits. J. Comput. Syst. Sci., 78:1404 1422, 2009. [81] Alon Cohen, Tamir Hazan, and T. Koren. Tight bounds for bandit combinatorial optimization. In COLT, 2017. [82] S. Ito, Daisuke Hatano, Hanna Sumita, Kei Takemura, Takuro Fukunaga, Naonori Kakimura, and K. Kawarabayashi. Improved regret bounds for bandit combinatorial optimization. In Neur IPS, 2019. [83] Kareem Amin, M. Kearns, and U. Syed. Graphical models for bandit problems. In UAI, 2011. [84] Aleksandrs Slivkins. Contextual bandits with similarity information. J. Mach. Learn. Res., 15:2533 2568, 2011. [85] Qingyun Wu, N. Iyer, and Hongning Wang. Learning contextual bandits in a non-stationary environment. The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, 2018. [86] Qingyun Wu, Huazheng Wang, Yanen Li, and Hongning Wang. Dynamic ensemble of contextual bandits to satisfy users changing interests. The World Wide Web Conference, 2019. [87] Xiao Xu, Fang Dong, Yanghua Li, Shao jian He, and X. Li. Contextual-bandit based personalized recommendation with time-varying user interests. Ar Xiv, abs/2003.00359, 2020. [88] Chuanhao Li, Qingyun Wu, and Hongning Wang. Unifying clustered and non-stationary bandits. In AISTATS, 2021. [89] Mark Herbster, Stephen Pasteris, and Lisa Tse. Online multitask learning with long-term memory. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17779 17791. Curran Associates, Inc., 2020.