# unimodal_thompson_sampling_for_graphstructured_arms__0cf519a4.pdf

Unimodal Thompson Sampling for Graph Structured Arms

Stefano Paladino, Francesco Trov o, Marcello Restelli, Nicola Gatti Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico di Milano, Milano, Italy {stefano.paladino, francesco1.trovo, marcello.restelli, nicola.gatti}@polimi.it

We study, to the best of our knowledge, the ﬁrst Bayesian algorithm for unimodal Multi Armed Bandit (MAB) problems with graph structure. In this setting, each arm corresponds to a node of a graph and each edge provides a relationship, unknown to the learner, between two nodes in terms of expected reward. Furthermore, for any node of the graph there is a path leading to the unique node providing the maximum expected reward, along which the expected reward is monotonically increasing. Previous results on this setting describe the behavior of frequentist MAB algorithms. In our paper, we design a Thompson Sampling based algorithm whose asymptotic pseudo regret matches the lower bound for the considered setting. We show that as it happens in a wide number of scenarios Bayesian MAB algorithms dramatically outperform frequentist ones. In particular, we provide a thorough experimental evaluation of the performance of our and state of the art algorithms as the properties of the graph vary.

Introduction Multi Armed Bandit (MAB) algorithms (Auer, Cesa Bianchi, and Fischer 2002) have been proven to provide effective solutions for a wide range of applications ﬁtting the sequential decisions making scenario. In this framework, at each round over a ﬁnite horizon T, the learner selects an action (usually called arm) from a ﬁnite set and observes only the reward corresponding to the choice she made. The goal of a MAB algorithm is to converge to the optimal arm, i.e., the one with the highest expected reward, while minimizing the loss incurred in the learning process and, therefore, its performance is measured through its expected pseudo regret, deﬁned as the difference between the expected reward achieved by an oracle algorithm always selecting the optimal arm and the one achieved by the considered algorithm. We focus on the so called Unimodal MAB (UMAB), introduced in (Combes and Proutiere 2014a), in which each arm corresponds to a node of a graph and each edge is associated with a relationship specifying which node of the edge gives the largest expected reward (providing thus a partial ordering over the arm space). Furthermore, from any node there is a path leading to the unique node with the maximum expected reward along which the expected reward is

Copyright c 2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

monotonically increasing. While the graph structure may be (not necessarily) known a priori by the UMAB algorithm, the relationship deﬁned over the edges is discovered during the learning. In the present paper, we propose a novel algorithm relying on the Bayesian learning approach for a generic UMAB setting. Models presenting a graph structure have become more and more interesting in last years due to the spread of social networks. Indeed, the relationships among the entities of a social network have a natural graph structure. A practical problem in this scenario is the targeted advertisement problem, whose goal is to discover the part of the network that is interested in a given product. This task is heavily inﬂuenced by the graph structure, since in social networks people tend to have similar characteristics to those of their friends (i.e., neighbor nodes in the graph), therefore interests of people in a social network change smoothly and neighboring nodes in the graph look similar to each other (Mc Pherson, Smith-Lovin, and Cook 2001; Crandall et al. 2008). More speciﬁcally, an advertiser aims at ﬁnding those users that maximize the ad expected revenue (i.e., the product between click probability and value per click), while at the same time reducing the amount of times the advertisement is presented to people not interested in its content. Under the assumption of unimodal expected reward, the learner can move from low expected rewards to high ones just by climbing them in the graph, preventing from the need for an exploration over all the graph nodes. This assumption reduces the complexity in the search for the optimal arm, since the learning algorithm can avoid to pull the arms corresponding to some subset of non optimal nodes, reducing thus the regret. Other applications might beneﬁt from this structure, e.g., recommender systems which aims at coupling items with those users are likely to enjoy them. Similarly, the use of the unimodal graph structure might provide more meaningful recommendations without testing all the users in the social network. Finally, notice that unimodal problems with a single variable, e.g., in sequential pricing (Jia and Mannor 2011), bidding in online sponsored search auctions (Edelman and Ostrovsky 2007) and single peak preferences economics and voting settings (Mas-Collel, Whinston, and Green 1995), are graph structured problems in which the graph is a line.

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

Frequentist approaches for UMAB with graph structure are proposed in (Jia and Mannor 2011) and (Combes and Proutiere 2014a). Jia and Mannor (2011) introduce the GLSE algorithm with a regret of order O(

T log(T)). However, GLSE performs better than classical bandit algorithms only when the number of arms is Θ(T). Combes and Proutiere (2014a) present the OSUB algorithm based on KLUCB achieving asymptotic regret of O(log(T)) and outperforming GLSE in settings with a few arms. To the best of our knowledge, no Bayesian approach has been proposed for unimodal bandit settings, included the UMAB setting we study. However, it is well known that Bayesian MAB algorithms the most popular is Thompson Sampling (TS) usually suffer of same order of regret as the best frequentist one (e.g., in unstructured settings (Kaufmann, Korda, and Munos 2012)), but they outperform the frequentist methods in a wide range of problems (e.g., in bandit problems without structure (Chapelle and Li 2011) and in bandit problems with budget (Xia et al. 2015)). Furthermore, in problems with structure, the classical Thompson Sampling (not exploiting the problem structure) may outperform frequentist algorithms exploiting the problem structure. For this reason, in this paper we explore Bayesian approaches for the UMAB setting. More precisely, we provide the following original contributions:

we design a novel Bayesian MAB algorithm, called UTS and based on the TS algorithm;

we derive a tight upper bound over the pseudo regret for UTS, which asymptotically matches the lower bound for the UMAB setting;

we describe a wide experimental campaign showing better performance of UTS in applicative scenarios than those of state of the art algorithms, evaluating also how the performance of the algorithms (ours and of the state of the art) varies as the graph structure properties vary.

Related work

Here, we mention the main works related to ours. Some works deal with unimodal reward functions in continuous armed bandit setting (Jia and Mannor 2011; Combes and Proutiere 2014b; Kleinberg, Slivkins, and Upfal 2008). In (Jia and Mannor 2011) a successive elimination algorithm, called LSE, is proposed achieving regret of O(

T log T). In this case, assumptions over the minimum local decrease and increase of the expected reward is required. Combes and Proutiere (2014b) consider stochastic bandit problems with a continuous set of arms and where the expected reward is a continuous and unimodal function of the arm. They propose the SP algorithm, based on the stochastic pentachotomy procedure to narrow the search space. Unimodal MABs on metric spaces are studied in (Kleinberg, Slivkins, and Upfal 2008). An application dependent solution to the recommendation systems which exploits the similarity of the graph in social network in targeted advertisement has been proposed in (Valko et al. 2014). Similar information has been considered in (Caron and Bhagat 2013) where the problem of

cold start users (i.e., new users) is studied. Another type of structure considered in sequential games is the one of monotonicity of the conversion rate in the price (Trov o et al. 2015). Interestingly, the assumptions of monotonicity and unimodality are orthogonal, none of them being a special case of the other, therefore the results for monotonic setting cannot be used in unimodal bandits. In (Alon et al. 2013; Mannor and Shamir 2011), a graph structure of the arm feedback in an adversarial setting is studied. More precisely, they assume to have correlation over rewards and not over the expected values of arms.

Problem Formulation A learner receives in input a ﬁnite undirected graph MAB setting G = (A, E), whose vertices A = {a1, . . . , a K} with K N correspond to the arms and an edge (ai, aj) E exists only if there is a direct partial order relationship between the expected rewards of arms ai and aj. The leaner knows a priori the nodes and the edges (i.e., she knows the graph), but, for each edge, she does not know a priori which is the node of the edge with the largest expected reward (i.e., she does not know the ordering relationship). At each round t over a time horizon of T N the learner selects an arm ai and gains the corresponding reward xi,t. This reward is drawn from an i.i.d. random variable Xi,t (i.e., we consider a stochastic MAB setting) characterized by an unknown distribution Di with ﬁnite known support Ω R (as customary in MAB settings, from now on we consider Ω [0, 1]) and by unknown expected value μi := E[Xi,t]. We assume that there is a single optimal arm, i.e., there exists a unique arm ai s.t. its expected value μi = maxi μi and, for sake of notation, we denote μi with μ . Here we analyze a graph bandit setting with unimodality property, deﬁned as: Deﬁnition 1. A graph unimodal MAB (UMAB) setting G = (A, E) is a graph bandit setting G s.t. for each sub optimal arm ai, i = i it exists a ﬁnite path p = (i1 = i, . . . , im = i ) s.t. μik < μik+1 and (aik, aik+1) E for each k {1, . . . , m 1}. This deﬁnition assures that if one is able to identify a non decreasing path in G of expected rewards, she be able to reach the optimum arm, without getting stuck in local optima. Note that the unimodality property implies that the graph G is connected and therefore we consider only connected graphs from here on. A policy U over a UMAB setting is a procedure able to select at each round t an arm ait by basing on the history ht, i.e., the sequence of past selected arms and past rewards gained. The pseudo regret RT (U) of a generic policy U over a UMAB setting is deﬁned as:

RT (U) := Tμ E

where the expected value E[ ] is taken w.r.t. the stochasticity of the gained rewards Xit,t and of the policy U. Let us deﬁne the neighborhood of arm ai as N(i) := {j|(ai, aj) E}, i.e., the set of each index j of the arm aj

connected to the arm ai by an edge (ai, aj) E. It has been shown in (Combes and Proutiere 2014a) that the problem of learning in a UMAB setting presents a lower bound over the regret RT (U) of the following form: Theorem 1. Let U be a uniformly good policy, i.e., a policy s.t. RT (U) = o(T c) for each c > 0. Given a UMAB setting G = (A, E) we have:

lim inf T RT (U) log(T) =

μ μi KL(μi, μ ), (2)

where KL(p, q) = p log p q + (1 p) log 1 p 1 q , i.e., the Kullaback Leibler divergence of two Bernoulli distributions with means p and q, respectively. This result is similar to the one provided in (Lai and Robbins 1985), with the only difference that the summation is restricted to the arms laying in the neighborhood of the optimal arm N(i ) and reduces to it when the optimal arm is connected to all the others (i.e., N(i ) {1, . . . , K}) or the graph is completely connected (i.e., N(i) {1, . . . , K}, i). We would like to point out that by relying on the assumption of having a single maximum of the expected rewards, we also assure that the optimal arm neighborhood N(i ) is uniquely deﬁned and, thus, the lower bound inequality in Equation 2 is well deﬁned.

The UTS algorithm We describe the UTS algorithm and we show that its regret is asymptotically optimal, i.e., it asymptotically matches the lower bound of Theorem 1. The algorithm is an extension of the Thompson Sampling (Thompson 1933) that exploits the graph structure and the unimodal property of the UMAB setting. Basically, the rationale of the algorithm is to apply a simple variation of the TS algorithm to only the arms associated with the nodes that compose the neighborhood of the arm with the highest empirical mean reward, called leader.

The UTS pseudo code The pseudo code of the UTS algorithm is presented in Algorithm 1. The algorithm receives in input the graph structure G, the time horizon T, and a Bayesian prior πi for each expected reward μi. At each round t, the algorithm computes the empirical expected reward for each arm (Line 3):

Sit Ti,t if Ti,t > 0

0 otherwise ,

where Si,t = t 1 h=1 Xi,h1{U(h) = ai} is the cumulative reward of arm ai up to round t and Ti,t = t 1 h=1 1{U(h) = ai} is the number of times the arm ai has been pulled up to round t.1 After that, UTS selects the arm denoted as the leader al(t) for round t, i.e., the one having the maximum empirical expected reward:

al(t) = arg max ai A ˆμi,t. (3)

1We here denote with 1{ } the indicator function.

Algorithm 1 UTS

1: Input: UMAB setting G = (V, E), Horizon T, Priors {πi}K i=1 2: for t {1, . . . , T} do 3: Compute ˆμi,t for each i {1, . . . , K} 4: Find the leader al(t) 5: if Ll(t),t mod |N +(l(t))| = 0 then 6: Collect reward xl(t),t 7: else 8: Draw θi,t from πi,t for each i N +(l(t)) 9: Collect reward xit,t where it = arg maxi θi,t

Once the leader has been chosen, we restrict the selection procedure to it and its neighborhood, considering only arms with indexes in N +(l(t)) := N(l(t)) {l(t)}. Denote with Li,t := t 1 h=1 1{l(h) = i} the number of times the arm ai has been selected as leader before round t. If Ll(t),t is a multiple of |N +(l(t))|, then the leader is pulled and reward xl(t),t is gained (Line 6).2 Otherwise, the TS algorithm is performed over arms ai s.t. i N +(l(t)) (Lines 8 9). Basically, under the assumption of having a prior πi, we can compute the posterior distribution πi,t for μi after t rounds, using the information gathered from the rounds in which ai has been pulled. We denote with θi,t a sample drawn from πi,t, called Thompson sample. For instance, for Bernoulli rewards and by assuming uniform priors we have that πi,t = Beta(1+Si,t, 1+Ti,t Si,t), where Beta(α, β) is the beta distribution with parameters α and β. Finally, the UTS algorithm pulls the arm with the largest Thompson sample θi,n and collects the corresponding reward xit,t. See (Kaufmann, Korda, and Munos 2012) for further details.

Remark 1. Assuming that the UTS algorithm receives in input the whole graph G is unnecessary. The algorithm just requires an oracle that, at each round t, is able to return the neighborhood N(l(t)) of the arm which is currently the leader al(t). This is crucial in all the applications in which the graph is discovered by means of a series of queries and the queries have a non negligible cost (e.g., in social networks a query might be computationally costly). Finally, we remark that the frequentist counterpart of our algorithm (i.e., the OSUB algorithm) requires the computation of the maximum node degree γ := maxi |N(i)|, thus requiring at least an initial analysis of the entire graph G.

Finite time analysis of UTS Theorem 2. Given a UMAB setting G = (A, E), the expected pseudo regret of the UTS algorithm satisﬁes, for every ε > 0:

RT (UTS) (1 + ε)

μ μi KL(μi, μ ) [log(T ) + log log(T )] + C,

where C > 0 is a constant depending on ε, the number of arms K and the expected rewards {μ1, . . . , μK}.

2We here denote with | | the cardinality operator.

Sketch of proof. (The complete version of the proof is reported in (Paladino et al. 2016).) At ﬁrst, we remark that a straightforward application of the proof provided for OSUB is not possible in the case of UTS. Indeed, the use of frequentist upper bounds over the expected reward in OSUB implies that in ﬁnite time and with high probability the bounds are ordered as the expected values. Since we are using a Bayesian algorithm, we would require the same assurance over the Thompson samples θi,t, but we do not have a direct bound over P(θi,t > θi ,t) where ai is the optimal arm in the neighborhood N +(i). This fact requires to follow a completely different strategy when we analyze the case in which the leader is not the optimal arm. The regret of the UTS algorithm RT (UTS) can be divided in two parts: the one obtained during those rounds in which the optimal arm a is the leader, called R1, and the summation of the regrets in the rounds in which the leader is the arm ai = a , called Ri. R1 is obtained when i is the leader, thus, the UTS algorithms behaves like Thompson Sampling restricted to the optimal arm and its neighborhood N +(i ), and the regret upper bound is the one derived in (Kaufmann, Korda, and Munos 2012) for the TS algorithm. Ri is upper bounded by the expected number of rounds the arm ai has been selected as leader E[Li,T ] over the horizon T. Let us consider ˆLi,T deﬁned as the number of rounds spent with ai as leader when restricting the problem to its neighborhood N +(i). E[ˆLi,T ] is an upper bound over E[Li,T ], since there is nonzero probability that the UTS algorithm moves in another neighborhood. Since i = i and the setting is unimodal, there exists an optimal arm ai , i = i among those in the neighborhood N(i) s.t. μi = maxi|ai N(i) μi and ˆμi,t ˆμi . Thus:

Ri E[ˆLi,T ] =

1{ˆμi,t = max aj N+(i) ˆμj,t}

ˆμi,t max aj N+(i) ˆμj,t

t=1 P ˆμi,t ˆμi ,t

t=1 P ˆμi,t μi Δi

2 ˆμi ,t + μi Δi

t=1 P ˆμi,t μi Δi

t=1 P ˆμi ,t μi + Δi

where Δi = maxi |ai N(i) μi μi is the expected loss incurred in choosing ai instead of its best adjacent one ai . Ri1 can be upper bounded by a constant by relying on conditional probability deﬁnition and the Hoeffding inequality (Hoeffding 1963). Speciﬁcally, we rely on the fact that the leader is chosen at least Ll(t),t |N +(l(t))| times. Upper bounding Ri2 by a constant term requires the use of Proposition 1 in (Kaufmann, Korda, and Munos 2012), which limits the expected number of times the optimal arm is pulled less than tb times by TS, where b (0, 1) is a constant, and

Table 1: Results concerning R%(U, OSUB) in the setting with K = 17 and K = 129 and a line graph.

KLUCB 3.08 0.05 6.51 0.07

TS 1.34 0.07 2.68 0.05

UTS 0.52 0.07 0.76 0.15

the use of a technique already used on Ri1. Summing up the regret over i = i and considering the three obtained bounds concludes the proof.

Experimental Evaluation In this section, we compare the empirical performance of the proposed algorithm UTS with the performance of a number of algorithms. We study the performance of the state of the art algorithm OSUB (Combes and Proutiere 2014a) to evaluate the improvement due to the employment of Bayesian approaches w.r.t. frequentist approaches. Furthermore, we study the performance of TS (Thompson 1933) to evaluate the improvement in Bayesian approaches due to the exploitation of the problem structure. For completeness, we study also the performance of KLUCB (Garivier and Capp e 2011), being a frequentist algorithm that is optimal for Bernoulli distributions.

Figures of merit Given a policy U, we evaluate the average and 95% conﬁdence intervals of the following ﬁgures of merit:

the pseudo regret RT (U) as deﬁned in Equation 1; the lower RT (U) the better the performance;

the regret ratio R%(U1, U2) = RT (U1)

RT (U2) showing the ratio between the total regret of policy U1 after T rounds and the one obtained with U2; the lower R%(U1, U2) the larger the relative improvement of U1 w.r.t. U2.

Line graphs We initially consider the same experimental settings, composed of line graphs, that are studied in (Combes and Proutiere 2014a). They consider graphs with K {17, 129} arms, where the arms are ordered on a line from the arm with smallest index to the arm with the largest index and with Bernoulli rewards whose averages have a triangular shape with the maximum on the arm in the middle of the line. More precisely, the minimum average is 0.1, associated with arms a1 and a17 when K = 17 and with arms a1 and a129 with K = 129, while the maximum average reward is μ = 0.9, associated with arm a9 when K = 17 and with arm a65 with K = 129. The averages decrease linearly from the maximum one to the minimum one. For both the experiments, we average the regret over 100 independent trials of length T = 105. We report Rt(U) for each policy U as t varies in Fig. 1(a), for K = 17, and in Fig. 1(b), for K = 129. The UTS algorithm outperforms all the other algorithms along the whole time horizon, providing a signiﬁcant improvement in terms of regret w.r.t. the

0 20 40 60 80 100 103 0

KLUCB TS OSUB UTS

0 20 40 60 80 100 103 0

KLUCB TS OSUB UTS

Figure 1: Results for the pseudo regret Rt(U) in line graphs settings with K = 17 (a) and K = 129 (b) as deﬁned in (Combes and Proutiere 2014a).

state of the art algorithms. In order to have a more precise evaluation of the reduction of the regret w.r.t. OSUB algorithm, we report R%(U, OSUB) in Tab. 1. As also conﬁrmed below by a more exhaustive series of experiments, in line graphs the relative improvement of performance due to UTS w.r.t. OSUB reduces as the number of arms increases, while the relative improvement of performance due to UTS w.r.t. TS increases as the number of arms increases.

Erd os-R enyi graphs To provide a thorough experimental evaluation of the considered algorithms in settings in which the space of arms has a graph structure, we generate graphs using the model proposed by Erd os and R enyi (1959), which allows us to simulate graph structures more complex than a simple line. An Erd os-R enyi graph is generated by connecting nodes randomly: each edge is included in the graph with probability p, independently from existing edges. We consider connected graphs with K {5, 10, 20, 50, 100, 1000} and with probability p {1, 1

K , ℓ}, where p = 1 corresponds to have a fully connected graph and therefore the graph structure is useless, p = 1 2 corresponds to have a number of edges that increases linearly in the number of nodes, p = log(K)

K corresponds to have a few edges w.r.t. the nodes, and we use p = ℓto denote line graphs (these line graphs differ from those used for the experimental evaluation discussed above for the reward function, as discussed in what follows). We use different values of p in order to see how the performance of UTS changes w.r.t. the number of edges in the graph; we remark that such an analysis is unexplored in the literature so far. The optimal arm is chosen randomly among the existing arms and its reward is given by a Bernoulli distribution with expected value 0.9. The rewards of the suboptimal arms are given by Bernoulli distributions with expected value depending on their distance from the optimal one. More precisely, let d i be the shortest path from the i th arm to the optimal arm and let:

d max = max i {1,...,K} d i

be the maximum shortest path of the graph. The expected reward of the i th arm is:

μi = 0.9 d i (0.9 0.1)

Table 2: Results concerning RT (U) (T = 105) in the setting with Erd os-R enyi graphs.

1 1/2 log(K)/K ℓ

KLUCB 34 0.4 50 1.5 52 3.7 56 2.2

TS 18 0.2 23 0.6 24 1.3 25 0.7

OSUB 34 0.3 32 7.2 35 5.8 31 4.1

UTS 17 0.1 15 2.4 16 2.2 14 1.3

KLUCB 77 0.5 107 5.5 127 11.2 159 7.0

TS 40 0.2 50 2.0 56 3.8 67 2.5

OSUB 77 0.3 76 8.1 57 5.6 70 8.1

UTS 39 0.2 35 3.2 27 2.1 34 2.4

KLUCB 163 0.7 217 6.2 262 16.2 386 21.3

TS 84 0.5 102 2.3 117 5.7 157 6.9

OSUB 163 0.8 148 14.9 86 14.6 124 11.7

UTS 83 0.3 70 6.0 44 4.8 65 8.8

KLUCB 420 0.7 560 15.0 686 30.5 1132 49.2

TS 217 0.5 262 4.4 303 10.0 454 19.9

OSUB 420 1.0 382 35.6 162 13.9 240 15.8

UTS 216 0.7 182 14.2 89 5.5 156 30.1

KLUCB 846 2.0 1134 17.8 1313 59.7 2327 63.5

TS 436 1.1 528 4.9 586 18.4 973 31.8

OSUB 846 2.7 786 39.0 226 27.1 369 10.7

UTS 437 0.5 372 15.2 141 9.1 290 42.3

KLUCB 8505 12.2 11247 60.1 12024 464.7 10640 291.5

TS 4391 3.4 5262 23.0 5478 151.3 6554 115.2

OSUB 8493 13.6 7761 153.4 1151 45.0 1165 20.7

UTS 4388 5.2 3718 62.9 1000 14.2 1165 41.8

i.e., the arm with d max has a value equal to 0.1 and the expected rewards of the arms along the path from it to the optimal arm are evenly spaced between 0.1 and 0.9. We generate 10 different graphs for each combination of K and p and we run 100 independent trials of length T = 105 for each graph. We average the regret over the results of the 10 graphs. In Tab. 2, we report RT (U) for each combination of policy U, K, and p. It can be observed that the UTS algorithm outperforms all the other algorithms, providing in every case the smallest regret except for K = 1000 and p = ℓ. Below we discuss how the relative performance of the algorithms vary as the values of the parameters K and p vary. Consider the case with p = 1. The performance of UTS and TS are approximately equal and the same holds for the performance of OSUB and KLUCB. This is due to the fact that the neighborhood of each node is composed by all the arms, the graphs being fully connected, and therefore UTS and OSUB cannot take any advantage from the structure of the problem. We notice, however, that UTS and TS have not the same behavior and that UTS always performs slightly better than TS. It can be observed in Fig. 2 with K = 5 and p = 1 that the relative improvement is mainly at the beginning of the time horizon and that it goes to zero as K increases (the same holds for OSUB w.r.t. KLUCB). The reason behind this behavior is that UTS reduces the exploration performed by TS in the ﬁrst rounds, forcing the algorithm to pull the leader chosen as the arm maximizing the empirical mean for a larger number of rounds.

0 20 40 60 80 100 103 0

KLUCB TS OSUB UTS

Figure 2: Results for the pseudo regret Rt(U) in the setting with K = 5 and p = 1.

Consider the case with p = 1 2. In the considered experimental setting, the relative performance of the algorithms does not depend on K. The ordering, from the best to the worst, over the performance of the algorithms is: UTS, TS, OSUB, and ﬁnally KLUCB. Surprisingly, even the dependency of the following ratios on K is negligible: R%(UTS, TS) = 0.68 0.03, R%(UTS, OSUB) = 0.47 0.01, and R%(OSUB, KLUCB) = 0.68 0.03. This shows that the relative improvement due to UTS is constant w.r.t. TS and OSUB as K varies. These results raise the question whether the relative performance of OSUB and TS would be the same, except for the numerical values, for every p constant w.r.t. K. To answer to this question, we consider the case in which p = 0.1, corresponding to the case in which the number of edges is linear in K, but it is smaller than the case with p = 1 2. The results in terms of RT (U), reported in Table 3 show that OSUB outperforms TS for K 10, suggesting that, when p is constant in K, OSUB may or may not outperform TS depending on the speciﬁc pair (p, K).

Consider the case with p = log(K)

K . The ordering over the performance of the algorithms changes as K varies. More precisely, while UTS keeps to be the best algorithm for every K and KLUCB the worst algorithm for every K, the ordering between TS and OSUB changes. When K 10 TS performs better than OSUB, instead when K 20 OSUB outperforms TS, see Fig. 3. This is due to the fact that, with a small number of arms, exploiting the graph structure is not sufﬁcient for a frequentist algorithm to outperform the performance of TS, while with many arms exploiting the graph structure even with a frequentist algorithm is much better than employing a general-purpose Bayesian algorithm. The ratio R%(UTS, TS) monotonically decreases as K increases, from 0.66 when K = 5 to 0.19 when K = 1000, suggesting that exploiting the graph structure provides ad-

Table 3: Results concerning RT (U) (T = 105) in the setting with Erd os-R enyi graphs and p = 0.1.

5 10 20 50 100 1000

TS 25 66 162 278 519 4564

OSUB 29 64 126 144 266 2358

0 20 40 60 80 100 103 0

KLUCB TS OSUB UTS

0 20 40 60 80 100 103 0

KLUCB TS OSUB UTS

Figure 3: Results for the pseudo regret Rt(U) in the setting with K = 5 (a) and K = 20 (b) and p = log(K)

vantages as K increases. Instead, the ratio R%(UTS, OSUB) monotonically increases as K increases, from 0.45 when K = 5 to 0.94 when K = 1000, suggesting that the improvement provided by employing Bayesian approaches reduces as K increases as observed above in line graphs.

Consider the case with p = ℓ. As in the case discussed above, OSUB is outperformed by TS for a small number of arms (K 10), while it outperforms TS for many arms (K 20). The reason is the same above. Similarly, the ratio R%(UTS, TS) monotonically decreases as K increases, from 0.58 when K = 5 to 0.18 when K = 1000, and the ratio R%(UTS, OSUB) monotonically increases as K increases, from 0.45 when K = 5 to 1.00 when K = 1000. This conﬁrms that the performance of UTS and the one of OSUB asymptotically match as K increases when p = ℓ(as well as p = log(K)

K ). In order to investigate the reasons behind such a behavior, we produce an additional experiment with the line graphs of Combes and Proutiere (2014a) except that the maximum expected reward is set to 0.108 when K = 17 and 0.165 when K = 129 (thus, given any edge with terminals i and i + 1, we have |μi μi+1| = 0.001). What we observe (details of these experiments and those described below are in (Paladino et al. 2016)) is that, on average, OSUB outperforms UTS at T = 105 suggesting that, when it is necessary to repeatedly distinguish between three arms that have very similar expected rewards, frequentist methods may outperform the Bayesian ones. This is no longer true when T is much larger, e.g., T = 107, where UTS outperforms OSUB (interestingly, differently from what happens in the other topologies, in line graphs with very small |μi μi+1|, the average RT (UTS) and RT (OSUB) cross a number of times during the time horizon). Futhermore, we evaluate how the relative performance of OSUB w.r.t. UTS varies for |μi μi+1| {0.001, 0.002, 0.005}, observing it improves as |μi μi+1| decreases. Finally, we evaluate whether this behavior emerges also in Erd os-R enyi graphs in which p = c K where c is a constant (we use p = 5 K , 10

K ) and we observe that UTS outperforms OSUB, suggesting that line graphs with very small |μi μi+1| are pathological instances for UTS.

Conclusions and Future Work In this paper, we focus on the Unimodal Multi Armed Bandit problem with graph structure in which each arm corresponds to a node of a graph and each edge is associated with a relationship in terms of expected reward between its arms. We propose, to the best of our knowledge, the ﬁrst Bayesian algorithm for the UMAB setting, called UTS, which is based on the well known Thompson Sampling algorithm. We derive a tight upper bound for UTS that asymptotically matches the lower bound for the UMAB setting, providing a non-trivial derivation of the bound. Furthermore, we present a thorough experimental analysis showing that our algorithm outperforms the state of the art methods. In future, we will evaluate the performance of the algorithms considered in this paper with other classes of graphs, e.g., Barab asi Albert and lattices. Future development of this work may consider an analysis of the proposed algorithm in the case of time varying environments, i.e., the expected reward of each arm varies over time, assuming that the unimodal structure is preserved. Another interesting study may consider the case of a continuous decision space.

References Alon, N.; Cesa-Bianchi, N.; Gentile, C.; and Mansour, Y. 2013. From bandits to experts: a tale of domination and independence. In Advances in Neural Information Processing Systems (NIPS), 1610 1618. Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite time analysis of the multiarmed bandit problem. Machine Learning 47(2-3):235 256. Caron, S., and Bhagat, S. 2013. Mixing bandits: A recipe for improved cold start recommendations in a social network. In Proceedings of the Workshop on Social Network Mining and Analysis (SNA-KDD). Chapelle, O., and Li, L. 2011. An empirical evaluation of thompson sampling. In Advances in Neural Information Processing Systems (NIPS), 2249 2257. Combes, R., and Proutiere, A. 2014a. Unimodal bandits: Regret lower bounds and optimal algorithms. In Proceedings of the International Conference on Machine Learning (ICML), 521 529. Combes, R., and Proutiere, A. 2014b. Unimodal bandits without smoothness. ar Xiv preprint ar Xiv:1406.7447. Crandall, D.; Cosley, D.; Huttenlocher, D.; Kleinberg, J.; and Suri, S. 2008. Feedback effects between similarity and social inﬂuence in online communities. In Proceedings of the Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), 160 168. Edelman, B., and Ostrovsky, M. 2007. Strategic bidder behavior in sponsored search auctions. Decision Support Systems 43(1):192 198. Erd os, P., and R enyi, A. 1959. On random graphs I. Publicationes Mathematicae (Debrecen) 6:290 297. Garivier, A., and Capp e, O. 2011. The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the Conference on Learning Theory (COLT), 359 376.

Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58(301):13 30. Jia, Y. Y., and Mannor, S. 2011. Unimodal bandits. In Proceedings of the International Conference on Machine Learning (ICML), 41 48. Kaufmann, E.; Korda, N.; and Munos, R. 2012. Thompson sampling: An asymptotically optimal ﬁnite time analysis. In Proceedings of the Algorithmic Learing Theory (ALT), 199 213. Kleinberg, R.; Slivkins, A.; and Upfal, E. 2008. Multi armed bandits in metric spaces. In Proceedings of the Symposium on Theory Of Computing (STOC), 681 690. Lai, T. L., and Robbins, H. 1985. Asymptotically efﬁcient adaptive allocation rules. Advances in Applied Mathematics 6(1):4 22. Mannor, S., and Shamir, O. 2011. From bandits to experts: On the value of side observations. In Advances in Neural Information Processing Systems (NIPS), 684 692. Mas-Collel, A.; Whinston, M. D.; and Green, J. R. 1995. Micreconomic Theory. Oxford University Press. Mc Pherson, M.; Smith-Lovin, L.; and Cook, J. M. 2001. Birds of a feather: Homophily in social networks. Annual review of sociology 415 444. Paladino, S.; Trov o, F.; Restelli, M.; and Gatti, N. 2016. Unimodal thompson sampling for graph-structured arms. ar Xiv preprint ar Xiv:1611.05724. Thompson, W. R. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3):285 294. Trov o, F.; Paladino, S.; Restelli, M.; and Gatti, N. 2015. Multi armed bandit for pricing. In Proceedings of the European Workshop on Reinforcement Learning (EWRL). Valko, M.; Munos, R.; Kveton, B.; and Kocak, T. 2014. Spectral bandits for smooth graph functions. In Proceedings of the International Conference on Machine Learning (ICML), 46 54. Xia, Y.; Li, H.; Qin, T.; Yu, N.; and Liu, T.-Y. 2015. Thompson sampling for budgeted multi armed bandits. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence (IJCAI).