# graphtriggered_rising_bandits__6a312f99.pdf

Graph-Triggered Rising Bandits

Gianmarco Genalti 1 Marco Mussi 1

Nicola Gatti 1 Marcello Restelli 1 Matteo Castiglioni 1 Alberto Maria Metelli 1

In this paper, we propose a novel generalization of rested and restless bandits where the evolution of the arms expected rewards is governed by a graph defined over the arms. An edge connecting a pair of arms (i, j) represents the fact that a pull of arm i triggers the evolution of arm j, and vice versa. Interestingly, rested and restless bandits are both special cases of our model for some suitable (degenerate) graphs. Still, the model can represent way more general and interesting scenarios. We first tackle the problem of computing the optimal policy when no specific structure is assumed on the graph, showing that it is NP-hard. Then, we focus on a specific structure, forcing the graph to be composed of a set of fully connected sub-graphs (i.e., cliques), and we prove that the optimal policy can be easily computed in closed form. Subsequently, we move to the learning problem presenting regret minimization algorithms for deterministic and stochastic cases. Our regret bounds highlight the complexity of the learning problem by incorporating instancedependent terms that encode specific properties of the underlying graph structure. Moreover, we illustrate how the knowledge of the underlying graph is not necessary for achieving the no-regret property.

1. Introduction

In the basic stochastic Multi-Armed Bandit (MAB, Lattimore & Szepesv ari, 2020) problem, at each round, the agent is asked to choose an action (a.k.a. arm) among a finite action set observing a reward drawn from an unknown probability distribution. MABs are particularly appealing machine learning models as they provide important theoreti-

1Politecnico di Milano, Milan, Italy. Correspondence to: G. Genalti <gianmarco.genalti@polimi.it>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

cal guarantees on the convergence to the optimal solution and the lower and upper bounds to the convergence rate (usually referred to as regret bounds). However, the standard MAB model is too na ıve for many real-world applications, and understanding which kind of additional problem structure allows recovering good theoretical properties is a central scientific challenge. Examples include nonstationary (Gur et al., 2014), delayed (Pike-Burke et al., 2018), linear (Abbasi-Yadkori et al., 2011), contextual (Chu et al., 2011) and continuous-action spaces (Kleinberg et al., 2008) bandits.

We focus on rather recent MAB structures, called restless and rested bandits (Tekin & Liu, 2012). In the former, the expected rewards evolve following the time (i.e., as an effect of nature); in the latter, the expected reward of an arm evolves as a function of the pulls we perform on that specific arm. In particular, we study a specific shape of the expected reward evolution, namely rising (Heidari et al., 2016). In a rising bandit, expected rewards increase according to monotonic and concave functions. Restless and rested rising bandits can capture many settings of practical interest. Consider, for instance, the scenario in which we have to choose which product to advertise (i.e., our arms), and the reward is the number of sales for such a product. The product we advertise will increase its sales and favor the sales of complementary products. This scenario corresponds to a rested problem in which some elements present a restless behavior.

In this paper, we propose a generalization of restless and rested bandits. We define a novel space of MABs called Graph-Triggered Rising Bandits (GTRBs). A GTRB is represented by a bandit with a graph describing the interactions between arms. Specifically, an arm triggers the evolution of its own expected reward (as for rested bandit) and the evolution of the connected arms. Figure 1b shows an example of this scenario. Interestingly, rested and restless bandits are two vertices in the space of GTRB. In particular, restless bandits correspond to the case of a fully-connected graph (Figure 1a), while rested ones correspond to the graph with the self-loops only (Figure 1c).

Contributions. In this paper, we present Graph-Triggered Rising Bandits (GTRBs), a setting aiming at generalizing

Graph-Triggered Rising Bandits

(a) Restless Setting.

(b) This work.

(c) Rested Setting.

Figure 1. Examples of 3-armed GTRBs with graph representation.

the rested and restless bandit settings by introducing a graph representing the interaction between the arms. We focus on the case of rising bandits as they represent an interesting case study that will allow us to obtain no-regret algorithms. More in detail, the contributions are as follows.

In Section 2, we formally present the fundamental notions on the rested and restless bandits and the assumptions related to the rising scenario. Then, we introduce the novel setting of GTRBs and discuss the relevant quantities characterizing an instance, including a representation of the graph based on the adjacency matrix. Finally, we present the learning problem and the performance index to evaluate the algorithms. In Section 3, we discuss the notion of optimal policy. We start with a negative result, proving that computing the optimal policy for a generic graph is NP-hard (Theorem 1). Then, we characterize the optimal policy for block-diagonal adjacency matrices, which can be computed in polynomial time (Theorem 2). In Section 4, we discuss the deterministic scenario and we propose two algorithms, the first (DR-BG-UB) for block-diagonal matrices and the second (DR-G-UB) for general graphs. We analyze their regret guarantees, highlighting the dependence on the graph structure. In Section 5, we analyze the seminal algorithm R- -UCB (Metelli et al., 2022), designed for rested and restless stochastic rising bandits that does not require the knowledge of the graph. We analyze its regret guarantees, focusing on the dependence on the characteristics of the underlying graph.

The relevant literature is discussed in Appendix A. The statements proofs are provided in Appendix B. An extensive numerical validation of the proposed algorithms is provided in Appendix D.

2. Problem Formulation

In this section, we first introduce notions related to Stochastic Rising Bandits (Section 2.1). Then, we present the Graph-

Triggered Rising Bandits setting (Section 2.2). Finally, we formalize the learning problem (Section 2.3).

2.1. Notions on Rested and Restless Rising Bandits

Before introducing the setting, we present the fundamental notions concerning stochastic rising rested and restless bandits, and the related assumptions.

Let T N be the learning horizon. We define an instance ν = (νi)i [k] of a k-armed bandit as a vector of probability distributions with support defined over R, where k N, where [k] := {1, 2, . . . , k}. The agent interacts with the environment as follows. At every round t [T], the agent is asked to select an action It among the k available ones and it observes a reward XIt,t νIt. We define Ni,t := P

τ [t] 1{It = i} as the number of pulls of the arm i [k] until round t. In this work, we consider two specific types of MAB, namely restless and rested bandits (Tekin & Liu, 2012). In both cases, to each arm i [k] corresponds a sequence of probability distributions ν = (νi,n)i [k], n [T ], where the expected reward µi(n) = EX νi,n[X] evolves following an history-dependent quantity n N. In the rested scenario, we consider the case in which the expected reward of a generic arm i evolves according to the number of pulls of such an arm, i.e., n Ni,t. On the other hand, in the restless case, the expected reward of a generic arm i evolves according to the current time t, i.e., n t. In other words, in rested bandits, the reward distribution of an arm evolves only when it is pulled, while in restless bandits, it evolves at each round. As customary in this field, we consider expected rewards µi(n) bounded in [0, 1], for every i [k] and n [T]. Finally, we assume distributions to be subgaussian1 for every arm i and n N, with their subgaussianity constants uniformly upper bounded by σ2.

Among the various types of restless and rested bandits available in the literature, we focus on Stochastic Rising Bandits (Metelli et al., 2022). They are a specific class of bandits in which the expected reward of each arm evolves in a nondecreasing and concave manner. The following assumption formalizes such a behavior.

Assumption 1 (Non-decreasing and Concave Payoffs). Let ν be an instance of a Stochastic Rising Bandit, then, defining γi(n) := µi(n + 1) µi(n) for every i [k] and n [T], it holds that:

Non-decreasing: γi(n) 0, (1)

Concave: γi(n 1) γi(n). (2)

The two parts of this assumption allow us to provide theoretical guarantees in both the restless and rested settings.

1A (zero-mean) random variable X is σ2-subgaussian if it holds E [exp (λX)] exp σ2λ2

2 for every λ R.

Graph-Triggered Rising Bandits

Such guarantees cannot be provided without the concavity assumption (see Theorem 4.2 of Metelli et al., 2022).

Instance Characterization. Assumption 1 ensures sufficient structure on the problem to allow for algorithms with provably strong theoretical guarantees. In this scenario, given an instance ν, we define the total increment as:

Υν(M, q) := X

t [M 1] max i [k] γi(t)q, (3)

where M N and q [0, 1]. This quantity figures in the (instance-dependent) theoretical guarantees of algorithms operating in this setting and characterizes the difficulty of learning in the instance ν.

2.2. Graph-Triggered Rising Bandits

In the Stochastic Rising Bandits, either in the rested or restless fashions, there exists no structure among different actions. In this work, we generalize the rested and restless bandits by adding a structure allowing arms to interact. We consider arms as connected through an undirected graph, that can be either known or unknown to the agent.2 If we pull an arm i [k], we get its reward, and we trigger an evolution of the expected reward of the arm i and of all the arms connected to i. We do not get nor observe rewards from the connected arms (i.e., bandit feedback). Such a graph can be represented by a symmetric adjacency matrix G {0, 1}k k. If the matrix contains the value 1 in position (i, j), this implies that a pull of arm i determines the evolution of the expected reward of arm j. If the matrix contains a 0 in position (i, j), this implies that a pull of arm i does not cause an evolution of the expected reward of arm j. The pull of an arm i always implies the evolution of its own expected reward, formally Gi,i = 1, i [k].

For every round t [T] and arm i [k], we define the number e Ni,t of triggers that it has undergone as follows:

τ [t] 1{GIτ ,i = 1} = e i G Nt, (4)

where ei is a vector belonging to the standard basis of Rk whose all components are all zero except for the ith and Nt := (N1,t, . . . , Nk,t) is the vector containing the number of pulls of each arm up to round t. In GTRBs, rewards are sampled from probability distributions whose average rewards vary with the number of triggers, i.e., n e Ni,t and, consequently, the expected reward of an arm i evolves as µi( e Ni,t). Furthermore, we define ti,n := P

l [T ] 1{Ni,l n} as the round in which arm i has been pulled for the n-th time. With ti,t := (ti,n)n Ni,t we refer to the vector containing all the rounds in which the

2All the results we present also hold for directed graphs.

arm i has been pulled, up to time t. Moreover, we introduce t I i,n := e Ni,ti,n, namely the internal time of the n-th pull of arm i, which is the number of triggers of arm i at the time of the n-th pull. Finally, we introduce the concept of degree. In a graph, the degree of a node is the number of edges incident to the node. Formally, given an arm i [k], we define:

deg(i) := 1 k Gei.

Given the adjacency matrix of a graph G, we define k1 := |{i [k] : deg(i) = 1}| as the number of arms having degree of 1. We now observe the relationship between rested and restless bandits and our setting.

Remark 1 (Inclusion of Rested and Restless bandits in GTRBs). The GTRB setting includes both rested and restless bandits (Tekin & Liu, 2012). These two settings can be recovered by considering G = Ik and G = 1k k for rested and restless settings, respectively.3 Indeed, a restless bandit can be seen as a particular instance of GTRB where all arms are triggered at each round, making them change every round independently from which action has been chosen ( e Ni,t = t, for every i [k]). Instead, in a rested bandit an arm changes its expected reward only when is directly chosen e Ni,t = Ni,t.4

Block-Diagonal Adjacency Matrix. We now discuss a particular case of GTRB that is interesting from both the practical and analytical point of view. Until now, we considered G {0, 1}k k to be a generic binary symmetric matrix. However, we now focus on the specific case in which G is a block-diagonal matrix, i.e., a matrix in which the main-diagonal blocks are square matrices of all ones, and all off-diagonal blocks are zero matrices. Formally, let Bek {0, 1}k k be the set of block-diagonal matrices with exactly ek [k] distinct blocks of 1s. We call the GTRB with block connectivity the set of instances where Assumption 1 holds and the adjacency matrix G Bek for some ek k. Moreover, we identify with CG = {Cm,G}m [ek] the partition of [k] corresponding to the diagonal blocks of G. In particular, when G Bek, the graph associated to the adjacency matrix is only composed by completely connected components, or cliques. Thus, CG represents the set of cliques. Moreover, we indicate with e NCm,t := P

i Cm Ni,t the number of times an arm belonging to clique Cm CG has been pulled, namely the number of triggers of the clique Cm.

3We denote Ik the identity matrix of dimension k and 1k k the square matrix of dimension k whose entries are all equal to 1. 4This can be easily seen by looking at Equation (4) considering G = Ik and observing that the vector ei selects the i-th element of vector Nt.

Graph-Triggered Rising Bandits

2.3. Learning Problem

We define Ht = {(Il, XIl,l)}l [t] as the history of interactions at a given round t [T]. We define a policy π(t) as a function π(t) : Ht 1 7 It returning the next action given the history up to that round. For a given instance ν of a GTRB, the performance of a policy π is measured by the means of expected cumulative reward throughout T rounds, formally:

Jν,G,T (π) := E

t [T ] µIt( e NIt,t)

where the expectation is taken over the randomness of both the environment and the policy/algorithm. A policy is optimal for instance ν, an adjacently matrix G, and time horizon T if it maximizes the expected cumulative reward, formally:

π ν,G,T arg max π Jν,G,T (π).

We denote by J ν,G,T = Jν,G,T (π ν,G,T ) the expected cumulative reward attained by the optimal policy. We can now define the expected policy regret as:

Rν,G,T (π) = J ν,G,T Jν,G,T (π). (6)

Therefore, our learning problem is to find a policy π minimizing the expected policy regret Rν,G,T (π). Since the optimal policy depends simultaneously on ν, G, and T, from now on, we consider an instance of the GTRB problem the triple (ν, G, T), instead of the reward distributions ν only.

Remark 2 (On the chosen notion of regret). In GTRBs, we consider a notion of policy regret (Dekel et al., 2012). In this setting, diverging from the optimal sequence of actions influences not only instantaneous regret but also leads to a sub-optimal history, implying future regret even when returning to an optimal policy from there on. This notion of regret, which shares similarities with the one of Reinforcement Learning, is more challenging to optimize.

3. Optimality in GTRB

In this section, we discuss the notion of optimality, in our learning problem. We first characterize the complexity of finding the optimal policy followed by the clairvoyant.

Theorem 1 (Complexity of finding the Optimal Policy in GTRBs). Computing the optimal policy in GTRBs with arbitrary matrices G is NP-Hard.

This theorem follows from a reduction to the NP-Hard problem of determining if a large clique in a given graph exists (Karp, 1972). Intuitively, given a graph (V, E), we build an instance in which the cumulative reward is maximum

only if the learner plays a sequence of arms that represent vertexes in a clique. Theorem 1 implies that the class of problems of GTRBs is computationally harder than all restless bandits and rested rising bandits, for which the optimal policy can be computed in polynomial time (Heidari et al., 2016). Moreover, the optimal policy does not admit a simple closed-form representation. Thus, in general, the optimal policy cannot be reduced to a greedy one or to a fixed-arm policy. The result highlights how this definition of optimal policy is closer to the one of MDPs rather than the one of standard bandit settings.

3.1. Optimality in GTRB with Block Diagonal Connectivity Matrix

We now show how, for this special case of GTRBs with block-diagonal connectivity matrices, the optimal policy can be efficiently computed in a closed-form fashion.

Theorem 2 (Optimal Policy in Rising GTRB with Block Diagonal Connectivity). For any instance (ν, G, T) s.t. G Bek, under Assumption 1, the optimal policy π ν,G,T arg maxπ Jν,G,T (π) is given by:

π ν,G,T (t) arg max j C ν,G,T µj(t), t [T], (7)

where C ν,G,T is the best cumulative reward clique:

C ν,G,T arg max C CG

t [T ] max j C µj(t).

This result characterizes the optimal policy when the graph linking the actions is only composed of cliques. In particular, the clairvoyant would play a greedy policy but always inside the same predefined subset of arms composing a clique. Naturally, the chosen clique would be the one having the maximum cumulative reward at the end of the trial. We point out how this policy combines the optimal policies from both rising rested bandits (corresponding to always playing the arm with the highest cumulative reward), and the optimal policy from rising restless bandits (the greedy policy). From now on, for the sake of simplicity in the notation, we will omit explicit references to T.

4. Regret Minimization in Deterministic Settings

In this section, we propose a novel algorithm to solve the deterministic GTRB, i.e., all instances of GTRB where σ = 0. The deterministic scenario allows for a better understanding of the complex structure of this setting since it ignores the statistical learning problem.

We start by introducing a novel biased estimator which, for every arm i [k], propagates its reward function to the

Graph-Triggered Rising Bandits

Algorithm 1: DR-BG-UB.

Input :Adjacency matrix G

1 Initialize Ni,0 0, i [k]

2 for t [T] do

3 Compute µi(t) as in Equation (8)

4 Select It arg maxi [k] µi(t)

5 Play It and observe µIt( e NIt,t)

6 NIt,t NIt,t 1 + 1

7 Ni,t Ni,t 1, i [k]

current time t by estimating the first derivative using the last two observations:

µi(t) := µ(t I i,Ni,t 1)+

+(t t I i,Ni,t 1) µ(t I i,Ni,t 1) µ(t I i,Ni,t 1 1)

t I i,Ni,t 1 t I i,Ni,t 1 1 . (8)

This estimator relies on the concept of internal time. Internal times are particularly useful since they can separate the bias in two components:

t t I i,Ni,t 1 = (t t I i,Ni,t) | {z } (A)

+ (t I i,Ni,t t I i,Ni,t 1) | {z } (B)

As we will see in Section 4.1, this decomposition assumes a particular meaning in instances where G Bek, where (A) represents the rested component of the bias, since t I i,Ni,t = e NCm,ti,Ni,t making it equivalent to the bias of a rested bandit where cliques are the arms; and (B) represents the restless component of the bias, since from arm i perspective t I i,Ni,t = e Ni,t can be interpreted as the current time inside the clique.

4.1. Algorithm for Deterministic GTRBs with Block-Diagonal Matrices

In this part, we introduce DR-BG-UB, an optimistic anytime regret minimization algorithm for the deterministic GTRB setting with block-diagonal matrices, whose pseudocode is provided in Algorithm 1. The algorithm takes as input the matrix G and employs the estimator presented in Equation (8). Then, after having initialized the counters of the number of pulls, it starts the interaction with the environment. At each round t [T], it estimates (line 3) the µi(t) for every i [k] as in Equation (8) and plays greedy according to it (line 5).5

Regret Analysis. We recall that block-diagonal matrices represent a special case of graph structure for GTRB where the optimal policy can be characterized in a closed form (Theorem 2). The following result provides the regret bound

5At the beginning of the run, the algorithm is required to play every arm 2 times in a round-robin fashion in order to be able to compute µi(t).

of DR-BG-UB highlighting the impact of the graph topology.

Theorem 3 (Regret Upper Bound for DR-BG-UB in Block- -Diagonal Matrices in Deterministic Settings). Let (ν, G) be an instance of the GTRB problem, where G Bek and σ = 0. Then, Algorithm 1 suffers a regret bounded by:

Rν,G(DR-BG-UB)

inf q [0,1]

Cm C |Cm|Υν

| {z } (A) Rested Bias Contribution

Cm C |Cm| e N

q 1+q Cm,T Υν

| {z } (B) Restless Bias Contribution

In this theorem, we report the result as a function of the number of triggers e NCm,T of the cliques in order to better discuss the properties of the graph. However, this dependence can be removed by simply observing e NCm,T T. This choice allows us to have an interesting discussion on the nature of this result w.r.t. the graph structure. First of all, we observe that we can separate two contributions to the regret: one coming from the rested behavior (part (A) of the bound) determined by the need for identifying the best clique, and the other from the restless behavior needed for identifying the best arm inside the clique (part (B) of the bound). If we compare this result to the bounds in Theorems 4.4 and 5.2 of (Metelli et al., 2022), we can notice how the shapes of the two contributions correspond. We also remark that, in the two corner cases, i.e., rested and restless bandits, the regret bound is actually smaller and corresponds exactly to the bounds presented in (Metelli et al., 2022), even though this is not immediately visible in Theorem 3 because of a mathematical artifact of the proof. More details can be found in Remark 5 (Appendix B).

In Theorem 3, the graph topology emerges by means of cliques sizes, that act as multiplicative constants. The major consequence is that having fewer cliques leads, in general, to a better bound. As intuition suggests, the rested scenario can lead to a worst-case bound in the first component (which is, by the way, the one having the greater order in T), and this can be seen by a simple application of Jensen s Inequality, and by noticing that Υν is a concave function:

Cm C |Cm|Υν

We remark that in the two corner cases, one of the two contributions vanishes, even though it cannot be directly seen in Theorem 3. However, since the restless regret has a better order than the rested one, graphs with fewer cliques

Graph-Triggered Rising Bandits

may lead, in general, to better bounds. Unfortunately, to precisely quantify this property, one would need to know the exact shape of Υν and to solve a difficult optimization problem.

4.2. Algorithm for Deterministic GTRBs for General Matrices

After having studied the scenario of block-diagonal matrices, we now consider the case in which G can be arbitrary. Before introducing the algorithm, we need to define the concept of block sub-matrix. Definition 1 (Block Sub-matrix). Let G {0, 1}k k, a block-diagonal matrix GL Bek is a sub-matrix of G if it satisfies: Gi,j GL i,j 0, i, j [k]. (9)

Moreover, we say that GL Bek is maximal if it also satisfies: GL arg min GL satisfying Eq. (9) |CGL| .

Informally, GL Bek is a sub-matrix of G if its graph can be obtained by only removing 1s from G. Finally, a maximal sub-matrix has the least number of cliques. Note that such a maximal sub-matrix is, in general, not unique.

For this algorithm, we need to introduce a novel estimator whose definition recalls the one of Equation (8):

µL i (t) := µ(t I,L i,Ni,t 1)+

+(t t I,L i,Ni,t 1) µ(t I,L i,Ni,t 1) µ(t I,L i,Ni,t 1 1)

t I,L i,Ni,t 1 t I,L i,Ni,t 1 1 , (10)

where t I,L i,l := e i ( GL) Nti,l is the internal time w.r.t. a maximal sub-matrix GL of the actual matrix G.

Given this new estimator, we can generalize Algorithm 1 to attain comparable performance even for an arbitrary G. We introduce DR-G-UB, whose pseudocode is provided in Algorithm 2. The algorithm takes as input a generic matrix G and computes GL. Then, the algorithm interacts with the environment as before and uses the estimator defined in Equation (10). In other words, DR-G-UB pretends to be interacting with a bandit with a graph defined by GL.

Regret Analysis. We now provide a regret upper bound for DR-G-UB. Theorem 4 (Regret Upper Bound for DR-G-UB for General Matrices in Deterministic Settings). Let (ν, G) be an instance of the GTRB problem, where G {0, 1}k k and σ = 0. Then, DR-G-UB suffers a regret bounded as:

Rν,G(DR-G-UB)

min q [0,1]

CL m C GL |CL m|Υν

& e NCL m,T |CL m|

Algorithm 2: DR-G-UB.

Input :Connectivity matrix G

1 Initialize Ni,0 0, i [k]

2 Compute maximal sub-matrix GL from G

3 for t [T] do

4 Compute µL i (t) as in Equation (10)

5 Select It arg maxi [k] µL i (t)

6 Play It and observe µL It( e NIt,t)

7 NIt,t NIt,t 1 + 1

8 Ni,t Ni,t 1, i [k]

CL m C GL |CL m| e N

q 1+q CL m,T Υν

& e NCL m,T |CL m|

where GL Bek is a maximal sub-matrix of G.

This result provides a formal justification to the intuition that the performance of Algorithm 2 can be bounded with the upper bound attained in a less favorable scenario, i.e., a block-diagonal instance that is closer to the worst-case instance of a rested bandit. The regret bound of DR-G-UB can be found by applying Theorem 3 using the matrix GL. Remark 3 (Computational Complexity). Note that, even if the optimal policy in this setting for a general G is NPhard to be retrieved, with DR-G-UB, we achieved sublinear regret w.r.t. the optimal policy with a polynomial-time algorithm. This has been made possible by the ability of DR-G-UB to identify a convenient matrix GL that is subsequently adopted as a proxy of the real environment in order to play in a computationally efficient manner.

5. Regret Minimization in Stochastic Setting

In this section, we focus on the stochastic GTRBs scenario. We characterize the performances of R- -UCB (Metelli et al., 2022) in the GTRBs setting. We show that such an algorithm achieves good performances for a general G. In particular, we develop a new proof strategy for the regret upper bound that makes graph-dependent terms explicit.

As anticipated in Section 1, we aim at obtaining a computationally efficient algorithm enjoying sub-linear regret guarantees. Surprisingly, our analysis shows that R- -UCB not only enjoys sub-linear regret for any matrix G, but also that the graph-dependent quantities actually interpolate the regret between the two corner cases. Moreover, we show that there is no need to solve any additional NP-Hard problem before or during the algorithm s executions, letting R- -UCB keep it affordable computational costs, as in the two corner settings. Furthermore, in this case, the algorithm is completely unaware of the graph structure.

The algorithm employs a biased estimator which, for every

Graph-Triggered Rising Bandits

Algorithm 3: R- -UCB.

Input : Sub-gaussianity proxy upper bound σ, confidence levels {δt}t [T ], window size parameter ϵ (0, 1/2).

1 Initialize Ni,0 0, i [k]

2 for t [T] do

3 Select It arg maxi [k] bµ hi,t i (t) + β hi,t i (t, δt)

4 Play It and observe XIt,t 5 NIt,t NIt,t 1 + 1

arm i, propagates its reward function to the current round t by estimating the first derivative over the last 2h samples:

bµh i (t):= 1

l=Ni,t 1 h+1

Xi,ti,l +(t l) Xi,ti,l Xi,ti,l h

where h N is the window size. We report the estimator s concentration rate, which is a function of the window size h. The proof of this result originally appeared in (Metelli et al., 2022). However, it can be extended to GTRBs (more details are provided in Appendix C).

Lemma 5 (Concentration of Estimator, adapted from Metelli et al. 2022). For every arm i [k], every round

t [T], and window width 1 h j Ni,t 1

βh i (t, δ) := σ(t Ni,t 1 + h 1)

Then, if the window size depends on the number of pulls only hi,t = h(Ni,t 1) and if δt = t α for some α > 2, it holds for every round t [T] that:

P bµhi,t i (t) eµhi,t i (t) > βhi,t i (t, δt) 2t1 α.

The algorithm, whose pseudocode is reported in Algorithm 3, takes as input the subgaussianity constants upper bound σ, sliding window size parameter ϵ, and a sequence of confidence levels δt, where t [T]. R- -UCB relies on the previously defined biased estimator and uses its confidence interval to make decisions in an optimistic manner. R- -UCB does not require the time horizon T as an input, making it an anytime algorithm. Moreover, the algorithm exploits the sliding window mechanism to deal with the environment s uncertainty while controlling the confidence degree by means of {δt}t [T ]. In particular, the window size employed by the algorithm is proportional to parameter ϵ (0, 1/2), in the form of hi,t = ϵNi,t 1 . As we show below, ϵ controls the bias-variance trade-off, where low values for ϵ result in less bias but higher variance, and vice versa.

Remark 4 (Computational Complexity). At each round, Algorithm 3 only needs to update the estimator and the related confidence bounds for every arm, which can be done in a time linear in the number of arms at every step. For an efficient update, we refer the reader to (Mussi et al., 2024).

5.1. Regret Analysis for Block-Diagonal Matrices

We analyze its performance in the block-diagonal case before bounding the regret of Algorithm 3 for general matrices. Theorem 6 (Regret Upper Bound for R- -UCB in Block- -Diagonal Matrices). Let (ν, G) be an instance of the GTRB problem, where G Bek. Let hi,t = ϵNi,t 1 for ϵ (0, 1/2) and δt = t α for α > 2. Then, Algorithm 3 suffers an expected regret bounded as:

Rν,G(R- -UCB)

min q [0,1]

| {z } (A) Variance Contribution

| {z } (B) Rested Bias Contribution

Cm CG:|Cm|>1 |Cm|Υν

| {z } (C) Restless Bias Contribution

where k1 is the number of cliques in G containing only one action.

Existence of a Bias-Variance Trade-off. In the regret upper bound, we can observe three distinct contributions. First, (A) represents the variance contribution, which is the regret suffered by the algorithm due to the stochastic nature of the environment. This contribution is due to the estimator s concentration properties and sets a minimum order of regret to e O(T 2/3). This term is independent of the total increment Υν but, differently from the others, is the only contribution depending on σ. The contribution due to the estimator s bias is split into two distinct parts. The term (B) represents the rested contribution, which scales with the number of blocks containing only one arm. The term (C), instead, represents the restless contribution that scales with the number and the sizes of cliques. The bias contributions depend explicitly on the shape of average reward functions by total increment Υν. The only term common to variance and bias contribution is ϵ. Indeed, ϵ regulates such a trade-off between bias and variance, and this effect can be observed in the complete form of the regret upper bound in Appendix B. The variance contribution depends linearly on ϵ 1; thus, a smaller window size implies a higher variance in the estimate. On the contrary, the bias tends to increase with ϵ: this is expected since a larger window means including older samples in the estimate.

Dependence on Graph Topology. In the regret upper bound of Theorem 6, the only contribution depending on

Graph-Triggered Rising Bandits

graph topology is the one coming from bias (terms (B) and (C)). Indeed, the environment s randomness contribution has been decoupled from estimation bias to get a fully tractable stochastic structure. We observe how the different behaviors of arms not connected with the others (size-1 cliques, corresponding to rested arms) and arms belonging to larger cliques. The regret scales as T q in rested arms, but the dependence on the total increment Υν is linear. Instead, for cliques with size greater than 1, regret scales as T

2q 1+q , which is greater than in rested contribution, but scales with Υν to the 1 1+q, that is indeed a better dependence. Moreover, each clique contributes differently, based on its size. Overall, the higher the size, the higher the contribution is since the linear term is dominant w.r.t. the inverse term inside the total increment Υν. Another interesting dependence is the one on ϵ 1

for the restless contribution, which can be observed in the complete form of the bound in Appendix B. For connected arms, stochasticity and graph topology produce an interaction. Indeed, if one could design an estimator with strong concentration properties for connected arms, this would simplify the analysis of the restless contribution, eliminating the bad dependence on stochasticity. With such an estimator, we could reduce the dependence up to T

q 1+q , matching the deterministic setting bound.

Comparison with Known Results from Literature. Given that rested and restless rising bandits are special instances of GTRBs, we now comment on how the presented bound links to existing results when Algorithm 3 is run over one of those instances. We start from the rested scenario, i.e., when G = Ik. Then, we would have k1 = k and an empty summation in the restless bias contribution. The bound would thus assume the following form:

Rν,Ik(R- -UCB)

min q [0,1]

(σT) 2 3 + k T qΥν

The only other existing result for the rested rising bandit setting is the one of Theorem 4.4 of (Metelli et al., 2022), which is matched up to constants by ours. In the restless scenario, i.e., when G = 1k k, we have a unique clique of size k, and k1 = 0. Thus, the bound we presented in Theorem 6 becomes:

Rν,1k k(R- -UCB)

min q [0,1]

(σT) 2 3 + k T

Once again, this result matches (up to constants) the result from Theorem 5.3 of (Metelli et al., 2022), the current stateof-the-art for the restless rising bandit problem. To conclude, we generalize the stochastic rising rested/restless bandit setting, with regret bounds that are tight w.r.t. the known results for the two corner scenarios.

5.2. Regret Analysis for General Matrices

We are now ready to generalize Theorem 6 to general matrices in G {0, 1}k k. We first introduce the notion of block super-matrix.

Definition 2 (Block Super-matrix). Let G {0, 1}k k, a block-diagonal matrix GU Bek is super-matrix of G if it satisfies: Gi,j GU i,j 0, i, j [k]. (11)

Moreover, we say that GU Bek is minimal if it also satisfies: GU arg max GU satisfying Eq. (11) |CGU | .

This concept of minimal super-matrix plays an analogous role as the maximal sub-matrix in Theorem 4. We now have all the elements to present the upper bound on the regret for the stochastic case and general matrices.

Theorem 7 (Regret Upper Bound for R- -UCB in General Matrices). Let (ν, G) be an instance of the GTRB problem, where G {0, 1}k k. Let hi,t = ϵNi,t 1 for ϵ (0, 1/2) and δt = t α for α > 2. Then, Algorithm 3 suffers an expected regret bounded as:

Rν,G(R- -UCB)

min q [0,1]

(σT) 2 3 + T q k1Υν

CU m |CU m|Υν

|CU m|, q 1 1+q )!

where GU is the minimal super-matrix of G.

We remark that the result has been obtained by bounding e N U CU m,T T for every CU m C GU to remove any stochastic quantity from the regret, but a more precise bound can be provided by finding the worst-case allocation of the triggers among the cliques (as discussed for the similar result in Theorem 4). However, this would require solving a challenging optimization problem that does not admit any closed-form solution, as discussed for the optimal policy (Theorem 1). This result is similar to the one presented in Theorem 4, with the only difference being that the dependence on graph topology is linked to the minimal super-matrix. In principle, the result holds for any super-matrix of G. Still, in the stochastic setting, the upper bound for the rested scenario is better than the one for the restless scenario. Hence, a block-diagonal matrix with as many cliques as possible will, in most cases, lead to better bounds.

About the Knowledge of G. In the stochastic scenario, we avoid extracting the super-matrix structure from the graph before executing the algorithm, as it plays the same policy,

Graph-Triggered Rising Bandits

regardless of the graph. Indeed, Algorithm 3 does not require the knowledge on the graph: the algorithm plays as if the true matrix is the identity one (i.e., a rested instance). To justify this behavior in an intuitive way, we point out to Theorems 4.4 and 5.3 of (Metelli et al., 2022): in stochastic scenarios, the rested contribution to regret s upper bound has a better dependence on T w.r.t. the restless one. Moreover, our optimistic estimator computed by assuming a less connected graph will always be higher than the one computed from any more densely connected graph. Thus, by playing a purely rested policy, we are always sure to over-estimate the true reward (i.e., optimism holds) and we are guaranteed that the rested contribution to the regret is maximized w.r.t. the restless contribution. The final form of the regret bound is obtained by including the minimal super-matrix as a pessimistic proxy of the effect of connected arms (informally, the minimal super-matrix represents the maximum possible contribution to the regret that is due to the graphical connections). We point out that Algorithm 3 does not require the minimal super-matrix as an input, as it is needed only in the analysis. For this reason, one could reformulate the following result by removing the dependence on the minimal super-matrix and including a minimization over the set of all super-matrices. As a side effect, this dramatically reduces the computational burden w.r.t. the deterministic setting at the cost of a slightly higher regret bound.

Comparison with Deterministic Regret Bounds. In deterministic scenarios (Theorems 3 and 4), the restless contributions are always of smaller order compared to the rested one, which is the contrary of what we observe in stochastic settings (Theorems 6 and 7). Due to this reason, in Algorithm 2, the regret bound scales with the maximal sub-matrix instead of the minimal super-matrix. In the deterministic setting, the maximal sub-matrix represents the maximum possible contribution to the regret that is due to the absence of graphical connections. In principle, we could remove the necessity for graph knowledge also in the deterministic setting by simply playing as in a rested scenario (i.e., run Algorithm 1 by setting G = Ik). This would be sensibly sub-optimal since any graphical connection can be used to obtain a strictly better regret bound. This is not the case for the stochastic setting, where over-estimating the number of connections (e.g., by playing as in a restless scenario) may result in a non-optimistic estimator, compromising the analysis of our algorithms.

6. Discussion and Conclusions

In this paper, we proposed the graph-triggered rising bandits (GTRBs), a generalization of the rested and restless bandit settings, where the expected rewards of the different arms evolve by means of a graph. We focused on the stochastic rising bandits, a peculiar type of bandits where

the expected rewards are non-decreasing and concave w.r.t. the number of triggers. As a first result, we showed that, in this setting, computing the optimal policy without additional assumptions on the graph structure is NP-Hard. Then, we characterized the optimal policy for the special class of block-diagonal adjacency matrices, and we showed that the optimal policy can be computed in closed form. This special family of instances allowed us to express a strong inter-dependence between the graph topology and the regret bounds and it plays a crucial role also in bounding the regret for the general case. In particular, we presented and studied the performance of two new algorithms, DR-BG-UB and DR-G-UB, to handle the deterministic scenarios in which the adjacency matrix is block-diagonal and for the general case, respectively. Finally, we studied R- -UCB (Metelli et al., 2022), an algorithm from the SRB literature, and we showed that it can provide regret guarantees also in this more general setting. This work aspires to be a first step in the study of graph-triggered bandits. We started from a special set of instances, namely rising bandits, with the goal of extending this framework to other classes of bandits, e.g., rotting bandits.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Acknowledgments

Funded by the European Union Next Generation EU within the project NRPP M4C2, Investment 1.3 DD. 341 - 15 March 2022 FAIR Future Artificial Intelligence Research Spoke 4 - PE00000013 - D53C22002380006, and by the EU Horizon project ELIAS (European Lighthouse of AI for Sustainability, No. 101120237).

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems (Neur IPS), pp. 2312 2320, 2011.

Alon, N., Cesa-Bianchi, N., Dekel, O., and Koren, T. Online learning with feedback graphs: Beyond bandits. In Proceedings of the Annual Conference on Learning Theory (COLT), pp. 23 35. PMLR, 2015.

Bubeck, S., Stoltz, G., Szepesv ari, C., and Munos, R. Online optimization in x-armed bandits. Advances in Neural Information Processing Systems (Neur IPS), 21, 2008.

Graph-Triggered Rising Bandits

Chu, W., Li, L., Reyzin, L., and Schapire, R. E. Contextual bandits with linear payoff functions. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), volume 15 of JMLR Proceedings, pp. 208 214. JMLR, 2011.

Dekel, O., Tewari, A., and Arora, R. Online bandit learning against an adaptive adversary: from regret to policy regret. In Proceedings of the International Conference on Machine Learning (ICML). Omnipress, 2012.

Doob, J. Stochastic Processes. Probability and Statistics Series. Wiley, 1953.

Garivier, A. and Moulines, E. On upper-confidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory (ALT), pp. 174 188. Springer, 2011.

Gur, Y., Zeevi, A., and Besbes, O. Stochastic multi-armedbandit problem with non-stationary rewards. In Advances in Neural Information Processing Systems (Neur IPS), pp. 199 207, 2014.

Heidari, H., Kearns, M. J., and Roth, A. Tight policy regret bounds for improving and decaying bandits. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1562 1570, 2016.

Herlihy, C. and Dickerson, J. P. Networked restless bandits with positive externalities. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 11997 12004. AAAI Press, 2023.

Jhunjhunwala, P. R., Moharir, S., Manjunath, D., and Gopalan, A. On a class of restless multi-armed bandits with deterministic policies. In International Conference on Signal Processing and Communications (SPCOM), pp. 487 491. IEEE, 2018.

Karp, R. M. Reducibility among combinatorial problems. Complexity of Computer Computations, pp. 85 103, 1972.

Kleinberg, R., Slivkins, A., and Upfal, E. Multi-armed bandits in metric spaces. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC), pp. 681 690. ACM, 2008.

Lattimore, T. and Szepesv ari, C. Bandit Algorithms. Cambridge University Press, 2020.

Levine, N., Crammer, K., and Mannor, S. Rotting bandits. In Advances in Neural Information Processing Systems (Neur IPS), pp. 3074 3083, 2017.

Metelli, A. M., Trovo, F., Pirola, M., and Restelli, M. Stochastic rising bandits. In Proceedings of the International Conference on Machine Learning (ICML), pp. 15421 15457. PMLR, 2022.

Mussi, M., Montenegro, A., Trov o, F., Restelli, M., and Metelli, A. M. Best arm identification for stochastic rising bandits. In Proceedings of the International Conference on Machine Learning (ICML). PMLR, 2024.

Pike-Burke, C., Agrawal, S., Szepesv ari, C., and Gr unew alder, S. Bandits with delayed, aggregated anonymous feedback. In Proceedings of the International Conference on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pp. 4102 4110. PMLR, 2018.

Raj, V. and Kalyani, S. Taming non-stationary bandits: A bayesian approach. ar Xiv preprint ar Xiv:1707.09727, 2017.

Seznec, J., Locatelli, A., Carpentier, A., Lazaric, A., and Valko, M. Rotting bandits are no harder than stochastic ones. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), volume 89 of Proceedings of Machine Learning Research, pp. 2564 2572. PMLR, 2019.

Seznec, J., M enard, P., Lazaric, A., and Valko, M. A single algorithm for both restless and rested rotting bandits. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), volume 108 of Proceedings of Machine Learning Research, pp. 3784 3794. PMLR, 2020.

Tekin, C. and Liu, M. Online learning of rested and restless bandits. IEEE Transactions on Information Theory, 58 (8):5588 5611, 2012.

Whittle, P. Restless bandits: Activity allocation in a changing world. Journal of Applied Probability, 25(A):287 298, 1988.

Graph-Triggered Rising Bandits

A. Related Works

In this appendix, we discuss the relevant literature for the GTRB setting. We divide this appendix into two parts. First, we discuss the relevant works concerning graph feedback, then, the literature related to restless and rested bandits, with particular attention to rotting and rising bandits.

Graphical Relationships in Bandits. The graph-triggered restless bandit setting has been introduced in this work. Thus, no prior literature is available on this setting. However, we mention similar settings that appeared in the last years. Herlihy & Dickerson (2023) propose the networked restless bandit setting. Despite some similarities with our setting, e.g., the presence of a graph among arms, their action space and learning objectives radically differ from ours, and thus the two settings are not comparable. In (Jhunjhunwala et al., 2018), a restless bandit setting is proposed in which the graph structure is not explicit in the formulation; however, the authors develop a graphical representation of the policies in the deterministic scenario. Their algorithm builds and exploits a graph in an online fashion. Once again, we cannot properly compare this setting to ours, despite some sparse similarities. Finally, we mention bandits with graph feedback (Alon et al., 2015). Despite this setting being conceptually different from ours since arms do not interact, we report it here just because it features graph topology-dependent bounds. We remark that in this case, the graph has not to be intended as a structure for arms interactions but rather as a feedback structure for the learner, in GTRB the feedback is purely bandit.

Rested and Restless Bandits. Restless and rested bandits are a well-established research field. Starting from the seminal paper of Whittle (1988) on restless bandits, several approaches have been proposed over the years to deal with non-stationary bandits (Tekin & Liu, 2012; Raj & Kalyani, 2017). Then, specialization of these settings such as rising (Metelli et al., 2022; Mussi et al., 2024) and rotting (Levine et al., 2017) has been introduced. In particular, rotting bandits are a family of restless bandits where the reward is assumed to decrease (contrary to rising bandits). Over the last years, several works tackled rotting bandits (Levine et al., 2017; Seznec et al., 2019). Remarkably, (Seznec et al., 2020) provide a single algorithm for dealing with both rested and restless rotting bandits but show that in the rotting setting, achieving sub-linear regret is not possible when there are both rested and restless arms in the same instance. We remark that for any two-armed rotting bandit where one arm is rested and the other is restless, we can construct an (asymmetric) matrix G such that the instance can be mapped to a graph-triggered rotting bandit instance. This highlights a crucial difference between rotting and rising bandits for what concerns graph-triggering.

B. Omitted Proofs

Theorem 1 (Complexity of finding the Optimal Policy in GTRBs). Computing the optimal policy in GTRBs with arbitrary matrices G is NP-Hard.

Proof. We reduce from a decision problem related to finding cliques in graphs. In particular, given a graph (V, E) and f M N, it is NP-Hard to determine if there exists a clique of size f M (Karp, 1972). In the following, we design an instance of our problem such that the reward of the optimal policy is at least PT t=1(1 + t T 2 ) if and only if there exists a clique of size f M = T.

Construction. Given a graph (V, E), we build an instance such that the horizon is T. Our set of actions can be constructed by assigning an action to every node and time step couple, i.e., A = {av,t}v V, t [T ]. We define the matrix e G is such that for any v, v V and t, t [T], it holds Gav,t,av ,t = 1 if (v, v ) E. Finally, for each arm av,t A, the reward is deterministic and evolves as eµav,t(n) = min{1 + ηt, n+1

t (1 + ηt)}, where η = T 2. We call eν the set of this functions. It is easy to see that the GTRB instance (eν, e G, T) satisfies Assumption 1.

if. We show that if there exists a clique C = {v1, . . . , v T } of size T, then there exists a policy with a cumulative reward of at least PT t=1(1 + ηt). Consider the policy eπ s.t. eπ(t) = avt,t. It is easy to see that e Navt,t,t = t 1 for every t [T]. Hence, the reward of the policy eπ at time t is

µavt,t( e Navt,t,t) = min{1 + ηt, (t 1) + 1

t (1 + ηt)} = 1 + ηt.

Thus, Jeµ, e G,T (eπ) = PT t=1(1 + ηt) and the claim is proven.

only if. We show that if there is a policy eπ s.t. Jeµ, e G,T (eπ) PT t=1(1 + ηt), then there exists a clique of size T. First, we

Graph-Triggered Rising Bandits

show that for each t , t [T] it holds that

max t [T ] min{1 + ηt , t

t (1 + ηt )} = 1 + ηt.

Indeed, for each t < t, it holds

min{1 + ηt , t

t (1 + ηt )} 1 + ηt < 1 + ηt.

On there other hand, for each t > t it holds

min{1 + ηt , t

t (1 + ηt )} t

t (1 + ηt ) t t + 1(1 + 1

where in the second inequality we use η = T 2. Putting together, the previous inequalities imply that for each t = t we have

min{1 + ηt , t

t (1 + ηt )} < 1 + ηt. (12)

This implies that, at any round t, the best obtainable reward is

max t [T ] max v V max l t 1 eµav,t (l) = max t [T ] max v V eµav,t (t 1)

= max t [T ] min 1 + ηt , t

t (1 + ηt )

= min 1 + ηt, t

t(1 + ηt) = 1 + ηt.

Since by assumption there is a policy with reward at least PT t=1(1 + ηt), then there is a policy such that at each round t [T] the reward is exactly 1 + ηt.

Consider a round t [T]. let av,t be the arm played by the policy at this round. It must be the case that: i) t = t, otherwise

µav,t ( e Nav,t,t) µav,t (t 1) < 1 + ηt

by Equation (12), and ii) e Nav,t ,t = t 1, otherwise

µav ,t ( e Nav,t,t) t 1

t (1 + ηt) < 1 + ηt.

Let avt,t be the arm chosen at round t. Then, each arm in {avt,t}t [T ], is chosen while having exactly t 1 triggers. By the definition of e G this directly implies that {vt}t=1 is a clique of size T.

Theorem 2 (Optimal Policy in Rising GTRB with Block Diagonal Connectivity). For any instance (ν, G, T) s.t. G Bek, under Assumption 1, the optimal policy π ν,G,T arg maxπ Jν,G,T (π) is given by:

π ν,G,T (t) arg max j C ν,G,T µj(t), t [T], (7)

where C ν,G,T is the best cumulative reward clique:

C ν,G,T arg max C CG

t [T ] max j C µj(t).

Proof. For each clique Cm CG, we substitute the reward function of every arm i Cm with µ i (t) = maxi Cm µi(t), for every t [T]. Now, since all arms sharing the same clique have the same reward function, our instance is equivalent to a ek-armed bandit problem. Since arms in different cliques are not connected, this corresponds to a rested bandit problem, and we use Proposition 1 from Heidari et al. (2016) to get that the optimal policy would only pull the best action in terms of cumulative reward at the end of the time horizon T. To conclude the proof, we remark that playing greedily inside a clique corresponds exactly to play on the reward function defined above, which dominates the initial problem, and so the maximum cumulative reward is exactly the one attained in the problem with ek arms.

Graph-Triggered Rising Bandits

Lemma 8 (DR-BG-UB Estimator s Instantaneous Bias). For every arm i [k], every round t = 1, let us define:

µi(t) := µi(t I i,Ni,t 1) + (t t I i,Ni,t 1) µi(t I i,Ni,t 1) µi(t I i,Ni,t 1 1)

t I i,Ni,t 1 t I i,Ni,t 1 1 ,

Then, µi(t) µi(t I i,Ni,t 1) and, if Ni,t 1 2 it holds that:

µi(t) µi( e Ni,t) (t t I i,Ni,t 1)γi(t I i,Ni,t 1 1).

Proof. Let us start by observing the following equality holding:

µi( e Ni,t) = µi(t I i,Ni,t 1) +

j=t I i,Ni,t 1

µi( e Ni,t) = µi(t I i,Ni,t 1) +

j=t I i,Ni,t 1

µi(t I i,Ni,t 1) + ( e Ni,t t I i,Ni,t 1)γi(t I i,Ni,t 1 1) (13)

µi(t I i,Ni,t 1) + (t t I i,Ni,t 1)γi(t I i,Ni,t 1 1) (14)

where line (13) follows from Assumption 1, and line (14) is obtained from observing that e Ni,t t. Concerning the bias, when Ni,t 1 2, we have:

µi(t) µi( e Ni,t) µi(t I i,Ni,t 1) µi( e Ni,t) + (t t I i,Ni,t 1) µi(t I i,Ni,t 1) µi(t I i,Ni,t 1 1)

t I i,Ni,t 1 t I i,Ni,t 1 1 (15)

(t t I i,Ni,t 1) µi(t I i,Ni,t 1) µi(t I i,Ni,t 1 1)

t I i,Ni,t 1 t I i,Ni,t 1 1 (16)

(t t I i,Ni,t 1)γi(t I i,Ni,t 1 1), (17)

where line (16) follows from observing that µi(t I i,Ni,t 1) µi( e Ni,t), and line (17) derives from bounding

µi(t I i,Ni,t 1) µi(t I i,Ni,t 1 1)

t I i,Ni,t 1 t I i,Ni,t 1 1 γi(t I i,Ni,t 1 1) thanks to Assumption 1.

Theorem 3 (Regret Upper Bound for DR-BG-UB in Block-Diagonal Matrices in Deterministic Settings). Let (ν, G) be an instance of the GTRB problem, where G Bek and σ = 0. Then, Algorithm 1 suffers a regret bounded by:

Rν,G(DR-BG-UB)

inf q [0,1]

Cm C |Cm|Υν

| {z } (A) Rested Bias Contribution

Cm C |Cm| e N

q 1+q Cm,T Υν

| {z } (B) Restless Bias Contribution

Proof. Let C ν,G CG be the optimal clique of the instance. We analyze the following expression:

Rν,G(DR-BG-UB) =

t=1 µi t (t) µIt( e NIt,t),

Graph-Triggered Rising Bandits

where i t arg maxi C ν,G µi(t) for all t [T]. Then:

RDR-BG-UB ν,G =

t=1 µi t (t) µIt( e NIt,t)

t=1 µi t (t) µIt(t) µIt( e NIt,t)

t=1 min{1, µIt(t) µIt( e NIt,t)} (18)

t=1 min{1, (t t I It,NIt,t 1)γIt(t I It,NIt,t 1 1)} (19)

t=1 min{1, (t t I It,NIt,t t I It,NIt,t 1)γIt(t I It,NIt,t 1 1)}

t=1 min{1, (t t I It,NIt,t)γIt(t I It,NIt,t 1 1)} + X

t=1 min{1, (t I It,NIt,t t I It,NIt,t 1)γIt(t I It,NIt,t 1 1)} (20)

j=3 min{1, (t t I i,j)γ(t I i,j 2)}

j=3 min{1, (t I i,j t I i,j 1)γ(t I i,j 2)}

where lines (18) and (19) follow from Lemma 8, line (20) from the fact that min{1, x + y} min{1, x} + min{1, y} for any x, y 0, line (21) from a decomposition over the cliques, the arms in the cliques and the number of pulls of every arm.

Let us bound the two terms separately, let q [0, 1]:

j=3 min{1, Tγ(t I j,j 2)} (22)

j=3 T qγ(t I i,j 2)q (23)

Cm CG T q|Cm|Υν

where line (22) follows by bounding t t I i,j T for every i and every j, line (23) from the inequality min{1, x} min{1, x}q xq for q [0, 1], line (24) from Lemma 13.

Then, let y [0, 1

2], and q := y 1 y:

j=3 (t I i,j t I i,j 1)yγ(t I i,j 2)y (25)

j=3 (t I i,j t I i,j 1)

j=3 γ(t I i,j 2)

Graph-Triggered Rising Bandits

e N y Cm,T |Cm|y

e N y Cm,T |Cm|Υν

where line (25) follows from the inequality min{1, x} min{1, x}q xq for q [0, 1], line (26) is obtained from H older s inequality with exponents 1

y 1 and 1 1 y 1 respectively, line (27) follows from bounding PNi,T j=3 (t I i,j t I i,j 1) e NCm,T and by Assumption 1, line (28) is obtained by an application of Jensen s Inequality, and line (29) follows from Lemma 13.

By setting q = y 1 y [0, 1], and putting all together:

RDR-BG-UB ν,G 4k + X

Cm CG T q|Cm|Υν

q 1+q Cm,T |Cm|Υν

Remark 5 (Regret Bound in Rested and Restless Rising Bandits). When we are in a purely rested/restless scenario, the contribution term associated to the restless/rested scenario vanish, and we get the same regret orders from (Metelli et al., 2022). In particular, we can avoid to split the minimum in (20) and instead notice that in a rested setting we have t t I It,NIt,t 1 = t NIt,t 1, and thus we can bound the cumulative regret as we bound the term (a). Instead, in a restless setting we have t t I It,NIt,t 1 = t t It,NIt,t 1, and thus we can bound the cumulative regret as we bound the term (b).

Theorem 4 (Regret Upper Bound for DR-G-UB for General Matrices in Deterministic Settings). Let (ν, G) be an instance of the GTRB problem, where G {0, 1}k k and σ = 0. Then, DR-G-UB suffers a regret bounded as:

Rν,G(DR-G-UB)

min q [0,1]

CL m C GL |CL m|Υν

& e NCL m,T |CL m|

CL m C GL |CL m| e N

q 1+q CL m,T Υν

& e NCL m,T |CL m|

where GL Bek is a maximal sub-matrix of G.

Proof. The theorem can be proved by showing that estimator s bias is always larger when internal times are decreased. For every arm i [k] we define:

fi(t; x, y) = µi(x) + (t x)µi(x) µi(y)

for every triplet of natural numbers y x t T. Note that µi(t) = fi(t; t I i,Ni,t 1, t I i,Ni,t 1 1), so if we can show that fi is decreasing in both x and y we can prove the previous claim. We start with the second argument: fix t and x, then for any y:

fi(t; x, y) fi(t; x, y 1) = (t x)

Px 1 j=y γi(j)

Px 1 j=y 1 γi(j)

Px 1 j=y γi(j) (x y)γi(y 1)

(x y)(x y + 1) 0, (31)

where line (31) follows from Assumption 1. With slightly more calculations we show that fi is also decreasing in the first argument, fix t and y, then for any x:

fi(t; x, y) fi(t; x 1, y) = (32)

Graph-Triggered Rising Bandits

= γi(x 1) + (t x)

Px 1 j=y γi(j)

x y (t x + 1)

Px 2 j=y γi(j)

= γi(x 1) + (t x)(x 1 y) Px 1 j=y γi(j) (x y)(t x + 1) Px 2 j=y (x y)(x 1 y)

= γi(x 1) +

Px 2 j=y γi(j)[(t x)(x 1 y) (x y)(t x + 1)] + (t x)(x 1 y)γi(x 1)

(x y)(x 1 y) (33)

= γi(x 1) 1 + t x

+ γi(x 2)x y 1

x y 1 (t x)(x 1 y) (x y)(t x + 1)

= γi(x 1) t y

x y γi(x 2) t y

x y (γi(x 1) γi(x 2)) 0, (34)

where line (33) follows from observing that (t x)(x 1 y) (x y)(t x+1) 0, line (34) follows from Assumption 1. We proved that the estimator is decreasing with internal times.

Now we observe that, for every i and every t, we have t I i,Ni,t t I,L i,Ni,t. This is a trivial consequence of Definition 1, since

t I i,Ni,t t I,L i,Ni,t =

j=1 (GIt,i GL It,i) 0.

As a consequence of this, we have

fi(t; t I i,Ni,t 1, t I i,Ni,t 1) fi(t; t I,L i,Ni,t 1, t I,L i,Ni,t 1), (35)

and µi(t I i,Ni,t) µi(t I,L i,Ni,t). (36)

Finally, given the optimal policy as a sequence (i t )t [T ], we bound the regret:

Rν,G(DR-G-UB) =

t=1 µi t (t) µIt( e NIt,t)

t=1 µi t (t) µIt(t I It,NIt,t)

t=1 µi t (t) µIt(t) µIt(t I It,NIt,t)

t=1 µIt(t) µIt(t I It,NIt,t)

t=1 µL It(t) µIt(t I,L It,NIt,t), (37)

where line (37) follows from (35) and (36). The proof can be concluded the same way as in Theorem 3.

Lemma 9 (Estimator s Instantaneous Bias). For every arm i [k], every round t [T], and window width 1 h j Ni,t 1

2 k , let us define:

eµh i (t) := 1

l=Ni,t 1 h+1

µi(t I i,l) + (t l) µi(t I i,l) µi(t I i,l h) h

Graph-Triggered Rising Bandits

otherwise if h = 0, we set eµh i (t) := + . Then, eµh i (t) µi(ti,Ni,t 1) and, if Ni,t 1 2 it holds that:

eµh i (t) µi( e Ni,t) (2t 2Ni,t 1 + h 1)(t I i,Ni,t 1 t I i,Ni,t 1 2h+1)

2h γi(t I i,Ni,t 1 2h+1).

Proof. Let us start by observing the following equality holding for every l {2, . . . , Ni,t 1}:

µi( e Ni,t) = µi(t I i,l) +

By averaging over a window of length h, we obtain:

µi( e Ni,t) = 1

l=Ni,t 1 h+1

µi(t I i,l) +

l=Ni,t 1 h+1

µi(t I i,l) + ( e Ni,t t I i,l)γi(t I i,l 1) (38)

l=Ni,t 1 h+1

µi(t I i,l) + e Ni,t t I i,l t I i,l t I i,l h

t I i,l 1 X

j=t I i,l h

l=Ni,t 1 h+1

µi(t I i,l) + (t l) µi(t I i,l) µi(t I i,l h) h

=: eµh i (t), (40)

where lines (38) and (39) follow from Assumption 1, and line (40) is obtained from observing that t I i,l l, e Ni,t t and t I i,l t I i,l h h.

Concerning the bias, when Ni,t 1 2, we have:

eµh i (t) µi( e Ni,t) = 1

l=Ni,t 1 h+1

µi(t I i,l) + (t l) µi(t I i,l) µi(t I i,l h) h

µi( e Ni,t)

l=Ni,t 1 h+1 (t l) µi(t I i,l) µi(t I i,l h) h (41)

l=Ni,t 1 h+1 (t l) µi(t I i,l) µi(t I i,l h)

t I i,l t I i,l h

t I i,l t I i,l h h

l=Ni,t 1 h+1 (t l)γi(t I i,l h) t I i,l t I i,l h h (42)

t I i,Ni,t 1 t I i,Ni,t 1 2h+1 h2 γi(t I i,Ni,t 1 2h+1)

l=Ni,t 1 h+1 (t l) (43)

= (2t 2Ni,t 1 + h 1)(t I i,Ni,t 1 t I i,Ni,t 1 2h+1)

2h γi(t I i,Ni,t 1 2h+1), (44)

where line (41) follows from observing that µi(t I i,l) µi( e Ni,t), line (42) derives from Assumption 1 and bounding

µi(t I i,l) µi(t I i,l h) t I i,l t I i,l h γi(t I i,l h), line (43) is obtained by bounding t I i,l t I i,l h t I i,Ni,t 1 t I i,Ni,t 1 2h+1 and γi(t I i,l h)

γi(t I i,Ni,t 1 2h+1), and line (44) follows from computing the summation.

Graph-Triggered Rising Bandits

Lemma 10 (Bound on Estimator s Cumulative Bias for Block Diagonal Matrices). Let (It)t=1 be a sequence of actions. For every action i [k], every round t [T], let window width hi,t = ϵNi,t 1 . Let G Bek be a block diagonal matrix, then for every q [0, 1], we have

t=1 min n 1, eµ h It,t It (t) µIt( e NIt,t) o

2k + k1T q 1 1 2ϵ

2q 1+q (1 + log(ϵT))

Cm CG:|Cm|>1 |Cm|Υν

, q 1 1+q , (45)

where C is the set of blocks of matrix G, and k1 k is the number of blocks of size 1.

Proof. We proceed decomposing over the cliques and then over the arms, splitting cliques with only one arm from the others:

t=1 min n 1, eµ h It,t It (t) µIt( e NIt,t) o

Cm CG:|Cm|=1 Cm={i}

j=3 min n 1, eµ hi,ti,j i (ti,j) µi(j) o

Cm CG:|Cm|>1

j=3 min n 1, eµ hi,ti,j i (ti,j) µi(t I i,j) o

We start from bounding the first term:

Cm CG:|Cm|=1 Cm={i}

j=3 min 1, (2ti,j 2(j 1) + hi,ti,j 1)2hi,t

2hi,t γi(t I i,(j 1) 2hi,ti,j +1) (46)

Cm CG:|Cm|=1 Cm={i}

j=3 min n 1, Tγi(t I i,j 2 ϵ(j 1) ) o (47)

Cm CG:|Cm|=1 Cm={i}

j=3 min {1, Tγi( (1 2ϵ)j )} (48)

Cm CG:|Cm|=1 Cm={i}

j=3 γi( (1 2ϵ)j )q (49)

Cm CG:|Cm|=1 Cm={i}

(1 2ϵ)Ni,T X

j=3 3(1 2ϵ) γi(j)q (50)

k1T q 1 1 2ϵ

where line (46) follows from Lemma (9) and the fact that, for cliques with a single arm, internal times are equivalent to the number of pulls (i.e., t I i,Ni,t 1 t I i,Ni,t 1 2h+1 = 2h), line (47) follows from Assumption 1, by hi,ti,j = ϵNi,ti,j 1 and by bounding 2ti,j 2(j 1) + hi,ti,j 1 T, line (48) by Assumption 1, line (49) from the inequality min{1, x} min{1, x}q xq for q [0, 1], line (50) from Lemma 12, and line (51) from Lemma 13.

Graph-Triggered Rising Bandits

We now proceed on bounding the second term:

Cm CG:|Cm|>1

1, (2ti,j 2(j 1) + hi,ti,j 1)(t I i,j 1 t I i,j 2hi,t+1)

2hi,t γi(t I i,(j 1) 2hi,ti,j +1)

Cm CG:|Cm|>1

1, T e NCm,T ϵ(j 1) γi(t I i,j 2 ϵ(j 1) )

Cm CG:|Cm|>1

1, T e NCm,T ϵ(j 1) γi( (1 2ϵ)j )

Cm CG:|Cm|>1

γi( (1 2ϵ)j )

Cm CG:|Cm|>1

j=3 γi( (1 2ϵ)j ) z 1 z

Cm CG:|Cm|>1

ϵ(Ni,T 1) X

(1 2ϵ)Ni,T X

j= 3(1 2ϵ) γi(j) z 1 z

T z(1 + log(ϵT))z 1

Cm CG:|Cm|>1

(1 2ϵ)Ni,T X

j= 3(1 2ϵ) γi(j) z 1 z

T z(1 + log(ϵT))z 1

Cm CG:|Cm|>1 e N z Cm,T |Cm|z

(1 2ϵ)Ni,T X

j= 3(1 2ϵ) γi(j) z 1 z

T z(1 + log(ϵT))z 1

Cm CG:|Cm|>1 e N z Cm,T |Cm|Υν

T 2z(1 + log(ϵT))z 1

Cm CG:|Cm|>1 |Cm|Υν

where line (52) follows from the bias bound of Lemma 9, line (53) is obtained from bounding (2ti,j 2(j 1) + hi,ti,j 1)(t I i,j 1 t I i,j 2hi,t+1) 2T e Ni,T and using the definition of hi,t, line (54) derives from observing that γi(ti,j) γi(j) for Assumption 1, line (55) from the inequality min{1, x} min{1, x}z xz for z [0, 1/2], line (56) is obtained from H older s inequality with exponents 1

z 1 and 1 1 z 1 respectively, line (57) is an application of Lemma 12

to independently to both inner summations, line (58) derives from bounding the harmonic sum, i.e., P ϵ(Ni,T 1) 2ϵ 1 j 1 + log(ϵ(Ni,T 1)) 1 + log(ϵT), line (59) follows from Jensen s inequality, line (60) is obtained from Lemma 13, line (61) by bounding e Ni,T T. By recalling q = z 1 z [0, 1], we obtain:

2q 1+q (1 + log(ϵT))

Cm CG:|Cm|>1 |Cm|Υν

, q 1 1+q .

Theorem 6 (Regret Upper Bound for R- -UCB in Block-Diagonal Matrices). Let (ν, G) be an instance of the GTRB problem, where G Bek. Let hi,t = ϵNi,t 1 for ϵ (0, 1/2) and δt = t α for α > 2. Then, Algorithm 3 suffers an expected regret bounded as:

Rν,G(R- -UCB)

Graph-Triggered Rising Bandits

min q [0,1]

| {z } (A) Variance Contribution

| {z } (B) Rested Bias Contribution

Cm CG:|Cm|>1 |Cm|Υν

| {z } (C) Restless Bias Contribution

where k1 is the number of cliques in G containing only one action.

Proof. Let us define the good events Et = T

i [k] Ei,t that correspond to the event in which all confidence intervals hold:

Ei,t := n bµhi,t i (t) eµhi,t i (t) βhi,t i (t) o i [T], i [k].

We have to analyze the following expression:

Rν,G,T (DR-BG-UB) = E

t=1 µi t (t) µIt(t)

where i t arg maxi C ν,G,T µi(t) for all t = 1. We decompose according to the good events Et:

Rν,G,T (πDR-BG-UB) =

t=1 E µi t (t) µIt(t) 1{Et} +

t=1 E µi t (t) µIt(t) 1{ Et}

t=1 E µi t (t) µIt(t) 1{Et} +

t=1 E [1{ Et}] ,

where we exploited µi t (t) µIt(t) 1 in the inequality. Now, we bound the second summation, recalling that α > 2:

t=1 E [1{ Et}] =

t=1 P ( Et) 1 +

t=2 P ( Ei,t) ,

where the first inequality is obtained with P( E1) 1 and the second with a union bound over [k]. Recalling P( Ei,t) was bounded in Lemma 5, we bound the summation with the integral to get:

t=2 P ( Ei,t) X

t=2 2t1 α 2k Z +

x=1 x1 αdx = 2k α 2.

From now on, we will proceed the analysis under the good event Et, recalling that Bi(t) bµhi,t i (t) + βhi,t i (t). Let t [T], and we exploit the optimism, i.e., Bi t (t) BIt(t):

µi (t) µIt(t) + BIt(t) BIt(t) min

1, µi t (t) Bi t (t) | {z } 0

+BIt(t) µIt(t)

min {1, BIt(t) µIt(t)} .

Now, we work on the term inside the minimum:

BIt(t) µIt(t) = bµ h It,t It (t) + β h It,t It (t) µIt(t) (62)

Graph-Triggered Rising Bandits

eµ h It,t It (t) µIt(t) | {z } (a)

+ 2β h It,t It (t) | {z } (b)

where line (62) follows from the definition of Bi(t) and line (63) from the good event Et. We make use of Lemma 10 and Lemma 14 to bound the summations over t of (a) and (b), respectively.

Putting all together, we obtain:

Rν,G,T (DR-BG-UB) 1 + 2k α 2 + 5k + k

ϵ (2σT) 2 3 (10α log T)

2q 1+q (1 + log(ϵT))

+ 2k + k1T q 1 1 2ϵ

2q 1+q (1 + log(ϵT))

Cm CG:|Cm|>1 |Cm|Υν

, q 1 1+q .

Lemma 11 (Bound on Estimator s Cumulative Bias for General Matrices). Let {It}t=1 be a sequence of actions. For every action i [k], every round t [T], let window width hi,t = ϵNi,t 1 . Let G {0, 1}k k, then for every q [0, 1], we have

t=1 min n 1, eµ h It,t It (t) µIt( e NIt,t) o

2k + k1T q 1 1 2ϵ

2q 1+q (1 + log(ϵT))

(1 2ϵ) T k k1

, q 1 1+q , (64)

where k1 k is the number of arms having degree of 1, i.e., k1 := |{i [k] : deg(i) = 1}|.

Proof. The proof follows similar steps as Lemma 10. We decide to split arms based on their degree, in particular we bound separately the bias due to arms having degree of 1 (i.e., they only are triggered by themselves).

t=1 min n 1, eµ h It,t It (t) µIt( e NIt,t) o

i [k] deg (i)=1

j=3 min n 1, eµ hi,ti,j i (ti,j) µi(j) o

i [k] deg (i)>1

j=3 min n 1, eµ hi,ti,j i (ti,j) µi(t I i,j) o

We start from bounding the first term:

i [k] deg (i)=1

j=3 min 1, (2ti,j 2(j 1) + hi,ti,j 1)2hi,t

2hi,t γi(ti,(j 1) 2hi,ti,j +1) (65)

i [k] deg (i)=1

j=3 min n 1, Tγi(t I i,j 2 ϵ(j 1) ) o (66)

Graph-Triggered Rising Bandits

i [k] deg (i)=1

j=3 min {1, Tγi( (1 2ϵ)j )} (67)

i [k] deg (i)=1

j=3 γi( (1 2ϵ)j )q (68)

i [k] deg (i)=1

(1 2ϵ)Ni,T X

j=3 3(1 2ϵ) γi(j)q (69)

i [k] deg (i)=1

(1 2ϵ)Ni,T X

j=3 3(1 2ϵ) γi(j)q (70)

k1T q 1 1 2ϵ

where line (65) follows from Lemma (9) and the fact that, for cliques with a single arm, internal times are equivalent to the number of pulls (i.e., t I i,Ni,t 1 t I i,Ni,t 1 2h+1 = 2h), line (66) follows from Assumption 1, by hi,ti,j = ϵNi,ti,j 1 and by bounding 2ti,j 2(j 1) + hi,ti,j 1 T, line (68) by Assumption 1, line (69) from the inequality min{1, x} min{1, x}q xq for q [0, 1], line (70) from Lemma 12, and line (71) from Lemma 13.

As a trivial consequence of Definition 1, we observe that

t I i,Ni,t t I,U i,Ni,t =

j=1 (GIt,i GU It,i) 0.

As a consequence of this, we have that, for every i and for every t:

e Ni,t e N U i,t, (72)

where e N U i,t := e i ( GU) Nt.

We now proceed on bounding the second term:

i [k] deg (i)>1

1, (2ti,j 2(j 1) + hi,ti,j 1)(t I i,j 1 t I i,j 2hi,t+1)

2hi,t γi(t I i,(j 1) 2hi,ti,j +1)

i [k] deg (i)>1

1, T e Ni,T ϵ(j 1) γi(t I i,j 2 ϵ(j 1) )

i [k] deg (i)>1

1, T e Ni,T ϵ(j 1) γi( (1 2ϵ)j )

CU m C GU |CU m|>1

1, T e N U Cm,T ϵ(j 1) γi( (1 2ϵ)j )

CU m C GU |CU m|>1

i CU m ( e N U Cm,T )z Ni,T X

γi( (1 2ϵ)j )

Graph-Triggered Rising Bandits

CU m C GU |CU m|>1

i CU m ( e N U Cm,T )z

j=3 γi( (1 2ϵ)j ) z 1 z

CU m C GU |CU m|>1

i CU m ( e N U Cm,T )z

ϵ(Ni,T 1) X

(1 2ϵ)Ni,T X

j= 3(1 2ϵ) γi(j) z 1 z

T z(1 + log(ϵT))z 1

CU m C GU |CU m|>1

i CU m ( e N U Cm,T )z

(1 2ϵ)Ni,T X

j= 3(1 2ϵ) γi(j) z 1 z

T z(1 + log(ϵT))z 1

CU m C GU |CU m|>1

(1 2ϵ)Ni,T X

j= 3(1 2ϵ) γi(j) z 1 z

T 2z(1 + log(ϵT))z 1

CU m C GU |CU m|>1

$ e N U Cm,T |CU m|

where line (73) follows from the bias bound of Lemma 9, line (74) is obtained from bounding (2ti,j 2(j 1) + hi,ti,j 1)(t I i,j 1 t I i,j 2hi,t+1) 2T e Ni,T and using the definition of hi,t, line (75) derives from observing that γi(ti,j) γi(j) for Assumption 1, line (76) follows (72) and a decomposition of the pulls over the cliques of GU, line (77) from the inequality min{1, x} min{1, x}z xz for z [0, 1/2], line (78) is obtained from H older s inequality with exponents 1

z 1 and 1 1 z 1 respectively, line (79) is an application of Lemma 12 to independently to both inner summations, line (80) derives

from bounding the harmonic sum, i.e., P ϵ(Ni,T 1) 2ϵ 1 j 1 + log(ϵ(Ni,T 1)) 1 + log(ϵT), line (81) follows from

Jensen s inequality and by bounding e N U CU m,T T, line (82) is obtained from Lemma 13. By recalling q = z 1 z [0, 1], we obtain:

2q 1+q (1 + log(ϵT))z 1

CU m C GU |CU m|>1

Theorem 7 (Regret Upper Bound for R- -UCB in General Matrices). Let (ν, G) be an instance of the GTRB problem, where G {0, 1}k k. Let hi,t = ϵNi,t 1 for ϵ (0, 1/2) and δt = t α for α > 2. Then, Algorithm 3 suffers an expected regret bounded as:

Rν,G(R- -UCB)

min q [0,1]

(σT) 2 3 + T q k1Υν

CU m |CU m|Υν

|CU m|, q 1 1+q )!

where GU is the minimal super-matrix of G.

Proof. The proof follows similar steps of the proof of Theorem 6, but uses Lemma 11 instead of Lemma 10 to bound cumulative estimator s bias.

Graph-Triggered Rising Bandits

Let us define the good events Et = T

i [k] Ei,t that correspond to the event in which all confidence intervals hold:

Ei,t := n bµhi,t i (t) eµhi,t i (t) βhi,t i (t) o i [T], i [k].

We have to analyze the following expression:

Rν,G,T (DR-BG-UB) = E

t=1 µi t (t) µIt(t)

where i t arg maxi C ν,G,T µi(t) for all t [T]. We decompose according to the good events Et:

Rν,G,T (DR-BG-UB) =

t=1 E µi t (t) µIt(t) 1{Et} +

t=1 E µi t (t) µIt(t) 1{ Et}

t=1 E µi t (t) µIt(t) 1{Et} +

t=1 E [1{ Et}] ,

where we exploited µi t (t) µIt(t) 1 in the inequality. Now, we bound the second summation, recalling that α > 2:

t=1 E [1{ Et}] =

t=1 P ( Et) 1 +

t=2 P ( Ei,t) ,

where the first inequality is obtained with P( E1) 1 and the second with a union bound over [k]. Recalling P( Ei,t) was bounded in Lemma 5, we bound the summation with the integral to get:

t=2 P ( Ei,t) X

t=2 2t1 α 2k Z +

x=1 x1 αdx = 2k α 2.

From now on, we will proceed the analysis under the good event Et, recalling that Bi(t) bµhi,t i (t) + βhi,t i (t). Let t = 1, and we exploit the optimism, i.e., Bi t (t) BIt(t):

µi (t) µIt(t) + BIt(t) BIt(t) min

1, µi t (t) Bi t (t) | {z } 0

+BIt(t) µIt(t)

min {1, BIt(t) µIt(t)} .

Now, we work on the term inside the minimum:

BIt(t) µIt(t) = bµ h It,t It (t) + β h It,t It (t) µIt(t) (83)

eµ h It,t It (t) µIt(t) | {z } (a)

+ 2β h It,t It (t) | {z } (b)

where line (83) follows from the definition of Bi(t) and line (84) from the good event Et. We make use of Lemma 11 and Lemma 14 to bound the summations over t of (a) and (b), respectively.

Putting all together, we obtain:

Rν,G,T (R- -UCB)

1 + 2k α 2 + 5k + k

ϵ (2σT) 2 3 (10α log T)

Graph-Triggered Rising Bandits

+ 2k + k1T q 1 1 2ϵ

2q 1+q (1 + log(ϵT))

CU m C GU |CU m|>1

, q 1 1+q .

C. Technical Lemmas for Stochastic Rising Bandits

In this appendix, we report some useful technical lemmas from the literature of stochastic rising bandits that will play a role in the results of Section B.

Lemma 12 (Lemma C.1 of Metelli et al. 2022). Let M 3, and let f : N R, and β (0, 1). Then it holds that:

j=3 f( βj ) 1

l= 3β f(l).

Proof. We simply observe that the minimum value of βj is 3β and its maximum value is βM . Each element βj

changes value at least one time every l 1 β m times.

Lemma 13 (Lemma C.2 of Metelli et al. 2022). Under Assumption 1, it holds that:

max (Ni,T )i [k] Ni,T 0,P

i [k] Ni,T =T

l=1 γi(l)q kΥν

Lemma 5 (Concentration of Estimator, adapted from Metelli et al. 2022). For every arm i [k], every round t [T], and

window width 1 h j Ni,t 1

βh i (t, δ) := σ(t Ni,t 1 + h 1)

Then, if the window size depends on the number of pulls only hi,t = h(Ni,t 1) and if δt = t α for some α > 2, it holds for every round t [T] that:

P bµhi,t i (t) eµhi,t i (t) > βhi,t i (t, δt) 2t1 α.

Proof Sketch. Using a Doob s optional skipping argument (Doob, 1953; Bubeck et al., 2008), and noting that, at round t, t I i,l is a stopping time for every arm i [k] and pull number l {1, . . . , Ni,t 1} w.r.t. the filtration Fτ 1 = σ(I1, X1, . . . , Iτ 1, Xτ 1, Iτ), we can proceed to prove this lemma as in Metelli et al. (2022) also for GTRB.

Lemma 14 (Bound on Estimator s Variance, Metelli et al. (2022), Theorem 4.4). Let (It)t [T ] be a sequence of actions such that bµ h It,t It (t) eµ h It,t It (t) β h It,t It (t, t α), t [T], (85)

where α > 2. For every action i [k], every round t [T], let window width hi,t = ϵNi,t 1 . Then, we have

t=1 min n 1, 2β h It,t It (t, t α) o k 3 + 1

ϵ (2σT) 2 3 (10α log T) 1 3 . (86)

Graph-Triggered Rising Bandits

0 1 2 3 4 5 104

µ0 µ1 µ2 µ3 µ4

(a) ν1, µi = mi(1 e κint).

µ0 µ1 µ2 µ3 µ4

(b) ν2, µi = min{κint, mi}.

0 1 2 3 4 5 104

(c) ν3, µi = mi(1 e κint).

Figure 2. Sets of functions used in the experimental campaign over deterministic settings with block-diagonal adjacency matrices. Under each figure, we report the family of analytical functions used for the instance construction, where κi > 0 and mi [0, 1].

D. Experiments

In this appendix, we provide an experimental campaign to validate the proposed algorithmic solutions from an empirical perspective.

We start with the deterministic setting: in Appendix D.1 we evaluate DR-BG-UB in 15 GTRB instances, varying both the functions and the adjacency matrices; in Appendix D.2 we evaluate DR-G-UB in 3 GTRB instances, but varying the sub-matrix used in the Algorithm 2 routine. Finally, we evaluate R- -UCB in 10 stochastic GTRB instances, varying both the functions and the adjacency matrices, and comparing its performances to a baseline from the literature, Sliding Window UCB (Garivier & Moulines, 2011). We decided not to evaluate R- -UCB under general adjacency matrices since there is no feasible way to compute a clairvoyant and no reasonable sensitivity analysis can be conducted on the algorithm s inputs as we did in the deterministic setting studying the impact of the specified sub-matrix.

D.1. Deterministic Setting with Block-Diagonal Matrices

This section assesses the empirical performances of DR-BG-UB in a synthetic environment. To do so, we propose a total of 15 different instances of Graph-Triggered Rising Bandits with block-diagonal matrices and adding no noise in the rewards generation process.

Setting. In Figure 2, we report three different set of functions satisfying Assumption 1, {ν1, ν2, ν3}, of 5, 5 and 15 arms, respectively.

Some remarks are in order. The total increment assumes different behaviors depending on the set of functions, indeed Υν1 = O(log T) and Υν3 = O(log T), while Υν2 = O(T). This has been done voluntarily to stress the algorithm towards these two corner cases and assess its performance on both. Moreover, in F1 we can see one function, namely µ3, dominating all the others. This is a corner case in which, whatever the underlying graph, all the optimal policies coincide. Instead, in F2, we observe that the optimal policy in the restless scenario would include pulling 4 different actions across the trial. Instead, F3 is aimed at assessing Algorithm 1 performance when the action space is larger.

We now introduce a compact notation for block-diagonal matrices. Let G {0, 1}k k a block-diagonal matrix with ek distinct blocks. Then, we indicate the matrix by means of its block sizes as G := {b1, . . . , bek}, with the convention that the first block is the one on the upper-left corner, and so on. Note that P

i [ek] bi = k.

Together with those, we define two sets of block-diagonal matrices, namely B1 and B2, composed of 5 matrices each:

B1 = {I5, {2, 1, 1, 1}, {2, 1, 2}, {3, 2}, 15 5},

B2 = {I15, {3, 3, 3, 3, 3}, {5, 5, 5}, {5, 10}, 115 15}.

Note that the definition of sub-matrix partially orders matrices increasingly, i.e., every matrix is a sub-matrix of the following one. When referring to a set of increasing matrices, we will indicate with G0 the minimum (e.g., G0 = I5), and so on until the maximum (e.g., G4 = 15 5). Set B1 is composed of sequentially nested matrices, while set B2 is composed of matrices. Both have a decreasing number of blocks.

Graph-Triggered Rising Bandits

0 1 2 3 4 5 104

G0 G1 G2 G3 G4

(a) ν1, Gi B1, T = 5 104.

G0 G1 G2 G3 G4

(b) ν2, Gi B1, T = 7.5 104.

0 1 2 3 4 5 104

G0 G1 G2 G3 G4

(c) ν3, Gi B2, T = 5 104.

Figure 3. Cumulative regrets obtained by DR-BG-UB. The algorithm faces every set of functions 5 times under a different adjacency matrix, from the sparser (G0 = I) to the complete matrix (G4 = 1).

0 1 2 3 4 5 104

µ0 µ1 µ2 µ3 µ4

(a) ν4, µi = mi(1 e κint).

µ0 µ1 µ2 µ3 µ4

(b) ν5, µi = min{κint, mi}.

0 1 2 3 4 5 104

(c) ν6, µi = mi(1 e κint).

Figure 4. Sets of functions used in the experimental campaign over stochastic settings with non-block-diagonal adjacency matrices. Under each figure, we report the family of analytical functions used for the instance construction, where κi > 0 and mi [0, 1].

Together with the reward functions, the matrices will form the 15 instances as follows: (ν1, G)G B1, (ν2, G)G B1, (ν3, G)G B2, where G0 Thus, every set of functions will be evaluated on 5 different block-diagonal matrices.

Results. In Figure 3, we report the cumulative regrets obtained by DR-BG-UB on the three instances previously described, setting T to 50.000, 75.000 and 50.000, respectively. Since DR-BG-UB is an anytime algorithm, for every time t [T] we computed the cumulative reward achieved by the optimal policy for that specific time horizon and then tracked the algorithm s cumulative regret at every time.

Some comments are in order. The instances corresponding to purely restless settings (i.e., G4) are the ones in which DR-BG-UB achieves the best performances. This is expected since the restless contribution to the regret s upper bound is sensibly lower than the contribution given by rested arms (Theorem 3). In general, this phenomenon is even more evident when looking at the progression of regret when the number of blocks increases. The higher cumulative regret is always observed when the matrix is the identity (i.e., G0). However, the cumulative regret always assumes a sub-linear shape, thus validating the theoretical findings of Theorem 3. Finally, the cumulative regret s shape assumes a non-concave behavior (see, e.g., Figure 3b): this is expected since the clairvoyant is computed for every possible t and the optimal policy may drastically change from one time to the subsequent.

D.2. Deterministic Setting with General Matrices

This section assesses the empirical performances of DR-G-UB in a synthetic environment. To do so, we propose a total of 3 different instances of Graph-Triggered Rising Bandits with non-block-diagonal matrices and adding no noise in the rewards generation process. Moreover, we analyze the behavior of DR-G-UB under different choices of the employed sub-matrix, also deviating from the maximal sub-matrix choice prescribed by the pseudo-code in Algorithm 2.

Setting. In Figure 4, we report three different set of functions satisfying Assumption 1, {ν4, ν5, ν6}, of 5, 5 and 15

Graph-Triggered Rising Bandits

0 1 2 3 4 5 104

GL 0 GL 1 GL 2 GL 3

(a) ν4, Ga, T = 5 104.

GL 0 GL 1 GL 2 GL 3

(b) ν5, Ga, T = 7.5 104.

0 1 2 3 4 5 104

GL 0 GL 1 GL 2 GL 3

(c) ν6, Gb, T = 5 104.

Figure 5. Cumulative rewards obtained by DR-G-UB. The algorithm faces every set of functions 5 times under the same different adjacency matrix, however the used sub-matrix GL i changes, from the sparser (GL 0 = I) to the maximal sub-matrix (GL 3 ).

functions, respectively.

These set of functions are very similar to the ones in the previous section. Thus, all the remarks done before still apply.

We define two non-block-diagonal matrices, namely Ga {0, 1}5 5 and Gb {0, 1}15 15:

1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1

1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1

Together with the set of functions, these will constitute the 3 instances as follows: (ν4, Ga), (ν5, Ga), and (ν6, Gb).

The goal of this section is mainly to study the behaviour of DR-G-UB under different choices of the sub-matrix, and to validate that the maximal sub-matrix is empirically the best choice. Thus, we will propose 4 possible sub-matrices of Ga and Gb, respectively, and run Algorithm 2 for every possible choice.

We define two sets of sub-matrices, namely BL a and BL b , s.t.

BL a = {I5, {2, 1, 1, 1}, {2, 1, 2}, {3, 2}}, (87)

BL b = {I15, {3, 3, 3, 3, 3}, {5, 5, 5}, {5, 10}}. (88)

We will indicate with GL 0 the minimum sub-matrix (e.g., G0 = I5), and so on until the maximal sub-matrix (e.g., GL 4 = {3, 2}), that is here uniquely defined.

Results. When the matrix is non-block-diagonal, computing the clairvoyant is NP-hard. Thus, in Figure 5, we report the cumulative rewards obtained by DR-G-UB on the three instances defined above when varying the sub-matrix that is used, setting T to 50.000, 75.000 and 50.000, respectively.

The cumulative regret is always sub-linear, independent of the choice of the used sub-matrix. However, the best performances

Graph-Triggered Rising Bandits

R- -UCB SW-UCB

(a) ν1, G0, T = 7.5 104.

R- -UCB SW-UCB

(b) ν1, G1, T = 7.5 104.

R- -UCB SW-UCB

(c) ν1, G2, T = 7.5 104.

R- -UCB SW-UCB

(d) ν1, G3, T = 7.5 104.

R- -UCB SW-UCB

(e) ν1, G4, T = 7.5 104.

Figure 6. Cumulative regrets obtained by R- -UCB and SW-UCB. The algorithms face the same set of functions ν1 for 5 times under different adjacency matrices. Gi moves from the sparser (G0 = I5) to the complete matrix (G4 = 15 5). For each instance, we performed 20 trials and reported mean std.

are obtained using the maximal sub-matrix, and this is an expected consequence of Theorem 4. When the provided submatrix becomes smaller in the number of blocks, the cumulative reward improves despite keeping the true matrix underlying the process fixed. This outcome validates the design choice in Algorithm 2 to use the maximal sub-matrix.

D.3. Stochastic Setting with Block-Diagonal Matrices, and comparison with Sliding-Window UCB

In this section, we validate the need for ad-hoc algorithmic solutions to solve the Graph-Triggered Rising Bandits problem. In particular, we evaluate the performance of Algorithm 3 in a stochastic setting with block-diagonal adjacency matrices and compare it to the one of Sliding Window UCB (shortly, SW-UCB, Garivier & Moulines (2011)). To the authors knowledge, no existing algorithm from the literature deals appropriately with the GTRB setting. So, as a comparison baseline, we decided to use SW-UCB since it is one of the most robust and known algorithms from the non-stationary bandit literature.

Setting. We evaluate the two algorithm on a total of 10 instances, corresponding to the first 10 instances of the experimental campaign in Appendix D.1 (rescaled), i.e., (ν1, G)G B1 and (ν2, G)G B1. The hyper-parameters of SW-UCB have been set according to the original paper (Garivier & Moulines, 2011) and then optimized to get the smaller regret upper bound. Instead, the hyper-parameters of R- -UCB have been fixed for all experiments, using ϵ = 0.1 and α = 3. The time horizon was set to T = 75.000, and the optimal policy s cumulative reward has been computed for every time t [T]. We perform 20 trials for each setting, varying the seed to the additive, zero-mean, Gaussian noise generator, where σ = 0.1.

Results. In Figure 6, we report the average cumulative regrets obtained by the two algorithms in the first 5 instances and their standard deviations. We can observe that, in most of these, the performance of the two algorithms is comparable, even if SW-UCB tends to achieve a lower regret. However, the shape of SW-UCB cumulative regret has a linear behavior. We remark that the set of functions ν1 contains a dominant function, µ3, thus the optimal policy prescribes to always play such arm, independently from the matrix. In this kind of setting, standard algorithms for non-stationary bandits can still achieve satisfactory performance in practice.

However, as we can observe in Figure 7, which contains the regrets for the second half of the instances, the performance of SW-UCB quickly deteriorates w.r.t. R- -UCB as the optimal policy becomes less trivial. Indeed, when increasing the

Graph-Triggered Rising Bandits

R- -UCB SW-UCB

(a) ν2, G0, T = 7.5 104.

R- -UCB SW-UCB

(b) ν2, G1, T = 7.5 104.

R- -UCB SW-UCB

(c) ν2, G2, T = 7.5 104.

R- -UCB SW-UCB

(d) ν2, G3, T = 7.5 104.

R- -UCB SW-UCB

(e) ν2, G4, T = 7.5 104.

Figure 7. Cumulative regrets obtained by R- -UCB and SW-UCB. The algorithms face the same set of functions ν2 for 5 times under different adjacency matrices. Gi moves from the sparser (G0 = I5) to the complete matrix (G4 = 15 5). For each instance, we performed 20 trials and reported mean std.

adjacency matrix, the optimal policy pulls a larger set of different arms, and standard techniques for non-stationary bandits fail to model this kind of interaction among the arms. The results of this section highlight the need for a specific algorithm to deal with GTRB problems, as standard algorithms from the non-stationary bandits literature may perform well in simpler instances but will severely deteriorate in harder ones. Finally, this section also assesses the good performance of R- -UCB in stochastic settings. Indeed, all the cumulative regrets assume a sub-linear shape.