# stochastic_rising_bandits__8ea1936f.pdf

Stochastic Rising Bandits

Alberto Maria Metelli 1 Francesco Trov o 1 Matteo Pirola 1 Marcello Restelli 1

This paper is in the ﬁeld of stochastic Multi Armed Bandits (MABs), i.e., those sequential selection techniques able to learn online using only the feedback given by the chosen option (a.k.a. arm). We study a particular case of the rested and restless bandits in which the arms expected payoff is monotonically non-decreasing. This characteristic allows designing speciﬁcally crafted algorithms that exploit the regularity of the payoffs to provide tight regret bounds. We design an algorithm for the rested case (R-ed-UCB) and one for the restless case (R-less-UCB), providing a regret bound depending on the properties of the instance and, under certain circumstances, of r Op T 2 3 q. We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a realworld dataset. Finally, using synthetic and realworld data, we illustrate the effectiveness of the proposed approaches compared with state-of-theart algorithms for the non-stationary bandits.

1. Introduction

The classical stochastic MAB framework (Lattimore & Szepesv ari, 2020) has been successfully applied to a number of applications, such as advertising, recommendation, and networking. MABs model the scenario in which a learner sequentially selects (a.k.a. pulls) an option (a.k.a. arm) in a ﬁnite set, and receives a feedback (a.k.a. reward) corresponding to the chosen option. The goal of online learning algorithms is to guarantee the no-regret property, meaning that the loss due to not knowing the best arm is increasing sublinearly with the number of pulls. One of the assumptions that allows designing no-regret algorithms consists in requiring that the payoff (a.k.a. expected reward) provided

1Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy. Correspondence to: Alberto Maria Metelli <albertomaria.metelli@polimi.it>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

by the available options is stationary, i.e., rewards come from a ﬁxed distribution.

However, the arms payoff may change over time due to intrinsic modiﬁcations of the arms or the environment. A no-regret approach is offered by the adversarial algorithms, in which no assumption on the nature of the reward is required. It has been shown that, in this setting, it is possible to design effective algorithms, e.g., EXP3 (Auer et al., 1995). However, in practice, their performance is unsatisfactory because the non-stationarity of real-world cases is far from being adversarial. Instead, non-stationarity is explicitly accounted for by a surge of methods that consider either abrupt changes (e.g., Garivier & Moulines, 2011), smoothly changing environments (e.g., Trov o et al., 2020) or bounded reward variation (e.g., Besbes et al., 2014).

While in non-stationary MABs the arms payoff changes naturally over time, a different setting arises when the payoff changes as an effect of pulling the arm. This is the case of rotting bandits (Levine et al., 2017; Seznec et al., 2019), in which the payoff of the arms are monotonically nonincreasing over the pulls, modeling degradation phenomena. Knowing the monotonicity property allows deriving more specialized algorithms, exploiting the process characteristics and further decreasing the regret w.r.t. unrestricted cases. Notably, the symmetric problem of monotonically non-decreasing payoffs cannot be addressed with the same approaches. Indeed, it was shown that it represents a signiﬁcantly more complex problem, even for deterministic arms (Heidari et al., 2016). In this non-decreasing setting, a common assumption is the concavity of the payoff function that deﬁnes the rising bandits setting (Li et al., 2020).

The goal of this paper is to study the stochastic MAB problem when the arms payoff is monotonically non-decreasing. This setting arises in several real-world sequential selection problems. For instance, suppose we have to choose among a set of optimization algorithms to maximize an unknown stochastic concave function. In this setting, we expect that all the algorithms progressively increase (on average) the function value and eventually converge to an optimal value, possibly with different speeds. Therefore, we wonder which candidate algorithm to assign the available resources (e.g., computational power or samples) to identify the one that converges faster to the optimum. This online model selec-

Stochastic Rising Bandits

tion process can be modeled as a rested MAB (Tekin & Liu, 2012), like the rotting bandits (Levine et al., 2017), but with non-decreasing payoffs. Indeed, each optimization algorithm (arms) and the function value does not evolve if we do not select (pull) it. Another example that shows a non-decreasing expected reward is the selection of athletes for competitions. Athletes train in parallel and increase (on average) their performance. However, if participation in competitions is allowed to one athlete only, the trainer should select the one who has achieved the best performance so far. This problem is akin to the restless case (Tekin & Liu, 2012), like non-stationary bandits (Besbes et al., 2014), but with the additional assumption that payoffs are nondecreasing. Indeed, the athletes (arms) are evolving even if they are not selected (pulled) to compete.

Original Contribution In this paper, we study the stochastic rising bandits, i.e., stochastic bandits in which the payoffs are monotonically non-decreasing and concave, in both restless and rested formulations. More speciﬁcally:

we show that the rested bandit with non-decreasing payoffs is non-learnable, i.e., the loss due to learning is linear with the number of pulls, unless additional assumptions on the payoff functions are enforced (e.g., concavity); we design R-ed-UCB and R-less-UCB, optimistic algorithms for the rising rested and restless bandits; we show that R-ed-UCB and R-less-UCB suffer an expected regret that depends on the payoff function proﬁle and, under some conditions, of order r Op T 2 3 q;1

we illustrate, using synthetic and real-world data, the effectiveness of our approaches, compared with state-ofthe-art algorithms for the non-stationary (restless) bandits.

2. Related Works

Restless and Rested Bandits The rested and restless bandit settings have been introduced by Tekin & Liu (2012) and further developed by (Ortner et al., 2012; Russac et al., 2019) in the restless version and by (Mintz et al., 2020; Pike-Burke & Grunewalder, 2019) in the rested one. Originally the evolution of the payoff was modeled via a suitable process, e.g., a Markov chain with ﬁnite state space or a linear regression process. For instance, Wang et al. (2020) proposes an optimistic approach based on the estimation of the transition kernel of the underlying chain. More recently, the terms rested and restless have been employed to denote arms whose payoff changes as time passes, for restless ones, or whenever being pulled, for rested ones (Seznec et al., 2019; 2020). That is the setting we target in this work.

Non-Stationary Bandits The restless bandits, without a ﬁxed temporal reward evolution, are usually addressed via non-stationary MAB approaches, that include both pas-

1With r Op q we disregard logarithmic terms in the order.

sive (e.g., Garivier & Moulines, 2011; Besbes et al., 2014; Auer et al., 2019; Trov o et al., 2020) and active (e.g., Liu et al., 2018; Besson et al., 2019; Cao et al., 2019) methods. The former algorithms base their selection criterion on the most recent feedbacks, while the latter actively try to detect if a change in the arms rewards occurred and use only data gathered after the last change. Garivier & Moulines (2011) employ a discounted reward approach (D-UCB) or an adaptive sliding window (SW-UCB), proving a r Op ?

Tq regret when the number of abrupt changes is known. Similar results have been obtained by (Auer et al., 2019) without knowing the number of changes, at the price of resorting to the doubling trick. (Besbes et al., 2014) provides an algorithm, namely RExp3, a modiﬁcation EXP3, originally designed for adversarial MABs, to give a regret bound of Op T 2 3 q under the assumption that the total variation VT of the arms expected reward is known. The knowledge of VT has been removed by Chen et al. (2019) using the doubling trick. In Trov o et al. (2020), an approach in which the combined use of a sliding window on a Thompson Sampling-like algorithm provides theoretical guarantees both on abruptly and smoothly changing environments. Nonetheless, in our setting, their result might lead to linear regret for speciﬁc instances. Notably, none of the above explicitly use assumptions on the monotonicity of the payoff over time.

Rising Bandits The rising bandit problem has been tackled in its deterministic version by (Heidari et al., 2016; Li et al., 2020). In Heidari et al. (2016), the authors design an online algorithm to minimize the regret of selecting an increasing and concave function among a ﬁnite set. This study assumes that the learner receives feedback about the true value of the reward function, i.e., no stochasticity is present. In Li et al. (2020), the authors model the problem of parameter optimization for machine learning models as a rising bandit setting. They propose an online algorithm having good empirical performance, still in the case of deterministic rewards. A case where the reward is increasing in expectation (or equivalently decreasing in loss), but no longer deterministic, is provided by Cella et al. (2021). However, the payoff follows a given parametric form known to the learner, who estimates such parameters in the best-arm identiﬁcation and regret-minimization frameworks. The need for knowing the parametric form of the payoff makes these approaches hardly applicable for arbitrary increasing functions.

Corralling Bandits It is also worth mentioning the corralling bandits (Agarwal et al., 2017; Pacchiano et al., 2020b; Abbasi-Yadkori et al., 2020; Pacchiano et al., 2020a; Arora et al., 2021), a setting in which the goal is to minimize the regret of a process choosing among a ﬁnite set of bandit algorithms. This setting, close to online model selection, is characterized by particular assumptions. Indeed, each arm corresponds to a learning algorithm, operating on a bandit, endowed with a (possibly known) regret bound, sometimes

Stochastic Rising Bandits

requiring additional conditions (e.g., stability).

3. Problem Setting

A K-armed MAB (Lattimore & Szepesv ari, 2020) is deﬁned as a vector of probability distributions ν pνiqi Pr Ks, where νi:N2Ñ p Rq depends on a pair of parameters pt,nq PN2

for every i Pr Ks, where r Ks t1,...,Ku. Let T PN be the optimization horizon, at each round t Pr Ts, the agent selects an arm It Pr Ks and observes a reward Rt νItpt,NIt,tq, where Ni,t řt l 11t Il iu is the number of times arm i P r Ks was pulled up to round t. Thus, the reward depends, in general, on the current round t and on the number of pulls NIt,t NIt,t 1 1 of arm It up to t. For every arm i Pr Ks, we deﬁne its payoff µi:N2ÑR as the expectation of the reward, i.e., µipt,nq ER νipt,nqr Rs and denote the vector of payoffs as µ pµiqi Pr Ks. We assume that the payoffs are bounded in r0,1s, and that the rewards are σ2-subgaussian,

i.e., ER νipt,nqreλp R µipt,nqqsďe σλ2

2 , for every λPR.

Rested and Restless Arms We revise the deﬁnition of rested and restless arms (Tekin & Liu, 2012).2

Deﬁnition 3.1 (Rested and Restless Arms). Let ν be a MAB and let i Pr Ks be an arm, we say that:

i is a rested arm if, for every round t Pr Ts and number of pulls n PN, we have µipt,nq µipnq; i is a restless arm if, for every round t Pr Ts and number of pulls n PN, we have µipt,nq µiptq.

A K-armed bandit is rested (resp. restless) if all of its arms are rested (resp. restless).

Thus, the payoff of a rested arm changes when being pulled and, therefore, it models phenomena that evolve as a consequence of the agent intervention. Instead, a restless arm is in all regards a non-stationary arm (Besbes et al., 2014), and it is suitable for modeling a natural phenomenon that evolves for time passing, independently of the agent intervention.

Rising Bandits We revise the rising bandits notion, i.e., MABs with payoffs non-decreasing and concave as a function of pt,nq (Heidari et al., 2016).3

Assumption 3.1 (Non-Decreasing Payoff). Let ν be a MAB, for every arm i Pr Ks, number of pulls n PN, and round t Pr Ts, functions µip ,nq and µipt, q are non-decreasing. In particular, we deﬁne the increments:

2We refer to the deﬁnition of (Levine et al., 2017; Seznec et al., 2020) and not to the one of (Tekin & Liu, 2012) that assumes an underlying Markov chain governing the arms distributions. 3Deterministic bandits with non-decreasing payoffs were introduced in (Heidari et al., 2016) with the term improving. In (Li et al., 2020), the term rising was used to denote the improving bandits with concave payoffs (concavity was already employed by Heidari et al. (2016)).

Rested arm: γipnq: µipn 1q µipnqě0; Restless arm: γiptq: µipt 1q µiptqě0.

From an economic perspective, γip q represents the increase of total return (or payoff) we obtain by adding a factor of production, i.e., pulling the arm (rested) or letting time evolve for a unit (restless). In the next sections, we analyze how the following assumption deﬁnes a remarkable class of bandits with non-decreasing payoffs (Heidari et al., 2016).

Assumption 3.2 (Concave Payoff). Let ν be a MAB, for every arm i Pr Ks, number of pulls n PN, and round t Pr Ts, functions µip ,nq and µipt, q are concave, i.e.:

Rested arm: γipn 1q γipnqď0; Restless arm: γipt 1q γiptqď0.

As pointed out by Heidari et al. (2016), the concavity assumption corresponds, in economics, to the decrease of marginal returns that emerges when adding a factor of production, i.e., pulling the arm (rested) or letting time evolve for one unit (restless).

Formally, we deﬁne rising a stochastic MAB in which both Assumption 3.1 and Assumption 3.2 hold.

Learning Problem Let t Pr Ts be a round, we denote with Ht p Il,Rlqt l 1 the history of observations up to t. A (nonstationary) deterministic policy is a function π:Ht 1ÞÑIt mapping a history to an arm, that is abbreviated as πptq: πp Ht 1q. The performance of a policy π in a MAB with payoffs µ is the expected cumulative reward collected over the T rounds, formally:

Jµpπ,Tq E ÿ

t Pr T s µIt pt,NIt,tq ,

and the expectation is computed over the histories. A policy π µ,T is optimal if it maximizes the expected cumulative reward: π µ,T Pargmaxπt Jµpπ,Tqu. Denoting with J µp Tq Jµpπ µ,T ,Tq the expected cumulative reward of an optimal policy, the suboptimal policies π are evaluated via the expected cumulative regret:

Rµpπ,Tq J µp Tq Jµpπ,Tq. (1)

Problem Characterization To characterize the problem instance, we introduce the following quantity, namely the cumulative increment, deﬁned for q Pr0,1s and M Pr Ts as:

Υµp M,qq max i Pr Ks

l 1 γiplqq +

The cumulative increment accounts for how fast the payoffs reach their asymptotic value, i.e., become stationary. Intuitively, small values of Υµp M,qq lead to simpler problems, as they are closer to stationary bandits. Table 1 reports

Stochastic Rising Bandits

Table 1. O rates of Υµp M,qq in the case γiplqďfplq for all i Pr Ks and l PN (see also Lemma C.6).

fplq e cl l c

Υµp M,qq e cq

1 cq 1 log M M 1 cq

some bounds on Υµp M,qq for particular choices of γiplq and q. When q 1, the cumulative increment resembles the bounded variation VT řT 1 l 1 maxi Pr Kstγiplqu (Besbes et al., 2014), but Υµp T,1q is smaller than VT as the maximization over the arms appears outside the summation.

In the next sections, we devise and analyze learning algorithms for rested (Section 4) and restless (Section 5) rising bandits. We will present optimistic algorithms, whose structure is summarized in Algorithm 1 and parametrized by an exploration index Biptq that will be designed case by case.

4. Stochastic Rising Rested Bandits

In this section, we consider the Rising rested bandits (R-ed) setting in which the arms expected payoff increases only when it is pulled, i.e., µipt,Ni,tq µip Ni,tq.4

Oracle Policy We recall that the oracle constant policy, that always plays at each round t Pr Ts the arm that maximizes the sum of the payoffs over the horizon T, is optimal for the non-increasing rested bandits.

Theorem 4.1 (Heidari et al., 2016). Let πc µ,T be the oracle constant policy:

πc µ,T ptq Pargmax i Pr Ks

l Pr T s µiplq

, @t Pr Ts.

Then, πc µ,T is optimal for the rested non-decreasing bandits (i.e., under Assumption 3.1).

The result holds under the non-decreasing property (Assumption 3.1) only, without requiring concavity (Assumption 3.2). However, this policy cannot be used in practice as it requires knowing the full function µip q in advance.

4.1. Non-Learnability

We now prove a result highlighting the hardness of the non-decreasing rested bandits. We show that with no assumptions on the payoff µipnq (e.g., concavity), it is impossible to devise a no-regret algorithm.

Theorem 4.2 (Non-Learnability). There exists a 2-armed

4We are employing the original deﬁnition of rested arms of (Levine et al., 2017) in which µipnq is the payoff of arm i when it is pulled for the n-th time.

Algorithm 1 R-l-UCB (l Ptless,edu )

Input: K, p Biqi Pr Ks Initialize NiÐ0 for all i Pr Ks for t Pp1,...,Tq do

Pull It Pargmaxi Pr Kst Biptqu Observe Rt νItpt,NIt 1q Update BIt and NIt ÐNIt 1 end for

non-decreasing (non-concave) deterministic rested bandit with γipnqďγmaxď1 for all i Pr Ks and n PN, such that any learning policy π suffers regret:

Rµpπ,Tqě Yγmax

The intuition behind this result is that, if we enforce no condition on the increment γipnq we cannot predict how much the arm payoff will increase in the future. Therefore, we face the dilemma of whether or not to pull an arm that is currently believed to be suboptimal, hoping its payoff will increase. If we decide to pull it and its payoff will not actually increase, or if we decide not to pull it and its payoff will actually increase, becoming optimal, we will suffer linear regret. Thus, Theorem 4.2 highlights the importance of the concavity assumption (Assumption 3.2), providing an answer to an open question posed in (Heidari et al., 2016).

Remark 4.1 (About the Concavity Assumption). While without additional structure, e.g., concavity, the nondecreasing rested bandits are non-learnable (Theorem 4.2), the assumption is not necessary in other related settings. In particular, non-decreasing restless bandits are in all regard non-stationary bandits, for which no-regret algorithms exist under different assumptions about the number of change points (Garivier & Moulines, 2011) or a bounded total variation (Besbes et al., 2014). Furthermore, for non-increasing rested (rotting) bandits (Levine et al., 2017), a bounded payoff decrement between consecutive pulls is sufﬁcient to devise a no-regret algorithm.

4.2. Deterministic Setting

To progressively introduce the core ideas, we begin with the case of deterministic arms (σ 0). We devise an optimistic estimator of µiptq, namely µR-ed i ptq, having observed the exact payoffs pµipnqq Ni,t 1 n 1 . Differently from the rotting setting, these payoffs are an underestimation of µiptq. Therefore, we exploit the non-decreasing assumption (Assumption 3.1) to derive the identity:

µiptq µip Ni,t 1q loooomoooon

(most recent payoff)

n Ni,t 1 γipnq.

looooooomooooooon

(sum of future increments)

Stochastic Rising Bandits

Ni,t 1 1 Ni,t 1 t

γip Ni,t 1 1q

µip Ni,t 1q

µR-ed i ptq

Figure 1. Graphical representation of the estimator construction

µR-ed i ptq for the rested deterministic setting.

By exploiting the concavity (Assumption 3.2), we upper bound the sum of future increments with the last experienced increment γip Ni,t 1 1q that is projected for the future t Ni,t 1 pulls, leading to the following estimator:

µR-ed i ptq: µip Ni,t 1q loooomoooon

(most recent payoff)

pt Ni,t 1qγip Ni,t 1 1q, looooooomooooooon

(most recent increment)

if Ni,t 1ě2 else µR-ed i ptq: 8. Figure 1 illustrates the construction of the estimator. The optimism of µR-ed i and a bias bound are proved in Lemma A.2.

Regret Analysis We are now ready to provide the regret analysis of R-ed-UCB, i.e., Algorithm 1 when we employ as exploration index Biptq µR-ed i ptq.

Theorem 4.3. Let T PN, then R-ed-UCB (Algorithm 1) with Biptq µR-ed i ptq suffers an expected regret bounded, for every q Pr0,1s, as:

Rµp R-ed-UCB,Tqď2K KT qΥµ

The regret depends on a parameter q Pr0,1s that can be selected to tighten the bound, whose optimal value depends on Υµp ,qq, that is a function on the horizon T. Some examples, when γiptqďl c for cą0, are reported in Figure 2.

4.3. Stochastic Setting

Moving to the R-ed stochastic setting (σą0), we cannot directly exploit the estimator in Equation (4). Indeed, we only observe the sequence of noisy rewards p Rti,nq Ni,t 1 n 1 , where ti,n Pr Ts is the round at which arm i Pr Ks was pulled for the n-th time. To cope with stochasticity, we need to employ an h-wide window made of the h most recent samples, similarly to what has been proposed by Seznec et al. (2020). The choice of h represents a bias-variance trade-off between employing few recent observations (less biased), compared to many past observations (less variance). For

Rested T KT 1 c

Restless K 1 c

Figure 2. Regret bounds r O rates optimized over q for R-less and R-ed deterministic bandits when γiplqďl c for cą0.

h Pr Ni,t 1s, the resulting estimator pµR-ed,h i ptq is given by:

pµR-ed,h i ptq: 1

l Ni,t 1 h 1

Rti,l lomon

(estimated payoff)

pt lq Rti,l Rti,l h

h looooooomooooooon

(estimated increment)

if hďt Ni,t 1{2u, else pµR-ed,h i ptq: 8. The construction of the estimator is shown in Appendix A.1 and relies on the idea of averaging several estimators of the form of Equation (4) instanced using as starting points different number of pulls Ni,t 1 l 1 for l Prhs and replacing the true payoff with the corresponding reward sample. An efﬁcient way to compute this estimator is reported in Appendix D.

Regret Analysis By making use of the presented estimator, we build the following optimistic exploration index:

Biptq pµR-ed,hi,t i ptq βR-ed,hi,t i ptq, where

βR-ed,hi,t i pt,δtq: σpt Ni,t 1 hi,t 1q

δt h3 i,t ,

and hi,t are arm-and-time-dependent window sizes and δt is a time-dependent conﬁdence parameter. By choosing the window size depending linearly on the number of pulls, we are able to provide the following regret bound.

Theorem 4.4. Let T PN, then R-ed-UCB (Algorithm 1) with Biptq pµR-ed,hi,t i ptq βR-ed,hi,t i ptq, hi,t tϵNi,t 1u for ϵPp0,1{2q and δt t α for αą2, suffers an expected regret bounded, for every q Pr0,1s, as:

Rµp R-ed-UCB,TqďO

ϵ pσTq 2 3 pαlog Tq

ˆR p1 2ϵq T

This result deserves some comments. First, compared with the corresponding deterministic R-ed regret bound (Theorem 4.3), it reﬂects a similar dependence of the cumulative

Stochastic Rising Bandits

increment Υµ, although it now involves the ϵ parameter deﬁning the window size hi,t tϵNi,t 1u. Second, it includes an additional term of order r Op T 2 3 q that is due to the noise σ presence that increases inversely w.r.t. the ϵ.5 Thus, we visualize a trade-off in the choice of ϵ: larger windows (ϵ 1) are beneﬁcial for the ﬁrst term, but they enlarge the constant 1{p1 2ϵq multiplying the second component.

Remark 4.2 (Comparison with Adversarial Bandits). The R-ed setting can be mapped to an adversarial bandit (Auer et al., 2002) with an adaptive (i.e., non-oblivious) adversary. Indeed, the arm payoff µip Ni,tq can be thought to as selected by an adversary who has access to the previous learner choices (i.e., the history Ht 1), speciﬁcally to the number of pulls Ni,t. However, although adversarial bandit algorithms, such as EXP3 (Auer et al., 2002) and OSMD (Audibert et al., 2014), suffer r Op ?

Tq regret, these results are not comparable with ours. Indeed, while these correspond to guarantees on the external regret, the regret deﬁnition we employ in Section 3 is a notion of policy regret (Dekel et al., 2012).

5. Stochastic Rising Restless Bandits

In this section, we consider the Rising restless bandits (R-less) in which the payoff increases at every round regardless the arm is pulled, i.e., µipt,Ni,tq µiptq.

Oracle Policy We start recalling that the oracle greedy policy, i.e., the policy selecting at each round t Pr Ts the arm with largest payoff, is optimal for the non-decreasing restless bandit setting.

Theorem 5.1 (Seznec et al., 2020). Let πg µ be the oracle greedy policy:

πg µptq Pargmax i Pr Ks tµiptqu, @t Pr Ts.

Then, πg µ is optimal for the restless non-decreasing bandits (i.e., under Assumption 3.1).

Notice that πg µ is optimal under the non-decreasing payoff assumption (Assumption 3.1) only, without requiring the concavity (Assumption 3.2). We can now ﬁrst appreciate an important difference between rising and rotting bandits. While for the rotting bandits the oracle greedy policy is optimal for both the rested and restless settings, for the rising bandits it remains optimal in the restless case only. Indeed, for the rising rested case, as shown in Theorem 4.1, the oracle constant policy is needed to achieve optimality.

5In particular, when γipnq decreases sufﬁciently fast (see Table 1), the regret is dominated by the r Op T 2 3 q component.

5.1. Deterministic Setting

We begin with the case of deterministic arms (σ 0). Similarly to the rested case, we design an optimistic estimator of µiptq, namely µR-less i ptq, employing the exact payoffs pµipti,nqq Ni,t 1 n 1 . To this end, we exploit the non-decreasing assumption (Assumption 3.1) to derive the identity:

µiptq µipti,Ni,t 1q looooomooooon

(most recent payoff)

l ti,Ni,t 1 γiplq.

looooooomooooooon

(sum of future increments)

Then, we leverage the concavity (Assumption 3.2) to upper bound the sum of future increments with the last experienced increment that will be projected in the future for t ti,Ni,t 1 rounds, leading to the estimator:

µR-less i ptq: µipti,Ni,t 1q looooomooooon

(most recent payoff)

pt ti,Ni,t 1q µipti,Ni,t 1q µipti,Ni,t 1 1q

ti,Ni,t 1 ti,Ni,t 1 1 looooooooooooooooomooooooooooooooooon

(most recent increment)

if Ni,t 1ě2, else µR-less i ptq: 8. Lemma A.5 shows that µR-less i is optimistic and provides a bias bound.

Regret Analysis We now provide the regret analysis of R-less-UCB that is obtained from Algorithm 1, when setting Biptq µR-less i ptq.

Theorem 5.2. Let T PN, then R-less-UCB (Algorithm 1) with Biptq µR-less i ptq suffers an expected regret bounded, for every q Pr0,1s, as:

Rµp R-less-UCB,Tqď2K KT

V ,q 1 q 1 .

Similarly to Theorem 4.3, the result depends on the free parameter q Pr0,1s, that can be chosen to tighten the bound. It is worth noting that the regret bound of the R-less deterministic case (Theorem 5.2) is always smaller than that of the R-ed deterministic case (Theorem 4.3). Indeed, ignoring the dependence on K, we have Rµp R-less-UCB,Tq

O Rµp R-ed-UCB,Tq 1 q 1 . The following example clariﬁes the role of q for both the restless and rested case.

Example 5.1. Suppose that for all i Pr Ks, we have γiplqď l c for cą0. The expressions of bounds on Υµp ,qq have been shown in Table 1. Different values of q Pr0,1s should be selected to tighten the regret bounds depending on the value of c. Figure 2 reports the optimized bounds for the deterministic R-less and R-ed (derivation in Appendix B).

Stochastic Rising Bandits

5.2. Stochastic Setting

In the stochastic setting (σą0), we have access to noisy versions of µi only, i.e., p Rti,nq Ni,t 1 n 1 . Intuitively, we might be tempted to straightforwardly extend the derivation of the rested case by averaging h estimators like the ones in Equation (5), instanced with different time instants ti,Ni,t 1. Unfortunately, this approach is unsuccessful for technical issues since the increment term would include the difference of time instants ti,Ni,t 1 ti,Ni,t 1 1 that, in the stochastic setting, are random variables correlated with the observed rewards Rti,n. For this reason, at the price of a larger bias, we employ the same estimator used in the stochastic rested case, deﬁned for h Pr Ni,t 1s:

pµR-less,h i ptq 1

l Ni,t 1 h 1

Rti,l lomon

(estimated payoff)

pt lq Rti,l Rti,l h

h looooooomooooooon

(estimated increment)

if hi,tďt Ni,t 1{2u, else pµR-less,h i ptq 8. Additional details on the estimator construction is reported in Appendix A.2 together with its analysis.

Regret Analysis We provide the regret analysis of R-less-UCB when we employ the exploration index analogous to that of the rested case:

Biptq pµR-less,hi,t i ptq βR-less,hi,t i ptq, where

βR-less,hi,t i pt,δtq: σpt Ni,t 1 hi,t 1q

δt h3 i,t ,

and hi,t are a arm-and-time-dependent window sizes and δt is a time-dependent conﬁdence. The regret bound is given by the following result.

Theorem 5.3. Let T PN, then R-less-UCB (Algorithm 1) with Biptq pµR-less,hi,t i ptq βR-less,hi,t i ptq, hi,t tϵNi,t 1u for ϵPp0,1{2q, and δt t α for αą2, suffers an expected regret bounded, for every q Pr0,1s, as:

Rµp R-less-UCB,TqďO

ϵ pσTq 2 3 pαlog Tq

2q 1 q plog Tq

ˆR p1 2ϵq T

Some observations are in order. First, compared to the bound for the rested case in Theorem 4.4, we note the same dependence of r Op T 2 3 q due to the noise presence σ. Concerning the second term, compared with the one of the deterministic case (Theorem 5.2), we worsen the dependence on T and an inverse dependence on the ϵ and 1 2ϵ

parameters appear. This is due to the usage of the h-wide window instead of the last sample and that, all other things being equal, the estimator employed for the stochastic case, as already discussed, is looser compared to the one for the deterministic case. Finally, our result is not fully comparable with (Besbes et al., 2014) for generic non-stationary bandits with bounded variation because, as already mentioned, Υµpr T{Ks,qq may be smaller than VT . Moreover, we achieve such a bound with no knowledge about Υµ, while the work by (Besbes et al., 2014) requires knowing VT .

6. Numerical Simulations

We numerically tested R-less-UCB and R-ed-UCB w.r.t. state-of-the-art algorithms for non-stationary MABs in the restless and rested settings, respectively.6

Algorithms We consider the following baseline algorithms: Rexp3 (Besbes et al., 2014), a non-stationary MAB algorithm based on variation budget, KL-UCB (Garivier & Capp e, 2011), one of the most effective stationary MAB algorithms, Ser4 (Allesiardo et al., 2017), which considers best arm switches during the process, and slidingwindow algorithms such as SW-UCB (Garivier & Moulines, 2011), SW-KL-UCB (Combes & Proutiere, 2014), and SW-TS (Trov o et al., 2020) that are generally able to deal with non-stationary restless settings. The parameters for all the baseline algorithms have been set as recommended in the corresponding papers (see also Appendix E). For our algorithms, the window is set as hi,t tϵNi,t 1u (as prescribed by Theorems 4.4 and 5.3). We remark that while the baseline algorithms are suited for the restless case, in the rested case, no algorithm has been designed to cope with the stochastic rising setting, provided that no knowledge on the payoff function is available. We compare the algorithms in terms of empirical cumulative regret p Rµpπ,tq, which is the empirical counterpart of the expected cumulative regret Rµpπ,tq at round t averaged over multiple independent runs.

6.1. Restless setting

To evaluate R-less-UCB in the restless setting, we run the aforementioned algorithms on a problem with K 15 arms over a time horizon of T 200,000 rounds, setting ϵ 1{4. The payoff functions µip q are chosen in these families: Fexp tfptq cp1 e atqu and Fpoly fptq c 1 bpt b1{ρq ρ ( , where a,c,ρPp0,1s and b P Rě0 are parameters, whose values have been selected randomly. By construction all functions f PFexp YFpoly satisfy

6Details of the experimental setting, and additional results are provided in Appendix E. The code to reproduce the experiments is available at https://github.com/albertometelli/ stochastic-rising-bandits.

Stochastic Rising Bandits

0 2,000 4,000 6,000 0

Time t / Pulls n

0 0.5 1 1.5 2

0 0.5 1 1.5 2

R-less/ed-UCB KL-UCB Rexp3 SW-TS SW-UCB Ser4 SW-KL-UCB

Figure 3. 15 arms bandit setting: (a) ﬁrst 6000 rounds/pulls of the payoff functions, (b) cumulative regret in the R-less scenario, (c) cumulative regret in the R-ed scenario (100 runs 95% c.i.).

0 0.5 1 1.5 2

Figure 4. 2 arms R-ed bandit setting: (a) payoff functions, (b) cumulative regret (100 runs, 95% c.i.).

R-ed-UCB KL-UCB Rexp3 SW-TS SW-UCB Ser4 SW-KL-UCB

Figure 5. Cumulative regret in the online model selection on IMDB dataset (30 runs, 95% c.i.).

Assumptions 3.1 and 3.2. The functions coming from Fexp (exponential functions) have a sudden increase, while ones from Fpoly (polynomial functions) have a slower growth rate, leading to different cumulative increments Υµ. The stochasticity is realized by adding a Gaussian noise with σ 0.1. The generated functions are shown in Figure 3a.

The empirical cumulative regret p Rµpπ,tq is provided in Figure 3b. The results show that SW-TS is the algorithm that achieves the lowest regret at the horizon, even though its performance at the beginning is worse than the other algorithms. As commonly happening in practice, TS-based approaches tend to outperform UCB ones. Indeed, R-less-UCB displays the second-best curve overall and achieves the best performance among the UCB-like algorithms.

6.2. Rested setting

We employ the same arms generated for the restless case to evaluate R-ed-UCB in the rested setting. We plot the empirical cumulative regret in Figure 3c. SW-TS is conﬁrmed as the best algorithm at the end of the time horizon, although other algorithms (SW-UCB and SW-KL-UCB) suffer less regret at the beginning of learning. R-ed-UCB pays the price of the initial exploration, but at the end of the horizon, it manages to achieve the second-best performance. Notice

that, besides R-ed-UCB, all other baseline algorithms are designed for the restless setting and are not endowed with any guarantee on the regret in the rested scenario.

To highlight this fact, we designed a particular 2-arms rising rested bandit in which the optimal arm reveals only when pulled a sufﬁcient number of times (linear in T). The payoff functions, fulﬁlling Assumptions 3.1 and 3.2, are shown in Figure 4a and the algorithms empirical regrets in Figure 4b. Note that in this setting the expected (instantaneous) regret may be negative for tă 19T

400 , and this is the case for most of the algorithms for tă20,000. While for the ﬁrst 20,000 rounds R-ed-UCB is on par with the other algorithms, it outperforms all the other policies over a longer run. Note that the regret for Rexp3 and Ser4 is decreasing the slope for tą40,000, meaning that they are somehow reacting to the change in the reward of the two arms. SW-TS starts reacting even later, at around t 100,000. However, they are not prompt to detect such a change in the rewards and, therefore, collect a large regret in the ﬁrst part of the learning process. The other algorithms suffer a linear regret at the end of the time horizon since they do not employ forgetting mechanisms or because the sliding window should be tuned knowing the characteristics of the expected reward.

Stochastic Rising Bandits

6.3. IMDB dataset (rested)

We investigate the performance of R-ed-UCB on an online model selection task for a real-world dataset. We employ the IMDB dataset, made of 50,000 reviews of movies (scores from 0 to 10). We preprocessed the data as done by Maas et al. (2011) to obtain a binary classiﬁcation problem. Each review xt lies in a d 10,000 dimensional feature space, where each feature is the frequency of the most common English words. Each arm corresponds to a different online optimization algorithm, i.e., two of them are Online Logistic Regression algorithms with different learning rate schemes, and the other ﬁve are Neural Networks with different topologies. We provide additional information on the arms of the bandit in Appendix E.2. At each round, a sample xt is randomly selected from the dataset, a reward of 1 is generated for a correct classiﬁcation, 0 otherwise, and, ﬁnally, the online update step is performed for the chosen algorithm.

The empirical regret is plotted in Figure 5. We can see that R-ed-UCB, with ϵ 1{32 outperforms the considered baselines. Compared to the synthetic simulations, the smaller window choice is justiﬁed by the fact that we need to take into account that the average learning curves of the classiﬁcation algorithms are not guaranteed to be non-decreasing nor concave on the single run. However, keeping the window linear in Ni,t 1 is crucial for the regret guarantees of Theorem 4.4.

7. Discussion and Conclusions

This paper studied the MAB problem when the payoffs are non-decreasing functions that evolve either when pulling the corresponding arm (rested) or for time passing (restless). We showed that, for the rested case, an assumption on the payoff (e.g., concavity) is essential to make the problem learnable. We presented novel algorithms that suitably employ the concavity assumption to build proper estimators for both settings. These algorithms are proven to suffer a regret made of a ﬁrst instance-independent component of r Op T 2 3 q and an instance-dependent component involving the cumulative increment function Υµp ,qq. For the rested setting, ours represent the ﬁrst no-regret algorithm for the stochastic rising bandits. The experimental evaluation conﬁrmed our theoretical ﬁndings showing advantages over state-of-the-art algorithms designed for non-stationary bandits, especially in the rested setting. The natural future research direction consists of studying the complexity of the learning problem in stochastic rising rested and restless bandits, focusing on deriving suitable regret lower bounds. Other future works include investigating the best-arm identiﬁcation setting and, motivated by the online model selection, analysing the alternative case in which the optimization algorithms associated with the arms act on a shared vector of parameters.

Abbasi-Yadkori, Y., Pacchiano, A., and Phan, M. Regret balancing for bandit and RL model selection. Co RR, abs/2006.05491, 2020.

Agarwal, A., Luo, H., Neyshabur, B., and Schapire, R. E. Corralling a band of bandit algorithms. In Kale, S. and Shamir, O. (eds.), Proceedings of the Conference on Learning Theory (COLT), volume 65, pp. 12 38, 2017.

Allesiardo, R., F eraud, R., and Maillard, O.-A. The nonstationary stochastic multi-armed bandit problem. International Journal of Data Science and Analytics, 3(4): 267 283, 2017.

Arora, R., Marinov, T. V., and Mohri, M. Corralling stochastic bandit algorithms. In Banerjee, A. and Fukumizu, K. (eds.), Proceedings of the international conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 130, pp. 2116 2124, 2021.

Audibert, J., Bubeck, S., and Lugosi, G. Regret in online combinatorial optimization. Math. Oper. Res., 39(1):31 45, 2014.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the IEEE annual symposium on Foundations of Computer Science (FOCS), pp. 322 331, 1995.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48 77, 2002.

Auer, P., Gajane, P., and Ortner, R. Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Beygelzimer, A. and Hsu, D. (eds.), Proceedings of the Conference on Learning Theory (COLT), volume 99, pp. 138 158, 2019.

Besbes, O., Gur, Y., and Zeevi, A. Stochastic multi-armedbandit problem with non-stationary rewards. In Proceedings of the conference on Neural Information Processing Systems (Neur IPS), volume 27, pp. 199 207, 2014.

Besson, L., Kaufmann, E., Maillard, O.-A., and Seznec, J. Efﬁcient change-point detection for tackling piecewisestationary bandits. ar Xiv preprint ar Xiv:1902.01575, 2019.

Bubeck, S., Munos, R., Stoltz, G., and Szepesv ari, C. Online optimization in x-armed bandits. In Advances in Neural Information Processing Systems 21 (NIPS), pp. 201 208, 2008.

Stochastic Rising Bandits

Cao, Y., Wen, Z., Kveton, B., and Xie, Y. Nearly optimal adaptive procedure with change detection for piecewisestationary bandit. In The international conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 418 427, 2019.

Cella, L., Pontil, M., and Gentile, C. Best model identiﬁcation: A rested bandit formulation. In Meila, M. and Zhang, T. (eds.), Proceedings of the International Conference on Machine Learning (ICML), volume 139, pp. 1362 1372, 2021.

Chen, Y., Lee, C., Luo, H., and Wei, C. A new algorithm for non-stationary contextual bandits: Efﬁcient, optimal and parameter-free. In Proceedings of the Conference on Learning Theory (COLT), volume 99, pp. 696 726, 2019.

Combes, R. and Proutiere, A. Unimodal bandits: Regret lower bounds and optimal algorithms. In Proceedings of the International Conference on Machine Learning (ICML), pp. 521 529, 2014.

Dekel, O., Tewari, A., and Arora, R. Online bandit learning against an adaptive adversary: from regret to policy regret. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2012.

Doob, J. L. Stochastic processes, volume 7. Wiley New York, 1953.

Garivier, A. and Capp e, O. The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the Conference on Learning Theory (COLT), pp. 359 376, 2011.

Garivier, A. and Moulines, E. On upper-conﬁdence bound policies for switching bandit problems. In Proceedings of the international conference on Algorithmic Learning Theory (ALT), pp. 174 188, 2011.

Heidari, H., Kearns, M. J., and Roth, A. Tight policy regret bounds for improving and decaying bandits. In Kambhampati, S. (ed.), Proceedings of the International Joint Conference on Artiﬁcial Intelligence (IJCAI), pp. 1562 1570, 2016.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Levine, N., Crammer, K., and Mannor, S. Rotting bandits. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Proceedings of the conference on Neural Information Processing Systems (Neur IPS), pp. 3074 3083, 2017.

Li, Y., Jiang, J., Gao, J., Shao, Y., Zhang, C., and Cui, B. Efﬁcient automatic CASH via rising bandits. In Proceedings of the Conference on Artiﬁcial Intelligence (AAAI), pp. 4763 4771, 2020.

Liu, F., Lee, J., and Shroff, N. A change-detection based framework for piecewise-stationary multi-armed bandit problem. In Proceedings of the Conference on Artiﬁcial Intelligence (AAAI), volume 32, 2018.

Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. Learning word vectors for sentiment analysis. In Proceedings of the annual meeting of the association for computational linguistics: Human Language Technologies (HLT), pp. 142 150, 2011.

Mintz, Y., Aswani, A., Kaminsky, P., Flowers, E., and Fukuoka, Y. Nonstationary bandits with habituation and recovery dynamics. Operations Research, 68(5):1493 1516, 2020.

Ortner, R., Ryabko, D., Auer, P., and Munos, R. Regret bounds for restless markov bandits. In Proceedings of the international conference on Algorithmic Learning Theory (ALT), pp. 214 228, 2012.

Pacchiano, A., Dann, C., Gentile, C., and Bartlett, P. Regret bound balancing and elimination for model selection in bandits and rl. ar Xiv preprint ar Xiv:2012.13045, 2020a.

Pacchiano, A., Phan, M., Abbasi-Yadkori, Y., Rao, A., Zimmert, J., Lattimore, T., and Szepesv ari, C. Model selection in contextual stochastic bandit problems. In Proceedings of the conference on Neural Information Processing Systems (Neur IPS), 2020b.

Pike-Burke, C. and Grunewalder, S. Recovering bandits. Proceedings of the conference on Neural Information Processing Systems (Neur IPS), 32:14122 14131, 2019.

Russac, Y., Vernade, C., and Capp e, O. Weighted linear bandits for non-stationary environments. In Proceedings of the conference on Neural Information Processing Systems (Neur IPS), 2019.

Seznec, J., Locatelli, A., Carpentier, A., Lazaric, A., and Valko, M. Rotting bandits are no harder than stochastic ones. In Proceedings of the international conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 89, pp. 2564 2572, 2019.

Seznec, J., M enard, P., Lazaric, A., and Valko, M. A single algorithm for both restless and rested rotting bandits. In Proceedings of the international conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 108, pp. 3784 3794, 2020.

Tekin, C. and Liu, M. Online learning of rested and restless bandits. IEEE Transaction on Information Theory, 58(8): 5588 5611, 2012.

Trov o, F., Paladino, S., Restelli, M., and Gatti, N. Slidingwindow thompson sampling for non-stationary settings.

Stochastic Rising Bandits

Journal of Artiﬁcial Intelligence Research, 68:311 364, 2020.

Wang, S., Huang, L., and Lui, J. C. S. Restless-ucb, an efﬁcient and low-complexity algorithm for online restless bandits. In Advances in Neural Information Processing Systems (Neur IPS), volume 33, pp. 11878 11889. Curran Associates, Inc., 2020.

Stochastic Rising Bandits

A. Proofs and Derivations

In this section, we provide the proof of the results presented in the main paper.

A.1. Proofs of Section 4

Theorem 4.1 (Heidari et al., 2016). Let πc µ,T be the oracle constant policy:

πc µ,T ptq Pargmax i Pr Ks

l Pr T s µiplq

, @t Pr Ts.

Then, πc µ,T is optimal for the rested non-decreasing bandits (i.e., under Assumption 3.1).

Proof. The proof is reported in Proposition 1 of (Heidari et al., 2016).

Lemma A.1. In the noiseless (σ 0) setting, there exists a 2-armed non-increasing non-concave bandit such that any learning policy π suffers regret:

Rµpπ,Tqě Z T

Proof. Let µA and µB be two non-concave non-decreasing rested bandits, deﬁned as:

µA 1 pnq µB 1 pnq 1

# 0 if nďt T

3 u 1 otherwise ,

µB 2 pnq 0.

It is clear that for µA the optimal arm is 2, whereas for bandit µB the optimal arm is 1, having optimal performance respectively J µAp Tq r 2

3Ts and J µBp Tq T

Let π be an arbitrary policy. Since the learner will receive the same rewards for both bandits until at least t T

3 u. Thus, we have:

πp HtpµAqq πp HtpµBqq ùñ E µA

Let us now compute the performance of policy π in the two bandits and the corresponding regrets. Let us start with µA:

2 E µAr N1,T s max " 0, E µAr N2,T s ZT

2 E µAr N1,T s max " 0, R2

3T V E µAr N1,T s * , (7)

Stochastic Rising Bandits

where Equation (6) follows from observing that we get reward from arm 2 only if we pull it more than t T

3 u times and Equation (7) derives from observing that T EµAr N1,T s EµAr N2,T s. Now, consider the two cases:

Case (i) : EµAr N1,T sěr 2

2 E µAr N1,T s,

that is maximized by taking EµAr N1,T s T.

Case (ii) : EµAr N1,T săr 2

JµApπ,Tq R2

2 E µAr N1,T s,

that is maximized by taking the minimum value of EµAr N1,T s possible, that is EµAr N1,T sěEµAr N1,t T

3 us n1. Putting all together, we have:

JµApπ,Tqďmax "T

having observed that n1ďt T

3 u. Let us now focus on the regret:

RµApπ,Tq J µAp Tq JµApπ,Tq R2

Consider now bandit µB, we have:

2 E µBr N1,T sď n1

having observed that EµBr N1,T s n1 EµBr N1,T s EµBr N1,t T

3 usďn1 r 2

3Ts. Let us now compute the regret:

RµBpπ,Tq J µBp Tq JµBpπ,Tq T

Finally, the worst-case regret can be lower bounded as follows:

inf π sup µ Rµpπ,Tqěinf π max RµApπ,Tq,RµBpπ,Tq ( inf n1Pr0,t T

3 us max "n1

having minimized over n1.

Theorem 4.2 (Non-Learnability). There exists a 2-armed non-decreasing (non-concave) deterministic rested bandit with γipnqďγmaxď1 for all i Pr Ks and n PN, such that any learning policy π suffers regret:

Rµpπ,Tqě Yγmax

Proof. It is sufﬁcient to rescale the mean function of the proof of Lemma A.1 by the quantity γmax.

Lemma A.2. For every arm i Pr Ks and every round t Pr Ts, let us deﬁne:

µR-ed i ptq: µip Ni,t 1q pt Ni,t 1qγip Ni,t 1 1q,

Stochastic Rising Bandits

if Ni,t 1ě2 else µR-ed i ptq: 8. Then, µR-ed i ptqěµiptq and, if Ni,t 1ě2, it holds that:

µR-ed i ptq µip Ni,tqďpt Ni,t 1qγip Ni,t 1 1q.

Proof. Let us consider the following derivation:

µiptq µip Ni,t 1q

n Ni,t 1 γipnqďµip Ni,t 1q pt Ni,t 1qγip Ni,t 1 1q :µR-ed i ptq,

where the inequality holds thanks to Assumption 3.2, having observed that řt 1 n Ni,t 1 γipnqďpt Ni,t 1qγip Ni,t 1qď pt Ni,t 1qγip Ni,t 1 1q. For the bias bound, when Ni,t 1ě2, we consider the following derivation:

µR-ed i ptq µip Ni,tq µip Ni,t 1q pt Ni,t 1qγip Ni,t 1 1q µip Ni,tqďpt Ni,t 1qγip Ni,t 1 1q.

having observed that µip Ni,t 1qďµip Ni,tq by Assumption 3.1.

Theorem 4.3. Let T PN, then R-ed-UCB (Algorithm 1) with Biptq µR-ed i ptq suffers an expected regret bounded, for every q Pr0,1s, as:

Rµp R-ed-UCB,Tqď2K KT qΥµ

Proof. We have to analyze the following expression:

Rµp R-ed-UCB,Tq

t 1 µi ptq µItp Ni,tq,

where i Pargmaxi Pr Ks !ř

l Pr T sµiplq ) . We consider a term at a time, use Biptq µR-ed i ptq, and we exploit the optimism, i.e., Bi ptqďBItptq:

µi ptq µItp NIt,tq BItptq BItptqďmin

%1,µi ptq Bi ptq looooooomooooooon

BItptq µItp NIt,tq

ďmint1,BItptq µItp NIt,tqu.

Now we work on the term inside the minimum when NIt,t 1ě2:

BItptq µItp NIt,tq µR-ed i ptq µItp NIt,tqďpt Ni,t 1qγItp Ni,t 1 1q,

where the inequality follows from Lemma A.2. We are going to decompose the summation of this term over the K arms:

Rµp R-ed-UCB,Tqď

t 1 mint1,pt Ni,t 1qγItp Ni,t 1 1qu

j 3 mint1,pti,j pj 1qqγipj 2qu,

where ti,j Pr Ts is the round at which arm i Pr Ks was pulled for the j-th time. Now, q Pr0,1s, then for any xě0 it holds that

Stochastic Rising Bandits

mint1,xuďmint1,xuqďxq. By applying this latter inequality to the inner summation, we get:

j 3 mint1,pti,j pj 1qqγipj 2quď

j 3 mint1,Tγipj 2quďT q Ni,T ÿ

j 3 γipj 2qq,

having used ti,j pj 1qďT. Summing over the arms, we obtain:

j 3 γipj 2qqďT q ÿ

i Pr Ks Υµp Ni,T ,qqďT q KΥµ

where the last inequality is obtained from Lemma C.2.

Estimator Construction for the Stochastic Rising Rested Setting Before moving to the proofs, we provide some intuition behind the estimator construction. We start observing that for every l Pt2,...,Ni,t 1u, we have that:

µiptq µiplq lomon

(past payoff)

j l γipjq looomooon

(sum of future increments)

ď µiplq lomon

(past payoff)

pt lq γipl 1q looomooon

(past increment)

where the inequality follows from Assumption 3.2.7 Since we do not have access to the exact payoffs µiplq and exact increments γipl 1q µiplq µipl 1q, one may be tempted to directly replace them with the corresponding point estimates Rti,l and Rti,l Rti,l 1 and average the resulting estimators for a window of the most recent h values of l. Unfortunately, while replacing µiplq with Rti,l is a viable option, replacing γipl 1q with Rti,l Rti,l 1 will prevent concentration since the estimate Ri,tl Ri,tl 1 is too unstable. To this end, before moving to the estimator, we need a further bounding step to get a more stable, although looser, quantity. Based on Lemma C.3, we bound for every l Pt2,...,Ni,t 1u and h Prl 1s:

γipl 1q looomooon

(past increment at l)

ď µiplq µipl hq

h loooooooomoooooooon

(average past increment over tl h,...,lu)

We can now introduce the optimistic approximation of µiptq, i.e., rµR-ed,h i ptq, and the corresponding estimator, i.e., pµR-ed,h i ptq, that are deﬁned in terms of a window of size 1ďhďt Ni,t 1{2u:

rµR-ed,h i ptq: 1

l Ni,t 1 h 1

µiplq lomon

(past payoff)

pt lq µiplq µipl hq

h loooooooomoooooooon

(average past increment)

pµR-ed,h i ptq: 1

l Ni,t 1 h 1

Rti,l lomon

(estimated past payoff)

pt lq Rti,l Rti,l h

h looooooomooooooon

(estimated average past increment)

The proof is composed of the following steps:

(i) Lemma A.3 shows that rµR-ed,h i ptq is an upper-bound for µiptq and provides a bound to its bias w.r.t. µip Ni,tq for every value of h;

(ii) Lemma A.4 analyzes the concentration of pµR-ed,h i ptq around rµR-ed,h i ptq for a speciﬁc choice of δt t α and when hi,h: hp Ni,t 1q is a function of the number of pulls Ni,t 1 only;

(iii) Theorem 4.4 bounds the expected regret of R-ed-UCB when hi,h tϵNi,t 1u, for ϵPp0,1{2q.

7The estimator of the deterministic case in Equation (4) is obtained by setting l Ni,t 1.

Stochastic Rising Bandits

Lemma A.3. For every arm i Pr Ks, every round t Pr Ts, and window width 1ďhďt Ni,t 1{2u, let us deﬁne:

rµR-ed,h i ptq: 1

l Ni,t 1 h 1

ˆ µiplq pt lqµiplq µipl hq

otherwise if h 0, we set rµR-ed,h i ptq: 8. Then, rµR-ed,h i ptqěµiptq and, if Ni,t 1ě2, it holds that:

rµR-ed,h i ptq µip Ni,tqď 1

2p2t 2Ni,t 1 h 1qγip Ni,t 1 2h 1q.

Proof. Following the derivation provided above, we have for every l Pt2,...,Ni,t 1u:

µiptq µiplq

ďµiplq pt lqγipl 1q (8)

ďµiplq pt lqµiplq µipl hq

where line (8) follows from Assumption 3.2, line (9) is obtained from Lemma C.3. By averaging over the most recent 1ďhďt Ni,t 1{2u pulls, we obtain:

l Ni,t 1 h 1

ˆ µiplq pt lqµiplq µipl hq

:rµR-ed,h i ptq.

For the bias bound, when Ni,t 1ě2, we have:

rµR-ed,h i ptq µip Ni,tq 1

l Ni,t 1 h 1

ˆ µiplq pt lqµiplq µipl hq

µip Ni,tq (10)

l Ni,t 1 h 1 pt lqµiplq µipl hq

l Ni,t 1 h 1 pt lq 1

j l h γjplq

l Ni,t 1 h 1 pt lqγipl hq (11)

2p2t 2Ni,t 1 h 1qγip Ni,t 1 2h 1q. (12)

where line (10) follows from Assumption 3.1 applied as µiplqďµip Ni,tq, line (11) follows from Assumption 3.2 and bounding 1

h řl 1 j l hγjplqďγipl hq and line (12) is derived still from Assumption 3.2, γipl hqďγip Ni,t 1 2h 1q and computing the summation.

Lemma A.4. For every arm i Pr Ks, every round t Pr Ts, and window width 1ďhďt Ni,t 1{2u, let us deﬁne:

pµR-ed,h i ptq: 1

l Ni,t 1 h 1

ˆ Rti,l pt lq Rti,l Rti,l h

Stochastic Rising Bandits

βR-ed,h i pt,δq: σpt Ni,t 1 h 1q

otherwise if h 0, we set pµR-ed,h i ptq: 8 and βR-ed,h i pt,δq: 8 . Then, if the window size depends on the number of pulls only hi,t hp Ni,t 1q and if δt t α for some αą2, it holds for every round t Pr Ts that:

Pr ˇˇˇpµR-ed,hi,t i ptq rµR-ed,hi,t i ptq ˇˇˇąβR-ed,hi,t i pt,δtq ď2t1 α.

Proof. First of all, we observe under the event thi,t 0u, then pµR-ed,hi,t i ptq rµR-ed,hi,t i ptq βR-ed,hi,t i pt,δtq 8. By convening that p 8q p 8q 0, we have that 0ąβR-ed,hi,t i pt,δtq is not satisﬁed. Thus, we perform the analysis under the event thi,tě1u. We ﬁrst get rid of the dependence on the random number of pulls Ni,t 1:

Pr ˇˇˇpµR-ed,hi,t i ptq rµR-ed,hi,t i ptq ˇˇˇąβR-ed,hi,t i pt,δtq

Pr ˇˇˇpµR-ed,hp Ni,t 1q i ptq rµR-ed,hp Ni,t 1q i ptq ˇˇˇąβR-ed,hp Ni,t 1q i pt,δtq (13)

ďPr Dn Pt0,...,t 1u s.t. hpnqě1: ˇˇˇpµR-ed,hpnq i ptq rµR-ed,hpnq i ptq ˇˇˇąβR-ed,hpnq i pt,δtq

n Pt0,...,t 1u:hpnqě1 Pr ˇˇˇpµR-ed,hpnq i ptq rµR-ed,hpnq i ptq ˇˇˇąβR-ed,hpnq i pt,δtq , (14)

where line (13) derives from the deﬁnition of hi,t hp Ni,t 1q and line (14) follows from a union bound over the possible values of Ni,t 1. Now, having ﬁxed the value of n, we rewrite the quantity to be bounded:

hpnq pµR-ed,hpnq i ptq rµR-ed,hpnq i ptq

ˆ Xl pt lq Xl Xl hpnq

t l hpnq Xl hpnq,

where Xl: Rti,l µiplq. It is worth noting that we can index Xl with the number of pulls l only as the distribution of Rti,l is fully determined by l and n (that are non-random quantities now) and, consequently, all variables Xl and Xl hpnq are independent.

Now we apply Azuma-Ho effding s inequality of Lemma C.5 for weighted sums of subgaussian martingale difference sequences. To this purpose, we compute the sum of the square weights:

ďhpnq ˆ 1 t n hpnq 1

2 hpnq ˆt n hpnq 1

ď 5pt n hpnq 1q2

hpnq , (16)

where line (15) follows from bounding t lďt n hpnq 1 and line (16) from observing that t n hpnq 1

hpnq ě1. Thus, we have:

Pr ˇˇˇpµR-ed,hpnq i ptq rµR-ed,hpnq i ptq ˇˇˇąβR-ed,hpnq i pt,δtq

Stochastic Rising Bandits

t l hpnq Xl hpnq

ˇˇˇˇˇˇ ąhpnqβR-ed,hpnq i pt,δtq

hpnqβR-ed,hpnq i pt,δtq 2

2σ2 5pt n hpnq 1q2

By replacing this result into Equation (14), and recalling the value of δt, we obtain:

n Pt0,...,t 1u:hpnqě1 2δtď

n 0 2t αď2t1 α.

Theorem 4.4. Let T PN, then R-ed-UCB (Algorithm 1) with Biptq pµR-ed,hi,t i ptq βR-ed,hi,t i ptq, hi,t tϵNi,t 1u for ϵPp0,1{2q and δt t α for αą2, suffers an expected regret bounded, for every q Pr0,1s, as:

Rµp R-ed-UCB,TqďO

ϵ pσTq 2 3 pαlog Tq

ˆR p1 2ϵq T

Proof. Let us deﬁne the good events Et Ş

i Pr Ks Ei,t that correspond to the event in which all conﬁdence intervals hold:

Ei,t: !ˇˇˇrµR-ed,hi,t i ptq pµR-ed,hi,t i ptq ˇˇˇďβR-ed,hi,t i ptq ) @i Pr Ts,i Pr Ks

We have to analyze the following expression:

Rµp R-ed-UCB,Tq E

t 1 µi ptq µItp Ni,tq

where i Pargmaxi Pr Ks !ř

l Pr T sµiplq ) . We decompose the above expression according to the good events Et:

Rµp R-ed-UCB,Tq

t 1 Erpµi ptq µItp NIt,tqq1t Etus

t 1 Erpµi ptq µItp NIt,tqq1t Etus (17)

t 1 Erpµi ptq µItp NIt,tqq1t Etus

t 1 Er1t Etus, (18)

where we exploited µi ptq µItp NIt,tqď1 in line (18). Now, we bound the second summation, recalling that αą2:

t 1 Er1t Etus

t 1 Prp Etq 1

i Pr Ks Ei,t

i Pr Ks Ei,t

t 2 Prp Ei,tq,

where the ﬁrst inequality is obtained with Prp E1qď1 and the second with a union bound over r Ks. Recalling Prp Ei,tq

Stochastic Rising Bandits

was bounded in Lemma A.4, we bound the summation with the integral as in Lemma C.4 to get:

t 2 Prp Ei,tqď ÿ

t 2 2t1 αď2K ż 8

x 1 x1 αdx 2K

From now on, we proceed the analysis under the good events Et, recalling that Biptq pµR-ed,hi,t i ptq βR-ed,hi,t i pt,δtq. We consider each addendum of the summation and we exploit the optimism, i.e., Bi ptqďBItptq:

µi ptq µItp NIt,tq BItptq BItptqďmin

%1,µi ptq Bi ptq looooooomooooooon

BItptq µItp NIt,tq

ďmint1,BItptq µItp NIt,tqu.

Now, we work on the term inside the minimum:

BItptq µItp NIt,tq pµ R-ed,h It,t It ptq β R-ed,h It,t It pt,δtq µItp NIt,tq (19)

ďrµ R-ed,h It,t It ptq µItp NIt,tq looooooooooooooomooooooooooooooon

2β R-ed,h It,t It pt,δtq looooooooomooooooooon

where line (19) follows from the deﬁnition of Biptq, and line (20) derives from the fact that we are under the good event Et. We now decompose over the arms and consider one term at a time. We start with (a):

t 1 min ! 1,rµ R-ed,h It,t It ptq µItp NIt,tq ) ď2K ÿ

j 3 min ! 1,rµ R-ed,hi,ti,j i pti,jq µipjq )

j 3 min " 1, 1

2p2ti,j 2pj 1q hi,ti,j 1qγippj 1q 2hi,ti,j 1q *

j 3 mint1,Tγipj 2tϵpj 1ququ (22)

j 3 mint1,Tγiptp1 2ϵqjuqu (23)

j 3 γiptp1 2ϵqjuqq (24)

ď2K T q R 1 1 2ϵ

tp1 2ϵq Ni,T u ÿ

j t3p1 2ϵqu γipjq (25)

ď2K T q R 1 1 2ϵ

i Pr Ks Υµptp1 2ϵq Ni,T u,qq (26)

ď2K KT q R 1 1 2ϵ

ˆR p1 2ϵq T

V ,q , (27)

where line (21) follows from Lemma A.3, line (22) is obtained by bounding 2ti,j 2pj 1q hi,ti,j 1ď2T and exploiting the deﬁnition of hi,t tϵNi,t 1u, line (23) follows from the observation j 2tϵpj 1quěj 2ϵpj 1qětp1 2ϵqju, line (24) is obtained from the already exploited inequality mint1,xuďmint1,xuqďxq for q Pr0,1s, line (25) is an application of Lemma C.1, line (26) applies the deﬁnition of Υµp ,qq, and line (27) follows from Lemma C.2 recalling that ř

Stochastic Rising Bandits

2ϵq Ni,T uďp1 2ϵq T.

Let us now move to the concentration term (b). We decompose over the arms as well, taking care of the pulls in which hi,j 0, that are at most 1 P 1

t 1 min ! 1,2β R-ed,h It,t It pt,δtq ) ďK K R1

1,2σpti,t pj 1q hi,ti,t 1q

where line (28) follows from bounding tαďT α and from the deﬁnition of hi,t tϵNi,t 1u. To bound the summation, we compute the minimum integer value j (that turns out to be independent of i) of j such that the minimum is attained by its second argument:

tϵpj 1qu3 ď1ùñtϵpj 1quěp2σTq 2 3 p10αlog Tq

S 1 ϵ p2σTq 2 3 p10αlog Tq

Thus, we have:

V K ˆ j 1 R1

10αlogp Tq ż 8

pϵpx 1q 1q 3 2 dx (30)

K Kj 4KσT a

ϵpϵpj 1q 1q

ϵ p2σTq 2 3 p10αlog Tq

where line (29) is obtained by splitting the summation based on the value of j , line (30) comes from bounding the summation with the integral (Lemma C.4), and line (31) follows from substituting the value of j and simple algebraic manipulations. Putting all together, we obtain:

Rµp R-ed-UCB,Tqď1 2K

ϵ p2σTq 2 3 p10αlog Tq

1 3 KT q R 1 1 2ϵ

ˆR p1 2ϵq T

O ˆ KT q R 1 1 2ϵ

ˆR p1 2ϵq T

ϵ pσTq 2 3 pαlog Tq

A.2. Proofs of Section 5

Theorem 5.1 (Seznec et al., 2020). Let πg µ be the oracle greedy policy:

πg µptq Pargmax i Pr Ks tµiptqu, @t Pr Ts.

Stochastic Rising Bandits

Then, πg µ is optimal for the restless non-decreasing bandits (i.e., under Assumption 3.1).

Proof. Trivially follows from the fact that the greedy policy at each round t is selecting the largest expected reward, therefore any optimal policy other than the greedy one should select a larger expected reward at least for a single round t1, which is in contradiction with the deﬁnition of greedy policy.

Lemma A.5. For every arm i Pr Ks and every round t Pr Ts, let us deﬁne:

µR-less i ptq: µipti,Ni,t 1q pt ti,Ni,t 1qµipti,Ni,t 1q µipti,Ni,t 1 1q

ti,Ni,t 1 ti,Ni,t 1 1 ,

if Ni,t 1ě2 else µR-less i ptq: 8. Then, µR-less i ptqěµiptq and, if Ni,t 1ě2, it holds that:

µR-less i ptq µiptqď t ti,Ni,t 1 γipti,Ni,t 1 1q.

Proof. Let us consider the following derivation:

µiptq µipti,Ni,t 1q

l ti,Ni,t 1 γiplq

ďµipti,Ni,t 1q pt ti,Ni,t 1qγipti,Ni,t 1q (32)

ďµipti,Ni,t 1q pt ti,Ni,t 1qµipti,Ni,t 1q µipti,Ni,t 1 1q

ti,Ni,t 1 ti,Ni,t 1 1 :µR-less i ptq, (33)

where line (32) follows from Assumption 3.2 and line (33) from Lemma C.3. Moreover, if Ni,t 1ě2, we have:

µR-less i ptq µiptq pt ti,Ni,t 1qµipti,Ni,t 1q µipti,Ni,t 1 1q

ti,Ni,t 1 ti,Ni,t 1 1 µipti,Ni,t 1q µiptq loooooooooomoooooooooon

ďpt ti,Ni,t 1qµipti,Ni,t 1q µipti,Ni,t 1 1q

ti,Ni,t 1 ti,Ni,t 1 1

t ti,Ni,t 1 ti,Ni,t 1 ti,Ni,t 1 1

ti,Ni,t 1 1 ÿ

l ti,Ni,t 1 1 γiplq,

ď t ti,Ni,t 1 γipti,Ni,t 1 1q,

where we employed Assumption 3.2 in the last line, noting 1 ti,Ni,t 1 ti,Ni,t 1 1 řti,Ni,t 1 1 l ti,Ni,t 1 1 γiplqďγipti,Ni,t 1 1q.

Theorem 5.2. Let T PN, then R-less-UCB (Algorithm 1) with Biptq µR-less i ptq suffers an expected regret bounded, for every q Pr0,1s, as:

Rµp R-less-UCB,Tqď2K KT

V ,q 1 q 1 .

Proof. We have to analyze the following expression:

Rµp R-less-UCB,Tq

t 1 µi t ptq µItptq,

Stochastic Rising Bandits

where i t Pargmaxi Pr Kstµiptqu for all t Pr Ts. We consider each round at a time, recalling that Biptq µR-less i ptq, and using optimism, i.e., Bi t ptqďBItptq, we have:

µi ptq µItptq BItptq BItptqďmin

% 1,µi t ptq Bi t ptq looooooomooooooon

BItptq µItptq

ďmint1,BItptq µItptqu. (34)

Now we consider the term inside the minimum, when NIt,t 1ě2:

BItptq µItptq µR-less It ptq µItptq (35)

ďpt ti,Ni,t 1qγipti,Ni,t 1 1q, (36)

where to get line (36) we applied Lemma A.5. Let us plug the expression derived in Equation (34) and decompose the summation of this term w.r.t. the K arms:

Rµp R-less-UCB,Tqď

t 1 min 1,pt ti,Ni,t 1qγipti,Ni,t 1 1q, (

j 3 mint1,pti,j ti,j 1qγipti,j 2qu

j 3 pti,j ti,j 1qyγipti,j 2qy (37)

j 3 pti,j ti,j 1q

j 3 γipti,j 2q

j 3 γipj 2q

ˆ Ni,T , y 1 y

ˆ Ni,T , y 1 y

ď2K T y KΥµ

line (37) follows from the inequality mint1,xuďmint1,xuyďxy for y P 0, 1

2 , line (38) follows from H older s inequality with powers 1

y ě1 and 1 1 y ě1 (since y P 0, 1

2 ), line (39) is obtained from observing that řNi,T j 3 pti,j ti,j 1qďT and γipti,j 2qďγipj 2q from Assumption 3.2, line (40) follows from Jensen s inequality as y P 0, 1

2 and observing:

ˆ Ni,T , y 1 y

ˆ Ni,T , y 1 y

ˆ Ni,T , y 1 y

where line (41) is obtained from Lemma C.2. The ﬁnal theorem statement is obtained by deﬁning q: y 1 y Pr0,1s and substituting it to the above equation.

Stochastic Rising Bandits

Estimator Construction for the Stochastic Rising Restless Setting We provide the intuition behind the estimator construction and explain why it differs signiﬁcantly from the one employed for the deterministic case. We start observing that for every l Pt2,...,Ni,t 1u, we have that:

µiptq µipti,lq loomoon

(past payoff)

j tl γipjq looomooon

(sum of future increments)

ď µiplq lomon

(past payoff)

pt ti,lq γipti,l 1q looomooon

(past increment)

where the inequality follows from Assumption 3.2. Since do not have access to γipti,l 1q and we cannot directly estimate it, we need to perform a further bounding step. Speciﬁcally, based on Lemma C.3, we bound for every l Pt2,...,Ni,t 1u and h Prl 1s:

γipti,l 1q looomooon

(past increment at ti,l)

ď µipti,lq µipti,l hq

ti,l ti,l h loooooooooomoooooooooon

(average past increment over tti,l h,...,ti,lu)

We report a ﬁrst proposal of optimistic approximation of µiptq, i.e., rrµR-ed,h i ptq, and the corresponding estimator, i.e., ppµR-ed,h i ptq, that are deﬁned in terms of a window of size 1ďhďt Ni,t 1{2u:

rrµR-less,h i ptq: 1

l Ni,t 1 h 1

µipti,lq loomoon

(past payoff)

pt ti,lq µipti,lq µipti,l hq

ti,l ti,l h loooooooooomoooooooooon

(average past increment)

ppµR-less,h i ptq: 1

l Ni,t 1 h 1

Rti,l lomon

(estimated past payoff)

pt ti,lq Rti,l Rti,l h

ti,l ti,l h looooooomooooooon

(estimated average past increment)

Unfortunately, this estimator, although intuitive, does not enjoy desirable concentration properties due to the presence of the denominator ti,l ti,l h that is inconveniently correlated with the numerator Rti,l Rti,l h. For this reason, we resort to different estimators, with better concentration properties but larger bias:

rµR-less,h i ptq: 1

l Ni,t 1 h 1

µipti,lq loomoon

(past payoff)

pt lq µipti,lq µipti,l hq

h loooooooooomoooooooooon

(average past increment)

pµR-less,h i ptq: 1

l Ni,t 1 h 1

Rti,l lomon

(estimated past payoff)

pt lq Rti,l Rti,l h

h looooooomooooooon

(estimated average past increment)

These estimators are actually upper-bounds of the previous ones since ti,l ti,l hěh and ti,lěl.

The proof is composed of the following steps:

(i) Lemma A.6 shows that rµR-less,h i ptq is an upper-bound for µiptq and provides a bound to its bias for every value of h;

(ii) Lemma A.7 analyzes the concentration of pµR-less,h i ptq around rµR-less,h i ptq for a speciﬁc choice of δt t α and when hi,h: hp Ni,t 1q is a function of the number of pulls Ni,t 1 only;

(iii) Theorem 5.3 bounds the expected regret of R-less-UCB when hi,h tϵNi,t 1u for ϵPp0,1{2q.

Stochastic Rising Bandits

Lemma A.6. For every arm i Pr Ks, every round t Pr Ts, and window width 1ďhďt Ni,t 1{2u, let us deﬁne:

rµR-less,h i ptq 1

l Ni,t 1 h 1

ˆ µipti,lq pt lqµipti,lq µipti,l hq

otherwise if h 0, we set rµR-less,h i ptq: 8. Then, rµR-less,h i ptqěµipti,Ni,t 1q and, if Ni,t 1ě2 it holds that:

rµR-less,h i ptq µiptqď p2t 2Ni,t 1 h 1qpti,Ni,t 1 ti,Ni,t 1 2h 1q

2h γipti,Ni,t 1 2h 1q.

Proof. Let us start by observing the following equality holding for every l Pt2,...,Ni,t 1u:

µiptq µipti,lq

j ti,l γipjq.

By averaging over a window of length h, we obtain:

l Ni,t 1 h 1

j ti,l γipjq

l Ni,t 1 h 1 pµipti,lq pt ti,lqγipti,l 1qq (42)

l Ni,t 1 h 1

µipti,lq t ti,l ti,l ti,l h

j ti,l h γipjq

l Ni,t 1 h 1

ˆ µipti,lq pt ti,lqµipti,lq µipti,l hq

ti,l ti,l h

l Ni,t 1 h 1

ˆ µipti,lq pt lqµipti,lq µipti,l hq

:rµR-less,h i ptq, (44)

where lines (42) and (43) follow from Assumption 3.2, and line (44) is obtained from observing that ti,lěl and ti,l ti,l hěh. Concerning the bias, when Ni,t 1ě2, we have:

rµR-less,h i ptq µiptq 1

l Ni,t 1 h 1

ˆ µipti,lq pt lqµipti,lq µipti,l hq

l Ni,t 1 h 1 pt lqµipti,lq µipti,l hq

l Ni,t 1 h 1 pt lqµipti,lq µipti,l hq

ti,l ti,l h ti,l ti,l h

l Ni,t 1 h 1 pt lqγipti,l hq ti,l ti,l h

ď ti,Ni,t 1 ti,Ni,t 1 2h 1

h2 γipti,Ni,t 1 2h 1q

l Ni,t 1 h 1 pt lq (47)

Stochastic Rising Bandits

p2t 2Ni,t 1 h 1qpti,Ni,t 1 ti,Ni,t 1 2h 1q

2h γipti,Ni,t 1 2h 1q, (48)

where line (45) follows from observing that µipti,lqďµiptq, line (46) derives from Assumption 3.2 and bounding µipti,lq µipti,l hq

ti,l ti,l h ďγipti,l hq, line (47) is obtained by bounding ti,l ti,l hďti,Ni,t 1 ti,Ni,t 1 2h 1 and γipti,l hqď γipti,Ni,t 1 2h 1q, and line (48) follows from computing the summation.

Lemma A.7. For every arm i Pr Ks, every round t Pr Ts, and window width 1ďhďt Ni,t 1{2u, let us deﬁne:

pµR-less,h i ptq: 1

l Ni,t 1 h 1

ˆ Rti,l pt lq Rti,l Rti,l h

βR-less,h i pt,δq: σpt Ni,t 1 h 1q

otherwise if h 0, we set pµR-less,h i ptq: 8 and βR-less,h i pt,δq: 8. Then, if the window size depends on the number of pulls only hi,t hp Ni,t 1q and if δt t α for some αą2, it holds for every round t Pr Ts that:

Pr ˇˇˇpµR-less,hi,t i ptq rµR-less,hi,t i ptq ˇˇˇąβR-less,hi,t i pt,δtq ď2t1 α.

Proof. Under the event thi,t 0u, we have that pµR-less,hi,t i ptq rµR-less,hi,t i ptq βR-less,hi,t i pt,δq 8 and, under the convention p 8q p 8q 0 the event 0ąβR-less,h i pt,δq does not hold. Therefore, we conduct the proof under the event thi,tě1u. Hence:

Pr ˇˇˇpµR-less,hi,t i ptq rµR-less,hi,t i ptq ˇˇˇąβR-less,hi,t i pt,δtq (49)

Pr ˇˇˇpµR-less,hp Ni,t 1q i ptq rµR-less,hp Ni,t 1q i ptq ˇˇˇąβR-less,hp Ni,t 1q i pt,δtq (50)

ďPr Dn Pt0,...,t 1u s.t. hpnqě1: ˇˇˇpµR-less,hpnq i ptq rµR-less,hpnq i ptq ˇˇˇąβR-less,hpnq i pt,δtq

n Pt0,...,t 1u:hpnqě1 Pr ˇˇˇpµR-less,hpnq i ptq rµR-less,hpnq i ptq ˇˇˇąβR-less,hpnq i pt,δtq , (51)

where line (50) follows from the deﬁnition of hi,t hp Ni,t 1q, and line (51) derives from a union bound over n. Differently from the rested case, in which the distribution of all random variable involved is fully determined having ﬁxed Ni,t 1, in the restless case this is no longer the case. Indeed, the distribution of the rewards does not depend on the number of pulls, but on the round in which the arm was pulled. Thus, we need a more articulated argument. We start rewriting the estimator with a summation over rounds:

hpnq pµR-less,hpnq i ptq rµR-less,hpnq i ptq

ˆ Xti,l pt lq Xti,l Xti,l hpnq

s 1 ϵs Ys Xs, (52)

ϵs 1t Is iu,

Ys ˆ 1t Ni,s Ptn hpnq 1,...,nuu ˆ 1 t Ni,s

1t Ni,s Ptn 2hpnq 1,...,n hpnquu t Ni,s hpnq

Xs Rs µipsq.

The rationale behind this decomposition is to use random variable ϵs to select the pulls of arm i, Ys to deﬁne the quantity by which Xs is multiplied. In particular, if the pull belongs to the set of the most recent hpnq pulls, i.e.,

Stochastic Rising Bandits

Ni,s Ptn hpnq 1,...,nu, we multiply Xs by the constant 1 t Ni,s

hpnq . Instead, if the pull belongs to less recent hpnq pulls,

i.e., Ni,s Ptn 2hpnq 1,...,n hpnqu, we multiply Xs by t Ni,s hpnq

hpnq . Now, we deﬁne the sequence of random times at which arm i was pulled for the j-th time:

ti,j: min t Pr T st Ni,t ju, j Prns,

and we introduce the random variables r Xj: Xti,j and r Yj: Yti,j. To prove that r Yj r Xj is a martingale difference sequence w.r.t. to the ﬁltration it generates, we apply a Doob s optional skipping argument (Doob, 1953; Bubeck et al., 2008). We introduce the ﬁltration Fτ 1 σp I1,R1,...,Iτ 1,Rτ 1,Iτq and we need to show that: (i) Zτ řτ s 1ϵs Ys Xs is a martingale, and (ii) tti,j τu PFτ 1 for τ Prt 1s. Concerning (i), we have:

Er Zτ|Fτ 1s Zτ 1 ϵτYτ Er Xτ|Fτ 1s Zτ 1,

since ϵτYτ is fully determined by Fτ 1 and either ϵτ 0 or Iτ i, thus, ϵτ Er Xτ|Fτ 1s ϵτ Er Rτ µipτq|Fτ 1s 0. Concerning (ii), tti,j τu PFτ 1 is trivially veriﬁed. We recall that, since r Yj Yti,j we have that Ni,ti,j j:

r Yj ˆ 1tj Ptn hpnq 1,...,nuu ˆ 1 t j

1tj Ptn 2hpnq 1,...,n hpnquu t j hpnq

From which, by substituting into Equation (52) and properly solving the indicator functions, we have:

j 1 r Xj r Yj

j n 2hpnq 1

t j hpnq r Xj.

We compute the square of the weights and apply a derivation similar to that of Lemma A.4:

j n 2hpnq 1

2 ď 5pt n hpnq 1q2

Thus, we can now apply Azuma-H oeffding s inequality (Lemma C.5):

Pr ˇˇˇpµR-less,hpnq i ptq rµR-less,hpnq i ptq ˇˇˇąβR-less,hpnq i pt,δtq

s 1 ϵs Xs Ys

ˇˇˇˇˇąhpnqβR-less,hpnq i pt,δtq

j 1 r Xj r Yj

ˇˇˇˇˇąhpnqβR-less,hpnq i pt,δtq

hpnqβR-ed,hpnq i pt,δtq 2

2σ2 5pt n hpnq 1q2

By replacing into Equation (51) and summing over n, we obtain:

n Pt0,...,t 1u:hpnqě1 2δtď

n 0 2δt 2t1 α.

Theorem 5.3. Let T PN, then R-less-UCB (Algorithm 1) with Biptq pµR-less,hi,t i ptq βR-less,hi,t i ptq, hi,t tϵNi,t 1u

Stochastic Rising Bandits

for ϵPp0,1{2q, and δt t α for αą2, suffers an expected regret bounded, for every q Pr0,1s, as:

Rµp R-less-UCB,TqďO

ϵ pσTq 2 3 pαlog Tq

2q 1 q plog Tq

ˆR p1 2ϵq T

Proof. Let us deﬁne the good events Et Ş

i Pr Ks Ei,t that correspond to the event in which all conﬁdence intervals hold:

Ei,t: !ˇˇˇrµR-less,hi,t i ptq pµR-less,hi,t i ptq ˇˇˇďβR-less,hi,t i ptq ) @i Pr Ts,i Pr Ks.

We have to analyze the following expression:

Rµp R-less-UCB,Tq E

t 1 µi t ptq µItptq

where i t Pargmaxi Pr Kstµiptqu for all t Pr Ts. We decompose according to the good events Et:

Rµp R-less-UCB,Tq

t 1 E µi t ptq µItptq 1t Etu ı

t 1 E µi t ptq µItptq 1t Etu ı

t 1 E µi t ptq µItptq 1t Etu ı

t 1 Er1t Etus,

where we exploited µi t ptq µItptqď1 in the inequality. Now, we bound the second summation, as done in Theorem 4.4:

t 1 Er1t Etusď1 2K

From now on, we will proceed the analysis under the good event Et, recalling that Biptq pµR-less,hi,t i ptq βR-less,hi,t i ptq. Let t Pr Ts, and we exploit the optimism, i.e., Bi t ptqďBItptq:

µi ptq µItptq BItptq BItptqďmin

% 1,µi t ptq Bi t ptq looooooomooooooon

BItptq µItptq

ďmint1,BItptq µItptqu.

Now, we work on the term inside the minimum:

BItptq µItptq pµ R-less,h It,t It ptq β R-less,h It,t It ptq µItptq (53)

ďrµ R-less,h It,t It ptq µItptq looooooooooooomooooooooooooon

2β R-less,h It,t It ptq loooooooomoooooooon

where line (53) follows from the deﬁnition of Biptq and line (54) from the good event Et. We proceed decomposing over the

Stochastic Rising Bandits

arms, starting with (a):

t 1 min ! 1,rµ R-less,h It,t It ptq µItptq ) ď2K ÿ

j 3 min ! 1,rµ R-less,hi,ti,j i pti,jq µipti,jq )

j 3 min " 1, p2ti,j 2pj 1q hi,ti,j 1qpti,j 1 ti,j 2hi,t 1q

2hi,t γipti,pj 1q 2hi,ti,j 1q * (55)

j 3 min " 1, T 2

tϵpj 1quγipti,j 2tϵpj 1quq * (56)

j 3 min " 1, T 2

tϵpj 1quγiptp1 2ϵqjuq * (57)

ˆγiptp1 2ϵqjuq

j 3 γiptp1 2ϵqjuq z 1 z

ď2K T 2z R1

tϵp Ni,T 1qu ÿ

tp1 2ϵq Ni,T u ÿ

j t3p1 2ϵqu γipjq z 1 z

ď2K T 2zp1 logpϵTqqz R1

tp1 2ϵq Ni,T u ÿ

j t3p1 2ϵqu γipjq z 1 z

ď2K T 2zp1 logpϵTqqz R1

ˆ tp1 2ϵq Ni,T u, z 1 z

ď2K T 2zp1 logpϵTqqz R1

ˆ tp1 2ϵq Ni,T u, z 1 z

ď2K T 2zp1 logpϵTqqz R1

ˆR p1 2ϵq T

where line (55) follows from the bias bound of Lemma A.6, line (56) is obtained from bounding p2ti,j 2pj 1q hi,ti,j 1qpti,j 1 ti,j 2hi,t 1qď2T 2 and using the deﬁnition of hi,t, line (57) derives from observing that γipti,jqďγipjq for Assumption 3.2 and having bounded the ﬂoor analogously as done in Theorem 4.4, line (58) from the inequality mint1,xuďmint1,xuzďxz for z Pr0,1{2s, line (59) is obtained from H older s inequality with exponents 1

z ě1 and 1 1 z ě1 respectively, line (60) is an application of Lemma C.1 to independently to both inner summations, line (61) derives from bounding the harmonic sum, i.e., řtϵp Ni,T 1qu t2ϵu 1 j ď1 logpϵp Ni,T 1qqď1 logpϵTq, line (62) follows from Jensen s inequality, line (63) is obtained from Lemma C.2. By recalling q z 1 z Pr0,1s, we obtain:

2q 1 q p1 logpϵTqq

ˆR p1 2ϵq T

V ,q 1 1 q .

Concerning the term (b), we recall that β R-less,h It,t It ptq equals the bonus term used in the rested setting and, consequently

Stochastic Rising Bandits

from Theorem 4.4:

t 1 min ! 1,2β R-ed,h It,t It pt,δtq ) ďK ˆ 3 1

ϵ p2σTq 2 3 p10αlog Tq

Putting all together, we obtain:

Rµp R-less-UCB,Tqď1 2K

ϵ p2σTq 2 3 p10αlog Tq

2q 1 q p1 logpϵTqq

ˆR p1 2ϵq T

V ,q 1 1 q .

Stochastic Rising Bandits

B. Bounding the Cumulative Increment

Let us consider the case in which γiplqďl c for all i Pr Ks and l Pr Ts. We bound the cumulative increment with the corresponding integral using Lemma C.4, depending on the value of cq:

l 1 γiplqqď1 ż T K

x 1 x cqdxď1

K 1 cq 1 1 cq if cqă1 log T

K if cq 1 1 cq 1 if cqą1

Thus, depending on the value of c, there will be different optimal values for q in the rested and restless cases that optimize the regret upper bound.

B.1. Rested Setting

Let us start with the rested case. From Theorem 4.3, we have:

Rµď2K T q KΥµ

V ,q ď2K KT q K

T 1 cq q K1 qcp1 cqq if cqă1

K if cq 1 T q cq 1 if cqą1

T 1 cq q K1 qcp1 cqq if cqă1

K if cq 1 T q mint1,cq 1u if cqą1

where we have highlighted the dominant term. For the case c Pp0,1q we consider the ﬁrst case only and minimize over q:

RµďO ˆ K min q Pr0,1s T 1 cq q

K1 qcp1 cqq

For the case c 1, we still obtain RµďOp Tq. Instead, for c Pp1, 8q, we have the three cases:

Kminq Pr0,1{cq T 1 cq q K1 qcp1 cqq T 1 c log T

K minq Pp1{c,1s T q mint1,cq 1u

O ˆ KT 1 c log T

B.2. Restless Setting

Let us now move to the restless setting. From Theorem 5.2, we have:

V ,q 1 1 q ď2K KT

q 1 p1 cqq if cqă1

q q 1 log T

K 1 q 1 if cq 1

T q q 1 cq 1 if cqą1

q 1 p1 cqq if cqă1

q q 1 log T

K 1 q 1 if cq 1

T q q 1 mint1,cq 1u if cqą1

, @q Pr0,1s.

Stochastic Rising Bandits

For the case c Pp0,1q, we consider the ﬁrst case only and minimize over q:

K min q Pr0,1s T

for sufﬁciently large T "K. For the case c 1, it is simple to prove that the case cq 1 leads to the smallest regret:

RµďKT 1 c 1 ˆ log T

Finally, for the case c Pp1, 8q, we have to consider all the three cases:

minq Pr0,1{cq T 1 cq q

q 1 p1 cqq T 1 c 1 log T

minq Pp1{c,1s T q q 1 mint1,cq 1u

KT 1 c 1 ˆ log T

Stochastic Rising Bandits

C. Technical Lemmas

Lemma C.1. Let M ě3, and let f :NÑR, and βPp0,1q. Then it holds that:

j 3 fptβjuqď R 1

l t3βu fplq.

Proof. We simply observe that the minimum value of tβju is t3βu and its maximum value is tβMu. Each element tβju

changes value at least one time every Q 1 β U times.

Lemma C.2. Under Assumption 3.2, it holds that:

max p Ni,T qi Pr Ks Ni,T ě0,ř

i Pr Ks Ni,T T

i Pr Ks Υµp Ni,T ,qqďKΥµ

Proof. We ﬁrst claim that there exists an optimal assignment of N i,T are such that |N i,T N i1,T |ď1 for all i,i1Pr Ks. By contradiction, suppose that the only optimal assignments are such that there exists a pair i1,i2Pr Ks such that : N i2,T N i1,T ą1. In such a case, we have:

Υµ N i1,T ,q Υµ N i2,T ,q 2Υµ N i1,T ,q

j 1 γi p N i1,T l 1q

ď2Υµ N i1,T ,q

j 0 γi p N i1,T l 1q

j 1 γi p N i1,T l 1q

Υµ N i1,T r {2s,q Υµ N i1,T t {2u,q .

where the inequality follows from Assumption 3.2. By redeﬁning r N i1,T : N i1,T t {2u and r N i2,T : N i1,T r {2s, we

have that r N i1,T r N i2,T N i1,T N i2,T and | r N i1,T r N i2,T |ď1. Thus, we have found a better solution to the optimization problem, contradicting the hypothesis. Since the optimal assignment fulﬁlls |N i,T N i1,T |ď1, it must be that N i,T ď P T

for all i Pr Ks.

Lemma C.3. Under Assumptions 3.1 and 3.2, for every i Pr Ks, k,k1PN with k1ăk, for both rested and restless bandits, it holds that:

γipkqď µipkq µipk1q

Proof. Using Assumption 3.2, we have:

γipkq 1 k k1

l k1 γipkqď 1 k k1

l k1 γiplq 1 k k1

l k1 pµipl 1q µiplqq µipkq µipk1q

where the ﬁrst inequality comes from the concavity of the reward function, and the second equality from the deﬁnition of increment.

Lemma C.4. Let a,b PN and let f :ra,bsÑR. If f is monotonically non-decreasing function, then:

n a fpnqď ż b

x a fpxqdx fpbqď ż b 1

Stochastic Rising Bandits

If f is monotonically non-increasing, then:

n a fpnqďfpaq ż b

x a fpxqdxď ż b

x a 1 fpxqdx.

Proof. Let us consider the intervals Ii rxi 1,xis with x0 a and xi xi 1 1 for i Prb as. If f is monotonically non-decreasing, we have that for all i Prb as and x PIi it holds that fpxqěfpxi 1q and consequently ş

Ii fpxqdxě fpxi 1qvolp Iiq fpxi 1q. Thus:

i 1 fpxi 1q fpbqď

Ii fpxqdx fpbq ż b

x a fpxqdx fpbq.

Recalling that fpbqď şb 1 x bfpxqdx, we get the second inequality. Conversely, if f is monotonically non-increasing, then for all i Prb as and x PIi, it holds that fpxqěfpxiq and consequently ş

Ii fpxqdxěfpxiq. Thus:

n a fpnq fpaq

i 1 fpxiqďfpaq

Ii fpxqdx fpaq ż b

x a fpxqdx.

Recalling that fpaqď şa x a 1fpxqdx, we get the second inequality.

Theorem C.5 (H oeffding-Azuma s inequality for weighted martingales). Let F1Ă ĂFn be a ﬁltration and X1,...,Xn be real random variables such that Xt is Ft-measurable, Er Xt|Ft 1s 0 (i.e., a martingale difference sequence), and ErexppλXtq|Ft 1sďexp λ2σ2

2 for any λą0 (i.e., σ2-subgaussian). Let α1,...,αn be non-negative real numbers. Then, for every κě0 it holds that:

2σ2řn t 1α2 i

Proof. It is a straightforward extension of Azuma-H oeffding inequality for subgaussian random variables. We apply the Chernoff s method for some są0:

t 1 αt Xtąκ

Pr esřn t 1αt Xt ąesκ ď E esřn t 1αt Xt

where the last inequality follows from the application of Markov s inequality. We use the martingale property to deal with the expectation. By the law of total expectation, we have:

E esřn t 1αt Xt ı E esřn 1 t 1 αt Xt E esαn Xn|Ft 1 ı .

Using now the subgaussian property, we have:

E esαn Xn|Ft 1 ďexp ˆs2α2 nσ2

An inductive argument, leads to:

E esřn t 1αt Xt ı ďexp

Stochastic Rising Bandits

Thus, minimizing w.r.t. są0, we have:

t 1 αt Xtąκ

ďmin sě0 exp

t 1 α2 n sκ

2σ2řn t 1α2n

being the minimum attained by s κ σ2řn t 1α2n . The reverse inequality can be derived analogously. A union bound completes the proof.

Lemma C.6. Let Υµp T,qq be as deﬁned in Equation (2) for some q Pr0,1s. Then, for all i Pr Ks and l PN the following statements hold:

if γiplqďbe cl, then Υµp T,qqďO bq e cq

if γiplqďbl c with cqą1, then Υµp T,qqďO bq cq 1 ;

if γiplqďbl c with cq 1, then Υµp T,qqďOpbqlog Tq;

if γiplqďbl c with cqă1, then Υµp T,qqďO bq T 1 cq

Proof. The proofs of all the statements are obtained by bounding the summation deﬁning Υµp T,qq with the corresponding integrals, as in Lemma C.4. Let us start with γiplqďbe cl:

l 1 γiplqqďbqe cq ż T

x 1 bqe cqxdxďbqe cq bq

cq e cq O ˆ bq e cq

We now move to γiplqďbl c. If cqă1, we have:

l 1 γiplqqďbq ż T

x 1 bqx cqdx bq bq

cq 1 O ˆ bq

For cq 1, we obtain:

l 1 γiplqqďbq ż T

x dx bq bqlog T Opbqlog Tq.

Finally, for cqă1, we have:

l 1 γiplqqďbq ż T

x 1 bqx cqdx bq bq T 1 cq

1 cq O ˆ bq T 1 cq

The results of Table 1 are obtained by setting b 1.

D. Efﬁcient Update

Under the assumption that the window size depends on the number of pulls only and that 0ďhpn 1q hpnqď1, we can employ the following efﬁcient Op1q update for R-ed-UCB and R-less-UCB. Denoting with n the number of pulls of arm i, we update the estimator at every time step t Pr Ts as:

pµhpnq i ptq 1 hpnq

ˆ an tpan bnq

Stochastic Rising Bandits

where the following sequences are updated only when the arm is pulled:

# an 1 ripnq ripn hpnqq if hpnq hpn 1q an 1 ripnq otherwise ,

# bn 1 ripn hpnqq ripn 2hpnqq if hpnq hpn 1q bn 1 ripn 2hpnq 1q otherwise ,

# cn 1 nripnq pn hpnqqripn hpnqq if hpnq hpn 1q cn 1 nripnq otherwise ,

# dn 1 pn hpnqqripn hpnqq pn 2hpnqqripn 2hpnqq if hpnq hpn 1q dn 1 pn 2hpnq 1qripn 2hpnq 1q otherwise ,

where we have abbreviated ripnq: Rti,n.

E. Experimental Setting and Additional Results

E.1. Parameter Setting

The choices of the parameters of the algorithms we compared R-less/ed-UCB with are the following:

Rexp3: VT K since in our experiments we consider the reward of each arm to evolve from 0 to 1, thus the maximum global variation possible is equal the number of arms of the bandit; γ min ! 1, b

Klog K pe 1q T

rp Klog Kq1{3p T{VT q2{3s as recommended by Besbes et al. (2014);

KL-UCB: c 3 as required by the theoretical results on the regret provided by Garivier & Capp e (2011);

Ser4: according to what suggested by Allesiardo et al. (2017) we selected δ 1{T, ϵ 1 KT , and φ b

N T Klogp KT q;

SW-UCB: as suggested by Garivier & Moulines (2011) we selected the sliding-window τ 4?T log T and the constant ξ 0.6;

SW-KL-UCB as suggested by Garivier & Moulines (2011) we selected the sliding-window τ σ 4{5;

SW-TS: as suggested by Trov o et al. (2020) for the smoothly changing environment we set β 1{2 and sliding-window τ T 1 β ?

E.2. IMDB Experiment

We created a bandit environment in which each of the classiﬁcation algorithms is an arm of the bandit. The interaction for each round t PT of the real-world experiment is composed by the following:

the agent decides to pull arm It;

a random point xt of the IMDB dataset is selected and supplied to the classiﬁcation algorithm associated to arm It;

the base algorithm classiﬁes the sample, i.e., it provides the prediction ˆyt Pt0,1u for the selected sample xt;

the environment generates the reward comparing the prediction ˆyt to the target class yt using the following function Rt 1 |yt ˆyt|;

the base algorithm is updated using pxt,ytq;

Since the base algorithms are trained only if their arm is selected, this is a problem which belongs to the rested scenario.

For the classiﬁcation task we decided to employ:

Stochastic Rising Bandits

NN(1, 1, 2)

NN(2, 2, 2)

Figure 6. Empirical learning curves of the classiﬁcation algorithms (arms) of the IMDB experiment

R-ed-UCB KL-UCB Rexp3 SW-TS SW-UCB Ser4 SW-KL-UCB

Figure 7. Cumulative regret in a 2-arms online model selection on IMDB dataset (30 runs, 95% c.i.).

2 Online Logistic Regression (LR) methods with different schemes used for the learning rate λt;

5 Neural Networks (NNs) different in terms of shape and number of neurons

Speciﬁcally, we adopt a decreasing scheme for the learning rate of λt β

t (denoted with LRptq from now on) and a constant learning rate λt β (denoted as LR from now on). Moreover, the NNs use as activation functions the rectiﬁed linear unit, i.e., relupxq maxp0,xq, a constant learning rate α 0.001 and the adam stochastic gradient optimizer for ﬁtting. Two of the chosen nets have only one hidden layer, with 1 and 2 neurons, respectively, the third net has 2 hidden layer, with 2 neurons each, and two nets have 3 layers with 2,2,2 and 1,1,2 neurons, respectively. We refer to a speciﬁc NN denoting in curve brackets the cardinalities of the layers, e.g., the one having 2 layer with 2 neurons each is denoted by NNp2,2q.

We analyzed their global performance on the IMDB dataset by averaging 1,000 independent runs in which each strategy is sequentially fed with all the available 50,000 samples. The goal was to determine, at each step, the value of the payoff µipnq. Figure 6a provides the average learning curves of the selected algorithms. As we expected, from a qualitative perspective, the average learning curves are increasing and concave, however, due to the limited number of simulations, Assumptions 3.1 and 3.2 are not globally satisﬁed.

We also perform an experiment using only LRptq and LR as arms. Figure 7a reports the result of a run of the MAB algorithms over the IMDB scenario. The analogy between this result and the one of the 2-arms synthetic rested bandit (Figure 4b) is clear, indeed R-ed-UCB outperforms the other baselines when the learning curves of the base algorithms at some point intersects one another.

E.3. Pulls of each arm

Figure 8 presents the average number of pulls for each arm for each one of the algorithm analysed in the synthetic experiments of Section 6. Figure 8a shows how R-less-UCB is able to identify and discard the majority of the suboptimal arms using a few pulls, and it is second only to SW-TS, which seems to commit to a single arm which turns out to be the optimal one (arm 13). Figure 8b shows that R-ed-UCB explored arms 13 and 1 more than the others, which are respectively the best and the second-best, and most likely needs a longer time horizon to select which one is the best among the twos. Figure 8c highlights the fact that R-ed-UCB undoubtedly identiﬁed which arm is the best (arm 2), while KL-UCB, SW-UCB, SW-KL-UCB do not identify the best arm. Ser4, Rexp3 and SW-TS pulled the best arm slightly more than 50% of the times, paying the already discussed initial learning phase.

E.4. Additional Experimental Results

We evaluated the performance of the algorithms over 50 different bandits with KPt2,...,15u randomly generated arms over a time horizon of T 200,000 rounds. We averaged the run of each algorithm on a single scenario over 10 independent experiments and compared the expected value of the ranking of the considered algorithms in order to draw up a leaderboard: in every scenario we ranked the algorithms based on their empirical regret, giving the ﬁrst placement to the one with the lowest value. We report the summarized results of the rank of the algorithms averaged over the 50 experiments in Table 2. In the rested case R-ed-UCB is among the worse ones (4.98 on average), however this is again due to the fact that on average the algorithm is not superior to the baselines, which conversely do not provide any theoretical guarantees in some

Stochastic Rising Bandits

R-ed-UCB KL-UCB Rexp3 SW-TS SW-UCB Ser4 SW-KL-UCB

Figure 8. Average number of pulls: (a) 15 arms R-less, (b) 15 arms R-ed, (c) 2 arms R-ed.

Table 2. Ranking of the algorithms (50 bandits, 10 runs, 95% c.i. in brackets). Algorithm Rested Setting Ranking Restless Setting Ranking Restless Setting Ranking Heuristic R-ed-UCB 4.98 p0.34q R-less-UCB 5.14 p0.38q R-less-UCB-H 1.90 p0.30q KL-UCB 2.56 p0.43q 2.54 p0.34q 2.46 p0.31q Rexp3 5.10 p0.26q 5.20 p0.26q 6.08 p0.16q SW-TS 2.84 p0.35q 2.86 p0.39q 4.76 p0.19q SW-UCB 2.12 p0.44q 2.58 p0.47q 3.08 p0.30q Ser4 6.84 p0.15q 6.60 p0.28q 6.66 p0.18q SW-KL-UCB 3.56 p0.38q 3.08 p0.45q 3.06 p0.48q

speciﬁc settings (see the 2 arm experiment). In the restless case R-less-UCB achieves a worse-than-average performance, probably inﬂuenced by the characteristics of the randomly generated bandits. Due to this unsatisfactory results, we propose a slight modiﬁcation of the R-less-UCB upper bound as follows:

pµR-less,h i ptq 1

l Ni,t 1 h 1

ˆ Rti,l pt ti,lq Rti,l Rti,l h

ti,l ti,l h

which we call R-less-UCB-H to denote it is an heuristic method, i.e., not having theoretical results on the regret. While the performance of the heuristic seems good in practice (it achieves the best overall result), its downside is that the theoretical guarantees on the regret will have to be reconsidered.