# forced_exploration_in_bandit_problems__8457e062.pdf

Forced Exploration in Bandit Problems

Qi Han, Li Zhu *, Fei Guo

School of Software Engineering, Xi an Jiaotong University, 28 Xianning West Road, Xi an, Shaanxi, 710049, China {qihan19,co.fly}@stu.xjtu.edu.cn, zhuli@xjtu.edu.cn

The multi-armed bandit(MAB) is a classical sequential decision problem. Most work requires assumptions about the reward distribution (e.g., bounded), while practitioners may have difficulty obtaining information about these distributions to design models for their problems, especially in nonstationary MAB problems. This paper aims to design a multiarmed bandit algorithm that can be implemented without using information about the reward distribution while still achieving substantial regret upper bounds. To this end, we propose a novel algorithm alternating between greedy rule and forced exploration. Our method can be applied to Gaussian, Bernoulli and other subgaussian distributions, and its implementation does not require additional information. We employ a unified analysis method for different forced exploration strategies and provide problem-dependent regret upper bounds for stationary and piecewise-stationary settings. Furthermore, we compare our algorithm with popular bandit algorithms on different reward distributions.

Introduction The multi-armed bandit (MAB) is a classical reinforcement learning problem. It simulates the process of pulling different arms of a slot machine, where each arm has an unknown reward distribution and only the selected arm s reward is observed. The learner aims to find the optimal policy that maximizes the cumulative reward. To achieve better long-term rewards, the learner must balance exploration and exploitation. MAB has been widely used in many sequential decision tasks, such as online recommendation systems (Li et al. 2011; Li, Karatzoglou, and Gentile 2016), online advertisement campaign (Schwartz, Bradlow, and Fader 2017) and diagnosis and treatment experiment (Vermorel and Mohri 2005). Most of the existing multi-armed bandit algorithms are based on Upper Confidence Bound (UCB) (Auer, Cesa Bianchi, and Fischer 2002) and Thompson Sampling (TS) (Thompson 1933), which often rely on assumptions about the reward distribution. For example, some studies assume that the reward distribution follows a sub-Gaussian distribution, and the algorithm implementation requires knowledge

*Li Zhu is the corresponding author. Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

of the variance parameter. Additionally, some works assume that the reward distribution is bounded, and the algorithm design necessitates specific upper-bound information. In the standard MAB model, the reward distribution remains constant. However, in real-world scenarios, the distribution of rewards may vary over time. In the context of clinical trials, the target disease may undergo mutations, causing the initially optimal treatment to potentially become less effective compared to another candidate (Gorre et al. 2001). Similarly, in online recommendation systems, user preferences are prone to evolve (Wu, Iyer, and Wang 2018), rendering collected data progressively outdated. During the past ten years, several works have been conducted on nonstationary multi-armed bandit problems. These approaches can be broadly classified into two categories: one category involves detecting changes in the reward distribution using change-point detection algorithms (Liu, Lee, and Shroff 2018; Cao et al. 2019; Auer, Gajane, and Ortner 2019; Chen et al. 2019; Besson et al. 2022). In contrast, the other category focuses on mitigating the impact of past observations in a passive manner (Garivier and Moulines 2011; Raj and Kalyani 2017; Trovo et al. 2020). Among them, Auer, Gajane, and Ortner (2019); Chen et al. (2019); Besson et al. (2022) can derive regret bounds without knowing the number of changes. However, as in the stationary settings, these algorithms also require prior information on the reward distributions for their implementation. Recently, the non-parametric multi-armed bandit algorithm has garnered considerable attention (Lattimore 2017; Kveton et al. 2019a,b; Riou and Honda 2020; Baudry, Russac, and Capp e 2021; Liu et al. 2022). The implementation of these algorithms does not require strong parametric assumptions on the reward distributions, that is, a single implementation can be applied to several different reward distributions. Although the implementation of some mentioned algorithms (Kveton et al. 2019b,b; Riou and Honda 2020) does not need to know the information of reward distribution in advance, they require that the reward distribution is bounded. Therefore, they cannot be applied to a wider unbounded distribution. Wang et al. (2020) propose a perturbation based exploration method called Re Boot which has been analyzed only for Gaussian distributions. Chan (2020) propose SSMC by comparing the subsample means of the leading arm with the sample means of its competitors for

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

one-parameter exponential families distributions. However, as with the existing subsampling algorithms (Baransi, Maillard, and Mannor 2014), they have to store the entire history of rewards for all the arms. Baudry, Russac, and Capp e (2021) propose LB-SDA and LB-SDA-LM with a deterministic subsampling rule for one-parameter exponential families distributions. LB-SDA-LM is introduced to address the issue of high storage space and computational complexity. By using a sliding window, LB-SDA can be applied to non-stationary settings. While it cannot be applied to Gaussian distributions with unknown variance and the arms must come from the same one-parameter exponential family. Lattimore (2017) propose an algorithm that can be applied to various common distributions but requires the learner to know the specific kurtosis. Liu et al. (2022) propose the extended robust UCB policies without knowing the knowledge of an upper bound on specific moments of reward distributions. However, their algorithm needs a specific moment control coefficient as input. Table 1 lists the required assumptions and preliminary information for some mentioned algorithms. Our algorithm is applicable to sub-Gaussian distributions. This assumption covers common distributions such as Gaussian, bounded, and Bernoulli distributions. The implementation of our algorithm does not require knowledge of the variance parameter for sub-Gaussian distributions. Contributions In this paper, we propose a bandit algorithm that achieves respectable upper bounds on regret without using the parameters of distribution model for implementation. Our method is simple and easy-to-implement with the core idea of forcing exploration. Each time step forces an arm to be explored or uses greedy rule to select an arm. Specifically, our algorithm takes a non-decreasing sequence {f(r)} as input. This input sequence controls how each arm is forced to be pulled. In the stationary settings, we provide problem-dependent regret upper bounds. This regret upper bound is related to the input sequence, i.e., different input sequence will lead to different upper bounds. For example, if {f(r)} is taken as an exponential sequence, our algorithm can guarantee a problem-dependent asymptotic upper bound O((log T)2) on the expected number of pulls of the suboptimal arm. In the piecewise-stationary settings, we use a sliding window τ along with a reset strategy to adapt to changes in the reward distribution. The reset strategy is realized by periodically resetting the sequence {f(r)}. This ensures that the algorithm maintains its exploration capability after τ time steps. Furthermore, we show that our algorithm has the ability to achieve a problem-dependent asymptotic regret bound of order O( TBT ) if the number of breakpoints BT is a constant independent of T. This asymptotic regret bound matches the lower bounds in finite time horizon (Garivier and Moulines 2011), up to logarithmic factors.

Problem Formulation

Stationary environments Let s consider a multi-armed bandit problem with finite time horizon T and a set of arms A := {1, ..., K}. At time step t, the learner must choose an

arm it A and receive the corresponding reward Xt(it). The reward comes from a different distribution unknown to the learner and the expectation of Xt(i) is denoted as µ(i) = E[Xt(i)]. Let i denote the expected reward of the optimal arm, i.e. ,µ(i ) = maxi {1,...,K} µ(i). The gap between the expectation of the optimal arm and the suboptimal arm is denoted as (i) = µ(i ) µ(i). The history ht is defined as the sequence of actions and rewards from the previous t 1 time steps. A policy, denoted as π, is a function π(ht) = it that selects arm it to play at time step t based on the history ht. The performance of a policy π is measured in terms of cumulative expected regret:

t=1 µ(i ) µ(it)

where E[ ] is the expectation with respect to randomness of π. Let k T (i) = PT t=1 1{it = i, µ(i) = µ(i )}, the regret can be denoted as

i=1 (i)E[k T (i)]. (2)

In later section, we provide a regret upper bound for our method by analyzing the upper bound on E[k T (i)]. Piecewise-stationary environments The piecewisestationary bandit problem has been extensively studied. Piecewise-stationary bandits pose a more challenging problem as the learner needs to balance exploration and exploitation within each stationary phase and during the changes between different phases. In this setting, the reward distributions remain constant for a certain period and change at unknown time steps, called breakpoints. The number of breakpoints is denoted as BT = PT 1 t=1 1{ i A : µt(i) = µt+1(i)}. The optimal arm may vary over time and is denoted by µt(i t ) = maxi {1,..,K} µt(i). The performance of a policy is measured by

t=1 µt(i t ) µt(it)

As in the stationary environment, the regret can be analyzed by bound the expectation of the number of pulls of suboptimal arm i up to the time step T.

Stationary Environments Forced Exploration Our algorithm pulls arms in two ways. The first is based on the greedy rule, which pulls the arm with the largest value of the estimator. The second one is a forced exploration step. Specifically, the algorithm takes a sequence {f(1), f(2), ...}as input. Let p(i) denote the number of times that arm i is not pulled. At round r, if p(i) f(r), arm i will be forced to pull once and reset p(i) to 0. If all arms have been pulled at least once at round r, let r = r + 1 and repeat the process in the next round. Algorithm 1 shows the pseudocode of our method. We require the input sequence to be non-decreasing to prevent the

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Algorithms Assumption Known UCB,TS Support in [a, b] or σ-subgaussian a, b or σ Robust-UCB E[|X µ|1+ϵ] < u, 0 < ϵ 1 ϵ, u Re Boot(Wang et al. 2020) Gaussian - SSMC,LB-SDA one-parameter exponential families - (Lattimore 2017) Bounded kurtosis E[( X µ

σ )4] < κ κ (Liu et al. 2022) - moment control coefficient Our method subgaussian -

Table 1: Assumptions and implementation parameters

algorithm from over-exploring. If f(r) is less than a small constant c after t time steps,

Rπ T > (K 1)T t

This shows that the algorithm can only obtain linear regret. Step 3 is the greedy rule. In Step 5, the algorithm pull the arm that has not been pulled more than f(r) times. To simplify the notation, we use the equivalent notation of arg maxi p(i). Related to Explore-Then-Committee (ETC) (Garivier, Lattimore, and Kaufmann 2016) study ETC policy for twoarmed bandit problems with Gaussian rewards. ETC explores until a stopping time s. Let

f(r) = 1, r s 0, r > s ,

our method is equivalent to ETC. The essential difference is that ETC is non-adaptive, i.e., for different problem instances, the set of all exploration rounds and the choice of arms therein is fixed before the first round. For some common input sequences, such as f(r) =

T or f(r) = r, the exploration schedule of our method is related to the history of pulled arms, so our method is adaptive. Related to ϵt-greedy ϵt-greedy strategy tosses a coin with a success probability ϵt at time step t and randomly selects an arm if success, otherwise the arm with the largest average reward is selected. If the exploration probability satisfies ϵt t 1

3 , this greedy strategy achieves regret bound on the order of O(T 2 3 ). Similar to our method, this strategy needs to decide whether to explore at each time step. The difference is that ϵt-greedy is non-adaptive since it does not adapt their exploration schedule to the history.

Regret Analysis In this section, we bound the expectation of the number of pulls of suboptimal arms. The detailed proofs of Theorem and Corollary are at https://arxiv.org/abs/2312.07285. Let ht(i) denotes the number of times arm i is forced to pull (Step 5) up to time t. We give the following theorem and proof sketch. Theorem 1. Assume that the reward distribution is σsubgaussian. Let {f(r)} be a non-decreasing sequence, for any suboptimal arm i,

E[k T (i)] h T (i) + 2me 1 m

Algorithm 1: FE Input: non-decreasing sequence {f(r)}, K arms, horizon T Initialization: t = 1, r = 0, f(0) = 0, i {1, ..., K}, p(i) = 0, flag(i) = 0 1: while t < T do 2: if i {1, ..., K}, p(i) < f(r) then 3: Pull arm it = arg maxi ˆµ(i). 4: else 5: Pull arm it = arg maxi p(i) 6: end if 7: Update the estimate ˆµ(it),p(it) = 0, flag(it) = 1 8: p(i) = p(i) + 1 for all unpulled arm i 9: if i {1, ..., K}, flag(i) == 1 then 10: r = r + 1 11: i {1, ..., K}, flag(i) = 0 12: end if 13: t = t + 1 14: end while

where m = 8σ2 (i)2 .

Proof sketch We bound the number of suboptimal arm s pulls in Step 3 and Step 5 in Algorithm 1 respectively. The number of pulls according to the greedy rule can be estimated using the properties of subgaussian. Summing the regret caused by the greedy rule and the forced exploration (which can be denoted as h T (i)) leads to this Theorem. Next, we give more specific upper bounds for different input sequences.

Corollary 1 (Constant sequence). Let f(r) =

T. We have ht(i) [ t

T + 1]. Therefore,

T(1 + 2m2e 2 m ) + 1. (6)

Corollary 2 (Linear sequence). Let f(r) = r. We have,

2T + K2 + 6m3e 3 m (7)

Corollary 3 (Exponential sequence). Let f(r) = ar(a > 1). We have,

E[k T (i)] log(T(a 1) + 1)

log(a) + (K + 1)log(K + 1)

t=1 (1 + a 1

K + 1t) 1 m log(a) .

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Lower Bounds Many studies have demonstrated the lower bounds of regret for the stationary multi-armed bandit problem (Slivkins et al. 2019; Lai, Robbins et al. 1985; Bubeck, Cesa-Bianchi, and Lugosi 2013). A commonly used problem-dependent lower bound for bounded rewards or subgaussian rewards with variance parameter 1 is

lim inf T Rπ T log(T) X

CI (i), (9)

where CI is a problem-dependent constant. Define f 1(r) as min{x : f(x) r}. Let t0 denote the number of all time steps satisfying f(r) K, our method has a simple problem-dependent lower bound:

r=f 1(K+1) f(r) T t0}. (10)

Notethat, this lower bound only considers forced exploration and does not take into account the regret incurred by the greedy rule. The true regret lower bound is larger than the one computed by Equation 10. If f(r) is constant sequence (f(r) =

T) or linear sequence (f(r) = r), this problem-dependent lower bound is Ω(

T). According to Corollary 1 and Corollary 2, the lower bound of constant sequence and linear sequence is on the same order of the upper bound. If f(r) is exponential sequence (f(r) = ar), this lower bound is Ω( log(T )

Upper Bound of Exponential Sequence From the lower bounds of the above three sequences, the regret upper bound of exponential sequence can be expected to reach the lower bound (Equation 9). However, it often fails to achieve this goal. Since there is lack of knowledge about σ and (i), we can t tune the parameter a to obtain an optimal upper bound in Equation 8. If there is no information about rewards distributions, there are two simple ways to set the exponential sequence: a is a small constant independent of T. If the value of m = 8σ2 (i)2 is appropriate such that

m log(a) < 1, (11)

E[k T (i)] = O(me1/m log(T)).

This upper bound is optimal with respect to the order of T. However, in general, we cannot guarantee the parameter a can satisfy Equation 11.

a is associated with T, such as a = e

1 log(T ) . Then, f(r) = e

r log(T ) , we can get the asymptotic regret

E[k T (i)] = O(me 1 m (log(T))2). (12)

Like the other two sequences, this problem-dependent asymptotic upper bound also matches the order of lower bound. Our upper bound is not optimal and there has been work to show that optimality is impossible. For

example, recently Agrawal, Koolen, and Juneja (2021); Ashutosh et al. (2021) have proved the impossibility of a problem-dependent logarithmic regret for light-tailed distributions without further assumptions on the tail parameters.

Remark 1. We use an example to illustrate Corollary 3 and Equation 12. Assume that the reward distribution follows a Gaussian distribution with variance 1. If T is sufficiently large, it holds that m log(a) = m log(T ) < 1. If m > 1, we get

E[k T (i)] (K + 2)(log(T))2 + e(K + 1)16(log(T))2

(13) If m 1, the above equation holds obviously. We can derive the following regret upper bound:

Rπ T 8 e(K + 1)

+ (K + 2)(log(T))2 K X

i=1 (i). (14)

The above regret bound matches the optimal upper bounds O(

T) in stationary multi-armed bandits. The extra terms log(T) and K are due to the forced exploration. Note that, this regret upper bound also holds asymptotically.

Non-Stationary Environments In this section, we consider the piecewise-stationary settings. We employ a method often used in non-stationary bandits problems - sliding windows. One might think that all we need is to add sliding windows to the mean estimator ˆµ. However, this does not work in non-stationary situations. Consider an input sequence f(r) that can be incremented to + . If f(r) > T holds after some time step t, then between the time steps in [t, T], the algorithm will not force to explore any arm but only pulls the arm based on the value of the mean estimator. If the reward distribution changes, the algorithm will suffer from very poor performance. To keep the exploration ability when the reward distribution changes, we propose to periodically reset the exploration sequence.

Reset {f(r)} Define

ˆµt(τ, i) = 1 Nt(τ, i)

s=t τ+1 Xs(i)1{is = i},

s=t τ+1 1{is = i}.

ˆµt(τ, i) denote the sliding window estimator, which using only the τ last pulls. We reset r = 1 every τ time steps to make the input sequence re-grow from f(1). This simple reset strategy ensures that each arm is forced to be explored a certain number of times in each τ interval. The size of this reset interval is set to τ for simplicity and ease of analysis. One can use intervals of other sizes. For

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Algorithm 2: SW-FE Input: non-decreasing sequence {f(r)},sliding window τ, K arms, horizon T Initialization: t = 1, r = 0, f(0) = 0, i {1, ..., K}, p(i) = 0, flag(i) = 0 1: while t < T do 2: if i {1, ..., K}, p(i) < f(r) then 3: Pull arm it = arg maxi ˆµt(τ, i). 4: else 5: Pull arm it = arg maxi p(i) 6: end if 7: Update the estimate ˆµt(τ, it),p(it) = 0, flag(it) = 1 8: p(i) = p(i) + 1 for all unpulled arm i 9: if i {1, ..., K}, flag(i) == 1 then 10: r = r + 1 11: i {1, ..., K}, flag(i) = 0 12: end if 13: if t mod τ == 0 then 14: r = 1 15: end if 16: t = t + 1 17: end while

our method, the selection of the reset interval should ensure that each arm can be forced to explore a certain number of times within the interval [t τ + 1, t]. Algorithm 2 shows the pseudocode of our method for non-stationary settings. Compared to Algorithm 1, only a sliding window is added to the estimator and {f(r)} is reset periodically.

Regret Analysis

In non-stationary setting, more regret is incurred compared to the stationary setting. This is the cost that must be paid to adapt to changes in the reward distributions. For our approach, the regret that arises more than the stationary environment comes from two aspects. The first is that due to the use of sliding window, the historical data used to estimate arm expectations are limited to at most τ. The second, which is unique to our method, is that the number of times the suboptimal arm is pulled increases due to the periodically resetting of the exploration sequence. Let ht(τ, i) denote the number of forced pulls for arm i in the τ last plays. Since we use the sliding window and a reset strategy, ht(τ, i) will first increase and then change within a certain range. Let T (i) = min{ t(i) : i = i t , t T}, be the minimum difference between the expected reward of the best arm i t and the expected reward of arm i in all times T when arm i is not the best arm.

Theorem 2. Assume that the reward distribution is σsubgaussian. Let {f(r)} be an non-decreasing sequence,τ is the sliding window, for any suboptimal arm i,

E[k T (i)] T

hτ(τ, i) + m T e

τ (1 + 2m T log(τ)) + BT τ,

where m T = 8σ2 T (i)2 .

Proof sketch The proof is adapted from the analysis of SWUCB (Garivier and Moulines 2011). The regret comes from three aspects: greedy rules, forced exploration and the analysis methods of sliding windows. The forced exploration incurs regrets with upper bounds T

τ hτ(τ, i). The analysis approach of the sliding window itself has a regret upper bound of BT τ. The regret caused by greedy rule can be estimated by regret decomposition. PT t=1 1{it = i = i t , Nt(τ, i) < A(τ)} can be bounded by T/τ A(τ), A(τ) = 2m T log(τ). We can decompose the regret in the following way:

{it = i = i t , Nt(τ, i) > A(τ)}

{µt(τ, i) > µt(i) + (i)

2 , Nt(τ, i) > A(τ)}

{µt(τ, i t ) µt(i t ) (i)

The analysis methods of these two parts are similar to the stationary settings. Summing over all leads to this Theorem. Similar to the stationary setting, we provide the specific bounds for some explore sequence. Our method requires that each arm can be forced to explore within [t τ + 1, t]. Constant sequences cannot be set to the same value as

T in the stationary scenario. This could potentially result in a long time steps without exploring the optimal arm. Since the size of the sliding window is τ, we set the constant sequence as f(r) = τ. Corollary 4. Let f(r) = τ. We have,

E[k T (i)] BT τ + T

τ (1 + 2m T log(τ) + τm2 T e

(16) Corollary 5. Let f(r) = r. We have,

E[k T (i)] BT τ + T

τ (1 + 2m T log(τ) + 3m3 T e

(17) Corollary 6. Let f(r) = ar(a > 1). We have,

E[k T (i)] BT τ + T

K + 1t 1 m T log(a)

1 + 2m T log(τ) + (K + 2)log(τ + 1)

Remark 2. The sliding window methods (Garivier and Moulines 2011; Baudry, Russac, and Capp e 2021) generally suggest that the size of sliding window is τ = p

T log(T)/BT . For constant and linear sequence, we get

E[k T (i)] = O(T 3 4 p

BT log(T)).

For exponential sequence, we can take a = e

1 log(τ) similar to stationary settings. It can be observed that setting the sliding

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

0 20000 40000 60000 80000 100000 Round t

FE-Exp FE-Linear FE-Constant LB-SDA-LM Lattimore(2017)

0 20000 40000 60000 80000 100000 Round t

FE-Exp FE-Linear FE-Constant LB-SDA-LM Lattimore(2017)

Figure 1: Settings with K = 10, T = 100000. Gaussian rewards (a), Bernoulli rewards (b).

window to τ = p

T/BT log(T) for exponential sequence will yield a smaller upper bound, effectively reducing it by p

log(T). If the number of breakpoints is constant, we have the following asymptotic bound

E[k T (i)] = O( p

TBT log(T)). (19)

Experiments Stationary Settings In this section, we compare our method with other nonparametric bandit algorithms on Gaussian and Bernoulli distribution rewards1. Our method is instantiated by three different sequence: FE-Constant, FE-Linear, FE-Exp. They use constant (f(r) =

T), linear (f(r) = r), and exponential (f(r) = e

r log(T ) ) sequences, respectively. We compare the above three instances of our method with two representative non-parametric algorithms: LB-SDA-LM and Lattimore(2017). Due to the potentially high time complexity of LB-SDA algorithm, we turn to comparing LB-SDA-LM, an alternative algorithm that achieves the same theoretical results but with much lower complexity. The implementation of Lattimore(2017) seems challenging, we use a roughly equivalent and efficiently computable alternative (Lattimore 2017). The means and variances of Gaussian distributions are randomly generated from uniform distribution:

µ(i) U(0, 1),

σ(i) U(0, 1). The means of Bernoulli distribution are also generated from U(0, 1). The time horizon is set as T = 100000. We fix the number of arms as K = 10. We measure the performance of each algorithm with the cumulative expected regret defined in Equation 1. The expected regret is averaged on 100 independently runs. The 95% confidence interval is obtained by

1https://github.com/qh1874/Force Explor

performing 100 independent runs and is shown as a semitransparent region in the figure. Figure 1 shows the results of Gaussian and Bernoulli rewards. Constant and linear sequences exhibit similar performance. The implementation of Lattimore(2017) using the approximation method may lead to significant variance in experimental results, and its performance could be inferior to other methods. LB-SDA-LM is applicable to singleparameter exponential family distributions. This method demonstrates optimal performance on Bernoulli distributions. However, its performance is notably weaker on Gaussian distributions with unknown variance in (0, 1). Our approach, FE-EXP, although its theoretical upper bound is asymptotic and not optimal, achieves remarkable performance on Gaussian and Bernoulli rewards.

0 20000 40000 60000 80000 100000

Arm 1 Arm 2 Arm 3 Arm 4 Arm 5

Figure 2: K = 5,BT = 5 for Gaussian rewards.

Non-stationary Settings In this section, we compare our method with other nonstationary bandit algorithms. Specifically, our method em-

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

0 20000 40000 60000 80000 100000 Round t

SW-FE-Exp SW-FE-Linear FE-Exp SW-LB-SDA SW-FE-Constant SW-TS

0 20000 40000 60000 80000 100000 Round t

SW-FE-Exp SW-FE-Linear FE-Exp SW-LB-SDA SW-FE-Constant M-UCB CUSUM SW-TS

Figure 3: Settings with K = 5, BT = 5, T = 100000. Gaussian rewards (a), Bernoulli rewards (b).

ploys four instances: constant sequence with sliding window (SW-FE-Constant), linear sequence with sliding window (SW-FE-Linear), exponential sequence with sliding window (SW-FE-EXP), and the exponential sequence without sliding window (FE-EXP). We use FE-EXP to evaluate the improvement obtained thanks to the employment of the sliding window. We also compare our method with another nonparametric algorithm named SW-LB-SDA (Baudry, Russac, and Capp e 2021). Furthermore, we compare with some novel and efficient algorithms such as CUSUM (Liu, Lee, and Shroff 2018), M-UCB (Cao et al. 2019) only in Bernoulli distribution rewards. Moreover, we compare with SW-TS (Trovo et al. 2020). This method requires information about the Gaussian rewards to be known in advance. There is no theoretical proof yet for SW-TS except for Bernoulli rewards.

Tune parameters Following Remark 2, we set τ = p

T/BT log(T) for SW-FE-Exp, τ = p

T log(T)/BT for SW-FE-Constant and SW-FE-Linear. We set τ = p

T log(T)/BT for LB-SDA and SW-TS. For changepoint detection algorithm M-UCB, we set w = 800, b = p

w/2 log(2KT 2) suggested by (Cao et al. 2019). But set the amount of exploration γ = p

KBT log(T)/T. In practice, it has been found that using this value instead of the one guaranteed in (Cao et al. 2019) will improve empirical performance (Baudry, Russac, and Capp e 2021). For CUSUM, following from (Liu, Lee, and Shroff 2018), we set α = p

BT /T log(T/BT ) and h = log(T/BT ). For our experiment settings, we choose M = 50, ϵ = 0.05.

The time horizon is set as T = 100000. We split the time horizon into 5 phases of equal length and fix the number of arms to K = 5. Each stationary phase, the reward distributions will be regenerated in the same way as stationary settings. Figure 2 depicts the expected rewards for Gaussian arms with K = 5 and BT = 5. Gaussian and Bernoulli distributions are generated in the same way as in the stationary setting. The expected regret is averaged on 10 independently

runs. The 95% confidence interval is obtained by performing 10 independent runs and is shown as a semi-transparent region in the figure. Figure 3 shows the results of Gaussian and Bernoulli rewards for piecewise-stationary settings. M-UCB and CUMSUM require that the rewards are bounded, which is not applicable to Gaussian rewards. We only conducted experiments on Bernoulli rewards. FE-EXP is an algorithm for stationary MAB problems, so it oscillates a lot at the breakpoint. SW-FE-Constant and SW-FE-Linear have similar performance, with SW-FE-Constant even performing better. This could be attributed to the significant impact of problemdependent term m T = 8σ2 T (i)2 on the performance. The regret upper bound of SW-FE-Constant is controlled by m2 T , while that of SW-FE-Linear is controlled by m3 T . Our algorithm, SW-FE-EXP, exhibits competitive results on both Gaussian and Bernoulli rewards.

Conclusion In this paper, we have developed a forced exploration algorithm for both stationary and non-stationary multi-armed bandit problems. This algorithm has broad applicability to various reward distributions, and its implementation does not require the use of reward distribution information. We employ a unified analytical approach for different input sequences and provide regret upper bounds. Experimental results demonstrate that despite the asymptotic nature of our regret upper bounds, our approach achieves comparable performance to current popular algorithms.

References Agrawal, S.; Koolen, W. M.; and Juneja, S. 2021. Optimal best-arm identification methods for tail-risk measures. Advances in Neural Information Processing Systems, 34: 25578 25590. Ashutosh, K.; Nair, J.; Kagrecha, A.; and Jagannathan, K. 2021. Bandit algorithms: Letting go of logarithmic regret

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

for statistical robustness. In International Conference on Artificial Intelligence and Statistics, 622 630. PMLR. Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47: 235 256. Auer, P.; Gajane, P.; and Ortner, R. 2019. Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Conference on Learning Theory, 138 158. PMLR. Baransi, A.; Maillard, O.-A.; and Mannor, S. 2014. Subsampling for multi-armed bandits. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2014, Nancy, France, September 15-19, 2014. Proceedings, Part I 14, 115 131. Springer. Baudry, D.; Russac, Y.; and Capp e, O. 2021. On limitedmemory subsampling strategies for bandits. In International Conference on Machine Learning, 727 737. PMLR. Besson, L.; Kaufmann, E.; Maillard, O.-A.; and Seznec, J. 2022. Efficient change-point detection for tackling piecewise-stationary bandits. The Journal of Machine Learning Research, 23(1): 3337 3376. Bubeck, S.; Cesa-Bianchi, N.; and Lugosi, G. 2013. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11): 7711 7717. Cao, Y.; Wen, Z.; Kveton, B.; and Xie, Y. 2019. Nearly optimal adaptive procedure with change detection for piecewisestationary bandit. In The 22nd International Conference on Artificial Intelligence and Statistics, 418 427. PMLR. Chan, H. P. 2020. The multi-armed bandit problem: An efficient nonparametric solution. Chen, Y.; Lee, C.-W.; Luo, H.; and Wei, C.-Y. 2019. A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free. In Conference on Learning Theory, 696 726. PMLR. Garivier, A.; Lattimore, T.; and Kaufmann, E. 2016. On explore-then-commit strategies. Advances in Neural Information Processing Systems, 29. Garivier, A.; and Moulines, E. 2011. On upper-confidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory, 174 188. Springer. Gorre, M. E.; Mohammed, M.; Ellwood, K.; Hsu, N.; Paquette, R.; Rao, P. N.; and Sawyers, C. L. 2001. Clinical resistance to STI-571 cancer therapy caused by BCR-ABL gene mutation or amplification. Science, 293(5531): 876 880. Kveton, B.; Szepesvari, C.; Ghavamzadeh, M.; and Boutilier, C. 2019a. Perturbed-history exploration in stochastic multi-armed bandits. ar Xiv preprint ar Xiv:1902.10089. Kveton, B.; Szepesvari, C.; Vaswani, S.; Wen, Z.; Lattimore, T.; and Ghavamzadeh, M. 2019b. Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In International Conference on Machine Learning, 3601 3610. PMLR.

Lai, T. L.; Robbins, H.; et al. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1): 4 22. Lattimore, T. 2017. A scale free algorithm for stochastic bandits with bounded kurtosis. Advances in Neural Information Processing Systems, 30. Li, L.; Chu, W.; Langford, J.; and Wang, X. 2011. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, 297 306. Li, S.; Karatzoglou, A.; and Gentile, C. 2016. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, 539 548. Liu, F.; Lee, J.; and Shroff, N. 2018. A change-detection based framework for piecewise-stationary multi-armed bandit problem. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32. Liu, K.; Chen, H.; Deng, W.; and Wu, T. 2022. The Extended UCB Policies for Frequentist Multi-armed Bandit Problems. ar Xiv:1112.1768. Raj, V.; and Kalyani, S. 2017. Taming non-stationary bandits: A Bayesian approach. ar Xiv preprint ar Xiv:1707.09727. Riou, C.; and Honda, J. 2020. Bandit algorithms based on thompson sampling for bounded reward distributions. In Algorithmic Learning Theory, 777 826. PMLR. Schwartz, E. M.; Bradlow, E. T.; and Fader, P. S. 2017. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4): 500 522. Slivkins, A.; et al. 2019. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(12): 1 286. Thompson, W. R. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4): 285 294. Trovo, F.; Paladino, S.; Restelli, M.; and Gatti, N. 2020. Sliding-window thompson sampling for non-stationary settings. Journal of Artificial Intelligence Research, 68: 311 364. Vermorel, J.; and Mohri, M. 2005. Multi-armed bandit algorithms and empirical evaluation. In Machine Learning: ECML 2005: 16th European Conference on Machine Learning, Porto, Portugal, October 3-7, 2005. Proceedings 16, 437 448. Springer. Wang, C.-H.; Yu, Y.; Hao, B.; and Cheng, G. 2020. Residual bootstrap exploration for bandit algorithms. ar Xiv preprint ar Xiv:2002.08436. Wu, Q.; Iyer, N.; and Wang, H. 2018. Learning contextual bandits in a non-stationary environment. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 495 504.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)