# piecewisestationary_dueling_bandits__a1920d05.pdf

Published in Transactions on Machine Learning Research (MM/YYYY)

Piecewise-Stationary Dueling Bandits

Patrick Kolpaczki patrick.kolpaczki@upb.de Department of Computer Science Paderborn University

Eyke Hüllermeier eyke@lmu.de Institute of Informatics, LMU Munich Munich Center for Machine Learning

Viktor Bengs viktor.bengs@lmu.de Institute of Informatics, LMU Munich Munich Center for Machine Learning CIFAR Fellow

Reviewed on Open Review: https: // openreview. net/ forum? id= Wh EHEDP7ZG

We study the piecewise-stationary dueling bandits problem with K arms, where the time horizon T consists of M stationary segments, each of which is associated with its own preference matrix. The learner repeatedly selects a pair of arms and observes a binary preference between them as feedback. To minimize the accumulated regret, the learner needs to pick the Condorcet winner of each stationary segment as often as possible, despite preference matrices and segment lengths being unknown. We propose the Beat the Winner Reset algorithm and prove a bound on its expected binary weak regret in the stationary case, which tightens the bound of current state-of-art algorithms. We also show a regret bound for the non-stationary case, without requiring knowledge of M or T. We further propose and analyze two meta-algorithms, DETECT for weak regret and Monitored Dueling Bandits for strong regret, both based on a detection-window approach that can incorporate any dueling bandit algorithm as a black-box algorithm. Finally, we prove a worst-case lower bound for expected weak regret in the non-stationary case.

1 Introduction

The stochastic multi-armed armed bandit (MAB) problem (Thompson, 1933; Robbins, 1952) is an online learning framework in which an agent (learner) chooses repeatedly from a set of K options called arms in the course of a sequential decision process with (discrete) time horizon T. Each arm ai is associated with an unknown reward distribution having finite mean and being stationary over the learning process. Choosing arm ai results in a random reward sampled from that distribution. The learner s goal is to choose the optimal arm, i.e., the one having highest mean reward, as often as possible, because a suboptimal arm with lower (expected) reward implies a positive regret (reward difference). Due to the absence of knowledge about the reward distributions, this task of cumulative regret minimization comes with the challenge of tackling the exploration-exploitation dilemma: As the learner requires reasonably certain estimates of the arms mean rewards, it must explore by choosing each arm sufficiently often. Otherwise, because rewards are random, it may believe that a suboptimal arm is optimal. On the other side, too much exploration is not good either, because exploration comes at the cost of exploitation (choosing the presumably optimal arm), thereby increasing cumulative regret.

In many practical applications, the assumption of stationary reward distributions is likely to be violated. In light of this, the non-stationary MAB problem has received increasing interest in the recent past (Hartland

Published in Transactions on Machine Learning Research (MM/YYYY)

et al., 2006; Liu et al., 2018; Auer et al., 2019). Here, two main types of non-stationarity are distinguished: conceptual drift, where the reward distributions change gradually over time, and conceptual shift, where the reward distributions change abruptly at a certain point in time, called changepoint. Consider music recommendation as an example: It is known that user preferences towards a certain genre can change depending on the time of the year (Pettijohn et al., 2010); while this is an example of drift, a shift might be caused by switching between users sharing a single account. Non-stationarity adds another dimension to the exploration-exploitation dilemma: not only must the learner balance exploration and exploitation to maximize the number of times the optimal arm is chosen, but additionally ensure that previous observations are still valid by conducting ancillary exploration to detect changes in the reward distributions. Algorithms tackling the problem can be divided into passively adaptive ones, which devalue older observations and base their strategies on observations made in more recent time steps, e.g. D-UCB (Garivier & Moulines, 2011) and SW-UCB (Garivier & Moulines, 2011), and actively adaptive ones trying to detect changepoints through suitable detection mechanisms (and then discarding observations prior to suspected changepoints), e.g. CUSUM-UCB (Liu et al., 2018), PHT-UCB (Liu et al., 2018), and M-UCB (Cao et al., 2019).

Another assumption of the MAB setting that is often difficult to meet in practice is the provision of feedback in the form of precise numerical rewards. In many cases, the learner is only able to observe feedback of a weaker kind, for example qualitative preferences over pairs of arms. This is especially true when the feedback is provided by a human or implicitly derived from human behavior, such as in clinical treatments (Sui & Burdick, 2014), information retrieval (Zoghi et al., 2016), or recommender systems (Hofmann et al., 2013). For example, if a playlist is recommended to a user, one may conjecture that those songs listened to are preferred to those not listened to. In light of this, a variant called dueling bandits (Yue & Joachims, 2009) has been proposed, in which the learner chooses a pair of arms in each time step, whereupon feedback in the form of a noisy preference is observed. This feedback is governed by an unknown preference matrix which determines for each pair of arms (ai, aj) the probability that ai wins against aj, or in other words, is preferred over aj. In order to define the notions of best arm and regret, a common assumption is the existence of a Condorcet winner (CW), i.e., an arm that is preferred over all other arms (in the sense of winning with probability > 1/2). The average or strong regret of a chosen pair is then defined as the average of the chosen arms calibrated probabilities, i.e., the magnitude of the probability that the CW is preferred over an arm. Obviously, this regret is only zero for the pair containing the best arm twice (full commitment). In contrast, the weak regret (Yue et al., 2012) is the minimum of the chosen arms calibrated probabilities, turning zero as soon as any of them is the CW. This is motivated by scenarios where the worse option does not have an impact on the pair s quality. In this case, a learner can conduct exploration and exploitation simultaneously by playing a reference arm the presumably best one together with another exploration arm .

Although the non-stationary variant of the MAB problem has been studied quite intensely in the recent past, the research on non-stationary dueling bandits is still in its infancy. This is somewhat surprising since changes of preferences are not uncommon in applications of dueling bandits.

Contribution. In this paper, we contribute to the theoretical understanding of non-stationary dueling bandits, in which preferences change M 1 times at unknown time points. We give a formal problem statement for the piecewise-stationary dueling bandits problem in Section 2, with emphasis on stationary segments being separated by changepoints and the Condorcet winner as the best arm. As a first algorithm to tackle the problem, we propose Beat the Winner Reset (Bt WR) for weak regret and show expected regret bounds of O( K log K

2 ) in the stationary setting and O( KM

2 log(K + T)) in the non-stationary setting (Section 3) under the assumption that the suboptimality gap is known. Bt WR enjoys a tighter regret bound than other state-of-the-art algorithms for the stationary case, and does not need to know the time horizon T or the number of segments M for the non-stationary case. Next, we propose the Monitored Dueling Bandits (MDB) meta-algorithm for strong regret based on a detection-window approach, which is parameterized with a black-box dueling bandits algorithm for the stationary setting (Section 4). We bound

its expected regret by O K q

MT log T MK plus the sum of regret that the black-box algorithm would have

incurred when being run on its own for each (stationary) segment. In Section 5, we additionally present with DETECT an adaptation of MDB towards weak regret and prove its expected regret to be in O(KM log T) plus the sum of regret that the black-box algorithm would have incurred when being run on its own for each

Published in Transactions on Machine Learning Research (MM/YYYY)

segment. Our regret bounds and the quantities that are required to be known for each bound are summarized in Table 1. Further, we prove a worst-case lower bound of Ω(

KMT) for the expected weak regret of any algorithm for the problem (Section 6). In a comparison with state-of-the-art methods, we provide empirical evidence for our algorithm s superiority.

Table 1: Overview of theoretical regret bounds derived for our proposed algorithms. The second column denotes whether we consider the stationary or the non-stationary case. The last four columns indicate whether the quantity needs to be known for parametrization. Algorithm Stationary Regret type Regret bound T M δ / δ

Bt WR bin. weak O K log K

Bt WR bin. weak O KM log(K+T )

MDB bin. strong O K

MT log(δT/KM)

+ R S M(Alg) 1

DETECT2 bin. weak O MK log T

δ2 + M log T + R W M (Alg) 1

1 The algorithm itself does not require , but the underlying black-box dueling bandits algorithm might do. 2 With Bt W as the underlying black-box dueling bandits algorithm.

Related Work. There is a wealth of work on the non-stationary bandit problem with numerical rewards (Hartland et al., 2006; Kocsis & Szepesvári, 2006; Garivier & Moulines, 2011; Liu et al., 2018; Auer et al., 2019). See Lu et al. (2021) for a good overview of this branch of literature. From a methodological point of view, the work by Cao et al. (2019) is closest to our approaches MDB and DETECT. For the dueling bandits problem (Yue & Joachims, 2009), a variety of algorithms for strong regret based on the Condorcet winner has been proposed: Beat the Mean (Yue & Joachims, 2011), Interleaved Filter (Yue et al., 2012), SAVAGE (Urvoy et al., 2013), RUCB (Zoghi et al., 2014a), RCS (Zoghi et al., 2014b), RMED (Komiyama et al., 2015), Merge RUCB (Zoghi et al., 2015), Merge DTS (Li et al., 2020). Although weak regret has been proposed by Yue et al. (2012), Winner Stays (Chen & Frazier, 2017) and Beat the Winner (Peköz et al., 2020) are the only algorithms specifically tailored to this regret that we are aware of. Remarkably, their expected regret bounds are constant w.r.t. T. Bengs et al. (2021) give a survey of dueling bandits.

Non-stationary dueling bandits have been considered by Saha & Gupta (2022); Buening & Saha (2023); Suk & Agarwal (2023). All of them consider dynamic strong regret, which is a stricter notion of regret as the learner s selection are compared against a dynamic benchmark arm every time step. Saha & Gupta (2022) propose with Dex3.S an algorithm inspired by EXP3 (Auer et al., 2002b) from the adversarial setting that achieves a high probability strong regret bound of O(

KMT log KT/ϵ) with at least 1 2ϵ probability if the number of stationary segments M is known. Buening & Saha (2023) rely on this previous work in order to propose with ANACONDA an algorithm that adapts to the environment without requiring the number of segments. It achieves an expected strong regret of O(K

MT). Suk & Agarwal (2023) also consider strong regret by counting Condorcet winner changes and achieve with METASWIFT O(

KMT) strong regret but under the relatively strong assumption of stochastic transivity and the stochastic triangle inequality.

Our main focus is on static weak regret, where the benchmark arm may only change on certain time segments, and the learner does not suffer regret as long as the segment-wise benchmark arm occurs in the duel. We ask ourselves the question to what extent the regret bounds change for weak regret if non-stationarity is now present. In particular, what the dependence on the time horizon T will be. In contrast to strong regret, the notion of weak regret in dynamic environments is so far left unexplored.

Published in Transactions on Machine Learning Research (MM/YYYY)

2 Problem Statement

The piecewise-stationary dueling bandits problem involves a finite set of K arms A = {a1, . . . , a K}, a time horizon T N, and M 1 changepoints ν1, . . . , νM 1 N unknown to the learner with 1 < ν1 < . . . < νM 1 T dividing the entire learning time into M stationary segments. We additionally define dummy changepoints ν0 = 1 and νM = T + 1, allowing us to express the m-th stationary segment as the set Sm := {νm 1, . . . , νm 1} spanning between the (m 1)-th and the m-th changepoint exclusively. For each stationary segment Sm the environment is characterized by a preference matrix P (m) [0, 1]K K unknown to the learner, with each entry P (m) i,j denoting the probability that the arm ai wins against aj in a duel.

From here on we write p(m) i,j := P (m) i,j . To be well-defined, we assume that p(m) i,j + p(m) j,i = 1 for all ai, aj, i.e., the probability of a draw is zero. For each segment Sm, we assume the existence of the Condorcet winner am that beats all other arms with probability greater than half, i.e., p(m) m ,i > 1

2 for all ai A \ {am }, and refer to it as the optimal arm of the m-th stationary segment.

To provide expressive notation for our analyses, we define

the change of a pair (ai, aj) at the m-th changepoint νm as δ(m) i,j := p(m+1) i,j p(m) i,j ,

the m-th segmental change δ(m) := maxi,j|δ(m) i,j |, the minimal segmental change δ := minm δ(m), the m-th Condorcet Winner segmental change δ(m) := maxj|δ(m) m ,j|,

and the minimal Condorcet Winner segmental change δ := minj δ(m) .

Furthermore, we define

calibrated preference probabilities (m) i,j := p(m) i,j 1

suboptimality gaps (m) i := (m) m ,i for each segment Sm,

and the minimal suboptimality gap := min 1 m M,i =m (m) i over all segments.

In each time step t Sm, the learner plays a pair of arms (a It, a Jt) and thereupon obverses a binary random

variable X(t) It,Jt Ber p(m) It,Jt

. All X(t) i,j are mutually independent. We distinguish between two types of instantaneous regret that the learner suffers in each time t Sm:

Strong regret given by r S t := 1

2 (m) It + (m) Jt

Weak regret given by r W t := min n (m) It , (m) Jt

Their respective binary versions are defined by r S t := r S t and r W t := r W t . It is essential to involve the current preference matrix P (m) into the regret calculation since the regret of the same pair is free to change between segments. The cumulative regret Rv(T) w.r.t. the considered version of instantaneous regret rv t that the learner aims to minimize is given by

t Sm rv t .

Although we give upper bounds on expected cumulative binary strong and weak regret, they are also valid for the non-binary versions, because r S r S and r W r W, and vice versa for lower bounds. A list of symbols used frequently throughout the paper is given by Table 2. For the remainder of the paper, we use the terms piecewise-stationary and non-stationary interchangeably.

Published in Transactions on Machine Learning Research (MM/YYYY)

3 Beat the Winner Reset

The first piecewise-stationary dueling bandits algorithm that we propose is the Beat the Winner Reset algorithm (Bt WR) (see Algorithm 1). It is a variant of the Beat the Winner (Bt W) algorithm (Peköz et al., 2020) and to be applied for weak regret. It proceeds in explicit rounds playing the same pair (a I, a J) until a certain termination condition is fulfilled. Bt WR starts by drawing an incumbent arm a I uniformly at random, puts the other K 1 arms in random order into a FIFO queue Q, and initializes the round counter c to be 1. It proceeds by iterating through rounds until the time horizon is reached. In each round, the challenger arm a J is dequeued from Q and the variables w I and w J counting wins of the incumbent and challenger, respectively, are initialized. The pair (a I, a J) is played until one of them has reached ℓc many wins, called the winner of that round. The winner of a round becomes the incumbent a I for the next round and the loser is enqueued in Q, being played again K 1 rounds later. At last, c (counting the number of successive rounds with the same incumbent a I) is incremented by one if the incumbent a I has won the round. Otherwise, if the challenger a J has won, it is reset to 1. An illustration of Bt WR is given in Fig. 1.

Figure 1: Scheme of Bt WR: a1 is chosen as the first incumbent and a2 as the challenger. After a1 has won ℓ1 many duels against a2 in the first round, a2 is put back to the end of the queue and a3 becomes the new challenger for the second round. The pair (a1, a3) is then played until one arm wins ℓ2 many times. Since the challenger arm a3 has won the second round, it becomes the new incumbent and the counter c is set back to 1 for the third round.

Bt WR differs to its predecessor Bt W in the way it updates the round length. More specifically, Bt W increments it by one in each round, independent of whether the current incumbent has lost the round. Setting it back to one, allows Bt WR to react to changepoints by resetting its internal state back to the time of initialization (aside from the order of arms in Q), whenever the new optimal arm has won a round for the first time against the current incumbent. We show in the upcoming analyses how to set ℓc in order to bound Bt WR s cumulative expected weak regret.

Bt WR Stationary Regret Analysis. In the following we sketch Bt WR s regret analysis for the stationary case, i.e., M = 1 (all proofs are given in Appendix B). We set

4 2 log c(c + 1)e

for some λ (0, 1) that we specify later. Note that this requires to be known. We denote by τn the first time step of the algorithm s n-th round and define it as τ1 := 1 and τn := inf{t > τn 1 | Jt 1 = Jt} for n 2. Further, let cn be the value of c during the n-th round. First, we bound the probability of the optimal arm (w.l.o.g. a1) to lose one of the remaining rounds with probability λ as soon as it becomes the incumbent.

Published in Transactions on Machine Learning Research (MM/YYYY)

Algorithm 1 Beat the Winner Reset (Bt WR) Require: K N

1: Draw I uniformly at random from {1, . . . , K} 2: Q queue of all arms from A \ {a I} in random order 3: c 1

4: while time steps left do 5: a J top arm dequeued from Q 6: w I, w J 0 7: while w I < ℓc and w J < ℓc do 8: Play (a I, a J) and observe X(t) I,J

9: w I w I + I{X(t) I,J = 1}

10: w J w J + I{X(t) I,J = 0} 11: end while

12: if w I = ℓc then 13: Enqueue a J in Q 14: c c + 1 15: else 16: Enqueue a I in Q 17: I J

18: c 1 19: end if 20: end while

Lemma 1. Given that a1 is the incumbent of the n-th round, the probability of a1 losing at least one of the remaining rounds is at most λ.

Next, we give a bound on the expected number of time steps it takes for a1 to become the incumbent given that it finds itself in some round n not as the current incumbent.

Lemma 2. Fix an arbitrary round n with Iτ n = 1. Let n = inf{n > n | Iτn = 1} be the first round after n in which a1 is the incumbent. The expected number of time steps needed for a1 to become the incumbent is bounded by

E[τn τ n] 6K

λ(K + c n 1) .

Lemma 1 allows us to bound the expected number of times that a1 loses a round as the current incumbent, growing with increasing λ. On the other hand, Lemma 2 implies a bound on the expected regret caused each time a1 loses a round as the current incumbent. A small λ implies longer round lengths ℓc and hence Bt WR incurs more regret during rounds in which a1 is placed in the queue. We set λ = 1/e to strike a balance and derive the following result.

Theorem 1. For λ = 1/e, the expected cumulative binary weak regret of Bt WR in the stationary setting is bounded by

E h R W(T) i 20K log K

The resulting bound of O (K log K/ 2) improves on the bound of O K2/ 3 for the Winner Stays algorithm (Chen & Frazier, 2017), where := mini =j (1) i,j , as well as the O exp( 2)K/(1 exp( 2))2 bound for Bt W. Although is unknown in our setting, we assume its knowledge to prove theoretical results. This is commonly made also by others (see e.g. Theorem 3 for the ϵ-Greedy algorithm in Auer et al. (2002a)). For completeness, we like to mention that WS and Bt W do not require to be known, though their dependencies are worse.

Published in Transactions on Machine Learning Research (MM/YYYY)

Bt WR Non-Stationary Regret Analysis. For the non-stationary case let am denote the optimal arm of the m-th segment and the round length ℓc be as in equation 1. For the proofs of the following theoretical results see Appendix C. Note that now takes the suboptimality gaps of all segments into account and is thus potentially smaller than in the stationary case. We can show the following non-stationary counterpart of Lemma 1 for any segment.

Lemma 3. Consider an arbitrary segment Sm. Given that am is the incumbent of the n-th round, the probability of am losing at least one of the remaining rounds is at most λ.

Using Lemma 3 we can adapt Theorem 1 for the case where Bt WR enters a new segment with c being some number c 1, providing us with a bound on the incurred expected regret in that particular segment which can be viewed as an instance of the stationary setting. Again, we set λ = 1/e.

Lemma 4. For λ = 1/e the expected cumulative binary weak regret of Bt WR starting with c1 = c in the stationary setting, i.e., M = 1, is bounded by

E h R W(T) i 20K

2 log(K + c 1) .

Partitioning the regret over the whole time horizon into regret incurred for each segment allows us to apply Lemma 4 per segment. Together with the fact that c T holds, we derive the following main result.

Theorem 2. For λ = 1/e the expected cumulative binary weak regret of Bt WR in the non-stationary setting is bounded by

E h R W(T) i 22KM

2 log(K + T) .

As in the stationary setting, Bt WR does not require the time horizon T. Additionally, the regret bounds hold with Bt WR being oblivious about the number of stationary segments M. It is worth mentioning that, unlike the stationary case, the regret bound for the non-stationary case depends now on the time horizon T. This is attributable to the regret caused by the delay to have the current segment s optimal arm as the incumbent.

4 Monitored Dueling Bandits

Next, we present Monitored Dueling Bandits (MDB) (see Algorithm 2), our first algorithm using a detectionwindow approach to tackle dueling bandits with strong regret. Its core idea is to use the same changepoint detection mechanism as MUCB (Cao et al., 2019) by scanning for changepoints within a window of fixed length w for each pair of distinct arms, thus using K(K 1)/2 windows. Each detection window for a distinct pair, say (ai, aj), monitors the absolute difference of the absolute win frequencies of ai over aj of the first and the second half of the window. Once this absolute difference exceeds some threshold b, the changepoint detection is triggered leading to a reinitialization of MDB. In contrast to MUCB, MDB does not incorporate a specific dueling bandits algorithm with the task to minimize regret in stationary segments. Instead, it requires a black-box algorithm Alg for the (stationary) dueling bandits problem as an input and periodically alternates between choosing pairs selected by Alg and exploring the preference matrix in a round-robin manner in order to fill all detection windows with the dueling outcomes. The exploration rate by which the preference matrix is palpated for changepoints is given by a parameter γ. An illustration is shown in Fig. 2.

To be more specific, MDB requires for its parameterization the time horizon T, an even window length w, a threshold b > 0, an exploration rate γ (0, 1], and a dueling bandits algorithm Alg. In particular, we assume that MDB can interact with the black-box dueling algorithm Alg by (i) running it for one time step leading to a concrete choice of a pair of arms and (ii) forwarding the corresponding dueling outcome for that pair to Alg leading to an update of its internal state. This interaction is represented by the Run And Feed procedure in Algorithm 2. Moreover, we assume that MDB can reset Alg to its initial state. MDB starts by setting τ to zero, which represents the last time step in which a changepoint was detected, and initializes ni,j for each distinct pair (ai, aj), counting the number of times the pair has been played since the last detected changepoint. An arbitrary ordering π of all distinct pairs (ai, aj) is fixed, starting to count at the index zero. At the beginning of each time step, MDB uses the value of r to decide whether to choose a pair of arms

Published in Transactions on Machine Learning Research (MM/YYYY)

selected by Alg or to evenly explore the next pair in π. Its value is guaranteed to represent the index of a pair in π in a proportion of γ of all time steps, thus conducting exploration at the desired rate. In case Alg is run, MDB forwards the dueling outcome to it and continues with the next time step without inserting the observation into the corresponding detection window. Otherwise, if MDB performs a detection step, n It,Jt is incremented and the observed outcome X(t) It,Jt is inserted into the corresponding detection window by means of XIt,Jt,n It,Jt, denoting the n It,Jt-th dueling outcome between a It and a Jt after τ. The changepoint detection is triggered, if (a It, a Jt) has been chosen at least w times after τ (i.e., its detection window is completely filled) and the window s monitored absolute difference of absolute win frequencies exceeds the threshold b. This induces a reset of MDB by updating τ to the current time step, setting all ni,j to zero, and resetting Alg.

Figure 2: Scheme of MDB s detection steps (red) grouped in phases: the pairs (ai, aj) and (ai , aj ) are the first two of K(K 1)/2 many pairs in the ordering π and thus played first in the detection phases. The time steps between the phases are used to play pairs selected by black-box dueling bandits algorithm Alg. All windows are filled after w many phases.

Non-Stationary Regret Analysis. Due to the similarity of MDB in its design to MUCB, our analysis for MDB is inspired by Cao et al. (2019). Nevertheless, we adapt the analysis to the dueling bandits setting, simplify parts, merge out soft spots in their proofs and exploit room for improvement (see Appendix D for the details). A key quantity is the number of time steps L needed to fill all windows completely, i.e., each distinct pair of arms (ai, aj) has been chosen at least w times. Hence, we define

L := w K(K 1)

We denote by R S(t1, t2) the cumulative binary strong regret that Alg would have incurred if it was run on its own from t1 to t2. Next, we impose the following assumptions for technical reasons similar to MUCB:

Assumption 1: |Sm| 2L for all m {1, . . . , M}

Assumption 2: δ 2b

w + c for some c > 0

Assumption 3: γ h K(K 1)

Assumption (1) is similar to (1) in Section 3.2.1 for MUCB (Cao et al., 2019) and requires a minimal length for all stationary segments depending on L. Thus, it can be viewed as an implicit upper bound on w and lower bound for γ. It is essential and guarantees MDB enough time steps to fill all windows before the next changepoint, even after a delay of L/2. Assumption (2) is required to guarantee short delay with certain probability and the variable c will play a role in optimizing the parameters. The difficulty of detecting a changepoint with the detection-window approach increases with shrinking δ. The bound is necessary to allow

Published in Transactions on Machine Learning Research (MM/YYYY)

for a formal analysis, and is also to be found in Cao et al. (2019). Assumption (3) restricts the exploration rate γ is solely of technical nature to derive a simpler bound in Lemma 5. It is not necessary from a conceptual standpoint. Our derived choice for γ in Corollary 1 lies far away from the assumed bounds.

Algorithm 2 Monitored Dueling Bandits (MDB) Require: K, T, even w N, b > 0, γ (0, 1]

1: τ 0 2: ni,j 0 ai, aj A with i < j 3: Fix any ordering π of all pairs (ai, aj) A2 with i < j

4: for t = 1, . . . , T do 5: r (t τ 1) mod K(K 1)/2γ

6: if r < K(K 1)

2 then 7: (a It, a Jt) r-th pair in π

8: Play the pair (a It, a Jt) and observe X(t) It,Jt 9: n It,Jt n It,Jt + 1

10: XIt,Jt,n It,Jt X(t) It,Jt

11: if n It,Jt w and |Pn It,Jt w/2 s=n It,Jt w+1 XIt,Jt,s Pn It,Jt s=n It,Jt w/2+1 XIt,Jt,s| > b then 12: τ t 13: ni,j 0 ai, aj A with i < j 14: Reset Alg 15: end if 16: else

17: Run And Feed Alg 18: end if 19: end for

First, we bound MDB s expected binary strong regret in the stationary setting. Lemma 5. Let τ1 be the first detection time. The expected cumulative binary strong regret of MDB in the stationary setting, i.e., M = 1, is bounded by

E h R S(T) i P(τ1 T) T + 2γKT

K 1 + E h R S(T) i .

Next, we bound the probability that MDB wrongly triggers the changepoint detection in the stationary case and additionally the probability of MDB missing the changepoint by more than L/2 time steps in the non-stationary case with exactly one changepoint. Lemma 6. Let τ1 be the first detection time. The probability of MDB raising a false alarm in the stationary setting, i.e., M = 1, is bounded by

P(τ1 T) 2T exp 2b2

Lemma 7. Consider the non-stationary setting with M = 2. Then it holds

P(τ1 ν1 + L/2 | τ1 ν1) exp wc2

We bound MDB s expected regret by an induction over M. More precisely, we assume for each inductive step that MDB is started at time step νm 1 for a particular segment Sm. Then we use the bounds on the probabilities that MDB raises a false alarm in Sm (via p) and that it misses to trigger the changepoint detection by more than L/2 time steps after passing the next changepoint νm (via q), respectively. For sake of convenience, we denote Alg s expected cumulative binary strong regret incurred over all stationary segments by

R S M(Alg) =

m=1 E h R S(νm 1, νm 1) i .

Published in Transactions on Machine Learning Research (MM/YYYY)

Theorem 3. Let τm be the time step at which the changepoint detection is triggered for the first time with MDB being initialized at νm 1. Let p, q [0, 1] such that P(τm < νm) p for all m M and P(τm νm + L/2 | τm νm) q for all m M 1. Then the expected cumulative binary strong regret of MDB is bounded by

E h R S(T) i ML

K 1 + p + q + R S M(Alg) .

Finally, we set MDB s parameters in accordance with Assumption 2 and 3, which results in the following bound.

Corollary 1. Setting γ = (K 1) p

Mw/8T, b = p

w C/2, c = p

2C/w, and w to the smallest even integer 8C/δ2 with C = log(

2T (2T +1)/

MK), it holds that

E h R S(T) i = O

MT log(δT/KM)

+ R S M(Alg) .

MDB requires M and T to be known and is sensible to δ, as smaller changes of entries in the preference matrix between segments impede its ability to detect a changepoint. Nevertheless, if a suitable dueling bandits algorithm is used for Alg such as RUCB or RMED it has a nearly optimal strong regret bound w.r.t. T and M in light of the lower bound shown by Saha & Gupta (2022).

Figure 3: Scheme of DETECT: in the first running phase of Alg that spans the first T time steps only pairs selected by Alg are played. It is followed by a detection phase filling the K 1 many detection windows. The arms aj and a j are the first two of K 1 many arms in the ordering π. It ends with the raise of an alarm in time step τ and is followed by a new running phase of Alg. All windows are filled after w(K 1) time steps in the detection phase.

The second algorithm utilizing detection-windows that we propose is the Dueling Explore-Then-Exploit Changepoint Test (DETECT) algorithm (see Algorithm 3). It is a specialization of MDB (see Section 4) for weak regret. DETECT is given a black-box dueling bandits algorithm Alg for the stationary setting and alternates between two phases: one in which it is running Alg for T many time steps, and a detection phase, in which it scans for the next changepoint and resets Alg upon detection. We take advantage of the opportunity to choose all pairs that contain the best arm without incurring a regret penalty, thus leaving one arm free to be chosen for exploration in the detection phases. DETECT starts by running Alg for a fixed number of time steps T, where we assume that interaction with Alg can be done by means of a Run And Feed procedure as in MDB. In addition, we assume that Alg can provide at any time step its suspected Condorcet winner (CW) a I (to which most dueling bandit algorithms can be extended effortlessly). After running Alg for T time steps and obtaining Alg s suspected CW a I, DETECT initializes a detection phase in which pairs

Published in Transactions on Machine Learning Research (MM/YYYY)

of the form (a I, aj) are played with aj alternating between all K 1 arms other than a I. The binary dueling outcomes in these detection steps are inserted into K 1 windows, one for each pair (a I, aj) having length w. The detection phase ends as soon as a changepoint is detected, which follows the same principle as in MDB but using the K 1 detection windows for the pairs (a I, aj) instead. The end of the detection phase is followed by a new running phase of a reinitialized instance of Alg. This cycle continues until the time horizon is reached. If a I is indeed the optimal arm of the current segment, we suffer zero regret in the part of the detection phase that overlaps with the stationary segment in which it started. Thus, we simultaneously perform regret minimization and changepoint detection. An illustration of DETECT is given in Fig. 3.

Algorithm 3 Dueling Explore-Then-Exploit Changepoint Test (DETECT)

Require: K, T, T, even w N, b > 0

1: τ 0 2: for t = 1, . . . , T do

3: if t τ T then 4: Run And Feed Alg 5: else 6: if t τ = T + 1 then 7: I index of suspected Condorcet winner by Alg 8: n I,j 0 aj = a I A

9: Fix any ordering π of arms aj = a I A 10: end if 11: r t τ T 1 mod (K 1) 12: a Jt r-th arm in π

13: Play (a I, a Jt) and observe X(t) I,Jt 14: n I,Jt n I,Jt + 1

15: XI,Jt,n I,Jt X(t) I,Jt

16: if n I,Jt w and |Pn I,Jt w/2 s=n I,Jt w+1 XI,Js,s Pn I,Jt s=n I,Jt w/2+1 XI,Js,s| > b then 17: τ t 18: Reset Alg 19: end if 20: end if

21: end for

Non-Stationary Regret Analysis. Our theoretical analysis of DETECT s expected binary weak regret follows a similar approach to MDB (see Appendix E for the proofs). However, due to its different mechanism, we redefine the number of time steps needed to fill all of DETECT s K 1 detection windows completely by

L := w(K 1) .

Due to its focus on weak regret, we modified the definition of δ to δ in Section 2 as well, since only changes in the preference matrix between stationary segments that relate to the winning probabilities of the CW are relevant. Note that DETECT s mechanism attempts to account for this very insight by tracking changes only at the entries related to a I, the suspected CW by Alg. Additionally, we define p(m) T as the probability that a I, the suspected CW returned by Alg after T time steps, is the actual CW of segment m. Based on that, let p T minm p(m) T be a lower bound valid for all segments. Again, we impose some technical assumptions:

Assumption 1: |Sm| T + 3

2L for all m {1, . . . , M}

Assumption 2: δ 2b

w + c for some c > 0

Assumption (1) requires a minimum length for all stationary segments exceeding T to guarantee that DETECT is able to detect changepoints. Otherwise, the number of observed dueling outcomes during the exploitation phase might be too small to give sufficient guarantees for the validity of the detection mechanism.

Published in Transactions on Machine Learning Research (MM/YYYY)

Analogously to Assumption (2) for MDB (and also to be found for MUCB in Cao et al. (2019)), Assumption (2) restricts δ such that a changepoint can be detected after a short delay with a certain probability. To begin with, we bound DETECT s expected binary weak regret in the stationary setting, which unlike Lemma 5 now depends on whether Alg returns with a I the true CW. Lemma 8. Let τ1 be the first detection time and a I be the first suspected CW returned by Alg. Given T T, the expected cumulative binary weak regret of DETECT in the stationary setting, i.e., M = 1, is bounded by

E h R W(T) i E h R W( T) i + P(τ1 T a I = a1 ) (T T) .

The following Lemmas are adequate to Lemma 6 and 7 with only minor changes incorporating a I. Lemma 9. Let τ1 be as in Lemma 8. The probability of DETECT raising a false alarm in the stationary setting, i.e., M = 1, given that a I = a1 is bounded by

P (τ1 T | a I = a1 ) 2(T T) P (a I = a1 ) exp 2b2

Lemma 10. Consider the non-stationary setting with M = 2. Let τ1 be as in Lemma 8. Then, the following holds

P (τ1 ν1 + L /2 | τ1 ν1, a I = a1 ) exp wc2

The proof of the following result is an adaption of Theorem 3 for MDB. We slightly change the meaning of p and q: for a considered segment Sm, p is now a bound on the probability that DETECT raises a false alarm given that it started at time step νm 1 and Alg returned the true CW. Similarly, q bounds the probability that DETECT misses to trigger the changepoint detection by more than L /2 time steps after passing the next changepoint νm. Further, let

R W M(Alg) =

m=1 E h R W(νm 1, νm 1 + T 1) i

be Alg s expected cumulative weak regret incurred over all T-capped stationary segments {νm 1, . . . , νm 1+ T 1}. Theorem 4. Let τm be the time step at which the changepoint detection is triggered for the first time with DETECT being initialized at νm 1. Let p, q [0, 1] be such that P (τm < νm | a Im = am ) p for all m M and P (τm νm + L /2 | τm νm, a Im = am ) q for all m M 1. Then,

E h R W(T) i ML

2 + (1 p T + pp T + q)MT + R W M(Alg).

Finally, we set DETECT s parameters in accordance with the assumptions above which results in the following bound. Corollary 2. Setting b = 2w log T, c = p

8 log T/w, and w to the lowest even integer 32 log T/δ2 , the expected cumulative binary weak regret of DETECT is bounded by

E h R W(T) i = O KM log T

+ (1 p T )MT + R W M(Alg) .

Note that the presented bound is not necessarily linear in T, since we still have the freedom to set T, effecting the probability p T with which the used black-box algorithm Alg returns the Condorcet Winner. We analyze p T for Bt W and Winner-Stays in Appendix F, and derive a suitable choice of T in Corollary 6, which leads to a bound of E h R W(T) i = O (log T/δ2 MK + M log T) + R W M(Alg).

Similar to our remark in Section 3 about , we would like to point out that the knowledge of δ (and δ for MDB) is necessary in order to derive parameterizations for which we can prove theoretical results. Again, this is common practice in the multi-armed bandit literature.

Published in Transactions on Machine Learning Research (MM/YYYY)

6 Weak Regret Lower Bounds

The following result provides worst-case lower bounds for the weak regret for non-stationary and stationary cases complementing the lower bound for the strong regret by Saha & Gupta (2022).

Theorem 5. (Proof in Appendix G) For every algorithm exists an instance of

(i) the piecewise-stationary dueling bandits problem with T, K, and M fulfilling M(K 1) 9T such that the algorithm s expected weak regret is in Ω(

(ii) the stationary dueling bandits problem with T and K fulfilling K 1 9T such that the algorithm s expected weak regret is in Ω(

Note that the presented lower bounds do not form a contradiction with Theorem 1, 2, or 4, since they are problem-independent and do not take quantities specific to each problem instance like and δ into account, while in turn, these are included in the latter ones.

7 Empirical Results

In the following we evaluate Bt WR, MDB, and DETECT empirically on synthetic problem instances by comparing them to dueling bandits algorithms for the stationary setting w.r.t. their cumulative regret. Besides the regret in dependence of T and M, we are also interested in the shape of the regret curves for a specific scenario in order to give a glimpse of how the algorithms cope with the challenge of non-stationarity. We have generated for each chosen scenario 500 random problem instances, each consisting of a sequence of M randomly generated preference matrices. For each matrix P (m) a Condorcet winner am exists with one of its winning probabilities being exactly 1/2 + and the others being drawn uniformly at random from [1/2 + , 1]. All winning probabilities between pairs that do not include am are drawn uniformly at random from [0, 1]. The next matrix P (m+1) is manipulated such that at least one of am winning probabilities changes by δ. We have used = 0.1 and δ = 0.6 for most scenarios, but also investigate the algorithms regret curves depending on these quantities within a range of reasonable values. The Condorcet winner changes with each new preference matrix. The changepoints are distributed evenly across the time horizon, implying equal segment lengths in each scenario. This is done with the intention to obtain meaningful regret averages for a single parameter specification. For further empirical results see Fig. 8-13.

7.1 Weak Regret

We compare Bt WR and DETECT (parameterized as given in Corollary 2) with Winner-Stays (WS) (Chen & Frazier, 2017) as the incorporated black-box algorithm against Bt W in Fig. 4. Both, Bt WR and DETECT improve on its stationary competitor Bt W: not only is the cumulative regret lower in single scenarios, but they also show a desirable dependency on T. In addition with the linear growth w.r.t. M, this confirms or theoretical results in Theorem 2 and 4. Bt W experiences difficulties adapting to its changing environment, visible by its increasing regret jumps at each new changepoint, caused by the time it takes to adapt to the new preference matrix. In comparison, Bt WR and DETECT surmount the difficulty of non-stationarity without an increasing regret penalty. Further of interest are the regret curves depending on and δ . The cumulative regret of DETECT and especially of Bt WR shrinks with increasing gap . The latter observation confirms our result in Theorem 2. In contrast to that, Bt W seems to have, at least in the context of nonstationarity, constant performance across a broad range of values. Inspecting the dependency on δ , it is striking how DETECT s regret curve drops with increasing δ , which is in accordance with our result in Corollary 2, while Bt W and Bt WR show constant performance.

Published in Transactions on Machine Learning Research (MM/YYYY)

Figure 4: Cumulative binary weak regret averaged over 500 random problem instances with shaded areas showing the standard deviation across 10 groups: regret over time for fixed T (Top left), dependencies on T (Top right), K (Center left), M (Center right), (Bottom left), δ (Bottom right).

7.2 Strong Regret

We compare MDB (parameterized as given in Corollary 1) with Winner Stays Strong (WSS) (Chen & Frazier, 2017) as the incorporated black-box algorithm against WSS itself and DEX3.S (Saha & Gupta, 2022) in Fig. 5. Both instances of WSS have exploitation parameter β = 1.05. MDB accumulates less regret than WSS and DEX3.S across single scenarios with similar sublinear dependencies on M and T. Unlike MDB and DEX3.S, the considered stationary algorithm WSS shows increasing regret jumps, indicating its maladjustment for the non-stationary setting. Further, we observe how MDB s cumulative regret decreases for larger δ , whereas the drop in regret depending on is rather negligible.

7.3 Sensitivity Analysis for Bt WR

Our satisfying regret bound in Theorem 2 and empirical findings in Section 7.1, which demonstrate Bt WR s superiority in the non-stationary setting over its competitors, rely on the assumption that the gap of the particular problem instance is known. But since this is not guaranteed in practice, one might ask by how much Bt WR s performance detoriates if the unknown is only estimated. In order to answer this question empirically, we conducted an evaluation of Bt WR with the same setup as described above. With the only difference being that we have run multiple instantiations of Bt WR, parameterized with different estimates of , on multiple problem classes that differ in their true gap . We consider for both the true gap and its estimate values ranging from 0.025 to 0.25, which we find reasonable, as values outside our chosen interval are unlikely to bet met in practice.

Published in Transactions on Machine Learning Research (MM/YYYY)

Figure 5: Cumulative binary strong regret averaged over 500 random problem instances with shaded areas showing the standard deviation across 10 groups: regret over time for fixed T (Top left), dependencies on T (Top right), K (Center left), M (Center right), (Bottom left), δ (Bottom right).

The performance of different Bt WR estimates, measured in cumulative regret, depending on the true is depicted in Fig. 6. The estimates = 0.05 and = 0.075 outperform the parameter-free Bt W algorithm significantly across all values in the considered range. Further, if one possesses additional domain knowledge related to the problem instance at hand, stating that is not lower than 0.1, then Bt WR parametrized with between 0.1 and 0.2 achieves even lower regret numbers.

Next, we would like to present our results from a slightly different perspective. Fig. 7 shows the cumulative regret achieved by Bt WR for a particular true gap depending on the chosen estimate Bt WR . Again, we observe that the estimates = 0.05 and = 0.075 perform well across the whole considered range. Interestingly, the lower the estimate is chosen to be, the less relevant the actual gap becomes.

Further, one can notice how each graph for a particular gap reaches its minimum at a point, at which Bt WR slightly overestimates the true gap. The attentive and sharp-eyed reader has already observed this phenomenon in Fig. 6. We are confident to assert that this does not occur just by chance, but has a theoretical background that we are able to unveil. Our main result in Theorem 2 and its proof in Appendix C allow us to draw the conclusion that parameterizing Bt WR with an estimate that underestimates , i.e. , leads to a regret bound of

E h R W(T) i 22KM

2 log(K + T).

Published in Transactions on Machine Learning Research (MM/YYYY)

in the non-stationary setting. This bound grows the further underestimates . From this observation we dare to infer, although we cannot prove it, that the cumulative regret should shrink if slightly overestimates , since then the denominator in the bound grows.

Figure 6: Cumulative binary weak regret in dependence of averaged over 100 random problem instances with shaded areas showing the standard deviation across 5 groups. Except for Bt W, each graph shows the performance of Bt WR initialized with an estimate of .

Figure 7: Cumulative binary weak regret averaged over 100 random problem instances with shaded areas showing the standard deviation across 5 groups. Each graph shows the performance of Bt WR initialized with different estimates depending on a fixed of the problem instance.

8 Conclusion

We considered the piecewise-stationary dueling bandits problem with M segments, for which we proposed and analyzed three actively adaptive algorithms. For Bt WR, we have shown satisfactory regret bounds for weak regret in the non-stationary and stationary setting. It stands out from previous non-stationary algorithms by not requiring the time horizon T or M to be given, which allows it to be applied when these quantities are not known upfront. With MDB and DETECT, we have presented two meta-algorithms based on detection-windows and incorporating black-box dueling bandit algorithms. Their regret bounds can be interpreted as the sum of the black-box algorithm s regret if it would know the changepoints plus a penalty of non-stationarity. In addition, we have proven worst-case lower bounds on the expected weak regret. Our empirical results clearly suggest that all three algorithms outperform standard dueling bandits algorithms in the non-stationary setting.

Published in Transactions on Machine Learning Research (MM/YYYY)

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235 256, 2002a.

Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on Computing, 32(1):48 77, 2002b.

Peter Auer, Pratik Gajane, and Ronald Ortner. Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Proceedings of Annual Conference on Learning Theory (COLT), pp. 138 158, 2019.

Viktor Bengs, Róbert Busa-Fekete, Adil El Mesaoudi-Paul, and Eyke Hüllermeier. Preference-based online learning with dueling bandits: A survey. Journal of Machine Learning Research, 22:7:1 7:108, 2021.

Thomas Kleine Buening and Aadirupa Saha. ANACONDA: an improved dynamic regret algorithm for adaptive non-stationary dueling bandits. In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 3854 3878, 2023.

Yang Cao, Zheng Wen, Branislav Kveton, and Yao Xie. Nearly optimal adaptive procedure with change detection for piecewise-stationary bandit. In Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 418 427, 2019.

Bangrui Chen and Peter I. Frazier. Dueling bandits with weak regret. In Proceedings of International Conference on Machine Learning (ICML), pp. 731 739, 2017.

Aurélien Garivier and Eric Moulines. On upper-confidence bound policies for switching bandit problems. In Proceedings of the International Conference on Algorithmic Learning Theory (ALT), pp. 174 188, 2011.

Cédric Hartland, Sylvain Gelly, Nicolas Baskiotis, Olivier Teytaud, and Michèle Sebag. Multi-armed bandit, dynamic environments and meta-bandits. 2006.

Katja Hofmann, Anne Schuth, Shimon Whiteson, and Maarten de Rijke. Reusing historical interaction data for faster online learning to rank for IR. In Proceedings of ACM International Conference on Web Search and Data Mining (WSDM), pp. 183 192, 2013.

Levente Kocsis and Csaba Szepesvári. Discounted UCB. In 2nd PASCAL Challenges Workshop, 2006.

Junpei Komiyama, Junya Honda, Hisashi Kashima, and Hiroshi Nakagawa. Regret lower bound and optimal algorithm in dueling bandit problem. In Proceedings of Annual Conference on Learning Theory (COLT), pp. 1141 1154, 2015.

Chang Li, Ilya Markov, Maarten De Rijke, and Masrour Zoghi. Merge DTS: A Method for Effective Largescale Online Ranker Evaluation. ACM Transactions on Information Systems (TOIS), (4):1 28, 2020.

Fang Liu, Joohyun Lee, and Ness B. Shroff. A change-detection based framework for piecewise-stationary multi-armed bandit problem. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI), pp. 3651 3658, 2018.

Shiyin Lu, Yao Hu, and Lijun Zhang. Stochastic bandits with graph feedback in non-stationary environments. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI), pp. 8758 8766, 2021.

Erol Peköz, Sheldon M. Ross, and Zhengyu Zhang. Dueling bandit problems. Probability in the Engineering and Informational Sciences, pp. 1 12, 2020.

Terry F Pettijohn, Greg M Williams, and Tiffany C Carter. Music for the seasons: seasonal music preferences in college students. Current psychology, 29(4):328 345, 2010.

Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527 535, 1952.

Published in Transactions on Machine Learning Research (MM/YYYY)

Aadirupa Saha and Shubham Gupta. Optimal and efficient dynamic regret algorithms for non-stationary dueling bandits. In Proceedings of International Conference on Machine Learning (ICML), pp. 19027 19049, 2022.

Yanan Sui and Joel W. Burdick. Clinical online recommendation with subgroup rank feedback. In Proceedings of ACM Conference on Recommender Systems (Rec Sys), pp. 289 292, 2014.

Joe Suk and Arpit Agarwal. When can we track significant preference shifts in dueling bandits? In Proceedings of Advances in Neural Information Processing Systems (Neur IPS), pp. 38347 38369, 2023.

William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933.

Tanguy Urvoy, Fabrice Clérot, Raphaël Féraud, and Sami Naamane. Generic exploration and k-armed voting bandits. In Proceedings of International Conference on Machine Learning (ICML), pp. 91 99, 2013.

Andrew Chi-Chin Yao. Probabilistic computations: Toward a unified measure of complexity. In Proceedings of Annual Symposium on Foundations of Computer Science (SFCS), pp. 222 227, 1977.

Yisong Yue and Thorsten Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of International Conference on Machine Learning (ICML), pp. 1201 1208, 2009.

Yisong Yue and Thorsten Joachims. Beat the mean bandit. In Proceedings of International Conference on Machine Learning, (ICML), pp. 241 248, 2011.

Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538 1556, 2012.

Masrour Zoghi, Shimon Whiteson, Rémi Munos, and Maarten de Rijke. Relative upper confidence bound for the k-armed dueling bandit problem. In Proceedings of International Conference on Machine Learning (ICML), pp. 10 18, 2014a.

Masrour Zoghi, Shimon A. Whiteson, Maarten de Rijke, and Remi Munos. Relative Confidence Sampling for Efficient On-line Ranker Evaluation. In Proceedings of ACM International Conference on Web Search and Data Mining (WSDM), pp. 73 82, 2014b.

Masrour Zoghi, Shimon Whiteson, and Maarten de Rijke. Mergerucb: A method for large-scale online ranker evaluation. In Proceedings of ACM International Conference on Web Search and Data Mining (WSDM), pp. 17 26, 2015.

Masrour Zoghi, Tomás Tunys, Lihong Li, Damien Jose, Junyan Chen, Chun Ming Chin, and Maarten de Rijke. Click-based hot fixes for underperforming torso queries. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 195 204, 2016.

Published in Transactions on Machine Learning Research (MM/YYYY)

A List of Symbols

Table 2: List of frequently symbols used throughout the paper. Problem setting of piecewise-stationary dueling bandits A set of arms K number of arms ai i-th arm T time horizon M number of stationary segments νm m-th changepoint Sm m-th stationary segment P (m) preference matrix of the m-th stationary segment p(m) i,j probability that arm ai is preferred over aj in the m-th segment X(t) It,Jt binary dueling outcome between a It and a Jt am Condorcet winner of the m-th segment (m) i,j calibrated pereference probability of ai over aj in the m-th segment (m) i suboptimality gap of ai in the m-th segment minimal suboptimality gap δ(m) i,j change between ai and aj at the m-th changepoint δ(m) m-th segmental change δ minimal segmental change δ(m) segmental change w.r.t. CW δ minimal segmental change w.r.t. CW

Regret measures r S t instantaneous strong regret r W t instantaneous weak regret r S t instantaneous binary strong regret r W t instantaneous binary weak regret Rv(T) cumulative regret for regret type v Rv(t1, t2) cumulative regret of type v of Alg run on its own from t1 to t2 Rv(T) cumulative regret of type v of Alg run on its own Rv M(Alg) expected cumulative regret of type v of Alg over all stationary segments

Algorithm-specific symbols ℓc number of required wins to end a round of Bt WR with round counter c λ probability of CW to lose a remaining round of Bt WR as current incumbent τn first time step of Bt WR s n-th round Alg utilized black-box algorithm by MDB and DETECT γ exploration rate of MDB w window length of MDB and DETECT b threshold of MDB and DETECT L number of time steps to fill all of MDB s detection windows L number of time steps to fill all of DETECT s detection windows T phase length in which DETECT runs Alg p(m) T probability that DETECT s suspected CW of the m-th segment is correct

Published in Transactions on Machine Learning Research (MM/YYYY)

B Bt WR Stationary Weak Regret Analysis

Let Ln be the index of the arm that lost the n-th round and cn the value of c during the n-th round. Finally, let N = sup{n N | τn T} be the last round to be started.

Lemma 1 Choosing ℓc = l 1 4 2 log c(c+1)e

2 m for any λ > 0 and given that a1 is the incumbent of the n-th round with cn = 1, the probability of a1 losing at least one of the remaining rounds is at most

n =n {Ln = 1} Iτn = 1, cn = 1

Proof. The n -th round with round counter cn can be seen as an experiment in which 2ℓcn 1 coins are thrown i.i.d. with the probability of head (representing a Iτn to win a duel against a Jτn ) being p Iτn ,Jτn . The round is lost by a Iτn if at most ℓcn heads are thrown. Define the binomially distributed random variables Bcn Bin(2ℓcn 1, p1,Jτn ) and B cn Bin(2ℓcn 1, 1

2 + ). We derive:

P(Ln = 1 | Iτn = 1) = P(Bcn ℓcn 1)

P(B cn ℓcn 1)

2(2ℓcn 1) 1

exp 2 2(1 2ℓcn ) ,

where we utilized Hoeffding s inequality and the fact that 1

2. Further, we derive:

n =n {Ln = 1} Iτn = 1, cn = 1

Ln = 1 {Iτn = 1, cn = 1}

i=n {Li = 1}

n =n {Ln = 1 | Iτn = 1, cn = n n + 1}

n =n P(Ln = 1 | Iτn = 1, cn = n n + 1)

n =1 e2 2(1 2ℓn )

where we used the independence of events for the second equality and the previously shown bound on P(Ln = 1 | Iτn = 1) for the second inequality.

Published in Transactions on Machine Learning Research (MM/YYYY)

Lemma 2 Consider the stationary setting and let n be an arbitrary round with Iτ n = 1. Let n = inf{n >

n | Iτn = 1} be the first round after n in which a1 is the incumbent. Choosing ℓc = l 1 4 2 log c(c+1)e

2 m for any λ > 0, the expected number of time steps needed for a1 to become the incumbent is bounded by

E[τn τ n] 6K

λ(K + c n 1).

Proof. Let n1 = inf{n n | Jτn = 1} be the first round from n onward in which a1 is the current challenger and ni = inf{n > ni 1 | Jτn = 1} be the i-th round for i 2. Let U be a random variable denoting the number of times a1 has to become challenger before winning a round, i.e., n U = n 1. Since the queue contains K 2 many arms, it takes at most K 2 rounds for a1, potentially sitting in last place of the queue, to become the challenger and one more round to get to the end of the queue or to get promoted to the incumbent. Consequently, we have n1 n K 2, and ni ni 1 K 1 for all i {2, . . . , U}. Hence, we obtain:

n n = (n1 n) + (n n U) +

i=2 ni ni 1 (K 1)U.

For the number of time steps depending on U we further conclude:

c=1 2ℓc n+c 1 1

1 2 2 log (c n + c 1)(c n + c)e

2 2 log (c n + (K 1)U 1)(c n + (K 1)U)e

λ(c n + KU 1)

λ(K + c n 1).

For all rounds i holds that P(Lni = 1 | Jτni = 1) 1

2 since the probability of a1 winning a duel against any other arm is at least 1

2 + . Thus, we can use the geometrically distributed random variable V with parameter p = 1

2 to bound the expected value of U 2 by E[U 2] E[V 2]. Finally, we derive the stated claim:

E[τn τ n | τn νm] E KU 2

λ(K + c n 1)

λ(K + c n 1)

λ(K + c n 1)

λ(K + c n 1),

where we used the fact that E[V 2] = 2 p

Theorem 1 Setting λ = 1

e and thus choosing ℓc = 1 4 2 log c(c + 1)e2 1

2 , the expected cumulative binary weak regret of Bt WR in the stationary setting is bounded by

E h R W(T) i 20K log K

Published in Transactions on Machine Learning Research (MM/YYYY)

Proof. Let n1 = inf{n N | Iτn = 1} be the first round in which a1 is the incumbent. Further define inductively ni = inf{n > mi 1 | Iτn = 1} for all i {2, . . . , N } and mi = inf{n > ni | Iτn = 1} for all i {1, . . . , M } with N being the number of times a1 becomes the incumbent (including the possibility of starting as such in the first round) and M the number of times a1 loses a round as the current incumbent. The sequence of rounds can be partitioned in alternating subsequences: the ones that have a1 as their current incumbent and those that not. This allows us to reformulate the incurred regret as:

R(τ1, τn1 1) + R(τn N , T)

i=1 R(τmi, τni+1 1) + N 1 P

i=1 R(τni, τmi 1) if M = N 1

R(τ1, τn1 1) + R(τm M , T)

i=1 R(τmi, τni+1 1) + N 1 P

i=1 R(τni, τmi 1) if M = N

τn1 τ1 + M P

i=1 τni+1 τmi if M = N 1

τn1 τ1 + T + 1 τm M + M 1 P

i=1 τni+1 τmi if M = N ,

where we used the fact that Bt WR suffers no regret in sequences of rounds with a1 being the current incumbent. By applying Lemma 2 with c n = 1 due to cmi = 1 for the sequences of rounds not having a1 as its incumbent, we obtain for the expected regret:

E h R W(T) i = EM h E h R W(T) | M ii

E[τn1 τ1] +

i=1 E[τni+1 τmi],

E[τn1 τ1] + E[T + 1 τm M ] +

i=1 E[τni+1 τmi]

λK E[M + 1].

Note that c1 = 1 and cmi = 1 holds for all i {1, . . . , M } because the incumbent Iτmi 1 = 1 has lost the previous round. Finally, we can bound the expectation of M by E[M ] E[X] where X is a discrete random variable with P(X = x) = (1 λ) λx for all x N0. This is justified by Lemma 1, guaranteeing that the probability of a1 losing a round once it is incumbent is at most λ. Let Y be a random variable with Y = X + 1, then it holds Y Geo(1 λ). Since E[Y ] = 1 1 λ, we have E[M + 1] 1 1 λ. We conclude

the result by assuming K 3, otherwise E[R W(T)] = 0 since no pair of arms would cause any regret, and plugging in λ = 1

E h R W(T) i 6K

λK E[M + 1]

= 6K (1 λ) 2 log r e

Published in Transactions on Machine Learning Research (MM/YYYY)

C Bt WR Regret Analysis Non-stationary Setting

As before, let Ln be the index of the arm that lost the n-th round and cn the value of c during the n-th round.

Lemma 3 Consider an arbitrary segment Sm. Let N = sup{n | {τn, τn+1 1} Sm} be the last round to

be started and finished within Sm. Choosing ℓc = l 1 4 2 log c(c+1)e

2 m for any λ > 0 and given that am is the incumbent of the n-th round with cn = 1, the probability of am losing at least one of the remaining rounds is at most

n =n {Ln = 1} Iτn = 1, cn = 1

Proof. In the case, where {n | {τn, τn+1 1} Sm} = the statement is trivial. Otherwise, one follows the lines of the proof of Lemma 1.

Lemma 4 Setting λ = 1

e and thus choosing ℓc = 1 4 2 log c(c + 1)e2 1

2 , the expected cumulative binary weak regret of Bt WR starting with c1 = c in the stationary setting is bounded by

E h R W(T) i 20K

2 log(K + c 1).

Proof. Analogous to the proof of Theorem 1 using Lemma 3 instead of Lemma 1.

Theorem 2 Choosing ℓc = 1 4 2 log c(c + 1)e2 1

2 , the expected cumulative binary weak regret of Bt WR in the non-stationary setting is bounded by:

E h R W(T) i 22KM

2 log(K + T).

Proof. First, we decompose the expected regret over the stationary segments:

E h R W(T) i = E h R W(1, ν1) i +

m=2 E h R W(νm 1, νm 1) i .

For each m 2 let Am = {{n N | τn Sm} = } be the event that there exists a round starting in Sm and m1 = inf{n N | τn Sm} be the first round starting in Sm given Am. Let m0 = sup{n N | τn < νm 1} be the last round that started before Sm, on the event of Am we have m0 = m1 1, but not necessarily τm0 Sm 1. The term in the sum can be further decomposed for all m 2:

E h R W(νm 1, νm 1) i

E h R W(νm 1, νm 1)I{Am} i + E h R W(νm 1, νm 1)I{AC m} i

= E h R W(νm 1, τm1 1)I{Am} i + E h R W(τm1, νm 1)I{Am} i

+ E h R W(νm 1, νm 1)I{AC m} i

4ℓcm0 4 + E h R W(τm1, νm 1)I{Am} i

1 2 log cm0(cm0 + 1)e2 2 + 20K

2 log(cm1 + K 1)

2 log(cm0 + K),

Published in Transactions on Machine Learning Research (MM/YYYY)

where we used Lemma 4 in the third inequality because E h R W(τm1, νm 1)I{Am} i can be seen as the expected regret in a stationary setting with Bt WR having cm0 as the round counter for its first round. Applying Theorem 1 for the first segment which can also be seen as an instance of the stationary setting and the obvious fact that cn T for all rounds n, we obtain:

E h R W(T) i = E h R W(1, ν1) i +

m=2 E h R W(νm 1, νm 1) i

2 log(cm0 + K)

2 log(K + T).

D MDB Regret Analysis

In the following we will utilize the Mc Diarmid s inequality which can be seen as a concentration inequality on a function g of n independent random variables.

Lemma 11. (Mc Diarmid s inequality) Let X1, . . . , Xn be independent random variables all taking values in the set X and c1, . . . , cn R. Further, let g : X n 7 R be a function that satisfies |g(x1, . . . , xi, . . . , xn) g(x1, . . . , x i, . . . , xn)| ci for all i and x1, . . . , xn, x i X. Then for all ϵ > 0 holds

P(g(x1, . . . , xn) E[g(x1, . . . , xn)] ϵ) exp 2ϵ2

Pn i=1 c2 i

P(g(x1, . . . , xn) E[g(x1, . . . , xn)] ϵ) exp 2ϵ2

Pn i=1 c2 i

Lemma 12. (similar to Cao et al. (2019) but errors corrected) Consider a detection window with even window length w that is filled with i.i.d. bits X1, . . . , Xw Ber(p),

for fixed p [0, 1]. Let A = w/2 P

i=1 Xi be the sum of bits in the older window half and B = w P

i=w/2+1 Xi the sum

in the newer half. Then for any b > 0 holds

P(|A B| > b) 2 exp 2b2

Proof. Obviously, A and B have expected values E[A] = E[B] = wp

2 and thus we derive:

P(|A B| > b) = P(A B > b) + P(B A > b)

= P(A B E[A B] > b) + P(B A E[B A] > b)

where we used Mc Diarmid s inequality (Lemma 11). The denominator results from the fact, that a single change of one random variable Xi can cause a maximum absolute difference of 1 in A or B. Thus, |A B| can differ no more than 1.

Lemma 13. (similar to Cao et al. (2019) but errors corrected) Consider a detection window with even window length w that is filled with independent bits X1, . . . , Xw/2

Ber(p) and Xw/2+1, . . . , Xw Ber(p + θ) with fixed p [0, 1] and θ [ p, 1 p]. Let A = w/2 P

i=1 Xi be the sum

Published in Transactions on Machine Learning Research (MM/YYYY)

of bits in the older window half and B = w P

i=w/2+1 Xi the sum in the newer half. Then for any b > 0 such

that some c > 0 exists with |θ| 2b

w + c, holds

P(|A B| > b) 1 exp wc2

Proof. The expected values of A and B are now given by E[A] = wp

2 and E[B] = w(p+θ)

2 and thus E[A B] = wθ

2 and E[B A] = wθ

2 . Consider the following case distinction for θ:

If θ 0 then due to |θ| 2b

w + c: E[B A] b + cw

2 , and thus:

P(|A B| > b) = P(A B > b) + P(B A > b)

= 1 P(B A b)

= 1 P(B A E[B A] b E[B A])

1 exp 2(E[B A] b)2

where we used Mc Diarmid s inequality (Lemma 11) for the second inequality. The necessary condition b E[B A] < 0 is implied by the existence of a c > 0 with |θ| 2b

w + c, assumed in the beginning. We justify the denominator with the same reasoning as in Lemma 12. Otherwise, if θ < 0 then due to |θ| 2b

w +c: E[A B] b + cw

2 , and thus:

P(|A B| > b) = P(A B > b) + P(B A > b)

= 1 P(A B b)

= 1 P(A B E[A B] b E[A B])

1 exp 2(E[A B] b)2

where we used Mc Diarmid s inequality (Lemma 11) for the third inequality. The necessary condition b E[A B] < 0 is implied by the existence of a c > 0 with |θ| 2b

w + c, assumed in the beginning. We justify the denominator with the same reasoning as in Lemma 12.

Lemma 14. The cumulative regret w.r.t. to any regret measure rv during the m-th stationary segment can be rewritten as

Rv(νm 1, νm 1) = X

i,j Ni,j(νm 1, νm 1) rv(ai, aj),

where Ni,j(t1, t2) = t2 P

t=t1 I{It = i, Jt = j} is the number of times an algorithm chooses to duel the arms ai

and aj within the time period {t1, . . . , t2}.

Published in Transactions on Machine Learning Research (MM/YYYY)

Rv(νm 1, νm 1) =

t=νm 1 rv (a It, a Jt)

i,j I{It = i, Jt = j} rv (ai, aj)

t=νm 1 I{It = i, Jt = j}

rv (ai, aj)

i,j Ni,j(νm 1, νm 1) rv (ai, aj) .

We impose the following assumptions on the problem statement and the parameters:

Assumption (1): |Sm| 2L for all m {1, . . . , M}

Assumption (2): δ 2b

w + c for some c > 0

Assumption (3): γ h K(K 1)

2 i and K 3

Lemma 5 Consider a scenario with M = 1 and let τ1 be the first detection time. Then the expected cumulative binary strong regret of MDB is bounded by

E h R S(T) i P(τ1 T) T + 2γKT

K 1 T + E h R S(T) i .

Proof. Let π(i, j) be the position of the pair (ai, aj) in the ordering π for i < j, starting to count at 0. In order to be well-defined for all i, j, let π(i, j) = 1 for all i j and define the event Ai,j,t := {(t τ 1) mod K(K 1)/2γ = π(i, j)} for all i, j, t. Given τ1 > T, we have τ = 0 guaranteed for the whole runtime of the algorithm and the number of times a pair (ai, aj) with i < j is played can be bounded by:

Ni,j(1, T) =

t=1 I{It = i, Jt = j}

t=1 I{Ai,j,t} + I{Ai,j,t, It = i, Jt = j}

t=1 I{Ai,j,t, It = i, Jt = j}

2T j K(K 1)

t=1 I{Ai,j,t, It = i, Jt = j}

4γT K(K 1) (K 1) +

t=1 I{Ai,j,t, It = i, Jt = j}

= 4γT (K 1)2 +

t=1 I{Ai,j,t, It = i, Jt = j},

Published in Transactions on Machine Learning Research (MM/YYYY)

where make use of Assumption (3) in the third and fourth inequality. Whereas for i j the event Ai,j,t will never occur such that

Ni,j(1, T) =

t=1 I{Ai,j,t, It = i, Jt = j}.

From that we derive a bound for E h R S(T) | τ1 > T i :

E h R S(T) | τ1 > T i

i<j Ni,j(1, T)

& (1) i + (1) j 2

i j Ni,j(1, T)

& (1) i + (1) j 2

t=1 I{Ai,j,t, It = i, Jt = j}

& (1) i + (1) j 2

K 1 + E h R S(T) i ,

where we use Lemma 14 in both equations. Thus, we obtain the desired bound:

E h R S(T) i = E h R S(T) | τ1 T i P(τ1 T) + E h R S(T) | τ1 > T i P(τ1 > T)

P(τ1 T) T + E h R S(T) | τ1 > T i

P(τ1 T) T + 2γKT

K 1 + E h R S(T) i .

Lemma 6 Consider a scenario with M = 1 and let τ1 be the first detection time. The probability of MDB raising a false alarm is bounded by

P(τ1 T) 2T exp 2b2

Proof. Let n It,Jt,t be the value of n It,Jt after its update in Line 9 at time step t, At :=

n It,Jt,t w/2 P

s=n It,Jt,t w+1 XIt,Jt,s,

n It,Jt,t P

s=n It,Jt,t w/2+1 XIt,Jt,s. Moreover, denote by r(t) the value of r in time step t. We derive:

P(τ1 T) = X

t : r(t)< K(K 1)

2 ,t T P(τ1 = t)

t : r(t)< K(K 1)

2 ,t T,n It,Jt,t w P(|At Bt| > b)

t=1 2 exp 2b2

= 2T exp 2b2

In the second inequality we used Lemma 12 with the observations of the duels being the bits filling the detection window. The requirement that all samples are drawn from the same Bernoulli distribution is

Published in Transactions on Machine Learning Research (MM/YYYY)

met since there is no changepoint and thus all samples from a pair (ai, aj) are drawn from a Bernoulli distribution with parameter p(1) i,j . Note that r(t), It, and Jt are not random variables because they are calculated deterministically.

Lemma 7 Consider a scenario with M = 2. Assume that ν1 + L/2 ν2 and the existence of arms ai and aj with i < j such that |δ(1) i,j | 2b

w + c for some c > 0. Then it holds

P(τ1 ν1 + L/2 | τ1 ν1) exp wc2

Proof. Assume that τ1 is large enough such that the pair (ai, aj) is played at least w/2 times from ν1 onward during detection steps. In that case, let t be the time step in which (ai, aj) is played for the w/2-th time from ν1 onward during the detection steps. Then it holds t < ν1 + L/2. As a consequence, we obtain τ1 < ν1 + L/2 if we deny the assumption. Let ni,j,t be the value of ni,j after its update in Line 9 at time

step t, Ai,j,t :=

ni,j,t w/2 P

s=ni,j,t w+1 Xi,j,s, and Bt :=

s=ni,j,t w/2+1 Xi,j,s. Since |Ai,j,t Bi,j,t| > b triggers the

changepoint-detection in time step t, it implies τ1 < ν1 + L/2 given τ1 ν1. Thus, we obtain

P(τ1 < ν1 + L/2 | τ1 ν1) P(|Ai,j,t Bi,j,t| > b),

giving us the opportunity to apply Lemma 13 with p = p(1) i,j and θ = δ(1) i,j to conclude:

P(τ1 ν1 + L/2 | τ1 ν1) = 1 P(τ1 < ν1 + L/2 | τ1 ν1)

1 P(|Ai,j,t Bi,j,t| > b)

The required distance between ν1 and ν2 of at least L/2 time steps is implied by Assumption (1), whereas the resulting condition on δ(1) i,j in order to apply Lemma 13 is given by Assumption (2).

Before stating our main result in Theorem 3, we introduce notation specifically tailored to the upcoming proof. Let Em h R S(t1, t2) i be the expected cumulative binary strong regret from time steps t1 to t2 inclusively

of MDB started at νm 1. Note that E1 h R S(ν0, T) i = E h R S(T) i . In order to capture false alarms and delay, we further define τm to be the time step of the first triggered changepoint-detection of MDB started at νm 1. Thus, we can express a false alarm in the m-th segment raised by MDB started at νm 1 as the event {τm < νm}. A large delay of L/2 or more time steps for νm in the absence of a false alarm is then given by {τm νm + L/2}.

Theorem 3 Let p, q [0, 1] with P(τm < νm) p for all m M and P(τm νm + L/2 | τm νm) q for all m M 1. Then the expected cumulative binary strong regret of MDB is bounded by

E h R S(T) i ML

K 1 + p + q +

m=1 E h R S(νm 1, νm 1) i .

In conjunction with the previous explanation on how to formalize false alarms and large delays, p represents an upper bound for the probability of MDB raising a false alarm in the m-th segment given MDB is started at νm 1 for each m. Similarly, q bounds the probability of MDB having a delay of at least L/2 for the

Published in Transactions on Machine Learning Research (MM/YYYY)

detection of νm for each m. Our proof is basically an adaption of its equivalent for MUCB in Cao et al. (2019) with some refinements made to quantify and lower the impact of p and q on the regret bound. In order to highlight these and provide the reader with a premature understanding of its concept, we first explain the structure used in Cao et al. (2019) and explain our improvements based on that. We can categorize the set of sample paths that MDB takes in good and bad ones. The good paths are the ones in which MDB raises no false alarms and has no large detection delay for any changepoint, meaning a delay of at least L/2. Then the bad paths are characterized by MDB raising at least one false alarm or having at least for one changepoint a large delay. We can divide each path into M pieces, with each piece corresponding to a stationary segment, and utilize this concept by recursively analyzing MDB in bottom-up fashion from the last segment SM to the first S1. This is done by an induction over M. For each segment Sm we distinguish whether the path taken by MDB is at this point still good or becomes bad. In case it is good, we can apply Lemma 5 to bound the regret suffered in Sm as a stationary setting and add the regret incurred in the remaining path from Sm+1 onward. Otherwise, if a false alarm occurs or the changepoint νm is detected with large delay, the regret can be naively bounded by T. In order to control the expected regret in these cases with the help of p and q, Lemma 6 and 7 come into play, which bound the probability of a false alarm and large delay, respectively. In Cao et al. (2019) both probabilities are set ad hoc to 1/T, reducing the additional expected regret to a constant. The bound T on the regret in case of MDB entering a bad path is penalizing MDB harsh and one could ask if there is a possibility for MDB to reenter into a good path by coming across a changepoint which it detects with delay shorter than L/2. Based on this idea, we extend the proof in Cao et al. (2019) substantially. Indeed, we remove these "dead ends" from the previous analysis and are able to analyze the expected regret entailed in the "detours" from false alarms or delays to the reentrance into a good path.

Proof. First, we prove the following statement for all m M by induction:

Em h R S(νm 1, T) i (M m + 1) L

i=m |Si| + (p + q)

i=m E h R S(νi 1, νi 1) i .

Note that T = M P

m=1 |Sm|. Then E h R S(T) i can be bounded by a simpler expression:

E h R S(T) i

= E1 h R S(ν0, T) i

m=1 |Sm| + (p + q)

m=1 E h R S(νm 1, νm 1) i

m=1 |Sm| + 2(p + q)

m=1 E h R S(νm 1, νm 1) i

K 1 + p + q +

m=1 E h R S(νm 1, νm 1) i .

Base case: m = M Since there is only one stationary segment from νM 1 to νM 1 = T, we can consider this as the stationary

Published in Transactions on Machine Learning Research (MM/YYYY)

setting starting at νM 1 and apply Lemma 5 in the second inequality:

EM h R S(νM 1, T) i |SM| P(τM < νM) + EM[R S(νM 1, νM) | τM νM]

|SM|p + 2γK|SM|

K 1 + E h R S(νM 1, νM 1) i

2 + 2γK|SM|

K 1 + (p + q)|SM| +

i=M E h R S(νi 1, νi 1) i .

Induction hypothesis: For any arbitrary but fixed m M 1, the expected cumulative binary strong regret from time steps νm to T of MDB started at νm is bounded by

Em+1 h R S(νm, T) i (M m) L

i=m+1 |Si| + (p + q)

i=m+1 E h R S(νi 1, νi 1) i .

Inductive step: m + 1 m: Let pm := P(τm < νm). We can decompose the regret based on the distinction whether a false alarm occurs:

Em h R S(νm 1, T) i

= Em h R S(νm 1, T) | τm νm i (1 pm) + Em h R S(νm 1, T) | τm < νm i pm.

We can divide the expected regret in the case of no false alarm into two parts: that incurred in the m-th stationary segment and that from the next segment onward to the time horizon, which allows us to apply the intermediate result in Lemma 5 for the m-th segment:

Em h R S(νm 1, T) | τm νm i

= Em h R S(νm 1, νm 1) | τm νm i + Em h R S(νm, T) | τm νm i

K 1 + E h R S(νm 1, νm 1) i + Em h R S(νm, T) | τm νm i .

Let qm := P(τm νm + L/2 | τm νm). We split the righthand side term into:

Em h R S(νm, T) | τm νm i = Em h R S(νm, T) | νm τm < νm + L/2 i (1 qm)

+ Em h R S(νm, T) | τm νm + L/2 i qm.

On the event that the changepoint is detected with short delay, i.e., νm τm < νm + L/2, we can rewrite the expected regret as:

Em h R S(νm, T) | νm τm < νm + L/2 i

= Em h R S(νm, τm) | νm τm < νm + L/2 i + Em h R S(τm + 1, T) | νm τm < νm + L/2 i

2 + Em+1 h R S(νm, T) i ,

where we used that after MDB being resetted in τm, no detection window contains any samples from the previous segment before νm and that there are still at least L time steps left before νm+1 (given by Assumption

Published in Transactions on Machine Learning Research (MM/YYYY)

(1)), such that νm+1 can be detected as if MDB was started at νm. On the other hand, for longer delay we can distinguish further:

Em h R S(νm, T) | τm νm + L/2 i

max n Em h R S(νm, T) | νm + L/2 τm < νm + L i , Em h R S(νm, T) | τm νm + L io .

In the case of νm + L/2 τm < νm + L and under Assumption (1), guaranteeing at least L time steps between τm and νm, we can make the same argument as above and derive:

Em h R S(νm, T) | νm + L/2 τm < νm + L i

= Em h R S(νm, τm) | νm + L/2 τm < νm + L i

+ Em h R S(τm + 1, T) | νm + L/2 τm < νm + L i

L + Em+1 h R S(νm, T) i .

In the case of τm νm + L the detection windows contain no samples from the previous preference matrix P (m) due to the fact that after L time steps all detection windows are filled with new samples from P (m+1). Thus, the detection of νm+1 applies as if MDB would have been started at νm, but Alg is still running with invalid observations from the previous segment Sm. Hence, it potentially suffers maximal regret for Sm+1:

Em h R S(νm, T) | τm νm + L i |Sm+1| + Em+1 h R S(νm, T) i .

Combining both cases on the event of τm νm + L/2, we obtain:

Em h R S(νm, T) | τm νm + L/2 i |Sm+1| + Em+1 h R S(νm, T) i .

Summarizing, on the event of τm νm we obtain:

Em h R S(νm, T) | τm νm i

= Em h R S(νm, T) | νm τm < νm + L/2 i (1 qm)

+ Em h R S(νm, T) | τm νm + L/2 i qm

2 + Em+1 h R S(νm, T) i (1 qm) + |Sm+1| + Em+1 h R S(νm, T) i qm

2 (1 qm) + |Sm+1| qm + Em+1 h R S(νm, T) i

2 + |Sm+1| qm + Em+1 h R S(νm, T) i .

Coming back to the case of τm < νm, we can make use of the same arguments as for the intermediate results of the previous cases with a case distinction depending on whether τm+1 < νm + L/2:

Em h R S(νm 1, T) | τm < νm i

= Em h R S(νm 1, νm 1) | τm < νm i + Em h R S(νm, T) | τm < νm i

|Sm| + max n Em h R S(νm, T) | τm+1 < νm + L/2, τm < νm i ,

Em h R S(νm, T) | τm+1 νm + L/2, τm < νm i o

|Sm| + max L

2 + Em+1 h R S(νm, T) i , |Sm+1| + Em+1 h R S(νm, T) i

= |Sm| + |Sm+1| + Em+1 h R S(νm, T) i .

Published in Transactions on Machine Learning Research (MM/YYYY)

Finally, we can combine the bounds on the expected regret in both major cases τm νm and τm < νm, and apply the induction hypothesis in the third inequality in order to derive:

Em h R S(νm 1, T) i

= Em h R S(νm 1, T) | τm νm i (1 pm) + Em h R S(νm 1, T) | τm < νm i pm

2 + 2γK|Sm|

K 1 + |Sm+1| qm + E h R S(νm 1, νm 1) i + Em+1 h R S(νm, T) i (1 pm)

+ |Sm| + |Sm+1| + Em+1 h R S(νm, T) i pm

2 + 2γK|Sm|

K 1 + E h R S(νm 1, νm 1) i (1 pm) + |Sm+1| (1 pm)qm

+ (|Sm+1| + |Sm|) pm + Em+1 h R S(νm, T) i

2 + 2γK|Sm|

K 1 + (|Sm+1| + |Sm|) (p + q) + E h R S(νm 1, νm 1) i + Em+1 h R S(νm, T) i

2 + 2γK|Sm|

K 1 + (|Sm+1| + |Sm|) (p + q) + E h R S(νm 1, νm 1) i + (M m) L

i=m+1 |Si| + (p + q)

i=m+1 E h R S(νi 1, νi 1) i

= (M m + 1) L

i=m |Si| + (p + q)

i=m E h R S(νi 1, νi 1) i ,

which proves the hypothesis. For the second inequality we used:

|Sm+1| (1 pm)qm + (|Sm+1| + |Sm|) pm (|Sm+1| + |Sm|) ((1 pm)qm + pm)

(|Sm+1| + |Sm|) (pm + qm)

(|Sm+1| + |Sm|) (p + q),

since p maxm pm and q maxm qm.

Theorem 3 bounds MDB s expected regret depending on p and q, but these are not explicitly given by the problem statement nor the choice of our parameters. A bound solely based on the given parameters is more desirable. We catch up on this by plugging in the bounds obtained in Lemma 6 and 7 directly.

Corollary 3. For any choice of parameters satisfying Assumptions (1) to (3) the expected cumulative binary strong regret of MDB is bounded by

E h R S(T) i ML

K 1 + p + q +

m=1 E h R S(νm 1, νm 1) i ,

with p = 2T exp 2b2

w and q = exp wc2

2 , while c > 0 stems from Assumption (2).

Being equipped with a bound depending on the chosen parameters, we are interested in an optimal parameterization in the sense that it minimizes MDB s expected regret. We start by considering γ in isolation.

Corollary 4. Let p, q [0, 1] with P(τm < νm) p for all m M and P(τm νm + L/2 | τm νm) q

for all m 1 M. Setting γ = q

8T (K 1), the expected cumulative binary strong regret of MDB is

Published in Transactions on Machine Learning Research (MM/YYYY)

E h R S(T) i

2w MTK + 2(p + q)T +

m=1 E h R S(νm, νm+1 1) i .

Proof. In order to minimize the bound in Theorem 3 depending on γ we only have to consider the global minimum of Mw j K(K 1)

for γ (0, 1], where we inserted the definition of L given earlier. To circumvent rounding problems caused by the floor, we instead consider the following function bounding that term:

g(γ) := Mw K(K 1)

The first derivative is

g (γ) = Mw K(K 1)

Setting g (γ ) = 0 in search of a local minimum, we obtain

This is indeed a local minimum because it holds g (γ ) > 0 for the second derivative given by

g (γ) = Mw K(K 1)

It is also a global minimum because g and its domain are convex. Plugging γ into g, we obtain the following bound:

For certain values of M, T, and K (and also w which is the only parameter of our choice) the derived formula for γ violates the condition γ 1. MDB could still be executed, but it would have maximal possible strong regret, same as for γ = 1, since detection steps are conducted perpetually without a break and thus only pairs containing two different arms are played. At last, we derive the optimal choices for the missing parameters w and b.

Corollary 1 Setting γ = q

8T (K 1), b = r

2T (2T +1)δ

2T (2T +1)δ

the lowest even integer greater or equal 8 δ2 log

2T (2T +1)δ

, the expected cumulative binary strong regret of MDB is bounded by

E h R S(T) i

v u u t8 log

2T(2T + 1)δ

m=1 E h R S(νm, νm+1 1) i .

Proof. Considering the third term in Corollary Corollary 4 to guarantee the bound 2(p+q)T C for C R+

of our choice, we need to have

2T and q (1 a)C

Published in Transactions on Machine Learning Research (MM/YYYY)

for some a [0, 1]. The intention is to choose C as high as possible without changing the asymptotic behavior of the bound given in Corollary 4. With greater C there is more space for the probabilities p and q to increase such that the window length w can be chosen smaller. We fulfill the demands towards p and q posed in Theorem 3 by ensuring that the above mentioned bounds are greater than those given by Lemma 6 and 7, and therefore obtain

2T and exp wc2

These inequalities are equivalent to

2 w log 2T (1 a)C

In order to simplify further calculations we set 4T 2

a C = 2T (1 a)C , from which we derive a = 2T 2T +1. Next, we

δ , which gives us the intermediate result:

E h R S(T) i w + 1

m=1 E h R S(νm, νm+1 1) i .

2T(2T + 1)δ

2T(2T + 1)δ

such that the lower bounds stated above are met. Then we plug this into Assumption (2) in order to obtain for w:

2T(2T + 1)δ

Finally, we set w to the lowest even integer above its lower bound.

E DETECT Regret Analysis

Again, the number of time steps L needed to fill all windows completely during a detection phase, i.e., each pair of arms (a I, aj) with a I = aj is played at least w times, remains a key quantity used in the regret analysis. Since it differs from MDB, we define:

L := w(K 1).

Another new parameter we have to take into consideration is the adaptation of δ to entries in the preference matrices that relate to winning probabilities of the Condorcet winner in each segment. The definition of δ in Section 4 is not appropriate anymore because DETECT cannot detect changes at any other entries than those being related to the by Alg suspected Condorcet winner. Hence, we define

δ(m) := maxj |δ(m) m ,j| and δ := minm δ(m) ,

where δ(m) i,j := p(m+1) i,j p(m) i,j denotes the magnitude of change of the preference probability for the pair of arms (ai, aj) between segment Sm and Sm+1. Continuing, we impose the following assumptions on the problem statement and the parameters:

Assumption (1): |Sm| T + 3

2L for all m {1, . . . , M}

Assumption (2): δ 2b

w + c for some c > 0

Published in Transactions on Machine Learning Research (MM/YYYY)

Assumption (1) requires a minimal length for all stationary segments depending on T to guarantee that DETECT is able to detect changepoints as touched on above. Assumption (2) is required to allow short delay with certain probability, analogously to Assumption (2) for MDB. It also implies δ(m) > 0 for each m, meaning that for every changepoint νm there is at least one entry p(m+1) m ,j in P (m+1) different to the winning

probability p(m) m ,j in the previous segment. Otherwise DETECT would automatically fail to detect these kind of changepoints.

We adopt the notation used in Appendix D, but assume rv to be the binary weak regret aggregation function (instead of binary strong regret) in the course of the following theoretical analysis.

Lemma 8 Consider a scenario with M = 1. Let τ1 be the first detection time and a I be the first suspected Condorcet winner returned by Alg. Given T T, the expected cumulative binary weak regret of DETECT is bounded by

E h R W(T) i E h R W( T) i + (T T) (1 P(τ1 > T, a I = a1 )).

Proof. We can split the expected regret by distinguishing whether a false alarm is raised or the optimal arm has not been found by Alg as follows:

E h R W(T) i = E h R W( T) i + E h R W( T + 1, T) i

= E h R W( T) i + E h R W( T + 1, T) | τ1 T i P(τ1 T)

+ E h R W( T + 1, T) | τ1 > T, a I = a1 i P (τ1 > T, a I = a1 )

+ E h R W( T + 1, T) | τ1 > T, a I = a1 i P (τ1 > T, a I = a1 )

E h R W( T) i + (T T) P(τ1 T) + (T T) P (τ1 > T, a I = a1 )

= E h R W( T) i + (T T) (1 P (τ1 > T, a I = a1 )) ,

where we rely on T T for the first equation and utilized for the first inequality the fact that in the case of τ1 > T and a I = a1 no regret is incurred in the detection phase due to properties of the binary weak regret.

The condition of T T is implied by Assumption (1) since we have |S1| = T due to the absence of changepoints. We can bound the probability of a false alarm given that Alg has identified the optimal arm successfully by using Lemma 12.

Lemma 9 Consider a scenario with M = 1. Let τ1 be the first detection time and a I be the first suspected Condorcet winner returned by Alg. The probability of DETECT raising a false alarm given that a I = a1 is bounded by

P (τ1 T | a I = a1 ) 2(T T) P (a I = a1 ) exp 2b2

Proof. Let n I,Jt,t be the value of n I,Jt after its update in Line 13 at time step t, AI,t :=

n I,Jt,t w/2 P

s=n I,Jt,t w+1 XI,Jt,s,

and BI,t :=

s=n I,Jt,t w/2+1 XI,Jt,s. Let r(t) be the value of r in time step t during a detection step. We

Published in Transactions on Machine Learning Research (MM/YYYY)

P (τ1 T | a I = a1 ) =

t= T +1 P (τ1 = t | a I = a1 )

t= T +1 : n I,Jt,t w P (|AI,t BI,t| > b | a I = a1 )

t= T +1 : n I,Jt,t w

P(|A1 ,t B1 ,t| > b)

P (a I = a1 )

1 P (a I = a1 )

t= T +1 2 exp 2b2

= 2(T T) P (a I = a1 ) exp 2b2

In the third inequality we used Lemma 12 with the observations of the duels between the pair (a I, a Jt) being the bits filling the detection window DI,Jt. The requirement that all samples are drawn from the same Bernoulli distribution is met since there is no changepoint and thus all samples from a pair (ai, aj) are drawn from a Bernoulli distribution with parameter p(1) i,j . Note that Jt is deterministic in each time step t later than T.

Next, we bound the probability of a delay of at least L/2 time steps given that the suspected Condorcet winner returned by Alg is indeed the true optimal arm.

Lemma 10 Consider a scenario with M = 2. Let τ be the first detection time and a I the first suspected Condorcet winner returned by Alg. Assume that ν1 + L /2 ν2 and the existence of an arm aj such that δ(1) 2b

w + c for some c > 0. Then it holds

P (τ1 ν1 + L /2 | τ1 ν1, a I = a1 ) exp wc2

Proof. Assume that τ1 is large enough such that the pair (a I, aj) is played at least w/2 times from ν1 onward during the detection phase. In that case, let t be the time step in which the pair (a1 , aj) is played for the w/2-th time from ν1 onward during the first detection phase. Then it holds t < ν1 +L /2. As a consequence, we obtain τ1 < ν1 + L /2 if we deny the assumption. Let n I,j,t be the value of n I,j after its update at

time step t, Ai,j,t :=

ni,j,t w/2 P

s=ni,j,t w+1 Xi,j,s, and Bt :=

s=ni,j,t w/2+1 Xi,j,s. Since |AI,j,t BI,j,t| > b triggers the

changepoint-detection in time step t, it implies τ1 < ν1 + L /2 given τ1 ν1. Thus, we obtain

P (τ1 < ν1 + L /2 | τ1 ν1, a I = a1 ) P (|AI,j,t BI,j,t| > b | a I = a1 ) .

We can apply Lemma 13 with p = p(1) 1 ,j and θ = δ(1) 1 ,j and conclude:

P (τ1 ν1 + L /2 | τ1 ν1, a I = a1 ) = 1 P (τ1 < ν1 + L /2 | τ1 ν1, a I = a1 )

1 P (|AI,j,t BI,j,t| > b | a I = a1 )

1 P(|A1 ,j,t B1 ,j,t| > b)

P (a I = a1 )

Published in Transactions on Machine Learning Research (MM/YYYY)

The required distance between ν1 and ν2 of at least L /2 time steps is implied by Assumption (1), whereas the condition on δ(1) is given by Assumption (2).

Before stating our main result in Theorem 4, we adopt again the notation used in Appendix D for using Em [R(t1, t2)] with f being the binary weak regret aggregation function. Furthermore, let a Im be the first suspected Condorcet winner returned by Alg when DETECT is started at νm 1. Since DETECT s success of detecting changepoints is highly depending on Alg s ability to find the true optimal arm after T time steps and our analysis capitalizes on that, we define p(m) T as the probability that the suspected Condorcet winner

a T returned by Alg after T time steps is the optimal arm, i.e., p(m) T := P ( a T = am ), given it is run in a

stationary setting with preference matrix P (m). Based on that, let p T minm p(m) T , stating a lower bound valid for all M preference matrices. Since Assumption (1) ensures that the first running phase ends before νm given that DETECT is started at νm 1, we can later relate p T to the results in Lemma 21 and 23 if we are given WS or Bt W as Alg, respectively.

Theorem 4 Let p, q [0, 1] with P (τm < νm | a Im = am ) p for all m M and P (τm νm + L /2 | τm νm, a Im = am ) q for all m M 1. Then the expected cumulative binary weak regret of DETECT is bounded by

E h R W(T) i ML

2 + (1 p T + pp T + q)MT +

m=1 E h R W(νm 1, νm 1 + T 1) i .

The proof is an induction over the number of segments M, same as its adequate for MDB in Appendix D, and thus also inspired by the one presented for MUCB in Cao et al. (2019).

In the case of an false alarm, DETECT could recover completely if the next changepoint is at least T + L

time steps away, such that the running phase is finished and all detection windows are filled and prepared to encounter the next changepoint. Since we cannot simply assume this to happen, we have to deal with the opposite event in which either the detection windows are not filled to a sufficient extent, or even worse, the running phase of Alg is not finished yet, leaving it in a corrupted state to enter the next segment, thus invalidating the lower bound p T .

Proof. First, we prove the following statement for all m M by induction:

Em h R W(νm 1, T) i

i=m+1 (i m)|Si| +

i=m E h R W(νi 1, νi 1 + T 1) i

+ (1 (1 p)p T )

i=m (i m + 1)|Si|.

Note that T = M P

m=1 |Sm|. Then E h R W(T) i can be bounded by a simpler expression:

E h R W(T) i

= E1 h R W(ν0, T) i

m=2 (m 1)|Sm| + (1 (1 p)p T )

Published in Transactions on Machine Learning Research (MM/YYYY)

m=1 E h R W(νm 1, νm 1 + T 1) i

+ (1 p T + pp T )

m=1 E h R W(νm 1, νm 1 + T 1) i

2 + q(M 1)T + (1 p T + pp T )MT +

m=1 E h R W(νm 1, νm 1 + T 1) i

2 + (1 p T + pp T + q)MT +

m=1 E h R W(νm 1, νm 1 + T 1) i .

Base case: m = M Since there is only one stationary segment from νM 1 to νM 1 = T, we can consider this as the stationary setting starting at νM 1 and apply Lemma 8 in the first inequality:

EM h R W(νM 1, T) i

E h R W(νM 1, νM 1 + T 1) i + |SM| T (1 P (τM νM, a IM = a M ))

= E h R W(νM 1, νM 1 + T 1) i

+ |SM| T (1 (1 P (τM < νM | a IM = a M )) P (a IM = a M ))

E h R W(νM 1, νM 1 + T 1) i + (1 (1 p)p T )|SM|

i=M+1 (i M)|Si| +

i=M E h R W(νi 1, νi 1 + T 1) i

+ (1 (1 p)p T )

i=M (i M + 1)|Si|.

Induction hypothesis: For any arbitrary but fixed m M 1, the expected cumulative binary weak regret from time steps νm to T of DETECT started at νm is bounded by

Em+1 h R W(νm 1, T) i

i=m+2 (i m 1)|Si| +

i=m+1 E h R W(νi 1, νi 1 + T 1) i

+ (1 (1 p)p T )

i=m+1 (i m)|Si|.

Inductive step: m + 1 m Let pm = P (τm < νm | a Im = am ). We can decompose the regret based on the distinction whether Alg returned the true optimal arm, i.e., a Im = am and then whether a false alarm, i.e., τm < νm, occurs:

Em h R W(νm 1, T) i

= Em h R W(νm 1, νm 1 + T 1) i + Em h R W(νm 1 + T, T) i

= E h R W(νm 1, νm 1 + T 1) i + Em h R W(νm 1 + T, T) | a Im = am i 1 p(m) T

Published in Transactions on Machine Learning Research (MM/YYYY)

+ Em h R W(νm 1 + T, T) | τm < νm, a Im = am i pm p(m) T

+ Em h R W(νm 1 + T, T) | τm νm, a Im = am i (1 pm)p(m) T

E h R W(νm 1, νm 1 + T 1) i + 1 (1 pm)p(m) T

+ Em h R W(νm 1 + T, T) | τm νm, a Im = am i .

We can reduce the expected regret from the first detection phase onward in the case of no false alarm and correctly returned Condorcet winner to the regret incurred from the next segment onward to the time horizon:

Em h R W(νm 1 + T, T) | τm νm, a Im = am i

= Em h R W(νm 1 + T, νm 1) | τm νm, a Im = am i

+ Em h R W(νm, T) | τm νm, a Im = am i

= Em h R W(νm, T) | τm νm, a Im = am i .

This is justified by the fact that in every time step of the first detection phase the true Condorcet winner returned by Alg is played given that τm νm and a Im = am . Thus, there is no weak regret incurred from νm 1 + T to νm 1, which we used for the second equation. We split the remaining term into:

Em h R W(νm, T) | τm νm, a Im = am i

Em h R W(νm, T) | νm τm < νm + L /2, a Im = am i

+ Em h R W(νm, T) | τm νm + L /2, a Im = am i q

Em h R W(νm, T) | νm τm < νm + L /2, a Im = am i + q

i=m+1 |Si|.

On the event that the changepoint is detected with short delay, i.e., νm τm < νm + L /2, we can rewrite the expected regret as:

Em h R W(νm, T) | νm τm < νm + L /2, a Im = am i

= Em h R W(νm, τm) | νm τm < νm + L /2, a Im = am i

+ Em h R W(τm + 1, T) | νm τm < νm + L /2, a Im = am i

2 + Em+1 h R W(νm, T) i ,

where we utilized Assumption (1) ensuring us that after DETECT being resetted in τm, the next detection phase has at least L time steps left before νm+1 such that νm+1 can be detected as if DETECT was started at νm. Summarizing, on the event of τm νm and a Im = am we obtain:

Em h R W(νm 1 + T, T) | τm νm, a Im = am i

= Em h R W(νm, T) | τm νm, a Im = am i

Em h R W(νm, T) | νm τm < νm + L /2, a Im = am i + q

Published in Transactions on Machine Learning Research (MM/YYYY)

2 + Em+1 h R W(νm, T) i + q

i=m+1 |Si|.

Finally, we can combine the intermediate results and apply the induction hypothesis in the fourth inequality to in order to derive:

Em h R W(νm 1, T) i

E h R W(νm 1, νm 1 + T 1) i + 1 (1 pm)p(m) T

+ Em h R W(νm 1 + T, T) | τm νm, a Im = am i

E h R W(νm 1, νm 1 + T 1) i + 1 (1 pm)p(m) T

2 + Em+1 h R W(νm, T) i + q

i=m+1 |Si| + (1 (1 p)p T )

i=m |Si| + E h R W(νm 1, νm 1 + T 1) i

+ Em+1 h R W(νm, T) i

i=m+1 |Si| + (1 (1 p)p T )

i=m |Si| + E h R W(νm 1, νm 1 + T 1) i

+ (M m 1) L

i=m+2 (i m 1)|Si| +

i=m+1 E h R W(νi 1, νi 1 + T 1) i

+ (1 (1 p)p T )

i=m+1 (i m)|Si|

i=m+1 (i m)|Si| +

i=m E h R W(νi 1, νi 1 + T 1) i

+ (1 (1 p)p T )

i=m (i m + 1)|Si|,

which proves the hypothesis.

Theorem 4 bounds DETECT s expected regret depending on p and q, but these are not explicitly given by the problem statement nor the choice of our parameters. A bound solely based on the given parameters is more desirable. We catch up on this by plugging in the bounds obtained in Lemma 9 and 10 directly.

Corollary 5. For any choice of parameters satisfying Assumptions (1) and (2) the expected cumulative binary weak regret of DETECT is bounded by

E h R W(T) i ML

2 + (1 p T + pp T + q)MT +

m=1 E h R W(νm 1, νm 1 + T 1) i ,

with p = 2(T T )

p T exp 2b2

w and q = exp wc2

2 , while c > 0 stems from Assumption (2).

We are again interested in an optimal parameterization that minimizes DETECT s expected regret and derive it to the best of our ability, neglecting some imprecision incurred in the previous intermediate results

Published in Transactions on Machine Learning Research (MM/YYYY)

and further technical difficulties. First, we consider the window length w and the threshold b independent of the used dueling bandits algorithm and its bound p T .

Corollary 2 Setting b = q

w , and w to the lowest even integer greater or equal 8C

δ2 with any

C R+ and C log M(2T 2+T )

C , the expected cumulative binary weak regret of DETECT is bounded by

E h R W(T) i 4C + 1

δ2 MK + (1 p T )MT + C +

m=1 E h R W(νm 1, νm 1 + T 1) i .

Proof. To guarantee the bound (pp T + q)MT C for C R+ of our choice of the term contained in Theorem 4, we need to have

p a C p T MT and q (1 a)C

for some a [0, 1]. The intention is to choose C as high as possible without changing the asymptotic behavior of the bound given in Theorem 4. With greater C there is more space for the probabilities p and q to increase such that the window length w can be chosen smaller. We fulfill the demands towards p and q posed in Theorem 4 by ensuring that the above mentioned bounds are greater than the ones given in Lemma 9 and 10 to obtain 2T p T exp 2b2

a C p T MT and exp wc2

These inequalities are equivalent to

2 log 2MT 2

2 w log MT (1 a)C

In order to simplify further calculations we want 2MT 2

a C = MT (1 a)C , from which we derive a = 2T 2T +1. Using

C log M(2T 2+T )

such that the lower bounds stated above are met. Then we plug this into Assumption (2) in order to obtain for w:

Finally, we set w to be the lowest even integer above its lower bound and hence due to L = w(K 1):

Finally, we want to close the analysis by setting T and consequently p T specifically for the usage of WS and Bt W. To this end, we derive T in Corollary 6 and 8 such that the summand (1 p T )MT in Corollary 2 is bounded by M log T and we set C = (1 p T )MT in order to keep the same asymptotic bounds. We tackle the trade-off between T and p T in Corollary 7 and 9 from the other side by assuming T =

T to be given and calculate the expected regret bounds on the basis of the resulting p T . Since we need a valid lower bound p T for all stationary segments, we define p min := minm p(m) min and plug it into the respective lower bounds in Lemma 21 and 23. Note that the value of T can exceed segment lengths for small enough p min.

Published in Transactions on Machine Learning Research (MM/YYYY)

Corollary 6. Setting T =

1 e (2p min 1)2

(2p min 1)2 + K 1

w log T + 3

1 w 4 log T + 3

2 , and w to the lowest even integer greater or equal 16 log T +6

δ2 by the usage of Corollary 2 with C = M log T the expected cumulative binary weak regret of DETECT using Bt W is bounded by

E h R W(T) i 8 log T + 4

δ2 MK + 2M log T +

m=1 E h R W νm 1, νm 1 + T 1 i .

Corollary 7. Let x := e (T 1/4 K+1)(2p min 1)2

1 e (2p min 1)2 . Setting T =

w to the lowest even integer greater or equal 8 δ2 log 2T

x by the usage of Corollary 2 with C = x MT the expected cumulative binary weak regret of DETECT using Bt W is bounded by

E h R W(T) i 4 log 2T

x + 1 δ2 MK + 2x MT +

m=1 E h R W νm 1, νm 1 +

Corollary 8. Setting r = max

1+ 1 p min 2p min 1

log p min 1 p min

, T = r3K3T

log T , b = q

w log T + 3

1 w log 4 log T + 3

2 , and w to the lowest even integer greater or equal 16 log T +6

δ2 by the usage of Corollary 2 with C = M log T the expected cumulative binary weak regret of DETECT using WS is bounded by

E h R W(T) i 8 log T + 4

δ2 MK + 2M log T +

m=1 E h R W(νm 1, νm 1 + T 1) i .

Corollary 9.

Let y := min r N

1 + 1 p min 2p min 1 1 p min p min

T . Setting T =

w to the lowest even integer greater or equal 8 δ2 log 2T

y by the usage of Corollary 2 with C = y MT the expected cumulative binary weak regret of DETECT using WS is bounded by

E h R W(T) i 4 log 2T

δ2 MK + 2y MT +

m=1 E h R W(νm 1, νm 1 +

Published in Transactions on Machine Learning Research (MM/YYYY)

F Condorcet Winner Identification Analysis

F.1 Winner Stays

Algorithm 4 Winner Stays (WS) Chen & Frazier (2017) Require: K N

Ci 0 ai A for t = 1, . . . , T do

if t > 1 and It 1 arg maxi Ci then

It It 1 else if t > 1 and Jt 1 arg maxi Ci then

It Jt 1 else

Draw It uniformly at random from arg maxi Ci end if if t > 1 and It 1 arg maxi =It Ci then

Jt It 1 else if thent > 1 and Jt 1 arg maxi =It Ci

Jt Jt 1 else

Draw Jt uniformly at random from arg maxi =It Ci end if Play (a It, a Jt) and observe X(t) It,Jt if X(t) It,Jt = 1 then CIt CIt + 1 CJt CJt 1 else

CIt CIt 1 CJt CJt + 1 end if end for

The arms scores depending on the round and iteration are generalized in Lemma 15 that we leave without proof.

Lemma 15. Fix any arbitrary round r N and iteration i {1, . . . , K 1} of the Winner Stays algorithm. Then in the beginning of iteration i in round r the incumbent has score (r 1)(K 1) + i 1 and the challenger 1 r.

Next, we will derive a bound for the probability that the winner of the last round is the Condorcet winner given a number of time steps T based on the connection to Gambler s ruin game. We use parts of the proof for the regret bound of WS provided in Peköz et al. (2020), but correct errors that the authors have made. In Gambler s ruin two players participate, one having a dollars at disposal, the other b. The game consists of consecutive rounds (not to confuse with rounds of WS) in which the first player wins with probability p. The winner of a round receives one dollar from the losing player. As soon as one player has run out of money, the game is over. Each iteration of WS can be viewed as such a game with the dueling arms being the players of the game and the difference between each arm s score to the score at which it loses the round being its money at hand. Throughout the analysis we utilize the following result:

Lemma 16. (Gambler s ruin) In a game of Gambler s ruin with the first player having a dollars and the second player having b dollars, given the winning probability p of the first player over the second player, the first player s probability to win

Published in Transactions on Machine Learning Research (MM/YYYY)

the game is 1 1 p

Next, we prove similar to Peköz et al. (2020) lower bounds for the probability of a winning a round given that it won the previous round or lost in the other case. For that purpose let Wr,i and Lr,i be the events that a wins iteration i in round r and loses iteration i in round r, respectively. Further define pmin := mini = p ,i and x := 1 pmin

pmin . Note that x [0, 1) since pmin 1

2, 1 , otherwise a would not be the Condorcet winner.

Lemma 17. For all rounds r 2 holds

P(Wr,K 1 | Wr 1,K 1) 1 x(r 1)K+1

P(Wr,K 1 | Lr 1,K 1) x 1 xr K .

Proof. Consider a fixed iteration i in round r with a being part of the played pair. Since the incumbent starts with score (r 1)(K 1) + i 1 and the challenger with score 1 r (Lemma 15), the iteration can be seen as a Gambler s ruin game between the incumbent having (r 1)K + i dollars and the challenger having one. Thus using Lemma 16 a wins this iteration given that it is the incumbent or the challenger, respectively, with probability

P(Wr,i | a starts iteration i in round r as the incumbent) 1 x(r 1)K+i

1 x(r 1)K+i+1 ,

P(Wr,i | a starts iteration i in round r as the challenger) 1 x 1 x(r 1)K+i+1 .

And hence we get for a whole round r 2:

P(Wr,K 1 | Wr 1,K 1) = P(Wr,1 | Wr 1,K 1)

i=2 P(Wr,i | Wr,i 1)

1 x(r 1)K+i

1 x(r 1)K+i+1

= 1 x(r 1)K+1

In the other case of a having lost the previous round r 1 we can upper bound its winning probability in round r similarly. It is chosen randomly as the challenger for some iteration j in which it has to beat the incumbent before winning in all remaining K 1 j iterations. Let Cr,i be the event that a is chosen as the challenger in iteration i of round r 2 given that Lr 1,K 1, obviously we have P(Cr,i) 1 K 1 for all r 2 and i. From which we derive for r 2:

P(Wr,K 1 | Lr 1,K 1) =

j=1 P(Cr,j) P(Wr,j | Cr,j)

i=j+1 P(Wr,i | Wr,i 1)

1 x 1 x(r 1)K+j+1

1 x(r 1)K+i

1 x(r 1)K+i+1

= 1 x 1 xr K .

Published in Transactions on Machine Learning Research (MM/YYYY)

With the probability given that a wins the next round, we continue by bounding the probability for it to lose the r-th round in Lemma 18. Let Wr := Wr,K 1 and Lr := Lr,K 1.

Lemma 18. For all rounds r 1 holds

P(Lr) 1 + x 1 x

Proof. In the special case of r = 1 we say that in the first iteration both randomly chosen arms are challengers such that P(C1,1) = 2 K and P(C1,j) 1 K for any iteration j 2. Thus for r = 1 we derive with applying Lemma 17:

P(L1) = 1 P(W1)

j=1 P(C1,j) P(W1,j | C1,j)

i=j+1 P(W1,i | Wr,i 1)

j=1 P(C1,j) 1 x 1 xj+1

j=1 P(C1,j) 1 x

For r 2 we can bound the probability recursively by using Lemma 17 again:

P(Lr) = P(Lr | Wr 1) P(Wr 1) + P(Lr | Lr 1) P(Lr 1)

= (1 P(Wr | Wr 1)) (1 P(Lr 1)) + (1 P(Wr | Lr 1)) P(Lr 1)

x(r 1)K+1 xr K

1 xr K (1 P(Lr 1)) + x xr K

1 xr K P(Lr 1)

= P(Lr 1) x x(r 1)K+1

1 xr K + x(r 1)K+1 xr K

P(Lr 1) x + x(r 1)K+1,

which leads to:

s=1 xs(K 1)+r,

as we prove by induction over r. We have already shown the base case for r = 1 above, hence only the inductive step for r + 1 is left to show:

P(Lr+1) P(Lr) x + xr K+1

s=1 xs(K 1)+r !

s=1 xs(K 1)+r+1.

We can bound the second term in the second last display by:

s=1 xs(K 1)+r

Published in Transactions on Machine Learning Research (MM/YYYY)

= xr+1 1 xr 1

We can now say with which probability a loses any round r, but we are not finished yet because we need to know which is the last round completed by WS after T many time steps. A key quantity for this is the length of a Gambler s ruin game, which can be bounded in expectation.

Lemma 19. Peköz et al. (2020) Let X be the number of plays in a game of Gambler s ruin with one player having m dollars and the other one. Then the expected game length is bounded by

E[X] < 2(m + 1).

Lemma 20. Let Xr ,i be the number of time steps played in iteration i in round r . For any a R+ and round r 2 holds

i=1 Xr ,i ar2K2 !

Proof. Due to Lemma 15 we can interpret each iteration i in any round r as a game of Gambler s ruin with one player having (r 1)K + i dollars and the other one. Thus we obtain

E[Xr ,i] < 2((r 1)K + i + 1).

Using Markov s inequality, we can state for any a R+ that:

P(Xr ,i 2a((r 1)K + i + 1)) < 1

and thus further conclude with the help of an average argument for the first inequality:

i=1 2a((r 1)K + i + 1)

i=1 {Xr ,i 2a((r 1)K + i + 1)}

i=1 P(Xr ,i 2a((r 1)K + i + 1))

The lower bound for the total number of time steps can be rewritten as:

i=1 2a((r 1)K + i + 1)

r =1 r K(K 1) + K 1 K(K 1)

Published in Transactions on Machine Learning Research (MM/YYYY)

= 2a r K 1 K(K 1)

+ K(K 1)r(r + 1)

= ar(K 1)(r K + 2),

which can be upper bounded by ar2K2 for r 2.

Next, by substituting a with T r2K2 , we obtain a probabilistic bound for the first r rounds to take at least T time steps, which gives us a minimal probability for the first r rounds to be completed.

Corollary 10. For any round r 2 and number of time steps T holds

i=1 Xr ,i T

Finally, what is left to show is with which probability a wins the last completed round by combining Lemma 18 and Corollary 10. Let a T be the winner of that round, meaning that it is the incumbent of the first iteration in the succeeding round. We call a T also the suspected optimal arm being the one that WS would return if we were to ask it for the optimal arm.

Lemma 21. Fix any r N with r 2. Given T, the suspected optimal arm a T returned by Winner Stays after T time steps is the true optimal arm a with probability

P( a T = a ) > 1 1 + x 1 x

Proof. Let U T be the random variable denoting the last round that has been completed after T time steps. We use both Lemma 18 and Corollary 10 in the fourth inequality to derive that:

P( a T = a ) =

s=0 P( a T = a | U T = s) P(U T = s)

s=r (1 P(Ls)) P(U T = s)

s=r P(U T = s)

(1 P(Lr)) (1 P(U T r))

1 1 + x 1 x

1 1 + x 1 x

F.2 Beat the Winner

Same as for WS we will derive a lower bound for the probability that after T time steps the suspected optimal arm by Bt W is indeed optimal. Therefore, we define its suspected optimal arm to be the current incumbent. As before, we use pmin := mini = p ,i. The authors in Peköz et al. (2020) have already given a useful property about the last round lost by the optimal arm:

Lemma 22. Peköz et al. (2020) For the last round U lost by the optimal arm and any r N holds

P(U r) e r(2pmin 1)2

1 e (2pmin 1)2 .

Published in Transactions on Machine Learning Research (MM/YYYY)

Algorithm 5 Beat the Winner (Bt W) Peköz et al. (2020) Require: K N

Draw I uniformly at random from {1, . . . , K} Q queue of all arms from A \ {a I} in random order r 1 while true do

a J top arm removed from Q w I, w J 0 while w I < r and w J < r do

Play (a I, a J) and observe X(t) I,J w I w I + I{X(t) I,J = 1}

w J w J + I{X(t) I,J = 0} end while if w I = r then

Put a J to the end of Q else

Put a I to the end of Q I J end if r r + 1 end while

Next, we can derive the desired probability bound in Lemma 23 by combining Lemma 22 together with an upper bound on the time steps that are needed for the optimal arm to win its first round after the round U in which it has lost for the last time.

Lemma 23. Given T K2, the suspected optimal arm a T returned by Bt W after T time steps is the true optimal arm a with probability

P( a T = a ) 1 e

T K+1 (2pmin 1)2

1 e (2pmin 1)2 .

Proof. Let U be the last round lost by a . Then a is played again in round U + K 1 and wins that round. Thus, Bt W will always suggest the true optimal arm as the suspected Condorcet winner after the end of round U + K 1. Since the r-th round consists of at most 2r 1 time steps, we get the following condition for T to guarantee that round U + K 1 is finished:

which is equivalent to U 2 + (2K 2)U + K2 2K T + 1 0.

Due to U 1, the inequality becomes sharp for U = T K + 1, which requires T K2. The lefthand side of the inequality decreases for smaller U 1, which means that U

T K + 1 implies the above mentioned condition for T. Now, we can use Lemma 22 to conclude:

P( a T = a ) P

Published in Transactions on Machine Learning Research (MM/YYYY)

T K+1 (2pmin 1)2

1 e (2pmin 1)2 .

G Non-stationary Dueling Bandits with Weak Regret Lower Bound

In this section, we assume without loss of generality that T is divisible by M. Let ε (0, 1). Define segment lengths |Sm| = T/M for all m {1, . . . , M}. Define preference matrices P1, . . . , PK as follows:

1 2 + ε if i = 1, j = 1 1 2 ε if i = 1, j = 1 1 2 otherwise ,

1 2 + ε if i = k, j = k 1 2 ε if i = k, j = k 1 2 + ε if i = 1, j = 1, j = k 1 2 ε if i = 1, i = j, j = 1 1 2 otherwise

Let Ht = {(a It , a Jt , X(t ) It ,Jt )}t t be the history of observations up to time step t. Let Qm k denote the probability distribution over histories induced by assuming that the preference matrix Pk is used for the mth segment, i.e., P (m) = Pk, and thus ak is the Condorcet winner in the m-th segment. The corresponding expectation is denoted by Em k [ ]. In the case of M = 1 we omit the superscripts in order to simplify notation.

Lemma 24. Saha & Gupta (2022) Let HT be the set of all possible histories HT for M = 1 and let f : HT [0, B] be a measurable function that maps a history HT to number in the interval [0, B]. Then for every k = 1 holds

Ek[f(HT )] E1[f(HT )] + B

ε ln 1 + 2ε

E1[N1,k + Nk],

where N1,k = TP

t=1 I{It = 1, Jt = k} + I{It = k, Jt = 1} is the number of times the algorithm chooses to duel

the arms a1 and ak, and Nk = TP

t=1 I{It = k, Jt = 1, Jt = k} + I{It = 1, It = k, Jt = k} is the number of times

ak is played with an arm other than itself and or a1.

Theorem 5 For every algorithm exists an instance of

(i) the piecewise-stationary dueling bandits problem with T, K, and M fulfilling M(K 1) 9T such that the algorithm s expected weak regret is in Ω(

(ii) the stationary dueling bandits problem with T and K fulfilling K 1 9T such that the algorithm s expected weak regret is in Ω(

Proof. Let t Sm and akm be the Condorcet winner in the m-th stationary segment, i.e., P (m) = Pkm. Define the following events:

Et a = {It = 1, Jt = km} {It = km, Jt = 1}

Et b = {It = km, Jt / {1, km}} {It / {1, km}, Jt = km}

Published in Transactions on Machine Learning Research (MM/YYYY)

Et c = {It = km, Jt = km}

Et d = {It = km, Jt = km}

Recall that

r W t := min{ (m) It , (m) Jt } = min n p(m) km,It, p(m) km,Jt

is the incurred weak regret at time step t Sm. For the expected regret at time step t Sm under the distribution Qm km we derive:

Em km[r W t ] = εQm km(Et d)

= ε(1 Qm km(Et a) Qm km(Et b) Qm km(Et c))

= ε ε(Qm km(Et a) + Qm km(Et b) + Qm km(Et c)).

Define N m 1,km = P

t Sm I{Et a}, N m km = P

t Sm I{Et b}, N m km = P

t Sm I{Et c}. We further obtain:

t Sm ε ε(Qm km(Et a) + Qm km(Et b) + Qm km(Et c))

= |Sm|ε εEm km

t Sm Qm km(Et a) + Qm km(Et b) + Qm km(Et c)

= |Sm|ε εEm km[N m 1,km + N m km + N m km].

Note that N m 1,km+N m km+ N m km |Sm| for all HT HT . Next, we can apply Lemma 24 since N m 1,km+N m km+ N m km is measurable for all histories:

Em km[N m 1,km + N m km + N m km]

Em 1 [N m 1,km + N m km + N m km] + |Sm|

ε ln 1 + 2ε

Em 1 [N m 1,km + N m km].

The switch from M = 1 as it is demanded in Lemma 24 to the non-stationary case is justified because for the measure N m 1,km + N m km + N m km only the part of a history containing the m-th segment is relevant. Using

Cauchy-Schwarz and ln 1+2ε 1 2ε 9ε for ε (0, 1/4) we get:

km=2 Em km[N m 1,km + N m km + N m km]

km=2 N m 1,km + N m km + N m km

ε ln 1 + 2ε

Em 1 [N m 1,km + N m km]

km=2 N m 1,km + N m km + N m km

v u u tε ln 1 + 2ε

km=2 Em 1 [N m 1,km + N m km]

Published in Transactions on Machine Learning Research (MM/YYYY)

|Sm| + |Sm|

ε ln 1 + 2ε

|Sm| + 3|Sm| p

ε2(K 1)|Sm|,

which results in the following bound:

km=2 |Sm|ε εEm km[N m 1,km + N m km + N m km]

= |Sm|ε(K 1) ε

km=2 Em km[N m 1,km + N m km + N m km]

|Sm|ε(K 1) ε(|Sm| + 3|Sm| p

ε2(K 1)|Sm|)

= |Sm|ε(K 1 1 3ε p

(K 1)|Sm|).

Recall that RW(T) = M P

t Sm r W t . Thus, assuming K 3, the expected cumulative weak regret of any

deterministic algorithm under a randomly sampled problem instance (with segment lengths |Sm| = T/M) with randomly chosen preference matrices P (m) = Pk with k = 1 for all m {1, . . . , M}, is

1 K 1|Sm|ε(K 1 1 3ε p

1 K 1 T M ε

K 1 ε 3ε2 s

We choose ε = 1 12

T 1/4 to maximize the expression and obtain non-stationary setting:

Using Yao s minimax principle (Yao, 1977), we can infer the latter lower bound also for randomized algorithms. Finally, we conclude for the stationary setting, i.e. M = 1:

Published in Transactions on Machine Learning Research (MM/YYYY)

H Further Empirical Results 1

H.1 Weak Regret

Figure 8: Cumulative binary weak regret averaged over 500 random problem instances with shaded areas showing standard deviation across 10 groups: T = 105, = 0.1, δ = 0.6.

Figure 9: Cumulative binary weak regret averaged over 500 random problem instances with shaded areas showing standard deviation across 10 groups: T = 106, = 0.1, δ = 0.6.

Published in Transactions on Machine Learning Research (MM/YYYY)

Figure 10: Cumulative binary weak regret averaged over 500 random problem instances with shaded areas showing standard deviation across 10 groups: T = 107, = 0.1, δ = 0.6.

H.2 Strong Regret

Figure 11: Cumulative binary strong regret averaged over 500 random problem instances with shaded areas showing standard deviation across 10 groups, WSS with exploitation parameter β = 1.05: T = 105, = 0.1, δ = 0.6.

Published in Transactions on Machine Learning Research (MM/YYYY)

Figure 12: Cumulative binary strong regret averaged over 500 random problem instances with shaded areas showing standard deviation across 10 groups, WSS with exploitation parameter β = 1.05: T = 106, = 0.1, δ = 0.6.

Figure 13: Cumulative binary strong regret averaged over 500 random problem instances with shaded areas showing standard deviation across 10 groups, WSS with exploitation parameter β = 1.05: T = 107, = 0.1, δ = 0.6.