# nonstationary_delayed_bandits_with_intermediate_observations__3e5b4c35.pdf

Non-Stationary Delayed Bandits with Intermediate Observations

Claire Vernade * 1 Andr as Gy orgy * 1 Timothy A. Mann 1

Abstract Online recommender systems often face long delays in receiving feedback, especially when optimizing for some long-term metrics. While mitigating the effects of delays in learning is wellunderstood in stationary environments, the problem becomes much more challenging when the environment changes. In fact, if the timescale of the change is comparable to the delay, it is impossible to learn about the environment, since the available observations are already obsolete. However, the arising issues can be addressed if intermediate signals are available without delay, such that given those signals, the long-term behavior of the system is stationary. To model this situation, we introduce the problem of stochastic, non-stationary, delayed bandits with intermediate observations. We develop a computationally efﬁcient algorithm based on UCRL2, and prove sublinear regret guarantees for its performance. Experimental results demonstrate that our method is able to learn in non-stationary delayed environments where existing methods fail.

1. Introduction

Optimizing long-term metrics is an important challenges for online recommender systems (Wu et al., 2017). Consider, for instance, the media recommendation problem, where the ﬁnal feedback on the content is only given after the customer spent time watching or reading it. Another major case is that of conversion optimization in online marketing (Yoshikawa & Imai, 2018): Here a user s click on an advertisement/recommendation might be observed almost immediately, but whether the click is converted into a long-term commitment, such as a purchase, can be observed usually only after some much longer delay (Chapelle, 2014), which causes modelling issues for robust inference.

*Equal contribution 1Deep Mind, London, UK. Correspondence to: Claire Vernade <vernade@google.com>, Andr as Gy orgy <agyorgy@google.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

Delayed feedback in online learning have been addressed both in the full information setting (see, e.g., Joulani et al., 2013, and the references therein), and in the bandit setting (see, e.g., Mandel et al., 2015; Vernade et al., 2017; Cesa Bianchi et al., 2019, and the references therein), assuming both stochastic and adversarial environments. The main take-away message from these studies is that, for bandits, the impact of a constant delay D results in an extra additive O(

DT) term in the regret in adversarial settings, or an additive O(D) term in stochastic settings. The aforementioned approaches often provide efﬁcient and provably optimal algorithms for delayed-feedback scenarios. Nonetheless, the algorithms naturally wait for having received enough feedback before actually learning something. This is, in general, sufﬁcient in stationary environments or when comparing with the best ﬁxed action in the adversarial setting. Such methods, however, may be quite unsuitable in a nonstationary environment. For example, it is easy to see that in an environment that changes abruptly, every change results in an extra regret term mentioned above. An extreme example of a bad situation is when the environment changes (abruptly) every D steps, in which case, by the time any feedback is received by the recommender system, its value becomes obsolete, thus no learning is possible. Accordingly, the extra regret terms above sum up to T, for example, in the stochastic case a regret of D is suffered for each of the T/D changes. Delays and non-stationarity are, in general, strongly entangled, which badly affects the performance of any online learning algorithm.

In fact, real world problems are non-stationary in most cases. For example, trending or boredom phenomena may affect the conversion rates (Agarwal et al., 2009a), forcing the systems to have estimators as local as possible in time. As such, algorithms developed for non-stationary online learning (see, e.g., Auer et al., 2002a; Garivier & Moulines, 2011; Gy orgy et al., 2012; Gajane et al., 2018) crucially depend on the availability of recent observations, which is often not possible in the presence of delays. To alleviate this problem, practical systems often monitor and sometimes optimize for some proxy information instead of the real target. Typically, the web industry optimizes for click-through rate instead of conversions (Agarwal et al., 2009b).

One step further in utilizing proxy information is to directly model the connection between the proxies and the ﬁnal out-

NSD Bandits with Intermediate Observations

comes. This approach was formalized recently by Mann et al. (2019), who proposed a model where intermediate signals can be observed without any delay, while the ﬁnal feedback is independent of the system s action given the intermediate feedback, disentangling the effects of delays and non-stationarity. The model is based on the simple observation that even if the observation of the optimized metric is heavily delayed, the systems record many intermediate signals like clicks, labels or ﬂags provided by the customers. In our media-recommendation example, this means that if a user decided to download a book, the probability of reading the book will be the same. While this model is also only a crude approximation of real-world scenarios, it allows to analyze the effect of such intermediate feedback signals in a principled way, and Mann et al. (2019) show both in theory and in practice that the proposed approach has significant beneﬁts in full-information prediction problems under practical assumptions on the delay and the user population.

In this paper, we propose and analyze the bandit version of the model of Mann et al. (2019), which we call the stochastic non-stationary delayed bandit problem with intermediate observations (NSD for short). Here the system takes actions repeatedly, observes some intermediate feedback immediately, and the target metric after some (ﬁxed) delay. It is assumed that the distribution of the intermediate feedback, called outcomes or signals in what follows, changes over time, while the ﬁnal (delayed) metric, called the reward, depends on the observed signal in a stationary way, and is conditionally independent of the action given the signals. In this way, our model disentangles the effects of delays and nonstationarity, and hence algorithms leveraging intermediate outcomes can learn even in fast changing environment with large observation delays. In the media-recommendation example, the actions of the systems are recommending books, which a user may or may not download, depending on the time-varying popularity of the recommended book. However, once the book is downloaded, we assume that the user will read it with a ﬁxed (but unknown) probability.

We propose and analyze an algorithm for the NSD problem, which can be thought of as a carefully derived variant of the UCRL2 algorithm of Jaksch et al. (2010) for our problem. Similarly to the sliding-window UCB (SW-UCB) algorithm for non-stationary stochastic bandits (Garivier & Moulines, 2011) and the sliding-window UCRL2 (SW-UCRL2) algorithm of Gajane et al. (2018), our method uses sliding-window estimates for the non-stationary parameters of the problem, but these are combined with delayed estimates for the the stationary parameters. We show both theoretically (through regret bounds) and in simulations that the proposed algorithm indeed disentangles the effects of the delay and non-stationarity.

The paper is organized as follows. The NSD bandit frame-

work is presented in Section 2. Our algorithm NSD-UCRL2 is describe in Section 3, while their performance is analyzed theoretically in Section 4. Experimental results are provided in Section 5, while related work and future directions are discussed in Sections 6 and 7.

Notation. For any integer N > 0, let N RN denote the simplex in N dimensions, and let N and [N] := {1, . . . N}. For any x 2 R, (x)+ = max{0, x} is the positive part of x. An MDP M is characterized, in the tabular case, by M = (A, S, P, ), i.e a ﬁnite set of S states S = [S] and K actions A := [K], a transition probability tensor P and a matrix of reward parameters. Each element shall be precisely instantiated for our setting later.

Learning scenario. We consider a stochastic bandit setting with a ﬁnite set of K actions and a ﬁnite set of S outcomes, with K, S 2. The learning procedure at each round t is the following:

The learner chooses an action At 2 [K]; A categorical outcome St 2 [S] is revealed. It is

assumed that St is a categorical variable such that St pt(a), where pt(a) 2 S. This probability vector may change with time; After a possibly random delay Dt, a reward Rt drawn

from some ﬁxed (but unknown) distribution that depends on the outcome St is revealed to the learner. It is assumed that conditionally on St, the reward Rt is independent of the action At, and it has mean St (the vector of all means will be denoted by = ( 1, . . . , S)>). For simplicity, we assume throughout the paper that the rewards are [0, 1]-valued and the delays are constant, Dt = D; our results can be extended, e.g., to random delays and sub-Gaussian reward distributions.

Note that this factored bandit model can also be viewed as a simple episodic MDP M = ([S + 1], [K], (Pa,t)a2[K], ) with S+1 states and K actions, as show in Figure 1. In state 0, the learner takes an action a 2 [K] and transitions to a signal state s 2 [S] with probability pt(s|a) (we will use the notation pt(a) to denote the vector (pt(1|a), . . . , pt(S|a))>

of transition probabilities). They fall back into state 0 for the next round while a reward with mean s is generated (but observed later).

Non-stationary transitions. We assume the environment is changing. There is a vast body of literature on nonstationary multi-armed bandits (Auer et al., 2002a; Garivier & Moulines, 2011; Besbes et al., 2014; Auer et al., 2019), contextual bandits (Chen et al., 2019) and reinforcement learning (Jaksch et al., 2010; Gajane et al., 2018) (tabular case).

NSD Bandits with Intermediate Observations

Figure 1. Schema of the MDP underlying the NSD bandit model with L independent signals. Action At 2 [K] is taken in the triangle state (0) and a transition to a round signal state is observed.

In this work, we focus on switching environments: there exist 0 ΓT < T instants where all transition probabilities abruptly change. At the same time, we assume that the parameter vector of the reward distributions remains ﬁxed. This models the scenario described in the introduction. The popularity of the items to be recommended changes with time, and hence the likelihood pt of the intermediate observations changes. However, given the intermediate outcomes, the reward distribution does not change. For instance, if a book is downloaded (this is the intermediate outcome) the probability that the user will read it is constant. Note that we index ΓT with the horizon to emphasize that if the latter goes to inﬁnity, the number of changes should do so as well, otherwise the impact of non-stationarity disappears in the asymptotic regime.

Expected return and dynamic regret. For a given action a 2 [K], the expected reward at round t is a function of the instantaneous parameters:

pt(s|a) s = pt(a)> .

The optimal action at round t is the one that has the highest expected reward under the current distribution over the outcomes:

t = arg max

pt(s|a) s = pt( |a

t is the optimal action at round t, that is, the one achieving the maximum above. For simplicity, we assume it is unique.

The goal of the learner is to minimize the dynamic expected regret deﬁned as

This metric is called dynamic because the baseline is the learner that sequentially adapts to environment changes,

which is a different and arguably harder problem than competing with the best action in hindsight. Because of our assumption that the environment changes only ΓT times, the optimal action a

t also only changes at most ΓT times.

Note that the transition probabilities are not state-dependent as opposed to usual reinforcement learning (RL) settings because in our model there is only one state that allows to take actions. Following the action, a transition is observed to a state where no action can be taken, and only a future reward, assimilated to the state s expected value, is observed. One can think of our model as a disentanglement of the usual RL setting into a transition-then-reward bandit model.

3. Algorithms

In this section, we present a version of SW-UCRL2 of Gajane et al. (2018), instantiated for our NSD bandit problem. Even though the general philosophy of the algorithm is fairly similar to the standard one, the simple structure of the MDP underlying our model does simplify things a bit.

Our method, presented in Algorithm 1, relies on two types of conﬁdence regions, for transition probabilities and rewards. Because of the non-stationarity of the transitions, we estimate them using a sliding window with a ﬁxed window size W > 0, given as input to the learner. In contrast, the reward parameters are ﬁxed so no window is needed to estimate them. At every round 1 t T, for each action a 2 [K], we deﬁne the windowed counts

t (a) = max

and for all s 2 [S], we estimate the transition probability as

u=(t W )+ 1{Au = a, Su = s}

t (a) . (2)

On the other hand, for each signal s 2 [S], we keep the usual counts N D

t (s) = max

u<t D 1{Su = s}

(recall that observing rewards are delayed by D, so at the beginning of round t only rewards for rounds before t D are available), and the empirical average estimator

ˆ t(s) = 1 N D

Rt1{Su = s}

of the rewards up to round t D. As in standard stochastic bandit models, we consider high-probability upperconﬁdence bounds for some ﬁxed δ > 0:

Ut(s) = min

1, ˆ t(s) +

NSD Bandits with Intermediate Observations

where CT,δ = 2 log(2TS/δ). The key step in the learning is to search for the most optimistic MDP in a high-probability parameter space: for some δ > 0 we deﬁne CW,T,δ = 2S log(KWT/δ), and for each action a 2 [K],

t (a) = max

q>Ut : kˆpt(a) qk1

Here q ranges over the plausible candidate set of transition probabilities for pt(a) with estimated average reward q>Ut = P

s2[S] q(s)Ut(s). Note that since Ut(s) is an optimistic (upper) estimate of s (with high probability), +

t (a) is an optimistic estimate of t(a). We denote the value of q achieving the maximum above by pt(a). The algorithm then acts optimistically by choosing the action with the largest +

This mechanism is reminiscent to the Extended Value Iteration algorithm of (Jaksch et al., 2010), but we only need a simpliﬁed version, which is given in Appendix A. Further justiﬁcations on the choices of conﬁdence regions are provided in the next section where they are used to prove high-probability upper bounds on the regret. Note that as opposed to the general UCRL2 algorithm of (Jaksch et al., 2010), our method does not require phases.

Algorithm 1 NSD-UCRL2 for NSD-Bandits

Input: K: number of arms, S: number of outcomes, W: window size, δ: error probability Initialization: 8 a 2 [K] and 8 s 2 [S]: Global counts: N0(a) = N0(a, s) = 0, N D

0 (s) = 0. Local counts: N W

0 (a) = 0, ˆp0( |a) = 1

S 1S Pull each action once. for t K + 1 do

8 a 2 [K], compute ˆpt( |a) as in (2); 8 s 2 [S], compute ˆ t(s) and the high-probability upper bounds Ut(s) as in (3). 8 a 2 [K], compute +

t (a) according to (4) (by Algorithm 2 in Appendix A). Pull action At 2 arg max a2[K] +(a); Action update: N W

t (a) for all a 2 [K]; Signal update: Receive the intermediate signal St (with no delay) and update the windowed transition counts; Receive feedback Rt D for round t D and signal St D. end for

4. Analysis

In this section we present high-probability upper bounds on the regret of NSD-UCRL2.

4.1. Warm-up: Analysis in the stationary case

As a ﬁrst step, we provide an analysis of NSD-UCRL2 when the transition probabilities are stationary, that is, ΓT = 0. This initial result gives an insight on how much regret is suffered due to the use of a sliding-window estimator.

Theorem 1. Let M = (S, K, P, ) deﬁne a stationary NSD problem with constant delay D > 0. With probability at least 1 2δ, the regret of NSD-UCRL2 with window size W is bounded from above by

With the choice W = (T), we get

The proof of the theorem is deferred to Appendix B. It is based on standard techniques used in the literature to derive regret bounds for optimistic algorithms for bandit and MDP problems (mostly on the proof of the minimax bound of UCRL2). The non-standard parts in our derivation come from the fact that the transitions are estimated only based on the most recent window, but without delay, while the rewards can be estimated based on all observations, but these are delayed. Therefore, we need to disentangle the estimation of the transitions and rewards.

Remark 1 (Random delays). It is possible to extend our analysis used in the proof of Theorem 1 for random delays. When the delays are random, the number of missing observations takes the role of D. The only part of the derivation affected by this change (see the proof for details) is the term (Dt(s) + 1)/(2(Nt(s) Dt(s) 1)3/2), where now Dt(s) is the number of missing observations from signal s at time t. Under mild assumptions on the delays, Dt(s) Nt(s)/2 for t > T0 with high probability where T0 is large enough. Then the effect of delays, in expectation, is O(T0 + E[D(s)]), where E[D(s)] E[Dt(s)] for t > T0.

4.2. Main Results for NSD Bandits

We can now prove the main results of this section, highprobability upper bounds on the regret of NSD-UCRL2 in non-stationary and delayed environments. We start with

NSD Bandits with Intermediate Observations

a problem-independent bound, which is the extension of Theorem 1 to the non-stationary case.

Theorem 2. Let MT = (S, K, (Pt)t T , ) deﬁne an NSD problem with ΓT switches and constant delays D > 0. With probability at least 1 2δ, the regret of NSD-UCRL2 is

bounded from above by

Choosing W = (T/ΓT )2/3(KS log(KT/δ))1/3, we get

T 2/3 (ΓT KS log(KT/δ))1/3

+ S(D + 1) +

T(S + log(1/δ)) log(TS/δ)

The proof of the theorem presented in Appendix C relies on the regret guarantees of NSD-UCRL2 proved in Theorem 1. In short, after each switch, the estimators are all wrong for W rounds, resulting in a linear O(W) regret per switch. In between those phases, the algorithm adapts and suffers the sublinear regret proved in Theorem 1. The subtlety of the analysis lies in the separate control of the reward estimation that does not suffer from switches.

Next we present a bound on the number of rounds where suboptimal actions are selected, as well as a problem dependent regret bound. We start with a few more deﬁnitions. Let T (W) = {1 t T : 8k 2 [K], 8t W s t, ps(k) = pt(k)} denote the round indices when all arms have had stable transition probabilities for at least W rounds. An action a 2 [K] is called "-bad in time step t, if its gap is at least ", that is, if t,a = pt(a

t )> pt(a)> ", and let B"

t denote the set of "-bad actions in round t. The minimum "-bad observation probability for each s 2 [S] is deﬁned as p"

min(s) = min{p(s|a) : t 2 [T], a 2 B"

t , p(s|a) > 0}, and let pmin = mins2[S] p0

min(s) = min{p(s|a) : t 2 [T], a 6= a

t , s 2 [S], p(s|a) > 0}. Finally, let min = mina,t: t,a>0 t,a denote the minimum gap of the actions over the time horizon.

Theorem 3. Under the assumptions of Theorem 2, with probability at least 1 3δ, the number of times NSD-UCRL2 selects "-bad actions in T (W) is bounded by

SK log( KW T

δ ) "2 +SD+

Furthermore, the regret of NSD-UCRL2 can be bounded by

T log( KW T

log( e pmin ) log( T S

The proof is deferred to Appendix D. It is based on a decomposition of the suboptimal actions taken into two groups depending whether the rewards of the actually reachable intermediate signals are estimated up to a required accuracy.

Lower bound for standard, signal-agnostic methods. Following the construction of Garivier & Moulines (2011), it is easy to show a lower bound for signal-agnostic algorithms (i.e., algorithms that do not use the intermediate signals). Indeed, if the reward distributions can be selected arbitrarily for every stationary segment, the regret of a signal-agnostic algorithm is bounded from below by the regret of the same algorithm receiving the information of the change points and restarting (optimally) after each of them. Then, on every stationary segment, the minimax regret lower bound for the K-armed bandit problem applies (see, e.g., Chapter 15 of Lattimore & Szepesv ari, 2020), which, together with the effect of the delay D on this segment, gives a lower bound of (

Kl + D) regret for a segment of length l. Considering environments with stationary segments of equal length l = T/ΓT term, gives the following lower bound:1

Proposition 1. For any ΓT < T and any signal-agnostic algorithm, there exists an NSD problem such that the regret is lower bounded as

K(ΓT + 1)T + (ΓT + 1)D

4.3. Discussion

Proposition 1 shows that for bandit algorithms, delays and non-stationarity are entangled and have a combined impact on minimax regret bounds. In contrast, by leveraging intermediate outcomes, NSD-UCRL2 disentangles this effect and only pays the price of the delay once, and not every time a new stationary segment starts (see Theorems 2 and 3). This allows NSD-UCRL2 to achieve sublinear regret in situations where standard, signal-agnostic algorithms would suffer linear regret. For example, if D = (T/(ΓT + 1)), the latter suffer linear regret (as the lower bound becomes

1This argument can be made precise by carefully considering that the lengths of the segments are integers and selecting independently the reward distribution for each segment from the minimax lower bound construction.

NSD Bandits with Intermediate Observations

(T)), while the bounds for our method guarantee sublinear regret as long as S is small enough relative to ΓT (neglecting, of course, other conditions, like KSΓT = o(T)). This emphasizes the main advantage of our feedback model: by disentangling delay and non-stationarity, learning is possible in a much broader family of situations.

Our two upper bounds provide slightly different insights to our algorithm: Theorem 2 shows an O(Γ1/3

T T 2/3 + D) upper bound on the regret. Theorem 3 provides a problem dependent bound that is essentially O( T log T

W min +WΓT +D), which depends on the minimum gap of all the actions over the time horizon, and also on the minimum observation probability of the intermediate signals. It also shows that our method cannot suffer large losses too often, as it chooses "-bad actions at most O( T

"2 + WΓT + D) times.

In case the problem is stationary, setting the window size to W = T essentially recovers the best possible rates in both theorems, including the additive effect of the delay (Joulani et al., 2013). The T 2/3 regret rate of Theorem 2 also agrees with the rates derived for MDPs in non-stationary environments (Jaksch et al., 2010; Gajane et al., 2018). The regret guarantee of Theorem 3 improves upon a similar bound of Garivier & Moulines (2011) for the non-stationary bandit problem (without delay), where the dependence on min is quadratic. Similarly to them, we can also obtain a problemdependent regret bound of O(

ΓT T/ min + D), which requires setting the window size as W = (

T/ min). The bound becomes a bit worse in min if W is tuned independently of this quantity. The main message of Theorem 3, however, is that if the gaps are not too small, the algorithm can learn really fast, while handling partially delayed observations.

On the other hand, a O(pΓT T) minimax regret is possible for non-stationary bandits, and such a result automatically extends to shortest path problems with ﬁx horizon, an instance of which we consider here. Nevertheless, to our knowledge, no practical algorithm exists for MDPs in general that achieves the minimax rate in non-stationary, abruptly changing environments considered in this paper. The task is also not made easier by the combined effects of delays and non-stationarity in our speciﬁc problem, which need to be disentangled to utilize the full power of our model. This comes with its own limitations; for example, we could not built on the UBEValgorithm of (Dann et al., 2017) (designed for the stochastic shortest part problem), which improves upon the guarantees of UCRL2by jointly estimating the transition probabilities and the rewards (value function in their case).

Getting our O(Γ1/3

T T 2/3 + D) or O(pΓT T/ min + D) regret rates requires setting the window parameter W using some prior information on the number of switches and on the horizon. This is common practice in the literature (see,

e.g., Auer et al., 2002b; Garivier & Moulines, 2011; Besbes et al., 2014, as in face of non-stationarity, it is challenging to design fully adaptive online learning algorithms that do not rely on restricting the length of the history they use. Indeed, in absence of other assumptions on the underlying non-stationary function controlling the rewards, the best a learner can do is to forget the past to maintain estimators as little biased as possible. The optimal behavior should then be to forget the past adaptively.

This is the recent approach taken by (Auer et al., 2019; Chen et al., 2019) for the standard and the contextual multiarmed bandit problems, respectively. They include careful change-detection procedures in their algorithms to detect (almost) stationary segments, and obtain the ﬁrst fully adaptive O(pΓT T) bounds on the dynamic regret with Γt switches. It seems straightforward to show that, in a delayed environment, the regret of these methods matches (up to logarithmic factors) the rate of the lower bound of Proposition 1.

These algorithms pull all arms alternately until they identify the current best one and then commit to it and keep a carefully tuned change-point safety check in parallel. It would be interesting to adapt these methods for learning the transition probabilities in our algorithm in order to exploit the fully adaptive ideas mentioned above. We conjecture that the optimal regret for NSD bandits with intermediate observations should be bounded by O(

(ΓT + 1)KST + D), but we leave this question as an open problem for future work on this topic.

In all our regret bounds, we treat S as a constant term. Comparing our regret bounds (Theorems 2 and 3) to the rates achievable by the standard signal-agnostic methods (Proposition 1) shows that NSD-UCRL2 behaves better in terms of the delay than signal-agnostic algorithms when ΓT is larger than S. However, a large value of S may deteriorate the ﬁrst, time-dependent term in our bounds, possibly making NSD-UCRL2 theoretically inferior to signalagnostic methods in such cases. Deciding whether this is just an artifact of our analysis or a real phenomenon is left for future work. In particular, it could be interesting to derive a problem-dependent lower bound for this new family of structured bandit problems, similarly to those of Graves & Lai (1997) for the standard setting.

5. Experiments

In this section we present experimental results to provide more insight on the impact and potential usefulness of the intermediate signals for different regimes of delays with respect to ΓT /T. We test our algorithm in different scenarios by comparing it to standard bandit algorithms that are agnostic to the intermediate signals, as well as to oracle bandit strategies that have access to various extra information.

NSD Bandits with Intermediate Observations

reward 1 2 3 0.8 0.4 0.2 action (a) p(1|a) p(2|a) p(3|a) a 1 0.8 0.1 0.1 0.7 2 0.1 0.8 0.1 0.42 3 0.8 0.1 0.8 0.28 4 0.1 0.4 0.5 0.34

Table 1. Experiment setting

Baselines. First, natural but weak baselines are the standard UCB algorithm (Auer et al., 2002a) and its slidingwindow version SW-UCB (Garivier & Moulines, 2011), which ignore the intermediate observations. Those policies have sublinear regret in stationary and non-stationary environments respectively. Thus, they should provide reasonable performance in our problem (when the delays are not too large), but are expected to be inferior compared to our method, proving that signals at the very least do not hurt. As a Bayesian alternative to NSD-UCRL2, we implemented NSD-PSRL, based on the posterior sampling reinforcement learning (PSRL) algorithm of Osband & Van Roy (2017) (details are given in Appendix F). We expect this method to have similar performance to NSD-UCRL2. We also construct two stronger oracle policies. Oracle-UCB is a signalagnostic UCB policy that receives the extra-information of the change-points and restarts optimally after each of them. Oracle-NSD is NSD-UCRL2 without windowing but with oracle restarts as well. We also consider full oracles that do not suffer delays, and call them respectively Oracle-NSD-nd and Oracle-UCB-nd. Restarting is the optimal behavior in a non-stationary environment so these oracles should all outperform any other algorithms in this setting.

Experimental setup. Throughout the experiments we use the transition probabilities and mean rewards as reported in Table 1. The delay parameter D and the window size W of the algorithm are discussed in the dedicated sections. To simulate non-stationarity, we draw a random integer r 2 {1, . . . , K 1} at each change point and we permute the action vectors of the transition matrix by shifting them r times. So arm i becomes arm (i + r 1 mod K) + 1. This has the advantage of ensuring the best arm always changes after a change point. Note that a side effect of this construction is that after a few changes, it might be the case that an arm that had better values on average so far suddenly becomes the best one, which allows agnostic policies like UCB to perform well on such a phase (typically the last one in our experiments). But this is an artifact of this experimental setup and we tried our best to design experiments that do not suffer from it.

Unless otherwise stated, we run experiments with time horizon T = 8000 with ΓT = 3 change points at rounds {2000, 4000, 6000}. All results presented are averaged over

Figure 2. Impact of the choice of window parameter W on the

regret of NSD-UCRL2.

50 independent runs. The shaded areas in the graphs correspond to the 95% conﬁdence regions.

Choice of W. Our theoretical analysis (Theorem 2) suggests that for a problem with T = 8000 and ΓT = 3, the optimal choice of the window parameter W is of order T 2/3Γ1/3

T (KS)1/3 = 1243. In Figure 2, we compare the regret of NSD-UCRL2 instantiated with W 2 {400, 800, 2000} to that of the oracle that knows the change points in a changing environment with no delays (D = 0). We can make several observations on those results. First, when the window is too small, the algorithm never reaches the exploitation phase and the regret is linear. This corresponds to the regime where the term in T

log(WT)/W (or T log T/(W min) in Theorem 3) is dominant in the regret bound. Then, the larger the window gets, the better is the exploitation and the smaller the regret. However, for W > 1000, the variance of the regret gets much larger as the overlap between the phases creates bias in the estimators. This roughly corresponds to the second regime where the term WΓT becomes dominant. In the next experiments, we will set W = 800, which is close to the value suggested by theory and that works best in this calibration experiment.

Comparison with all the baselines. In this experiment, we set T = 8000, ΓT = 3, and consider delays D 2 {100; 500; 1000}. The largest value corresponds to the extreme case of D T/ΓT . For this experiment, we ﬁx the window parameter to W = 800, for both SW-UCB and NSD-UCRL2.

Figure 3 shows the regret of all the policies in the benchmark, including strong oracles. Our NSD-UCRL2 and NSD-PSRL outperform all other strategies, except for the oracle ones. Non-delayed oracles have a signiﬁcantly reduced regret but our policies have a similar behavior with a reasonable gap when delays are small. Signal-agnostic policies (UCB and SW-UCB) have a linear regret (see Appendix E for linear scale plots) as well as the strong baseline Oracle-UCB when delays are large.

NSD Bandits with Intermediate Observations

Figure 3. Cumulative regret of all policies in vertical log scale. ΓT = 3 and D 2 {100, 500, 1000} from left to right. When a policy uses

a window, W = 800. Oracles know the change points, non-delayed oracles (nd) receive rewards without delays.

Ablation study: behavior under a misspeciﬁed model. Our factored reward model is likely not correct in practice. However, if the model is approximately accurate, we expect that our algorithm performs well. To examine such situations, we construct a mixture reward model where the factored model is correct only in some fraction of the time, while the rewards and the intermediate observations may behave differently at other times. For the latter, the rewards of the actions are controlled by an additional vector of parameters µ = (µ1, . . . , µK), and are assumed to be independent of the intermediate observations. Then, in each round t, given action At 2 [K], the environment either uses our factored model (with probability 1 for some mixing parameter 2 [0, 1]), or returns a random intermediate observation while generating a reward with expectation µAt. Formally,

with probability , St U([S]) and Rt B(µAt);

with probability 1 , St pt( |a) and Rt B( St);

where U([S]) denotes a uniform distirbution over [S] and B(β) denotes a Bernoulli distribution with parameter β. This means that the expected payoff of arm a at round t is (a) = µa + (1 )pt( |a)> .

First we test how NSD-UCRL2 performs in this environment in the stationary non-delayed setting, then we analyze its performance in the delayed non-stationary case (note that as long as the algorithm learns reasonably well in the stationary non-delayed case, we can expect to see a similar performance boost in the delayed non-stationary scenario due to the intermediate observations as in the case when the factored model is correct).

In the stationary setting, we identify two types of regime. In a favorable situation, the best arm is not modiﬁed by the mixing, i.e. arg max a (a) = arg max a (a). In that case, we conjecture that NSD-UCRL2 should be able to have a sublinear regret as it tends to learn the right action. On the contrary, in a bad mixing case, the best arm is not the same and we expect NSD-UCRL2 to fail. In any case, UCB ignores the signals and should have a logarithmic regret.

To test these properties, we set up a simple, stationary problem with only actions 1 and 2 from Table 1, such that action

Figure 4. Misspeciﬁed model with low mixing level = 0.1 in

the stationary setting: In both favorable and bad cases, NSD-UCRL2 is on par with UCB. The bad mixing case tends to increase the variance of the rewards and even increases UCB s regret.

Figure 5. Misspeciﬁed model with non-negligible mixing level

= 0.3 in the stationary setting: NSD-UCRL2 is on par with UCB in the favorable case, and diverges in the bad case.

1 is the best one. A favorable case is when µ = (0.9, 0.1)

such that for any , the optimal action in the mixed model is action 1. A bad case is for instance when µ = (0.1, 0.9), in which case the best arm is preserved only for small values of . We show some of our results below, more can be found in Appendix E.

In Figure 4, we show the results in both scenarios in the case of a low mixing level, that is, when the model is close to right. NSD-UCRL2 is on par or better than UCB. UCB even suffers from the higher variance of the distribution of the rewards, especially in the bad mixing case. On the contrary, as expected, when the mixing becomes non-negligible ( = 0.3), UCB learns in both the favorable and bad cases. NSD-UCRL2 also learns, though with higher regret in case of good mixing, but it fails in the bad case.

In the non-stationary and delayed setting considered before, the transition probabilities change along the horizon, together with the vector µt (now depending on time using the same shifting scheme as for the transition probabilities), modifying the means of the arms in a consistent manner. We compare the performance of UCB, SW-UCB and

NSD Bandits with Intermediate Observations

Figure 6. Regret of NSD-UCRL2, SW-UCB and UCB in the same setting as Figure 3 when the model is misspeciﬁed. Delays are ﬁxed to

D = 500. From left to right, = (0.1, 0.3, 0.5) and µ = (0.1, 0.1, 0.1, 0.9) is chosen so that the best arm changes in the mixed model. Despite the mixing, even for non-negligible mixing levels, NSD-UCRL2 learns decently fast.

NSD-UCRL2 when the model is misspeciﬁed. Unlike what was done before, we do not expect NSD-UCRL2 to gracefully adapt to changes. Nonetheless, results in Figure 6 show that NSD-UCRL2 does learn when is small, and signiﬁcantly outperforms both UCB and SW-UCB. For 0.5, none of the considered algorithms is able to solve the problem.

6. Other Related Work

As mentioned in the introduction, our model combines the challenges of non-stationary stochastic bandits (Garivier & Moulines, 2011; Auer et al., 2019) and those of delayed feedback in online learning (Joulani et al., 2013), and online learning in episodic MDPs (Dann et al., 2017). The general tree-structure of the problem shown in Figure 1 is also reminiscent of combinatorial (semi-)bandits (Cesa-Bianchi & Lugosi, 2012; Combes et al., 2015; Kveton et al., 2015; Talebi et al., 2017). However, the difference is that while in the aforementioned combinatorial bandit problems one can directly select which sub-actions to choose, in our model we cannot directly choose to observe the reward corresponding to a selected signal. This limits the adaptation of the techniques used in these algorithms, as the corresponding observation counts can only be controlled indirectly.

7. Conclusions

In this paper, we deﬁned a new stochastic bandit model, NSD bandits with intermediate observations. It addresses real-world modelling issues when dealing with nonstationary environments and delayed feedback, and mitigates the joint effect of these by utilizing some non-delayed proxy feedback (similarly to the work of Mann et al. 2019). From the theoretical point of view, it provides a mid-way model between the simple stochastic multi-armed bandit and the tabular MDPs.

One ﬁrst possible future extension would be to consider the contextual version of this problem, where the transition probabilities of each action depends on a current context

vector. Another potential extension would be to add several layers of signals that would be sequentially sent to the learner before the ﬁnal reward is revealed. This problem would then resemble combinatorial bandits and episodic MDPs with varying delays.

While we presented sub-linear regret guarantees, understanding the rate of the minimax regret and modifying the algorithm to achieve it is of obvious interest. Future work should also look into problem-dependent guarantees. For instance, it would be interesting to exploit ideas from (Graves & Lai, 1997) to obtain a gap-dependent asymptotic lower bound, and seek optimal strategies.

Acknowledgements

The authors are indebted to Tor Lattimore for several inspiring discussions.

Agarwal, D., Chen, B.-C., and Elango, P. Spatio-temporal

models for estimating click-through rate. In Proceedings of the 18th International Conference on World Wide Web, pp. 21 30. ACM, 2009a.

Agarwal, D., Chen, B.-C., Elango, P., Motgi, N., Park, S.-T.,

Ramakrishnan, R., Roy, S., and Zachariah, J. Online models for content optimization. In Advances in Neural Information Processing Systems, pp. 17 24, 2009b.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235 256, 2002a.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E.

The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002b.

Auer, P., Chen, Y., Gajane, P., Lee, C.-W., Luo, H., Ortner,

R., and Wei, C.-Y. Achieving optimal dynamic regret for non-stationary bandits without prior information. In

NSD Bandits with Intermediate Observations

Proceedings of the 32nd Conference on Learning Theory, pp. 159 163, 2019.

Azuma, K. Weighted sums of certain dependent random

variables. Tohoku Mathematical Journal, Second Series, 19(3):357 367, 1967.

Besbes, O., Gur, Y., and Zeevi, A. Stochastic multi-armed-

bandit problem with non-stationary rewards. In Advances in Neural Information Processing Systems, pp. 199 207, 2014.

Cesa-Bianchi, N. and Lugosi, G. Combinatorial bandits.

Journal of Computer and System Sciences, 78(5):1404 1422, 2012.

Cesa-Bianchi, N., Gentile, C., and Mansour, Y. Delay and

cooperation in nonstochastic bandits. Journal of Machine Learning Research, 20(17):1 38, 2019.

Chapelle, O. Modeling delayed feedback in display ad-

vertising. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1097 1105. ACM, 2014.

Chapelle, O. and Li, L. An empirical evaluation of thompson

sampling. In Advances in Neural Information Processing Systems, pp. 2249 2257, 2011.

Chen, Y., Lee, C.-W., Luo, H., and Wei, C.-Y. A new

algorithm for non-stationary contextual bandits: Efﬁcient, optimal, and parameter-free. In Proceedings of the 32nd Conference on Learning Theory, pp. 696 726, 2019.

Combes, R., Shahi, M. S. T. M., Proutiere, A., et al. Com-

binatorial bandits revisited. In Advances in Neural Information Processing Systems, pp. 2116 2124, 2015.

Dann, C., Lattimore, T., and Brunskill, E. Unifying PAC and

regret: Uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pp. 5713 5723, 2017.

Gajane, P., Ortner, R., and Auer, P. A sliding-window algorithm for markov decision processes with arbitrarily changing rewards and transitions. ar Xiv preprint ar Xiv:1805.10066, 2018.

Garivier, A. and Moulines, E. On upper-conﬁdence bound

policies for switching bandit problems. In Proceedings of the 22nd International Conference on Algorithmic Learning Theory, pp. 174 188, 2011.

Graves, T. L. and Lai, T. L. Asymptotically efﬁcient adaptive

choice of control laws incontrolled markov chains. SIAM Journal on Control and Optimization, 35(3):715 743, 1997.

Gy orgy, A., Linder, T., and Lugosi, G. Efﬁcient tracking of

large classes of experts. IEEE Transactions on Information Theory, 58(11):6709 6725, 2012.

Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret

bounds for reinforcement learning. Journal of Machine Learning Research, 11(51):1563 1600, 2010.

Joulani, P., Gy orgy, A., and Szepesv ari, Cs. Online learn-

ing under delayed feedback. In Proceedings of the 30th International Conference on Machine Learning, pp. 1453 1461, 2013.

Korda, N., Kaufmann, E., and Munos, R. Thompson sam-

pling for 1-dimensional exponential family bandits. In Advances in Neural Information Processing Systems, pp. 1448 1456, 2013.

Kveton, B., Wen, Z., Ashkan, A., and Szepesv ari, Cs. Tight

regret bounds for stochastic combinatorial semi-bandits. In Proceedings of the 18th International Conference on Artiﬁcial Intelligence and Statistics, pp. 535 543, 2015.

Lattimore, T. and Szepesv ari, Cs. Bandit Algorithms. Cam-

bridge University Press, 2020.

Mandel, T., Liu, Y.-E., Brunskill, E., and Popovi c, Z. The

queue method: Handling delay, heuristics, prior data, and evaluation in bandits. In Proceedings of the 29th AAAI Conference on Artiﬁcial Intelligence, 2015.

Mann, T. A., Gowal, S., Gy orgy, A., Hu, H., Jiang, R., Lak-

shminarayanan, B., and Srinivasan, P. Learning from delayed outcomes via proxies with applications to recommender systems. In Proceedings of the 36th International Conference on Machine Learning, pp. 4324 4332, 2019.

Osband, I. and Van Roy, B. Why is posterior sampling better

than optimism for reinforcement learning? In Proceedings of the 34th International Conference on Machine Learning, pp. 2701 2710, 2017.

Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., Wen,

Z., et al. A tutorial on thompson sampling. Foundations and Trends R in Machine Learning, 11(1):1 96, 2018.

Talebi, M. S., Zou, Z., Combes, R., Proutiere, A., and Jo-

hansson, M. Stochastic online shortest path routing: The value of feedback. IEEE Transactions on Automatic Control, 63(4):915 930, 2017.

Vernade, C., Capp e, O., and Perchet, V. Stochastic bandit

models for delayed conversions. In Proceedings of the 33rd Conference on Uncertainty in Artiﬁcial Intelligence, 2017.

Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., and

Weinberger, M. J. Inequalities for the l1 deviation of the

NSD Bandits with Intermediate Observations

empirical distribution. Technical Report HPL-2003-97R1, Hewlett-Packard Labs., 2003.

Wu, Q., Wang, H., Hong, L., and Shi, Y. Returning is

believing: Optimizing long-term user engagement in recommender systems. In Proceedings of the 2017 ACM Conference on Information and Knowledge Management, pp. 1927 1936. ACM, 2017.

Yoshikawa, Y. and Imai, Y. A nonparametric delayed feed-

back model for conversion rate prediction. ar Xiv preprint ar Xiv:1802.00255, 2018.