# delayed_bandits_when_do_intermediate_observations_help__41d10369.pdf

Delayed Bandits: When Do Intermediate Observations Help?

Emmanuel Esposito * 1 2 Saeed Masoudian * 3 Hao Qiu 1 Dirk van der Hoeven 4

Nicol o Cesa-Bianchi 1 5 Yevgeny Seldin 3

We study a K-armed bandit with delayed feedback and intermediate observations. We consider a model where intermediate observations have a form of a finite state, which is observed immediately after taking an action, whereas the loss is observed after an adversarially chosen delay. We show that the regime of the mapping of states to losses determines the complexity of the problem, irrespective of whether the mapping of actions to states is stochastic or adversarial. If the mapping of states to losses is adversarial, then the regret rate is of order p

(K + d)T (within log factors), where T is the time horizon and d is a fixed delay. This matches the regret rate of a K-armed bandit with delayed feedback and without intermediate observations, implying that intermediate observations are not helpful. However, if the mapping of states to losses is stochastic, we show that the regret grows at a rate of q K + min{|S|, d} T (within log factors), implying that if the number |S| of states is smaller than the delay, then intermediate observations help. We also provide refined high-probability regret upper bounds for non-uniform delays, together with experimental validation of our algorithms.

1. Introduction

Delay is an ubiquitous phenomenon that many sequential decision makers have to deal with. For example, outcomes of medical treatments are often observed with delay, purchase events happen with delay after advertisement impressions,

*Equal contribution 1Universit a degli Studi di Milano, Milan, Italy 2Istituto Italiano di Tecnologia, Genoa, Italy 3University of Copenhagen, Copenhagen, Denmark 4Korteweg-de Vries Institute for Mathematics University of Amsterdam, Amsterdam, Netherlands 5Politecnico di Milano, Milan, Italy. Correspondence to: Emmanuel Esposito <emmanuel@emmanuelesposito.it>, Saeed Masoudian <saeed.masoudian@di.ku.dk>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

and acceptance/rejection decisions for scientific papers are observed with delay after manuscript submissions. The impact of delay on the performance of sequential decision makers, measured by regret, has been extensively studied under full information and bandit feedback, and in stochastic and adversarial environments. Yet, in many situations in real life intermediate observations may be available to the learner. For example, a health check-up might give a preliminary indication on the effect of a treatment, an advertisement click might be a precursor for an upcoming purchase, and preliminary reviews might provide some information regarding an upcoming acceptance or rejection decision. In this work we study when, and how, intermediate observations can be used to reduce the impact of delay in observing the final outcome of an action in a multi-armed bandit setting.

Online learning with delayed feedback and intermediate observations was studied by Mann et al. (2019) in a full-information setting, and then by Vernade et al. (2020) in a nonstationary stochastic bandit setting. In the paper of Vernade et al. (2020), at each time step the learner chooses an action and immediately observes a signal (also called state) belonging to a finite set. The actual loss (i.e., feedback) incurred by the learner in that time step is only received with delay, which can be fixed or random. More formally, the observed state is drawn from a distribution that only depends on the chosen action, and the incurred loss is drawn from a distribution that only depends on the observed state (and not on the chosen action), forming a Markov chain.

State St = st(At)

Loss ℓt(St)

no delay delay dt

The work of Vernade et al. (2020) studies a setting, where st are nonstationary and ℓt are i.i.d. stochastic.

In this work, we consider two possible regimes for the mappings st from actions to states (stochastic and adversarial) and two possible regimes for the mappings ℓt from states to losses (also stochastic and adversarial). Altogether, we study four different regimes, defined by the combination of the first and the second mapping type.

We characterize (within logarithmic factors) the minimax re-

Delayed Bandits: When Do Intermediate Observations Help?

gret rates for all of them, by giving upper and lower bounds. Similar to Vernade et al., we assume that the states are observed instantaneously, and we assume that the losses are observed with delay d. We show that the minimax regret rate is fully determined by the regime of the states to losses mapping, regardless of the regime of the actions to states mapping. The results are informally summarized in the following table, where K denotes the number of actions, S denotes the number of states, and T denotes the time horizon. It is assumed that the losses belong to the [0, 1] interval.

States to losses mapping Regret (within log factors)

Adversarial

q K + min{S, d} T

All of our upper bounds hold with high probability (with respect to the learner s internal randomization) irrespective of the regime of the action to states mapping.

We recall that (within logarithmic factors) the minimax regret rate in multi-armed bandits with delays without intermediate observations is of order p

(K + d)T (Cesa-Bianchi et al., 2019). Therefore, we conclude that if the mapping from states to actions is adversarial, then intermediate observations do not help (in the minimax sense), because the regret rates are the same irrespective of whether the intermediate observations are used or not, and irrespective of whether the mapping from actions to states is stochastic or adversarial. However, if the mapping from states to losses is stochastic, and the number S of states is smaller than the delay d, then intermediate observations are helpful, and we provide an algorithm, Meta Ada BIO, which is able to exploit them. Our result improves on the e O

KST regret bound obtained by Vernade et al. (2020) for the case of stochastic and stationary action to states mapping. Our algorithm also applies to a more general setting of non-uniform delays (dt)t [T ] where we achieve a high-probability regret bound of order p

KT + min{ST, DT } (ignoring logarithmic factors). This improves upon the total delay term DT = d1 + + d T similarly to the respective term in the fixed delay setting.

Related work Adaptive clinical trials have served an inspiration for the multi-armed bandit model (Thompson, 1933), and, interestingly, they have also pushed the field to study the effect of delayed feedback (Simon, 1977; Eick, 1988). In the bandit setting Joulani et al. (2013) have studied a stochastic setting with random delays, whereas Neu et al. (2010; 2014) have studied a nonstochastic setting with constant delays. Cesa-Bianchi et al. (2019) have shown an Ω(max{

d T ln K}) lower bound for nonstochastic bandits with uniformly delayed feedback, and an upper bound matching the lower bound within logarithmic fac-

tors by using an EXP3-style algorithm (Auer et al., 2002b), whereas Zimmert & Seldin (2020) have reduced the gap to the lower bound down to constants by using a Tsallis-INF approach (Zimmert & Seldin, 2021). Follow up works have studied adversarial multi-armed bandits with non-uniform delays (Thune et al., 2019; Bistritz et al., 2019; 2022; Gyorgy & Joulani, 2021; Van der Hoeven & Cesa-Bianchi, 2022) with Zimmert & Seldin (2020) providing a minimax optimal algorithm and Masoudian et al. (2022) deriving a matching lower bound and a best-of-both-worlds extension. Two key techniques for handling non-uniform delays are skipping, introduced by Thune et al. (2019), and algorithm parametrization by the number of outstanding observations (an observed quantity at action time), as opposed to the delays (an unobserved quantity at action time), introduced by Zimmert & Seldin (2020).

Paper structure In Section 2 we provide a formal problem definition. In Section 3 we introduce two algorithms, Meta BIO and Meta Ada BIO, for the model of bandits with intermediate observations. In Section 4 we analyze both algorithms and prove high-probability regret bounds for the setting of adversarial action-state mappings and stochastic losses. In Section 5 we provide the lower bounds, and in Section 6 experimental evaluation, concluding with a discussion in Section 7.

2. Problem definition

We consider an online learning setting with a finite set A = [K] of K 2 actions and a finite set S = [S] of S 2 states. In each round t = 1, 2, . . . the learner picks an action At A and receives a state St = st(At) S as an intermediate observation according to some mapping st SA. The learner also incurs a loss ℓt(St) [0, 1], which is only observed at the end of round t + dt, where the delay dt 0 is revealed to the learner only when the observation is received.

The difficulty of this learning task depends on three elements all initially unknown to the learner:

the sequence of action-state mappings s1, . . . , s T SA;

the sequence of loss vectors ℓ1, . . . , ℓT [0, 1]S;

the sequence of delays d1, . . . , d T N, where dt T t for all t [T] without loss of generality.

Note that unlike standard bandits, here the losses are functions of the states instead of the actions. However, since actions are chosen without a-priori information on the actionstate mappings, learners have no direct control on the losses they will incur and, because of the delays, they also have no immediate feedback on the loss associated with the observed states. Note also that, for all t 1, the states st(a) for a = At and the losses ℓt(s) for s = St are never re-

Delayed Bandits: When Do Intermediate Observations Help?

vealed to the algorithm. For brevity, we refer to this setting as (delayed) Bandits with Intermediate Observations (BIO).

In the setting of stochastic losses, we assume the loss vectors ℓt [0, 1]S are sampled i.i.d. from some fixed but unknown distribution Q, and let θ [0, 1]S be the unknown vector of expected losses for the states. That is, ℓt(s) Q( | s) has mean θ(s) for each t [T] and s S. Note that we allow dependencies between the stochastic losses of distinct states in the same round, but require losses to be independent across rounds. In the setting of stochastic action-state mappings, we assume that each observed state St is independently drawn from a fixed but unknown distribution P( | At). If both losses and action-state mappings are stochastic, then ℓt(St) is independent of At given St. When losses or action-state mappings are adversarial, we always assume oblivious adversaries.

Our main quantity of interest is the regret measured via the learner s cumulative loss P

t ℓt(St), where St = st(At) and (At)t 1 is the sequence of learner s actions. In case of stochastic losses, we define the learner s performance by P

t θ(St). In case of stochastic action-state mappings, we average each instantaneous loss over the random choice of the state: P

s ℓt(s)P(s | At) for adversarial losses and P

s θ(s)P(s | At) for stochastic losses. Regret is always computed according to the best action with respect to appropriate notion of cumulative loss. In particular, for stochastic state-action mappings, the cumulative losses of the best action are

s S ℓt(s)P(s | a)

s S θ(s)P(s | a) .

3. Algorithm

In this section we introduce Meta BIO (Algorithm 1) that transforms any algorithm B tailored for the delayed setting without intermediate observations into an algorithm for our setting. We then propose Meta Ada BIO, a modification of Meta BIO that delivers an improved regret bound for our setting.

The idea of Meta BIO is to reduce the impact of delays using the information we get from intermediate observations. More precisely, if we have enough observations for the current state St at time t, we immediately feed to B the estimate of the mean loss of this state as if it were the actual loss at time t; otherwise, we wait for dt time steps and refine our estimate using the additional loss observations.

The are two key steps in the design of our algorithm: how we construct the mean estimate and when we use it instead

of waiting for the actual loss. They are the steps highlighted in green in Algorithm 1. For all t [T] and s S, we use eθt(s) to denote the mean estimate of θ(s) at round t and nt(s) to denote the number of observations for state s that we want to observe before using eθt(s). We add a subscript t to L(s) in Algorithm 1 to denote the set of observations we have collected at the end of round t. Thus, eθt(s) uses Nt(s) = |Lt(s)| observations.

Algorithm 1: Meta BIO Input: Algorithm B for standard delayed bandits, confidence parameter δ (0, 1) Initialize L(s) = for all s S for t = 1, . . . , T do

Get At from B Observe St = st(At) for j : j + dj = t do

Receive (j, ℓj(Sj)) Update L(Sj) = L(Sj) {(j, ℓj(Sj))}

Initialize feedback set M = Compute nt(St) if |L(St)| nt(St) then

for j : j + dj = t |L(Sj)| < nj(Sj) do

Compute eθt(Sj) from L(Sj) // using δ Feed (j, Aj, eθt(Sj)) to B

Fixed delay setting. When all rounds have delay d, we simply choose nt(s) = d for all s S, t [T]. In other words, if we have at least d observations for some state, then we can compensate for the effect of delays and construct a well concentrated mean estimate around the actual mean. Let bθt(s) = P

j Lt(s) ℓj(s) Nt(s). Then our mean loss estimate is a lower confidence bound for θ(s) defined by

eθt(s) = max 0, bθt(s) 1

for εt(s) = q

2 Nt(s) ln 4ST

Arbitrary delay setting. In the arbitrary delay setting, where we do not have preliminary knowledge of delays, we can not use the delays to set nt(s). Instead, at the end of time t, we have access to the number of outstanding observations σt = {j [t] : j + dj > t} , which is different from prior works that consider outstanding observation at the beginning of the round. Then, for any s S, we may set nt(s) = σt. With this choice, incurring zero delay at some round implies that we received at least half of all the observations we could

Delayed Bandits: When Do Intermediate Observations Help?

have received in the no-delay setting (see Appendix B.4). In Section 4 we see that this ensures our mean estimate is well concentrated around its mean.

Since Algorithm 1 waits for the actual loss at time t only if Nt(St) < σt, then edt = dt 1[Nt(St) < σt] is the actual delay incurred by the algorithm, and Lt+ e dt(s) is the set of observations used to compute the estimate of the mean loss at time t. Because some observations may arrive at the same time, the high-probability analysis of Meta BIO requires these observations to be ordered. More precisely, we construct our mean estimate at time t + edt for the feedback of round t using the set

L t(s) = n (j, ℓj(s)) Lt+ e dt(s) j+ edj = t+ edt j <t o .

Letting N t(s) = |L t(s)|, we define the empirical mean

ℓj(s) N t(s) . (2)

Then, we set εt(s) = q

2 N t(s) ln 4ST

δ and define the mean loss estimator similarly to Equation (1).

The Meta Ada BIO algorithm. As we said already, the goal of intermediate observations is to reduce the impact of delays. However, if the number of states is too large compared to the average delay, then the information we get from intermediate observations could be misleading. To address this issue, we introduce Meta Ada BIO (Algorithm 2). Given a horizon T,1 this algorithm runs B (which is tailored for the setting without intermediate observations) until the total incurred delay exceeds ST, and then switches to Meta BIO. We precise that Meta Ada BIO computes Dt as the sum of outstanding observation counts up to round t, which is then used in the switching condition.

4. Regret Analysis

We analyze Meta BIO and Meta Ada BIO in the setting of adversarial action-state mappings and stochastic losses where the regret is defined by

t=1 θ(St) min a A

t=1 θ(st(a)) .

Our analysis guarantees a bound on RT that holds with high probability (and not just in expectation). A related notion of regret is

t=1 ℓt(St) min a A

t=1 ℓt(st(a))

1Note that we may remove the a-priori knowledge of T by using a doubling trick at the cost of a polylog factor in the regret. See Remark 4.7 for further details.

Algorithm 2: Meta Ada BIO Input: Algorithm B for standard delayed bandits, confidence parameter δ (0, 1) Initialize D0 = 0 for t = 1, . . . , T do

Get At from B for j : j + dj = t do

Receive (j, ℓj(Sj)) Feed (j, Aj, ℓj(Sj)) to B

Set σt = Pt 1 j=1 1[j + dj > t] Update Dt = Dt 1 + σt if Dt(3 ln K + ln(6/δ)) > 49ST ln 8ST

δ then break

if t < T then

Run Meta BIO(B, δ/2) for the remaining rounds

which considers the realized losses instead of their means. The two quantities are close with high probability: each inequality

2T ln(2K/δ) RT RT p

2T ln(2/δ) (3)

individually holds with probability at least 1 δ for any given δ (0, 1) (see Lemma A.1).

Let DT = PT t=1 dt be the total delay. We start by showing an upper bound on the total actual delay e DT = PT t=1 dt1[Nt(St) < σt] DT incurred by Meta BIO. Then, we provide a high-probability regret analysis of both Meta BIO and Meta Ada BIO.

More precisely, we can show that Meta BIO incurs the delays of no more than min{2Sσmax, T} rounds, where σmax = maxt [T ] σt. In the worst case, these rounds correspond with those from the set

Φ arg max J [T ]

n DJ : |J | = min{2Sσmax, T} o . (4)

where we denote DJ = P

t J dt for any J [T]. Note that the set Φ is fully determined by the delay sequence d1, . . . , d T . Moreover, the total delay incurred by Meta BIO cannot be worse than the sum of delays corresponding to the rounds in Φ, as stated in the lemma below.

Lemma 4.1 (Total actual delay). If Meta BIO is run with any algorithm B on delays (dt)t [T ], then e DT DΦ.

Lemma 4.1 (proof in Appendix B.1) implies that, if all delays are bounded by dmax, then e DT 2Sσmaxdmax, which does not depend on T. In the fixed-delay setting with delay d, for example, we get a total effective delay of at most 2Sd2, rather than the total delay d T we would incur

Delayed Bandits: When Do Intermediate Observations Help?

without access to intermediate observations (when T is large enough).

We now turn Meta BIO into a concrete algorithm by instantiating B. Specifically, we use DAda-Exp3 (Gyorgy & Joulani, 2021), a variant of Exp3 which does not use intermediate observations and is robust to delays. DAda-Exp3 has the following regret bound.

Theorem 4.2 (Gyorgy & Joulani (2021, Corollary 4.2)). For any δ (0, 1), the regret with respect to realized losses of DAda-Exp3 in the adversarial bandits with arbitrary delays with probability at least 1 δ satisfies

3(2KT + DT ) ln K

3 ln K + σmax

While Theorem 4.2 shows a high-probability bound on RT , Equation (3) shows that a high-probability bound for one notion of regret ensures a high-probability bound for the other. Although the original bound by Gyorgy & Joulani (2021) was stated with dmax instead of σmax, we can replace the former with the latter by observing that, in the analysis of Gyorgy & Joulani (2021, Theorem 4.1), they only use dmax to upper bound the number of outstanding observations. Note that σmax is never larger than dmax, indicating it is a well-behaved term that is not vulnerable to a few large delays. See Masoudian et al. (2022, Lemma 3) for a refined quantification of the relation between σmax and dmax.

If we consider a fixed confidence level δ (0, 1), then we can make the learning rate ηt and the implicit exploration term γt in DAda-Exp3 depend on the specific value of δ so as to achieve an improved regret bound (see Appendix B.2). This allows us to show that in the BIO setting with adversarial action-state mappings and stochastic losses, the regret RT of DAda-Exp3 is upper bounded by

2KTCK,6δ + 2 p

DT CK,6δ + σmax + 2

with probability at least 1 δ, where

CK,δ = 3 ln K + ln 12

Next, we state the regret bound for Meta BIO. We remark that we initialize DAda-Exp3 with confidence parameter δ/2 so as to guarantee the high-probability bound as in (5) with probability at least 1 δ/2 as required.

Theorem 4.3. Let δ (0, 1). If we run Meta BIO using DAda-Exp3, then the regret of Meta BIO in the BIO setting with adversarial action-state mappings and stochastic

losses with probability at least 1 δ satisfies

2KTCK,3δ + 7

DΦCK,3δ + σmax + 2

We begin the analysis of Theorem 4.3 by decomposing the regret into two parts: (i) the regret RT of DAda-Exp3 with losses eθt(St), and (ii) the gap RT RT , corresponding to the cumulative error of the estimates fed to DAda-Exp3. For the first part, we follow an approach similar to Gyorgy & Joulani (2021) and apply Neu (2015, Lemma 1) to obtain a concentration bound for the loss estimates defined using importance weighting along with implicit exploration. When using the actual losses, the application of Neu (2015, Lemma 1) is straightforward. However, when the mean loss estimate eθt(St) is used rather than the actual loss, there is a potential dependency between the chosen action At and eθt(St). In Appendix B.3 we carefully design a filtration to show that we may indeed use the high-probability regret bound of DAda-Exp3 in order to upper bound the first part (regret RT defined in terms of the estimates eθt).

The second part requires to bound the cumulative error of our estimator in (2) for the observed states {St}t [T ]. To this end, we use the Azuma-Hoeffding inequality to control the error of these estimates. Doing so causes a e O(

ST) term to appear in the regret bound. The detailed proof of this part is in Appendix B.4, together with the proof of Theorem 4.3.

The presence of the e O(

ST) term in the regret bound implies that, when S max{DT /T, K}, using intermediate feedback leads to no advantage over ignoring it. So we ideally want to recover the original bound in (5) when this happens. Meta Ada BIO solves this issue and gives the following regret guarantee. The proof of this result is deferred to Appendix B.5. We remark that, to achieve this bound, before the eventual switch we use algorithm DAda-Exp3 with confidence parameter set to δ/3 so as to guarantee a high-probability bound on Rt with probability at least 1 δ/2 over the first t rounds that DAda-Exp3 runs by itself. Theorem 4.4. Let δ (0, 1). If we run Meta Ada BIO with DAda-Exp3, then the regret of Meta Ada BIO in the BIO setting with adversarial action-state mappings and stochastic losses with probability at least 1 δ satisfies

KTCK,2δ + 2 p

DΦCK,2δ + (σmax + 2) ln 8

If we consider any upper bound dmax on the delays

Delayed Bandits: When Do Intermediate Observations Help?

(dt)t [T ], we can further observe that the regret RT of Meta Ada BIO (with DAda-Exp3) satisfies

T + dmax , p

with high probability. This also follows from the fact that, as previously mentioned, we can bound the total delay of Meta BIO by DΦ 2Sd2 max.

Given the previous regret bounds, we observe that we may further improve the dependency on the delays by adopting the idea of skipping rounds with large delays when computing the learning rates. This skipping idea was introduced by Thune et al. (2019) and has been leveraged by Gyorgy & Joulani (2021) to show that DAda-Exp3 can achieve a refined high-probability regret bound see Gyorgy & Joulani (2021, Theorem 5.1). As a consequence, we can indeed provide an improved bound in our setting by following similar steps as in the proof of Theorem 4.3. The only main change is the adoption of the version of DAda-Exp3 that uses the skipping procedure.

Corollary 4.5. Let δ (0, 1). If we run Meta BIO with DAda-Exp3 with skipping (Gyorgy & Joulani, 2021, Theorem 5.1), then the regret of Meta BIO in the BIO setting with adversarial action-state mappings and stochastic losses with probability at least 1 δ satisfies

CK,δ ln K min R Φ

DΦ\R ln K o .

This result could also be extended in a similar way to Meta Ada BIO, so as to achieve the best result from the presence of intermediate feedback.

So far, we have provided some high-probability guarantees for the regret of both Meta BIO and Meta Ada BIO, by which we can derive some expectation bounds as well (e.g., by setting δ 1/T). However, using the empirical mean estimators bθt as the mean loss estimators at time t and working directly with the expected regret allows us to improve the achievable bound by a polylogarithmic factor. Hence, for the expected regret we use Tsallis-INF (Zimmert & Seldin, 2020), a learning algorithm for the standard delayed bandit problem that uses a hybrid regularizer to deal with delays and gives a minimax-optimal expected regret bound. The proof of this expected regret upper bound is in Appendix B.6.

Proposition 4.6. If we execute Meta Ada BIO with Tsallis-INF (Zimmert & Seldin, 2020), and use the switching condition 8Dt ln K > 6 p

ST ln(2ST) at each round t [T], where Dt = Pt j=1 σj, then the regret of

Meta Ada BIO in the BIO setting with adversarial actionstate mappings and stochastic losses satisfies

+ 2 min n 6 p

ST ln(2ST), p

8DT ln K o .

Remark 4.7. In Meta BIO, we can replace T by t2 in the definition of the confidence intervals for (2) and remove the need for prior knowledge of the time horizon T. In Meta Ada BIO, we could use a doubling trick to avoid the prior knowledge of T in the switching condition. On the other hand, it is not required to know the number of states S for expectation bounds on the regret of Meta BIO. However, removing the prior knowledge of S in the high-probability regret bounds is challenging. Indeed, to the best of our knowledge, there is no result in BIO that avoids prior knowledge on the number of states. Lifting this requirement in the high-probability analysis is thus an interesting question for future work.

5. Lower Bounds

The lower bounds in this section are for the expected regret E[RT ]. Since our algorithms provide high-probability guarantees, the upper bounds also apply to the expected regret. Throughout this section we will make use of constant delay i.e. dt = d for all t [T]. We will first prove a general

KT lower bound for all algorithms in BIO, after which we specialize to particular cases.

We start by proving a Ω

KT lower bound for any algorithm in our setting and for any combination of stochastic or adversarial action-state mappings and loss vectors. The construction is a reduction to the standard bandits lower bound construction.

Theorem 5.1. Irrespective to whether the action-state mappings and loss vectors are stochastic or adversarial, there exists a sequence of losses such that any (possibly randomized) algorithm in BIO suffers regret E[RT ] = Ω

Proof. Our construction only uses two states h1 and h2. The loss vectors, which are deterministic and do not change over time, are defined as follows: ℓt(h1) = 1 and ℓt(h2) = 0 for all t 0. The stochastic action-state mapping, which is also constant over time, is given by

( h1 with probability pa h2 with probability 1 pa

for all a A and t 0, where the probabilities pa are to be determined. Thus, the loss of an arm a is ℓt(st(a)) = ℓt(h1) = 1 with probability pa and ℓt(st(a)) = ℓt(h2) = 0 with probability 1 pa. Since the loss is determined by the state, the learner receives bandit feedback without

Delayed Bandits: When Do Intermediate Observations Help?

delay. We can then choose pa for a A to mimic the standard Ω

KT distribution-free bandit lower bound e.g., see Slivkins et al. (2019, Chapter 2). By Yao s minimax principle, the same lower bound also applies to the case with adversarial action-state mappings. Since the loss vectors are deterministic, this covers all possible cases in BIO.

Adversarial action-state mapping and stochastic losses. We first prove a lower bound

ST for any number K 2 of actions. However, we do need a minor generalization of our setting to allow correlation between unseen losses. Specifically, we allow all pairs of losses ℓj(s), ℓj (s ) of distinct states s = s to be correlated if j > j and j j d, while we guarantee the i.i.d. nature of losses for any fixed state. Since E[ℓt(St)] = E[θ(St)], this does not affect the analysis for the upper bound on the regret of our algorithms since E[RT ] E[RT ] (see Lemma A.3). However, for a high-probability upper bound, we need to relate RT and RT , which now leads to an additive e O(

ST) term rather than an additive e O(

T) term as in Equation (3).

In the proof of the

ST lower bound, we leverage the fact that losses are independent only across time steps for a fixed state, while they may depend on the losses of the other states. Note that our lower bound holds even when the learner knows the action-state assignments beforehand.

Theorem 5.2. Suppose that the action-state mapping is adversarial and the losses are stochastic and that dt = d for all t [T]. If T min{S, d} then there exists a distribution of losses and a sequence of action-state mappings such that any (possibly randomized) algorithm suffers regret E[RT ] = Ω p

min{S, d}T .

We provide a sketch of the proof of Theorem 5.2 (see Appendix C for the full proof). First, suppose that S 2d. For the construction of the lower bound we only consider two actions and equally split the states over these two actions. Then, we divide the T time steps in blocks of length S/2 d. In each block, each state has the same loss. Since the block length is smaller then the delay, we have effectively created a two-armed bandit problem with T = T/(S/2) rounds and loss range [0, S/2], for which we can prove a Ω S

ST lower bound by showing an equivalent lower bound for the full information setting. If S > 2d, we use the same construction with only 2d states, and obtain a Ω

d T lower bound.

Finally, we can show the following lower bound, whose proof can be found in Appendix C.

Theorem 5.3. Suppose that the action-state mapping is adversarial, the losses are stochastic, and that dt = d for all t [T]. If T d + 1 then there exists a distribution of losses and a sequence of action-state mappings such that

any (possibly randomized) algorithm suffers regret

E[RT ] = Ω min n (d + 1)

(d + 1)T o .

This term is also present in the dynamic regret bound of NSD-UCRL2, but it is necessarily incurred from their analysis even in the stationary case (Vernade et al., 2020, Theorem 1).

This last lower bound implies that the regret of our algorithm is near-optimal. Since the lower bound of Theorem 5.1 applies to the case where the action-state mapping is adversarial and the losses are stochastic, we find the following result as a corollary of Theorem 5.1, Theorem 5.2, and Theorem 5.3.

Corollary 5.4. Suppose that the action-state mapping is adversarial, the losses are stochastic, and that dt = d for all t [T]. If T 1 + min{S, d}, then there exists a distribution of losses and a sequence of action-state mappings such that any (possibly randomized) algorithm suffers regret

E[RT ] = Ω max

min{S, d}T, (d + 1)

Stochastic action-state mappings and adversarial losses. In this case we recover the standard lower bound for adversarial bandits with bounded delay.

Theorem 5.5. Suppose that the action-state mapping is stochastic, the losses are adversarial, and that dt = d for all t [T]. Then there exists a stochastic actionstate mapping and a sequence of losses such that any (possibly randomized) algorithm suffers regret E[RT ] = Ω max

Proof. Since by Theorem 5.1 we already know that any algorithm must suffer Ω

KT regret, we only need to show a Ω(

d T) lower bound. We use two states, h1 and h2. Our action-state mapping is deterministic and, for all t 0, assigns st(a) = h1 to all but one action a , to which the mapping assigns st(a ) = h2. We now have constructed a two-armed bandit problem with delayed feedback and T rounds, for which a Ω(

d T) lower bound is known (Cesa Bianchi et al., 2019).

Adversarial action-state mappings, adversarial losses. Since we can recover the construction of the lower bound in Theorem 5.5, we have the following result.

Corollary 5.6. Suppose that the action-state mapping is adversarial, the losses are adversarial, and that dt = d for all t [T]. Then there exists an action-state mapping and a sequence of losses such that any (possibly randomized) algorithm suffers regret E[RT ] = Ω max

Delayed Bandits: When Do Intermediate Observations Help?

6. Experiments

We empirically compare our algorithm Meta BIO with the following baselines: DAda-Exp3 (Gyorgy & Joulani, 2021) for adversarial delayed bandits without intermediate observations (which we used to instantiate the algorithm B), the standard UCB1 algorithm (Auer et al., 2002a) for stochastic bandits without delays and intermediate observations, and NSD-UCRL2 (Vernade et al., 2020) for nonstationary stochastic action-state mappings and stochastic losses. We run all experiments with a time horizon of T = 104. All our plots show the cumulative regret of the algorithms considered as a function of time. The performance of each algorithm is averaged over 20 independent runs in every experiment, and the shaded areas consider a range centered around the mean with half-width corresponding to the empirical standard deviation of these 20 repetitions. In the first two experiments, we consider both fixed delays d {50, 100, 200} and random delays dt Laplace(50, 25) sampled i.i.d. from the Laplace distribution with E[dt] = 50.

0 2000 4000 6000 8000 10000 0

DAda-Exp3 Meta-BIO Meta-BIO with skipping

(b) d = 100

0 2000 4000 6000 8000 10000 0

(c) d = 200

0 2000 4000 6000 8000 10000 0

900 (d) dt Laplace(50, 25)

0 2000 4000 6000 8000 10000 0

Figure 1. Cumulative regret over time for the stochastic actionstate mapping when delays are fixed or random.

Experiment 1: stochastic action-state mappings. Here we use a stationary version of the experiments in (Vernade et al., 2020) see Table 1 in Appendix D for details. We set K = 4 and S = 3, while we repeat this experiment for the previously mentioned values of delays. Figure 1 shows that, across all delay regimes, Meta BIO largely improves on the performance of DAda-Exp3 by exploiting intermediate observations.

Experiment 2: adversarial action-state mappings. In this construction, we simulate the adversarial mapping using a construction adapted from (Zimmert & Seldin, 2021): we alternate between two stochastic mappings while keeping

the loss means fixed. We set K = 4, S = 3, and we consider multiple instances for the different values of delays as in the previous experiment. The interval between two consecutive changes in the distribution of action-state mappings grows exponentially. See Table 2 in Appendix D for details. Figure 2 shows that Meta BIO and Meta BIO with skipping outperform both UCB1 and NSD-UCRL2.

0 2000 4000 6000 8000 10000 0

DAda-Exp3 Meta-BIO Meta-BIO with skipping UCB NSD-UCRL2

(b) d = 100

0 2000 4000 6000 8000 10000 0

(c) d = 200

0 2000 4000 6000 8000 10000 0

900 (d) dt Laplace(50, 25)

0 2000 4000 6000 8000 10000 0

Figure 2. Cumulative regret over time for the adversarial actionstate mapping when delays are fixed or random. All algorithms have small variance except for UCB1 and NSD-UCRL2.

Experiment 3: utility of intermediate observations. Here we set K = 8, d = 100, and investigate how the performance of Meta BIO changes when the number S of states varies in {4, 6, 8, 10, 12}. The mean loss is always 0.2 for the optimal state and 1 for the others. The optimal action always maps to the optimal state. The suboptimal actions map to the optimal state with probability 0.6 and map to a random suboptimal state with probability 0.4. This implies that the expected loss of each arm remains constant when the number of states changes. Figure 3 shows that the regret gap between Meta BIO and DAda-Exp3 shrinks as the number of states increases. This observation confirms our theoretical findings about the dependency of the regret on the number of states, which lead to a larger improvement the fewer they are.

Experiment 4: performance of Meta Ada BIO when S < d. We use the same setting as in Experiment 1 with delay d = 20.2 Figure 4 shows the performance of Meta Ada BIO compared with both DAda-Exp3 and

2Compared to the switching condition used for the analysis of Meta Ada BIO, we replace 49ST ln 8ST

δ with ST. This change allows the switching condition to be triggered more easily to provide a better visualization of the behaviour of Meta Ada BIO, while it only introduces a polylog factor in its regret bound.

Delayed Bandits: When Do Intermediate Observations Help?

0 2000 4000 6000 8000 10000

DAda-Exp3 Meta-BIO(S=4) Meta-BIO(S=6) Meta-BIO(S=8) Meta-BIO(S=10) Meta-BIO(S=12)

Figure 3. Cumulative regret over time of both DAda-Exp3 and Meta BIO with different numbers of states S {4, 6, 8, 10, 12}.

Meta BIO. Before the switching point, Meta Ada BIO runs DAda-Exp3 (up to independent internal randomization). Afterwards, Meta Ada BIO switches to Meta BIO (which in turn runs DAda-Exp3 as a subroutine) and quickly aligns with its performance. Note that, at the switching time, Meta Ada BIO uses (via Meta BIO) the same instance of DAda-Exp3 that was already running, rather than starting a new instance. It can be shown that our analysis of Meta Ada BIO applies to this variant as well without changes in the order of the bound.

0 2000 4000 6000 8000 10000

400 DAda-Exp3 Meta-BIO Meta-Ada-BIO

Figure 4. Cumulative regret over time of DAda-Exp3, Meta BIO and Meta Ada BIO. The vertical blue line marks the switching point of Meta Ada BIO.

Experiment 5: performance of Meta Ada BIO when S > d. We use a setting that is almost identical to that of Experiment 3 (Section 6), except we set d = 4 and S = 14. The performance of the three algorithms is shown in Figure 5. We can observe that Meta Ada BIO does not switch to Meta BIO and its performance is thus the same as that of DAda-Exp3, whereas Meta BIO incurs a larger regret.

0 2000 4000 6000 8000 10000

700 DAda-Exp3 Meta-BIO Meta-Ada-BIO

Figure 5. Cumulative regret over time of DAda-Exp3, Meta BIO and Meta Ada BIO when S > d.

7. Future Work

The work of Vernade et al. (2020) also considers a nonstationary action-state mapping and derive regret bounds for the switching regret. Preliminary results suggest that, as long as there is an algorithm that can provide bounds on the switching regret with delayed feedback, our ideas also transfer to this setting. Unfortunately, there is currently no algorithm that can provide bounds on the switching regret with delayed feedback and we leave this as a promising direction for future work.

Acknowledgements

EE, NCB, and HQ are partially supported by the MIUR PRIN grant Algorithms, Games, and Digital Markets (ALGADIMAR), by the EU Horizon 2020 ICT-48 research and innovation action under grant agreement 951847, project ELISE (European Learning and Intelligent Systems Excellence), and by the FAIR (Future Artificial Intelligence Research) project, funded by the Next Generation EU program within the PNRR-PE-AI scheme (M4C2, investment 1.3, line on Artificial Intelligence). SM acknowledges funding from the European Union s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 801199. YS acknowledges partial support by the Independent Research Fund Denmark, grant number 9040-00361B. This work was mostly done while Dvd H was at the University of Milan partially supported by the MIUR PRIN grant Algorithms, Games, and Digital Markets (ALGADIMAR) and partially supported by Netherlands Organization for Scientific Research (NWO), grant number VI.Vidi.192.095.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3), 2002a.

Delayed Bandits: When Do Intermediate Observations Help?

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1), 2002b.

Bistritz, I., Zhou, Z., Chen, X., Bambos, N., and Blanchet, J. H. Online EXP3 learning in adversarial bandits with delayed feedback. In Advances in Neural Information Processing Systems, 2019.

Bistritz, I., Zhou, Z., Chen, X., Bambos, N., and Blanchet, J. No weighted-regret learning in adversarial bandits with delays. Journal of Machine Learning Research, 23, 2022.

Cesa-Bianchi, N., Gentile, C., and Mansour, Y. Delay and cooperation in nonstochastic bandits. Journal of Machine Learning Research, 20, 2019.

Eick, S. G. The two-armed bandit with delayed responses. The Annals of Statistics, 1988.

Gyorgy, A. and Joulani, P. Adapting to delays and data in adversarial multi-armed bandits. In Proceedings of the 38th International Conference on Machine Learning, volume 139, 2021.

Van der Hoeven, D. and Cesa-Bianchi, N. Nonstochastic bandits and experts with arm-dependent delays. In International Conference on Artificial Intelligence and Statistics, 2022.

Joulani, P., Gyorgy, A., and Szepesv ari, C. Online learning under delayed feedback. In International Conference on Machine Learning, 2013.

Mann, T. A., Gowal, S., Gy orgy, A., Hu, H., Jiang, R., Lakshminarayanan, B., and Srinivasan, P. Learning from delayed outcomes via proxies with applications to recommender systems. In International Conference on Machine Learning, 2019.

Masoudian, S., Zimmert, J., and Seldin, Y. A best-ofboth-worlds algorithm for bandits with delayed feedback. In Advances in Neural Information Processing Systems, 2022.

Neu, G. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems, 2015.

Neu, G., Gy orgy, A., Szepesv ari, C., and Antos, A. Online markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems, 2010.

Neu, G., Gy orgy, A., Szepesv ari, C., and Antos, A. Online markov decision processes under bandit feedback. IEEE Transactions on Automatic Control, 59, 2014.

Simon, R. Adaptive treatment assignment methods and clinical trials. Biometrics, 33, 1977.

Slivkins, A. et al. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(12), 2019.

Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25, 1933.

Thune, T. S., Cesa-Bianchi, N., and Seldin, Y. Nonstochastic multiarmed bandits with unrestricted delays. Advances in Neural Information Processing Systems, 32, 2019.

Vernade, C., Gy orgy, A., and Mann, T. A. Non-stationary delayed bandits with intermediate observations. In Proceedings of the 37th International Conference on Machine Learning, volume 119, 2020.

Zimmert, J. and Seldin, Y. An optimal algorithm for adversarial bandits with arbitrary delays. In Proceedings on the International Conference on Artificial Intelligence and Statistics, 2020.

Zimmert, J. and Seldin, Y. Tsallis-INF: An optimal algorithm for stochastic and adversarial bandits. Journal of Machine Learning Research, 22, 2021.

Delayed Bandits: When Do Intermediate Observations Help?

A. Auxiliary Results

Lemma A.1. Consider any algorithm that picks actions (At)t [T ] in the adversarial delayed bandits problem with intermediate feedback with arbitrary action-state mappings (st)t [T ] and i.i.d. loss vectors (ℓt)t [T ]. Then, for any given δ (0, 1), RT RT p

2T ln(2/δ) and RT RT p

2T ln(2K/δ)

individually hold with probability at least 1 δ.

Proof. First, observe that we can relate the two notions of regret as

θ(St) ℓt(St) + min a A

t=1 ℓt(st(a)) min a A

t=1 θ(st(a))

By Azuma-Hoeffding inequality, we can show that each side of

θ(St) ℓt(St)

holds with probability at least 1 δ . Now, define

a ℓ arg min a A

t=1 ℓt(st(a)) and a θ arg min a A

t=1 θ(st(a)) .

On the one hand, observe that

t=1 ℓt(st(a θ))

t=1 θ(st(a θ))

where the last inequality holds with probability at least 1 δ by Azuma-Hoeffding inequality. On the other hand, we can show that

t=1 ℓt(st(a ℓ))

t=1 θ(st(a ℓ)) =: ( ) .

However, in this case a ℓdepends on the entire sequence ℓ1, . . . , ℓT . We thus need to use a union bound in order to show that

t=1 ℓt(st(a))

t=1 θ(st(a))

where the last inequality follows by Azuma-Hoeffding inequality. We conclude the proof by setting δ = δ/2.

Lemma A.2. The estimates (bθt)T t=1 defined in Equation (2) are such that |bθt(s) θ(s)| 1

2εt(s) simultaneously holds for all t [T] and all s S with probability at least 1 δ/2.

Proof. In a similar way as in Vernade et al. (2020), define Xm(s) to be the empirical mean estimate for θ(s) which uses the first m [T] observed losses corresponding to state s S. Notice that bθt(s) = XN t(s)(s), while we define

2 m ln( 4ST

δ ) so that εt(s) = ε N t(s)(s). We can additionally observe that E[Xm(s)] = θ(s). Then, we can use Azuma-Hoeffding inequality to show that

|bθt(s) θ(s)| 1

|Xm(s) θ(s)| 1

Delayed Bandits: When Do Intermediate Observations Help?

where we also used a union bound in the second inequality.

Lemma A.3. Consider any algorithm that picks actions (At)t [T ] in the BIO setting with adversarial action-state mappings (st)t [T ] and stochastic loss vectors (ℓt)t [T ]. Assume that the losses for any fixed state are i.i.d., whereas pairs of losses ℓj(s), ℓj (s ) of distinct states s = s might be correlated when j > j and j j dj . Then, it holds that E[RT ] E[RT ], where the expectation is with respect to the stochasticity of the losses and the randomness of the algorithm.

Proof. We know that E[ℓt(st(a))] = θ(st(a)) for any fixed a A and all t [T]. We further observe that

E[ℓt(St)] = E h E[ℓt(st(At)) | At] i = E[θ(St)]

holds for all t [T], as At is independent of losses that can be correlated with ℓt. Now, define

a ℓ arg min a A

t=1 ℓt(st(a)) and a θ arg min a A

t=1 θ(st(a)) .

Then, we conclude the proof by showing that

t=1 E[ℓt(St)] E

t=1 ℓt(st(a ℓ))

t=1 E[ℓt(St)] E

t=1 ℓt(st(a θ))

t=1 E[θ(St)]

t=1 θ(st(a θ)) = E[RT ] .

B. High-Probability Regret Bound

B.1. Total delay bound

Lemma 4.1 (Total actual delay). If Meta BIO is run with any algorithm B on delays (dt)t [T ], then e DT DΦ.

Proof of Lemma 4.1. For any s S, we define Ts = {t [T] : St = s} to be the set of all rounds when the state observed by the learner corresponds to s. Denote by ts the last time step t Ts such that Nt(s) < σt and let Cs = {t Ts : t ts} be those rounds in Ts that come no later than ts. According to the choice of ts, all the rounds in Ts for which learner waits for the respective delayed loss, must belong to Cs, while the learner incurs edt = 0 delay for rounds t Ts \Cs. Now we partition Cs into two sets: the observed set Cobs s = {t Cs : t + dt ts} and the outstanding set Cout s = {t Cs : t + dt > ts}. From the choice of ts, we can see that the number of rounds in Cobs s is

|Cobs s | Nts(s) < σts σmax ,

and the number of rounds in Cout s is

|Cout s | σts σmax .

Therefore, we have |Cs| 2σmax. So if we define Call = S

s S Cs, then |Call| min{2Sσmax, T} = |Φ|. This also implies that

t Call dt X

by definition of Φ.

Delayed Bandits: When Do Intermediate Observations Help?

B.2. Improved Regret for DAda-Exp3 for Fixed δ

We follow the analysis of Theorem 4.1 in Gyorgy & Joulani (2021, Appendix A) and our goal is to use the knowledge of δ (0, 1) to tune the learning rates (ηt)t [T ] and the implicit exploration terms (γt)t [T ], accordingly. Let d1, . . . , d T be the sequence of delays perceived by DAda-Exp3, and let DT = PT t=1 dt be its total delay. Furthermore, let σt be the number of outstanding observations of DAda-Exp3 at the beginning of round t [T]. Suppose that we take γt = cηt with c > 0 for all t [T], then following the same analysis as in Gyorgy & Joulani (2021, Appendix A), we end up with the following regret bound that holds with probability at least 1 2δ for any δ (0, 1/2):

t=1 ηt(σt + (c + 1)K) + ln(K/δ )

2cηT + σmax + c + 1

2c ln(1/δ )

ln(K) + ln(K/δ )

t=1 ηt(σt 1 + (c + 1)K) + σmax + 1

2c ln(1/δ ) + ln(1/δ )

Therefore, by taking η 1 t = r

(c+1)Kt+Pt j=1 σj 2 ln(K)+ 1

c ln(K/δ ), we get the following bound with probability at least 1 2δ :

(c + 1)KT +

! 2 ln(K) + ln(K/δ )

2c ln(1/δ ) + ln(1/δ )

We know that PT t=1 σt = DT by definition of σt. Then, we can set c = 1 to obtain that the regret RT (as per the original notion of regret used in Gyorgy & Joulani (2021)) is

2KT(3 ln(K) + ln(1/δ )) + 2 p

DT (3 ln(K) + ln(1/δ )) + σmax + 2

2 ln(1/δ ) (9)

with probability at least 1 2δ .

From Lemma A.1, we have that RT RT + p

2T ln(2/δ ) (10)

holds with probability at least 1 δ . So, combining Equations (9) and (10), and setting δ = 3δ , we can upper bound our notion of regret RT as

2KT(3 ln(K) + ln(3/δ)) + p

2T ln(6/δ) + 2 p

DT (3 ln(K) + ln(3/δ)) + σmax + 2

2 ln(3/δ) (11)

with probability at least 1 δ.

B.3. Reduction to DAda-Exp3 via Meta BIO

Based on the reduction via Meta BIO, we require that B guarantee a regret bound

t=1 eθt(St) min a A

t=1 eθt(st(a)) (12)

that holds with high probability when the losses experienced by B are of the form eθt st(a) . Note that, even though the action-state mappings s1, . . . , s T are unknown to the learner, we can provide those losses as long as B requires bandit feedback only. Indeed, we can compute eθt(St) defined in Equations (1) and (2), while we cannot determine st(a) for all actions a A that are not At. As mentioned in Section 4, in this work we consider DAda-Exp3 (Gyorgy & Joulani, 2021) as algorithm B used by Meta BIO. In what follows, we refer to this specific choice for the algorithm B.

The analysis of DAda-Exp3 for the high-probability bound (Theorem 4.2) is such that most steps only require that the loss of each action is bounded in [0, 1]. Then, those steps apply for any such sequence of loss vectors. However, the crucial part of that analysis that requires attention is the application of Lemma 1 from Neu (2015). We restate it below for reference.

Delayed Bandits: When Do Intermediate Observations Help?

Before that, we introduce the notation required for stating the result. We consider a learner choosing actions A1, . . . , AT according to probability distributions p1, . . . , p T over actions. We denote by Ft 1 the observation history of the learner until the beginning of round t. The result uses importance-weighted estimates for the losses ℓ1, . . . , ℓT with implicit exploration, where the implicit exploration parameter is γt 0 for each time t. These loss estimates are defined as

eℓt(a) = 1[At = a]

pt(a) + γt ℓt(a) t [T], a A . (13)

Lemma B.1 (Neu (2015, Lemma 1)). Let γt and αt(a) be nonnegative Ft 1-measurable random variables such that αt(a) 2γt, for all t [T] and all a A. Let eℓt(a) be as in (13). Then,

a=1 αt(a) eℓt(a) ℓt(a) ln(1/δ)

holds with probability at least 1 δ for any δ (0, 1).

In our case, we require an analogous result that work when loss vectors correspond with our estimates eθ1, . . . , eθT . However, these estimate have a dependency with the past actions chosen by the learner. This requires some nontrivial changes in the proof of Neu (2015, Lemma 1).

Before that, we introduce some crucial definitions for this proof. Let ρ(t) = t + dt be the arrival time for the realized loss ℓt(St) of the state St observed at time t [T]. Let eρ(t) = t + edt be instead the arrival time perceived by algorithm B relative to its choice of At at time t, i.e., when B receives eθt(St). This also means that eθt(St) is only defined at time eρ(t) ρ(t).

Let π: [T] [T] be the permutation of [T] that orders rounds according to their value of eρ. In other words, π satisfies the following property:

π(r) < π(t) eρ(r) < eρ(t) (eρ(r) = eρ(t) r < t) r, t [T] . (14)

This permutation allows us to sort rounds according to the order in which Meta BIO feeds B with a respective estimate for the mean loss. In particular, the r-th round in this order corresponds with the round tr = π 1(r), for any r [T]. Hence, we can equivalently define the round tr as the round such that its estimate eθtr(Str) for the mean loss θ(Str) is the r-th estimate received by B.

Define Fr = {(j, Aj, Sj, ℓj(Sj)) | j [T], π(j) r} r [T] (15)

as the information observed by B by the end to the time step when we feed it the estimate relative to round tr. Note that this defines a filtration, as Fr 1 Fr for all r [T], which has some desirable properties thanks to the ordering π we consider. In particular, we have that edtr, εtr, ptr, N tr are Fr 1-measurable random variables by the way we define them. This property is also due to the fact that Ntr and L tr are determined when conditioning on Fr 1. Moreover, we are now interested in the following importance-weighted loss estimates with implicit exploration:

eℓt(a) = 1[At = a]

pt(a) + γt eθt(st(a)) t [T], a A . (16)

Corollary B.2. Let γtr and αtr(a) be non-negative Fr 1-measurable random variables such that αtr(a) 2γtr, for all r [T] and all a A. Let eℓt(a) be as in (16). Then,

a=1 αt(a) eℓt(a) eθt(st(a)) ln(1/δ)

holds with probability at least 1 δ for any δ (0, 1).

Proof. We follow the proof of Neu (2015, Lemma 1) by considering any realization ℓ1, . . . , ℓT of the losses. The main difference is that, when defining the supermartingale as in the original proof, we need to consider the terms of the sum in the

Delayed Bandits: When Do Intermediate Observations Help?

order denoted by π instead of the increasing order of t. For this reason, we rewrite the sum from the statement by following the order given by π: T X

a=1 αtr(a) eℓtr(a) eθtr(str(a)) .

At this point, we need prove that E eℓtr(a) Fr 1 eθtr(str(a)), where we recall that tr = π 1(r). Also recall that εtr, ptr and γtr are Fr 1-measurable. This property allows us to prove the inequality with the conditional expectation of bθt instead of the one with the actual optimistic estimates eθt, by the definition of the latter. In other words, we now need to prove that E bℓtr(a) Fr 1 bθtr(str(a)), where bℓt(a) = 1[At=a]

pt(a)+γt bθt(st(a)).

We can consider two cases depending on whether edtr < dtr is true or not (and, thus, we are in the case edtr = dtr). In the first case, note that the realized losses used for computing bθtr(str(a)) correspond to time steps in L tr(str(a)), for which there is a corresponding tuple in Fr 1. Therefore, we have that bθtr(str(a)) is Fr 1-measurable, and we can show that

E h bℓtr(a)1[edtr < dtr] Fr 1 i = E 1[Atr = a]

ptr(a) + γtr

1[edtr < dtr]

N tr(str(a))

j L tr (str (a)) ℓj(str(a)) .

In the second case, we have that edtr = dtr, which implies that tr L tr(str(a)) in the case Atr = a. This means that we have a corresponding tuple in Fr 1 only for rounds in L tr(str(a)) \ {tr}. Nonetheless, this does not pose an issue since we have the indicator 1[Atr = a], and thus Str = st(a). Indeed, we have that

E h bℓtr(a)1[edtr = dtr] Fr 1 i = E

ptr(a) + γtr 1[edtr = dtr]

N tr(str(a))

j L tr (str (a)) ℓj(str(a))

= E 1[Atr = a]

ptr(a) + γtr

1[edtr = dtr]

N tr(str(a))

j L tr (str (a)) j =tr

+ E 1[Atr = a]

ptr(a) + γtr

1[edtr = dtr]

N tr(str(a)) ℓtr(str(a))

= E 1[Atr = a]

ptr(a) + γtr

1[edtr = dtr]

N tr(str(a))

j L tr (str (a)) ℓj(str(a))

and therefore the inequality

E h bℓtr(a) Fr 1 i = E 1[Atr = a]

ptr(a) + γtr

bθtr(str(a)) bθtr(str(a))

is true because 1[edt < dt] + 1[edt = dt] = 1 for all t [T], and by definition of bθt.

As already mentioned, this is equivalent to proving that E eℓtr(a) Fr 1 eθtr(str(a)) holds. By using a notation similar to the original proof, if we define eλr = PK a=1 αtr(a)eℓtr(a) and λr = PK a=1 αtr(a)eθtr(str(a)), the process (Zr)r [T ] with Zr = exp(Pr j=1(eλj λj)) is a supermartingale with respect to (Fr)r [T ] which has the same properties as in the proof of Neu (2015, Lemma 1). This concludes the current proof by following a similar reasoning as in the original one.

Thanks to this result, we can conclude that the adoption of DAda-Exp3 for the reduction via Meta BIO can guarantee a high-probability regret bound on b RB T as stated in Theorem 4.2, but with total delay e DT = PT t=1 edt instead of DT .

B.4. Regret of Meta BIO

By Lemma A.2, we have that

t=1 eθt(St) min a A

t=1 eθt(st(a)) +

t=1 εt(St) = b RB T +

t=1 εt(St) (17)

Delayed Bandits: When Do Intermediate Observations Help?

with probability at least 1 δ/2, where b RB T (Equation (12)) is the regret of algorithm B when fed with (eθt st)t [T ] as losses.

Lemma B.3. Conditioning on the event as stated in Lemma A.2, the sum of errors suffered from Meta BIO by using the loss estimates (eθt)t [T ] from Equations (1) and (2) is

t=1 εt(St) 4 + 2

Proof. First, observe that we can rewrite the sum of errors as

t=1 εt(St) =

t=1 εt(St)1[edt < dt] +

t=1 εt(St)1[edt = dt] .

We now provide an upper bound for the first sum of errors. For any s S, we define Ts = {t [T] : St = s} to be the set of all rounds when the state observed by the learner corresponds to s. We can bound it as

t=1 εt(St)1[edt < dt] = X

t Ts εt(s)1[edt < dt]

1 N t(s)1[edt < dt]

1 Mt(s)1[edt < dt] (because N t(s) 1

MT (s) (since Mt(s) is increasing over Ts)

where the second inequality holds because N t(St) = Nt(St) 1

2Mt(St) when edt < dt since Mt(St) Nt(St) + σt, while the last one follows by Jensen s inequality and the fact that P

s S MT (s) = T.

As a last step, we provide an upper bound to the second sum. Let Js = {r Ts : edr = dr} and notice that |Js| |Ts| = MT (s). Observe that ρ(t) = eρ(t) for each round t such that edt = dt, and thus by Equation (14) we have that

π(r) < π(t) ρ(r) < ρ(t) (ρ(r) = ρ(t) r < t)

for all r, t [T] such that edr = dr and edt = dt. Define νs : Js |Js| by

νs(t) = |{r Js : π(r) π(t)}| t Js .

Observe that νs(t) N t(s) = |L t(s)| for all s S and all t Js. This is due to the fact that νs(t) counts a subset of L t(s); to be precise, we have that νs(t) = |L t(s) Js|. Moreover, notice that the condition π(r) π(t) defines a total order over Js. Hence, νs(t) counts the number of elements of Js preceding t Js (including t itself) in this total order.

Delayed Bandits: When Do Intermediate Observations Help?

This implies that νs is a bijection between Js and |Js| . Then, using a similar reasoning as before, we show that

t=1 εt(St)1[edt = dt] =

1 N t(s)1[edt = dt]

1 N t(s) (by definition of Js)

1 νs(t) (since νs(t) N t(s) for t Js)

|Js| (since νs(t) is bijective)

MT (s) (since |Js| MT (s))

. (by Jensen s inequality)

Theorem 4.3. Let δ (0, 1). If we run Meta BIO using DAda-Exp3, then the regret of Meta BIO in the BIO setting with adversarial action-state mappings and stochastic losses with probability at least 1 δ satisfies

2KTCK,3δ + 7

DΦCK,3δ + σmax + 2

Proof of Theorem 4.3. By Equation (17), the regret RT can be bounded as

RT b RB T +

t=1 εt(St) b RB T + 7

with probability at least 1 δ/2, where the last inequality follows by Lemma B.3. From what we argued in Appendix B.3, we can upper bound b RB T using the high-probability regret bound of DAda-Exp3. Notice that the delays incurred by DAda-Exp3 via Meta BIO are those given when providing the estimates (eθt)t [T ]. We denote these delays by ed1, . . . , ed T , and the total delay perceived by DAda-Exp3 is thus e DT = PT t=1 edt. Hence, from the improved bound for DAda-Exp3 in Equation (9), we have that

2KT(3 ln(K) + ln(4/δ)) + 2 q

e DT (3 ln(K) + ln(4/δ)) + σmax + 2

holds with probability at least 1 δ/2. The combination of the above two inequalities, together with Lemma 4.1, concludes the proof.

B.5. Regret of Meta Ada BIO

Theorem 4.4. Let δ (0, 1). If we run Meta Ada BIO with DAda-Exp3, then the regret of Meta Ada BIO in the BIO setting with adversarial action-state mappings and stochastic losses with probability at least 1 δ satisfies

KTCK,2δ + 2 p

DΦCK,2δ + (σmax + 2) ln 8

Delayed Bandits: When Do Intermediate Observations Help?

Proof of Theorem 4.4. Let t [T] be the last round before Meta Ada BIO switches from DAda-Exp3 to Meta BIO, i.e., the last round that satisfies Dt CK,4δ 49ST ln 8ST

δ . Then, define a arg mina PT t=1 θ(st(a)). We may decompose regret as

θ(St) θ(st(a )) +

θ(St) θ(st(a ))

t=1 θ(St) min a A

t=1 θ(st(a))

t=t +1 θ(St) min a A

t=t +1 θ(st(a))

| {z } Rt :T

The incurred delay until time t is Dt . Thus, from Equation (11), we get that the following bound

2Kt CK,2δ +

Dt CK,2δ + σmax + 2

holds with probability at least 1 δ/2, where we recall that CK,δ = 3 ln K + ln(12/δ). If our algorithm never switches, then t = T and we get the bound in (18) for RT . Note that this is no greater than the upper bound in the statement as p

DT CK,2δ 7 p

ST ln(8ST/δ) by definition of t in this case.

Otherwise, we use the switching condition p

Dt CK,2δ 7 p

ST ln(8ST/δ) along with the fact that p

t ln(12/δ) p

Kt CK,2δ to get

2Kt CK,2δ + 14

δ + σmax + 2

Furthermore, Theorem 4.3 directly gives us an upper bound for Rt :T since Meta Ada BIO runs Meta BIO for t > t with the confidence parameter set to δ/2. We just need to bound the total incurred delays of these rounds, namely e Dt :T . Let σ t be the outstanding observations for any round t > t as perceived by the execution of Meta BIO starting after round t , that is, when considering only delays (dt)t>t . It is immediate to observe that σ t σt and thus maxt>t σ t maxt>t σt. Moreover, from Lemma 4.1 we have e Dt :T DΦ ,

where Φ denotes a set of min{T t , 2Sσ max} rounds with the largest delays among (dt)t>t , with σ max = maxt>t σ t. So we have DΦ DΦ

due to the fact that |Φ | = min{T t , 2Sσ max} min{T, 2Sσmax} = |Φ|. Therefore, from Theorem 4.3 we obtain

2K(T t )CK,3δ + 7

DΦCK,3δ + σmax + 2

with probability at least 1 δ/2. We conclude the proof by combining Equations (19) and (20) along with the fact that

2T to get that the bound

KTCK,2δ + 3 min

DΦCK,2δ + (σmax + 2) ln 8

holds with probability at least 1 δ.

B.6. Expected Regret Analysis of Meta Ada BIO with Tsallis-INF

Proposition 4.6. If we execute Meta Ada BIO with Tsallis-INF (Zimmert & Seldin, 2020), and use the switching condition 8Dt ln K > 6 p

ST ln(2ST) at each round t [T], where Dt = Pt j=1 σj, then the regret of Meta Ada BIO in the BIO setting with adversarial action-state mappings and stochastic losses satisfies

+ 2 min n 6 p

ST ln(2ST), p

8DT ln K o .

Delayed Bandits: When Do Intermediate Observations Help?

Proof of Proposition 4.6. We begin by studying of expected regret of Meta BIO and we then give a regret analysis of Meta Ada BIO. When running Meta BIO, we use the unbiased empirical mean estimators (bθt)t [T ] as the mean loss estimates, rather than the lower confidence bounds (eθt)t [T ]. The expected regret is defined as

t=1 E[θ(St)]

t=1 θ(st(a )) ,

where a = mina A PT t=1 θ(st(a)). Here we use a version of Tsallis-INF that is tailored for the delayed bandits problem (Zimmert & Seldin, 2020), which guarantees a bound in expectation on the regret

b RTsallis T (a) =

t=1 bθt(St)

t=1 bθt(st(a))

against any fixed action a A, using the loss estimates {bθt}t [T ]. Observe that this regret is defined in terms of our estimates, as required in our case. By Zimmert & Seldin (2020, Theorem 1), Tsallis-INF guarantees that its expected regret is

E h b RTsallis T (a ) i = E

t=1 bθt(St)

t=1 bθt(st(a ))

8 e DT ln K 4

where the last inequality uses Lemma 4.1. Then, we can focus on our notion of regret and use the above regret bound to obtain that

E[RT ] = E h RT b RTsallis T (a ) i + E h b RTsallis T (a ) i

θ(St) bθt(St) #

bθt(st(a )) θ(st(a )) #

+ E h b RTsallis T (a ) i

θ(St) bθt(St)

bθt(st(a )) θ(st(a )) #

8DΦ ln K . (21)

We know that our mean estimator is unbiased. Therefore, we have that E[bθt(st(a ))] = θ(st(a )) for any t [T], meaning that the second term in the right-hand side of (21) is equal to zero.

On the other hand, we can apply Lemma A.2 to get the following bound for that holds with probability at least 1 δ/2 for any δ (0, 1):

t=1 εt(St), T

where we recall that εt(s) = q

2 N t(s) ln 4ST

δ . In particular, the inequality T is true in general. By Lemma B.3, we can bound the right-hand side of (22) as 1 2

t=1 εt(St) 7

when conditioning on the event as in the statement of Lemma A.2. If we denote such an event as E, we have that P

E δ/2 and that E[ | E] 7

ST ln(4ST/δ). As a consequence, we notice that

E[ ] = E[ | E]P(E) + E | E P

ST ln(2ST) + 1

where in the last inequality we set δ = 2/T. Since we assume that S 2, we can easily observe that E[ ] 6 p

ST ln(2ST). Plugging this into Equation (21) gives us

8DΦ ln K + 6 p

ST ln(2ST) . (23)

Delayed Bandits: When Do Intermediate Observations Help?

At this point, we can proceed to the proof of the overall bound on the expected regret of Meta Ada BIO. The behaviour of Meta Ada BIO follows the same principle as before, but the switching condition is different: p

8Dt ln K > 6 p

ST ln(2ST) .

Similar to the analysis of Meta Ada BIO in Appendix B.5, we decompose the regret into

t=1 E[θ(St)] min a A

t=1 θ(st(a))

t=t +1 E[θ(St)] min a A

t=t +1 θ(st(a))

| {z } Rt :T

where t is the last round satisfying 8Dt 6 p

ST ln(2ST). Then, we have

8Dt ln K . (24)

If t = T then Rt = RT and we get the bound in (24), where we note that 8DT ln K 6 p

ST ln(2ST) by definition of t in this case, and we can replace DT by DT . Otherwise, t < T and we can apply the bound for Meta BIO from (23), along with the fact that the total incurred delay after round t is upper bounded by DΦ, in order to derive an upper bound for E[Rt :T ] that is E[Rt :T ] 4 p

K(T t ) + p

8DΦ ln K + 6 p

ST ln(2ST) . (25)

Finally, if we use the fact that 8Dt 6 p

ST ln(2ST) (by definition of t ) in (24), and combine it with (25), we conclude that E[RT ] 4

8DΦ ln K + 2 min n 6 p

ST ln(2ST), p

8DT ln K o ,

where we also used the fact that

C. Proofs for the Lower Bounds

Theorem 5.2. Suppose that the action-state mapping is adversarial and the losses are stochastic and that dt = d for all t [T]. If T min{S, d} then there exists a distribution of losses and a sequence of action-state mappings such that any (possibly randomized) algorithm suffers regret E[RT ] = Ω p

min{S, d}T .

Proof of Theorem 5.2. Assume without loss of generality that K = 2 and let S = {h1, . . . , h S} be the finite set of possible states. Let S = min{S/2, d} and let I1, . . . , IT be the actions chosen by the considered algorithm. Split the T time steps into m = T/S blocks B1, . . . , Bm of equal size S , eventually leaving S 1 extra time steps. We assume with no loss of generality that the last step corresponds to the end of the m-th block. The feedback formed by the losses of the actions chosen by the algorithm in a certain block is received only after the last time step of the same block since S 2d. Define bi = (i 1)S + 1 for all i [m]. We assume that the learner receives all the realized losses ℓt(st(A)) for all t Bi and all A {1, 2} at the end of each block, which means that we are in a full information setting, as this only helps the algorithm.

Now, we define a specific sequence of assignments from actions to states, and construct losses so that the expected regret becomes sufficiently large. Let st(A) = h2(t bi)+A for all t Bi, all i [m] and all A {1, 2}; this means that, for the first time step of any block, actions 1 and 2 will be assigned to states h1 and h2 respectively, then to h3 and h4 respectively in the next time step of the same block, and so on. Let ε = 1

S 2T ln(4/3) [0, 1

4] and let θ(A) R2 be a vector of mean

losses such that θ(A) i = 1

2 I{i = A}ε, for each A {1, 2}. We simplify the notation with EA[ ] = E θ(A) and PA( ) = P θ(A) , where the conditioning on θ(A) means that we sample losses for each state assigned to i {1, 2} such that they are Bernoulli random variables with mean θ(A) i . In particular, conditioning on θ(A), we sample independent Bernoulli random variables Xi 1, . . . , Xi m with mean θ(A) i , one for each block, for i {1, 2}. Then, the losses are defined as ℓt(st(i)) = Xi j for each t Bj and each j [m].

We can now proceed to show a lower bound for the expected pseudo-regret. Let Ti be the number of times the learner chooses action i over all T time steps. The expected pseudo-regret over the two instances determined by θ(k) for k {1, 2} adds up to E1[RT ] + E2[RT ] = ε(2T E1[T1] E2[T2]) .

Delayed Bandits: When Do Intermediate Observations Help?

Following the standard analysis, we show that the difference E2[T2] E1[T2] is such that

E2[T2] E1[T2] T d TV(P2, P1) T

1 2DKL(P1 P2) ,

where the last step follows by Pinsker s inequality.

Let λi = {(It, ℓt(St(1)), ℓt(St(2))) | t Bi} be the feedback set known to the learner by the end of block Bi, and let λi = (λ1, . . . , λi) be the tuple of all feedback sets up to the end of block Bi. Denote by Pk,i( ) the probability measure of feedback tuples λi conditioned on θ(A). By the chain rule for the relative entropy, we can observe that

DKL(P1 P2) =

λi 1 P1(λi 1)DKL(P1,i( | λi 1) P2,i( | λi 1))

λi 1 P1(λi 1)16ε2 ln(4/3)

= 16mε2 ln(4/3) ,

where we used the fact that each relative entropy DKL(P1,i( | λi 1) P2,i( | λi 1)) corresponds to the sum of the relative entropy between two Bernoulli distributions with means 1/2 and 1/2 ε and that between Bernoulli distributions with means 1/2 ε and 1/2, respectively, which is upper bounded by 16ε2 ln(4/3) for ε [0, 1/4]. This follows by an application of the chain rule for the relative entropy, as well as from the fact that the distribution of It is the same under both P1,i( | λi 1) and P2,i( | λi 1), for all t Bi and any λi 1. Therefore, we have that

E2[T2] E1[T2] 2εT p

which also implies that

E1[RT ] + E2[RT ] εT

S/2 T 2 ln(4/3) 1

ST 6 ln(4/3) ,

where we used the facts that m T/S and that S/2 S/3 for any integer S 2. This means that the expected

pseudo-regret of the learner has to be 1 16 q

ST 6 ln(4/3) at least in one of the two instances. Now, for S > 2d we use the same

construction, but now we only use 2d states, which leads to the promised Ω( p

min{S, d}T) lower bound.

Theorem 5.3. Suppose that the action-state mapping is adversarial, the losses are stochastic, and that dt = d for all t [T]. If T d + 1 then there exists a distribution of losses and a sequence of action-state mappings such that any (possibly randomized) algorithm suffers regret

E[RT ] = Ω min n (d + 1)

(d + 1)T o .

Proof of Theorem 5.3. Let S = min{ S

d+1 } 1. We consider the first (d + 1)S rounds of the game and divide them into S blocks B1, . . . , BS of same length d + 1. In this way, we ensure that the feedback for any time step in some block is revealed to the learner only after its final round.

Without loss of generality, we can assume that the learner observes all the losses of one block immediately after its last time step; this only helps the learner since they would observe only the incurred losses at possibly later rounds otherwise. We can further simplify the problem by assuming that losses are deterministic functions of the states, i.e., ℓt θ for every round t. This also means that the problem turns into an easier, full-information version of our problem with deterministic losses. Now, let the adversary choose the action-state mappings such that for each block index i and each action a A, St(a) = St (a) {s2i 1, s2i} for all t, t Bi. Furthermore, we assume that the losses are chosen such that θ(s2i 1) {0, 1} and θ(s2i) = 1 θ(s2i 1) for all i [S ]. In this construction, the learner cannot obtain any useful information from the states of a block because of the delays. Moreover, the states observed in one block are not observed again in the other blocks.

Delayed Bandits: When Do Intermediate Observations Help?

It thus suffices to prove a lower bound for a standard full information game with S rounds and loss range [0, d + 1]. Hence, we can conclude that the expected regret of any algorithm has to be

E[RT ] = Ω (d + 1)

S = Ω min n (d + 1)

(d + 1)T o .

D. Action-State Mappings and Loss Means Used in the Experiments

Table 1 and Table 2 describe the instances used to generate the data for the experiments of Section 6.

Mean loss s = 1 s = 2 s = 3 θ(s) 0.2 0.4 0.8 Mapping P(1|a) P(2|a) P(3|a) a = 1 0.8 0.1 0.1 a = 2 0.4 0.5 0.1 a = 3 0.3 0.7 0.0 a = 4 0.5 0.3 0.2

Table 1. Mean losses and stochastic action-state mapping for Experiment 1 in Section 6.

Mean loss s = 1 s = 2 s = 3 θ(s) 0 1 1 Environment 1 Mapping P(1|a) P(2|a) P(3|a) a = 1 0.06 0.47 0.47 a = 2 0 0.50 0.50 a = 3 0 0.50 0.50 a = 4 0 0.50 0.50 Environment 2 Mapping P(1|a) P(2|a) P(3|a) a = 1 1 0 0 a = 2 0.94 0.03 0.03 a = 3 0.94 0.03 0.03 a = 4 0.94 0.03 0.03

Table 2. Mean losses and stochastic action-state mappings for Experiment 2 in Section 6.