# bandits_with_delayed_aggregated_anonymous_feedback__7bf43e83.pdf

Bandits with Delayed, Aggregated Anonymous Feedback

Ciara Pike-Burke 1 Shipra Agrawal 2 Csaba Szepesvári 3 4 Steffen Grünewälder 1

We study a variant of the stochastic K-armed bandit problem, which we call bandits with delayed, aggregated anonymous feedback . In this problem, when the player pulls an arm, a reward is generated, however it is not immediately observed. Instead, at the end of each round the player observes only the sum of a number of previously generated rewards which happen to arrive in the given round. The rewards are stochastically delayed and due to the aggregated nature of the observations, the information of which arm led to a particular reward is lost. The question is what is the cost of the information loss due to this delayed, aggregated anonymous feedback? Previous works have studied bandits with stochastic, non-anonymous delays and found that the regret increases only by an additive factor relating to the expected delay. In this paper, we show that this additive regret increase can be maintained in the harder delayed, aggregated anonymous feedback setting when the expected delay (or a bound on it) is known. We provide an algorithm that matches the worst case regret of the non-anonymous problem exactly when the delays are bounded, and up to logarithmic factors or an additive variance term for unbounded delays.

1. Introduction

The stochastic multi-armed bandit (MAB) problem is a prominent framework for capturing the explorationexploitation tradeoff in online decision making and experiment design. The MAB problem proceeds in discrete sequential rounds, where in each round, the player pulls one

1 Department of Mathematics and Statistics, Lancaster University, Lancaster, UK 2 Department of Industrial Engineering and Operations Research, Columbia University, New York, NY, USA 3 Deep Mind, London, UK 4 Department of Computing Science, University of Alberta, Edmonton, AB, Canada. Correspondence to: Ciara Pike-Burke <ciara.pikeburke@gmail.com>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

of the K possible arms. In the classic stochastic MAB setting, the player immediately observes stochastic feedback from the pulled arm in the form of a reward which can be used to improve the decisions in subsequent rounds. One of the main application areas of MABs is in online advertising. Here, the arms correspond to adverts, and the feedback would correspond to conversions, that is users buying a product after seeing an advert. However, in practice, these conversions may not necessarily happen immediately after the advert is shown, and it may not always be possible to assign the credit of a sale to a particular showing of an advert. A similar challenge is encountered in many other applications, e.g., in personalized treatment planning, where the effect of a treatment on a patient s health may be delayed, and it may be difﬁcult to determine which out of several past treatments caused the change in the patient s health; or, in content design applications, where the effects of multiple changes in the website design on website trafﬁc and footfall may be delayed and difﬁcult to distinguish.

In this paper, we propose a new bandit model to handle online problems with such delayed, aggregated and anonymous feedback. In our model, a player interacts with an environment of K actions (or arms) in a sequential fashion. At each time step the player selects an action which leads to a reward generated at random from the underlying reward distribution. At the same time, a nonnegative random integer-valued delay is also generated i.i.d. from an underlying delay distribution. Denoting this delay by τ 0 and the index of the current round by t, the reward generated in round t will arrive at the end of the (t + τ)th round. At the end of each round, the player observes only the sum of all the rewards that arrive in that round. Crucially, the player does not know which of the past plays have contributed to this aggregated reward. We call this problem multi-armed bandits with delayed, aggregated anonymous feedback (MABDAAF). As in the standard MAB problem, in MABDAAF, the goal is to maximize the cumulative reward from T plays of the bandit, or equivalently to minimize the regret. The regret is the total difference between the reward of the optimal action and the actions taken.

If the delays are all zero, the MABDAAF problem reduces to the standard (stochastic) MAB problem, which has been studied considerably (e.g., Thompson, 1933; Lai & Robbins, 1985; Auer et al., 2002; Bubeck & Cesa-Bianchi,

Bandits with Delayed, Aggregated Anonymous Feedback

Multi-Armed Bandits (eg. Auer et al. (2002)) O( KT log T)

Delayed Feedback Bandits (eg. Joulani et al. (2013)) O( KT log T + KE[τ])

Bandits with Delayed, Aggregated Anonymous Feedback O( KT log K + KE[τ])

Figure 1: The relative difﬁculties and problem independent regret bounds of the different problems. For MABDAAF, our algorithm uses knowledge of E[τ] and a mild assumption on a delay bound, which is not required by Joulani et al. (2013).

2012). Compared to the MAB problem, the job of the player in our problem appears to be signiﬁcantly more difﬁcult since the player has to deal with (i) that some feedback from the previous pulls may be missing due to the delays, and (ii) that the feedback takes the form of the sum of an unknown number of rewards of unknown origin.

An easier problem is when the observations are delayed, but they are non-aggregated and non-anonymous: that is, the player has to only deal with challenge (i) and not (ii). Here, the player receives delayed feedback in the shape of action-reward pairs that inform the player of both the individual reward and which action generated it. This problem, which we shall call the (non-anonymous) delayed feedback bandit problem, has been studied by Joulani et al. (2013), and later followed up by Mandel et al. (2015) (for bounded delays). Remarkably, they show that compared to the standard (non-delayed) stochastic MAB setting, the regret will increase only additively by a factor that scales with the expected delay. For delay distributions with a ﬁnite expected delay, E[τ], the worst case regret scales with O( KT log T + KE[τ]). Hence, the price to pay for the delay in receiving the observations is negligible. QPM-D of Joulani et al. (2013) and SBD of Mandel et al. (2015) place received rewards into queues for each arm, taking one whenever a base bandit algorithm suggests playing the arm. Throughout, we take UCB1 (Auer et al., 2002) as the base algorithm in QPM-D. Joulani et al. (2013) also present a direct modiﬁcation of the UCB1 algorithm. All of these algorithms achieve the stated regret. None of them require any knowledge of the delay distributions, but they all rely heavily upon the non-anonymous nature of the observations.

While these results are encouraging, the assumption that the rewards are observed individually in a non-anonymous fashion is limiting for most practical applications with delays (e.g., recall the applications discussed earlier). How big is the price to be paid for receiving only aggregated anonymous feedback? Our main result is to prove that essentially there is no extra price to be paid provided that the value of the expected delay (or a bound on it) is available. In particular, this means that detailed knowledge of which action led to a particular delayed reward can be replaced by the much weaker requirement that the expected delay, or a bound on it, is known. Fig. 1 summarizes the relationship between the non-delayed, the delayed and the new problem

by showing the leading terms of the regret. In all cases, the dominant term is

KT. Hence, asymptotically, the delayed, aggregated anonymous feedback problem is no more difﬁcult than the standard multi-armed bandit problem.

1.1. Our Techniques and Results

We now consider what sort of algorithm will be able to achieve the aforementioned results for the MABDAAF problem. Since the player only observes delayed, aggregated anonymous rewards, the ﬁrst problem we face is how to even estimate the mean reward of individual actions. Due to the delays and anonymity, it appears that to be able to estimate the mean reward of an action, the player wants to have played it consecutively for long stretches. Indeed, if the stretches are sufﬁciently long compared to the mean delay, the observations received during the stretch will mostly consist of rewards of the action played in that stretch. This naturally leads to considering algorithms that switch actions rarely and this is indeed the basis of our approach.

Several popular MAB algorithms are based on choosing the action with the largest upper conﬁdence bound (UCB) in each round (e.g., Auer et al., 2002; Cappé et al., 2013). UCB-style algorithms tend to switch arms frequently and will only play the optimal arm for long stretches if a unique optimal arm exists. Therefore, for MABDAAF, we will consider alternative algorithms where arm-switching is more tightly controlled. The design of such algorithms goes back at least to the work of Agrawal et al. (1988) where the problem of bandits with switching costs was studied. The general idea of these rarely switching algorithms is to gradually eliminate suboptimal arms by playing arms in phases and comparing each arm s upper conﬁdence bound to the lower conﬁdence bound of a leading arm at the end of each phase. Generally, this sort of rarely switching algorithm switches arms only O(log T) times. We base our approach on one such algorithm, the so-called Improved UCB1 algorithm of Auer & Ortner (2010).

Using a rarely switching algorithm alone will not be sufﬁcient for MABDAAF. The remaining problem, and where the bulk of our contribution lies, is to construct appropri-

1The adjective Improved indicates that the algorithm improves upon the regret bounds achieved by UCB1. The improvement replaces log(T)/ j by log(T 2 j)/ j in the regret bound.

Bandits with Delayed, Aggregated Anonymous Feedback

ate conﬁdence bounds and adjust the length of the periods of playing each arm to account for the delayed, aggregated anonymous feedback. In particular, in the conﬁdence bounds attention must be paid to ﬁne details: it turns out that unless the variance of the observations is dealt with, there is a blow-up by a multiplicative factor of K. We avoid this by an improved analysis involving Freedman s inequality (Freedman, 1975). Further, to handle the dependencies between the number of plays of each arm and the past rewards, we combine Doob s optimal skipping theorem (Doob, 1953) and Azuma-Hoeffding inequalities. Using a rarely switching algorithm for MABDAAF means we must also consider the dependencies between the elimination of arms in one phase and the corruption of observations in the next phase (ie. past plays can inﬂuence both whether an arm is still active and the corruption of its next plays). We deal with this through careful algorithmic design.

Using the above, we provide an algorithm that achieves worst case regret of O( KT log K + KE[τ] log T) using only knowledge of the expected delay, E[τ]. We then show that this regret can be improved by using a more careful martingale argument that exploits the fact that our algorithm is designed to remove most of the dependence between the corruption of future observations and elimination of arms. Particularly, if the delays are bounded with known bound 0 d p

T/K, we can recover worst case regret of O( KT log K +KE[τ]), matching that of Joulani et al. (2013). If the delays are unbounded but have known variance V(τ), we show that the problem independent regret can be reduced to O( KT log K + KE[τ] + KV(τ)).

1.2. Related Work

We have already discussed several of the most relevant works to our own. However, there has also been other work looking at different ﬂavors of the bandit problem with delayed (non-anonymous) feedback. For example, Neu et al. (2010) and Cesa-Bianchi et al. (2016) consider nonstochastic bandits with ﬁxed constant delays; Dudik et al. (2011) look at stochastic contextual bandits with a constant delay and Desautels et al. (2014) consider Gaussian Process bandits with a bounded stochastic delay. The general observation that delay causes an additive regret penalty in stochastic bandits and a multiplicative one in adversarial bandits is made in Joulani et al. (2013). The empirical performance of K-armed stochastic bandit algorithms in delayed settings was investigated in Chapelle & Li (2011). A further related problem is the batched bandit problem studied by Perchet et al. (2016). Here the player must ﬁx a set of time points at which to collect feedback on all plays leading up to that point. Vernade et al. (2017) consider delayed Bernoulli bandits where some observations could also be censored (e.g., no conversion is ever actually observed if the delay exceeds some threshold) but require

complete knowledge of the delay distribution. Crucially, here and in all the aforementioned works, the feedback is always assumed to take the form of arm-reward pairs and knowledge of the assignment of rewards to arms underpins the suggested algorithms, rendering them unsuitable for MABDAAF. To the best of our knowledge, ours is the ﬁrst work to develop algorithms to deal with delayed, aggregated anonymous feedback in the bandit setting.

1.3. Organization

The reminder of this paper is organized as follows: In the next section (Section 2) we give the formal problem deﬁnition. We present our algorithm in Section 3. In Section 4, we discuss the performance of our algorithm under various delay assumptions; known expectation, bounded support with known bound and expectation, and known variance and expectation. This is followed by a numerical illustration of our results in Section 5. We conclude in Section 6.

2. Problem Deﬁnition

There are K > 1 actions or arms in the set A. Each action j A is associated with a reward distribution ζj and a delay distribution δj. The reward distribution is supported in [0, 1] and the delay distribution is supported on N .= {0, 1, . . . }. We denote by µj the mean of ζj, µ = µj = maxj µj and deﬁne j = µ µj to be the reward gap, that is the expected loss of reward each time action j is chosen instead of an optimal action. Let (Rl,j, τl,j)l N,j A be an inﬁnite array of random variables deﬁned on the probability space (Ω, Σ, P) which are mutually independent. Further, Rl,j follows the distribution ζj and τl,j follows the distribution δj. The meaning of these random variables is that if the player plays action j at time l, a payoff of Rl,j will be added to the aggregated feedback that the player receives at the end of the (l + τl,j)th play. Formally, if Jl A denotes the action chosen by the player at time l = 1, 2, . . . , then the observation received at the end of the tth play is

j=1 Rl,j I{l + τl,j = t, Jl = j}.

For the remainder, we will consider i.i.d. delays across arms. We also assume discrete delay distributions, although most results hold for continuous delays by redeﬁning the event {τl,j = t l} as {t l 1 < τl,j t l} in Xt. In our analysis, we will sum over stochastic index sets. For a stochastic index set I and random variables {Zn}n N we denote such sums as P

t I Zt .= P

t N I{t I} Zt.

Regret deﬁnition In most bandit problems, the regret is the cumulative loss due to not playing an optimal action.

Bandits with Delayed, Aggregated Anonymous Feedback

In the case of delayed feedback, there are several possible ways to deﬁne the regret. One option is to consider only the loss of the rewards received before horizon T (as in Vernade et al. (2017)). However, we will not use this deﬁnition. Instead, as in Joulani et al. (2013), we consider the loss of all generated rewards and deﬁne the (pseudo-)regret by

t=1 (µ µJt) = Tµ

This includes the rewards received after the horizon T and does not penalize large delays as long as an optimal action is taken. This deﬁnition is natural since, in practice, the player should eventually receive all outstanding reward.

Lai & Robbins (1985) showed that the regret of any algorithm for the standard MAB problem must satisfy,

lim inf T E[RT ] log(T) X

j KL(ζj, ζ ), (1)

where KL(ζj, ζ ) is the KL-divergence between the reward distributions of arm j and an optimal arm. Theorem 4 of Vernade et al. (2017) shows that the lower bound in (1) also holds for delayed feedback bandits with no censoring and their alternative deﬁnition of regret. We therefore suspect (1) should hold for MABDAAF. However, due to the speciﬁc problem structure, ﬁnding a lower bound for MABDAAF is non-trivial and remains an open problem.

Assumptions on delay distribution For our algorithm for MABDAAF, we need some assumptions on the delay distribution. We assume that the expected delay, E[τ], is bounded and known. This quantity is used in the algorithm.

Assumption 1 The expected delay E[τ] is bounded and known to the algorithm.

We then show that under some further mild assumptions on the delay, we can obtain better algorithms with even more efﬁcient regret guarantees. We consider two settings: delay distributions with bounded support, and bounded variance.

Assumption 2 (Bounded support) There exists some constant d > 0 known to the algorithm such that the support of the delay distribution is bounded by d.

Assumption 3 (Bounded variance) The variance, V(τ), of the delay is bounded and known to the algorithm.

In fact the known expected value and known variance assumption can be replaced by a known upper bound on the expected value and variance respectively. However, for simplicity, in the remaining, we use E[τ] and V(τ) directly. The next sections provide algorithms and regret analysis for different combinations of the above assumptions.

3. Our Algorithm

Our algorithm is a phase-based elimination algorithm based on the Improved UCB algorithm by Auer & Ortner (2010). The general structure is as follows. In each phase, each arm is played multiple times consecutively. At the end of the phase, the observations received are used to update mean estimates, and any arm with an estimated mean below the best estimated mean by a gap larger than a separation gap tolerance is eliminated. This separation tolerance is decreased exponentially over phases, so that it is very small in later phases, eliminating all but the best arm(s) with high probability. An alternative formulation of the algorithm is that at the end of a phase, any arm with an upper conﬁdence bound lower than the best lower conﬁdence bound is eliminated. These conﬁdence bounds are computed so that with high probability they are more (less) than the true mean, but within the separation gap tolerance. The phase lengths are then carefully chosen to ensure that the conﬁdence bounds hold. Here we assume that the horizon T is known, but we expect that this can be relaxed as in Auer & Ortner (2010).

Algorithm overview Our algorithm, ODAAF, is given in Algorithm 1. It operates in phases m = 1, 2, . . .. Deﬁne Am to be the set of active arms in phase m. The algorithm takes parameter nm which deﬁnes the number of samples of each active arm required by the end of phase m.

In Step 1 of phase m of the algorithm, each active arm j is played repeatedly for nm nm 1 steps. We record all timesteps where arm j was played in the ﬁrst m phases (excluding bridge periods) in the set Tj(m). The active arms are played in any arbitrary but ﬁxed order. In Step 2, the nm observations from timesteps in Tj(m) are averaged to obtain a new estimate Xm,j of µj. Arm j is eliminated if Xm,j is further than m from maxj Am Xm,j .

A further nuance in the algorithm structure is the bridge period (see Figure 2). The algorithm picks an active arm j Am+1 to play in this bridge period for nm nm 1 steps. The observations received during the bridge period are discarded, and not used for computing conﬁdence intervals. The signiﬁcance of the bridge period is that it breaks the dependence between conﬁdence intervals calculated in phase m and the delayed payoffs seeping into phase m+1. Without the bridge period this dependence would impair the validity of our conﬁdence intervals. However, we suspect that, in practice, it may be possible to remove it.

Choice of nm A key element of our algorithm design is the careful choice of nm. Since nm determines the number of times each active (possibly suboptimal) arm is played, it clearly has an impact on the regret. Furthermore, nm needs to be chosen so that the conﬁdence bounds on the estimation error hold with given probability. The main chal-

Bandits with Delayed, Aggregated Anonymous Feedback

Algorithm 1 Optimism for Delayed, Aggregated Anonymous Feedback (ODAAF) Input: A set of arms, A; a horizon, T; choice of nm for each phase m = 1, 2, . . .. Initialization: Set 1 = 1/2 (tolerance), the set of active arms A1 = A. Let Ti(1) = , i A, m = 1 (phase index), t = 1 (round index) while t T do

Step 1: Play arms. for j Am do

Let Tj(m) = Tj(m 1) while |Tj(m)| nm and t T do

Play arm j, receive Xt. Add t to Tj(m). Increment t by 1. end while end for Step 2: Eliminate sub-optimal arms. For every arm in j Am, compute Xm,j as the average of observations at time steps t Tj(m). That is, Xm,j = 1 |Tj(m)|

t Tj(m) Xt .

Construct Am+1 by eliminating actions j Am with

Xm,j + m < max j Am Xm,j .

Step 3: Decrease Tolerance.

Set m+1 = m

2 . Step 4: Bridge period. Pick an arm j Am+1 and play it νm = nm nm 1 times while incrementing t T. Discard all observations from this period. Do not add t to Tj(m). Increment phase index m. end while

lenge is developing these conﬁdence bounds from delayed, aggregated anonymous feedback. Handling this form of feedback involves a credit assignment problem of deciding which samples can be used for a given arm s mean estimation, since each sample is an aggregate of rewards from multiple previously played arms. This credit assignment problem would be hopeless in a passive learning setting without further information on how the samples were generated. Our algorithm utilizes the power of active learning to design the phases in such a way that the feedback can be effectively decensored without losing too many samples.

A naive approach to deﬁning the conﬁdence bounds for delays bounded by a constant d 0 would be to observe that,

t Tj(m)\Tj(m 1) Xt X

t Tj(m)\Tj(m 1) Rt,j

Tj(i) \ Tj(i 1) Bridge

Figure 2: An example of phase i of our algorithm.

since all rewards are in [0, 1]. Then we could use Hoeffding s inequality to bound Rt,Jt (see Appendix F) and select

nm = C1 log(T 2 m) 2m + C2md

for some constants C1, C2. This corresponds to worst case regret of O( KT log K + K log(T)d). For d E[τ] and large T, this is signiﬁcantly worse than that of Joulani et al. (2013). In Section 4, we show that, surprisingly, it is possible to recover the same rate of regret as Joulani et al. (2013), but this requires a signiﬁcantly more nuanced argument to get tighter conﬁdence bounds and smaller nm. In the next section, we describe this improved choice of nm for every phase m N and its implications on the regret, for each of the three cases mentioned previously: (i) Known and bounded expected delay (Assumption 1), (ii) Bounded delay with known bound and expected value (Assumptions 1 and 2), (iii) Delay with known and bounded variance and expectation (Assumptions 1 and 3).

4. Regret Analysis

In this section, we specify the choice of parameters nm and provide regret guarantees for Algorithm 1 for each of the three previously mentioned cases.

4.1. Known and Bounded Expected Delay

First, we consider the setting with the weakest assumption on delay distribution: we only assume that the expected delay, E[τ], is bounded and known. No assumption on the support or variance of the delay distribution is made. The regret analysis for this setting will not use the bridge period, so Step 4 of the algorithm could be omitted in this case.

Choice of nm Here, we use Algorithm 1 with

nm = C1 log(T 2 m) 2m + C2m E[τ]

for some large enough constants C1, C2. The exact value of nm is given in Equation (14) in Appendix B.

Estimation of error bounds We bound the error between Xm,j and µj by m/2. In order to do this we ﬁrst bound the corruption of the observations received during timesteps Tj(m) due to delays.

Bandits with Delayed, Aggregated Anonymous Feedback

Fix a phase m and arm j Am. Then the observations Xt in the period t Tj(m) \ Tj(m 1) are composed of two types of rewards: a subset of rewards from plays of arm j in this period, and delayed rewards from some of the plays before this period. The expected value of observations from this period would be (nm nm 1)µj but for the rewards entering and leaving this period due to delay. Since the reward is bounded by 1, a simple observation is that expected discrepancy between the sum of observations in this period and the quantity (nm nm 1)µj is bounded by the expected delay E[τ],

t Tj(m)\Tj(m 1) (Xt µj)

Summing this over phases ℓ= 1, . . . m gives a bound

|E[ Xm,j] µj| m E[τ]

|Tj(m)| = m E[τ]

Note that given the choice of nm in (2), the above is smaller than m/2, when large enough constants are used. Using this, along with concentration inequalities and the choice of nm from (2), we can obtain the following high probability bound. A detailed proof is provided in Appendix B.1.

Lemma 1 Under Assumption 1 and the choice of nm given by (2), the estimates Xm,j constructed by Algorithm 1 satisfy the following: For every ﬁxed arm j and phase m, with probability 1 3 T 2m , either j / Am, or:

Xm,j µj m/2 .

Regret bounds Using Lemma 1, we derive the following regret bounds in the current setting.

Theorem 2 Under Assumption 1, the expected regret of Algorithm 1 is upper bounded as

O log(T 2 j) j + log(1/ j)E[τ] . (5)

Proof: Given Lemma 1, the proof of Theorem 2 closely follows the analysis of the Improved UCB algorithm of Auer & Ortner (2010). Lemma 1 and the elimination condition in Algorithm 1 ensure that, with high probability, any suboptimal arm j will be eliminated by phase mj = log(1/ j), thus incurring regret at most nmj j We then substitute in nmj from (2), and sum over all suboptimal arms. A detailed proof is in Appendix B.2. As in Auer & Ortner (2010), we avoid a union bound over all arms (which would result in an extra log K) by (i) reasoning about the regret of each arm individually, and (ii) bounding the regret resulting

from erroneously eliminating the optimal arm by carefully controlling the probability it is eliminated in each phase.

Considering the worst-case values of j (roughly p

K/T), we obtain the following problem independent bound.

Corollary 3 For any problem instance satisfying Assumption 1, the expected regret of Algorithm 1 satisﬁes

E[RT ] O( p

KT log(K) + KE[τ] log(T)).

4.2. Delay with Bounded Support

If the delay is bounded by some constant d 0 and a single arm is played repeatedly for long enough, we can restrict the number of arms corrupting the observation Xt at a given time t. In fact, if each arm j is played consecutively for more than d rounds, then at any time t Tj(m), the observation Xt will be composed of the rewards from at most two arms: the current arm j, and previous arm j . Further, from the elimination condition, with high probability, arm j will have been eliminated if it is clearly suboptimal. We can then recursively use the conﬁdence bounds for arms j and j from the previous phase to bound |µj µj |. Below, we formalize this intuition to obtain a tighter bound on | Xm,j µj| for every arm j and phase m, when each active arm is played a speciﬁed number of times per phase.

Choice of nm Here, we deﬁne,

nm =C1 log(T 2 m) 2m + C2E[τ]

+ min md, C3 log(T 2 m) 2m + C4m E[τ]

for some large enough constants C1, C2, C3, C4 (see Appendix C, Equation (18) for the exact values). This choice of nm means that for large d, we essentially revert back to the choice of nm from (2) for the unbounded case, and we gain nothing by using the bound on the delay. However, if d is not large, the choice of nm in (6) is smaller than (2) since the second term now scales with E[τ] rather than m E[τ].

Estimation of error bounds In this setting, by the elimination condition and bounded delays, the expectation of each reward entering Tj(m) will be within m 1 of µj, with high probability. Then, using knowledge of the upper bound of the support of τ, we can obtain a tighter bound and get an error bound similar to Lemma 1 with the smaller value of nm in (6). We prove the following proposition. Since m = 2 m, this is considerably tighter than (3).

Proposition 4 Assume ni ni 1 d for phases i = 1, . . . , m. Deﬁne Em 1 as the event that all arms j Am satisfy error bounds | Xm 1,j µj| m 1/2. Then, for

Bandits with Delayed, Aggregated Anonymous Feedback

every arm j Am,

t Tj(m)\Tj(m 1) (Xt µj) Em 1

Proof: (Sketch). Consider a ﬁxed arm j Am. The expected value of the sum of observations Xt for t Tj(m) \ Tj(m 1) would be (nm nm 1)µj were it not for some rewards entering and leaving this period due to the delays. Because of the i.i.d. assumption on the delay, in expectation, the number of rewards leaving the period is roughly the same as the number of rewards entering this period, i.e., E[τ]. (Conditioning on Em 1 does not effect this due to the bridge period). Since nm nm 1 d, the reward coming into the period Tj(m)\Tj(m 1) can only be from the previous arm j . All rewards leaving the period are from arm j. Therefore the expected difference between rewards entering and leaving the period is (µj µj )E[τ]. Then, if µj is close to µj , the total reward leaving the period is compensated by total reward entering. Due to the bridge period, even when j is the ﬁrst arm played in phase m, j Am, so it was not eliminated in phase m 1. By the elimination condition in Algorithm 1, if the error bounds | Xm 1,j µj| m 1/2 are satisﬁed for all arms in Am, then |µj µj | m 1. This gives the result.

Repeatedly using Proposition 4 we get,

t Tj(i)\Tj(i 1) (Xt µj) Ei 1

since Pm i=1 i 1 = Pm 1 i=0 2 i 2. Then, observe that P(EC i ) is small. This bound is an improvement of a factor of m compared to (4). For the regret analysis, we derive a high probability version of the above result. Using this, and the choice of nm Ω log(T 2 m) 2m + E[τ]

from (6), for large enough constants, we derive the following lemma. A detailed proof is given in Appendix C.1.

Lemma 5 Under Assumptions 1 of known expected delay and 2 of bounded delays, and choice of nm given in (6), the estimates Xm,j obtained by Algorithm 1 satisfy the following: For any arm j and phase m, with probability at least 1 12 T 2m , either j / Am or

Xm,j µj m/2.

Regret bounds We now give regret bounds for this case.

Theorem 6 Under Assumption 1 and bounded delay Assumption 2, the expected regret of Algorithm 1 satisﬁes

j=1;j =j O log(T 2 j) j + E[τ]

+ min d, log(T 2 j) j + log( 1

Proof: (Sketch). Given Lemma 5, the proof is similar to that of Theorem 2. The full proof is in Appendix C.2.

Then, if d q

K + E[τ], we get the following problem independent regret bound which matches that of Joulani et al. (2013).

Corollary 7 For any problem instance satisfying Assump-

tions 1 and 2 with d q

K +E[τ], the expected regret of Algorithm 1 satisﬁes

E[RT ] O( p

KT log(K) + KE[τ]).

4.3. Delay with Bounded Variance

If the delay is unbounded but well behaved in the sense that we know (a bound on) the variance, then we can obtain similar regret bounds to the bounded delay case. Intuitively, delays from the previous phase will only corrupt observations in the current phase if their delays exceed the length of the bridge period. We control this by using the bound on the variance to bound the tails of the delay distributions.

Choice of nm Let V(τ) be the known variance (or bound on the variance) of the delay, as in Assumption 3. Then, we use Algorithm 1 with the following value of nm,

nm = C1 log(T 2 m) 2m + C2 E[τ] + V(τ)

for some large enough constants C1, C2. The exact value of nm is given in Appendix D, Equation (25).

Regret bounds We get the following instance speciﬁc and problem independent regret bound in this case.

Theorem 8 Under Assumption 1 and Assumption 3 of known (bound on) the expectation and variance of the delay, and choice of nm from (7), the expected regret of Algorithm 1 can be upper bounded by,

j=1:µj =µ O log(T 2 j) j + E[τ] + V(τ) .

Proof: (Sketch). See Appendix D.2. We use Chebychev s inequality to get a result similar to Lemma 5 and then use a similar argument to the bounded delay case.

Corollary 9 For any problem instance satisfying Assumptions 1 and 3, the expected regret of Algorithm 1 satisﬁes

E[RT ] O( p

KT log(K) + KE[τ] + KV(τ)).

Bandits with Delayed, Aggregated Anonymous Feedback

0 50000 100000 150000 200000 250000 Time

Regret ratio

Unif(0, 50) Unif(0, 100) Exp, E[τ] = 50, d = 75 Exp, E[τ] = 50, d = 250

(a) Bounded delays. Ratios of regret of ODAAF (solid lines) and ODAAF-B (dotted lines) to that of QPM-D.

0 50000 100000 150000 200000 250000 Time

Regret ratio

Exp, E[τ] = 50 Pois, E[τ] = 50

(b) Unbounded delays. Ratios of regret of ODAAF (solid lines) and ODAAF-V (dotted lines) to that of QPM-D.

Figure 3: The ratios of regret of variants of our algorithm to that of QPM-D for different delay distributions.

Remark If E[τ] 1, then the delay penalty can be reduced to O(KE[τ] + KV(τ)/E[τ]) (see Appendix D).

Thus, it is sufﬁcient to know a bound on variance to obtain regret bounds similar to those in bounded delay case. Note that this approach is not possible just using knowledge of the expected delay since we cannot guarantee that the reward entering phase i is from an arm active in phase i 1.

5. Experimental Results

We compared the performance of our algorithm (under different assumptions) to QPM-D (Joulani et al., 2013) in various experimental settings. In these experiments, our aim was to investigate the effect of the delay on the performance of the algorithms. In order to focus on this, we used a simple setup of two arms with Bernoulli rewards and µ = (0.5, 0.6). In every experiment, we ran each algorithm to horizon T = 250000 and used UCB1 (Auer et al., 2002) as the base algorithm in QPM-D. The regret was averaged over 200 replications. For ease of reading, we deﬁne ODAAF to be our algorithm using only knowledge of the expected delay, with nm deﬁned as in (2) and run without a bridge period, and ODAAF-B and ODAAF-V to be the versions of Algorithm 1 that use a bridge period and information on the bounded support and the ﬁnite variance of the delay to deﬁne nm as in (6) and (7) respectively.

We tested the algorithms with different delay distributions. In the ﬁrst case, we considered bounded delay distributions whereas in the second case, the delays were unbounded. In Fig. 3a, we plotted the ratios of the regret of ODAAF and ODAAF-B (with knowledge of d, the delay bound) to the regret of QPM-D. We see that in all cases the ratios converge to a constant. This shows that the regret of our algorithm is essentially of the same order as that of QPM-D. Our algorithm predetermines the number of times to play each active arm per phase (the randomness appears in whether an arm is active), so the jumps in the regret are it changing arm. This occurs at the same points in all replications.

Fig. 3b shows a similar story for unbounded delays with mean E[τ] = 50 (where N+ denotes the the half normal distribution). The ratios of the regret of ODAAF and ODAAF-V (with knowledge of the delay variance) to the regret of QPM-D again converge to constants. Note that in this case, these constants, and the location of the jumps, vary with the delay distribution and V(τ). When the variance of the delay is small, it can be seen that using the variance information leads to improved performance. However, for exponential delays where V(τ) = E[τ]2, the large variance causes nm to be large and so the suboptimal arm is played more, increasing the regret. In this case ODAAF-V had only just eliminated the suboptimal arm at time T.

It can also be illustrated experimentally that the regret of our algorithms and that of QPM-D all increase linearly in E[τ]. This is shown in Appendix E. We also provide an experimental comparison to Vernade et al. (2017) in Appendix E.

6. Conclusion

We have studied an extension of the multi-armed bandit problem to bandits with delayed, aggregated anonymous feedback. Here, a sum of observations is received after some stochastic delay and we do not learn which arms contributed to each observation. In this more difﬁcult setting, we have proven that, surprisingly, it is possible to develop an algorithm that performs comparably to those for the simpler delayed feedback bandits problem, where the assignment of rewards to plays is known. Particularly, using only knowledge of the expected delay, our algorithm matches the worst case regret of Joulani et al. (2013) up to a logarithmic factor. This logarithmic factors can be removed using an improved analysis and slightly more information about the delay; if the delay is bounded, we achieve the same worst case regret as Joulani et al. (2013), and for unbounded delays with known ﬁnite variance, we have an extra additive V(τ) term. We supported these claims experimentally. Note that while our algorithm matches the order of regret of QPM-D, the constants are worse. Hence, it is an open problem to ﬁnd algorithms with better constants.

Bandits with Delayed, Aggregated Anonymous Feedback

Acknowledgments

CPB would like to thank the EPSRC funded EP/L015692/1 STOR-i centre for doctoral training and Sparx. We would like to thank the reviewers for their helpful comments.

Agrawal, R., Hedge, M., and Teneketzis, D. Asymptotically efﬁcient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Transactions on Automatic Control, 33(10):899 906, 1988.

Auer, P. and Ortner, R. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55 65, 2010.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Journal of Machine Learning Research, 47(2-3):235 256, 2002.

Bubeck, S. and Cesa-Bianchi, N. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning. 2012.

Cappé, O., Garivier, A., Maillard, O.-A., Munos, R., and Stoltz, G. Kullback Leibler upper conﬁdence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516 1541, 2013.

Cesa-Bianchi, N., Gentile, C., Mansour, Y., and Minora, A. Delay and cooperation in nonstochastic bandits. In Conference on Learning Theory, pp. 605 622, 2016.

Chapelle, O. and Li, L. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems, pp. 2249 2257, 2011.

Desautels, T., Krause, A., and Burdick, J. Parallelizing exploration-exploitation tradeoffs in Gaussian process bandit optimization. Journal of Machine Learning Research, 15(1):3873 3923, 2014.

Doob, J. L. Stochastic processes. John Wiley & Sons, 1953.

Dudik, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., and Zhang, T. Efﬁcient optimal learning for contextual bandits. In Conference on Uncertainty in Artiﬁcial Intelligence, pp. 169 178, 2011.

Freedman, D. A. On tail probabilities for martingales. The Annals of Probability, 3(1):100 118, 1975.

Joulani, P., György, A., and Szepesvári, C. Online learning under delayed feedback. In International Conference on Machine Learning, pp. 1453 1461, 2013.

Lai, T. and Robbins, H. Asymptotically efﬁcient adaptive allocation rules. Advances in Applied Mathematics, 6(1): 4 22, 1985.

Mandel, T., Liu, Y.-E., Brunskill, E., and Popovic, Z. The queue method: Handling delay, heuristics, prior data, and evaluation in bandits. In AAAI, pp. 2849 2856, 2015.

Neu, G., Antos, A., György, A., and Szepesvári, C. Online Markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems, pp. 1804 1812, 2010.

Perchet, V., Rigollet, P., Chassang, S., and Snowberg, E. Batched bandit problems. The Annals of Statistics, 44 (2):660 681, 2016.

Szita, I. and Szepesvári, C. Agnostic KWIK learning and efﬁcient approximate reinforcement learning. In Conference on Learning Theory, pp. 739 772, July 2011.

Thompson, W. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3 4):285 294, 1933.

Vernade, C., Cappé, O., and Perchet, V. Stochastic bandit models for delayed conversions. In Conference on Uncertainty in Artiﬁcial Intelligence, 2017.