# online_learning_for_active_cache_synchronization__8f78b805.pdf Online Learning for Active Cache Synchronization Andrey Kolobov 1 S ebastien Bubeck 1 Julian Zimmert 2 Existing multi-armed bandit (MAB) models make two implicit assumptions: an arm generates a payoff only when it is played, and the agent observes every payoff that is generated. This paper introduces SYNCHRONIZATION BANDITS, a MAB variant where all arms generate costs at all times, but the agent observes an arm s instantaneous cost only when the arm is played. SYNCHRONIZATION MABs are inspired by online caching scenarios such as Web crawling, where an arm corresponds to a cached item and playing the arm means downloading its fresh copy from a server. We present MIRRORSYNC, an online learning algorithm for SYNCHRONIZATION BANDITS, establish an adversarial regret of O(T 2/3) for it, and show how to make it practical. 1. Introduction Multi-armed bandits (MAB) (Robbins, 1952) have been widely applied in settings where an agent repeatedly faces K choices (arms), each associated with its own payoff distribution unknown to the agent at the start, and needs to eventually identify the arm with the highest mean payoff by pulling a subset of arms at a time and observing a payoff sampled from their distributions. MABs defining property is that the agent observes an arm s instantaneous payoff when and only when the agent plays it. A key hidden assumption that goes hand-in-hand with it in the existing bandit models is that each arm generates reward when and only when it is played, which, combined with the bandit feedback property, also implies that the agent observes all generated payoffs. In this paper, we go beyond these seemingly fundamental assumptions by identifying a class of practical settings that violate them and analyzing it using online learning theory. Specifically, this paper formalizes scenarios that we call 1Microsoft Research, Redmond 2Google Research, Berlin. Work on this paper partially done during a visit to Microsoft Research, Redmond. Correspondence to: Andrey Kolobov . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). SYNCHRONIZATION MABs. In these settings, the agent can be thought of as holding copies of K files whose originals come from different remote sources. As time goes by, the files change at the sources, and their copies increasingly differ from the originals, becoming stale. The agent s task is to refresh these files by occasionally downloading their new copies from remote sources, under a constraint B on the average number of downloads per time unit. For each file, the agent is continually penalized for its staleness. The expected penalty at each time step due to this file is a non-decreasing function of the time since the file s last refresh. Playing arm k here corresponds to refreshing file k: doing so temporarily reduces its staleness and thereby diminishes the cost incurred due to it per time unit. The goal is to find a synchronization policy that minimizes regret in terms of the average staleness penalties by refreshing files according to a well-chosen schedule. Crucially, at any moment the agent doesn t know how outdated its copy of a given file is, except at the time when it downloads its fresh copy, and therefore most of the time doesn t know the penalties it is incurring. It observes the penalty only when it plays an arm, i.e., refreshes a file, and has a chance to see how different the cached copy was right before the refresh. Even this action reveals only the instantaneous penalty due to this file, not the cumulative penalty the file has brought on since its last refresh. SYNCHRONIZATION MABs are inspired by problems such as web crawl scheduling (Wolf et al., 2002; Cho & Garcia Molina, 2003a; Azar et al., 2018; Kolobov et al., 2019a; Upadhyay et al., 2020) and database update management (Gal & Eckstein, 2001; Bright et al., 2006). All these settings involve a cache that must proactively initiate downloads to refresh its content. This is in contrast to, e.g., Web browser caches that passively monitor a stream of download requests initiated by another program. The few existing works on policy learning for active caching (Kolobov et al., 2019a; Upadhyay et al., 2020) apply only to specific penalty functions. In contrast, the theoretical results in this paper are independent of the penalties functional form, and come with a practical online learning strategy for this model. High-level analysis idea and paper outline. Online learning theory is a powerful tool for analyzing decision-making models where an agent operates in discrete-time instantaneous rounds by playing a candidate solution (arm), imme- Online Learning for Active Cache Synchronization diately getting a feedback on it (a sample from the arm s payoff distribution), and using it to choose a candidate solution for the next round. Unfortunately, online learning s traditional assumptions clash with the properties of our setting. As Section 2 describes, SYNCHRONIZATION MAB is a continuous-time model with non-stationary sparsely observable costs. Its candidate solutions are multi-arm policies (Section 3). Getting useful feedback on a policy, such as an estimate of its cost function or gradient, isn t instantaneous; it requires playing the policy for a non-trivial stretch of time. In Section 4, we present the MIRRORSYNC algorithm, which continuously plays a candidate policy along with exploratory arm pulls, periodically updating it with online mirror descent (Nemirovsky & Yudin, 1983; Bubeck, 2016). It uses a novel unbiased policy gradient estimator that operates in the face of sparse policy cost observations. Our regret analysis of MIRRORSYNC in Section 5 critically relies on the convexity of policy cost functions a property we derive in Section 3 from minimal assumptions on SYNCHRO- NIZATION MABs payoffs. The regret analysis treats time intervals between MIRRORSYNC s policy updates as learning rounds and thereby brings online learning theory to bear on SYNCHRONIZATION BANDITS. Section 6 introduces ASYNCMIRRORSYNC, a practical MIRRORSYNC variant that lifts MIRRORSYNC s idealizing assumptions. In Section 8, we compare the two algorithm empirically. The contributions of this paper are thus as follows: (1) We cast active caching as an online learning problem with sparse feedback, enabling principled theoretical analysis of this setting under a variety of payoff distributions. (2) Based on this formulation, we propose a theoretic strategy for active caching under unknown payoff distributions and derive an adversarial regret bound of O(T 2/3) for it. In doing so, we overcome the challenges of sparse and temporal feedback inherent in this scenario that existing online learning theory does not address. (3) We present a practical variant of the above strategy that lifts the latter s assumptions and, as experiments demonstrate, has the same empirical convergence rate. 2. Model formalization SYNCHRONIZATION BANDITS are a continuous-time MAB model with K arms. Other than operating in continuoustime it differs from existing MAB formalisms in the mechanism by which arms generate costs/rewards and the observability of the generated costs from the agent s standpoint. In this section we detail both of these aspects, using the aforementioned cache update scenario as an illustrating example. Cost-generating processes. In SYNCHRONIZATION MABs, every arm k incurs a stochastically generated cost ˆck,t at every time instant t, whether the arm is played or not. How- ever, the distribution of arm k s possible instantaneous costs at time t depends on how much time has passed since the last time arm k was played to refresh the corresponding cached item. We denote the length of this time interval as τk(t) [0, ). Thus, arm k s cost generation process is described by a family of random variables {ck(τk(t))|t 0} as stated in the following assumption: Assumption 1. (Cost generation) At time t, each arm incurs a cost independently of being played by the agent. The instantaneous cost ˆck,t due to arm k at time t is sampled from a random variable ck,t = ck(τk(t)) s.t.: (1) τk(0) 0; (2) ck(0) = Dirac Delta(0); (3) there exists a bound U < s.t. supp(ck(τ)) [0, U] for every arm s cost generation process ck and any time interval length τ 0. By this assumption, for every arm and any amount of time τ since its latest play, its cost expectation is well-defined: ck(τ) E[ck(τ)] Agent s knowledge and cost observability. While costs are generated by all arms continually, in our model the agent doesn t observe most of them, with an important exception: Assumption 2. (Cost knowledge and observability) For each arm k, the agent observes a cost ˆck,t ck,t at time t if and only if the agent plays arm k at that time. The agent doesn t know the distributions of random variables ck,t. Assumption 2 is crucial in two ways. First, it means that our model provides only bandit feedback. Namely, the agent doesn t see arms costs at all times, unlike in related models such as maintenance scheduling (Bar-Noy et al., 1998). Second, coupled with Assumption 1 it implies that there is no causal relationship between playing an arm and incurring a cost, which is an implicit assumption that standard bandit strategies rely on. Arm play modes. At any time t, any of SYNCHRONIZATION MAB s arms can be played in one of two modes: Sync mode. Playing arm k in this mode at time t resets the arm s state, i.e., sets τk(t) [ 0. In addition, per Assumption 2, the agent observes the arm s instantaneous cost sample ˆck,t immediately before τk(t) is reset to 0. In the case of a cache, this means downloading a fresh copy of file k, estimating the difference between k s current original and the cached copy, and overwriting the cached copy with the new one. Probe mode. By playing arm k allows in probe mode, the agent observes the arm s instantaneous cost, but the arm s state τk(t) is not reset. In caching settings, this corresponds to downloading a fresh copy of item k, but using it purely to estimate the difference between k s current original and the cached copy, without overwriting the cached copy. Online Learning for Active Cache Synchronization Since, by Assumption 1, ck(0) = 0, playing an arm in sync mode gives the agent a way to temporarily reduce the expected rate at which the arm incurs costs. However, due to the following assumption, after a sync play the arm s cost generation rate starts growing again: Assumption 3. (Cost monotonicity) For every arm k, the means ck(τk(t)) of instantaneous cost random variables ck(τk(t)) are non-decreasing in time since the latest syncmode play τk(t). If arm k was played in sync mode at time t0, then any sequence of arm k s cost observations ˆck,1, ˆck,2, . . . yielded by probe-mode plays after t0 and until this arm s next sync-mode play at time t 0 is non-decreasing. Arm state τk(t) can be viewed as the amount of time that has passed since the arm s last sync by time point t; the more time has passed, the more cost the arm is incurring per time unit. Playing an arm in sync mode simply resets this time counter. Thus, according to Assumption 3, not only does the total cost generated by arm k since its previous sync play grow as time goes by which is to be expected but so does the rate at which it happens. Note that probe-mode arm plays don t help the agent reduce running costs directly. Instead, as we show in Section 4, they help the agent learn a good arm-playing policy faster. Example. All of the above assumptions are natural in realworld scenarios that inspired the SYNCHRONIZATION model. For instance, in Web crawling each online web page accumulates changes according to a temporal process Dk(t), which is widely assumed to be Poisson (i.e., memoryless) in the Web crawling literature (Wolf et al., 2002; Cho & Ntoulas, 2002; Cho & Garcia-Molina, 2003a;b; Azar et al., 2018; Kolobov et al., 2019a;b; Upadhyay et al., 2020). For each indexed page, the agent (the search engine) incurs a cost Ck(d) due to serving outdated search results, as a function of the total difference d between the indexed page copy and the online original. From this perspective, ck(τ) Ck(Dk(τ)), but at least one other approach models ck(τ) directly as a function of a web page copy s age (Cho & Garcia-Molina, 2000). In either case, Assumption 3 holds: the more time passes since the page s last crawl, the higher the expected instantaneous penalty. Moreover, penalties don t decrease between two successive crawls of a page: e.g., in case changes are generated by a Poisson process, their number can only grow with time since last refresh, and so can the penalty. 3. Policies and their cost functions In order to derive a learning strategy for SYNCHRONIZATION BANDITS (Section 4) and its regret analysis (Section 5), we first derive the necessary building blocks: the cost of an arbitrary policy for this model, the class of policies that will serve as our algorithm s hypothesis space, and parameterized cost functions for policies of this class. Policy costs. Our high-level aim is finding a SYNCHRONIZA- TION policy π that has a low expected average cost over an infinite time horizon. Whether a policy π is historydependent, stochastic, or neither, executing it produces a schedule σk = ((t1, l1), (t2, l2), . . .) for each arm k, a possibly infinite sequence of time points tnk when the arm is to be played and corresponding labels lnk specifying whether the arm should be played in probe or sync mode at that time. For convenience, WLOG assume that t0 always refers to t = 0, let τnk tnk tnk 1, and for any finite horizon H let Nk(H) be the index of schedule σk s largest time point not exceeding H: ( argmaxn N{tn σk|tn H} if such n exists otherwise (1) Given this definition, let t Nk(H)+1 H. Recalling that each arm has a specific time-dependent cost distribution ck(τk(t)) with mean ck(τk(t)), we define the average infinite-horizon cost Jσk k of arm k s schedule σk as Jσk k lim inf H E = lim inf T 1 H ck(τ)dτ (2) ck(τ)dτ, (3) we can rewrite Jσk k s definition as Jσk k = lim inf H 1 H Ck(τnk) (4) Here, Ck(τnk) is the total cost that arm k is expected to incur between (nk 1)-th and nk-th plays according to schedule σk. Thus, Jσk k is just the average of these costs over the entire schedule. If the schedule stops playing arm k forever after some time t, Jσk k may be infinite. Running a policy π amounts to sampling a joint schedule σ = {σk}K k=1. Therefore, we define policy cost Jπ as lim inf H 1 H Target policy class. Instead of considering all possible SYN- CHRONIZATION policies as potential solutions, in this paper we focus on those whose sync-mode plays are periodic, with equal gaps between every two consecutive such plays of a Online Learning for Active Cache Synchronization given arm. For arm k, we denote the length of these gaps as 1/rk > 0 length, rk being a policy parameter for this arm and r (rk)K k=1 being the joint parameter vector for all arms. Importantly, our policies do allow probe-mode arm plays but don t restrict how the time points for these plays are chosen. In particular, they may be chosen stochastically, as long the timings of sync-mode plays are deterministically periodic. Formally, for a scheduled arm pull time point tnk in schedule σk, let Next Syncσk(tnk) be the next sync-mode play of arm k in σk, i.e., tn k = Next Syncσk(tnk) if tnk = min{tn k | n k > nk, (tn k , ln k ) σk, ln k = sync}. Then our target policy class is Π = {π(r) | [σ π(r), k [K], and (tnk, lnk) σk s.t. lnk = sync and tn k = Next Syncσk(tnk)] tn k tnk = 1 rk for rk r} (6) Parameters r can be interpreted as rates at which arms are played in sync mode. For π Π, policy costs are uniquely determined by sync-mode play rates r: although these parameters ignore the timing of probe plays, probe plays don t affect cost generation and therefore policy cost. We let J(r) denote π(r) Π s policy cost. Equation 5 implies that its cost functions J(r) have a special form critical for our regret analysis in Section 5 they are convex: Lemma 1. For any policy π(r) Π, J(r) and Jk(rk) rk Ck 1 rk for each k [K] is convex and monotonically decreasing for r > 0. Proof. See the Appendix. The convexity of the cost functions plays a crucial role in obtaining the regret bounds (Section 5) for the policy learning algorithm presented in Section 4. Policy constraints. Naturally, we would like to find a π(r) Π that minimizes J(r) (Equation 7). As described, however, Π has no such policy: note that limr J(r) = 0, but no finite r attains this limit. However, in practical applications the rates rk cannot be arbitrarily high individually or in aggregate, and are subject to several constraints. Therefore, in this paper we regard feasible rk as bounded from above for all k by a universal bound rmax. Moreover, we assume that the sum of all arms play rates may not exceed some value B > 0. E.g., in Web crawling B is commonly interpreted as a limit on crawl rate imposed by physical network infrastructure (Azar et al., 2018; Kolobov et al., 2019a; Upadhyay et al., 2020). Last but not least, valid rk values may not be 0, since this implies never playing this arm after a certain time point. In applications such as Web crawling, this means abandoning a cached item (e.g., an indexed webpage) to grow arbitrarily stale, which is unacceptable, so we impose a minimum sync-mode play rate rmin on every arm. Note that since, by Assumption 1, every ck(τ) is bounded for τ 0, every Jk(rk) (Lemma 1) is bounded as well. Policy optimization. Thus, if we knew cost processes ck(.), we could use them to compute Ck(.) for all arms and would face the following optimization problem: Problem 1 (SYNCHRONIZATION BANDIT instance). Minimize: J(r) = 1 subject to: r K r [rmin, rmax]K | ||r ||1 = B Notice that this formulation implicitly assumes that the entire bandwidth B will be used for sync-mode arm plays indeed, if the model is known, there is no need for probes. As a side note, we remark that the class of periodic policies Π doesn t restrict Problem 1 s solution quality compared to broader the class of deterministic open-loop policies. We state it here informally as a proposition, which we reformulate more precisely and prove in the Appendix A: Proposition 1. For a given K-armed SYNCHRONIZATION BANDIT instance (Problem 1), the optimal periodic policy π Π has the same cost J as the optimal general deterministic open-loop policy under the same constraints. 4. Online learning for cache synchronization In reality we don t know the cost generation processes and can t solve the above optimization problem directly. Instead, we adopt an online perspective on SYNCHRONIZATION bandits. A key contribution of this paper that we present in this section is MIRRORSYNC (Algorithm 1), an algorithm that treats a SYNCHRONIZATION MAB as an online learning problem. A MIRRORSYNC agent can be viewed operates in rounds T = 1, 2, . . . Tmax, in each round playing a candidate policy parameter vector r T , suffering a loss ˆJT JT (r T ), and updating r T to a new vector r T +1 as a result. As we show in Section 5, MIRRORSYNC has an adversarial regret of O(T 2/3 max). MIRRORSYNC s novelty is due to the fact that, despite superficial similarities to standard online learning, our setting is different from it in crucial ways, and MIRRORSYNC circumvents these differences: (1) Although the agent can be viewed as suffering loss ˆJT , it doesn t observe this loss. By SYNCHRONIZATION MAB s assumptions, it observes only samples of instantaneous costs ck(.), and only when it plays arms. Existing techniques don t offer a way to estimate the gradient J in this case. Moreover, even these impoverished observations take real- Online Learning for Active Cache Synchronization world time to collect. (2) In online learning, regret analysis normally assumes J to be bounded. This isn t quite the case in our model. While we could assume a bound on J linear in 1/rmin, it would be detrimental to the regret bound when 1/rmin is large. We show how MIRRORSYNC addresses challenge (1) in this section, and circumvent challenge (2) in Section 5. A note on infinite vs. finite-horizon policies. The policy cost functions we derived in Section 3 describe the steadystate performance of a policy over an infinite time horizon. However, MIRRORSYNC s rounds are finite. Thus, the cost function J (Equation 7) that MIRRORSYNC uses as a basis for policy improvement is a proxy measure for the average costs that running MIRRORSYNC incurs in each round. MIRRORSYNC operation. At a high level, MIRRORSYNC s main insight is allocating a small fraction of available bandwidth B, determined by input parameter ε (Algorithm 1), to probe-mode arm plays, while using the bulk 1 1+ε B of the bandwidth (line 2) to play in sync mode according to the current rates r. MIRRORSYNC uses instantaneous cost samples obtained from both to estimate the gradient J (lines 11 - 19) by individually estimating its partial derivatives (lines 23-25), which we denote as rk for short. At the end of each epoch, it does online mirror descent on these estimates to get a new sync-mode policy r (lines 20, 42-43). In more detail, in the spirit of online learning, MIRRORSYNC assumes that at the start of each round all arms cost generation processes have just been reset to ck(0), and restarts the time counter at t = 0 for every arm (line 9). (In practice, this assumption is unrealistic, and we lift it in another variant of MIRRORSYNC in Section 6.) Then, for every arm k, it schedules sync-mode plays at intervals 1/rk (lines 29, 37-38) until the end of the current round, and attempts to insert one probe-mode play into each such interval (lines 33-36) independently with probability ε (line 32), at a point chosen uniformly at random over the interval s length (line 34). Executing the constructed schedule (line 13) yields cost samples that are used in the aforementioned gradient estimation, which, crucially, is unbiased: Lemma 2. For a rate vector r = (rk)K k=1 and a probability ε, suppose the agent plays each arm in sync mode 1/rk time after that arm s previous sync-mode play, observing a sample of instantaneous cost ˆck ck(1/rk). Suppose also that in addition, with probability ε independently for each arm k, the agent plays arm k in probe mode at time t Uniform[0, 1/rk] after that arm s previous sync-mode play, observing a sample of instantaneous cost ˆc(ε) k ck(t). Algorithm 1: MIRRORSYNC Input: rmin lowest allowable arm play rate rmax highest allowable arm play rate B bandwidth ε probability of probe-mode arm play η learning rate Tmax number of rounds Output: r arm play rates. 1 Tround [ 1/rmin // Round length 2 Kε x [rmin, rmax 1+ε ]K ||r||1 = B 1+ε 3 r [ arg minx Kε Barrier F(x) // Initialize play rates 4 foreach round T = 1, . . . , Tmax do 5 // At the start of each round, all arms are assumed 6 // to be synchronized and time re-starts at 0. 7 foreach arm k [K] do 8 // Construct a schedule σT k for the T -th round. 9 tprev,k [ 0 10 σT k , tprev,k Schedule Arm Plays(tprev,k, rk, Tround) 11 foreach arm k [K] simultaneously do 12 // Sample costs by playing according to σT k 13 (ˆck,t1, . . . , ˆck,t|σT k |) [ Play(σk) 14 foreach n = 1, . . . , |σT k | and (tn, ln) σT k do 15 if ln == sync then ˆc(ε) [ none, ˆc [ ˆck,tn 16 else ˆc(ε) [ ˆck,tn, ˆc [ ˆck,tn+1, n [ n + 1 17 ˆgk [ Grad JSample(ˆc(ε) k ,ˆck,rk,ε) K 18 break 19 ˆg T [ (ˆg1, . . . , ˆg K) 20 r [ Mirror Descent Step(η, Kε, ˆg T , r) 21 Return r 23 Grad JSample(ˆc(ε) k , ˆck, rk, ε): 24 if ˆc(ε) k == none then Return 0 25 else Return 1 εrk (ˆc(ε) k ˆck) 27 Schedule Arm Plays(tprev,k, rk, interval len): 28 σk [ (), nk [ 0, t0 [ tprev,k 29 while tnk + 1/rk < interval len do 30 tprev sync [ tnk 31 nk [ nk + 1 32 Probe Bernoulli(ε) 33 if Probe then 34 tnk Uniform(tnk 1, tnk 1 + 1/rk) 35 σk [ Append(σk, (tnk, probe)) 36 nk [ nk + 1 37 tnk [ tprev sync + 1/rk 38 σk [ Append(σk, (tnk, sync)) 39 tprev,k [ tnk 40 Return σk, tprev,k 41 42 Mirror Descent Step(η, K, g, r) : 43 Return arg minx K{η x, g + Div F(x, r)} 45 Div F(x, r) : Return PK k=1 log(xk/rk) + xk/rk 1 47 Barrier F(r) : Return PK k=1 log(rk) Online Learning for Active Cache Synchronization Then for each k, ( 0 if Bernoulli(ε) 1 εrk K (ˆc(ε) k ˆck) if Bernoulli(ε) is an unbiased estimator of k J(rk). Proof. See the Appendix. In one round, MIRRORSYNC may get several gradient estimates for a given arm, in which case it takes the first one (line 18). To get a regret bound, however, it is crucial to ensure that for each arm MIRRORSYNC receives at least one such estimate per round, even if the estimate is 0 (line 18). This is why we set the length of each round to be 1/rmin (line 1) the largest value 1/rk and hence the longest time that MIRRORSYNC may have to wait in order to get a cost sample at 1/rk. 5. Regret analysis We generalize our stochastic setting to an adversarial problem and prove an adversarial regret bound of order O T 2 3 . This means that the cost distributions ck and all derived quantities ( ck, Ck, Jk) need not be non-stationary from one round to the next. The cost distributions and derived functions at round T are denoted by c T k , c T k , CT k , JT k and can be chosen by an oblivious adversary ahead of time. Regret. We define the regret with respect to a fixed schedule r [0, )d by T =1 JT (r T ) T =1 JT (r) , where the expectation is over the randomness of observed costs ˆck and r T is the choice of the algorithm in round T . Our goal is to compete with the best possible schedule r using the full available budget: Reg = Reg(r ) , where r min r K0 T =1 JT (r), where K0 is Kε (line 2 of Algorithm 1) with ε = 0. Since we are not be able to obtain any information on the function value or gradient of JT without an allocated exploration, we also define the best possible schedule r ε given an exploration constrained by ε (lines 32 - 34 of Algorithm 1): r ε arg min r Kε T =1 JT (r) . The regret can be decomposed into Reg = Reg(r ε) | {z } in-policy regret T =1 (JT (r ε) JT (r )) | {z } exploration gap which we bound separately. In-policy regret. The problem is an instance of online learning, but online learning literature typically assumes that the gradients of the objective functions JT are uniformly bounded w.r.t. some norm. Our setting differs in a key aspect: the gradients k J(r) scale proportionally to r 1 k . A naive solution would be to bound k J(r) r 1 min and use gradient descent. However, the regret would scale with r 1 min, which might be prohibitively large. We show that mirror descent with a carefully chosen potential, namely the log barrier F(r) = PK k=1 log(rk), adapts to the geometry of the gradients and replaces the polynomial dependency on r 1 min by a log dependency. Theorem 5.1. For any sequence of convex functions (JT )Tmax T =1 and learning rate 0 < η < Kε 2U , the in-policy regret of MIRRORSYNC is bounded by η log B rmin K Proof. See the Appendix. Exploration Gap. We show that the exploration gap scales proportionally to ε and is independent of r 1 min. On first sight, this bound is surprising because the exploration gap should be approximately PTmax T =1 JT (r ), r r ε and the gradients JT k (r ) could be unbounded (i.e. of order r 1 min). The high-level idea behind the following lemma is the observation that at the optimal point r , the gradients in all coordinates must coincide and hence the gradient cannot be of order r 1 min even if r k = rmin. Lemma 3. The exploration gap is bounded by T =1 (JT (r ε) JT (r )) 2εUTmax . Proof. See the Appendix. Finally we are ready to present the main result. Corollary 1. The regret of MIRRORSYNC with η = log B rmin K ε Tmax and ε = T 1 3 max log 1 3 B rmin K is bounded for any Tmax > 8 log B rmin K by 2 3 max log 1 3 B rmin K . (9) Proof. The choice of η, ε and the bound on Tmax ensure that we can apply Theorem 5.1 to bound the in-policy regret. The in-policy regret simplifies to Reg(r ε) 2U ε log B rmin K Online Learning for Active Cache Synchronization Adding the exploration gap from Lemma 3 and substituting the value for ε completes the proof. 6. Making MIRRORSYNC practical Although MIRRORSYNC lends itself to theoretical analysis, several design choices make its vanilla version impractical. (1) MIRRORSYNC assumes that all arms are synchronized for free at the start of each round so that each round starts in the same state , which is unrealistic. (2) MIRRORSYNC waits until the end of each 1/rmin-long round to update all arms play rates simultaneously, which could be months in applications like Web crawling. (3) MIRRORSYNC s further source of inefficiency is that even if arm k has produced several k J(rk) samples in a given round, MIRRORSYNC uses only one of them. ASYNCMIRRORSYNC (Algorithm 2), which can be viewed as a practical implementation of MIRRORSYNC, mitigates these weaknesses of MIRRORSYNC. In contrast to MIRRORSYNC, which performs updates in rigidly defined rounds, ASYNCMIRRORSYNC updates policy according to a user-specified schedule S (see Algorithm 2 s inputs). The length of inter-update periods is unimportant for ASYNCMIRRORSYNC, unlike for MIRRORSYNC (line 1, Alg. 1), due to a major difference between the two algorithms. MIRRORSYNC aims to update all arms parameters synchronously at the end of each round and makes the rounds very long to guarantee that each arm has generated at least one gradient estimate by the end of each round. In the meantime, ASYNCMIRRORSYNC does updates asynchronously, performing mirror descent at an update time t(upd) i S only on those arms that happen to have generated at least one new gradient sample since the previous update time t(upd) i 1 (lines 26-33). ASYNCMIRRORSYNC does these local updates while respecting the global constraint B by using the sum of current play rates or arms that are about to be updated as a local constraint (line 32). Thus, ASYNCMIRRORSYNC doesn t need to make inter-update intervals excessively long and doesn t suffer from issue (1). As a side note, the reason MIRRORSYNC s regret bound in Corollary 1 has no linear dependence on 1/rmin is because it characterizes regret in terms of the number of rounds, not wall-clock time. Nonetheless, this dependency matters empirically, and obtaining a wall-clock-time regret bound that is free from it is an interesting theoretical problem. ASYNCMIRRORSYNC s asynchronous update mechanism also removes the need for free simultaneous sync-mode play of all arms after each round (2). Recall that before each sync-mode play of arm k with probability ε we can play arm k another time, and so far we have chosen to do so in probe mode. The Schedule Arm Plays routine (line 27, Alg. 1) that both MIRRORSYNC and ASYNCMIRRORSYNC rely on attempts this (lines 32-33, Alg. 1) for every syncmode arm play except the first arm play of each round. ASYNCMIRRORSYNC takes advantage these unused chances Algorithm 2: ASYNCMIRRORSYNC Input: rmin lowest allowable arm play rate rmax highest allowable arm play rate B bandwidth ε probability of probe-mode arm play η learning rate Tmax (wall-clock) time horizon S = (t(upd) 1 , t(upd) 2 , ...) update schedule Output: r arm play rates. 1 Kε x [rmin, rmax 1+ε ]K ||r||1 = B 1+ε 2 r [ arg minx Kε Barrier F(x) // Initialize play rates 3 tnow [ 0 // current time 4 // Extend arms schedules σk until the next update time foreach arm k [K] do 5 tprev,k [ tnow 6 σk, tprev,k Schedule Arm Plays(tprev,k, rk, t(upd) 1 ) 7 // tnow is incremented continuously 8 while tnow Tmax do 9 // Play each arm k s current σk, record cost samples 10 foreach arm k [K] simultaneously do 11 (ˆck,t1, . . . , ˆck,t|σk|) [ Play(σk) 12 // If now is the next update time i ... 13 if tnow==t(upd) i for some t(upd) i S then 14 Ai [ // Set of arms that will be updated now 15 // Collect cost samples since prev. update time 16 foreach arm k [K] do 17 foreach n, (tn, ln) σk | tn t(upd) i 1 do 18 if ln == sync then 19 ˆc(ε) [ none, ˆc [ ˆck,tn 20 else // This was a probe play 21 ˆc(ε) [ ˆck,tn, ˆc [ ˆck,tn+1 22 n [ n + 1 23 // Estimate k J per Lemma 2 (Alg. 1, 24 // lines 23-25) using collected samples 25 ˆgn,k [ Grad JSample(ˆc(ε) k , ˆck, rk, ε) 26 if we got at least one ˆgn,k for arm k then 27 Ai [ Ai {k} 28 gk [ Avg({ˆgn,k | tn t(upd) i 1 }) 29 // Normalize new grad. estimates 30 gi 1 |Ai|(gk)k Ai 31 // Now, update play rates only for arms in Ai 32 Kε,i x [rmin, rmax 1+ε ]|Ai| ||r||1 = 33 r Ai [ Mirror Descent Step(η, Kε,i, gi, r Ai) 34 foreach arm k [K] do 35 // Extend sched. σk until next update time. 36 if Bernoulli(ε) then tprev,k tnow 37 else tprev,k max{tprev,k + 1 rk , tnow} 38 σk Append(σk, (tprev,k, sync)) 39 σ k, tprev,k Schedule Arm Plays(tprev,k, rk, t(upd) i+1 ) σ k [ Append(σk, σ k) 40 Return r Online Learning for Active Cache Synchronization to schedule sync-mode plays, which reset cost generation for some fraction of arms (line 36, Alg. 2). For the remaining arms, it simply waits until their next sync-mode play (line 37, Alg. 2) to start estimating the new gradient. Last but not least, ASYNCMIRRORSYNC improves the efficiency of updates themselves. It employs all gradient samples we get for an arm between updates, and averages them to reduce estimation variance (lines 25, 28), thereby rectifying MIRRORSYNC s weakness (3). 7. Related work There are several existing models superficially related to but fundamentally different from SYNCHRONIZATION MABs. In maintenance job scheduling (Bar-Noy et al., 1998), as in our setting, each machine (arm) has an associated operating cost per time unit that increases since the previous maintenance, and performing a maintenance reduces this cost temporarily. However, the agent knows all arms cost functions and always observes the machines running costs. Upadhyay et al. (2018) describe a model for maximizing a long-term reward that is a function of two general marked temporal point processes. This model is more general than SYNCHRONIZATION MAB in some ways (e.g., not assuming cost process monotonicity) but allows controlling the policy process s rate only via a policy cost regularization term and provides no performance guarantees. Recharging bandits (Immorlica & Kleinberg, 2018), like SYNCHRONIZATION MABs, have arms with non-stationary payoffs: the expected arm reward is a convex increasing function of time since the arm s last play. In spite of this similarity, recharging bandits and other MABs with timedependent payoffs (Heidari et al., 2016; Levine & Crammer, 2017; Cella & Cesa-Bianchi, 2020) make the common assumptions that a reward is generated only when an arm is played and that the agent observes all generated rewards. As a result, their analysis is different from ours. In general, payoff non-stationarity has been widely studied in two broad bandit classes. Restless bandits (Whittle, 1988) allow an arm s reward distributions to change, but only independently of when the arm is played. Rested bandits (Gittins, 1979) also allow an arms reward distribution changes, but only when the arm is played. SYNCHRONIZATION MABs belong to neither class, since their arms instantaneous costs change both independently and a result of arms being played. 8. Empirical evaluation The goal of our experiments was to evaluate (1) the relative performance of MIRRORSYNC and ASYNCMIRRORSYNC, given that MIRRORSYNC assumes free arm resets at the beginning of each round and ASYNCMIRRORSYNC doesn t, and (2) the relative performance of ASYNCMIRRORSYNC and its version with projected stochastic gradient descent (PSGD) instead of mirror descent, which we denote as ASYNCPSGDSYNC. The choice of mirror descent instead of PSGD was motivated by the intuition that with mirror descent MIRRORSYNC would achieve lower regret than with PSGD (see Section 5). In the experiments, we verify this intuition empirically. Before analyzing the results, we describe the details of our experiment setup. Problem instance generation. Our experiments in Figures 1 and 2 were performed on SYNCHRONIZATION MAB instances generated as follows. Recall from Sections 2 and 3 that a SYNCHRONIZATION MAB instance is defined by: rmin, the lowest allowed arm play rate rmax, the highest allowed arm play rate B, the highest allowed total arm play rate K, the number of arms {ck(τ)}K k=1, a set of cost-generating processes timedependent distributions of instantaneous costs, one process for each arm k. For all problem instances in the experiments, rmin, rmax, B, and K were as in Table 1 in Appendix B. The set {ck(τ)}K k=1 of cost-generating processes was constructed randomly for each instance. In all problem instances, each arm had a distribution over time-dependent cost functions in the form of polynomials ck(τ) = akτ pk, where pk (0, 1) was chosen at instance creation time and ak sampled at run time as described in Appendix B. Note that MIRRORSYNC s regret (Equation 9) depends on the cost cap U. While our polynomial cost functions are unbounded in general, they are bounded within the [rmin, rmax] constraint region we are using (Table 1 in Appendix B). Within this constraint region, these cost functions are equivalent to min{akτ pk, U}, where U = 40. In Appendix C we also describe a different, Poisson processbased family of cost-generating processes, and present experimental results obtained on it in Figures 3 and 4 in that section. Despite that process family being very distinct from the polynomial one, the results are qualitatively similar to those in Figures 1 and 2. Implementation details. We implemented MIR- RORSYNC, ASYNCMIRRORSYNC, and ASYNCPSGDSYNC, along with a problem instance generator that constructs SYNCHRONIZATION MAB instances as above, in Python. The implementation, available at https://github.com/microsoft/Optimal-Freshness-Crawl Scheduling, relies on scipy.optimize.minimize for convex constrained optimization in order to update the play rates r (lines 20, 42 of Alg. 1, line 33 of Alg. 2). Other convex optimizers are possible as well. The experiments were performed on a Windows 10 laptop with 32GB RAM with 8 Intel 2.3GHz i9-9980H CPU cores. Hyperparameter tuning. Running the algorithms required Online Learning for Active Cache Synchronization 0 50 100 150 200 Equivalent number of MIRRORSYNC rounds J(r) (lower is better) MIRRORSYNC ASYNCMIRRORSYNC Optimal policy cost with ε-exploration, ε = 0.05 Figure 1. MIRRORSYNC s and ASYNCMIRRORSYNC s convergence. The two algorithms exhibit nearly identical convergence behavior using tuned hyperparameters. However, ASYNCMIRRORSYNC works without assuming the free arm resets after arms parameter updates that MIRRORSYNC relies on. 0 50 100 150 200 Equivalent number of MIRRORSYNC rounds J(r) (lower is better) ASYNCPSGDSYNC ASYNCMIRRORSYNC Optimal policy cost with ε-exploration, ε = 0.05 Figure 2. Asynchronous version of MIRRORSYNC with mirror descent vs. with projected SGD. The use of mirror descent with the log barrier function in MIRRORSYNC was key to constructing the regret bounds in Section 5. Empirically, although ASYNCMIRRORSYNC and ASYNCPSGDSYNC eventually converge to similar-quality policies, ASYNCMIRRORSYNC discovers good policies faster, as MIRRORSYNC s theory predicts. choosing values for the following parameters: Learning rate η. As in other learning algorithms, choosing a good value for η for each of the three algorithms was critical for their convergence behavior. Length lupd round of intervals between ASYNCMIRRORSYNC s and ASYNCPSGDSYNC s play rate updates. Recall that unlike MIRRORSYNC, which updates all play rates simultaneously after every 1/rmin time units, ASYNCMIRRORSYNC and ASYNCPSGDSYNC update the model parameters according to a user-provided schedule. While the schedule doesn t necessarily have to be periodic, in the experiments it was, with lupd round being the inter-update interval length. Intuitively, lupd round influences the average number of arms updated during each update attempt and the variance of gradient estimates: the larger lupd round, the more gradient samples ASYNCMIRRORSYNC and ASYNCPSGDSYNC can be expected to average for the upcoming update (line 28 of Alg. 2). In this respect, lupd round s role resembles that of minibatch size in minibatch SGD. Exploration parameter ε. Theory provides a horizondependent guidance for setting ε for MIRRORSYNC (Corollary 1) but not for ASYNCMIRRORSYNC and ASYNCPSGDSYNC. For comparing relative convergence properties of MIRRORSYNC, ASYNCMIRRORSYNC, and ASYNCPSGDSYNC, we fixed ε = 0.05 for all of them. ASYNCMIRRORSYNC s and ASYNCPSGDSYNC s performance is determined by a combination of η and lupd round, so we optimized them together using grid search. Please see Appendix B for more details. Experiment results. Figures 1 and 2 compare the performance of MIRRORSYNC vs. ASYNCMIRRORSYNC and ASYNCMIRRORSYNC vs. ASYNCPSGDSYNC, respectively. The figure captions highlight important patterns we observed. The plots were obtained by running the respective pairs of algorithms on 150 problem instances generated as above, measuring the policy cost J after every update, and averaging the resulting curves across these 150 trials. In each trial, all algorithms were run for the amount of simulated time equivalent to 240 MIRRORSYNC rounds (see Figure 1 s and 2 s x-axis). However, note that the number of updates performed by ASYNCMIRRORSYNC and ASYNCPSGDSYNC was larger than 240. Specifically, a MIR- RORSYNC s update round is always of length 1/rmin time units, but for ASYNCMIRRORSYNC and ASYNCPSGDSYNC it is lupd round units, so for every MIRRORSYNC update round, they perform (1/rmin)/lupd round updates. Although more frequent model updates is itself a strength of the asynchronous algorithms, their main practical advantage is independence of MIRRORSYNC s free arm resets assumption. 9. Conclusion This paper presented SYNCHRONIZATION MABs, a bandit class where all arms generate costs continually, independently of being played, and the agent observes an arm s stochastic instantaneous cost only when it plays the arm. We proposed an online learning approach for this setting, called MIRRORSYNC, whose novelty is in estimating the policy cost gradient without directly observing the policy cost function and without having a closed-form expression for it. Moreover, we derived an O(T 2 3 ) adversarial regret bound for MIRRORSYNC without explicitly requiring the gradients to be bounded. We also presented ASYNCMIRRORSYNC, a practical version of MIRRORSYNC that lifts the latter s idealizing assumptions. The key insight behind all these contributions is that the use of mirror descent for policy updates in SYNCHRONIZATION MABs enables much faster convergence than gradient descent would. Our experiments confirmed this insight empirically. Acknowledgements. We would like to thank Nicole Immorlica (Microsoft Research) and the anonymous reviewers for their helpful comments and suggestions regarding this work. Online Learning for Active Cache Synchronization Azar, Y., Horvitz, E., Lubetzky, E., Peres, Y., and Shahaf, D. Tractable near-optimal policies for crawling. Proceedings of the National Academy of Sciences (PNAS), 2018. Bar-Noy, A., Bhatia, R., Naor, J., and Schieber, B. Minimizing service and operation costs of periodic scheduling. In SODA, pp. 11 20, 1998. Bright, L., Gal, A., and Raschid, L. Adaptive pull-based policies for wide area data delivery. ACM Transactions on Database Systems (TODS), 31(2):631 671, 2006. Bubeck, S. Convex Optimization: Algorithms and Complexity. Foundations and Trends in Machine Learning, 2016. Cella, L. and Cesa-Bianchi, N. Stochastic bandits with delay-dependent payoffs. In AISTATS, 2020. Cho, J. and Garcia-Molina, H. Synchronizing a database to improve freshness. In ACM SIGMOD International Conference on Management of Data, 2000. Cho, J. and Garcia-Molina, H. Effective page refresh policies for web crawlers. ACM Transactions on Database Systems, 28(4):390 426, 2003a. Cho, J. and Garcia-Molina, H. Estimating frequency of change. ACM Transactions on Internet Technology, 3(3): 256 290, 2003b. Cho, J. and Ntoulas, A. Effective change detection using sampling. In VLDB, 2002. Gal, A. and Eckstein, J. Managing periodically updated data in relational databases. Journal of ACM, 48:1141 1183, 2001. Gittins, J. C. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological), 41(2), 1979. Heidari, H., Kearns, M., and Roth, A. Tight policy regret bounds for improving and decaying bandits. In AISTATS, 2016. Immorlica, N. and Kleinberg, R. Recharging bandits. In FOCS, 2018. Kolobov, A., Peres, Y., Lu, C., and Horvitz, E. Staying up to date with online content changes using reinforcement learning for scheduling. In Neur IPS, 2019a. Kolobov, A., Peres, Y., Lubetzky, E., and Horvitz, E. Optimal freshness crawl under politeness constraints. In SIGIR, 2019b. Levine, N. and Crammer, K. Rotting bandits. In NIPS, 2017. Nemirovsky, A. and Yudin, D. Problem complexity and method efficiency in optimization. Wiley, 1983. Robbins, H. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527 535, 1952. Upadhyay, U., De, A., and Gomez-Rodriguez, M. Deep reinforcement learning of marked temporal point processes. In Neur IPS, 2018. Upadhyay, U., Busa-Fekete, R., Kotlowski, W., Pal, D., and Szorenyi, B. Learning to crawl. In AAAI, 2020. Whittle, P. Restless bandits: Activity allocation in a changing world. Applied Probability, 25(A):287 298, 1988. Wolf, J. L., Squillante, M. S., Yu, P. S., Sethuraman, J., and Ozsen, L. Optimal crawling strategies for web search engines. In WWW, 2002.