# learning_to_crawl__18397a5c.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Learning to Crawl

Utkarsh Upadhyay Resonal, Berlin, Germany utkarsh@reason.al

R obert Busa-Fekete Google Research, NY, USA busarobi@google.com

Wojciech Kotłowski Poznan University of Technology, Poland wkotlowski@cs.put.poznan.pl

D avid P al Yahoo! Research, NY, USA davidko.pal@gmail.com

Bal azs Sz or enyi Yahoo! Research, NY, USA szorenyi.balazs@gmail.com

Web crawling is the problem of keeping a cache of webpages fresh, i.e., having the most recent copy available when a page is requested. This problem is usually coupled with the natural restriction that the bandwidth available to the web crawler is limited. The corresponding optimization problem was solved optimally by Azar et al. (2018) under the assumption that, for each webpage, both the elapsed time between two changes and the elapsed time between two requests follows a Poisson distribution with known parameters. In this paper, we study the same control problem but under the assumption that the change rates are unknown a priori, and thus we need to estimate them in an online fashion using only partial observations (i.e., single-bit signals indicating whether the page has changed since the last refresh). As a point of departure, we characterise the conditions under which one can solve the problem with such partial observability. Next, we propose a practical estimator and compute conﬁdence intervals for it in terms of the elapsed time between the observations. Finally, we show that the explore-and-commit algorithm achieves an O(

T) regret with a carefully chosen exploration horizon. Our simulation study shows that our online policy scales well and achieves close to optimal performance for a wide range of parameters.

1 Introduction

As information dissemination in the world becomes near realtime, it becomes more and more important for search engines, like Bing and Google, and other knowledge repositories to keep their caches of information and knowledge fresh. In this paper, we consider the web-crawling problem of designing policies for refreshing webpages in a local cache with the objective of maximizing the number of incoming requests which are served with the latest version of the page. Webpages are the simplest and most ubiquitous source of information on the internet. As items to be kept in a cache, they have two key properties: (i) they need to be polled, which uses bandwidth, and (ii) polling them only provides partial information about their change process, i.e., a single bit indicating whether the webpage has changed since it was last refreshed or not. Cho and Garcia-Molina (2003a) in their seminal work presented

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

a formulation of the problem which was recently studied by Azar et al. (2018). Under the assumption that the changes to the webpages and the requests are Poisson processes with known rates, they describe an efﬁcient algorithm to ﬁnd the optimal refresh rates for the webpages.

However, the change rates of the webpages are often not known in advance and need to be estimated. Since the web crawler cannot continuously monitor every page, there is only partial information available on the change process. Cho and Garcia-Molina (2003b), and more recently Li, Cline, and Loguinov (2017), have proposed estimators of the rate of change given partial observations. However, the problem of learning the refresh rates of items while also trying to optimise the objective of keeping the cache as up-to-date for incoming requests as possible seems very challenging. On the one hand, the optimal policy found by the algorithm by Azar et al. (2018) does not allocate bandwidth for pages that are changing very frequently, and, on the other hand, rate estimates with low precision, especially for those that are changing frequently, may result in a policy that has nonvanishing regret. We formulate this web-crawling problem with unknown change rates as an online optimization problem for which we deﬁne a natural notion of regret, describe the conditions under which the refresh rates of the webpages can be learned, and show that using a simple explore-and-commit algorithm, one can obtain regret of order

Though in this paper we primarily investigate the problem of web-crawling, our notion of regret and the observations we make about the learning algorithms can also be applied to other control problems which model the actions of agents as Poisson processes and the policies as intensities, including alternate objectives and observation regimes for the web-crawling problem itself. Such an approach is seen in recent works which model or predict social activities (Farajtabar 2018; Du et al. 2016), control online social actions (Zarezade et al. 2017; Karimi et al. 2016; Wang et al. 2017; Upadhyay, De, and Gomez-Rodriguez 2018), or even controlling spacing of items for optimal learning (Tabibian et al. 2019). All such problems admit online versions where the parameters for the models (e.g., difﬁculty of items from recall events (Tabibian et al. 2019), or rates of posting of messages by other broadcasters (Karimi et al. 2016)) need to be learned while also optimising the policy

the agent follows.

In Section 2, we will formally describe the problem setup and formulate the objective with bandwidth constraints. Section 3 takes a closer look at the objective function and the optimal policy to describe the properties any learning algorithm should have. We propose an estimator for learning the parameters of Poisson process with partial observability and provide guarantees on its performance in Section 4. Leveraging the bound on the estimator s performance, we propose a simple explore-and-commit algorithm in Section 5 and show that it achieves O(

T) regret. In Section 6, we test our algorithm using real data to justify our theoretical ﬁndings and we conclude with future research directions in Section 7.

2 Problem Formulation

In this section, we consider the problem of keeping a cache of m webpages up-to-date by modelling the changes to those webpages, the requests for the pages, and the bandwidth constraints placed on a standard web-crawler. We assume that the cache is empty when all processes start at time 0.

We model the changes to each webpage as Poisson processes with constant rates. The parameters of these change processes are denoted by ξ = [ξ1, . . . , ξm], where ξi > 0 denotes the rate of changes made to webpage i. We will assume that ξ are not known to us but we know only an upper bound ξmax and lower bound ξmin on the change rates. The crawler will learn ξ by refreshing the pages and observing the single-bit feedback described below. We denote the time webpage i changes for the nth time as xi,n. We model the incoming requests for each webpage also as Poisson processes with constant rates and denote these rates as ζ = [ζ1, ζ2, . . . , ζm]. We will assume that these rates, which can also be interpreted as the importance of each webpage in our cache, are known to the crawler. We will denote the time webpage i is requested for the nth time as zi,n. The change process and the request process, given their parameters, are assumed to be independent of each other.

We denote time points when page i is refreshed by the crawler using (yi,n) n=1. The feedback which the crawler gets after refreshing a webpage i at time yi,n consists of a single bit which indicates whether the webpage has changed or not since the last observation, if any, that was made at time yi,n 1. Let E i [t0, t] indicate the event that neither a change nor a refresh of the page has happened between time t0 and t for webpage i. Deﬁne FRESH(i, t) as the event that the webpage is fresh in the cache at time t. Deﬁning the maximum of an empty set to be , we have:

FRESH(i, t) = 0 if E i [0, t] 1(max{xi,j:xi,j<t}<max{yi,j:yi,j<t}) if E i [0, t]

where the indicator function 1( ) takes value 1 on the event in its argument and value 0 on its complement. Hence, we can describe the feedback we receive upon refreshing a page

i at time yi,n as: oi,n = FRESH(i, yi,n). (1) We call this a partial observation of the change process to contrast it with full observability of the process, i.e., when a refresh at yi,n provides the number of changes to the webpage in the period (yi,n, yi,n 1). For example, the crawler will have full observability of the incoming request processes.

The policy space Π consists of all measurable functions which, at any time t, decide when the crawler should refresh which page in its cache based on the observations up to time t that includes {(oi,n)N n=1 | N = argmaxn yi,n < t; i [m]}.

The objective of the web-crawling problem is to refresh webpages such that it maximizes the number of requests which are served a fresh version. So the utility of a policy π Π followed from time t1 to t2 can be written as:

U([t1, t2], π; ξ) = 1

t1 zi,n t2 FRESH(i, zi,n). (2)

Our goal is to ﬁnd a policy that maximizes this utility (2).1 However, if the class of policies is unconstrained, the utility can be maximized by a trivial policy which continuously refreshes all webpages in the cache. This is not a practical policy since it will overburden the web servers and the crawler. Therefore, we would like to impose a bandwidth constraint on the crawler. Such a constraint can take various forms and a natural way of framing it is that the expected number of webpages that are refreshed in any time interval with width w cannot exceed w R. This constraint deﬁnes a class of stochastic policies ΔR = {(ρ1, . . . , ρm) (R+)m : m i=1 ρi = R} Π, where each webpage s refresh time is drawn by the crawler from a Poisson process with rate ρi.

This problem setup was studied by Azar et al. (2018) and shown to be tractable. Recently, Kolobov et al. (2019a) have studied a different objective function which penalises the harmonic staleness of pages and is similarly tractable. Another example of such an objective (albeit with full observability) is proposed by Sia, Cho, and Cho (2007). We show later that our analysis extends to their objective as well.

We deﬁne the regret of policy π ΔR as follows

R(T, π; ξ) = max π ΔR E U([0, T], π ; ξ) E [U([0, T], π; ξ)] .

It is worth reiterating that the parameters ξ will not be known to the crawler. The crawler will need to determine when and which page to refresh given only the single bits of information oi,n corresponding to each refresh the policy makes.

3 Learning with Poisson Processes and Partial Observability

In this section, we will derive an analytical form of the utility function which is amenable to analysis, describe how to un-

1The freshness of the webpages does depend on the policy π which is hidden by function FRESH.

cover the optimal policy in ΔR if all parameters (i.e., ξ and ζ) are known, and consider the problem of learning the parameters ξ with partial observability. We will use the insight gained to determine some properties a learning algorithm should have so that it can be analysed tractably.

3.1 Utility and the Optimal policy

Consider the expected value of the utility of a policy ρ ΔR which the crawler follows from time t0 till time T. Assume that the cache at time t0 is given by S(t0) = [s1, s2, . . . , sm] {0, 1}m, where si = FRESH(i, t0). Then, using (2), we have:

E[U([t0, T], ρ; ξ) | S(t0)]

t0<zi,n<T FRESH(i, zi,n) FRESH(i, t0) = si

t0 ζi Pρ (FRESH(i, t) = 1 | FRESH(i, t0) = si)

F (i) [t0,t](ρ;ξ)

where (3) follows from Campbell s formula for Poisson Process (Kingman 1993) (expectation of a sum over the point process equals the integral over time with process intensity measure) as well as the fact that the request process and change/refresh processes are independent.2 In the next lemma, we show that the differential utility function F (i) [t0,t](ρ; ξ), deﬁned implicitly in (3), can be made timeindependent if the policy is allowed to run for long-enough. Lemma 1 (Adapted from (Azar et al. 2018)). For any given ε > 0, let ρ ΔR be a policy which the crawler adopts at time t0 and let the initial state of the cache be S(t0) = [s1, s2, . . . , sm] {0, 1}m, where si =

FRESH(i, t0). Then if t t0 1 ξmin log 2 m j=1 ζi

ε , then m i=1 ζiξi ξi+ρi < m i=1 F (i) [t0,t](ρ; ξ) < m i=1 ζiξi ξi+ρi + ε.

Hence, as long as condition described by Lemma 1 holds, the differential utility function for a policy ρ ΔR is time independent and can be written as just F(ρ; ξ) = m i=1 ζiξi ξi+ρi . Substituting this into (3), we get:

E[U([t0, T], ρ; ξ)] 1

ρi ρi + ξi ζi dt

ρiζi ρi + ξi (4)

This leads to the following time-horizon independent optimisation problem for the optimal policy:

maximize ρ ΔR F(ρ; ξ) =

ρiζi ρi + ξi (5)

2Note that the rates of the processes can be correlated; only the events need to be drawn independently.

Azar et al. (2018) have considered the approximate utility function given by (5) to derive the optimal refresh rates ρ for known ξ in O(m log m) time (See Algorithm 2 in (Azar et al. 2018)).3

This approximation has bearing upon the kind of learning algorithms we could use while keeping the analysis of the algorithm and computation of the optimal policy tractable. The learning algorithm we employ must follow a policy ρ ΔR for a certain amount of burn-in time before we can use (4) to approximate the performance of the policy. If the learning algorithm changes the policy too quickly, then we may see large deviations between the actual utility and the approximation. However, if the learning algorithm changes the policy slowly, where Lemma 1 can serve as a guide to the appropriate period, then we can use (4) to easily calculate its performance between t0 and T. Similar conditions can be formulated for the approximate objectives proposed by Kolobov et al. (2019a) and Sia, Cho, and Cho (2007) as well.

Now that we know how to uncover the optimal policy when ξ are known, we turn our attention to the task of learning it with partial observations.

3.2 Learnability with Partial Observations

In this section, we address the problem of partial information of Poisson process and investigate under what condition the rate of the Poisson process can be estimated. In our setting, for an arbitrary webpage, we only observe binary outcomes (on) n=1, deﬁned by (1). The refresh times (yn) n=1 and the Poisson process of changes with rate ξ induce a distribution over {0, 1}N which is denoted by μξ. If the observations happen at regular time intervals, i.e. yn yn 1 = c for some constant c, then the support Sξ of μξ is:

(on) n=1 {0, 1}N : lim n

n = 1 e c ξ

This means that we can have a consistent estimator, based on the strong law of large numbers, if the crawler refreshes the cache at ﬁxed intervals.

However, we can characterise the necessary property of the set of partial observations which allows parameter estimation of Poisson processes under the general class of policies Π. This result may be of independent interest. Lemma 2. Let {y0 := 0} (yn) n=1 be a sequence of times, such that n. yn > 0, at which observations (on) n=1 {0, 1}N are made of a Poisson process with rate ξ, such that on := 1 iff there was an event of the process in (yn 1, yn], deﬁne wn = yn yn 1, I = {n : wn < 1} and J = {n : wn 1}. Then:

1. If n I wn < and n J e ξwn < , then any statistic for estimating ξ has non-vanishing bias.

3The optimal policy can be obtained in O(m) time by using the method proposed by Duchi et al. (2008).

2. If n I wn = , then there exist disjoint subsets I1, I2, . . . of I such that

k=1 is monotone and

n Ik wn (1, 2) for k = 1, 2, . . . For any such sequence I = (Ik) k=1, the mapping c I(ξ) = lim K 1

K K k=1 exp ξ n Ik wn is strictly monotone and 1 K

a.s. 1 c I(ξ).

n J e wnξ = then, there exists a sequence J = (Jk) k=1 of disjoint subsets of J such that

k=1 is monotone and

n Jk e wnξ [1/e, 2/e) for k = 1, 2, . . . For any such J , the map-

ping c J (ξ) = lim K 1 K K k=1

n Jk 1 e ξwn

is strictly monotone and

k=1 I (on 1, n Jk)

a.s. c J (ξ),

Note that it is possible that, for some ξ, the statistics almost surely converge to a value that is unique to ξ, but for some other one they do not. Indeed, when wn = ln n, then

n I wn < and

n J e ξwn < for ξ = 2, but

n J e ξwn = for ξ = 1. More concretely, assuming that the respective limits exist, we have:

lim inf n J wn ln n > 1

n e ξwn < and

lim sup n J

n e ξwn = .

In particular, if lim supn J wn ln n = 0, it implies that

n J e ξwn = for all ξ > 0, which, in turn, implies that it will be possible to learn the true value for any parameter ξ > 0.

Lemma 2 has important implications on the learning algorithms we can use to learn ξ. It suggests that if the learning algorithm decreases the refresh rate ρi for a webpage too quickly, such that P lim infn wi,n

> 0 (assuming the limit exists), then the estimate of each parameter ξi has non-vanishing error.

In summary, in this section, we have made two important observations about the learning algorithm we can employ to solve the web-crawling problem. Firstly, given an error tolerance of ε > 0, the learning algorithm should change the policy only after 1 ξmin log (2

i ζi/ε) steps to allow for time invariant differential utility approximations to be valid. Secondly, in order to obtain consistent estimates for ξ from partial observations, the learning algorithm should not change the policy so drastically that it violates the conditions in Lemma 2. These observations strongly suggest that to obtain theoretical guarantees on the regret, one should use phased

learning algorithms where each phase of the algorithm is of duration 1 ξmin log (2 i ζi/ε), the policy is only changed when moving from one phase to the other, and the changes made to the policy are such that constraints presented in Lemma 2 are not violated. Parallels can be drawn between such learning algorithms and the algorithms used for online learning of Markov Decision Processes which rely on bounds on mixing times (Neu et al. 2010). In Section 5, we present the simplest of such algorithms, i.e., the explore-and-commit algorithm, for the problem and provide theoretical guarantees on the regret. Additionally, in Section 6.2, we also empirically compare the performance of ETC to the phased ε-greedy learning algorithm.

In the next section, we investigate practical estimators for the parameters ξ and the formal guarantees they provide for the web-crawling problem.

4 Parameter Estimation and Sensitivity Analysis with Partial Observations

In this section, we address the problem of parameter estimation of Poisson processes under partial observability and investigate the relationship between the utility of the optimal policy ρ (obtained using true parameters) and policy ρ (obtained using the estimates).

Assume the same setup as for Lemma 2, i.e., we are given a ﬁnite sequence of observation times {y0 := 0} (yn)N n=1 in advance, and we observe (on)N n=1, deﬁned as in (1), based on a Poisson process with rate ξ. Deﬁne wn = yn yn 1. Then log-likelihood of (on)N n=1 is:

n:on=1 ln(1 e ξwn)

n:on=0 ξwn (6)

which is a concave function. Taking the derivative and solving for ξ yields the maximum likelihood estimator (Cho and Garcia-Molina 2003b). However, as the MLE estimator lacks a closed form, coming up with a non-asymptotic conﬁdence interval is a very challenging task. Hence, we consider a simpler estimator.

Let us deﬁne an intermediate statistic p as the fraction of times we observed that the underlying Poisson process produced no events, p = 1

N N n=1(1 on). Since P(on = 0) = e ξwn we get E[ p] = 1

N N n=1 e ξwn. Motivated by this, we can estimate ξ by the following moment matching method: choose ξ to be the unique solution of the equation

n=1 e ξwn, (7)

and then obtain estimator ξ of ξ by clipping ξ to range [ξmin, ξmax], ξ = max{ξmin, min{ξmax, ξ}}. The RHS in (7) is monotonically decreasing in ξ, therefore ﬁnding the solution of (7) with error γ can be done in O(log(1/γ)) time based on binary search. Additionally, if the intervals are

of ﬁxed size, i.e., n. wn = c, then ξ reduces to the maximum likelihood estimator. Such an estimator was proposed by Cho and Garcia-Molina (2003b) and was shown to have good empirical performance. Here, instead of smoothing the estimator, a subsequent clipping of ξ resolves the issue of its instability for the extreme values of p = 0 and p = 1 (when the solution to (7) becomes ξ = and ξ = 0, respectively). In the following lemma, we will show that this estimator ξ is also amenable to non-asymptotic analysis by providing a high probability conﬁdence interval for it. Lemma 3. Under the condition of Lemma 2, for any δ (0, 1), and N observations it holds that

n=1 wne ξmaxwn 1

where ξ = max{ξmin, min{ξmax, ξ}} and ξ is obtained by solving (7).

Similar analysis can be done for the setting when one has full observability (Upadhyay et al. 2019, See Appendix K). With the following lemma we bound the sensitivity of the expected utility to the accuracy of our parameter estimates ξ. Lemma 4. For the expected utility F(ρ; ξ) deﬁned in (5), let ρ = argmaxρ F(ρ; ξ), ρ = argmaxρ F(ρ; ξ) and deﬁne the suboptimality of ρ as err( ρ) := F(ρ ; ξ) F( ρ; ξ). Then err( ρ) can be bounded by:

1 ξi min{ ξi, ξi} ζi( ξi ξi)2.

This lemma gives us hope that if we can learn ξ well enough such that | ξi ξi| O (1/

T) for all i, then we can obtain sub-linear regret by following the policy ρ = argmaxρ F(ρ; ξ). This indeed is possible and, in the next section, we show that an explore-and-commit algorithm can yield O(

T) regret. We would like to also bring to the notice of the reader that a similar result up to constants can be shown for the Harmonic staleness objective proposed by Kolobov et al. (2019a) (see (Upadhyay et al. 2019, Appendix E)) and the accumulating delay objective by Sia, Cho, and Cho (2007) (see (Upadhyay et al. 2019, Appendix F)).

5 Explore-Then-Commit Algorithm

In this section, we will analyse a version of the exploreand-commit (ETC) algorithm for solving the web-crawling problem. The algorithm will ﬁrst learn ξ by sampling all pages till time τ and then commit to the policy of observing the pages from time τ till T at the rates ρ as given by the Algorithm 2 in (Azar et al. 2018), obtained by passing it the estimated rates ξ instead of the true rates ξ.

Revisiting the policy space. The constraint we had used to deﬁne the policy space ΔR was that given any interval of width w, the expected number of refreshes in that interval should not exceed w R, which limited us to the class of

Poisson policies. However, an alternative way to impose the constraint is to bound the time-averaged number of requests made per unit of time asymptotically. It can be shown that given our modelling assumptions that request and change processes are memory-less, the policy which maximizes the utility in (2) given a ﬁxed number of observations per page will space them equally. This motivates a policy class KR = {κ = (κ1, . . . , κm) : m i=1 1/κi = R} Π as the set of deterministic policies which refresh webpage i at regular time intervals of length κi. Policies from KR allow us to obtain tight conﬁdence intervals for ξ by a straightforward application of Lemma 3. However, the sensitivity of the utility function for this policy space to the quality of the estimated parameters is difﬁcult to bound tightly. In particular, the differential utility function for this class of policies (deﬁned in (3)) is not strongly concave, which is a basic building block of Lemma 4. This precludes performance bounds which are quadratic in the error of estimates ξ, which lead to worse bounds on the regret of the ETC algorithm. These reasons are further expounded in the extended version of the paper (Upadhyay et al. 2019, see Appendix I). Nevertheless, we can show that using the uniform-intervals policy κUI incurs lower regret than the uniform-rates policy ρUR, while still making on average R requests per unit time (Upadhyay et al. 2019, see Appendix H).

Hence, to arrive at regret bounds, we will perform the exploration using Uniform-interval exploration policy κUI KR which refreshes webpages at regular intervals i. κi = m

R , which will allow us to use Lemma 3 to bound the error of the estimated ξ with high probability. Lemma 5. For a given δ (0, 1), after following the uniform-interval policy κUI for time τ, which is assumed to be a multiplicity of m/R, we can claim the following for the error in the estimates ξ produced using the estimator proposed in Lemma 3:

i [m]: | ξi ξi| e ξmaxm

With these lemmas, we can bound the regret suffered by the ETC algorithm using the following Theorem. Theorem 1. Let πEC denote the explore-and-commit algorithm which explores using the uniform-interval exploration policy for time τ (assumed to be a multiplicity of m

R ), estimates ξ using the estimator proposed in (7), and then uses the policy ρ = argmaxρ ΔR F(ρ; ξ) till time T. Then for a given δ (0, 1), with probability 1 δ, the expected regret of the explore and commit policy πEC is bounded by:

R(T, π EC; ξ)

τ m + (T τ)

R R log (2m/δ) 2m2ξ2 min

Further, we can choose an exploration horizon τ O(

T) such that, with probability 1 δ, the expected regret is O(

Proof. Since the utility of any policy is non-negative, we can upper-bound the regret of the algorithm in the exploration phase by the expected utility of the best stationary policy ρ = argmaxρ ΔR F(ρ; ξ), which is τ m m i=1 ζi ρ i ρ i +ξi <

τ m m i=1 ζi. In the exploitation phase, the regret is given by T τ

m (F(ρ , ξ) F( ρ, ξ)) (see (4)), which we bound using Lemma 4. Hence, we see that (with a slight abuse of notation to allow us to write R(T, κUI; ξ) for κUI KR):

R(T, πEC; ξ) = R(τ, κUI; ξ) + R(T τ, ρ; ξ)

i=1 ζi + (T τ)

ζi( ξi ξi)2

ξi min{ ξi, ξi} (8)

As we are using the estimator from Lemma 3, we have ξi min{ ξi, ξi} ξ2 min. Using this and Lemma 5 with (8), we get with probability 1 δ:

R(T,π EC; ξ)

i=1 ζi +(T τ) m i=1 ζi mξ2 min ( ξi ξi)2

= Aτ + (T τ) m i=1 ζi mξ2 min

R log (2m/δ)

= Aτ + (T τ)

m i=1 ζi 2m2ξ2 min e2 ξmaxm

R R log (2m/δ) B

This proves the ﬁrst claim.

The bound in (9) takes the minimum value when τ =

R R log (2m/δ) 2mξ2 min

T, giving with probabil-

ity 1 δ, the worst-case regret bound of:

R(T, ρEC; ξ) < 2

This proves the second part of the theorem.

This theorem bounds the expected regret conditioned on the event that the crawler learns ξ such that i. | ξi

2τR/m . These kinds of guarantees have been seen in recent works (Rosenski, Shamir, and Szlak 2016; Avner and Mannor 2014).

Note that the proof of the regret bounds can be easily generalised to the full-observation setting (Upadhyay et al. 2019, Appendix K) and for other objective functions (Upadhyay et al. 2019, see Appendix E and F).

Finally, note that using the doubling trick the regret bound can be made horizon independent at no extra cost. The policy can be de-randomized to either yield a ﬁxed interval policy in KR or, to a carousel like policy with similar performance

Exploration horizon (τ)

R = 100 R = 250 R = 500 R = 1000

(a) Regret / τ (with T = 104)

102 103 104 105 106

Time Horizon (T )

3.4 T 94.5 R = 1000

102 103 104 105 106

Time Horizon (T)

R = 100 1252.7 T + 72.3

R = 1000 348.9 T + 79.8

(c) Normalized Regret / T

Figure 1: Performance of the ETC algorithm. Panel (a) shows the regret suffered with different exploration horizons (keeping T = 104 ﬁxed) showing that a minima exists. Panel (b) shows that the optimal value of the horizon scales as O(

T) while panel (c) shows that the time-horizon normalized regret of the ETC algorithm decreases as O(1/

guarantees (Azar et al. 2018, See Algorithm 3). With this upper-bound on the regret of the ETC algorithm, in the next section we explore the empirical performance of the strategy.

6 Experimental Evaluation

We start with the analysis of the ETC algorithm, which shows empirically that the bounds that we have proven in Theorem 1 are tight up to constants. Next, we compare the ETC algorithm with phased ε-greedy algorithm and show that phased strategies can out-perform a well-tuned ETC algorithm, if given sufﬁcient number of phases to learn. We leave the detailed analysis of this class of algorithms for later work. An empirical evaluation of the MLE estimator and the moment matching estimator for partial observations, and the associated conﬁdence intervals proposed in Lemma 3 is done in the extended version of the paper (Upadhyay et al. 2019, Appendix J). These show that, for a variety of different parameters, the performance of the MLE estimator and the moment matching estimator is close to each other.

6.1 Evaluation of ETC Algorithm

For the experimental setup, we make user of the MSMACRO dataset (Kolobov et al. 2019b). The dataset was collected over a period of 14 weeks by the production crawler for Bing. The crawler visited webpages approximately once per day. The crawl time and a binary indicator of whether the webpage had changed since the last crawl or not are included in the dataset along with an importance score for the various URLs. We sample 5000 webpages from the dataset while taking care to exclude pages which did not change at all or which changed upon every crawl so as to not constraint the estimates of the

change rates artiﬁcially to ξmin or ξmax. We calculate their rate of change (ξ) using the MLE estimator (6). The corresponding importance values ζ are also sampled from the importance value employed by the production web-crawler. We set ξmin = 10 9 and ξmax = 25. The experiments simulate the change times ((xi,n) n=1)i [m] for webpages 50 times with different random seeds and report quantities with standard deviation error bars, unless otherwise stated.

We ﬁrst empirically determine the regret for different exploration horizon τ and bandwidth parameter R. To this end, we run a grid search for different values of τ (starting from the minimum time required to sample each webpage at least once), simulate the exploration phase using uniform-interval policy κUI and simulated change times to determine the parameters ξ and the regret suffered during the exploration phase. We calculate ρ using Algorithm 2 of Azar et al. (2018), similarly calculate ρ using the true parameters ξ, calculate their respective utility after the commit phase from τ till time horizon T = 104 using (4), and use it to determine the total regret suffered. We report the mean regret with the error bars indicating the standard deviations in Figure 1a. We see that there indeed is an optimal exploration horizon, as expected, and the value of both the horizon and the regret accumulated depends on R. We explore the relationship between the optimal exploration horizon τ and the time horizon T next by varying T from 102 to 106 and calculating the optimal horizon τ (using ternary search to minimize the empirical mean of the utility) for R {102, 103}; plots for other values of R are qualitatively similar. Figure 1b shows that the optimal exploration horizon τ scales as O(

Finally, we plot the time-horizon normalized regret suffered by πEC when using the optimal exploration horizon τ in Figure 1c. We see that the normalized regret decreases as 1

T , as postulated by Theorem 1. Plots for different values of ξ and ζ are qualitatively similar. It can also be seen in all plots that if the allocated bandwidth R is high, then the regret suffered is lower but the dependence of the optimal exploration threshold τ on the available bandwidth is nontrivial: in Figure 1b, we see that the τ R=100 < τ R=1000 if T < 103 and τ R=100 > τ R=1000 if T > 104.

It is noteworthy that Figures 1b and 1c suggest that the bounds we have proven in Theorem 1 are tight up to constants.

6.2 Phased ε-greedy Algorithm

In Section 5, we have shown guarantees on the performance of explore-then-commit algorithm which is the simplest form of policies which adheres to the properties we mention in Section 3. However, it is not difﬁcult to imagine other strategies which conform to the same recommendations. For example, consider a phased ε-greedy algorithm which runs with ρUR

for duration given in Lemma 1, estimates ξ, calculates ρ, and then follows the policy ρε, where ρε i = (1 ε) ρi + ε R

m, and then starts another phase, improving its policy with improving estimates of ξ. Since i [m]. ρε i > ε R

m, the policy

10 2 10 1 100

R = 100 R = 1000

(a) T = 103.67 / 3 phases

10 2 10 1 100

R = 100 R = 1000

(b) T = 103.83 / 6 phases

10 2 10 1 100

R = 100 R = 1000

(c) T = 104 / 9 phases

Figure 2: Performance of the phased ε greedy algorithm. The dotted lines show the regret of the ETC algorithm, with optimal exploration horizon, same bandwidth R and time horizon T. While the ETC algorithm performs well when the number of phases is small, with increasing number of phases, the phased ε-greedy algorithm is able to obtain lower regret than ETC.

will continue exploring all the webpages, ensuring eventual consistency in the estimates of ξ.

We performed simulations with the ρε algorithm and found that though it performed worse than ETC for small time horizons (Figures 2a and 2b), it performed better when given sufﬁcient number of phases (see Figure 2c). Exploring the regret bounds of such policies is part of our future work.

7 Conclusion

In this paper, we have taken the ﬁrst step towards solving the problem of learning changing rates of web-pages while solving the web-crawling problem, while also providing a guiding framework for analysing online learning problems where the agent s policy can be described using a Poisson process. We have shown that the learning algorithms should be phased and there are restrictions on how much they can change the policy from step to step while keeping learning feasible. Further, by bounding the performance of a Poisson policy using a deterministic policy, we have proved that a simple explore-and-commit policy has O(

T) regret under mild assumptions about the parameters.

This leaves several interesting avenues of future work open. Though we have proved a theoretical upper bound on regret of O(

T) and have empirically seen that bound is tight up to constants for the explore-and-commit algorithm, it is not clear whether this bound is tight for the class of all phased strategies. We will explore the class of such strategies in a planned extension. Lastly, we believe there are rich connections worth exploring between this work and the recent work on the Recharging Bandits paradigm (Immorlica and Klein-

berg 2018). We have taken some initial steps in the direction in the extended version of the paper (Upadhyay et al. 2019, See Appendix I and H).

Acknowledgements

W. Kotłowski was supported by the Polish National Science Centre under grant No. 2016/22/E/ST6/00299.

Avner, O., and Mannor, S. 2014. Concurrent bandits and cognitive radio networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 66 81. Springer.

Azar, Y.; Horvitz, E.; Lubetzky, E.; Peres, Y.; and Shahaf, D. 2018. Tractable near-optimal policies for crawling. Proceedings of the National Academy of Sciences 115(32):8099 8103.

Cho, J., and Garcia-Molina, H. 2003a. Effective page refresh policies for web crawlers. ACM Transactions on Database Systems (TODS) 28(4):390 426.

Cho, J., and Garcia-Molina, H. 2003b. Estimating frequency of change. ACM Transactions on Internet Technology (TOIT) 3(3):256 290.

Du, N.; Dai, H.; Trivedi, R.; Upadhyay, U.; Gomez Rodriguez, M.; and Song, L. 2016. Recurrent marked temporal point processes: Embedding event history to vector. In KDD.

Duchi, J.; Shalev-Shwartz, S.; Singer, Y.; and Chandra, T. 2008. Efﬁcient projections onto the l 1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, 272 279. ACM.

Farajtabar, M. 2018. Point Process Modeling and Optimization of Social Networks. Ph.D. Dissertation, Georgia Institute of Technology.

Immorlica, N., and Kleinberg, R. D. 2018. Recharging bandits. IEEE 59th Annual Symposium on Foundations of Computer Science.

Karimi, M. R.; Tavakoli, E.; Farajtabar, M.; Song, L.; and Gomez Rodriguez, M. 2016. Smart broadcasting: Do you want to be seen? In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 1635 1644. ACM.

Kingman, J. 1993. Poisson Processes. Oxford Science Publications.

Kolobov, A.; Peres, Y.; Lu, C.; and Horvitz, E. 2019a. Staying up to date with online content changes using reinforcement learning for scheduling.

Kolobov, A.; Peres, Y.; Lubetzky, E.; and Horvitz, E. 2019b. Optimal freshness crawl under politeness constraints. In

Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 495 504. ACM.

Li, X.; Cline, D. B.; and Loguinov, D. 2017. Temporal update dynamics under blind sampling. IEEE/ACM Transactions on Networking (TON) 25(1):363 376.

Neu, G.; Antos, A.; Gy orgy, A.; and Szepesv ari, C. 2010. Online markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems, 1804 1812.

Rosenski, J.; Shamir, O.; and Szlak, L. 2016. Multi-player bandits a musical chairs approach. In International Conference on Machine Learning, 155 163.

Sia, K. C.; Cho, J.; and Cho, H.-K. 2007. Efﬁcient monitoring algorithm for fast news alerts. IEEE Transactions on Knowledge and Data Engineering 19(7):950 961.

Tabibian, B.; Upadhyay, U.; De, A.; Zarezade, A.; Sch olkopf, B.; and Gomez-Rodriguez, M. 2019. Enhancing human learning via spaced repetition optimization. Proceedings of the National Academy of Sciences 201815156.

Upadhyay, U.; Busa-Fekete, R.; Kotlowski, W.; Pal, D.; and Szorenyi, B. 2019. Learning to crawl. ar Xiv preprint ar Xiv:1905.12781.

Upadhyay, U.; De, A.; and Gomez-Rodriguez, M. 2018. Deep reinforcement learning of marked temporal point processes. In Advances in Neural Information Processing Systems.

Wang, Y.; Williams, G.; Theodorou, E.; and Song, L. 2017. Variational policy for guiding point processes. In ICML.

Zarezade, A.; De, A.; Upadhyay, U.; Rabiee, H. R.; and Gomez-Rodriguez, M. 2017. Steering social activity: A stochastic optimal control point of view. Journal of Machine Learning Research 18:205 1.