# learning_to_crawl__18397a5c.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Learning to Crawl Utkarsh Upadhyay Resonal, Berlin, Germany utkarsh@reason.al R obert Busa-Fekete Google Research, NY, USA busarobi@google.com Wojciech Kotłowski Poznan University of Technology, Poland wkotlowski@cs.put.poznan.pl D avid P al Yahoo! Research, NY, USA davidko.pal@gmail.com Bal azs Sz or enyi Yahoo! Research, NY, USA szorenyi.balazs@gmail.com Web crawling is the problem of keeping a cache of webpages fresh, i.e., having the most recent copy available when a page is requested. This problem is usually coupled with the natural restriction that the bandwidth available to the web crawler is limited. The corresponding optimization problem was solved optimally by Azar et al. (2018) under the assumption that, for each webpage, both the elapsed time between two changes and the elapsed time between two requests follows a Poisson distribution with known parameters. In this paper, we study the same control problem but under the assumption that the change rates are unknown a priori, and thus we need to estimate them in an online fashion using only partial observations (i.e., single-bit signals indicating whether the page has changed since the last refresh). As a point of departure, we characterise the conditions under which one can solve the problem with such partial observability. Next, we propose a practical estimator and compute confidence intervals for it in terms of the elapsed time between the observations. Finally, we show that the explore-and-commit algorithm achieves an O( T) regret with a carefully chosen exploration horizon. Our simulation study shows that our online policy scales well and achieves close to optimal performance for a wide range of parameters. 1 Introduction As information dissemination in the world becomes near realtime, it becomes more and more important for search engines, like Bing and Google, and other knowledge repositories to keep their caches of information and knowledge fresh. In this paper, we consider the web-crawling problem of designing policies for refreshing webpages in a local cache with the objective of maximizing the number of incoming requests which are served with the latest version of the page. Webpages are the simplest and most ubiquitous source of information on the internet. As items to be kept in a cache, they have two key properties: (i) they need to be polled, which uses bandwidth, and (ii) polling them only provides partial information about their change process, i.e., a single bit indicating whether the webpage has changed since it was last refreshed or not. Cho and Garcia-Molina (2003a) in their seminal work presented Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. a formulation of the problem which was recently studied by Azar et al. (2018). Under the assumption that the changes to the webpages and the requests are Poisson processes with known rates, they describe an efficient algorithm to find the optimal refresh rates for the webpages. However, the change rates of the webpages are often not known in advance and need to be estimated. Since the web crawler cannot continuously monitor every page, there is only partial information available on the change process. Cho and Garcia-Molina (2003b), and more recently Li, Cline, and Loguinov (2017), have proposed estimators of the rate of change given partial observations. However, the problem of learning the refresh rates of items while also trying to optimise the objective of keeping the cache as up-to-date for incoming requests as possible seems very challenging. On the one hand, the optimal policy found by the algorithm by Azar et al. (2018) does not allocate bandwidth for pages that are changing very frequently, and, on the other hand, rate estimates with low precision, especially for those that are changing frequently, may result in a policy that has nonvanishing regret. We formulate this web-crawling problem with unknown change rates as an online optimization problem for which we define a natural notion of regret, describe the conditions under which the refresh rates of the webpages can be learned, and show that using a simple explore-and-commit algorithm, one can obtain regret of order Though in this paper we primarily investigate the problem of web-crawling, our notion of regret and the observations we make about the learning algorithms can also be applied to other control problems which model the actions of agents as Poisson processes and the policies as intensities, including alternate objectives and observation regimes for the web-crawling problem itself. Such an approach is seen in recent works which model or predict social activities (Farajtabar 2018; Du et al. 2016), control online social actions (Zarezade et al. 2017; Karimi et al. 2016; Wang et al. 2017; Upadhyay, De, and Gomez-Rodriguez 2018), or even controlling spacing of items for optimal learning (Tabibian et al. 2019). All such problems admit online versions where the parameters for the models (e.g., difficulty of items from recall events (Tabibian et al. 2019), or rates of posting of messages by other broadcasters (Karimi et al. 2016)) need to be learned while also optimising the policy the agent follows. In Section 2, we will formally describe the problem setup and formulate the objective with bandwidth constraints. Section 3 takes a closer look at the objective function and the optimal policy to describe the properties any learning algorithm should have. We propose an estimator for learning the parameters of Poisson process with partial observability and provide guarantees on its performance in Section 4. Leveraging the bound on the estimator s performance, we propose a simple explore-and-commit algorithm in Section 5 and show that it achieves O( T) regret. In Section 6, we test our algorithm using real data to justify our theoretical findings and we conclude with future research directions in Section 7. 2 Problem Formulation In this section, we consider the problem of keeping a cache of m webpages up-to-date by modelling the changes to those webpages, the requests for the pages, and the bandwidth constraints placed on a standard web-crawler. We assume that the cache is empty when all processes start at time 0. We model the changes to each webpage as Poisson processes with constant rates. The parameters of these change processes are denoted by ξ = [ξ1, . . . , ξm], where ξi > 0 denotes the rate of changes made to webpage i. We will assume that ξ are not known to us but we know only an upper bound ξmax and lower bound ξmin on the change rates. The crawler will learn ξ by refreshing the pages and observing the single-bit feedback described below. We denote the time webpage i changes for the nth time as xi,n. We model the incoming requests for each webpage also as Poisson processes with constant rates and denote these rates as ζ = [ζ1, ζ2, . . . , ζm]. We will assume that these rates, which can also be interpreted as the importance of each webpage in our cache, are known to the crawler. We will denote the time webpage i is requested for the nth time as zi,n. The change process and the request process, given their parameters, are assumed to be independent of each other. We denote time points when page i is refreshed by the crawler using (yi,n) n=1. The feedback which the crawler gets after refreshing a webpage i at time yi,n consists of a single bit which indicates whether the webpage has changed or not since the last observation, if any, that was made at time yi,n 1. Let E i [t0, t] indicate the event that neither a change nor a refresh of the page has happened between time t0 and t for webpage i. Define FRESH(i, t) as the event that the webpage is fresh in the cache at time t. Defining the maximum of an empty set to be , we have: FRESH(i, t) = 0 if E i [0, t] 1(max{xi,j:xi,j