# parsimonious_learningaugmented_caching__50e023c2.pdf Parsimonious Learning-Augmented Caching Sungjin Im 1 Ravi Kumar 2 Aditya Petety 1 Manish Purohit 2 Learning-augmented algorithms in which, traditional algorithms are augmented with machinelearned predictions have emerged as a framework to go beyond worst-case analysis. The overarching goal is to design algorithms that perform near-optimally when the predictions are accurate yet retain certain worst-case guarantees irrespective of the accuracy of the predictions. This framework has been successfully applied to online problems such as caching where the predictions can be used to alleviate uncertainties. In this paper we introduce and study the setting in which the learning-augmented algorithm can utilize the predictions parsimoniously. We consider the caching problem which has been extensively studied in the learning-augmented setting and show that one can achieve quantitatively similar results but only using a sublinear number of predictions. 1. Introduction Learning-augmented algorithms have recently emerged as a framework to strengthen traditional algorithms with machine-learned predictions. Traditional algorithm design focuses on formal guarantees for all inputs. Hence, it is often geared towards working well on worst-case inputs and not for typical, real-world instances. In contrast, machine learning (ML) performs extremely well on typical instances but can occasionally fail on rare instances. The learningaugmented framework aims to design algorithms that can benefit from the machine-learned predictions while retaining worst-case guarantees. This framework was initiated by Kraska et al. (2018), who demonstrated that indexed data structures can be improved 1UC Merced 2Google Mountain View. Correspondence to: Sungjin Im , Ravi Kumar , Manish Purohit . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). using learned predictions. Inspired by their work, Lykouris & Vassilvitskii (2018) studied the classic online caching problem and obtained an algorithm whose performance guarantee gracefully degrades as the prediction quality worsens but still remains robust regardless of the prediction quality. The learning-augmented framework has since found applications in streaming algorithms, data structures, and particularly for online algorithms where predictions can alleviate the uncertainties for unseen future inputs; see the survey by Mitzenmacher & Vassilvitskii (2020). In this paper we focus on an important yet largely overlooked aspect in previous works the cost of obtaining predictions. Predictions are typically obtained from a machine learned model, which can be computationally expensive; this makes it highly desirable to use predictions parsimoniously. We study online caching in the learningaugmented framework in which hints are used sparingly. Online caching. In the online caching problem, a sequence of page requests arrives at a cache of size k. If the requested page is in the cache, then it can be served at no extra cost, but otherwise a cache miss occurs to fetch the missing page into the cache. The goal of an online algorithm is to minimize the number of cache misses. A number of randomized algorithms are known to be Θ(log k)-competitive (Achlioptas et al., 2000; Fiat et al., 1991) meaning that they incur O(log k) times more cache misses than the offline optimal solution for all inputs and this is the best possible. Belady s furthest-in-future algorithm (Belady, 1966) that always evicts the page whose next request is the furthest in the future is well-known to be the optimal offline algorithm To exploit predictions, Lykouris & Vassilvitskii (2018) proposed an algorithm that assumes the knowledge of the next predicted request time of all pages in the cache. For the ℓ1-norm prediction error with respect to the actual arrival times, they showed that the algorithm s competitive ratio improves to O(1) as the error tends to 0 and remains O(log k) always. These bounds have further been quantitatively improved recently by Rohatgi (2020); Wei (2020). Our contributions. We show that we can use significantly fewer predictions for online caching to obtain results similar to the aforementioned work. More precisely, Parsimonious Learning-Augmented Caching we allow our algorithm to query b pages in the cache to learn their predicted next arrival time for each cache miss. We show that we can obtain an O(logb+1 k)-competitive ratio for good predictions while retaining the O(log k)- competitive ratio always (Theorem 11). Thus, as long as the cache miss rate is 1 kϵ for any constant ϵ > 0, we can obtain a constant O(log 1 ϵ )-competitive ratio using a sublinear number of queries in the number of page requests. We also show that our trade-off is near-optimal (Theorem 12). Our experiments show that even with very few queries, e.g., making two queries per cache miss, we can significantly improve upon traditional caching algorithms in practice. The experimental results also demonstrate that we can match (and even exceed) the performance of prior learning-augmented algorithms but querying only 11% of the page requests. As is typical in unweighted caching, our algorithm is based on the randomized marking algorithm. However, instead of evicting a randomly chosen unmarked page on a cache miss, our algorithm queries b unmarked pages in the cache and evicts the one with the furthest predicted request time. At a high-level, if there are k unmarked pages, then we can show that the evicted page is not requested before k/(b+1) other unmarked pages in the cache in expectation, provided all the predictions are correct. Using this we can show how to reduce the number of cache misses. While this idea is easy to state, the analysis is delicate as the prediction error is defined only over the pages that were queried. To retain an O(log k)-competitive ratio, we adapt the techniques of Lykouris & Vassilvitskii (2018) and switch to using the randomized marking strategy once we detect that the algorithm has made too many mistakes. The lower bound is shown by an explicit but intricate construction. Related work. Online caching has been extensively studied in the literature. For generalizations of caching, including the k-server problem, see (Koutsoupias & Papadimitriou, 1995; Bansal et al., 2015; Bubeck et al., 2018; Lee, 2018; Bansal et al., 2012; Adamaszek et al., 2012). In order to circumvent the pessimistic lower bounds in the worstcase setting, a number of alternative models have been proposed to capture properties of real-world instances such as the access graph model (Borodin et al., 1991; Fiat & Mendel, 1997; Irani et al., 1996), Markov paging (Karlin et al., 2000), and interleaved paging (Barve et al., 2000; Kumar et al., 2020). The reader is referred to the book by Borodin & El-Yaniv (2005) for a general overview of online algorithms. Learning-augmented algorithms largely fall in the rubric of beyond worst-case algorithms ; see (Roughgarden, 2020) for an extensive survey of the field. They have recently been extensively explored particularly for online algorithms, including load balancing (Lattanzi et al., 2020; Li & Xian, 2021), rent-or-buy (Kumar et al., 2018), scheduling (Azar et al., 2021), online set cover (Bamas et al., 2020), metrical task systems (Antoniadis et al., 2020), and many others. For online caching, its weighted version has been studied in (Jiang et al., 2020; Bansal et al., 2022). The problem of learning-augmented algorithms with sublinear number of queries was recently studied by Bhaskara et al. (2021), but in the regret setting for online linear optimization. Our paper studies an analogous question for caching, but in the competitive ratio setting. The online algorithms with advice model (Boyar et al., 2017) is loosely related to learning-augmented algorithms. Here the goal is to quantify the amount of advice necessary to obtain a (near-)optimal solution; the model assumes the advice is error-free. For online caching, Dobrev et al. (2009) show that one bit of advice per page request suffices to obtain an optimal solution, where the advice is whether a page should be kept in the cache until it is requested next. Unfortunately, such advice is derived from the optimal solution itself and is arguably hard to learn; in contrast, predicting the next arrival time of each page which is our setting is a much easier task to learn. Let U denote a universe of pages and k be the number of distinct pages that can be held in the cache at any time. In the classical unweighted caching problem, a sequence Γ = p1, p2, . . . , where each pi U, of page requests arrives online and the algorithm is required to maintain a set of at most k pages in the cache at any time. At any time t, if the currently requested page pt is not in the cache, then the algorithm incurs a cache miss and must fetch the requested page in the cache (possibly by evicting some other page). The objective of the online algorithm is to minimize the total number of cache misses incurred. Note that an online algorithm has to choose the page to be evicted without knowing Γ. We measure its performance by comparing against the furthest-in-the-future (Fi F) algorithm of Belady (1966), which is the optimal offline algorithm that knows Γ. Let costΓ( ) be the total number of cache misses of an algorithm for the request sequence Γ and let OPTΓ = costΓ(Fi F) be the optimal offline cost. An online (randomized) algorithm A is said to be c-competitive if for all request sequences Γ, it holds that E[costΓ(A)] c OPTΓ + b, where b 0 is a constant independent of the length of Γ, and the expectation is over the randomness (if any) of A. For brevity, from now on we will work with a given Γ and omit it from all subsequent notation. In the usual learning-augmented setting, at each time t, Parsimonious Learning-Augmented Caching along with the requested page pt, the algorithm is presented with a (possibly noisy) prediction τt N for the next time after t that the page pt will be requested again; hence the predicted arrival time of the next request is available for every page in the cache. In the learning-augmented setting with queries, at any time t and for any page p that is in the cache, the algorithm is allowed to query a possibly noisy (stochastic) oracle Q for the time, after t, of the next request for p. Let τp,t = Q(p, t) denote such a predicted arrival time of the next request to page p after time t; let ap,t t denote the actual arrival time of the next request to page p. Let Q be the set of queries made to Q. We define the error of the oracle to be η = P (p,t) Q |τp,t ap,t|. The learning-augmented setting with queries generalizes many well-studied caching problems. On one hand, if the algorithm makes no queries to the oracle, then it is the standard caching problem and we can get a O(log k)- competitive solution, say, with a randomized marking algorithm (see Section 3.1). On the other hand, if the oracle is error-free and the algorithm queries it at every time step, Belady s algorithm yields the optimal solution. In a recent work, Lykouris & Vassilvitskii (2018); Wei (2020); Rohatgi (2020) designed a learning-augmented caching algorithm for noisy oracles, showing a tight trade-off between the error of the oracle and the competitive ratio of the algorithm; their algorithm, however, queries the oracle at every time step. The question we ask in this paper is: can we get similar trade-offs but using much fewer queries? 3. Preliminaries A pair (p, t1) and (q, t2) of queries in Q is called an inversion if τp,t1 τq,t2 but ap,t1 < aq,t2, i.e., the next request of page p is earlier than that of q although the predictions indicated otherwise. Let I = |{(p, t1), (q, t2) | τp,t1 τq,t2 but ap,t1 < aq,t2}| be the number of inversions. The following relates the number of inversions to the error. Lemma 1 (Diaconis & Graham (1977); Rohatgi (2020)). For any request sequence Γ and any set Q of queries, 3.1. Marking algorithms Marking algorithms are a class of caching algorithms that associate a marking bit with each page in the cache, and upon a cache miss only evict an unmarked page from the cache. Formally, the algorithm first divides the request sequence into phases where a phase is a maximal contiguous sequence of requests to only k distinct pages. At the beginning of each phase, all pages in the cache are unmarked. Pages that are requested during the phase get marked one by one and upon any cache miss, the algorithm only evicts some unmarked page. Once all the pages in the cache have been marked, a new phase begins and the process repeats. Algorithm 1 shows the pseudocode for a generic marking algorithm. It is well known that any marking algorithm is O(k)-competitive and the randomized marking algorithm (Fiat et al., 1991), which evicts an unmarked page chosen uniformly at random, is O(H(k))- competitive, where H(k) := 1 + 1 k = Θ(log k). Algorithm 1: A generic marking algorithm. for each requested page p do if p in cache then Mark p else if all pages in cache are marked then Unmark all pages end Evict an unmarked page Fetch p in cache and mark it end end Consider any phase h and an arbitrary page p that is requested in phase h. We say that page p is clean if p was not requested in the previous phase (i.e., phase h 1), and we say p is stale otherwise. Note that once k is known, the phases of the sequence as well as clean and stale pages are determined, independent of the algorithm. Let ℓh denote the total number of distinct clean pages requested in phase h. The following result bounds the number of cache misses incurred by the optimal offline algorithm in terms of the number of distinct clean pages. Lemma 2 (Fiat et al. (1991)). 1 2 P h ℓh OPT P h ℓh. 4. Warm-up: Modified marking algorithm We first show how the classic randomized marking algorithm (Fiat et al., 1991) can be modified to effectively use predictions but making fewer queries. For ease of exposition, we assume for now that the oracle is error-free; we extend the analysis to handle noisy predictions in Section 4.1. We consider the following modification to the marking algorithm: whenever a page needs to be evicted, if there are at least ϵk unmarked pages in the cache, then evict an unmarked page chosen uniformly at random; otherwise, query Q(p, t) for all unmarked pages p and evict the page whose next request appears furthest in the future (i.e., apply Belady s method). We remark that once we query all the remaining unmarked pages in a phase, we can simply reuse these predictions for any further cache misses and hence make at most ϵk queries in any phase. Algorithm 2 describes this naive eviction policy formally. Parsimonious Learning-Augmented Caching Algorithm 2: Naive eviction. Function: Evict(): Data: U U: Set of unmarked pages in cache Result: α: Page to be evicted if |U| ϵk then α Uniformly random page from U else if we have not already queried pages in U in this phase then foreach page p in U do Let τp Q(p, t) α argmaxp U τp return α; Theorem 3. For any ϵ > 0, for any request sequence Γ, there is an O(log(1/ϵ))-competitive algorithm for caching that makes at most ϵ|Γ| queries. Proof. Consider any phase h of the marking algorithm and ℓh be the number of clean pages in that phase. Let p1, . . . , pk denote the k pages in cache at the beginning of the phase and further suppose that the pages are sorted in order of the arrival time of the first request to a page in this phase (breaking ties arbitrarily). In other words, pages p1, ... , pk ℓh are the k ℓh stale pages requested in this phase and further the first request to page pi is earlier than that of page pj for any i < j. Consider any stale page pi where 1 i k ϵk, and let ℓ(i) ℓh be the number of clean pages that have been requested before the first request to page pi. When the first request to page i arrives, there are exactly k i + 1 unmarked stale pages of which ℓ(i) pages have been evicted from the cache uniformly at random. Hence, the algorithm incurs a cache miss for page pi with probability ℓ(i) k i+1 ℓh k i+1. Finally, consider the first request to page pk ϵk. If the algorithm incurs any cache miss after this time, then it queries all the ϵk remaining unmarked pages and evicts the page whose next request is furthest in the future, i.e., it evicts a page from the set {pk ℓh, . . . , pk} that is not requested in this phase. Thus, for any i > k ϵk, page pi incurs a cache miss only if it has already been evicted by the time page pk ϵk is first requested. Thus, any such page pi incurs a cache miss with probability at most min{1, ℓh By the linearity of expectation, summing over all pages pi, the expected number of cache misses incurred by the algorithm for stale pages is at most Pk ϵk i=1 ℓh k i+1 + Pk ℓh i=k ϵk+1 ℓh ϵk ℓh + ℓh(Hk Hϵk) O(ℓh(log 1 ϵ )). In addition, the algorithm also incurs ℓh additional cache misses for the clean pages. Hence, the total expected number of cache misses incurred in phase h is O(ℓh log( 1 ϵ )). The desired competitive ratio now follows from Lemma 2. To bound the total number of queries, we observe that the algorithm makes at most ϵk queries in each phase. Since each phase has at least k requests, any request sequence Γ has at most |Γ|/k phases, and thus the total number of queries is at most ϵ|Γ|. 4.1. Handling prediction errors Let us now consider the case where the oracle can give erroneous predictions. Since Algorithm 2 does not utilize predictions as long as there are at least ϵk unmarked pages left in the cache, we only need to reconsider the cache misses that occur after there are fewer than ϵk unmarked pages left. Consider any page pi for i > k ϵk and let t denote the time when the algorithm queries all the remaining unmarked pages. Suppose the algorithm incurs a cache miss on page pi and evicts page q = argmaxp U τp, t. Now, if the predictions are correct, then q belongs to the set {pk ℓh, . . . , pk} of pages that are not requested in this phase. However, suppose the predictions are incorrect and page q is requested in this phase, then the algorithm incurs an additional cache miss. However, in this case the pair (q, t) and (pk, t) of queries is an inversion and we can charge the additional cache miss incurred to this inversion. Since we only incur at most one cache miss for a page, it can be easily verified that we charge at most one cache miss to a specific inversion. Let Ih be the total number of inversions for queries made in phase h, then from the above discussion we have that the expected number of cache misses incurred by the algorithm in phase h is at most O(ℓh(log 1 ϵ ) + E[Ih]). Hence, the total cost incurred over all phases is O((P h ℓh)(log 1 ϵ ) + E[I]) where the expectation is over the randomness in the pages evicted by the algorithm. Using Lemma 1 and Lemma 2, we conclude that the total cost incurred by Algorithm 2 is at most O(2 log(1/ϵ)OPT + E[η]) and obtain the following: Theorem 4. For any ϵ > 0, there is an O(log(1/ϵ) + E[η]/OPT)-competitive algorithm for caching that makes at most ϵ|Γ| queries. While this warm-up result is a proof of concept for parsimonious use of predictions, to achieve a constant competitive ratio, we still need to make a linear number of queries in the request sequence length. To overcome this weakness we propose a new algorithm in the following section that is more adaptive in deciding which pages to query. Parsimonious Learning-Augmented Caching 5. Adaptive query algorithm In this section we present a new algorithm that queries b unmarked pages uniformly at random on each cache miss and evicts the one that is predicted to be requested the furthest in the future. We call this the adaptive query algorithm (Adaptive Query-b); see Algorithm 3. Here, b is a parameter that governs a trade-off between the desired competitive ratio and the number of queries we are willing to make per cache miss. Algorithm 3: Adaptive query eviction. Function: Evict(): Data: U U: Set of unmarked pages in cache and an integer b > 0 Result: α: Page to be evicted S Sample b pages from U uniformly at random without replacement Let τp Q(p, t) for all p S α argmaxp S τp return α As before we first analyze the algorithm assuming the predictions are all correct. We will show the following tradeoff in Section 5.1. Theorem 5. Under the assumption that the oracle is errorfree, for any integer b > 0, the adaptive query algorithm is 2(logb+1 k + 3)-competitive and makes at most 2b(logb+1 k + 3) OPT queries in expectation. This bound is shown to be nearly tight in Section 6; see Theorem 12. We extend Theorem 5 in Section 5.2 to handle error-prone predictions. 5.1. Analysis If we show that the adaptive query algorithm is ccompetitive, then it immediately follows that the number of queries made is cb OPT. Thus, we only need to establish the desired competitive ratio. Consider any fixed phase h of the marking algorithm and let f1, . . . , fℓh be the clean pages requested in that phase. We consider the following notion of eviction chains (Lykouris & Vassilvitskii, 2018) for the sake of analysis. An eviction chain Ci = qi,0 := fi, qi,1, . . . , qi,Mi is a sequence of pages constructed as follows: qi,1 is the stale page that is evicted by the algorithm when it serves the clean page fi; similarly for all j 1, qi,j+1 is the stale page that gets evicted when the algorithm serves the request to page qi,j. Eventually, a stale page qi,Mi gets evicted that is not requested in the phase and the sequence ends. We note that each eviction chain starts with a distinct clean page and ends with a stale page that is not requested in the phase. Further, the ℓh eviction chains are disjoint and each cache miss incurred by the algorithm is encoded in these chains. The ith eviction chain Ci leads to Mi cache misses where Mi is a random variable. Our goal is to bound the expected total number of cache misses, i.e., E[Pℓh i=1 Mi]. Page ranks. We first order all clean pages and stale pages in the cache by the arrival time of the first request to that page in this phase (the ℓh stale pages that are not requested in the phase appear last in the ordering, in an arbitrary order). For each stale page p evicted by the algorithm, we define its rank r(p) as the number of stale pages after page p in the above ordering that have not yet been evicted (at the time p was evicted). By construction of the eviction chains, page qi,j is evicted when page qi,j 1 is requested and hence qi,j is after qi,j 1 in the ordering (since all pages before qi,j 1 in the ordering have already been marked). Hence, we always have r(qi,j) r(qi,j 1). Similarly, for each clean page fi, we define its rank r(fi) = r(qi,0) to be the number of stale pages after fi that have not yet been evicted when fi was requested. Note that r(fi) k, for all 1 i ℓh. We first show the following simple statement that uses the order statistics of the uniform distribution. We defer its proof to the Supplementary Material. Lemma 6. If S = {s1, . . . , sb} is a set sampled uniformly at random without replacement from {0, 1, . . . , r}, then E[mint [b] st] r b+1. The following two lemmas are used to bound the expected length of an eviction chain. Lemma 7. Consider any eviction chain Ci and suppose it evicts page qi,j+1 to service a request to page qi,j. Then we have E[r(qi,j+1) | r(qi,j)] r(qi,j) Proof. When a cache miss occurs for page qi,j, note that all pages that appear before qi,j (when ordered by the arrival time of their first request in the phase) have already been marked. Thus, all the queried stale pages must appear after qi,j. Suppose there are r r(qi,j) unmarked stale pages left. When the predictions are all correct, Algorithm 3 randomly samples b pages from all unmarked stale pages and evicts qi,j+1 as the one that is latest in the ordering. In other words, r(qi,j+1) is the minimum of b uniform samples from {0, 1, . . . , r 1}. Thus from Lemma 6, we have E[r(qi,j+1) | r(qi,j)] r 1 b+1 r(qi,j) Lemma 8. For every 1 i ℓh, we have E[Mi] logb+1 k + 3 where Mi is the length of the eviction chain beginning with the clean page fi. Proof. An eviction chain ends when it evicts one of the ℓh stale pages that are not requested in the phase. Fix a par- Parsimonious Learning-Augmented Caching ticular chain Ci and for brevity, let rj := r(qi,j). Since we have r0 k, using Lemma 7 and the law of iterated expectation, we have E[rj] k (b+1)j . By Markov s inequality, we have Pr[rj 1] k (b+1)j . Note that if Mi > j, then it must be the case that rj 1. Indeed, if rj = 0, then qi,j will not be evicted and the chain Ci must have length j. Let c := logb+1(k) . We can now bound the expected length of the chain as follows. j=0 Pr[Mi j] + X j c Pr[Mi j] j 0 Pr[Mi > c + j] = c + X j 0 Pr[rc+j 1] k (b + 1)c+j = c + X logb+1(k) + 3. Proof of Theorem 5. Fix a phase h of the marking algorithm. Since every cache miss incurred by the algorithm is recorded in exactly one eviction chain, the expected total number of cache misses incurred by the algorithm in this phase is E[Pℓh i=1 Mi]. Using Lemma 8, this is at most ℓh(logb+1 k + 3). The desired competitive ratio now follows from Lemma 2. Further, the algorithm makes at most b queries for each cache miss it incurs and thus we have Theorem 5. 5.2. Handling prediction errors In this section we extend the analysis to allow for oracles that make erroneous predictions. In this scenario, since we evict a page that is only predicted to arrive furthest in the future (and not actually be the one to arrive the latest), Lemma 7 fails. However, as we show below, in this case the oracle has a large error and we can bound the expected cost of the algorithm in terms of the prediction error. We first show the following technical statement that relates the rank of the page evicted by the algorithm and the rank of the page that actually arrives the furthest in the future from among the sampled pages (while processing any cache miss). Lemma 9. Let S be any set of b pages and let a1 < < ab denote their actual next arrival times and let τ1, . . . , τb be the sequence of their predicted arrival times. Let ηS = Pb i=1 |ai τi| be ℓ1-error of the predictions for the set S. If ˆb = argmaxα τα is the page with the furthest predicted arrival time, then we have r(ˆb) r(b) + ηS. Proof. We assume that ˆb = b since otherwise the lemma is trivial. By definition of ˆb we have τˆb τb and aˆb < ab. For convenience let ˆp and p denote the corresponding pages. Since the number of unmarked pages between ˆp and p, when ordered by the request time of their first request, is at most ab aˆb, by definition of rank we have r(ˆb) r(b) + ab aˆb. We now show ηS ab aˆb, which will complete the proof. To show this, we observe that ηS |aˆb τˆb| + |ab τb|. We consider three cases. Case 1: τˆb aˆb. In this case we have ηS |ab τb| = (ab aˆb) + (aˆb τb) ab aˆb. Case 2: ab τˆb > aˆb. Since τb τˆb, we have ηS |aˆb τˆb| + |ab τb| = τˆb aˆb + ab τb ab aˆb. Case 3: τˆb > ab. Here we have |aˆb τˆb| = τˆb aˆb ab aˆb. Lemma 9 lets us prove the following analog of Lemma 8. Lemma 10. For every 1 i ℓh, we have E[Mi] logb+1 k+3+2E[ηSi] where Mi is the length of the eviction chain beginning with the clean page fi and Si is the set of pages queried when pages on path Pi are evicted. Proof. Fix a particular chain Ci. Let r(qi,j) be the number of stale unmarked pages left in the cache when the algorithm incurs a cache miss for page qi,j at some time t. In this case, we sample a set S of b of those pages uniformly at random, and set qi,j+1 = argmaxp S Q(p, t). Let q i,j+1 = argmaxp S ap,t be the sampled page that actually arrives furthest in the future. Then by Lemma 7, we have E[r(q i,j+1) | r(qi,j)] r(qi,j) b+1 . Further, for any queried set S of pages, by Lemma 9 we have, r(qi,j+1) r(q i,j+1) + ηS where ηS is the ℓ1-error of the predictions for the set S. Thus we obtain the following where ηi,j+1 is defined to be the prediction error of the oracle for pages queried while evicting page qi,j+1. E[r(qi,j+1) | r(qi,j)] r(qi,j) b + 1 + E[ηi,j+1]. Now, since we have r(qi,0) k, using the law of iterated expectation we have E[r(qi,j)] k (b + 1)j + E[ηi,j ] (b + 1)j j . Finally, using Markov s inequality, we have Pr[Mi > j] Pr[r(qi,j) 1] k (b + 1)j + E[ηi,j ] (b + 1)j j . Parsimonious Learning-Augmented Caching As earlier, let c = logb+1(k) . We can now bound the expected length of the chain. E[Mi] = c + X j 0 Pr[Mi > c + j] k (b + 1)c+j + E[ηi,j ] (b + 1)c+j j 1 (b + 1)j + X j 1 E[ηi,j ] X 1 (b + 1)j j logb+1(k) + 3 + 2 X j 1 E[ηi,j ]. Finally, since the algorithm makes a distinct set of queries when evicting any page, we have P j 1 E[ηi,j ] = E[ηSi] and the lemma follows. 5.3. Adding worst-case guarantees In this section we show how a simple modification to the algorithm allows us to obtain an O(log k)-competitive ratio even when the prediction error is arbitrarily large. In order to obtain this worst-case guarantee, we make the following modification: when processing a cache miss for the jth page (qi,j) on chain Ci, if j > log k, then the algorithm switches to evict an unmarked stale page uniformly at random (as opposed to querying b pages and evicting the one with the furthest predicted arrival). Theorem 11. For any integer b > 0, there is an O(min{logb+1 k + E[η]/OPT, log k})-competitive algorithm for caching that makes at most b queries per cache miss. Proof. When pages are evicted according to Algorithm 3, Lemma 10 shows that the expected length of any eviction chain is at most O(logb+1 k + E[ηSi]). We consider the modified algorithm that switches to evicting a uniformly random unmarked page once the chain length exceeds log k. Following the traditional analysis of the randomized marking algorithm (Fiat et al., 1991; Lykouris & Vassilvitskii, 2018), we observe that once the algorithm switches to random evictions, the length of the chain increases by at most O(log k) in expectation. Consequently, the modified algorithm incurs at most O(1) min{logb+1 k + 3 + 2E[ηSi], 2 log k} cache misses in expectation on each eviction chain. Summing over all clean pages seen in the phase, the expected number of cache misses incurred in any phase h is at most O(min{logb+1 k+E[ηh], log k}), where ηh is defined to be the ℓ1-error of all the queries made in this phase. The desired competitive ratio now follows from Lemma 2. 5.4. Comparison to existing algorithms There are three main learning-augmented algorithms for (unweighted) caching with access to full predictions: the works of (i) Lykouris & Vassilvitskii (2018), (ii) Rohatgi (2020), and (iii) Wei (2020). Both (i) and (ii) are very similar: they consider a marking algorithm that evicts a page with the furthest predicted next arrival time until the eviction chain becomes long enough. The only difference is that (ii) caps each eviction chain s length at O(1) whereas (i) caps it at O(log k); the algorithmic difference and the resulting improvement in the competitive ratio is relatively minor, particularly for parsimonious predictions. On the other hand, (iii) shows that the competitive ratio of the algorithm that follows predictions blindly (called Blind Oracle; see Section 7) can be bound in terms of the prediction error. In our work, we cap the chain length at O(log k) for simplicity. Unfortunately, we cannot directly use (iii) as we do not have predictions for all pages and hence have to use marking-based algorithms (i.e., akin to (i) and (ii)). Further, we observe experimentally that bounding the length of an eviction chain in a marking algorithm provides better practical performance while also providing worst-case guarantees as opposed to using the black-box combination with a robust algorithm as suggested by (iii). 6. Lower bound The lower bound instance is fairly simple. Each phase starts with a request for a clean page that has never been requested before. Then, it is followed by requests for k 1 stale pages that are chosen uniformly at random among the k stale pages. We provide a formal description below. Lower bound instance. The page requests proceed in phases. Let Ph denote the pages that are requested in phase h for h [H], where H is a sufficiently large integer. We will have |Ph| = k for all h [H], and |Ph+1 \ Ph| = |Ph \ Ph+1| = 1 for all h [H 1]. First, P1 is an arbitrary set of k pages and there is one request for each page in P1. We now iteratively construct Ph+1 from Ph as follows: Let fh+1 be a clean page that has never been requested before. Let p1, . . . , pk be a uniformly random permutation of the set of pages in Ph. Then the request sequence for phase h + 1 is fh+1, fh+1 kpk, fh+1pk kpk 1, ..., fh+1pk . . . p3 kp2 in this order. Here, S k implies k repetitions of the sequence S. Focusing on the page arriving after the repeated sequence, we will say that pages are requested in the order of fh+1, pk, . . . , p2. The proof of Theorem 12 requires care to impose constraints on the structure of candidate algorithms, and formally demonstrate that a learning-augmented algorithm for caching can do no better than querying unmarked stale pages and always evict the one that arrives furthest in the Parsimonious Learning-Augmented Caching future. Unlike in the analysis of the upper bound, the algorithm can make varying numbers of queries per cache miss, even stochastically, which renders the analysis considerably more challenging. We defer the full proof to the Supplementary Material. Theorem 12. For any integer c ln k, any (c + 4)- competitive algorithm must make at least 1 12 ln(k+1)ck1/c OPT queries (with no error). 7. Experiments We experimentally evaluate our algorithm on a real-world dataset and demonstrate the empirical dependence of the competitive ratio on the number of queries made as well as on the prediction errors. Input dataset. We use the Citi Bike dataset as in Lykouris & Vassilvitskii (2018). The dataset comes from a publiclyavailable1 bike sharing platform in New York City. For each month of 2018, we construct one instance where each page request corresponds to the starting point of a bike trip. We truncate each month s data to the first 25,000 events, and thus each input sequence length is 25,000. Finally we set the cache size k = 500 and obtain seven non-trivial instances2. We use a bigger cache than (Lykouris & Vassilvitskii, 2018) to illustrate our algorithm s trade-off between number of queries and the competitive ratio. Predictions. To demonstrate the empirical dependence of different algorithms on the prediction error, we generate the following synthetic predictions. For each page p in the cache, its predicted next request time is set to its actual next request time plus a noise, which is drawn i.i.d. from a lognormal distribution whose underlying normal distribution has mean 0 and standard deviation σ. If the page is never requested in the future, we pretend its actual request time is the sequence length plus 1, i.e., 25,001. We also use a very simple prediction model to demonstrate the efficacy of easy off-the-shelf predictors. For each page, we compute the average time µp elapsed between consecutive requests for that page. For any page p at time t, we set the predicted arrival time as Q(p, t) = tp + µp where tp is the last time before t when page p was requested. We refer to these predictions as Mean Predictions in Table 1. Algorithms. We implement the following algorithms. Random Marker (Randomized Marking, Fiat et al. (1991)). Evicts a randomly chosen unmarked page; Θ(log k)-competitive. 1https://www.citibikenyc.com/system-data 2The other five sequences have less than 500 distinct pages and the caching problem becomes trivial. b: number of queries per cache miss Competitive Ratio b=1 b=2 b=4 b=8 b=16 b=32 b=64 b=128 Figure 1. Average competitive ratio of Adaptive Query for different values of b and the error parameter σ. Table 1. Average competitive ratio of algorithms for different error parameters. (Smaller values means better performance.) Algorithms Mean Synthetic Predictions Predictions σ = 0 σ = 2 σ = 4 σ = 6 Random Marker 3.14 3.14 3.14 3.14 3.14 LRU 2.86 2.86 2.86 2.86 2.86 Blind Oracle 1.92 1.00 1.02 3.92 4.15 LVMarker 2.49 1.77 1.81 2.94 3.11 Rohatgi Marker 2.54 1.77 1.83 3.15 3.29 Robust Oracle 4.29 1.80 1.83 4.48 4.51 Adaptive Query-2 2.91 2.46 2.46 2.52 2.65 Adaptive Query-4 2.71 2.07 2.07 2.20 2.49 Adaptive Query-8 2.59 1.86 1.86 2.07 2.54 LRU (Least Recently Used). A widely used heuristic that evicts the least recently used page. Blind Oracle. Evicts the page with the latest predicted next request time. LVMarker (Lykouris & Vassilvitskii, 2018). A learning-augmented marking algorithm that evicts the page with the furthest predicted arrival until the length of the eviction chain is O(log k) and then switches to evicting a randomly chosen unmarked page. Rohatgi Marker (Rohatgi, 2020). Identical to LVMarker except the switch occurs after the chain length exceeds one. Robust Oracle (Wei, 2020). Uses the combiner (Fiat et al., 1994) to combine Blind Oracle and Random Marker. Adaptive Query-b. Our algorithm (with worst-case guarantees) that is parameterized by b, the number of queries made per cache miss. Results. Table 1 shows the competitive ratios of all the implemented algorithms averaged over the seven instances. We observe that our Adaptive Query algorithm performs significantly better that Random Marker and LRU (that do not use any predictions), even while using very few predictions, e.g., making b = 2 queries on each cache miss. Since our algorithm is equivalent to Random Marker when b = 1, this demonstrates than even minimal predictions can considerably help online algorithms. Parsimonious Learning-Augmented Caching At the same time, Table 1 also demonstrates that Adaptive Query performs as well as the three other learningaugmented algorithms even for relatively small values of b with both synthetic predictions as well as the simple mean predictions. In fact, for high prediction errors, the Adaptive Query algorithm is less affected by these errors and outperforms the other learning-augmented algorithms. For comparison, with b = 8, the Adaptive Query algorithm uses only 2839 queries for each instance on average, it utilizes predictions for about 11% of requests in the sequence. We also compare the dependence of our algorithm on the number of queries and the prediction error. Figure 1 shows the competitive ratio of Adaptive Query-b for different values of b and different error parameters. Unsurprisingly, we observe that when predictions are (near-)perfect, the competitive ratio of the algorithm improves with the number of queries it is allowed to make. Surprisingly, however, when the prediction error is large, using more queries actually leads to a worse competitive ratio; indeed, in this case, more queries can lead to making poor eviction decisions. 8. Conclusions In this paper we initiate the study of online algorithms augmented with parsimonious learned predictions. Both the theory and experimental results suggest that performance of online algorithms can be significantly improved by judiciously using just a few predictions. Such an approach can make learning-augmented algorithms more practically appealing since obtaining predictions is often computationally expensive. An interesting future direction is to further explore this parsimonious model for other online problems. For example, consider problems that involve predictions of locations, such as metric task system and online matching (Antoniadis et al., 2020). It is conceivable that one can use less predictions by spatial interpolation. Achlioptas, D., Chrobak, M., and Noga, J. Competitive analysis of randomized paging algorithms. TCS, 234(12):203 218, 2000. Adamaszek, A., Czumaj, A., Englert, M., and R acke, H. An O(log k)-competitive algorithm for generalized caching. In SODA, pp. 1681 1689, 2012. Antoniadis, A., Coester, C., Elias, M., Polak, A., and Simon, B. Online metric algorithms with untrusted predictions. In ICML, pp. 345 355, 2020. Azar, Y., Leonardi, S., and Touitou, N. Flow time scheduling with uncertain processing time. In STOC, pp. 1070 1080, 2021. Bamas, E., Maggiori, A., and Svensson, O. The primaldual method for learning augmented algorithms. In Neur IPS, 2020. Bansal, N., Buchbinder, N., and Naor, J. S. A primal-dual randomized algorithm for weighted paging. JACM, 59 (4):19, 2012. Bansal, N., Buchbinder, N., Madry, A., and Naor, J. A polylogarithmic-competitive algorithm for the k-server problem. JACM, 62(5):1 49, 2015. Bansal, N., Coester, C., Kumar, R., Purohit, M., and Vee, E. Learning-augmented weighted paging. In SODA, 2022. Barve, R. D., Grove, E. F., and Vitter, J. S. Applicationcontrolled paging for a shared cache. SICOMP, 29(4): 1290 1303, 2000. Belady, L. A study of replacement algorithms for a virtualstorage computer. IBM Systems J., 5(2):78 101, 1966. Bhaskara, A., Cutkosky, A., Kumar, R., and Purohit, M. Logarithmic regret from sublinear hints. In Neur IPS, 2021. Borodin, A. and El-Yaniv, R. Online Computation and Competitive Analysis. Cambridge U. Press, 2005. Borodin, A., Raghavan, P., Irani, S., and Schieber, B. Competitive paging with locality of reference. In STOC, pp. 249 259, 1991. Boyar, J., Favrholdt, L. M., Kudahl, C., Larsen, K. S., and Mikkelsen, J. W. Online algorithms with advice: a survey. ACM Computing Surveys (CSUR), 50(2):1 34, 2017. Bubeck, S., Cohen, M. B., Lee, Y. T., Lee, J. R., and Madry, A. k-server via multiscale entropic regularization. In STOC, pp. 3 16, 2018. Diaconis, P. and Graham, R. L. Spearman s footrule as a measure of disarray. JRS Series B, 39:262 268, 1977. Dobrev, S., Kr aloviˇc, R., and Pardubsk a, D. Measuring the problem-relevant information in input. RAIROTheoretical Informatics and Applications, 43(3):585 613, 2009. Fiat, A. and Mendel, M. Truly online paging with locality of reference. In FOCS, pp. 326 335, 1997. Fiat, A., Karp, R. M., Luby, M., Mc Geoch, L. A., Sleator, D. D., and Young, N. E. Competitive paging algorithms. J. Algorithms, 12(4):685 699, 1991. Fiat, A., Rabani, Y., and Ravid, Y. Competitive k-server algorithms. JCSS, 48(3):410 428, 1994. Parsimonious Learning-Augmented Caching Irani, S., Karlin, A. R., and Phillips, S. Strongly competitive algorithms for paging with locality of reference. SICOMP, 25(3):477 497, 1996. Jiang, Z., Panigrahi, D., and Sun, K. Online algorithms for weighted paging with predictions. In ICALP, pp. 69:1 69:18, 2020. Karlin, A. R., Phillips, S. J., and Raghavan, P. Markov paging. SICOMP, 30(3):906 922, 2000. Koutsoupias, E. and Papadimitriou, C. H. On the k-server conjecture. JACM, 42(5):971 983, 1995. Kraska, T., Beutel, A., Chi, E. H., Dean, J., and Polyzotis, N. The case for learned index structures. In SIGMOD, pp. 489 504, 2018. Kumar, R., Purohit, M., and Svitkina, Z. Improving online algorithms via ML predictions. In Neur IPS, pp. 9661 9670, 2018. Kumar, R., Purohit, M., Svitkina, Z., and Vee, E. Interleaved caching with access graphs. In SODA, pp. 1846 1858, 2020. Lattanzi, S., Lavastida, T., Moseley, B., and Vassilvitskii, S. Online scheduling via learned weights. In SODA, pp. 1859 1877, 2020. Lee, J. R. Fusible HSTs and the randomized k-server conjecture. In FOCS, pp. 438 449, 2018. Li, S. and Xian, J. Online unrelated machine load balancing with predictions revisited. In ICML, pp. 6523 6532, 2021. Lykouris, T. and Vassilvitskii, S. Competitive caching with machine learned advice. In ICML, pp. 3296 3305, 2018. Mitzenmacher, M. and Vassilvitskii, S. Algorithms with predictions. In Roughgarden, T. (ed.), Beyond the Worst Case Analysis of Algorithms, pp. 646 662. Cambridge U. Press, 2020. Rohatgi, D. Near-optimal bounds for online caching with machine learned advice. In SODA, pp. 1834 1845, 2020. Roughgarden, T. (ed.). Beyond the Worst-Case Analysis of Algorithms. Cambridge U. Press, 2020. Wei, A. Better and simpler learning-augmented online caching. In APPROX/RANDOM, 2020. Parsimonious Learning-Augmented Caching Supplementary Material A. Proof of Lemma 6 Proof. We proceed assuming that replacement is allowed since it only increases E[mint [b] st]. Then, the sampling process can be simulated by sampling s 1, . . . , s b Unif(0, r) from the uniform distribution and taking their floor. Thus, we have E[mint [b] st] E[mint [b] s t]. It well known that if x1, . . . , xb Unif(0, 1), then E[mint [b] xt] = 1 b+1. (Indeed, this can be easily verified by observing that Pr[mint [b] xt > x] = (1 x)b and simple calculus.) Since we can set s t = r xt, we have, E[mint [b] st] E[mint [b] s t] = r b+1. B. Lower bound We repeat the lower bound instance here for clarity. Lower bound instance. The page requests proceed in phases. Let Ph denote the pages that are requested in phase h for h [H], where H is a sufficiently large integer. We will have |Ph| = k for all h [H], and |Ph+1 \Ph| = |Ph \Ph+1| = 1 for all h [H 1]. First, P1 is an arbitrary set of k pages and there is one request for each page in P1. We now iteratively construct Ph+1 from Ph as follows: Let fh+1 be a clean page that has never been requested before, and let π : [k] [k] denote a uniformly random permutation. Let S k denote k repetitions of the sequence S. Then the request sequence for phase h + 1 is fh+1, fh+1 kpπ(k), fh+1pπ(k) kpπ(k 1), ..., fh+1pπ(k) . . . pπ(3) kpπ(2) in this order. Focusing on the page arriving after the repeated sequence, we will say that pages are requested in the order of fh+1, pπ(k), . . . , pπ(2). For ease of notation, we let pπ(k+1) := fh+1. We first observe that the optimum offline solution for such an instance always incurs a cache miss on the first, clean page of each phase (except the first phase) and evicts the unique page in Ph 1 \ Ph. By construction, the optimum algorithm incurs no more cache misses in each phase. Finally, any algorithm must incur k cache misses for the first phase and we have the following claim. Claim 13. There is an offline solution that incurs exactly k + H 1 cache misses in total. We would like to lower bound the number of cache misses incurred by any c-competitive algorithm. Consider a fixed optimum online algorithm A.3 Claim 14. At the beginning of each phase h 2, we can assume without loss of generality that A has all pages in Ph 1 in its cache. Proof. Assume that the universe of pages is infinite. Then, knowing that we cannot guess the clean page that will be requested in phase h and all the other requests are for the stale pages, the claim follows. Thanks to Claim 14 and the repeated identical structure of the lower bound instance in every phase, we can assume without loss of generality that we use the same optimum online algorithm that we call A in all phases except the first. If c is the expected number of cache misses A incurs in each phase h 2, for A to be c-competitive, it must be the case that k+c (H 1) k+H 1 c from Claim 13. Here, k in the numerator is the number of cache misses incurred by A in the first phase. Thus, we must have c = c as H . Therefore, we can focus on one phase and lower bound the expected number of queries made by A assuming that it incurs at most c cache misses in expectation. Henceforth we drop indices referring to phases from the notation. We will say that a page is marked in the phase if it has been requested in the phase. For simplicity, we will assume that A makes at least one query before each page eviction; this would have no effect on the asymptotic lower bound we aim to prove. Lemma 15. In a phase that is not the first, with a given limit c > 0 on the expected number of cache misses, there is an algorithm that makes the minimum number of queries in expectation and simultaneously satisfies the following: (i) evicts a page only when it is forced to do so; (ii) never evicts marked pages and therefore it only needs to query unmarked pages; 3In this context, an optimal online algorithm is one that obtains a competitive ratio of c by making the fewest number of queries. Parsimonious Learning-Augmented Caching (iii) only queries pages just before a page eviction; (iv) evicts the queried page with furthest arrive time, if it makes any queries. We first prove (i) (iii). After setting up additional notation that will be used throughout the analysis, we will prove (iv). Proof of Lemma 15 (i) (iii). As argued in Fiat et al. (1991), we can assume without loss of generality that the algorithm is lazy, i.e., it only evicts a page when it incurs a cache miss; this implies (i). We now show (ii). Suppose algorithm A evicts a page pπ(i) for some i > j to service pπ(j) in the request subsequence, pπ(k+1)pπ(k) . . . pπ(j+1) kpπ(j). Then, for every repetition of pπ(k+1)pπ(k) . . . pπ(j) in the subsequence of pπ(k+1)pπ(k) . . . pπ(j) kpπ(j 1), until A fetches pπ(i) and evicts an unmarked page, it incurs another cache miss. If it makes a cache miss for every repetition, we can make A better or no worse by instead evicting an arbitrary unmarked page (such an algorithm incurs at most k cache misses even without using randomization). Otherwise, if A ever replaces with an unmarked page before pπ(j 1) is requested, A could have instead done so earlier to reduce cache misses. This will incur no cache misses for the repetition pk+1pk . . . pj k. We have thus shown that we can assume without loss of generality that A never evicts marked pages; thus, we have (ii). Now (iii) follows as once a page gets marked it stays marked in the phase. Thus, by deferring the queries until being forced to evict a page, A can only potentially avoid querying about pages that will be marked soon. For the remaining analysis, we take the eviction chain view we used in the analysis of the adaptive query algorithm. As our analysis will require careful conditioning and deconditioning, we slightly override the notation. From the above reasoning, particularly from Lemma 15 (i) (iii), we now have the following problem: We will see a request sequence for a clean page pπ(k+1) = fh+1, and k 1 stale pages, pπ(k), pπ(k 1), ..., pπ(2). Note that pπ(1) is the stale page that is not requested. Then, we consider the eviction chain C that starts with pπ(k+1) and ends with pπ(1). Recall that an edge from q to q means that we evict page q to service page q. Our goal is to lower bound the number of queries made under the requirement that the expected length (number of cache misses) M of C is at most c. Page pπ(j) is defined to have rank j; this definition is slightly simpler than the one in Section 5.1 as we have only one clean page, thus only one chain. With this notation set up, we are now ready to prove Lemma 15(iv). Proof of Lemma 15(iv). Our goal is to consider any algorithm A satisfying (i) (iii), and to construct another algorithm A satisfying (iv) as well, without increasing the number of cache misses but making no more queries. In the execution of A, suppose ith evicted page by A has rank Ri and A makes qi queries just before the eviction. Now A makes the same number qi of queries just before evicting ith page, but it instead evicts the one with smallest rank among the queried pages, i.e., satisfies property (iv). By a simple induction on i, we can show that A generates a sequence that stochastically dominates what A generates. More precisely, suppose A has made q queries, and the queried pages have ranks S1 < < Sq. Then, by definition, A has also made q queries (if there are not enough pages to query, then it is only better for A ), and let S 1 < < S q be the rank of pages queried by A . Then, we say that the sequence S stochastically dominates sequence S if S j Sj for all j [q]. Because the chain C ends once the last page of rank 1 is evicted (we can assume we evict only a queried page under the assumption we query at least one page before each page eviction), A make no more cache misses than A in expectation. Further, by construction, A can only make less queries than A in expectation. Henceforth, we consider an algorithm A that satisfies Lemma 15(i) (iv). Let R0 := k+1, be the rank of the clean page. Let Ri denote the rank of the ith evicted page. As mentioned before, C eventually ends with pπ(1). For notational convenience, once Ri becomes 1, we define all the subsequent Ri+1, Ri+2, ..., to be 1. Let Qi denote the number of queries A makes when we witness the ith cache miss. For analysis, we will assume that we do not make too many queries for each page eviction. This is because we can find the page pπ(1) without making many more queries. Lemma 16. Suppose we show that any algorithm that incurs at most c cache misses in expectation makes at least d queries under the assumption that Ri > Q2 i+1 for all i. Then, it implies the following lower bound: any algorithm that incurs at most c + 4 cache misses in expectation makes at least d/5 queries. Proof. If the fixed algorithm considered to show the lower bound makes Qi+1 queries such that Ri Q2 i+1 for the first time, then instead we let it make Ri queries per cache miss until the phase ends. This is equivalent to a problem where Parsimonious Learning-Augmented Caching the cache size is Ri and we make Ri queries per cache miss. Thus, by Lemma 8, we know that the number of cache misses is at most 5 in expectation, and we make at most 5 Ri 5Qi+1 queries in expectation. In summary, this change makes at most 4 extra cache misses in expectation and increases the number of queries by a factor of 5. Let us fix i and let Ri = r and Qi = q. If we choose q points uniformly at random from [0, r], the expected minimum of the samples is well known to be r q+1. But, we want to know the expected value of r Ri+1 , where the only randomness comes from π(1), . . . , π(r 1), which we denote as π([r 1]) for short. Bounding this quantity needs more care. Now, our concern is to upper bound Eπ([r 1]) r S , where S1, . . . , Sq are sampled uniformly from [r 1] without replacement and S := minj [q] Sj. This corresponds to making queries about q pages among r 1 unmarked ones and evicting the one with the minimum rank, i.e., the furthest request time in the future. For ease of analysis, we pretend that samples S 1, . . . , S q are made from [1/r, 1] independently and uniformly at random and we want to upper bound E[ 1 S ], where S := minj [q] S j. Here, we relax the random selection by making the sampling domain continuous. We first show this relaxation does not change the expectation by much. Lemma 17. If 1 q2 < r, we have Eπ(r 1) r Proof. We first scale down S1, . . . , Sq by a factor of r, so we can pretend that they are sampled from D = { 1 r, . . . , r 1 r } without replacement. Now, we want to upper bound E[ 1 S ]. Let D := D q . We observe that where S := { r S i /r | i [q]} and S = min S . In other words, after rounding down each S i to the nearest multiple of 1/r, if they are all distinct, we keep them. This is an equivalent way of getting samples S1, . . . , Sq. Further, the rounding changes the expectation by a factor of at most 2. Therefore, we have, S | S D 2E 1 We would like to decondition on S D. S | S D Pr[ S D] E 1 As S is a uniform sample with replacement, we have, Pr[ S D] = r(r 1) (r q+1) rq (1 q/r)q (1 q/(q2 +1))q 1/3, where the last inequality follows from a simple calculation. Combining the above equations yields the lemma. Lemma 18. If 1 q2 < r, we have E[ 1 S ] 2q ln(k + 1). Proof. Observe that Pr[S x] = 1 x 1 1/r q . Thus, the pdf of S is q(1 x)q 1 (1 1/r)q where x [1/r, 1]. For brevity, we omit the denominator in the following calculations and bring it back at the end. 1 xq(1 x)q 1dx 1 xq(1 x)q 1dx + Z 1 1 xq(1 x)q 1dx 1 xdx + Z 1 x=1/q q q(1 x)q 1dx q ln r/q + q q ln r q ln(k + 1). Thus, by factoring in the denominator 1 (1 1/r)q 1 (1 1/(q2+1))q 2, we obtain the lemma. Parsimonious Learning-Augmented Caching Corollary 19. E[ Ri Ri+1 ] 12q ln(k + 1)E[Qi+1]. Proof. From Lemmas 17 and 18, we know Eπ([r 1])[ r S(r,q,π)] 12q ln(k + 1). Here, the parameters r, q, π are used to make the dependency of S clear. As this holds for any Ri = r and Qi+1 = q, by deconditioning, we have the desired result. Lemma 20. If C has length M and for any integer c > 0, we have E[Pc i=1 Qi | M = c] 1 12 ln(k+1)ck1/c. Proof. Using the linearity of expectation and Corollary 19, 12 ln(k + 1) i=1 E[Qi | t = c] E Ri | Rc = 1, Rc 1 > 1 Ri )1/c | Rc = 1, Rc 1 > 1 = (k + 1)1/c, where the middle step follows from the AM GM inequality and the last step follows from a telescoping product, R0 = k + 1, and Rc = 1. By Lemma 20, if algorithm A makes at most c cache misses in expectation, then the number of queries it makes is lower bounded by the optimum objective of the following LP: 1 12 ln(k + 1) min X i 1 ik1/ixi i 1 ixi c (1) Here, xi := Pr[M = i], i.e., the probability that the chain C has length i, or equivalently A makes i cache misses. The last two constraints define a probability distribution over the values M can have and constraint (1) means that we can afford to make at most c cache misses in expectation. Lemma 21. If c ln k, then the above LP s optimum objective is at least 1 12 ln(k+1)ck1/c. Proof. Let f(y) := yk1/y. By simple calculus, we have f (y) = k1/y 1 ln k y and f (y) = ln2 k y2 k1/y. Thus, f decreases in y for y [1, ln k] and is convex. Then, the LP objective is 1 12 ln(k+1) P i 1 f(i)xi. By convexity, we have P i 1 f(i)xi f P i 1 ixi . Then, by constraint (1) and f being decreasing in y, we have f P i 1 ixi f(c). To summarize, we have shown that any c-competitive algorithm must make at least 1 12 ln(k+1)ck1/c cache misses, but under the assumption stated in Lemma 16, i.e., Ri > Q2 i+1 for all i. Thus, by the lemma, we have the following. Theorem 22. For any integer c ln k, any c + 4-competitive algorithm must make at least 1 12 ln(k+1)ck1/c OPT queries.