# learned_index_with_dynamic_epsilon__e5db621a.pdf LEARNED INDEX WITH DYNAMIC ϵ Daoyuan Chen1 , Wuchao Li2, Yaliang Li1 , Bolin Ding,1 Kai Zeng3 , Defu Lian2, Jingren Zhou1 1 Alibaba Group, 2 University of Science and Technology of China, 3 Huawei {daoyuanchen.cdy, yaliang.li, bolin.ding, jingren.zhou}@alibaba-inc.com liwuchao@mail.ustc.edu.cn, kai.zeng@huawei.com, liandefu@ustc.edu.cn Index structure is a fundamental component in database and facilitates broad data retrieval applications. Recent learned index methods show superior performance by learning hidden yet useful data distribution with the help of machine learning, and provide a guarantee that the prediction error is no more than a pre-defined ϵ. However, existing learned index methods adopt a fixed ϵ for all the learned segments, neglecting the diverse characteristics of different data localities. In this paper, we propose a mathematically-grounded learned index framework with dynamic ϵ, which is efficient and pluggable to existing learned index methods. We theoretically analyze prediction error bounds that link ϵ with data characteristics for an illustrative learned index method. Under the guidance of the derived bounds, we learn how to vary ϵ and improve the index performance with a better space-time trade-off. Experiments with real-world datasets and several state-of-the-art methods demonstrate the efficiency, effectiveness and usability of the proposed framework. 1 INTRODUCTION Data indexing (Graefe & Kuno, 2011; Wang et al., 2018; Luo & Carey, 2020; Zhou et al., 2020), which stores keys and corresponding payloads with designed structures, supports efficient query operations over data and benefits various data retrieval applications. Recently, Machine Learning (ML) models have been incorporated into the design of index structure, leading to substantial improvements in terms of both storage space and querying efficiency (Kipf et al., 2019; Ferragina & Vinciguerra, 2020a; Vaidya et al., 2021). The key insight behind this trending topic of learned index is that the data to be indexed contain useful distribution information and such information can be utilized by trainable ML models that map the keys {x} to their stored positions {y}. To approximate the data distribution, state-of-the-art (SOTA) learned index methods (Galakatos et al., 2019; Kipf et al., 2020; Ferragina & Vinciguerra, 2020b; Stoian et al., 2021) propose to learn piece-wise linear segments S = [S1, ..., Si, ..., SN], where Si : y = aix + bi is the linear segment parameterized by (ai, bi) and N is the total number of learned segments. These methods introduce an important pre-defined parameter ϵ Z>1 to guarantee the worst-case preciseness: |Si(x) y| ϵ for i [N]. By tuning ϵ, various space-time preferences from users can be met. For example, a relatively large ϵ can result in a small index size while having large prediction errors, and on the other hand, a relatively small ϵ provides users with small prediction errors while having more learned segments and thus a large index size. Existing learned index methods implicitly assume that the whole dataset to be indexed contains the same characteristics for different localities and thus adopt the same ϵ for all the learned segments. However, the scenario where there is varied local data distribution, is very common, leading to sub-optimal index performance of existing methods. For example, the real-world Weblog dataset used in our experiment has typically non-linear temporal patterns caused by online campus transactions such as class schedule arrangements, weekends and holidays. More importantly, the impact of ϵ on index performance is intrinsically linked to data characteristics, which are not fully explored and utilized by existing learned index methods. Motivated by these, in this paper, we theoretically analyze the impact of ϵ on index performance, and link the characteristics of data localities with the dynamic adjustments of ϵ. Based on the The first two authors contributed equally to this work. Corresponding author. Work was done at Alibaba. derived theoretical results, we propose an efficient and pluggable learned index framework that dynamically adjusts ϵ in a principled way. To be specific, under the setting of an illustrative learned index method MET (Ferragina et al., 2020), we present novel analyses about the prediction error bounds of each segment that link ϵ with the mean and variance of data localities. The segment-wise prediction error embeds the space-time trade-off as it is the product of the number of covered keys and mean absolute error, which determine the index size and preciseness respectively. The derived mathematical relationships enable our framework to fully explore diverse data localities with an ϵ-learner module, which learns to predict the impact of ϵ on the index performance and adaptively choose a suitable ϵ to achieve a better space-time trade-off. We apply the proposed framework to several SOTA learned index methods, and conduct a series of experiments on three widely adopted real-world datasets. Compared with the original learned index methods with fixed ϵ, our dynamic ϵ versions achieve significant index performance improvements with better space-time trade-offs. We also conduct various experiments to verify the necessity and effectiveness of the proposed framework, and provide both ablation study and case study to understand how the proposed framework works. Our contributions can be summarized as follows: We make the first step to exploit the potential of dynamically adjusting ϵ for learned indexes, and propose an efficient and pluggable framework that can be applied to a broad class of piece-wise approximation algorithms. We provide theoretical analysis for a proxy task modeling the index space-time trade-off, which establishes our ϵ-learner based on the data characteristics and the derived bounds. We achieve significant index performance improvements over several SOTA learned index methods on real-world datasets. To facilitate further studies, we make our codes and datasets public at https://github.com/yxdyc/Learned-Index-Dynamic-Epsilon. 2 BACKGROUND Learned Index. Given a dataset D = {(x, y)|x X, y Y}, X is the set of keys over a universe U such as reals or integers, and Y is the set of positions where the keys and corresponding payloads are stored. The index such as B+-tree (Abel, 1984) aims to build a compact structure to support efficient query operations over D. Typically, the keys are assumed to be sorted in ascending order to satisfy the key-position monotonicity, i.e., for any two keys, xi > xj iff their positions yi > yj, such that the range query (X [xlow, xhigh]) can be handled. Recently, learned index methods (Kraska et al., 2018; Li et al., 2019; Tang et al., 2020; Dai et al., 2020; Crotty, 2021) leverage ML models to mine useful distribution information from D, and incorporate such information to boost the index performance. To look up a given key x, the learned index first predicts position ˆy using the learned models, and subsequently finds the stored true position y based on ˆy with a binary search or exponential search. By modeling the data distribution information, learned indexes achieve faster query speed and much smaller storage cost than B+-tree index traditional, with different optimization aspects such as on ϵ-bounded linear approximation (Galakatos et al., 2019; Ferragina & Vinciguerra, 2020b; Kipf et al., 2020; Marcus et al., 2020; Li et al., 2021b) and data-layout (Ding et al., 2020; Wu et al., 2021; Zhang & Gao, 2022; Wu et al., 2022; Li et al., 2021a). ϵ-bounded Linear Approximation. Many existing learned index methods adopt piece-wise linear segments to approximate the distribution of D due to their effectiveness and low computing cost, and introduce the parameter ϵ to provide a worst-case preciseness guarantee and a tunable knob to meet various space-time trade-off preferences. Here we briefly introduce the SOTA ϵ-bounded learned index methods that are most closely to our work, and refer readers to the literature (Ferragina & Vinciguerra, 2020a; Marcus et al., 2020; Stoian et al., 2021) for details of other methods. We first describe an illustrative learned index algorithm MET (Ferragina et al., 2020). Specifically, for any two consecutive keys of D, suppose their key interval (xi xi 1) is drawn according to a random process {Gi}i N, where Gi is a positive independent and identically distributed (i.i.d.) random variable whose mean is µ and variance is σ2. MET learns linear segments {Si : y = aix + bi} via a simple deterministic strategy: the current segment fixes the slope ai = 1/µ, goes through the first available data point and thus bi is determined. Then Si covers the remaining data points one by one until a data point (x , y ) gains the prediction error larger than ϵ. The violation triggers a new linear segment that begins from (x , y ) and the process repeats until D has been traversed. Other ϵ-bounded learned index methods learn linear segments in a similar manner to MET while having different mechanisms to determine the parameters of {Si} such as FITing-Tree (Galakatos et al., 2019), PGM (Ferragina & Vinciguerra, 2020b) and Radix-Spline (Kipf et al., 2020). However, they both constrain all learned segments with the same ϵ. In this paper, we study how to enhance existing learned index methods from a new perspective: dynamic adjustment of ϵ accounting for diversity of different data localities, and present new theoretical results about the effect of ϵ. In Appx. A, we provide more detailed description and comparison to existing learned index methods. 3 LEARN TO VARY ϵ 3.1 PROBLEM FORMULATION AND MOTIVATION Before introducing the proposed framework, we first formulate the task of learning index from data with ϵ guarantee, and provide some discussions about why we need to vary ϵ. Given a dataset D to be indexed and an ϵ-bounded learned index algorithm A, we aim to learn linear segments S = [S1, ..., Si..., SN] with segment-wise varied [ϵi]i [N], such that a better trade-off between storage cost (size in KB) and query efficiency (time in ns) can be achieved than the ones using fixed ϵ. Let Di D be the data whose keys are covered by Si, for the remaining data D \ S j1, consider the setting of the MET algorithm (Ferragina et al., 2020), in which key intervals of D are drawn from a random process consisting of positive i.i.d. random variables with mean µ and variance σ2, and ϵ σ/µ. For a learned segment Si and its covered data Di, denote Seg Erri = P (x,y) Di |y Si(x)|. Then the expectation of Seg Erri satisfies: r 1 π µ σ ϵ2 < E[Seg Erri] < 2 Note that the average length of segments obtained by the MET algorithm is ( µ σ)2ϵ2. We now get a constant 2 3) 3 4 0.78 that is tighter than the trivial one with 1 corresponding to the case where each data point reaches the largest error ϵ. This theorem reveals that the prediction error Seg Erri depends on both ϵ and the data characteristics (µ, σ). Recall that CV =σ/µ is the coefficient of variation, a classical statistical measure of the relative dispersion of data points. In the context of the linear approximation, the data statistic 1/CV = µ/σ in our bounds intrinsically corresponds to the linearity degree of the data. With this, we can find that when µ/σ is large, the data is easy-to-fit with linear segments, and thus we can choose a small ϵ to achieve precise predictions. On the other hand, when µ/σ is small, it becomes harder to fit the data using a linear segment, and thus ϵ should be increased to absorb some non-linear data localities. In this way, we can make the total prediction error for different learned segments consistent and achieve a better N-MAE trade-off. This analysis also confirms the motivation of varying ϵ: The local linearity degrees of the indexed data can be diverse, and we should adjust ϵ according to the local characteristic of the data, such that the learned index can fit and leverage the data distribution better. In the rest of this section, we provide a proof sketch of this theorem due to the space limitation. For detailed proof, please refer to our Appx. E. TRANSFORMED RANDOM WALK. The main idea is to model the learning process of linear approximation with ϵ guarantee as a random walk process, and consider that the absolute prediction error of each data point follows folded normal distributions. Specifically, given a learned segment Si : y = aix + bi, we can calculate the expectation of Seg Erri for this segment as: E[Seg Erri] = ai E Pr(j = n), (1) where Zj is the j-th position of a transformed random walk {Zj}j N, j = max{j N| ϵ/ai Zj ϵ/ai} is the random variable indicating the maximal position when the random walk is within the strip of boundary ϵ/ai, and the last equality is due to the definition of expectation. PROOF OF UPPER BOUND. Under the MET setting where ai = 1/µ and ϵ σ/µ, we find that the increments of the transformed random walk {Zj} have zero mean and variance σ2, and many steps are necessary to reach the random walk boundary. With the Central Limit Theorem, we assume the Zj follows normal distribution with mean µzj = 0 and variance σ2 zj = jσ2, and thus |Zj| follows the folded normal distribution with expectation E(|Zj|) = p 2/πσ j. Thus Eq. (1) becomes: Pr(j = n)< 1 j=0 E [|Zj|]Pr(j = n) = σ j Pr(j = n). Using E[j ] = µ2 σ2 ϵ2 and V ar[j ] = 2 σ4 ϵ4 as derived in Ferragina et al. (2020), we get E[(j )2] = σ4 ϵ4. With the inequality Pn 1 j=0 j < 2 3n n and E[X 3 4 ] (E[X]) 3 4 , we get the upper bound: E[Seg Erri] < 2 2 π σ µE[(j ) 3 2 ] 2 2 π σ µ E[(j )2] 3 PROOF OF LOWER BOUND. Applying the triangle inequality into Eq. (1), we can get E[Seg Erri] > 1 µ P n=1 E [|Z|] Pr(j = n), where Z = Pn 1 j=0 Zj, and Z follows the normal distribution since Zj N(0, σ2 zj). We can prove that |Z| follows the folded normal distribution whose expectation E[|Z|] > σ(n 1)/ π. Thus the lower bound is: E[Seg Erri] > σ n=1 (n 1) Pr(j = n) = σ 1 π E [j 1] = µ, we can omit the right term p 1/π σ/µ and finish the proof. Although the derivations are based on the MET algorithm whose slope is the reciprocal of µ, we found that the mathematical forms among ϵ, µ/σ and Seg Erri are still applicable to other ϵ-bounded methods, and further prove that the learned segment slopes of these methods are similar with bounded differences in Appx. F. 3.4 ϵ-LEARNER Now given an ϵ, we have obtained the closed-form bounds of the Seg Err in Theorem 1, and both the upper and lower bounds are in the form of w1( µ σ)w2ϵw3, where w1,2,3 are some coefficients. As the concrete values of these coefficients can be different for different datasets and different methods, we propose to learn the following trainable estimator to make the error prediction more precise: Seg Err = f(ϵ, µ, σ) =w1(µ 3) 3 4 , 1 w2 2, 2 w3 3. (2) With this learnable estimator, we feed data characteristic µ/σ of the look-ahead data and the trans- formed Seg Err into it and find a suitable ϵ as Seg Err/w1( µ σ)w2 1/w3. We will discuss the look-ahead data and the transformed Seg Err in the following paragraphs. Now let s discuss the reasons for how this adjustment can achieve better index performance. Actually, the ϵ-learner proactively plans the allocations of the total prediction error indicated by the user (i.e., ϵ |D|) and calculates the tolerated Seg Err for the next segment. By adjusting current ϵ to ϵ , the following learned segment can fully utilize the distribution information of the data and achieve better performance in terms of N-MAE trade-off. To be specific, when µ/σ is large, the local data has clear linearity, and thus we can adjust ϵ to a relatively small value to gain precise predictions; although the number of data points covered by this segment may decrease and then the number of total segments increases, such cost paid in terms of space is not larger than the benefit we gain in terms of precise predictions. Similarly, when µ/σ is small, ϵ should be adjusted to a relatively large value to lower the learning difficulty and absorb some non-linear data localities; in this case, we gain in terms of space while paying some costs in terms of prediction accuracy. The segment-wise adjustment of ϵ improves the overall index performance by continually and data-dependently balancing the cost of space and preciseness. Look-ahead Data. To make the training and inference of the ϵ-learner light-weight, we propose to look ahead a few data D to reflect the characteristics of the following data localities. Specifically, we leverage a small subset D D \ S j0) where Gk is the key increment variable whose mean and variance is µ and σ2 respectively. Denote the Zj = Xj j/ai + (bi ci 1)/ai as the j-th position of the transformed random walk {Zj}j N, and j = max{j N| ϵ/ai Zj ϵ/ai} as the random variable indicating the maximal position when the random walk is within the strip of boundary ϵ/ai. The expectation can be rewritten as: j=0 |ai Xj j + (bi ci 1)| The last equality in Eq. (3) is due to the definition of expectation. Following the MET algorithm that the Si goes through the point (X0, Y0 = ci + 1), we get bi = ai X0 + ci + 1 and we can rewrite Zj as the following form: Z0 = 0, Zj j>0 = Xj X0 j/ai = k=1 Gk j/ai k=1 (Gk 1/ai) = where Wk is the walk increment variable of Zj, E[Wk] = µ 1/ai and V ar[Wk] = σ2. Under the MET algorithm setting where ai = 1/µ and ε σ/µ, the transformed random walk {Zj} has increments with zero mean and variance σ2, and many steps are necessary to reach the random walk boundary. With the Central Limit Theorem, we can assume that Zj follows the normal distribution with mean µzj and variance σ2 zj, and thus |Zj| follows the folded normal distribution: Zj N (µ 1/ai)j, jσ2 , E(|Zj|) = µzj[1 2Φ( µzj/σzj)] + σzj p 2/π exp( µ2 zj/2σ2 zj), where Φ is the normal cumulative distribution function. For the MET algorithm, ai = 1/µ and thus the µzj = 0, σzj = σ j, and E(|Zj|) = p 2/πσ j. Then the Eq. (3) can be written as Pr(j = n) < 1 j=0 E [|Zj|] Pr(j = n) j Pr(j = n). For the inner sum term in Eq. (4), we have (Pn 1 j=0 j) < 2 then the result in Eq. (4) becomes E[Seg Erri] < 2 n=1 n n Pr(j = n) 2 π σ µE[(j ) 3 2 ] = 2 2 π σ µE (j )2 3 where the last inequality holds due to the Jensen inequality E[X 3 4 ] (E[X]) 3 4 . Using E[j ] = µ2 and V ar[j ] = 2 σ4 ϵ4 derived in MET algorithm Ferragina et al. (2020), we get E[(j )2] = 5 σ4 ϵ4, which yields the following upper bound: E[Seg Erri] < 2 For the lower bound, applying the triangle inequality into the Eq. (3), we have n=1 E [|Z|] Pr(j = n), where Z = Pn 1 j=0 Zj. Since Zj N(0, σ2 zj), the Z follows the normal distribution: Z N µZ = 0, σ2 Z = j=0 σ2 zj + k=0,k =j rjkσzjσzk , where rjk is the correlation between Zj and Zk. Since µZ = 0, the |Z| follows the folded normal distribution with E[|Z|] = σZ p 2/π. Since the random walk {Zj} is a process with i.i.d. increments, the correlation rjk 0. With σzj = σ j > 0 and rjk 0, we have j=0 σzj > σ p n(n 1)/π > σ(n 1) π , and the result in Eq. (5) becomes: E[Seg Erri] > 1 n=1 (n 1) Pr(j = n) 1 π E [j 1] = µ, we can omit the right term q 1 π σ µ and finish the proof. F LEARNED SLOPES OF OTHER ϵ-BOUNDED METHODS As shown in Theorem 1, we have known how ϵ impacts the Seg Erri of each segment learned by the MET algorithm, where the theoretical derivations largely rely on the slope condition ai = 1/µ. Here we prove that for other ϵ-bounded methods, the learned slope of each segment (i.e., ai of Si) concentrates on the reciprocal of the expected key interval as shown in the following Theorem. Theorem 2. Given an ϵ Z>1 and an ϵ-bounded learned index algorithm A. For a linear segment Si : y = aix + bi learned by A, denote its covered data and the number of covered keys as Di and Len(Di) respectively. Assuming the expected key interval of Di is µi, the learned slope ai concentrates on a = 1/µi with bounded relative difference: (1 2ϵ E[Len(Di)] 1) a E[ai] (1 + 2ϵ E[Len(Di)] 1) a. Proof. For the learned linear segment Si, denote its first predicted position and last predicted position as y 0 and y n respectively, we have its slope ai = y n y 0 xn x0 . Notice that y0 ϵ y 0 y0 + ϵ and yn ϵ y n yn + ϵ due to the ϵ guarantee, we have yn y0 2ϵ y n y 0 yn y0 + 2ϵ and the expectation of ai can be written as E[yn y0 + 2ϵ xn x0 ] E[ai] = y n y 0 xn x0 E[yn y0 + 2ϵ Note that for any learned segment Si whose first covered data is (x0, y0) and last covered data is (xn, yn), we have E[ xn x0 yn y0 ] = µi and thus the inequalities become 1 µ E[ 2ϵ xn x0 ] E[ai] 1 µ + E[ 2ϵ xn x0 ]. Since a = 1/µi and E[xn x0] = (E[Len(Di)] 1)µi, we finish the proof. The Theorem 2 shows that the relative deviations between learned slope ai and a are within 2ϵ/(E[Len(Di)] 1). For the MET and PGM learned index methods, we have the following corollary that depicts more precise deviations without the expectation term E[Len(Di)]. Corollary 2.1. For the MET method Ferragina et al. (2020) and the optimal ϵ-bounded linear approximation method that learns the largest segment length used in PGM Ferragina & Vinciguerra (2020b), the slope relative differences are at O(1/ϵ). Proof. We note that the segment length of a learned segment is at O(ϵ2) for the MET algorithm, which is proved in the Theorem 1 of Ferragina et al. (2020). Since PGM achieves the largest learned segment length that is larger than the one of the MET algorithm, we finish the proof. G THE ALGORITHM OF DYNAMIC ϵ ADJUSTMENT We summarize the proposed algorithm below. In Section 3.4, we provide detailed descriptions of the initialization and adjustment sub-procedures. The lookahead() and optimize() are in the Look-ahead Data and Seg Err and Optimization paragraph respectively. H IMPLEMENTATION DETAILS Baselines. All the experiments are conducted on a Linux server with an Intel Xeon Platinum 8163 2.50GHz CPU. We first introduce more details and the implementation of baseline learned index methods. MET (Ferragina et al., 2020) fixes the segment slope as the reciprocal of the expected key interval, and goes through the first available data point for each segment. FITing-Tree (Galakatos et al., 2019) adopts a greedy shrinking cone algorithm and the learned segments are organized with a B+-tree. Here we use the stx::btree (v0.9) implementation (Bingmann, 2013) and set the filling factors of inner nodes and leaf nodes as 100%, i.e., we adopt the full-paged filling manner. Radix-Spline (Kipf et al., 2020) adopts a greedy spline interpolating algorithm to learn spline points, Algorithm Dynamic ϵ Adjustment with Pluggable ϵ Learner Input: D: Data to be indexed, A: Learned index algorithm, ϵ: Expected ϵ, ρ: Length percentage for look-ahead data Output: S: Learned segments with varied ϵs 1: initial parameters w1,2,3 of the learned function: f(ϵ, µ, σ) = w1( µ 2: initial mean length of learned segments so far: Len(DS) 404 3: S , (ˆµ/ˆσ) 0 4: repeat 5: Get data statistic: 6: (µ/σ) lookahead(D, Len(DS) ρ) 7: Adjust ϵ based on the learner: 8: ϵ Seg Err/w1( µ 9: Learn new segment Si using adjusted ϵ : 10: [Si, Di] A(D, ϵ ) 11: S S Si 12: D D \ Di, DS DS Di 13: Online update Len(DS): 14: Len(DS) running-mean Len(DS), Len(Di) 15: (ˆµ/ˆσ) running-mean (ˆµ/ˆσ), (µ/σ) 16: Train the learner with ground truth: 17: w1,2,3 optimize(f, Si, Seg Erri) 18: Seg Err w1(ˆµ/ˆσ)w2 ϵw3 19: until D = and the learned spline segments are organized with a flat radix table. We set the number of radix bits as r = 16 for the Radix-Spline method, which means that the leveraged radix table contains 216 entries. PGM (Ferragina & Vinciguerra, 2020b) adopts a convex hull based algorithm to achieve the minimum number of learned segments, and the segments can be organized with the help of binary search, CSS-Tree (Rao & Ross, 1999) and recursive structure. Here we implement the recursive version since it beats the other two variants in terms of indexing performance. For all the baselines and our method, we adopt exponential search to better leverage the predictive models since the query complexity using exponential search corresponds to the preciseness of models (MAE) as we analyzed in Appendix B. Evaluation Metrics. We evaluate the index performance in terms of its size, prediction preciseness, and total querying time. Specifically, we report the number of learned segments N, the index size in bytes, the MAE as 1 |D| P (x,y) D |y S(x)|, and the total querying time per query in ns (i.e., we perform querying operations for all the indexed data, record the total time of getting the payloads given the keys, and report the time that is averaged over all the queries). For a quantitative comparison w.r.t. the trade-off improvements, we calculate the Area Under the N-MAE Curve (AUNEC) where the x-axis and y-axis indicate N and MAE respectively. For the AUNEC metric, the smaller, the better. Hyper-parameters. We describe a few additional details of the proposed framework in terms of the ϵ-learner initialization and the hyper-parameter setting. For the w1,2,3 of the ϵ-learner shown in the Eq. (2), We empirically found that this light-weight initialization leads to better index performance compared to the versions with random parameter initialization, and it benefits the exploration of diverse ϵ , i.e., leading to the larger variance of the dynamic ϵ sequence [ϵ1, . . . , ϵi, . . . , ϵN]. As for the hyper-parameter ρ (described in the Section 3.4), we conduct a grid search over ρ [0.1, 0.4, 0.7, 1.0] on Map and Io T datasets. We found that all the ρs achieve better N-MAE trade-off (i.e., smaller AUNEC results) than the fixed ϵ versions. Since the setting ρ = 0.4 achieves averagely best results on the two datasets, we set ρ to be 0.4 for the other datasets. I DATASET DETAILS Our framework is verified on several widely adopted datasets having different data scales and distributions. Weblogs Kraska et al. (2018); Galakatos et al. (2019); Ferragina & Vinciguerra (2020b) contains about 715M log entries for the requests to a university web server and the keys are log timestamps. Io T Galakatos et al. (2019); Ferragina & Vinciguerra (2020b) contains about 26M event entries from different Io T sensors in a building and the keys are recording timestamps. Map dataset Kraska et al. (2018); Galakatos et al. (2019); Ding et al. (2020); Ferragina & Vinciguerra (2020b); Li et al. (2021b) contains location coordinates of 200M places that are collected around the world from the Open Street Map Open Street Map contributors (2017), and the keys are the longitudes of these places. Lognormal Ferragina & Vinciguerra (2020b) is a synthetic dataset whose key intervals follow the lognormal distribution: ln(Gi) N(µlg, σ2 lg). To simulate the varied data characteristics among different localities. We generate 20M keys with 40 partitions by setting µlg = 1 and setting σlg with a random number within [0.1, 1] for each partition. We normalize the positions of stored data into the range [0, 1], and thus the key-position distribution can be modeled as a Cumulative Distribution Function (CDF). We plot the CDFs and zoomed-in CDFs of experimental datasets in Figure 7 and Figure 8 respectively, which intuitively illustrate the diversity of the adopted datasets. For example, the CDF visualization of the Map dataset shows that it has a fairly shifted distribution across different data localities, verifying of the necessity of dynamically adapting and adjusting the learned index algorithms just as we considered in this paper. 1.450 1.455 1.460 1.465 1.470 1.475 1.480 1.485 1.490 1.495 1.500 1.505 1.510 1.515 100 0 100 Key 0 1 2 3 4 5 6 Key 107 Figure 7: CDFs of adopted datasets. 1.4718 1.4720 1.4722 1.4724 Key 109 Zoomed-in Weblogs 3.95 4.00 4.05 4.10 Key 107 Zoomed-in Lognormal Figure 8: Zoomed-in CDFs of adopted datasets. J ADDITIONAL EXPERIMENTAL RESULTS J.1 OVERALL INDEX PERFORMANCE. For the N-MAE trade-off improvements and the actual querying efficiency improvements brought by the proposed framework, we illustrate more N-MAE trade-off curves in Figure 9 and querying time results in Figure 10. We also mark the 99th percentile (P99) latency as the right bar, which is a useful metric in industrial-scale practical systems. Recall that the N-MAE trade-off curve adequately reflects the index size and querying time: (1) the segment size in bytes and N are only different by a constant factor, e.g., the size of a segment can be 128bit if it consists of two double-precision float parameters (slope and intercept); (2) the querying operation can be done in O(log(N) + log(|y ˆy|) as we mentioned in Section 3.1, thus a better N-MAE trade-off indicates a better querying efficiency. From these figures, we can see that the dynamic ϵ versions of all the baseline methods achieve better N-MAE trade-off and better querying efficiency, verifying the effectiveness and the wide applicability of the proposed framework. Regards the p99 metrics, we can see that the dynamic version achieves comparable or even better P99 results than the static version, showing that the proposed method not Figure 9: The additional N-MAE trade-off curves for learned index methods. only improves the average lookup time, but also has good robustness. This is because our method can effectively adjust ϵ based on the expected ϵ and data characteristic, making the {ϵi} fluctuated within a moderate range. J.2 ABLATION STUDY To examine the necessity and the effectiveness of the proposed framework, in Section 4.3, we compare the proposed framework with three dynamic ϵ variants for the FITing-Tree method. Here we demonstrate the AUNEC relative changes for the Radix-Spline method with the same three variants in Table 6 and similar conclusions can be drawn. Table 6: The AUNEC relative changes of dynamic ϵ variants compared to the Radix-Spline method with the proposed framework. Random ϵ Polynomial Learner Least Square Learner Weblogs +81.23% +56.20% -9.56% Io T +74.78% +53.28% +9.81% Map +60.67% +7.34% +0.45% Lognormal +83.16% +55.01% 11.23% Average +74.96% +42.96% 2.63% Figure 10: Improvements in terms of querying times for learned index methods with dynamic ϵ. J.3 THEORETICAL RESULTS VALIDATION We study the impact of ϵ on Seg Erri for the MET algorithm in Theorem 1, where the derivations are based on the setting of the slope condition ai = 1/µ. To confirm that the proposed framework also works well with other ϵ-bounded learned index methods, we analyze the learned slopes of other ϵ-bounded methods in Theorem 2. In summary, we prove that for a segment Si : y = aix + bi whose covered data is Di and the expected key interval of Di is µi, then ai concentrates on 1/µi within 2ϵ/(E[Len(Di)] 1) relative deviations. Here we plot the learned slopes of baseline learned index methods in Figure 11 (Map dataset) and in Figure 13 (Io T, Weblogs and Lognormal datasets). We can see that the learned slopes of other methods indeed center along the line ai = 1/µi, showing the 0.2 0.4 0.6 0.8 1.0 1/µi Io T Dataset ai = 1/µi FITing-Tree Radix Spline PGM Figure 11: Learned slopes. 100 200 300 400 ϵ 0 5 10 15 20 25 30 35 40 45 Log(Seg Err) Lognormal µ = 1 σ = 0.5 MET Upper Bound MET MET Lower Bound PGM Radix Spline FITing-Tree 100 200 300 400 ϵ Log(Seg Err) Lognormal µ = 1 σ = 1 MET Upper Bound MET MET Lower Bound PGM Radix Spline FITing-Tree Figure 12: Illustration of the derived bounds. close connections among these methods and confirming that the proposed framework can work well with other ϵ-bounded learned index methods. We further compare the theoretical bounds with the actual Seg Erri for all the adopted learned index methods. We show the results on the Lognormal dataset in Figure 12, and the results on another two datasets Gamma and Uniform in Figure 14, where the key intervals of the latter two datasets follow gamma distribution and uniform distribution respectively. As expected, we can see that the MET method has the actual Seg Erri within the derived bounds, verifying the correctness of the Theorem 1. Besides, the other ϵ-bounded methods show the same trends with the MET method, providing evidence that these methods have the same mathematical forms as we derived, and thus the ϵ-learner also works well with them. 0.85 0.90 0.95 1.00 1/µi Weblogs Dataset ai = 1/µi FITing-Tree Radix Spline PGM 0.25 0.30 0.35 1/µi Lognormal Dataset ai = 1/µi FITing-Tree Radix Spline PGM Figure 13: Learned slopes on the Io T, Weblogs and Lognormal datasets. 100 200 300 400 ϵ Log(Seg Err) Gamma, k = 1.0, θ = 1.0 MET Upper Bound MET MET Lower Bound PGM Radix Spline FITing-Tree 100 200 300 400 ϵ Log(Seg Err) Gamma, k = 2.0, θ = 3.0 MET Upper Bound MET MET Lower Bound PGM Radix Spline FITing-Tree 100 200 300 400 ϵ Log(Seg Err) Gamma, k = 3.0, θ = 6.0 MET Upper Bound MET MET Lower Bound PGM Radix Spline FITing-Tree 100 200 300 400 ϵ Log(Seg Err) Uniform, low = 0.0, high = 1.0 MET Upper Bound MET MET Lower Bound PGM Radix Spline FITing-Tree 100 200 300 400 ϵ Log(Seg Err) Uniform, low = 0.0, high = 10.0 MET Upper Bound MET MET Lower Bound PGM Radix Spline FITing-Tree 100 200 300 400 ϵ Log(Seg Err) Uniform, low = 10.0, high = 100.0 MET Upper Bound MET MET Lower Bound PGM Radix Spline FITing-Tree Figure 14: Illustrations of the derived bounds on Gamma and Uniform datasets. J.4 CV AS AN INDICATIVE QUANTITY The coefficient of variation (CV) value, i.e., CV =σ/µ, plays an important factor in our bounds to reflect the linearity degree of the data. We have seen that CV is effective to help dynamically adjust ϵ in our framework as shown in our experiments. Here we explore that whether the CV value can be an indicative quantity to shed light on what types of data will benefit from our dynamic adjustment. To be specific, we calculate the CV values of the experimental datasets and compare them with the trade-off improvements. The global CV values of Io T, Map, Lognormal, and Weblogs are 65.24, 11.12, 0.85, and 0.013 respectively, while their AUNEC improved by 20.71%, 6.47%, 21.89%, and 26.96% respectively. With the exception of Io T, the rest of the results show that the smaller the CV value is, the greater the trade-off improvement of dynamic ϵ brings. We find that Io T is a locally linear but globally fluctuant dataset. We then divide the data into 5000 segments and calculate their average CV values. The local CV values of Io T, Map, Lognormal, and Weblogs are 0.95, 2.18, 0.63, and 0.005 respectively, which is consistent with the improvement trends. Intuitively, when the local CV value is small, the local data is hard-to-fit with a few linear segments if we adopt an improper ϵ, and we need more fine-grained ϵ adjustment rather than the fixed setting. Thus we can expect more performance improvements in this case. The calculation of actual CV values of real-world datasets helps to validate our ϵ analysis based on the CV values, and provides further insight into the scenarios where the proposed method has strong potential to outperform existing methods.