# sorting_with_predictions__570e54b6.pdf Sorting with Predictions Xingjian Bai Department of Computer Science University of Oxford, UK xingjian.bai@sjc.ox.ac.uk Christian Coester Department of Computer Science University of Oxford, UK christian.coester@cs.ox.ac.uk We explore the fundamental problem of sorting through the lens of learningaugmented algorithms, where algorithms can leverage possibly erroneous predictions to improve their efficiency. We consider two different settings: In the first setting, each item is provided a prediction of its position in the sorted list. In the second setting, we assume there is a quick-and-dirty way of comparing items, in addition to slow-and-exact comparisons. For both settings, we design new and simple algorithms using only O(P i log ηi) exact comparisons, where ηi is a suitably defined prediction error for the ith element. In particular, as the quality of predictions deteriorates, the number of comparisons degrades smoothly from O(n) to O(n log n). We prove that this comparison complexity is theoretically optimal with respect to the examined error measures. An experimental evaluation against existing adaptive and non-adaptive sorting algorithms demonstrates the potential of applying learning-augmented algorithms in sorting tasks. 1 Introduction Sorting is one of the most basic algorithmic problems, commonly featured as one of the initial topics in computer science education, and with a vast array of applications spanning various domains. In recent years, the emerging field of algorithms with predictions (Lykouris and Vassilvitskii, 2021, Mitzenmacher and Vassilvitskii, 2022), also known as learning-augmented algorithms, has opened up new possibilities for algorithmic improvement, where algorithms aim to leverage predictions (possibly generated through machine learning, or otherwise) to improve their performance. However, the classical sorting problem with predictions, along with the discussion of different types of predictors for sorting, appears to have been largely overlooked by this recent movement. This paper explores the problem of sorting through the lens of algorithms with predictions in two settings, aiming to overcome the classical Ω(n log n) barrier with the aid of various types of predictors. The first setting involves each item having a prediction of its position in the sorted list. This type of predictor is commonly found in real-world scenarios. For instance, empirical estimations of element distribution can generate positional predictions. Another example is that a fixed set of items has their ranking evolve over time, with minor changes at each timestep. Here, an outdated ranking can serve as a natural prediction for the current ranking. As re-evaluating the true relation between items can be costly, the provided positional predictions offer useful information. The positional prediction setting is closely related to adaptive sorting of inputs with existing presortedness (Estivill-Castro and Wood, 1992), but we consider different measures of error on the predictor, resulting in algorithms with a more fine-grained complexity. In the second setting, a dirty comparison function is provided to assist sorting. In biological experiments, for instance, some indicating factors might be used to approximately compare two molecules or drugs. Despite potential errors due to the oversight of minor factors, these comparisons can still offer preliminary insights into the properties of the subjects. More broadly, in experimental science, researchers often carry out costly experiments to compare subject behaviours. By utilizing a 37th Conference on Neural Information Processing Systems (Neur IPS 2023). proficient sorting algorithm that capitalizes on dirty comparisons, the need for costly experiments can be reduced and substituted by less expensive, albeit noisier, experiments. We propose sorting algorithms to leverage either type of predictor. In the positional prediction setting, we design two deterministic algorithms with different complexity bounds, while in the dirty comparisons setting, we develop a randomized algorithm. In all settings, we provide bounds of the form O(Pn i=1 log(ηi +2)) on the number of exact comparisons, for different notions of element-wise prediction errors ηi [0, n]. In particular, all three proposed algorithms only require O(n) exact comparisons if predictions are accurate (consistency), never use more than O(n log n) comparison regardless of prediction quality (robustness), and their performance degrades slowly as a function of prediction error (smoothness). Moreover, we show that all algorithms have optimal comparison complexity with respect to the error measures examined. Finally, through experiments on both synthetic and real-world data, we evaluate the proposed algorithms against existing (adaptive and non-adaptive) algorithms. Results demonstrate their superiority over the baselines in multiple scenarios. 1.1 Preliminaries Let A = a1, . . . , an be an array of n items, equipped with a strict linear order <. Let p: [n] [n] be the permutation that maps each index i to the position of ai in the sorted list; that is, ap 1(1) < ap 1(2) < < ap 1(n). We consider two settings of sorting with predictions. Sorting with Positional Predictions. In sorting with positional predictions, the algorithm receives for each item ai a prediction ˆp(i) of its position p(i) in the sorted list. We allow ˆp to be any function [n] [n], which need not be a permutation (i.e., it is possible that ˆp(i) = ˆp(j) for some i = j). Positional predictions can be generated by models that roughly sort the items, e.g. focusing on major factors while neglecting minor ones. Or they can stem from past item rankings, while the properties of the items evolve over time. In such cases, the objective is to obtain the latest ranking of items. The error of a positional prediction can be naturally quantified by the displacement of each element s prediction; that is, the absolute difference of the predicted ranking and the true ranking. We define the displacement error of item ai as η i := |ˆp(i) p(i)|. The following notion of one-sided error provides an alternative perspective to evaluate the complexity of algorithms with positional predictions. We denote the left-error and right-error of item ai as ηl i := |{j [n]: ˆp(j) ˆp(i) p(j) > p(i)}| ηr i := |{j [n]: ˆp(j) ˆp(i) p(j) < p(i)}| . In certain contexts, it may be impossible to obtain a predictor with small displacement error, but possible to obtain one with a small one-sided error. By developing algorithms for this setting, we expand the space of problems where sorting algorithms with predictions can be applied. Sorting with Dirty and Clean Comparisons. The other setting we consider involves a predictor that estimates which of two elements is larger without conducting a proper comparison, providing a faster but possibly inaccurate result. This type of predictor is applicable in scenarios where exact comparisons are costly but a rough estimate of the comparison outcome can be obtained more easily. Formally, in sorting with dirty comparisons, the algorithm has access to a complete, asymmetric relation b< on the items, while still also having access to the exact comparisons <. That is, for any two distinct items ai and aj, either ai b< aj or aj b< ai. We think of b< as an unreliable, but much faster to evaluate prediction of <. We also refer to the (fast) comparisons according to b< as dirty whereas the (slow) comparisons according to < are clean. We emphasize that the relation b< need not be transitive, so it is not necessarily a linear order. Instead, b< induces a tournament graph on the items of A, containing a directed edge (ai, aj) if and only if ai b< aj, and in many applications, we expect this graph to have cycles. We denote by ηi the number of incorrect dirty comparisons involving ai, that is, ηi := {j [n]: (ai < aj) = (ai b< aj)} . 1.2 Main Results Our main result for the dirty comparison setting is given by the following theorem: Theorem 1.1. Augmented with dirty comparisons, there is a randomized algorithm that sorts an array within O(n log n) running time, O(n log n) queries to dirty comparisons, and O (Pn i=1 log (ηi + 2)) clean comparisons in expectation. By the classical lower bound, the total number of dirty + clean comparisons must be at least Ω(n log n) regardless of prediction quality, but for sufficiently good predictions the theorem allows us to replace most clean comparisons with dirty ones. The case where dirty comparisons are probabilistic is discussed in Section B.1. The following theorem generalizes the previous one to the case that there are k different dirty comparison operators: Theorem 1.2. Augmented with k 2O(n/ log n) dirty comparison predictors, where the error of the pth predictor is denoted by ηp, there is a randomized algorithm that sorts an array with at most O(minp Pn i=1 log(ηp i + 2)) clean comparisons. In other words, the number of clean comparisons is as good as if we knew in advance which of the k predictors is best. The bound on k is almost tight, since already k 2n log n would mean there could be one predictor for each of the n! possible sorting outcomes, which would render them useless. The proof of Theorem 1.2 is given in Appendix B.2. The next two theorems capture our algorithms for the positional prediction setting: Theorem 1.3. Augmented with a positional predictor, there is a deterministic algorithm that sorts an array within O Pn i=1 log η i + 2 running time and comparisons. Theorem 1.4. Augmented with a positional predictor, there is a deterministic algorithm that sorts an array within O Pn i=1 log min ηl i, ηr i + 2 comparisons. We remark that there exist instances where the bound of Theorem 1.3 is stronger than that of Theorem 1.4 and vice versa.1 The following theorem, proved in Appendix E, shows tightness of the aforementioned upper bounds. Theorem 1.5. Augmented with dirty comparisons, no sorting algorithm uses o (Pn i=1 log (ηi + 2)) clean comparisons. Augmented with a positional predictor, no sorting algorithm uses o Pn i=1 log η i + 2 or o Pn i=1 log(min ηl i, ηr i + 2) comparisons. Bounds in Terms of Global Error. One may wonder how the above bounds translate to a global error measure such as the number of item pairs where the larger one is incorrectly predicted to be no larger than the smaller one. Writing D for this error measure, in the dirty comparison setting we simply have D = 1 i ηi, and in the positional prediction setting D 1 i η i by (Rohatgi, 2020, Lemma 11). Thus, concavity of logarithm and Jensen s inequality yield an upper bound of O n log D n + 2 for both settings. This bound is tight2 as a function of D and corresponds to the optimal complexity of adaptive sorting as a function of the number of inversions (Mannila, 1985). However, our guarantees in terms of element-wise error are strictly stronger whenever Jensen s inequality is not tight, i.e., when the ηi are non-uniform. Furthermore, it is reasonable to expect predictors to exhibit varying levels of error for different items, especially when the error originates from element-wise noise. 1.3 Related Works Algorithms with Predictions. Our study aligns with the broader field of learning-augmented algorithms, also known as algorithms with predictions. The majority of research has focused on 1If predictions are correct except that the positions of the n smallest and n largest items are swapped, then P i log(η i +2) = Θ(n), but P i log(min{ηl i, ηr i }+2)) = Θ(n log n). Conversely, if ˆp(i) = p(i)+ n mod n, then P i log(η i + 2) = Θ(n log n), but P i log(min{ηl i, ηr i } + 2)) = Θ(n). 2Indeed, our proof of Theorem 1.5 constructs a family of instances where each ηi is bounded by the same quantity, so Jensen s inequality is tight for these instances. classical online problems such caching (Lykouris and Vassilvitskii, 2021, Rohatgi, 2020, Wei, 2020, Bansal et al., 2022), rent-or-buy problems (Purohit et al., 2018, Gollapudi and Panigrahi, 2019, Angelopoulos et al., 2020, Wang et al., 2020, Antoniadis et al., 2021), scheduling (Purohit et al., 2018, Lattanzi et al., 2020, Mitzenmacher, 2020, Azar et al., 2021, 2022, Lindermayr and Megow, 2022) and many others. In comparison, research on learning-augmented algorithms to improve runnning time for offline problems is relatively sparser, but examples include matching (Dinitz et al., 2021, Sakaue and Oki, 2022), clustering (Ergun et al., 2022), and graph algorithms (Chen et al., 2022, Davies et al., 2023). Motivated by (Kraska et al., 2018), (Lykouris and Vassilvitskii, 2021) describes a simple method to speed up binary search with predictions, which has been inspirational for our work. Learning-augmented algorithms have also been applied to data structures such as binary search trees (Lin et al., 2022, Cao et al., 2023), and empirical works demonstrate the benefits of ML-augmentation for index structures (Kraska et al., 2018) and database systems (Kraska et al., 2019). There has also been increasing interest in settings where algorithms have access to multiple predictors (Gollapudi and Panigrahi, 2019, Wang et al., 2020, Bhaskara et al., 2020, Almanza et al., 2021, Emek et al., 2021, Dinitz et al., 2022, Anand et al., 2022, Antoniadis et al., 2023). Related to sorting, Lu et al. (2021) studied learning-augmented generalized sorting, a variant of sorting where some comparisons are allowed while others are forbidden. The predictions they consider are similar to our dirty comparisons. They proposed two algorithms with comparison complexities O(n log n + w) and O(nw), where w is the total number of incorrect dirty comparisons. In the classical (non-generalized) setting with all comparisons allowed, only the second bound theoretically improves upon O(n log n), but the dependence on the error is exponentially worse than for our algorithms; even with a single incorrect dirty comparison per item, the O(nw) bound becomes O(n2), whereas ours is O(n). A recent work of Erlebach et al. (2023) studies sorting under explorable uncertainty with predictions, where initially only an interval around the value of each item is known, the exact values can be queried, a prediction of these values is given, and the goal is to minimize the number of queries needed to sort the list. Deep Learning-Based Sorting. The thriving development of deep learning has inspired research into new sorting paradigms. A Deep Mind study (Mankowitz et al., 2023) recast sorting as a singleplayer game, training agents to perform effectively, thus uncovering faster sorting routines for short sequences. Kristo et al. (2020) proposed a sorting algorithm that uses a learning component to improve the empirical performance of sorting numerical values. Their algorithm tries to approximate the empirical CDF of the input by applying ML techniques to a small subset of the input. This setting diverges from ours: If inputs are non-numeric (and no monotonous mapping to numbers is known), then one has to rely on a comparison function, and the approach of Kristo et al. (2020) would not be well-defined, whereas our algorithms can sort arbitrary data types. Conversely, the input in (Kristo et al., 2020) is solely the item list, without any additional predictions. Note that predictions (or other assumptions) are necessary to beat the entropic Ω(n log n) lower bound. 3 Noisy Sorting. Noisy sorting contemplates scenarios where comparison results may be incorrect. This model is useful to simulate potential faults in large systems. Two noisy sorting settings have primarily been considered: In independent noisy setting, each query s result is independently flipped with probability p (0, 1 2). Recently, Gu and Xu (2023) provided optimal bounds on the number of queries to sort n elements with high probability. Recurrent noisy setting (Braverman and Mossel, 2008) further assumes any repeated comparisons will yield consistent results. Geissmann et al. (2019) present an optimal algorithm that guarantees O(n log n) time, O(log n) maximum dislocation, and O(n) total dislocation with high probability. While the recurrent noisy setting is closely related to dirty comparisons, studies in that field focus primarily on approximate sorting; to the best of our knowledge, no exact sorting algorithms that use both dirty and clean comparisons have been studied. Adaptive Sorting. Adaptive sorting algorithms leverage various types of existing order in the input, achieving better complexity for partially sorted data. Examples include Tim Sort (Peters, 2002), which is the standard sorting algorithm in a variety of programming languages, Cook-Kim division 3The theoretical guarantee of their algorithm is O(n2), although they observe much better empirical performance. Indeed, this makes sense for numerical inputs drawn from a sufficiently nice distribution, since then one can extrapolate from a small part of the input to the rest. (Cook and Kim, 1980), and Powersort (Munro and Wild, 2018), which recently replaced Tim Sort in Python s standard library. We refer to the survey of Estivill-Castro and Wood (1992) for a broader overview. The concept of pre-sortedness is closely related to positional predictions; however, without the motivation from predictors, the complexity bound on adaptive sorting algorithms was often considered under error measures on the entire array, instead of on each element. In contrast, our error measure is element-wise, allowing algorithms with stronger complexity bounds. 2 Sorting with Dirty Comparisons Given a dirty predictor b<, our goal is to sort A with the least possible number of clean comparisons. Note that, if b< is accurate, A can be sorted using only O(n) clean comparisons and O(n log n) dirty comparisons. This could be achieved, for example, by performing Merge Sort with dirty comparisons and then validating the result through clean comparisons between adjacent elements. This observation motivates us to consider that not all O(n2) dirty comparisons are necessary, and we should devise an algorithm minimizing the number of both clean and dirty comparisons. We propose a randomized algorithm that sorts A with expected O(P log(2 + ηi)) clean comparisons, expected O(n log n) dirty comparisons, and expected O(n log n) running time. The key idea consists of three parts: 1) Sequentially insert each element of A into a binary search tree, following random order; 2) Guide each insertion primarily with dirty comparisons, while verifying the correctness of it using a minimal number of clean comparisons. 3) Correct the mistake induced by the dirty insertion, ensuring that the clean comparisons needed for correction is O(log ηi) in expectation. We describe the algorithm in Section 2.1 and prove its performance guarantees in Section 2.2. In Appendix B we discuss extensions of the dirty comparison setting. 2.1 Algorithm We now describe the sorting algorithm with dirty comparisons in detail (Algorithm 1). We initialize B as an empty binary search tree (BST). For any vertex v, we denote by left(v) and right(v) its left and right children, and by root(B) the root of B. Slightly abusing notation, we write v also for the item stored at vertex v. If any of these vertices is missing, the respective variable has value NIL. Algorithm 1: Sorting with dirty and clean comparisons Input: A = a1, . . . , an , dirty comparator b<, clean comparator < 1 B empty binary tree 2 for i [n] in uniformly random order do 3 (L1, C1, R1) ( , root(B), ) 5 while Ct = NIL do Dirty search 6 if ai b< Ct then (Lt+1, Ct+1, Rt+1) (Lt, left(Ct), Ct) 7 else (Lt+1, Ct+1, Rt+1) (Ct, right(Ct), Rt) 9 C Ct , where t t is maximal s.t. Lt < ai < Rt Verification 10 while C = NIL do Clean search 11 if ai < C then C left(C) 12 else C right(C) 13 Insert ai at C 14 return inorder traversal of B Within each iteration of the for-loop starting in line 2, we select one item ai of A uniformly randomly from the items that have not been processed yet. Then, we insert this item ai into B, while maintaining the invariant that B remains a BST with respect to clean comparisons <. Inserting item ai into B requires three phases, as illustrated in Figure 2.1. The first phase involves performing a search for the insertion position using dirty comparisons b<, keeping track of the search path. Here, we denote by Ct the tth vertex on this path, and by Lt and Rt the lower and upper Figure 2.1: The insertion process in dirty comparison sorting. bounds on items that can be inserted in the subtree rooted at Ct without violating the BST property with respect to <. Correctness of the choice of Lt and Rt follows from the fact that B was a BST with respect to < before the current insertion. This dirty procedure stops when the search path reaches a NIL-leaf, regarded as the predicted position for ai s insertion. However, since we used dirty comparisons to trace the path, ai might violate one of the boundary conditions Lt < ai or ai < Rt at some recursion step t. We call a recursion step t valid for ai if Lt < ai < Rt. Then, we enter the verification phase in line 9. We traverse the dirty search path in reverse order to locate the last valid step t . A naive method to do this (which is sufficient for our asymptotic guarantees) is to repeatedly decrease t by 1 until t is valid; we discuss an alternate, more efficient method in Remark A.6, which yields a better constant factor. The final phase involves performing a clean search starting from Ct to determine the correct insertion position for ai. After inserting ai into that position, B remains a BST with respect to <. Once all items of A are inserted into B, we can obtain the sorted order through the inorder traversal of B. 2.2 Complexity Analysis The idea is to show that the dirty search path has expected depth O(log n), whereas the verification path and the clean search path only have expected depth O(log ηi). Therefore, the algorithm only needs O(log n) dirty and O(log ηi) clean comparisons for inserting each ai; Theorem 1.1 is established consequently. The full proof is given in Appendix A. 3 Sorting with Positional Predictions In this section, we propose two algorithms that are capable of leveraging positional predictions. The first one has complexity bounds in terms of the displacement error measure, and the second one has complexity bounds in terms of ηl and ηr, the one-sided error measures. Each algorithm is effective in some tasks, where the given prediction is accurate with respect to its corresponding error measure. 3.1 Displacement Sort We present a sorting algorithm with positional prediction, whose comparison and time complexity rely solely on the displacement error of the predictor. The algorithm is adapted from Local Insertion Sort (Mannila, 1985), an adaptive sorting algorithm proven to be optimal for various measures of presortedness. We use a classic data structure called finger tree (Guibas et al., 1977), which is a balanced binary search tree (BBST) equipped with a pointer ( finger ) pointing towards the last inserted vertex. When a new value v is to be inserted, rather than searching for insertion position from the root, the insertion position is found by moving the finger from the last inserted vertex u to the suitable new position. By the balance property of BBST, the insertion can be performed in O(log d(u, v)) amortized time, where d(u, v) is the number of vertices in the tree whose value lies in the closed interval from u to v. Algorithm 2 details the proposed method. We first bucket sort (in time O(n)) the items in A based on their predicted positions, such that we may assume for all i < j that ˆp(i) ˆp(j). Following the rearranged order, items in A are sequentially inserted into an initially empty finger tree T. After all insertions, we obtain the exactly sorted array by an inorder traversal of T. Algorithm 2: Sorting with complexity on η i Input: A = a1, . . . , an , prediction ˆp 1 Bucket Sort(A, ˆp); Bucket Sort A according to ˆp, so that ˆp(1) ˆp(2) ˆp(n) 2 T an empty one-finger tree; 3 for i = 1, . . . , n do 4 Insert ai into T; 5 return nodes in T in sorted order (via inorder traversal); The full proof of run time and comparison complexity of Algorithm 2 is in Appendix C. 3.2 Double-Hoover Sort Now we turn our focus to settings where one-sided errors are small. We first describe a simple algorithm with comparison complexity as a function of either ηl or ηr. Following this, we introduce our algorithm Double-Hoover Sort, which has comparison complexity as claimed in Theorem 1.4. Both algorithms begin by bucket sorting A with respect to the positional prediction in O(n) time. Thus, we may subsequently assume that A is rearranged such that i < j, ˆp(i) ˆp(j). A First Approach. A left-sided sorting complexity of O(Pn i=1 log(2 + ηl i)) can be easily achieved using the standard technique of learning-augmented binary search, as described in (Lykouris and Vassilvitskii, 2021, Mitzenmacher and Vassilvitskii, 2022). Specifically, a sorted array L is maintained, and a1, . . . , an are sequentially inserted into L. During each insertion, we perform a learningaugmented binary search starting from the rightmost position of L, taking O(log(ηl i+2)) comparisons to find the correct insertion position. In total, O(P i log(ηl i + 2)) comparisons are taken among all insertions. By replacing the array L with an appropriate data structure (e.g., a BBST with a finger that always returns to the rightmost element), one can achieve the same bound also for time complexity. A reversed right-sided version of this algorithm achieves complexity O(P i log(ηr i + 2)). By simultaneously running the left-sided and right-sided algorithms, one can achieve the complexity bound of O min Pn i=1 log 2 + ηl i), Pn i=1 log(2 + ηr i . However, moving the min operator inside the summation requires more elaborate approach. Double-Hoover Sort. The basic idea is that, to utilize a similar insertion scheme to that employed in the one-sided algorithm, we maintain two sorted structures L and R at the same time, and insert each item into one of them, depending on which operation is faster, hereby achieving a complexity bound of O(log(min(ηl i, ηr i ) + 2)). Then, a final sorted list is attained by merging L and R in linear time. However, a significant issue yet to be addressed is how to decide the insertion order of different items. Consider two items au and av with u < v (so ˆp(u) ˆp(v) by Bucket Sort). In our algorithm, if both are to be inserted into L, it is crucial that au is inserted prior to av. Otherwise, the insertion complexity of au could exceed the bound of log(ηl u + 2). Conversely, if both au and av are to be inserted into R, then av should be inserted prior to au. Since we cannot predict whether an item will be inserted into L or R, formulating an appropriate insertion order that respects the constraints on both sides is impossible. We tackle this issue of insertion order with a strength-based, log n-rounds insertion scheme. Intuitively, we think of L and R as two hoovers , with their suction power increasing simultaneously over time. Each item (as dust ) is extracted from the array and inserted into one hoover once the suction power reaches the required strength: a hoover with suction power δ is able to absorb items that can be inserted into it with O(log δ) comparisons. Details are illustrated in Algorithm 3. The sorted structures L and R can be implemented by arrays, though alternative data structures could provide better time complexity. We conduct insertions in log n rounds, setting δ to be 1, 2, 4, . . . , 2 log n . In each round, we iterate over a1, . . . , an to decide if they should be inserted to L in the current round. Then, we iterate over an, . . . , a1 reversely, to decide if they should be inserted to R in the current round. Note that if an Algorithm 3: Double-Hoover Sort Input: A = a1, . . . , an , prediction ˆp 1 Bucket Sort(A, ˆp); Bucket Sort A according to ˆp, so that ˆp(1) ˆp(2) ˆp(n) 3 for δ = 20, 21, . . . , 2 log n do 4 for i = 1, . . . , n if ai has not been inserted do 5 L li δ then 8 Insert ai into L by binary search, starting on interval {x L: li δ x li}, 9 where li := minδ <δ li δ 10 for i = n, . . . , 1 if ai has not been inserted do 11 R>i {aj R : j > i}; 12 ri δ if |R>i| < δ then else δth smallest item in R>i ; 13 if ai < ri δ then 14 Insert ai into R by binary search, starting on interval {x R: ri x ri δ}, 15 where ri := maxδ <δ ri δ 16 return merge(L, R); Figure 3.1: An example of the insertion process in the Double-Hoover sort. item is inserted into either L or R, it is omitted in later rounds. The insertion process of an item in one round is depicted in Figure 3.1. To decide whether ai should be inserted into L with strength δ, let li δ denote the δth largest item in L with index smaller than i, representing the boundary value in this round. If there are less than δ eligible items in L, set li δ to be . Let li represent minδ <δ li δ , the minimum boundary value in previous rounds. If li δ is smaller than ai, we employ binary search to insert ai into L, starting with the interval {x L: li δ x li}. Conversely, to decide whether ai should be inserted into R, we adopt a symmetrical approach as depicted in lines 11 to 15 of Algorithm 3. After all insertion rounds, we merge L and R in linear time to obtain the sorted result. Correctness. The correctness of the Double-Hoover Sort arises from the invariance that both L and R remain sorted after all insertions. It is sufficient to show that the initial insertion intervals at line 8 and line 14 always cover the value of ai. At line 8, li δ < ai holds trivially by conditioning. Since ai is not inserted into L in any previous round, ai < li δ for all δ < δ. Since the minimum operation preserves inequality, ai < li also holds. A similar argument can be made for the interval at line 14. To discern the comparison complexity of the proposed algorithm, we propose the following lemma. Lemma 3.1. Upon the insertion of an item ai into L, all items in L\Li are smaller than ri. As a result, the initial interval at line 8 and line 14 are subsets of L 0. Proof. We employ a percentile argument. Consider the first step of either while-loop where the active subtree has at most s vertices, and let V be the set of these vertices. Note that the pivot in this step is the first element in V inserted into the tree. Conditioned on the vertices of the active subtree being V , elements of V are equally likely to be the pivot, since their insertion order is uniformly random. This is true even when conditioned on st = s , since reordering the elements within the set V does not change the value of st , which is determined at higher vertices of the tree. Thus, we have at least a 50% chance that the pivot lies between the 25th percentile and the 75th percentile, in which case both children subtrees contain at most 3 4|V | vertices each. Hence, the size of active subtree shrinks by a factor of 3 4 after at most 2 steps in expectation, and it shrinks to size smaller than s 2 after at most O(1) steps in expectation. Based on this lemma, we are able to characterize the expected length of dirty search, clean search, and dirty search after time t . Lemma A.2. A dirty search takes O(log n) steps in expectation, i.e. E[T] = O(log n). Proof. In each dirty search, the initial largest subtree has size at most n. We can apply Lemma A.1 repeated on s = n, n 2 , . . . , 1. By linearity of expectation, we derive that a dirty search takes log n O(1) = O(log n) steps in expectation. Lemma A.3. A clean search takes O(E[log(st + 1)]) steps in expectation; in dirty search after reaching t , there are O(E[log(st + 1)]) steps in expectation, i.e. E[T t ] = O(E[log(st + 1)]). Proof. Assume st = s . Then, in dirty search after t , the initial largest subtree has size s , and the rest of the active subtrees all have size at most s 1. By applying the second part of Lemma A.1 on s = s 1, s 1 2 , . . . , 1 and summing, we obtain E [T t |st = s ] = O(log(s + 1)). Thus, E [T t ] = X s E [T t |st = s ] P [st = s ] = O (E [log (st + 1)]) The bound on clean search follows in the same way. In order to relate E [log (st + 1)] to the prediction error ηi, the next lemma first characterizes the probability of a given time step t being t as a function of the subtree size st. Lemma A.4. For any t and k, P t = t st 2k 1, 2k ηi/2k 1. Proof. Recall that t is the last valid time step. A shift from a valid time step to an invalid time step occurs only if the dirty comparison between the pivot and ai is wrong. Among all st potential pivots, at most ηi can have mistaken dirty comparisons with ai, and they are equally likely to be the pivot (depending on which of them was inserted first). Hence, given st 2k 1, 2k , the probability of a pivot with mistaken comparison is at most ηi/st ηi/2k 1. Based on Lemma A.4, we present the central claim bridging prediction error with comparison complexity. Lemma A.5. E [log (st )] = O (log (ηi + 1)). Proof. We have P st (2k 1, 2k] = X t P t = t and st (2k 1, 2k] t P st (2k 1, 2k] P t = t st (2k 1, 2k] E h mdirty 2k i ηi/2k 1 where the first inequality uses Lemma A.4 and the second is due to Lemma A.1. Thus, E[log(st )] log(ηi + 1) + k= log(ηi+1) P st (2k 1, 2k] k log(ηi + 1) + ηi k= log(ηi+1) O k2 k = O (log (ηi + 1)) . Theorem 1.1 is subsequently deduced. Proof of Theorem 1.1. Dirty comparisons are only conducted during dirty search, when the recursion step is incremented by one. As per Lemma A.2, the total number of dirty comparisons is bounded by the sum of steps across all insertions, which is n O(log n). In each verification phase, as we traverse the dirty search in reverse order to locate t , at most T t + 2 clean comparisons suffice. In each clean search phase, the expected number of clean comparisons is the expected number of steps in clean search. Therefore, based on Lemma A.3 and Lemma A.5, the total number of clean comparisons performed in both phases is O (P log (ηi + 2)). Additionally, the running time is dominated by the number of dirty and clean comparisons. Remark A.6. The algorithm can be implemented such that the number of clean comparisons is at most that of quicksort plus O(n log log n), regardless of prediction error. Thus, even with terrible predictions our algorithm matches the performance of quicksort up to a factor that tends to 1 as n . To achieve this, we can implement the verification step by decreasing t in geometrically increasing step sizes until a valid t has been found, and then perform a binary search for t between the last two attempted values of t. This reduces the number of clean comparisons in line 9 from O(T t ) to O(log(T t )), which is at most O(log log n) in expectation by Lemma A.2, and at most O(n log log n) for all verification steps together. The remaining clean comparisons are performed during the clean searches. In the worst case (when all dirty comparisons are incorrect) all clean searches start from the root, and together they perform exactly the same set of comparisons as quicksort (by a coupling argument between the random choices of the two algorithms: E.g., the root of the search tree corresponds to the initial uniformly random pivot of quicksort). B Extentions of Dirty Comparison Setting We consider two extensions of the dirty comparison setting. The first one assumes that dirty comparisons are probabilistic; the second one discusses the setting where multiple predictors are available. We briefly discuss the extension of our algorithms and results in these new settings. B.1 Probabilistic Dirty Comparisons Our algorithm extends to the case where dirty comparisons are probabilistic. Assume that for each pair of objects i, j, the dirty comparison between them yields an incorrect result with probability ηij. Algorithm 1 can be directly applied in this setting, achieving the same guarantees by defining ηi = P j ηij. The proof remains unchanged. If repeatedly querying the same dirty comparison multiple times yields independent results, the number of clean comparisons can be further reduced: Let ϵij := min{ηij, 1/2)}. When querying a dirty comparison 2k times, the probability that the correct answer fails to secure a majority vote is at most (4ϵij(1 ϵij))k: For ηij 0.5, this bound is trivial. Otherwise, there are 22k strings of length 2k over the alphabet {correct, incorrect}, and each string that is at least half incorrect has probability at most (ϵij(1 ϵij))k. So by repeating each dirty comparison query 2k times, we obtain an algorithm that performs O(kn log n) dirty comparisons and O P i log P j(4ϵij(1 ϵij))k clean comparisons. B.2 Multiple Predictors We now discuss the setting where multiple predictors are available and prove Theorem 1.2. Suppose we have k different dirty comparison predictors. Let ηp i denote the number of incorrect comparisons by predictor p for item i. We prove Theorem 1.2 by reduction to the problem of prediction with expert advice : In this problem, there are k experts, and each incurs a loss in the range [0, 1] per time step. An algorithm must select an expert in each round before the losses are revealed and then incurs the loss of the chosen expert. According to (Freund and Schapire, 1997, Equation(9)), their algorithm HEDGE has an expected loss of O(L + log(k)), where L is the total loss of the best expert in hindsight. In our case, the experts correspond to the predictors, and time steps correspond to the n iterations of the for-loop of Algorithm 1 where an item is inserted into the BST. We define the loss of expert p in the time step where ai ought to be inserted by ℓp i = log(1 + ηp i )/ log(n), where ηp i ηp i is the number of incorrect comparisons of predictor p between ai and the items already in the BST at this time. Note that the value of ηp i can be determined at the end of the time step (once ai is correctly placed in the BST) without any additional clean comparisons; the division by log n ensures that ℓp i [0, 1] as required. The algorithm for multiple predictors proceeds as Algorithm 1, querying for dirty comparisons at a given time step the predictor p corresponding to the expert chosen by HEDGE at that time step. Recall from the analysis in Section 2.2 that the expected number of clean comparisons in this time step is then O(log(1 + ηp i )), which is an O(log n) factor larger than the loss suffered by HEDGE. Consequently, by the O(L + log k) bound on the cost of HEDGE, the total expected number of clean comparisons is O(minp P i log(1 + ηp i ) + log(k) log(n)). Since k 2O(n/ log n), the term log(k) log(n) = O(n) is negligible, and Theorem 1.2 follows. C Proof of Comparison Complexity of Displacement Sort Proof of Theorem 1.3. We focus on the insertion process of each item ai in Algorithm 2. Let di denote the number of nodes between ai and ai 1 in an inorder traversal of the tree after inserting ai, including themselves. These nodes must have their correct ranking in the final sorted list between p(i 1) and p(i); hence di |p(i) p(i 1)| + 1, for all i = 2, . . . , n. Therefore, the running time and number of comparisons among all insertions are bounded by i=2 O(log(di)) i=2 O(log(|p(i) p(i 1)| + 1)) i=2 O(log(|p(i) ˆp(i)| + |p(i 1) ˆp(i 1)| + ˆp(i) ˆp(i 1) + 1)) i=2 O(log(3 max{η i + 1, η i 1 + 1, ˆp(i) ˆp(i 1) + 1})) i=1 O(log(η i + 1)) i=1 log(η i + 2) The second inequality is by ˆp(i 1) ˆp(i) and the triangle inequality; the third inequality is by monotonicity of logarithm. The penultimate inequality is justified by log(max{x, y, z}) log x + log y + log z for all x, y, z 1 and i=2 log(ˆp(i) ˆp(i 1) + 1) i=2 (ˆp(i) ˆp(i 1)) n. This concludes the proof of Theorem 1.3. D Proof of Comparison Complexity of Double-Hoover Sort Lemma D.1. Upon the insertion of an item ai into L, all items in L\Li are smaller than ri. As a result, the initial interval at line 8 and line 14 are subsets of L lj δ . Since L lj δ li δ li. Hence, the interval {x L|li δ x li} only contains items in L ai = |{aj L : j < i aj > ai}| |{j [n] : ˆp(j) ˆp(i) p(j) > p(i)}| = ηl i < δl i. Hence, in round δl i, ai must be larger than the boundary value, which is the δl ith largest item in L