# bagged_regularized_kdistances_for_anomaly_detection__504452ed.pdf Journal of Machine Learning Research 26 (2025) 1-59 Submitted 11/23; Revised 3/25; Published 7/25 Bagged Regularized k-Distances for Anomaly Detection Yuchao Cai yccai@nus.edu.sg Department of Statistics and Data Science National University of Singapore 117546, Singapore Hanfang Yang hyang@ruc.edu.cn Center for Applied Statistics and School of Statistics Renmin University of China 100872 Beijing, China Yuheng Ma yma@ruc.edu.cn School of Statistics Renmin University of China 100872 Beijing, China Hanyuan Hang hanyuan0725@gmail.com Hong Kong Research Institute Contemporary Amperex Technology (Hong Kong) Limited Hong Kong Science Park, New Territories, Hong Kong Editor: Aryeh Kontorovich We consider the paradigm of unsupervised anomaly detection, which involves the identification of anomalies within a dataset in the absence of labeled examples. Though distancebased methods are top-performing for unsupervised anomaly detection, they suffer heavily from the sensitivity to the choice of the number of the nearest neighbors. In this paper, we propose a new distance-based algorithm called bagged regularized k-distances for anomaly detection (BRDAD), converting the unsupervised anomaly detection problem into a convex optimization problem. Our BRDAD algorithm selects the weights by minimizing the surrogate risk, i.e., the finite sample bound of the empirical risk of the bagged weighted k-distances for density estimation (BWDDE). This approach enables us to successfully address the sensitivity challenge of the hyperparameter choice in distance-based algorithms. Moreover, when dealing with large-scale datasets, the efficiency issues can be addressed by the incorporated bagging technique in our BRDAD algorithm. On the theoretical side, we establish fast convergence rates of the AUC regret of our algorithm and demonstrate that the bagging technique significantly reduces the computational complexity. On the practical side, we conduct numerical experiments to illustrate the insensitivity of the parameter selection of our algorithm compared with other state-of-the-art distance-based methods. Furthermore, our method achieves superior performance on real-world datasets with the introduced bagging technique compared to other approaches. Keywords: Unsupervised learning, density estimation, anomaly detection, weighted kdistances, regularization, surrogate risk minimization (SRM), bagging, ensemble learning, optimal convergence rates, learning theory c 2025 Yuchao Cai, Hanfang Yang, Yuheng Ma, Hanyuan Hang. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/23-1519.html. Cai, Yang, Ma, Hang 1. Introduction Anomaly detection refers to the process of identifying patterns or instances that deviate significantly from the expected behavior within a dataset (Chandola et al., 2009). It has been widely and carefully studied within diverse research areas and application domains, including industrial engineering (Fahim and Sillitti, 2019; Wang et al., 2021), medicine (Fernando et al., 2021; Tschuchnig and Gadermayr, 2021), cyber security (Folino et al., 2023; Ravinder and Kulkarni, 2023), earth science (Luz et al., 2022; Chen et al., 2022), and finance (Lokanan et al., 2019; Hilal et al., 2022), etc. For further discussions on anomaly detection techniques and applications, we refer readers to the survey of Nassif et al. (2021). Based on the availability of labeled data, anomaly detection problems can be classified into three main paradigms. The first is the supervised paradigm, where both the normal and anomalous instances are labeled. As mentioned in Aggarwal (2016a) and Vargaftik et al. (2021), researchers often employ existing binary classifiers in this case. The second is the semi-supervised paradigm, where the training data only consists of normal samples, and the goal is to identify anomalies that deviate from the normal samples. (Akcay et al., 2018; Zhou et al., 2023). Perhaps the most flexible yet challenging paradigm is the unsupervised paradigm (Aggarwal, 2016a; Gu et al., 2019), where no labeled examples are available to train an anomaly detector. For the remainder of this paper, we only focus on the unsupervised paradigm, where we do not assume any prior knowledge of labeled data. The existing algorithms in the literature on unsupervised anomaly detection can be roughly categorized into three main categories: The first category is distance-based methods, which determine an anomaly score based on the distance between data points and their neighboring points. For example, k-nearest neighbors (k-NN) (Ramaswamy et al., 2000) calculate the anomaly score of an instance based on the distance to its k-th nearest neighbor, distance-to-measure (DTM) (Gu et al., 2019) introduces a novel distance metric based on the distances of the first k-nearest neighbors, and local outlier factor (LOF) (Breunig et al., 2000) computes the anomaly score by quantifying the deviation of the instance from the local density of its neighboring data points. The second category is forest-based methods, which compute anomaly scores based on tree structures. For instance, isolation forest (i Forest) (Liu et al., 2008) constructs an ensemble of trees to isolate data points and quantifies the anomaly score of each instance based on its distance from the leaf node to the root in the constructed tree and partial identification forest (PIDForest) (Gopalan et al., 2019) computes the anomaly score of a data point by identifying the minimum density of data points across all subcubes partitioned by decision trees. The third category is kernel-based methods such as the one-class SVM (OCSVM) (Schölkopf et al., 1999), which defines a hyperplane to maximize the margin between the origin and normal samples. It has been empirically shown (Aggarwal and Sathe, 2015; Aggarwal, 2016b; Gu et al., 2019) that distance-based and forestbased methods are the top-performing methods across a broad range of real-world datasets. Moreover, experiments in Gu et al. (2019) suggest that distance-based methods show their advantage on high-dimensional datasets, as forest-based methods are likely to neglect a substantial number of features when dealing with high-dimensional data. Unfortunately, it is widely acknowledged that distance-based methods suffer from the sensitivity to the choice of the hyperparameter k (Aggarwal, 2012). This problem is particularly severe in unsupervised learning tasks because the absence of labeled data makes it difficult to guide the selection Bagged Regularized k-Distances for Anomaly Detection of hyperparameters. To the best of our knowledge, no algorithm in the literature effectively solves the aforementioned sensitivity problem. Besides, while distance-based methods are crucial and efficient for identifying anomalies, they pose a challenge in scenarios with a high volume of data samples, owing to the need for a considerable expansion in the search for nearest neighbors, leading to a notable increase in computational overhead. Therefore, there also remains a great challenge for distance-based algorithms to improve their computational efficiency. In this paper, we propose a distance-based algorithm, bagged regularized k-distances for anomaly detection (BRDAD), which formulates the weight selection problem in unsupervised anomaly detection as a minimization problem. Specifically, we first establish the surrogate risk, a finite-sample bound on the empirical risk of the bagged weighted k-distances for density estimation (BWDDE). At each bagging round, we determine the weights by minimizing the surrogate risk on a subsampled dataset. Then, using an independently drawn subsample of the same size, we compute the corresponding k-distances. By combining the learned weights and these k-distances, we obtain the regularized k-distance. The final anomaly scores are derived by averaging these regularized k-distances, referred to as bagged regularized kdistances. BRDAD ranks the data in descending order of these scores and identifies the top m instances as anomalies. BRDAD offers two key advantages. First, the surrogate risk minimization (SRM ) approach effectively mitigates the sensitivity of parameter choices in distance-based methods. Second, the incorporation of bagging enhances computational efficiency, making the method scalable for large datasets. The contributions of this paper are summarized as follows. (i) We propose a new distance-based algorithm BRDAD, that prevents the sensitivity of the hyperparameter selection in unsupervised anomaly detection problems by formulating it as a convex optimization problem. Moreover, the incorporated bagging technique in BRDAD improves the computational efficiency of our distance-based algorithm. (ii) From the theoretical perspective, we establish fast convergence rates of the AUC regret of BRDAD. Moreover, we show that with relatively few bagging rounds B, the number of iterations in the optimization problem at each bagging round can be reduced substantially. This demonstrates that the bagging technique significantly reduces computational complexity. (iii) From an experimental perspective, we conduct numerical experiments to evaluate the effectiveness of our proposed BRDAD method. First, we empirically validate the reasonableness of SRM by demonstrating similar convergence behaviors for both SR and MAE. Next, we compare BRDAD with distance-based, forest-based, and kernel-based methods on anomaly detection benchmarks, highlighting its superior performance. Finally, we perform a parameter analysis on the number of bagging rounds B, showing that selecting an appropriate B based on the sample size leads to improved performance. The remainder of this paper is organized as follows. In Section 2, we introduce some preliminaries related to anomaly detection and propose our BRDAD algorithm. We provide basic assumptions and theoretical results on the convergence rates of BRDDE and BRDAD in Section 3. Some comments and discussions concerning the theoretical results will also be provided in this section. We present the error and complexity analysis of our algorithm in Section 4. Some comments concerning the time complexity will also be provided in this section. We verify the theoretical findings of our algorithm by conducting numerical Cai, Yang, Ma, Hang experiments in Section 5. We also conduct numerical experiments to compare our algorithm with other state-of-the-art algorithms for anomaly detection on real-world datasets in this Section. All the proofs of Sections 2, 3, and 4 can be found in Section 6. We conclude this paper in Section 7. 2. Methodology We present our methodology in this section. Section 2.1 introduces basic notations and concepts. In Section 2.2, we propose the bagged weighted k-distances for density estimation (BWDDE) to demonstrate how bagged weighted k-distances can be applied to anomaly detection. Section 2.3 reformulates the weight selection problem for density estimation as a surrogate risk minimization problem, aiming to minimize the finite-sample bound of the empirical risk of BWDDE. Finally, the weights obtained by solving the SRM problem are utilized to construct our main algorithm, named bagged regularized k-distances for anomaly detection (BRDAD). 2.1 Preliminaries We begin by introducing some fundamental notations that will frequently appear. Suppose that the data Dn := {X1, . . . , Xn} are independent and identically distributed (i.i.d.) and drawn from an unknown distribution P that is absolutely continuous with respect to the Lebesgue measure µ and admits a unique density function f. In this paper, we assume that f is supported on [0, 1]d, which we denote as X. Recall that for 1 p < and a vector x Rd, the ℓp-norm is defined as x p := (xp 1 + + xp d)1/p, and the ℓ -norm is defined as x := maxi=1,...,d |xi|. For a measurable function g : X R, we define the Lp-norm as g p := R X |g(x)|p dx 1/p. Let B(x, r) := {x Rd : x x 2 r} denote a ball in Euclidean space Rd centered at x Rd with radius r (0, + ). We use Vd to denote the volume of the d-dimensional closed unit ball. In addition, for n N+, we write [n] := {1, . . . , n} as the set containing integers from 1 to n and Wn := {(w1, . . . , wn) Rn : Pn i=1 wi = 1, wi 0, i [n]}. For any x R, let x be the largest integer less than or equal to x and x be the smallest ingeter larger than or equal to x. Throughout this paper, we use a b = max{a, b} and a b = min{a, b}. Moreover, we use the following notations to compare the magnitudes of quantities: an bn or an = O(bn) indicates that there exists a positive constant c > 0 that is independent of n such that an cbn; an bn implies that there exists a positive constant c > 0 such that an cbn; and an bn means that an bn and bn an hold simultaneously. In this paper, we focus on the case of a fixed dimension, allowing the constant c to depend on the dimension d. Finally, we use C, C , c, and c to represent positive constants. 2.2 Bagged Weighted k-Distances for Anomaly Detection The learning goal of anomaly detection is to identify observations that deviate significantly from the majority of the data. Anomalies are typically rare and different from the expected behavior of the data set. In this paper, we adopt the contamination model proposed by Huber (1965) and further discussed in Huber (1992, Section 1): Bagged Regularized k-Distances for Anomaly Detection Assumption 1 The data Dn consists of i.i.d. samples drawn from a distribution P satisfying the contamination model: P = (1 Π) P0 + Π P1, (1) where P0 and P1 denote the distributions of normal and anomalies, respectively, and Π (0, 1) is the contamination proportion. Additionally, we assume that P0 has a probability density function f0, while P1 is uniformly distributed over [0, 1]d with density function f1. Assumption 1 describes a contamination model where the observed data is drawn from a mixture of normal and anomalous distributions. Assuming P1 to be uniform on [0, 1]d is a common choice, reflecting an uninformative prior on anomalies (Steinwart et al., 2005). Within this framework, anomaly detection reduces to identifying low-density regions, as the density of normal data shares the same family of level sets as the density of the contaminated model. This classical assumption facilitates theoretical analysis and underpins several wellknown unsupervised methods, including OC-SVM (Schölkopf et al., 2001) and deep learningbased approaches (Ruffet al., 2021). Building on this framework, we explore the distance-based method for unsupervised anomaly detection, motivated by the connection between distance functions and k-nearest neighbor (k NN) density estimations highlighted by Biau et al. (2011). This method leverages the distances between data points and their nearest neighbors to assess anomalies. For any x Rd and a dataset Dn, we denote X(k)(x; Dn) as the k-th nearest neighbor of x in Dn. We then define Rn,(k)(x) := x X(k)(x; Dn) 2 as the distance between x and X(k)(x; Dn), referred to as the k-nearest neighbor distance, or k-distance of x in Dn. Distance-based methods are important and effective for anomaly detection. However, when handling large datasets, the number of nearest neighbors that need to be searched grows substantially, leading to significant computational overhead. To mitigate this issue, we incorporate the bagging technique by averaging the weighted k-distances computed on multiple disjoint sub-datasets randomly drawn from the original dataset Dn without replacement. Let B be the number of bagging rounds pre-specified by the user, and {Db s}B b=1 be B disjoint subsets of Dn, each of size s. Since the sub-samples are disjoint and the data Dn is supposed i.i.d., this procedure is mathematically equivalent to taking the first s samples for bag 1, the following s samples for bag 2, etc. In each subset Db s, b [B], let Rb s,(k)(x) := x X(k)(x; Db s) 2 be the k-distance of x in Db s for any integer k s, and let the weighted k-distance be defined as Rw,b s (x) := Ps i=1 wb i Rb s,(i)(x), with wb Ws. The bagged weighted k-distances are obtained by averaging these weighted k-distances across the B sub-datasets: RB n (x) := 1 b=1 Rw,b s (x). Then, we introduce the bagged weighted k-distances for density estimation (BWDDE) as f B n (x) := 1 Vd RB n (x)d i=1 wb iγs,i Cai, Yang, Ma, Hang where Vd = πd/2/Γ(d/2 + 1) is the volume of the unit ball and γs,i := Γ(i + 1/d)Γ(s + 1) Γ(i)Γ(s + 1 + 1/d). (3) Here, γs,i represents the expected value of the d-th root of the beta distribution Beta(i, s + 1 i), which is associated with the probability of a ball having a radius of Rb s,(i)(x). The choice of γs,i facilitates the derivation of the concentration inequality for BWDDE. A detailed discussion is provided in Section 4.1. Potential anomalies can be identified in regions of low density using BWDDE. More specifically, the dataset Dn = {X1, . . . , Xn} can be sorted in ascending order based on their BWDDE values, denoted as {X 1, . . . , X n}, such that f B n (X 1) f B n (X n). If the number of anomalies is predetermined as m, then the m data points with the smallest BWDDE values are identified as anomalies. 2.3 Bagged Regularized k-Distances for Anomaly Detection A key challenge in using bagged weighted k-distance for density estimation (2) is selecting appropriate weights for the nearest neighbors. These weights play a crucial role in determining the accuracy of the density estimate and, consequently, the precision of anomaly detection. The simplest way is to take B = 1 and, for a fixed in advance k, set wk = 1 and wi = 0 for i [n] \ {k}. In this case, BWDDE reverts to the standard k-NN density estimation (Moore and Yackel, 1977; Devroye and Wagner, 1977; Dasgupta and Kpotufe, 2014). Notably, the standard k-NN density estimation relies solely on the distance to the k-th nearest neighbor, disregarding information from other neighbors. To address this limitation, a more general approach was proposed by Biau et al. (2011), which investigated the general weighted k-nearest neighbor density estimation by associating the weights with a given probability measure on [0, 1]. More specifically, for a given probability measure ν on [0, 1] and a sequence of positive integer {kn}, the weights are defined as wi = R ((i 1)/kn,i/kn] ν(dt), for 1 i kn, with wi = 0 otherwise. Challenges for the weight selection. Since density estimation is an unsupervised problem, we lack access to the true density function for direct hyperparameter selection. Existing literature has proposed two common approaches to address this challenge: 1. ANLL-based Approach: A common approach is to optimize hyperparameters by minimizing the Average Negative Log Likelihood (ANLL) (Chow et al., 1983; López-Rubio, 2013; Silverman, 2018), which is equivalent to minimizing KL divergence. This approach assumes that the density estimate integrates to one over Rd. However, this assumption does not hold for our BWDDE. Given the dataset Dn, let M := maxi [n] Xi 2. By the definition of the bagged weighted k-distances, for any x Rd, we have: RB n (x) = 1 b=1 Rw,b s (x) 1 b=1 Rb s,(s)(x) 1 b=1 ( x 2 + M) = x 2 + M. Therefore, the density estimate satisfies f B n (x) 1 Vd x 2 + M d i=1 wb iγs,i Bagged Regularized k-Distances for Anomaly Detection for any x Rd. Notably, the integral of f B n (x) over Rd diverges, rendering the ANLL approach unsuitable. 2. L2 Risk Approach: An alternative approach is leave-one-out cross-validation based on L2 risk of the density estimate (Tsybakov, 2009; Biau et al., 2011), defined as Z f B n (x) f(x) 2 dx = Z Rd f B n (x)2 dx 2 Z Rd f B n (x)f(x) dx + Z Rd f(x)2 dx. The last term is independent of f B n and can be ignored in the optimization. The second term can be estimated by 2 Pn i=1 f B n,( i)(Xi)/n, where f B n,( i)(x) is the density estimator with the i-th observation removed. Thus, we define the cross-validation loss as LCV(f B n ) := Z Rd f B n (x)2 dx 2 i=1 f B n,( i)(Xi). However, computing R Rd f B n (x)2 dx requires Monte Carlo methods, which becomes infeasible in high dimensions due to the absence of a closed-form expression for this integral. In summary, existing hyperparameter selection methods are not well-suited for BWDDE in high-dimensional cases due to the lack of closed-form integrals. To address this, we propose the Surrogate Risk Minimization (SRM) approach, which allows automatic weight selection through a more computationally feasible optimization, distinguishing our method from other nearest-neighbor-based density estimators. Surrogate risk. In the context of density estimation, we consider the absolute loss function L : X R [0, ), defined as L(x, t) := |f(x) t|, to measure the discrepancy between an estimate and the true density function f. Let DB s = B b=1Db s denote the union of the B sub-datasets, where each Db s = {Xb 1, . . . , Xb s}. The empirical risk of f B n with respect to DB s is given by RL,DB s (f B n ) := 1 f B n (Xb i ) f(Xb i ) . (4) As noted in Devroye and Lugosi (2001); Hang et al. (2018), the absolute loss is a reasonable choice for density estimation due to its invariance under monotone transformations. Moreover, it is proportional to the total variation metric, providing a more interpretable measure of proximity to the true density. Since the underlying density function f in (4) is unknown, standard optimization techniques for parameter selection cannot be directly applied to weight selection in density estimation. To address this, we seek a surrogate for the empirical risk in (4) and minimize it to determine the nearest neighbor weights. To proceed, we introduce the following regularity assumptions on the underlying probability distribution P. Assumption 2 Assume that P has a Lebesgue density f with support X = [0, 1]d. (i) [Lipschitz Continuity] The density f is Lipschitz continuous on [0, 1]d, i.e., for all x, y [0, 1]d, there exists a constant c L > 0 such that |f(x) f(y)| c L x y 2. Cai, Yang, Ma, Hang (ii) [Boundness] There exist constants c c > 0 such that c f(x) c for all x X. The smoothness assumption is necessary for bounding the variation of the density function and is a common approach in density estimation (Dasgupta and Kpotufe, 2014; Jiang, 2017). This assumption helps prevent overfitting and provides a more stable density estimate. The assumption of lower boundedness of the density has been commonly employed in prior work, such as (Dasgupta and Kpotufe, 2014; Zhao and Lai, 2022), to establish finitesample rates for k-NN density estimation. This assumption simplifies deriving finite-sample bounds for k-distances (see Lemma 12), thereby providing a clearer characterization of the surrogate risk in the following Proposition. We emphasize that although this assumption aids theoretical analysis, our proposed algorithm remains applicable in broader settings. Further discussions can be found after Theorem 3. Under these assumptions, along with additional conditions on the weights, the next proposition presents a surrogate for the empirical risk (4). Proposition 1 (Surrogate Risk) Let Assumption 2 hold, and let L denote the absolute value loss. Let {Db s}B b=1 be B disjoint subsets randomly drawn from Dn, with Db s = {Xb 1, . . . , Xb s}, and define R b s,(i) := Ps j=1 Rb s,(i)(Xb j)/s as the average i-distances for any integer i s on the subset Db s. Furthermore, let f be the true density function and f B n be the BWDDE as in (2). Moreover, let kb := k(wb) := sup{i [s] : wb i = 0}, k := minb [B] kb, and k := maxb [B] kb. Finally, suppose that the following four conditions hold: (i) There exists a sequence cn log n such that i=1 wb i (log n)/kb for all b [B]; (ii) k (log n)2, k k, B 2(d2 + 4)(log n)/3, B (k/(log n))1+2/d, log s log n, and s max{c 1, 2k}, where c 1 is a constant defined in Lemma 12; (iii) wb 2 kb 1/2 and i=1 i1/dwb i kb 1/d for b [B]; (iv) There exist constants Cn,i such that max b [B] wb i Cn,i for i [s] and i=1 i1/d 1/2Cn,i Then, there exists N 1 N, specified in the proof, such that for all n N 1 and Xb i satisfying B(Xb i , Rb s,(kb )(Xb i )) [0, 1]d for all b [B], there holds L(Xb i , f B n ) p (log s)/B wb 2 + Rw,b s (Xb i ), i [s], b [B], (5) with probability PBs at least 1 4/n2. Furthermore, we have RL,DB s (f B n ) Rsur L,DB s (f B n ) := 1 (log s)/B wb 2 + 1 i=1 Rw,b s (Xb i ) (log s)/B wb 2 + i=1 wb i R b s,(i) Bagged Regularized k-Distances for Anomaly Detection The term on the right-hand side of (6) is referred to as the surrogate risk. A smaller surrogate risk clearly corresponds to higher accuracy in BWDDE. Condition (i) and (ii) imply that Pcn i=1 wb i 0 as n for all b [B]. This ensures that the weights are not overly concentrated in the first cn nearest neighbors, promoting a more balanced distribution of weights across the data points. The first requirement in (ii) ensures that the number of nearest neighbors at each bagging round is at least of the order (log n)2, which aligns with the condition k in Moore and Yackel (1977); Dasgupta and Kpotufe (2014). The second requirement mandates that the number of non-zero nearest neighbors across different subsets remains of the same order. The third and fourth requirements impose lower and upper bounds on the number of bagging rounds B. Since the subsets are drawn without replacement, a very large B can result in a small sample size in each subset, which would increase the estimation error. Conversely, if B is too small, the density estimator may not benefit sufficiently from bagging. The last two conditions in (ii) impose bounds on s for similar reasons. Condition (iii) on the relationship between wb and kb is satisfied for commonly used weight choices for nearest neighbors. Finally, condition (iv) requires that the moments of the weights be bounded by powers of k. The condition B(Xb i , Rb s,(kb )(Xb i )) [0, 1]d for all b [B] ensures that the kb -distance ball is fully contained within the cube [0, 1]d for b [B]. This is crucial because if the ball extends beyond the cube, the density estimator may not be consistent under Assumption 2. While the conditions in Proposition 1 may appear complex, they encompass commonly used weight choices. For example, under a uniform weight distribution for the nearest neighbors, we have wb i = 1{i k}/k for i [s] and b [B], where k [s] is fixed. Under this setting, condition (i) is directly satisfied, and condition (ii) holds with appropriately chosen parameters. Regarding condition (iii), we obtain wb 2 = kb 1/2 = 1/ k and Ps i=1 i1/dwb i = Pk i=1 i1/d/k k1/d for all b [B]. If we set Cn,i = 1{i k}/k for i [s], it follows that Ps i=1 i1/d 1/2Cn,i = Pk i=1 i1/d 1/2/k k1/d 1/2, implying that all conditions hold for bagged uniformly weighted k-distance density estimation. With minor modifications, these arguments extend to the more general case wi = R ((i 1)/k,i/k] ν(dt), for 1 i k, with wi = 0 otherwise, where ν is the probability measure associated with Beta(α, 1) for α 1. These conditions are primarily used to derive our surrogate risk, which leads to a new and effective algorithm without hyperparameter tuning. In fact, Proposition 5 in Section 4.2.1 ensures that our proposed method satisfies these conditions with high probability, making it unnecessary to verify them in practical applications. Surrogate risk minimization (SRM). From the expression of the surrogate risk in (6), minimizing the surrogate risk is equivalent to solving the following optimization problems: wb, := arg min wb Ws (log s)/B wb 2 + i=1 wb i R b s,(i), b [B]. (7) A closer examination of the optimization problems in (7) reveals that each consists of two components. The first term is proportional to the ℓ2-norm of the weights wb, while the second term represents a linear combination of these weights. Cai, Yang, Ma, Hang Without the first term, the optimization objective in (7) reduces to the second term, Ps i=1 wb i R b s,(i), which attains its minimum when wb = (1, 0, . . . , 0). In this case, the weighted k-distance simplifies to Rw,b s (x) = Rb s,(1)(x), representing the distance from x to its nearest neighbor. This often leads to overfitting in density estimation, as it fails to incorporate information from other nearest neighbors. By introducing the wb 2 term into the minimization problem (7), we mitigate the overfitting issue. The ℓ2-norm wb 2 attains its maximum value of 1 when all weight is assigned to a single nearest neighbor, i.e., when wb i = 1 for some i [s] and wb j = 0 for all j = i. Conversely, it reaches its minimum value of n 1/2 when the weights are uniformly distributed as wb = (1/n, . . . , 1/n). Consequently, incorporating this term encourages the weights to be distributed across multiple nearest neighbors, thereby preventing overfitting. As a result, wb 2 serves as a regularization term in the minimization problem (7). Solution to SRM. Notice that (7) is a convex optimization problem solved efficiently from the data. For a fixed b [B], considering the constraint Lagrangian, we have L(wb, µb, νb) := p (log s)/B wb 2 + i=1 wb i R b s,(i) + µb 1 i=1 νb i wb i, where µb R and νb 1, . . . νb s 0 are the Lagrange multipliers. Since (7) is a convex optimization problem, the solution satisfying the KKT conditions is a global minimum. Setting the partial derivative of L(wb, µb, νb) with respect to wb i to zero gives: (log s)/B wb i/ wb 2 = µb + νb i R b s,(i). (8) Since wb, is the optimal solution of (7), the KKT conditions imply that if wb, i > 0, then νb i = 0. Conversely, if wb, i = 0, then νb i 0, which implies R b s,(i) µb. Therefore, wb, i is proportional to µb R b s,(i) for all nonzero entries. This together with the equality constraint Ps i=1 wb, i = 1 yields that wb, i has the form µb R b s,(i) 1{R b s,(i) < µb} Ps i=1 µb R b s,(i) 1{R b s,(i) < µb} for i [s]. Since R b s,(i) becomes larger as i increases, the formulation above shows that wb, i becomes smaller as i increases. Moreover, the optimal weights have a cut-offeffect that only nearest neighbors near x, i.e. R b s,(i) < µb are considered in the solution, while the weights for the remaining nearest neighbors are all set to zero. This is consistent with our usual judgment: the closer the neighbor, the greater the impact on the density estimation. There are many efficient methods to solve the convex optimization problem (7). Here, we follow the method developed in Anava and Levy (2016); Dong et al. (2020); Sheng and Yu (2023). The key idea is to add nearest neighbors in a greedy manner based on their distance from x until a stopping criterion is met. We present it in Algorithm 1. Bagged Regularized k-Distances for Anomaly Detection Algorithm 1: Surrogate Risk Minimization (SRM) Input: Average i-distances R b s,(i), 1 i s. B/ log s R b s,(i), 1 i s. Set µ0 = r1 + 1 and k = 0. while µk > rk+1 and k s 1 do µk = Pk j=1 rj + q k + (Pk j=1 rj)2 k Pk j=1 r2 j /k. end Compute A = Ps i=1((µk ri) 1(ri < µk)). Compute wb, i = (µk ri) 1(ri < µk)/A, 1 i s. Output: Weights wb, . Density estimation. The discussions above indicate that the minimization problem (7) provides a practical method for determining the weights of nearest neighbors in density estimation. However, these weights depend on the data, introducing statistical dependence between the weights wb, i and the i-distance Rb s,(i)(x) for i [s] in each bag b [B]. Such dependence complicates the theoretical analysis of weighted k-distances and density estimation within the framework of statistical learning theory. To address this issue, we employ distinct subsets of the data for computing the kdistances, ensuring their independence from the optimized weights. For simplicity, we assume that n = 2Bs, which facilitates a more straightforward theoretical derivation without loss of generality. This assumption ensures that each subset is of equal size; if n is not exactly divisible by 2B, minor modifications to the partitioning scheme can be applied without affecting the theoretical conclusions. We randomly partition Dn into B disjoint pairs of subsets {(Db s, e Db s)}B b=1, where each subset has size s and Db s e Db s = for each b. Let wb, represent the weights obtained by solving the optimization problem (7) using {Db s}B b=1, and let e Rb s,(i)(x) denote the i-distance of x computed using { e Db s}B b=1 for i [s]. Then, we define the regularized k-distances as follows: Rb, s (x) := Rwb, ,b s (x) = i=1 wb, i e Rb s,(i)(x). (9) We refer to the weighted average of these k-distances as the bagged regularized k-distance RB, n (x) := 1 b=1 Rb, s (x). (10) By incorporating the wb, and RB, n (x) into the BWDDE formula (2), we are able to obtain a new nearest-neighbor-based density estimator called bagged regularized k-distances for density estimation (BRDDE), expressed as f B, n (x) := 1 Vd RB, n (x)d i=1 wb, i γs,i Cai, Yang, Ma, Hang Algorithm 2: Bagged Regularized k-Distances for Anomaly Detection (BRDAD) Input: Data D = {X1, , Xn}; Number of anomalies m; Bagging rounds B; Subsampling size s. Randomly partition Dn into B disjoint pairs of subsets, denoted as {(Db s, e Db s)}B b=1, such that Db s e Db s = for each b. for b [B] do Compute weights wb, using Db s by (7); Compute the regularized k-distances Rb, s (Xi) for 1 i n using (9), based on wb, and e Db s. end Compute the bagged regularized k-distances RB, n (Xi) by (10) for 1 i n. Sort the data Dn = {X1, . . . , Xn} as {X 1, . . . , X n} in descending order according to their bagged regularized k-distances, i.e., RB, n (X 1) RB, n (X n). Output: Anomalies {X i}m i=1. This minimization approach distinguishes our BRDDE from existing nearest-neighbor-based density estimators. Specifically, they suffer from the sensitivity to the choice of the hyperparameter k, since the selection of k is inherently difficult due to the lack of supervised information. On the contrary, when the number of bagging rounds B is fixed, SRM enables the calculation of the weights of nearest neighbors in each subset Db s by solving the convex optimization problem based on the average i-distance R b s,(i) as in equation (7). As a result, we successfully address the hyperparameter selection challenge without changing the unsupervised nature of the problem. Anomaly Detection. By applying BRDDE to all samples, anomalies can be identified as instances with lower BRDDE values. However, explicit density estimation is not necessary for anomaly detection. Since density estimates serve as anomaly scores, any monotone transformation preserves their ranking (possibly in reverse) and maintains the same family of level sets, leading to identical AUC values. Thus, bagged regularized k-distances provide a sufficient and practical alternative for anomaly detection. They can be accurately computed in high-dimensional spaces, and their associated weights can be efficiently optimized from (7), making them an effective approach for density-based anomaly detection. We now introduce our anomaly detection algorithm, bagged regularized k-distances for anomaly detection (BRDAD). The dataset Dn is sorted into the sequence {X 1, . . . , X n} based on their bagged regularized k-distances in descending order, i.e. RB, n (X 1) RB, n (X n). Given the pre-specified number of anomalies m, the first m instances {X i}m i=1, are considered as the m anomalies. The complete procedure of our BRDAD algorithm is presented in Algorithm 2. As illustrated above, SRM mitigates the challenge of hyperparameter selection in density estimation. Consequently, BRDAD retains the advantages of BRDDE by using the same weights to address the sensitivity of hyperparameter selection in nearest-neighbor-based methods for unsupervised anomaly detection. Algorithm 2 is based on the assumption that P1 follows a uniform distribution. However, when prior knowledge about the density function of anomaly is available, anomalies can still Bagged Regularized k-Distances for Anomaly Detection be detected using the bagged regularized k-distances RB, n (x). Consider the Huber model in (1), where P, P0, and P1 have densities f, f0, and f1, respectively. Defining h := f0/f1 and ρ := Π/(1 Π), Steinwart et al. (2005, Corollary 3) establishes that instances in the set {Xi : h(Xi) ρ} can be recognized as anomalies with theoretical guarantees. This set can be equivalently written as {Xi : f(Xi)/f1(Xi) 2Π}, based on the relationship between the densities. Since RB, n (x)d is inversely proportional to the density estimate f B, n (x), anomalies can be identified by sorting the data according to: f1(X 1)1/d RB, n (X 1) f1(X n)1/d RB, n (X n), where the top m instances {X i}m i=1 are considered anomalies. Furthermore, the theoretical results in Section 3 can be extended to this setting with minor modifications, ensuring the generality of our approach. 3. Theoretical Results In this section, we present theoretical results related to our BRDAD algorithm. We first investigate the Huber contamination model in Section 3.1, in which we can analyze the performance of the bagged regularized k-distances from a learning theory perspective. Then, we present the convergence rates of BRDDE and BRDAD in Section 3.2 and 3.3, respectively. Finally, we provide comments and discussions on our algorithms and theoretical results in Section 3.4. We also compare our theoretical findings on the convergences of both BRDDE and BRDAD with other nearest-neighbor-based methods in this section. 3.1 Huber Contamination Model In the Huber contamination model (HCM) in Assumption 1, for every instance X from P, we can use a latent variable Y {0, 1} that indicates which distribution it is from. More specifically, Y = 0 and Y = 1 indicate that the instance is from the normal and the anomalous distribution, respectively. As a result, the anomaly detection problem can be converted into a bipartite ranking problem where instances are labeled positive or negative implicitly according to whether it is normal or not. Let e P represent the joint probability distribution of X Y. In this case, our learning goal is to learn a score function that minimizes the probability of mis-ranking a pair of normal and anomalous instances, i.e. that maximizes the area under the ROC curve (AUC). Therefore, we can study regret bounds for the AUC of the bagged regularized k-distances to evaluate its performance from the learning theory perspective. Let r : X R be a score function, then the AUC of r can be written as AUC(r) = E 1{(Y Y )(r(X) r(X ) > 0)} + 1{r(X) = r(X )}/2|Y = Y , where (X, Y ), (X , Y ) are assumed to be drawn i.i.d. from e P. In other words, the AUC of r is the probability that a randomly drawn anomaly is ranked higher than a randomly drawn normal instance by the score function r. Given the HCM in (1) and the assumption that P1 is uniformly distributed over [0, 1]d in Assumption 1, the posterior probability function with respect to e P is given by η(x) := e P(Y = 1|X = x) = Πf1(x) (1 Π)f0(x) + Πf1(x) = Πf(x) 1. (12) Cai, Yang, Ma, Hang Then, the optimal AUC is defined as AUC := sup r:X R AUC(r) = 1 1 2Π(1 Π)EX,X min η(X)(1 η(X )), η(X )(1 η(X) . Finally, the AUC regret of a score function r is defined as Reg AUC(r) := AUC AUC(r). As discussed in Section 2.2, BRDAD is a density-based anomaly detection method. To establish its convergence rates under the Huber contamination model, we first derive the theoretical convergence rates of BRDDE in (11), which are presented in the next subsection. 3.2 Convergence Rates of BRDDE The convergence rates of BRDDE are presented in the following Theorem. Theorem 2 Let Assumption 2 hold. Suppose that the dataset Dn is randomly partitioned into B disjoint pairs of subsets, denoted as {(Db s, e Db s)}B b=1, such that Db s e Db s = for each b. Let f be the true density function, and let f B, n be the BRDDE defined in (11). If we choose s (n/ log n)(d+1)/(d+2) and B n1/(d+2)(log n)(d+1)/(d+2), (13) then there exists N 2 N, which will be specified in the proof, such that for all n > N 2 , with probability Pn at least 1 4/n2, we have Z X |f B, n (x) f(x)| dx n 1/(2+d)(log n)(d+3)/(d+2). The convergence rate of the L1-error of BRDDE in the above theorem matches the minimax lower bound established in Zhao and Lai (2021) when the density function is Lipschitz continuous. Therefore, BRDDE attains the optimal convergence rates for density estimation. As a result, the SRM procedure in Section 2.3 turns out to be a promising approach for determining the weights of nearest neighbors for BWDDE. Moreover, notice that the number of iterations required in the optimization problem (7) at each bagging round depends on the sub-sample size s. In Theorem 2, the choice of s is significantly smaller than n, indicating that fewer iterations are required at each bagging round. This explains the computational efficiency of incorporating the bagging technique if parallel computation is employed. However, due to the dependence in d, this improvement becomes less and less significant in high dimension. Further discussions on the complexity are presented in Section 4.3. 3.3 Convergence Rates of BRDAD The next theorem provides the convergence rates for BRDAD. Theorem 3 Let Assumptions 1 and 2 hold. Suppose the conditions in Theorem 2 hold, including the dataset partitioning and choice of parameters s and B. Let RB, n be the bagged regularized k-distances returned by Algorithm 2. Then there exists N 2 N, as specified in Theorem 2, such that for all n > N 2 , with probability Pn at least 1 4/n2, we have Reg AUC(RB, n ) n 1/(2+d)(log n)(d+3)/(d+2). Bagged Regularized k-Distances for Anomaly Detection Theorem 3 establishes that, up to a logarithmic factor, the AUC regret of BRDAD converges at a rate of O(n 1/(d+2)), provided that the number of bagging rounds B and the subsample size s are chosen as in (13). Notably, the parameter choices and convergence rates in Theorem 3 align with those in Theorem 2 for BRDDE. This follows from the fact that BRDAD is a density-based anomaly detection method built upon BRDDE. Although surrogate risk minimization is formulated under Assumption 2, we note that these assumptions can be relaxed, and similar convergence rates for our BRDDE and BRDAD remain valid under more general conditions. In particular, Lipschitz continuity naturally extends to manifolds: if the data is supported on a d -dimensional manifold M with density f absolutely continuous with respect to the manifold s volume measure, then f is Lipschitz continuous on M if there exists a constant c L > 0 such that |f(x) f(y)| c Ld M(x, y) for all x, y M, where d M(x, y) denotes the geodesic distance. This assumption aligns with that in the prior literature (Berenfeld and Hoffmann, 2021). Additionally, the assumption of lower boundedness can be relaxed with suitable conditions on the tail behavior of the density, as discussed in Zhao and Lai (2022). Since geodesic distances on manifolds are well approximated by Euclidean distances within a small neighborhood (Niyogi et al., 2008; Berenfeld and Hoffmann, 2021), our theoretical results for SRM and the convergence rates of BRDDE and BRDAD can be extended to cases where both assumptions are relaxed. However, our current focus remains on hyperparameter sensitivity in density estimation; therefore, a detailed exploration of these extensions is beyond the scope of this paper. 3.4 Comments and Discussions By reformulating the analysis of bagged regularized k-distances in terms of BRDDE within a statistical learning framework (van der Vaart and Wellner, 1996), we establish convergence rates for the AUC regret of bagged regularized k-distances under the Huber contamination model, assuming mild regularity conditions on the density function (Theorem 3). Notably, our findings reveal that the convergence rate of the AUC regret of BRDAD matches that for density estimation, indicating the effectiveness of BRDAD. In contrast, previous theoretical studies on distance-based methods for unsupervised anomaly detection did not establish a connection between distance-based algorithms and density estimation, leaving the convergence rates unaddressed. For instance, Sugiyama and Borgwardt (2013) introduced a sampling-based outlier detection method and analyzed its effectiveness compared to traditional k-nearest neighbors approaches, but without a rigorous theoretical foundation. More recently, Gu et al. (2019) conducted a statistical analysis of distance-to-measure (DTM) for anomaly detection under the Huber contamination model, assuming specific regularity conditions on the distribution, and showed that anomalies can be identified with high probability. However, since these studies did not derive convergence rates for the AUC regret, their results are not directly comparable to ours. 4. Error and Complexity Analysis In this section, we present the error analysis of the AUC regret and the complexity analysis of our algorithm. In detail, in Section 4.1, we provide the error decomposition of the surrogate risk, which leads to the derivation of the surrogate risk in Proposition 1 in Section 2.3. Furthermore, in Section 4.2, we illustrate the three building blocks in learning the AUC Cai, Yang, Ma, Hang regret, which indicates the way to establish the convergence rates of both BRDDE and BRDAD in Theorem 2 and 3 in Section 3.3. Finally, we analyze the time complexity of BRDAD and illustrate the computational efficiency of BRDAD compared to other distancebased methods for anomaly detection in Section 4.3. 4.1 Error Analysis for the Surrogate Risk In this section, we first provide the error decomposition for the BWDDE f B n (x) in (2). Then, we present the upper bounds for these error terms. Let the term (I) be defined as (I) := 1 Vd RB n (x)d i=1 wb iγs,i j V 1/d d f(x)1/d RB n (x) d 1 j. (14) Using the triangle inequality and the equality xd yd = (x y) i=0 xiyd 1 i, (15) f B n (x) f(x) = 1 Vd RB n (x)d i=1 wb iγs,i d Vdf(x)RB n (x)d i=1 wb iγs,i V 1/d d f(x)1/d RB n (x) i=1 wb iγs,i 1 i=1 wb i V 1/d d f(x)1/d Rb s,(i)(x) b=1 wb i γs,i V 1/d d f(x)1/d Rb s,(i)(x) . (16) If the terms (II) and (III) are defined respectively as b=1 wb i γs,i P(B(x, Rb s,(i)(x)))1/d , (17) b=1 wb i P(B(x, Rb s,(i)(x)))1/d V 1/d d f(x)1/d Rb s,(i)(x) , (18) then applying the triangle inequality to (16) yields the error decomposition f B n (x) f(x) (I) (II) + (I) (III). (19) We emphasize that γs,i = E[P(B(x, Rb s,(i)(x)))1/d] for a fixed x X, b [B], and i [s]. Therefore, standard concentration inequalities can be applied to establish the convergence Bagged Regularized k-Distances for Anomaly Detection rates for term (II). The derivation of this equality is explained as follows: Since P has a density over [0, 1]d with respect to the Lebesgue measure, the random variable X x 2 is continuous for a fixed x X. By the probability integral transform, P(B(x, X x 2)) is uniformly distributed over [0, 1]. For any b [B], note that Xb 1, . . . , Xb s are i.i.d. with the same distribution P. Let Ub 1, . . . , Ub s be i.i.d. uniform [0, 1] random variables. Then P(x, Xb 1 x 2), . . . , P(x, Xb s x 2) D= Ub 1, . . . , Ub s . Reordering the samples such that Xb (1)(x) x 2 Xb (s)(x) x 2, we obtain P(B(x, Rb s,(1)(x))), . . . , P(B(x, Rb s,(s)(x))) D= Ub (1), . . . , Ub (s) , (20) where Ub (i) is the i-th order statistic of Ub 1, . . . , Ub s. This reduces the study of P(B(x, Rb s,(i)(x))) to the study of Ub (i). By Corollary 1.2 in Biau and Devroye (2015), Ub (i) Beta(i, s + 1 i). Hence, E[P(B(x, Rb s,(i)(x)))1/d] = E[(Ub (i))1/d] = γs,i. In the literature on weighted nearest-neighbor density estimation (Biau et al., 2011; Biau and Devroye, 2015), the expression (i/s)1/d is often used in place of γs,i, introducing an additional error term |γs,i (i/s)1/d|. While this error is negligible in the absence of bagging, it slows the convergence rate of term (II) when bagging is employed. Therefore, using γs,i enables a clearer demonstration of the benefits of bagging in density estimation. The following proposition provides the upper bounds for the error terms (I), (II), and (III), respectively. Proposition 4 Let Assumption 2 hold. Furthermore, let (I), (II), and (III) be defined as in (14), (17), and (18), respectively. Let kb, k, and k be defined as in Proposition 1. Suppose that the conditions (i) (iv) in Proposition 1 hold. Then there exists N1 N, which will be specified in the proof, such that for all n > N1 and x satisfying B(x, Rb s,(kb)(x)) [0, 1]d for all b [B], the following statements hold with probability PBs at least 1 2/n2: k/s 1/d (log n)/(k B) 1/2; (III) (log n)1+1/d/(s1/dk) + (k/s)2/d. 4.2 Learning the AUC Regret: Three Building Blocks Recalling that the central concern in statistical learning theory is the convergence rates of learning algorithms under various settings. In Section 3.1, we show that when the probability distribution P follows the Huber contamination model (HCM) in Assumption 1, we can use a latent variable Y to indicate whether it is from the anomalous distribution. Moreover, the posterior probability in (12) implies that in HCM, anomalies can be identified by using the Bayes classifier with respect to the classification loss, resulting in the set of anomalies as S := {x Rd : η(x) > 1/2} = {x Rd : Πf(x) 1 > 1/2} = {x Rd : f(x) < 2Π}. Cai, Yang, Ma, Hang This set can be estimated by the lower-level set estimation of BRDDE at the threshold 2Π as in (11), i.e., b S := {x Rd : f B, n (x) < 2Π} with f B, n (x) as defined in (11). If we choose θ = 1 (2VdΠ)1/d B i=1 wb, i γs,i, then we have {x Rd : RB, n (x) θ} = {x Rd : f B, n (x) < 2Π} = b S. This implies that the upper-level set of bagged regularized k-distances, i.e., {x Rd : RB, n (x) θ}, equals the estimation b S with the properly chosen threshold. As a result, the unsupervised anomaly detection problem is converted to an implicit binary classification problem. Therefore, we are able to analyze the performance of RB, n (x) in anomaly detection by applying the analytical tools for classification. Since the posterior probability estimation is inversely proportional to the BRDDE as shown in (12) in Section 3.1, the problem of analyzing the posterior probability estimation can be further converted to analyzing the BRDDE. Therefore, it is natural and necessary to investigate the following three problems: (i) The finite sample bounds of the optimized weights wb, by solving SRM problems. (ii) The convergence of the BRDDE as stated in Theorem 2, that is, whether f B, n converges to f in terms of L1-norm. (iii) The convergence of AUC regret for RB, n , i.e., whether the convergences of BRDDE f B, n imply the convergences of the AUC regret of RB, n . Reg AUC(RB, n ) 0 f B, n f L1(X) 0 (Theorem 2) SRM Figure 1: An illustration of the three fundamental components of AUC regret. The left block represents the consistency of AUC regret, the middle block signifies the consistency of BRDDE, and the right block corresponds to the statistical analysis of SRM, aligning with Problem (iii), (ii), and (i), respectively. The above three problems form the foundations for conducting a learning theory analysis on bagged regularized k-distances and serve as three main building blocks. Notice that Problem (ii) is already provided in Theorem 2 in Section 3.2. Detailed explorations of the other two Problems (i) and (iii), will be expanded in the following subsections. 4.2.1 Analysis for the Surrogate Risk Minimization The following Proposition ensures that the solution to the surrogate risk minimization satisfies the conditions in Proposition 1 with high probability, thereby addressing Problem (i) and validating the effectiveness of our algorithm. Proposition 5 Let Assumption 2 hold. Let {Db s}B b=1 be B disjoint subsets of size s randomly drawn from the data set Dn. Moreover, let wb, be defined as in (7), and kb, := k(wb, ) := Bagged Regularized k-Distances for Anomaly Detection sup{i [n] : wb, i = 0}. If we choose s and B as in (13) in Theorem 2, then there exists N2 N, which will be specified in the proof, such that for all n > N2, the following two statements hold with probability PBs at least 1 1/n2: 1. The conditions (i) (iv) in Proposition 1 hold for wb, and kb, . 2. Furthermore, we have kb, (n/ log n)1/(d+2) for b [B]. 4.2.2 Analysis for the AUC Regret Problem (iii) in the left block of Figure 4.2 is solved by the next proposition, which shows that the problem of bounding the AUC regret of the bagged regularized k-distances can be converted to the problem of bounding the L1-error of the BRDDE. Proposition 6 Let Assumptions 1 and 2 hold, and let f be the true density function. Let RB, n be the bagged regularized k-distances as in (10) and f B, n (x) be the BRDDE as in (11). Suppose that there exists a constant c > 0 such that f B, n c. Then, we have Reg AUC(RB, n ) Z f B, n (x) f(x) dx. The results in Proposition 6 apply broadly to any density estimator that is lower bounded and whose weights satisfy the conditions outlined in Propositions 4 and 5. Indeed, this means any sufficiently good (weighted) k-NN density estimator meeting these conditions would achieve similar theoretical guarantees. Therefore, our proposed estimator does not claim a faster convergence rate than existing methods. Instead, its primary advantage lies in reducing sensitivity to hyperparameter selection in practice, as thoroughly discussed in Section 2.3. The theoretical guarantees established in Proposition 6 play a crucial role in validating the consistency of our estimator in comparison to standard approaches. 4.3 Complexity Analysis To deal with the efficiency issue in distance-based methods for anomaly detection when dealing with large-scale datasets, Wu and Jermaine (2006) proposed the iterative subsampling, i.e., for each test sample, they first randomly select a portion of data and then compute the k-distance over the subsamples. They provided a probabilistic analysis of the quality of the subsampled distance compared to the k-distance over the whole dataset. Furthermore, Sugiyama and Borgwardt (2013) proposed the one-time sampling for the computation of the k-distances over the dataset for all test samples, which is shown to be more efficient than the iterative sampling. Although these sub-sampling methods improve computational efficiency, these distance-based methods fail to comprehensively utilize the information in the dataset since a large portion of samples are dropped out. By contrast, the bagging technique incorporated in our BRDAD not only addresses the efficiency issues when dealing with large-scale datasets but also maintains the ability to make full use of the data. In the following, we conduct a complexity analysis for BRDAD in detail to show the computational efficiency of BRDAD. As a widely used algorithm, the k-d tree (Friedman et al., 1977) is commonly employed in NN-based methods to search for nearest neighbors. Given n data points in d dimensions, Friedman et al. (1977) showed that constructing a k-d tree requires O(nd log n) time, Cai, Yang, Ma, Hang while searching for k nearest neighbors takes O(k log n) time. Below, we analyze the time complexities of the construction and search stages in BRDAD to demonstrate how bagging reduces computational complexity. (i) In each bagging round, BRDAD constructs a k-d tree using s data points. Taking s as the order in Theorem 2, the construction time per k-d tree is O(ds log s) = O(dn(1+d)/(d+2)(log n)1/(d+2)) when parallelism is applied. In contrast, without bagging, constructing a single k-d tree requires O(dn log n) time. Thus, bagging reduces the construction time complexity. (ii) The time complexity of regularized k-distances at each bagging round consists of two main components: (i) computing the average k-distances, and (ii) solving the SRM problem. The first part, querying kb, neighbors, takes O(kb, log s) time. For the second part, Anava and Levy (2016, Theorem 3.3) shows that Algorithm 1 finds the solution in O(kb, ) time. Consequently, the search stage requires at most O(kb, log s) time. From the order of kb, and s in Proposition 5 and Theorem 3, the search complexity with parallel computation across subsets {Db s}B b=1 is O(n1/(d+2)(log n)(d+1)/(d+2)). In contrast, without bagging, the search complexity is O(n2/(d+2)(log n)(2d+2)/(d+2)) from the order of k in Lemma 20 in Section 6.2, This result shows that bagging also improves the search efficiency. In summary, with proper parallelization, the overall complexity is dominated by the construction stage at O(dn(1+d)/(d+2)(log n)1/(d+2)), compared to O(dn log n) without bagging. Thus, bagging enhances computational efficiency when fully leveraging parallel computation. However, due to its dependence on d, the computational improvement becomes less significant in higher dimensions. For popular distance-based anomaly detection methods like standard k-NN and DTM (Gu et al., 2019), the primary computational cost comes from constructing a k-d tree and searching for k nearest neighbors. If k is set to the optimal order O(n2/(d+2)(log n)d/(d+2)) for standard k-NN density estimation, the construction stage takes O(nd log n) time, while the search stage requires O(n2/(d+2)(log n)(2d+2)/(d+2)) time. For another distance-based method, LOF, in addition to constructing a k-d tree and searching for k nearest neighbors, there is an extra step of computing scores for all samples, which adds a time complexity of O(n) (Breunig et al., 2000). A straightforward comparison shows that these methods are significantly more computationally intensive than BRDAD. In contrast, the construction and search stages of BRDAD have much lower complexities of O(dn(1+d)/(d+2)(log n)1/(d+2)) and O(n1/(d+2)(log n)(d+1)/(d+2)), respectively. 5. Experiments This section presents numerical experiments. In Section 5.1, we perform synthetic data experiments on density estimation to illustrate the convergence of the surrogate risk and mean absolute error of BRDDE as the sample size grows. Section 5.2 focuses on realworld anomaly detection benchmarks, where we compare BRDAD with various methods and explain its advantages. Our empirical results demonstrate that bagging enhances the algorithm s performance. Bagged Regularized k-Distances for Anomaly Detection 5.1 Synthetic Data Experiments on Density Estimation In this section, we empirically demonstrate the convergence of both the surrogate risk (SR) and the mean absolute error (MAE) of BRDDE as the sample size n increases. The results are presented in Figures 2(a) and 2(b). As established in Theorem 1, the surrogate risk is expected to exhibit a convergence behavior similar to that of the MAE for BRDDE. To investigate this, we sample n data points from the N(0, 1) distribution, with n set to {300, 1000, 3000, 5000, 10000} for training. The surrogate risk for each n is computed using Algorithm 1 with B = 1. Additionally, we randomly sample 10, 000 instances to calculate the MAE to assess BRDDE s performance. Each experiment is repeated 20 times for every sample size n. The results in Figure 2(a) show that the surrogate risk decreases monotonically as n increases, while Figure 2(b) exhibits a similar convergence pattern for the MAE. Furthermore, we plot the ratio of SR to MAE for each sample size n in Figure 2(c), which indicates that the ratio stabilizes when n exceeds 3000, further confirming the similarity in their convergence behaviors. To extend this analysis to more complex distributions and higher dimensions, we conduct additional experiments using a mixture distribution, 0.4 N(0.31d, 0.01Id) + 0.6 N(0.71d, 0.0025Id), across dimensions d {1, 3, 5, 7, 9}. The ratio of SR to MAE for each sample size n is plotted in Figure 2(d), revealing convergence toward a dimension-dependent constant. This result reinforces the generalizability of our optimized weights across diverse settings. We provide an illustrative example on a synthetic dataset to demonstrate the sensitivity of choosing the hyperparameter k in other nearest-neighbor-based density estimation methods, including the k-NN density estimation (k-NN) and the weighted k-NN density estimation (Wk NN) (Biau et al., 2011). In accordance with Biau et al. (2011), we take ν as the measure of Uα, where U is uniform on [0, 1] and α > 0 is a parameter. Given an integer k, the weights of Wk NN are defined by wi = R ((i 1)/k,i/k] ν(dt) = (i/k)1/α ((i 1)/k)1/α, for i [k], with wi = 0 otherwise. We generate 1000 data points to train the density estimators and an additional 10,000 points to compute the MAE from a Gaussian mixture model with the density function 0.5 N(0.3, 0.01) + 0.5 N(0.7, 0.0025). We vary the hyperparameter k from 3 to 500 to observe its effect on the MAE for both k-NN and Wk NN. The results in Figure 3 illustrate that the performance of these density estimators is significantly influenced by the choice of k, regardless of α. Only a narrow range of k values leads to optimal results. In contrast, our proposed estimator, BRDDE, mitigates this sensitivity. The black dashed line in Figure 3 represents the MAE performance of BRDDE, demonstrating that it achieves near-optimal results comparable to the best-tuned nearest-neighbor-based methods without requiring fine-tuning of k. 5.2 Real-world Data Experiments on Anomaly Detection 5.2.1 Dataset Descriptions To provide an extensive experimental evaluation, we use the latest anomaly detection benchmark repository named ADBench established by Han et al. (2022). The repository includes 47 tabular datasets, ranging from 80 to 619326 instances and from 3 to 1555 features. We provide the descriptions of these datasets in the Table 1. Cai, Yang, Ma, Hang Table 1: Descriptions of ADBench Datasets Number Data # Samples # Features # Anomaly % Anomaly Category 1 ALOI 49534 27 1508 3.04 Image 2 annthyroid 7200 6 534 7.42 Healthcare 3 backdoor 95329 196 2329 2.44 Network 4 breastw 683 9 239 34.99 Healthcare 5 campaign 41188 62 4640 11.27 Finance 6 cardio 1831 21 176 9.61 Healthcare 7 Cardiotocography 2114 21 466 22.04 Healthcare 8 celeba 202599 39 4547 2.24 Image 9 census 299285 500 18568 6.20 Sociology 10 cover 286048 10 2747 0.96 Botany 11 donors 619326 10 36710 5.93 Sociology 12 fault 1941 27 673 34.67 Physical 13 fraud 284807 29 492 0.17 Finance 14 glass 214 7 9 4.21 Forensic 15 Hepatitis 80 19 13 16.25 Healthcare 16 http 567498 3 2211 0.39 Web 17 Internet Ads 1966 1555 368 18.72 Image 18 Ionosphere 351 32 126 35.90 Oryctognosy 19 landsat 6435 36 1333 20.71 Astronautics 20 letter 1600 32 100 6.25 Image 21 Lymphography 148 18 6 4.05 Healthcare 22 magic.gamma 19020 10 6688 35.16 Physical 23 mammography 11183 6 260 2.32 Healthcare 24 mnist 7603 100 700 9.21 Image 25 musk 3062 166 97 3.17 Chemistry 26 optdigits 5216 64 150 2.88 Image 27 Page Blocks 5393 10 510 9.46 Document 28 pendigits 6870 16 156 2.27 Image 29 Pima 768 8 268 34.90 Healthcare 30 satellite 6435 36 2036 31.64 Astronautics 31 satimage-2 5803 36 71 1.22 Astronautics 32 shuttle 49097 9 3511 7.15 Astronautics 33 skin 245057 3 50859 20.75 Image 34 smtp 95156 3 30 0.03 Web 35 Spam Base 4207 57 1679 39.91 Document 36 speech 3686 400 61 1.65 Linguistics 37 Stamps 340 9 31 9.12 Document 38 thyroid 3772 6 93 2.47 Healthcare 39 vertebral 240 6 30 12.50 Biology 40 vowels 1456 12 50 3.43 Linguistics 41 Waveform 3443 21 100 2.90 Physics 42 WBC 223 9 10 4.48 Healthcare 43 WDBC 367 30 10 2.72 Healthcare 44 Wilt 4819 5 257 5.33 Botany 45 wine 129 13 10 7.75 Chemistry 46 WPBC 198 33 47 23.74 Healthcare 47 yeast 1484 8 507 34.16 Biology Bagged Regularized k-Distances for Anomaly Detection 300 1000 3000 5000 10000 n (a) Convergence of SR 300 1000 3000 5000 10000 n (b) Convergence of MAE 300 1000 3000 5000 10000 n (c) Ratio of SR to MAE 300 1000 3000 5000 10000 log(SR/MAE) dim=1 dim=3 dim=5 dim=7 dim=9 (d) Log ratio of SR to MAE Figure 2: (a)(b) show that SRM leads to the convergence of both surrogate risk (SR) and mean absolute error (MAE). Furthermore, (c) shows that as the sample size n increases, the ratio of SR to MAE becomes stable, indicating similar convergence behaviors for both SR and MAE by applying Algorithm 1. 0 100 200 300 400 500 k = 10 = 2 = 1 = 0.5 = 0.2 Figure 3: Illustration of parameter k s sensitivity. 5.2.2 Methods for Comparison We conduct experiments on the following anomaly detection algorithms. Cai, Yang, Ma, Hang (i) BRDAD is our proposed algorithm, with details provided in Algorithm 2. The choice of B depends on the sample size: for n (0, 10, 000], (10, 000, 50, 000], and (50, 000, + ), we set B = 1, 5, and 10, respectively. In practice, when B is fixed, we randomly divide the data into B subsets, each containing either n/B or n/B + 1 samples. Each subset is then further split into two parts such that their sizes are equal or differ by at most 1. This process ensures that the full dataset is partitioned as evenly as possible. (ii) Distance-To-Measure (DTM) (Gu et al., 2019) is a distance-based algorithm which employs a generalization of the k nearest neighbors named distance-to-measure . We use the author s implementation. As suggested by the authors, the number of neighbors k is fixed to be k = 0.03 sample size. (iii) k-Nearest Neighbors (k-NN) (Ramaswamy et al., 2000) is a distance-based algorithm that uses the distance of a point from its k-th nearest neighbor to distinguish anomalies. We use the implementation of the Python package Py OD with its default parameters. (iv) Local Outlier Factor (LOF) (Breunig et al., 2000) is a distance-based algorithm that measures the local deviation of the density of a given data point with respect to its neighbors. We also use Py OD with its default parameters. (v) Partial Identification Forest (PIDForest) (Gopalan et al., 2019) is a forest-based algorithm that computes the anomaly score of a point by determining the minimum density of data points across all subcubes partitioned by decision trees. We use the authors implementation with the number of trees T = 50, the number of buckets B = 5, and the depth of trees p = 10 suggested by the authors. (vi) Isolation Forest (i Forest) (Liu et al., 2008) is a forest-based algorithm that works by randomly partitioning features of the data into smaller subsets and distinguishing between normal and anomalous points based on the number of splits required to isolate them, with anomalies requiring fewer splits. We use the implementation of the Python package Py OD with its default parameters. (vii) One-class SVM (OCSVM) (Schölkopf et al., 1999) is a kernel-based algorithm which tries to separate data from the origin in the transformed high-dimensional predictor space. We also use Py OD with its default parameters. Note that as BRDAD, i Forest, and PIDForest are randomized algorithms, we repeat these three algorithms for 10 runs and report the averaged AUC performance. DTM, k-NN, LOF, and OCSVM are deterministic, and hence we report a single AUC number for them. 5.2.3 Experimental Results Table 2 presents the performance of seven anomaly detection methods on the ADBench benchmark, evaluated using the AUC metric. The last two rows summarize each algorithm s rank sum and the number of top-ranking performances. A lower rank sum and a higher number of first-place rankings indicate better performance. BRDAD demonstrates exceptional results across both evaluation metrics. Specifically, it achieves the lowest rank sum of 145, significantly outperforming other methods, with DTM and i Forest scoring 160 Bagged Regularized k-Distances for Anomaly Detection Table 2: Experimental Comparisons on ADBench Datasets BRDAD DTM k-NN LOF PIDForest i Forest OCSVM ALOI 0.5427 0.5440 0.6942 0.7681 0.5061 0.5411 0.5326 annthyroid 0.6516 0.6772 0.7343 0.7076 0.8781 0.8138 0.5842 backdoor 0.8490 0.9216 0.6682 0.7135 0.6965 0.7238 0.8465 breastw 0.9883 0.9799 0.9765 0.3907 0.9750 0.9871 0.8052 campaign 0.6711 0.6908 0.7202 0.5366 0.7945 0.7182 0.6630 cardio 0.9142 0.8879 0.7330 0.6372 0.8258 0.9271 0.9286 Cardiotocography 0.6302 0.6043 0.5449 0.5705 0.5587 0.6973 0.7872 celeba 0.5896 0.6929 0.5666 0.4332 0.6732 0.6955 0.6962 census 0.6394 0.6435 0.6465 0.5501 0.5543 0.6116 0.5336 cover 0.9301 0.9277 0.7961 0.5262 0.8065 0.8784 0.9141 donors 0.7858 0.8000 0.6117 0.5977 0.6945 0.7810 0.7323 fault 0.7591 0.7587 0.7286 0.5827 0.5437 0.5714 0.5074 fraud 0.9552 0.9583 0.9342 0.4750 0.9489 0.9493 0.9477 glass 0.7993 0.8688 0.8640 0.8114 0.7913 0.7933 0.4407 Hepatitis 0.6954 0.6303 0.6745 0.6429 0.7186 0.6944 0.6418 http 0.9943 0.0507 0.2311 0.3550 0.9870 0.9999 0.9949 Internet Ads 0.7274 0.7063 0.7110 0.6485 0.6754 0.6913 0.6890 Ionosphere 0.9113 0.9237 0.9259 0.8609 0.6820 0.8493 0.7395 landsat 0.6176 0.6184 0.5773 0.5497 0.5245 0.4833 0.3660 letter 0.8426 0.8417 0.8950 0.8872 0.6636 0.6318 0.4843 Lymphography 0.9988 0.9965 0.9988 0.9953 0.9656 0.9993 0.9977 magic.gamma 0.8205 0.8214 0.8323 0.6712 0.7252 0.7316 0.5947 mammography 0.8132 0.8301 0.8424 0.7398 0.8453 0.8592 0.8412 mnist 0.8335 0.8630 0.8041 0.6498 0.5366 0.7997 0.8204 musk 0.7583 0.9987 0.6604 0.4271 0.9997 0.9995 0.8094 optdigits 0.3912 0.5474 0.4189 0.5831 0.8248 0.6970 0.5336 Page Blocks 0.8889 0.8859 0.7813 0.7345 0.8154 0.8980 0.8903 pendigits 0.9174 0.9581 0.7127 0.4821 0.9214 0.9515 0.9354 Pima 0.7291 0.7224 0.7137 0.5978 0.6842 0.6803 0.6022 satellite 0.7449 0.7375 0.6489 0.5436 0.7122 0.7043 0.5972 satimage-2 0.9991 0.9991 0.9164 0.5514 0.9919 0.9935 0.9747 shuttle 0.9898 0.9442 0.6317 0.5239 0.9885 0.9968 0.9823 skin 0.7570 0.7177 0.5881 0.5756 0.7071 0.6664 0.4857 smtp 0.8506 0.8854 0.8953 0.9023 0.9203 0.9077 0.7674 Spam Base 0.5687 0.5663 0.4977 0.4581 0.6941 0.6212 0.5251 speech 0.4834 0.4810 0.4832 0.5067 0.4739 0.4648 0.4639 Stamps 0.8980 0.8594 0.8362 0.7269 0.8883 0.8911 0.8179 thyroid 0.9353 0.9470 0.9508 0.8075 0.9687 0.9771 0.8437 vertebral 0.3236 0.3663 0.3768 0.4208 0.2857 0.3515 0.3852 vowels 0.9489 0.9667 0.9797 0.9443 0.7817 0.7590 0.5507 Waveform 0.7783 0.7685 0.7457 0.7133 0.7263 0.7144 0.5393 WBC 0.9972 0.9930 0.9925 0.8399 0.9904 0.9959 0.9967 WDBC 0.9841 0.9773 0.9782 0.9796 0.9916 0.9850 0.9877 Wilt 0.3138 0.3545 0.4917 0.5394 0.5012 0.4477 0.3491 wine 0.8788 0.4277 0.4992 0.8756 0.8221 0.7987 0.6941 WPBC 0.5188 0.5101 0.5208 0.5184 0.5283 0.4942 0.4743 yeast 0.3717 0.3876 0.3936 0.4571 0.4019 0.3964 0.4141 Rank Sum 145 160 192 243 187 159 228 Num. No. 1 11 8 5 5 9 6 3 Cai, Yang, Ma, Hang and 159, respectively. In terms of first-place rankings across datasets, BRDAD ranks first in 11 out of 47 tabular datasets, outperforming PIDForest (9/47) and DTM (8/47). Overall, BRDAD exhibits outstanding performance, excelling over previous distance-based methods while competing effectively with forest-based approaches. On the one hand, BRDAD outperforms distance-based methods such as DTM and k NN in some datasets. For example, on the Satellite dataset, while DTM achieves a high AUC of 0.7375, BRDAD further improves it to 0.7449. Similarly, on the Internet Ads dataset, BRDAD achieves an AUC of 0.7274, outperforming k-NN s score of 0.7110. On the other hand, BRDAD remains competitive even in datasets where distancebased methods perform poorly while forest-based methods excel, such as the Stamps and Wine datasets. On Stamps, DTM and k-NN achieve AUC scores of 0.8594 and 0.8362, respectively, whereas forest-based methods like PIDForest and i Forest attain 0.8883 and 0.8911. Surprisingly, despite being a distance-based method, BRDAD surpasses them all with an AUC of 0.8980. Similarly, on the Wine dataset, PIDForest and i Forest achieve AUC scores of 0.8221 and 0.7987, respectively, while distancebased methods DTM and k-NN perform significantly worse, with scores of 0.4277 and 0.4992. Remarkably, BRDAD not only outperforms other distance-based methods but also surpasses forest-based methods, achieving the highest AUC of 0.8788. These results empirically demonstrate BRDAD s superiority over both distance-based and forest-based anomaly detection algorithms. 5.2.4 Parameter Analysis In this section, we analyze the impact of the bagging rounds B on BRDAD s performance using the ADBench datasets. Following the sample size categorization in Section 5.2.2, we classify the datasets into three groups: small datasets with n (0, 10, 000], medium datasets with n (10, 000, 50, 000], and large datasets with n (50, 000, + ). We evaluate BRDAD with different values of B {1, 5, 10, 20} and record the rank sum for each setting. Table 3: Experimental Comparisons for BRDAD with different B on ADBench Datasets B = 1 B = 5 B = 10 B = 20 Small 95 112 106 102 Medium 25 25 26 26 Large 26 25 25 26 Table 3 indicates that the effect of B on the rank sum metric varies with sample size. For small datasets, B = 1 significantly outperforms other choices, as bagging reduces the already limited sample size. In contrast, for medium and large datasets, a slightly larger B yields marginally better performance. This aligns with our theoretical findings in Theorem 3, which suggest that B should increase with sample size. These results also validate the effectiveness of our chosen values of B in real-world datasets, as discussed in the previous subsection. Bagged Regularized k-Distances for Anomaly Detection In this section, we present proofs of the theoretical results in this paper. More precisely, we first provide proofs related to the surrogate risk in Section 6.1. The proofs related to the convergence rates of BRDDE and BRDAD are provided in Sections 6.2 and 6.3, respectively. 6.1 Proofs Related to the Surrogate Risk In this section, we first provide proofs related to the error analysis of BWDDE in Section 6.1.1. Then in Section 6.1.2, we present the proof of Proposition 2.3 concerning the surrogate risk. 6.1.1 Proofs Related to Section 4.1 Before we proceed, we present Bernstein s inequality Bernstein (1946) that will be frequently applied within the subsequent proofs. This concentration inequality is extensively featured in numerous statistical learning literature, such as Massart (2007); Cucker and Zhou (2007); Steinwart and Christmann (2008); Cai et al. (2023). Lemma 7 (Bernstein s inequality) Let B > 0 and σ > 0 be real numbers, and n 1 be an integer. Furthermore, let ξ1, . . . , ξn be independent random variables satisfying EPξi = 0, ξi B, and EPξ2 i σ2 for all i = 1, . . . , n. Then for all τ > 0, we have Although Bernstein s inequality is a powerful tool for establishing finite-sample bounds for random variables, the resulting bounds may not be tight enough for random variables with vanishing variance. To address this limitation, we employ the sub-exponential tail bound, which provides sharper concentration results for such random variables. We begin by introducing the definition of sub-exponential random variables as stated in Wainwright (2019, Definition 2.7) as follows. Definition 8 (Sub-exponential variables) A random variable X with mean µ = EX is sub-exponential if there exist non-negative parameters (ζ, α) such that E[eλ(X µ)] eζ2λ2/2, for all |λ| 1/α. This condition on the moment-generating function, together with the Chernofftechnique, leads to concentration inequalities that bound the deviations of sub-exponential random variables. The following lemma gives this result (Wainwright, 2019, Proposition 2.9). Lemma 9 (Sub-exponential tail bound) Suppose that X is sub-exponential with parameters (ζ, α). Then for all 0 τ ζ2/(2α2), we have To measure the complexity of the functional space, we first recall the definition of the covering number in van der Vaart and Wellner (1996). Cai, Yang, Ma, Hang Definition 10 (Covering Number) Let (X, d) be a metric space and A X. For ε > 0, the ε-covering number of A is denoted as N(A, d, ε) := min n 1 : x1, . . . , xn X such that A i=1 B(xi, ε) , where B(x, ε) := {x X : d(x, x ) ε}. The following Lemma, which is taken from Hang et al. (2022) and needed in the proof of Lemma 12, provides the covering number of the indicator functions on the collection of balls in Rd. Lemma 11 Let B := {B(x, r) : x Rd, r > 0} and 1B := {1B : B B}. Then for any ε (0, 1), there exists a universal constant C such that N(1B, L1(Q), ε) C(d + 2)(4e)d+2ε (d+1) holds for any probability measure Q. The following lemma, which will be used several times in the sequel, provides the uniform bound on the distance between any point and its k-th nearest neighbor with a high probability when the distribution has bounded support. Lemma 12 Let Assumption 2 hold. Furthermore, let {Db s}B b=1 be B disjoint subsets of size s randomly drawn from the data set Dn, where Db s = {Xb 1, . . . , Xb s} and Rb s,(i)(x) denotes the i-distance of x in the subset Db s for i [s]. Additionally, let the sequence cn := 48(2d + 9) log n + 48 log 2 + 144 . Suppose that s c 1 := max{4e, d + 2, C}, where C is the constant specified in Lemma 11. Then, there exist constants 0 < c 2 < c 3 such that, with probability PBs at least 1 1/n2, for all x X, b [B], and cn i s, the following inequalities hold: c 2(i/s)1/d Rb s,(i)(x) c 3(i/s)1/d (21) P(B(x, Rb s,(i)(x))) i/s. (22) Proof [of Lemma 12] For x X and q [0, 1], under the continuity of the density function, as stated in Assumption 2, the intermediate value theorem guarantees the existence of a unique ρx(q) 0 such that P(B(x, ρx(q))) = q. Let τ := (2d + 9) log n + log 2 + 3 and cn := 48τ = 48(2d + 9) log n + 48 log 2 + 144 . Let i be an integer fixed in the sequel with i cn, which ensures that i > 3τ. Accordingly, we consider the set B i := B x, ρx (i 3τi)/s : x X B. Lemma 11 implies that for any probability measure Q and ε (0, 1), there holds N(1B i , L1(Q), ε) N(1B, L1(Q), ε) C(d + 2)(4e)d+2ε (d+1). (23) Bagged Regularized k-Distances for Anomaly Detection By the definition of the covering number, there exists ε-net {A j }J j=1 B i with J := C(d + 2)(4e)d+2ε (d+1) , and for any x X, there exists some j {1, . . . , J} such that 1 B x, ρx (i 3τi)/s 1A j L1(Dbs) ε. (24) Let j [J] and b [B] be fixed for now. For any ℓ [s], define the random variables ξℓ,b = 1A j (Xb ℓ) (i 3τi)/s. We have EPξℓ,b = 0, ξℓ,b 1, and EPξ2 ℓ,b (i 3τi)/s. Applying Bernstein s inequality in Lemma 7, we obtain ℓ=1 1A j (Xb ℓ) (i p 3τi)/s + 2τ/(3s) (25) for A j with probability Ps at least 1 e τ. The union bound then implies that this inequality holds for all A j , j = 1, , J with probability Ps at least 1 Je τ. Combining this result with (24), we obtain the following bound: ℓ=1 1 Xb ℓ B x, ρx (i 3τi)/s (i p 3τi)/s + 2τ/(3s) + ε (26) for all x X with probability Ps at least 1 Je τ. Let ε = 1/s. Since x x and s max{4e, d + 2, C}, it follows that log J log C + log(d + 2) + (d + 2) log(4e) + (d + 1) log s (2d + 5) log s. Since τ = (2d + 9) log n + log 2 + 3 and s n, we have the following lower bound on the probability: 1 Je τ 1 s2d+5/(2e3n2d+9) > 1 1/(2n4). Therefore, for fixed cn i s and b [B], inequality (26) holds for all x X with probability Ps at least 1 1/(2n4). Next, note that since τ > 3 and i 48τ 48τ, a straightforward calculation gives q 3τi)/s + 2τ/(3s) + ε 2τi/s + (2τ/3 + 1)/s 2τi/s + τ/s τi/48/s = ( 3τi/s. (27) The first inequality holds trivially without requiring i 48τ . However, when considering the set B+ i := B x, ρx (i + 3τi)/s : x X , the condition i 48τ is necessary to establish i + 3τi 5i/4, allowing us to proceed with a similar argument. Combining (27) with (26), we get Ps ℓ=1 1 Xb ℓ B x, ρx (i 3τi)/s /s i/s for all x X with probability Ps at least 1 1/(2n4). By the definition of Rb s,(i)(x), there holds Rb s,(i)(x) ρx (i 3τi)/s . (28) Now, let Vd denote the volume of the d-dimensional closed unit ball and µ denote the Lesbegue measure. By the definition of ρx and the condition f(x) c for all x X, as stated in Assumption 2, we have 3τi /s = P B(x, ρx (i 3τi)/s)) f(u) du Cai, Yang, Ma, Hang c µ B x, ρx (i 3τi)/s = c Vdρd x (i 3τi)/s . (29) Since i 48τ 48τ, we have i 3i2/48 /s = 3i/(4s). Combining this with inequality (29), we get c Vdρd x (i 3τi)/s 3i/(4s). Therefore, we obtain 3τi)/s 3/(4Vdc) 1/d(i/s)1/d. Combining this with (28), we have Rb s,(i)(x) ρx (i 3τi)/s) (i/s)1/d for all x X and fixed i and b with probability Ps at least 1 1/(2n4). Therefore, by a union bound argument over i and b, for all i cn, x X, and b [B], we have Rb s,(i)(x) ρx (i 3τi)/s) (i/s)1/d (30) with probability PBs at least 1 1/(2n2). On the other hand, to obtain the upper bound for Rb s,(i)(x), we consider the set B+ i = B x, ρx (i + 3τi)/s : x X . Using similar arguments, we can show that for all cn i s, x X, and b [B], it holds that Rb s,(i)(x) ρx (i + 3τi)/s (i/s)1/d (31) with probability PBs at least 1 1/(2n2). Combining (30) and (31), we obtain that (21) holds for all cn i s, x X, and b [B] with probability PBs at least 1 1/n2. Furthermore, (30) and (31) implies 3τi)/s = P B(x, ρx((i 3τi)/s)) P(B(x, Rb s,(i)(x))) P B(x, ρx((i + 3τi)/s)) = (i + 3τi i log n i and i 3i2/48 = 3i/4 for cn i s, we get P(B(x, Rb s,(i)(x))) i/s. This completes the proof of Lemma 12. The following lemma, which is needed in the proof of Lemma 17, shows that the probability P(B(x, Rb s,(i)(x))) is a Lipschitz continuous function for fixed i [s] and b [B]. This Lemma is necessary to establish the uniform concentration inequality in Lemma 17. Lemma 13 Suppose that Assumption 2 holds. For i [s] and b [B], let Rb s,(i)(x) denote the i-distance of x in the subset Db s = {Xb 1, . . . , Xb s}. Then for any x, x X and all b [B], we have P(B(x, Rb s,(i)(x))) P(B(x , Rb s,(i)(x ))) 2c Vd 3dd(d+1)/2 x x 2, where Vd denotes the volume of the d-dimensional closed unit ball. Proof [of Lemma 13] We first show that Rb s,(i)(x) is a Lipschitz continuous function of x. Let t = Rb s,(i)(x). By the triangle inequality the i nearest neighbors of x are at distance at most d(x, x ) + t of x . Therefore there are at least i points at at most this distance from x , hence the i-th nearest neighbor of x is also at distance less than d(x, x )+t, in other words, Rb s,(i)(x ) Rb s,(i)(x) + d(x, x ). Bagged Regularized k-Distances for Anomaly Detection By symmetry, we also have Rb s,(i)(x) Rb s,(i)(x ) + d(x, x ), implying that Rb s,(i)(x) is a Lipschitz continuous function. Next, we aim to show the continuity of P(B(x, Rb s,(i)(x))). For any x, x X, given f c from Assumption 2, we have: P(B(x, Rb s,(i)(x))) P(B(x , Rb s,(i)(x ))) c µ(B(x, Rb s,(i)(x)) B(x , Rb s,(i)(x ))) , (32) where the notation represents the symmetric difference between the two sets, i.e. A B := (A \ B) (B \ A), and µ denote the Lesbegue measure. Using geometric considerations, the Hausdorffdistance between B(x, Rb s,(i)(x)) and B(x , Rb s,(i)(x )) can be bounded by δ := x x 2 + |Rb s,(i)(x) Rb s,(i)(x )|. Therefore, we have µ(B(x, Rb s,(i)(x)) \ B(x , Rb s,(i)(x ))) Vd Rb s,(i)(x) + δ d Rb s,(i)(x)d . Applying the Lagrange mean value theorem, there exists ξ [Rb s,(i)(x), Rb s,(i)(x) + δ] such that Rb s,(i)(x) + δ d Rb s,(i)(x)d dξd 1δ. From the Lipschitz continuity of Rb s,(i)(x) and the assumption X = [0, 1]d, we know that δ 2 x x 2 2d1/2 and Rb s,(i)(x) d1/2. This implies ξ Rb s,(i)(x) + δ 3d1/2. Substituting this into the above inequalities, we get µ(B(x, Rb s,(i)(x)) \ B(x , Rb s,(i)(x ))) Vd d (3d1/2)d 1 δ. Since δ 2 x x 2, this yields µ(B(x, Rb s,(i)(x)) \ B(x , Rb s,(i)(x ))) Vd 3dd(d+1)/2 x x 2. By symmetry, we can similarly bound µ(B(x , Rb s,(i)(x ))\B(x, Rb s,(i)(x))). Combining these bounds, we obtain: µ(B(x, Rb s,(i)(x)) B(x , Rb s,(i)(x ))) 2Vd 3dd(d+1)/2 x x 2. Finally, substituting this into (32), we obtain the desired assertion. By (20), the study of P(B(x, Rb s,(i)(x))) reduces to analyzing Ub (i), where Ub (i) Beta(i, s+ 1 i). Consequently, term (II) in (17) represents a sum of sample mean deviations, where the sample refers to P(B(x, Rb s,(i)(x)))1/d across b [B]. and our objective is to establish finite-sample bounds for each i. To achieve this, we leverage the sub-exponential property of the d-th root of the beta distribution. However, this problem is challenging due to the complexity of the integral in its moment-generating function. To the best of our knowledge, no existing results provide such bounds for this distribution. To address this gap, we introduce a new approach to derive an upper bound for the moment-generating function. Specifically, we establish upper bounds on the p-th absolute central moment of the beta distribution for p 1 in Lemma 14, following Skorski (2023). We then extend these results to its d-th root in Lemma 15, leading to the sub-exponential property established in Lemma 16. This property serves as the foundation for our finitesample bounds in Lemma 17. Cai, Yang, Ma, Hang Lemma 14 Let s and i be integers with s 2i. Let X Beta(i, s + 1 i) and µ = EX = i/(s + 1). Define u = 2(s + 1 2i) (s + 1)(s + 3), v = i(s + 1 i) (s + 1)2(s + 2). (33) Then X is sub-exponential with parameters (2 v, 2u). Furthermore, for all p 1, we have E |X µ|p 4p+1(p v)p. Proof [of Lemma 14] We define the moment generating function φX(λ) := E[exp(λ(X EX))] and its logarithm ψ(λ) := log φX(λ). Since s 2i, we have s + 1 i i. Therefore, for the Beta distribution Beta(i, s + 1 i) with v and u as defined above, the first case in Skorski (2023)[Claim (43)] provides the bound: u2 uλ + log(1 uλ) , 0 λ < 1/u. (34) Define g(x) = log(1 x)+x+2x2 for 0 x 1/2. Then we have g (x) = x(3 4x)/(1 x) > 0. Therefore, g(x) is increasing, and thus g(x) g(0) = 0 for 0 x 1/2. Hence, for 0 λ 1/(2u), we have log(1 uλ) uλ 2u2λ2. Substituting this into (34), we get ψ(λ) 2vλ2, 0 λ 1/(2u). (35) Next, we extend the bound to λ < 0. Since X Beta(i, s + 1 i), it follows that 1 X Beta(s + 1 i, i). Applying the second case in Skorski (2023)[Claim (43)], we have ψ(λ) = log E[exp ( λ) (1 X E[1 X]) ] vλ2/2, 1/(2u) λ < 0. Combining this with (35), we conclude ψ(λ) 2vλ2, |λ| 1/(2u). Therefore, X is sub-exponential with parameters (2 v, 2u). Next, we turn to bound the moments. It is clear to see that for x 0 and a > 0, we have eax ax. For x µ and p 1, this implies x µ 4p v e(x µ)/(4p v) e(x µ)/(4 v) + e (x µ)/(4 v) 1/p. By symmetry, this yields |x µ|p (4p v)p e(x µ)/(4 v) + e (x µ)/(4 v) . Given s 2i, we have 1 4 v = (s + 1) s + 2 i(s + 1 i) s + 1 s + 2 s + 1 i (s + 1)(s + 2) 4(s + 1 i) < (s + 1)(s + 3) 4(s + 1 2i) = 1 Taking expectation with respect to X and using the sub-exponential property, we obtain E[|X µ|p] (4p v)p E[e(X µ)/(4 v)] + E[e (X µ)/(4 v)] 2e1/8(4p v)p 4p+1(p v)p. This completes the proof. Bagged Regularized k-Distances for Anomaly Detection Lemma 15 Let i, s, and d be integers with s 2i. Let X Beta(i, s + 1 i) and γs,i be defined as in (3). Then for all p 1, we have E |X1/d γs,i|p 3 (8p)p i 1/2+1/ds 1/d p. Proof [Proof of Lemma 15] For all x, y R and p 1, by Jensen s inequality, we have (|x| + |y|)p 2p 1(|x|p + |y|p). Therefore, E |X1/d γs,i|p 2p 1E |X1/d (i/(s + 1))1/d|p + 2p 1|(i/(s + 1))1/d γs,i|p. (36) Bounding the First Term: By equality (15), we have |X i/(s + 1)| = |X1/d (i/(s + 1))1/d| Xj/d(i/(s + 1))(d 1 j)/d (i/(s + 1))(d 1)/d |X1/d (i/(s + 1))1/d|. Therefore, we get E |X1/d (i/(s + 1))1/d|p (i/(s + 1)) p(d 1)/d E |X (i/(s + 1))|p . (37) Lemma 14 implies that E |X i/(s+1)|p 4p+1(p v)p, where v is specified in (33). Since v < i/(s + 1)2, combining this with (37) gives 2p 1E |X1/d (i/(s + 1))1/d|p 2p 1(i/(s + 1)) p(d 1)/d 4p+1pp (i1/2/(s + 1))p = 2 (8p)p i 1/2+1/d(s + 1) 1/d p. Bounding the Second Term: Using Gautschi s inequality (Gautschi, 1959) for Gamma functions, we have d 1 1/d < Γ(i + 1/d) Γ(i) < i + 1 1/d < Γ(s + 1 + 1/d) Γ(s + 1) < s + 1 + 1 By the definition of γs,i in (3), we conclude that s + 1 + 1/d 1/d < γs,i < i + 1/d Rewriting the left bound: s + 1 + 1/d 1/d = i s + 1 1/d 1 + (1/d 1)/i (1/d)/(s + 1) 1 + (1/d)/(s + 1) Cai, Yang, Ma, Hang Since s 2i, we have (1/d 1)/i (1/d)/(s + 1) 1 + (1/d)/(s + 1) 1/d s + 1 1/d 1 Define g(x) = (1+x)1/d for x [ 1+1/(2d), 1 1/(2d)]. Then, its derivative satisfies g (x) = (1+x)1/d 1/d 21 1/d/d1/d < 2. By the mean value theorem, we have |(1+x)1/d 1| 2|x| for x [ 1 + 1/(2d), 1 1/(2d)]. This implies s + 1 + 1/d 1/d i s + 1 1/d 2(2 1/d) Next, we rewrite the right bound in (38) as i + 1/d 1/d = i s + 1 1/d 1 + (1/d)/i + (1 1/d)/(s + 1) 1 (1 1/d)/(s + 1) For s 2i 2, we derive 1 (1 1/d)/(s + 1) 1 1/(s + 1) = s/(s + 1) 2/3 and (1/d)/i + (1 1/d)/(s + 1) 0. Using the inequality (1 + x)1/d 1 + x/d for x 0, we obtain 1 + (1/d)/i + (1 1/d)/(s + 1) 1 (1 1/d)/(s + 1) d (1/d)/i + (1 1/d)/(s + 1) 1 (1 1/d)/(s + 1) = 1 + 3(1/d + 1) (1/d) where the second inequality follows from 1 (1 1/d)/(s + 1) 2/3 and s 2i and the last inequality follows from x(1 + x) 2 for x [0, 1]. Therefore, we obtain 1/d i s + 1 Combining this with (40), we conclude that |(i/(s + 1))1/d γs,i| 2i 1+1/d(s + 1) 1/d 2i 1/2+1/d(s + 1) 1/d. Therefore, we have 2p 1|(i/(s + 1))1/d γs,i|p (8p)p i 1/2+1/d(s + 1) 1/d p. Finally, combining the bounds for both terms and using (36), we obtain E |X1/d γs,i|p 3 (8p)p i 1/2+1/ds 1/d p. This concludes the proof. Bagged Regularized k-Distances for Anomaly Detection Lemma 16 Let s and i be integers with s 2i. Let X Beta(i, s + 1 i). Then X1/d is sub-exponential with parameters (16e 3i 1/2+1/ds 1/d, 16ei 1/2+1/ds 1/d). Proof [of Lemma 16] Define the moment generating function (MGF) as φ(λ) := E[exp(λ(X1/d γs,i))] and its logarithm ψ(λ) := log φ(λ), where γs,i is defined in (3). To verify the sub-exponential property, we evaluate the MGF. Using the Taylor expansion of the exponential function, we write E exp λ(X1/d γs,i) = E 1 + λ(X1/d γs,i) + λp(X1/d γs,i)p Since E[X1/d γs,i] = 0, the linear term vanishes, and we bound the higher-order terms using Lemma 15 as follows E exp λ(X1/d γs,i) 1 + |λ|p E X1/d γs,i p 3(8i 1/2+1/ds 1/d|λ|)ppp Using Stirling s approximation p! (p/e)p for p 1, we obtain E exp λ(X1/d γs,i) 1 + 3 p=2 (8ei 1/2+1/ds 1/d|λ|)p The geometric series sums to p=2 (8ei 1/2+1/ds 1/d|λ|)p = 64e2i 1+2/ds 2/dλ2 1 8ei 1/2+1/ds 1/d|λ|. For |λ| i1/2 1/ds1/d/(16e), we have 8ei 1/2+1/ds 1/d|λ| 1/2, ensuring convergence. Substituting this bound, we get E exp λ(X1/d γs,i) 1 + 384e2i 1+2/ds 2/dλ2. Since 1 + x ex for x R, we get 1 + 384e2i 1+2/ds 2/dλ2 exp(384e2i 1+2/ds 2/dλ2). Therefore, we have E exp λ(X1/d γs,i) exp(384e2i 1+2/ds 2/dλ2), |λ| i1/2 1/ds1/d/(16e). This establishes that X1/d is sub-exponential with parameters ζ = 16e 3i 1/2+1/ds 1/d and α = 16ei 1/2+1/ds 1/d. This completes the proof. Cai, Yang, Ma, Hang Lemma 17 Let Assumption 2 hold. Additionally, let {Db s}B b=1 be B disjoint subsets of size s randomly drawn from data Dn. Let Rb s,(i)(x) be the i-distance of x in the subset Db s. Define k as in Proposition 1. Suppose that B 2(d2 + 4)(log n)/3, s 2k. Furthermore, assume that there exists constants C n,i such that wb i C n,i for all b [B] and i [s]. Then, there exists n1 := 2dd + 1, such that for all x X, 1 i k, and all n > n1, the following statement holds with probability PBs at least 1 1/n2: 1 B b=1 wb i P(B(x, Rb s,(i)(x)))1/d γs,i C n,i Proof [of Lemma 17] We first prove (41) holds for a fixed x X and a fixed 1 i k. For any b [B], by (20), we have P(B(x, Rb s,(1)(x))), . . . , P(B(x, Rb s,(s)(x))) D= Ub (1), . . . , Ub (s) , where Ub (i) is the i-th order statistic of i.i.d. uniform [0, 1] random variables. By Biau and Devroye (2015, Corollary 1.2), Ub (i) Beta(i, s + 1 i). Let ξb := P(B(x, Rb s,(i)(x))). Since {Db s}B b=1 are independent, {ξb}B b=1 are independent random variables following Beta(i, s + 1 i). The desired inequality (41) then reduces to the concentration inequality for wb iξ1/d b . To prove this, we begin by showing that 1 B PB b=1 wb i ξ1/d b γs,i is a sub-exponential random variables. Since ξb are independent, we have b=1 wb i ξ1/d b γs,i = b=1 E exp λwb i ξ1/d b γs,i /B . Since s 2k, it follows that s i + 1 i for 1 i k. Given the condition wb i C n,i, we have |λwb i/B| i1/2 1/ds1/d/(16e) for all |λ| i1/2 1/ds1/d B/(16e C n,i). By leveraging the sub-exponential property of ξ1/d b stated in Lemma 16, we obtain b=1 wb i ξ1/d b γs,i exp 384e2C 2 n,ii 1+2/dλ2 , |λ| i1/2 1/ds1/d B 16e C n,i . This shows 1 B PB b=1 wb i ξ1/d b γs,i is sub-exponential. Applying the sub-exponential tail bound in Lemma 9, we have b=1 wb i ξ1/d b γs,i 16e for all 0 τ 3B/2. Since B 2(d2 + 4)(log n)/3, it follows that 3B/2 (d2 + 4) log n. Taking τ := (d2 + 4) log n and replacing ξb with P(B(x, Rb s,(i)(x))), for a fixed x X and a fixed 1 i k, we have b=1 wb i P(B(x, Rb s,(i)(x)))1/d γs,i C n,i 1 2 nd2+4 . (42) Bagged Regularized k-Distances for Anomaly Detection To extend the upper bound to all x X, consider a 1/nd-net {zj}J j=1 of [0, 1]d. A 1/nd-net is a finite subset of X such that for any x X, there exists zj in the net with x zj 2 1/nd. The construction of such a net can be done by placing grid points spaced 1/(dnd) apart in each of the d dimensions. This results in at most dnd grid points per dimension, and thus the total number of grid points satisfies J (dnd)d ddnd2. By (42), the bound holds for each zj with j [J], and using the union bound, it holds for all j [J] with probability PBs at least 1 2dd/n4. Since {zj}J j=1 is a 1/nd-net, for any x X, there exists zj such that x zj 2 1/nd. By Lemma 13, we have P(B(x, Rb s,(i)(x))) P(B(zj, Rb s,(i)(zj))) 1/nd for all b [B]. Using the Minkowski inequality, this implies P(B(x, Rb s,(i)(x)))1/d P(B(zj, Rb s,(i)(zj)))1/d P(B(x, Rb s,(i)(x))) P(B(zj, Rb s,(i)(zj))) 1/d 1/n. Combining this with the triangle inequality and the error bound for zj, we obtain 1 B b=1 wb i P(B(x, Rb s,(i)(x)))1/d γs,i b=1 wb i P(B(x, Rb s,(i)(x)))1/d P(B(zj, Rb s,(i)(zj)))1/d b=1 wb i P(B(zj, Rb s,(i)(zj)))1/d γs,i i B + C n,i n for all x X and a fixed i with probability PBs at least 1 2dd/n4. By applying the union bound over i, the same holds for all x X and 1 i k with probability PBs at least 1 2dd/n3. For n > n1 := 2dd + 1, we note that i log n 1 for all 1 i k. This implies Combining this with the previous inequality, we conclude 1 B b=1 wb i P(B(x, Rb s,(i)(x)))1/d γs,i C n,i for all x X, 1 i k, and n > n1, with probability PBs at least 1 2dd/n3 1 1/n2. This completes the proof. Proof [of Proposition 4] Let Ω1 denote the event defined by (21) and (22) in Lemma 12, and let Ω2 denote the event defined by (41) in Lemma 17. By applying the union bound, the event Ω1 Ω2 holds with probability PBs at least 1 2/n2 for all n > n1, where n1 = 2dd+1. Cai, Yang, Ma, Hang The condition Ps i=1 wb ii1/d kb 1/d in (iii) implies the existence of a constant c 4 > 0 such that Ps i=1 wb i(i/s)1/d c 4(kb/s)1/d for all b [B]. On the other hand, the condition Pcn i=1 wb i (log n)/kb in (i) and the bound k (log n)2 together imply the existence of n2 N such that for all n n2, we have Pcn i=1 wb i c 4/2 and kb cn for all b [B], where cn is the sequence specified in Lemma 12. Hence, we consider the subsequent arguments under the assumptions that the event Ω1 Ω2 holds and that n > N1 := max{n1, n2}. Proof of Bounding (I). Let (IV ) and (V ) be defined as follows: i=1 wb i(i/s)1/d j V 1/d d f(x)1/d RB n (x) d 1 j and (V ) := Vd RB n (x)d. By the equality (14), in order to derive the upper bound of (I), it suffices to derive the upper bound of (IV ) and the lower bound of (V ). Let us first consider (IV ). By condition (iii), we have Ps i=1 wb i(i/s)1/d kb/s 1/d k/s 1/d for b [B]. Consequently we get i=1 wb i(i/s)1/d j k/s j/d. (43) On the event Ω1, we have Rw,b s (x) = i=1 wb i Rb s,(i)(x) = i=1 wb i Rb s,(i)(x) + i=cn+1 wb i Rb s,(i)(x) Rb s,(cn)(x) + i=cn+1 wb i Rb s,(i)(x) (cn/s)1/d + i=1 wb i(i/s)1/d for all x X, where the last inequality follows from Ps i=1 wb ii1/d k 1/d in condition (iii) and k (log n)2 in condition (i). Consequently we obtain RB n (x) = 1 b=1 Rw,b s (x) k/s 1/d (44) for all x X. Combining this with (43) and f c from Assumption 2, we derive k/s (d 1 j)/d k/s (d 1)/d. (45) Next, let us consider (V ). On the event Ω1, for any b [B], we have i=cn+1 wb i Rb s,(i)(x) i=cn+1 wb i(i/s)1/d = i=1 wb i(i/s)1/d i=1 wb i(i/s)1/d Bagged Regularized k-Distances for Anomaly Detection for all x X. From earlier arguments in the second paragraph of this proof, we have Ps i=1 wb i(i/s)1/d c 4(kb/s)1/d and Pcn i=1 wb i(i/s)1/d (cn/s)1/d Pcn i=1 wb i c 4(kb/s)1/d/2 for all n > N1. Therefore, we have i=1 wb i(i/s)1/d i=1 wb i(i/s)1/d c 4 kb/s 1/d/2 (46) for all x X. This implies RB n (x) = 1 b=1 Rw,b s (x) (k/s)1/d (47) for all x X. Therefore, we get (V ) = Vd RB n (x)d k/s. Combining with (45) and k k in condition (ii), we conclude that (I) = (IV )/(V ) k/s 1/d. This completes the proof of bounding (I). Proof of Bounding (II). Since wb i = 0 for all i > k and b [B], it follows that b=1 wb i P(B(x, Rb s,(i)(x))1/d γs,i = b=1 wb i P(B(x, Rb s,(i)(x))1/d γs,i . By applying (41) on the event Ω2, for all x X, we have b=1 wb i P(B(x, Rb s,(i)(x)))1/d γs,i Given Ps i=1 Cn,ii1/d 1/2 k 1/d 1/2 in condition (iv), we obtain b=1 wb i P(B(x, Rb s,(i)(x)))1/d γs,i k/s 1/d (log n)/(k B) 1/2. This completes the proof of bounding (II). Proof of Bounding (III). Let b [B] be fixed for now. We analyze two cases separately: 1 i < cn and cn i kb. We begin with the case 1 i < cn. On the event Ω1, we have Rb s,(i)(x) Rb s,(cn)(x) (cn/s)1/d and P(B(x, Rb s,(i)(x))) P(B(x, Rb s,(cn)(x))) cn/s for all x X by using (21) and (22). These bounds imply P(B(x, Rb s,(i)(x)))1/d V 1/d d f(x)1/d Rb s,(i)(x) (cn/s)1/d ((log n)/s)1/d. Therefore, we have i=1 wb i P(B(x, Rb s,(i)(x)))1/d V 1/d d f(x)1/d Rb s,(i)(x) log n Cai, Yang, Ma, Hang Using Pcn i=1 wb i (log n)/kb from condition (i) and k k from condition (ii), we obtain i=1 wb i P(B(x, Rb s,(i)(x)))1/d V 1/d d f(x)1/d Rb s,(i)(x) (log n)1+1/d for all x X. Now, we consider the case cn i kb. Using (15) and the condition f c from Assumption 2, we get P(B(x, Rb s,(i)(x)))1/d V 1/d d f(x)1/d Rb s,(i)(x) P(B(x, Rb s,(i)(x))) Vdf(x)Rb s,(i)(x)d Pd 1 j=0 P(B(x, Rb s,(i)(x)))j/d Rb s,(i)(x)d 1 j , (49) for all x X. Next, consider x such that B(x, Rb s,(kb)(x)) [0, 1]d for all b [B]. Using the Lipschitz smoothness from Assumption 2, we obtain P(B(x, Rb s,(i)(x))) Vdf(x)Rb s,(i)(x)d B(x,Rb s,(i)(x)) f(y) dy Z B(x,Rb s,(i)(x)) f(x) dy Z B(x,Rb s,(i)(x)) |f(y) f(x)| dy B(x,Rb s,(i)(x)) y x 2 dy Rb s,(i)(x)d+1. On the event Ω1, we have Rb s,(i)(x) (i/s)1/d for cn i kb. This implies P(B(x, Rb s,(i)(x))) Vdf(x)Rb s,(i)(x)d (i/s)1+1/d. (50) Moreover, using f c in Assumption 2 and Rb s,(i)(x) (i/s)1/d on the event Ω1, we have P(B(x, Rb s,(i)(x))) i/s for cn i kb. Consequently, we obtain j=0 (i/s)j/d Rb s,(i)(x) d 1 j j=0 (i/s)j/d (i/s)(d 1 j)/d (i/s)(d 1)/d. This together with (49) and (50) implies P(B(x, Rb s,(i)(x)))1/d V 1/d d f(x)1/d Rb s,(i)(x) (i/s)2/d (51) for cn i kb. Therefore, using Ps i=1 wb ii1/d kb 1/d in condition (iii), we have i=cn wb i (P(B(x, Rb s,(i)(x))))1/d V 1/d d f(x)1/d Rb s,(i)(x) i=cn wb i(i/s)2/d (k/s)1/d s X i=1 wb i(i/s)1/d (k/s)1/d (kb/s)1/d (k/s)2/d. Bagged Regularized k-Distances for Anomaly Detection Combining this with (48), we obtain i=1 wb i P(B(x, Rb s,(i)(x)))1/d V 1/d d f(x)1/d Rb s,(i)(x) (log n)1+1/d s1/dk + (k/s)2/d. for all x X and a fixed b [B]. Averaging over b [B], we have i=1 wb i P(B(x, Rb s,(i)(x)))1/d V 1/d d f(x)1/d Rb s,(i)(x) (log n)1+1/d s1/dk + (k/s)2/d. This completes the proof of bounding (III), and hence the proof of Proposition 4. 6.1.2 Proofs Related to Section 2.3 To prove Proposition 1, we require the following lemma, which establishes an upper bound on the number of instances near the boundary. Lemma 18 let {Db s}B b=1 be B disjoint subsets of size s randomly drawn from the data Dn, with Db s = {Xb 1, . . . , Xb s} for b [B]. Define n := [c 3(k/s)1/d, 1 c 3(k/s)1/d]d, where c 3 is the constant from Lemma 12 and k is as defined in Proposition 1. Assume that the condition k (log n)2 holds. Define Ib := {i [s] : Xb i n} and nb := |Ib|. Then, for all n > N1 with N1 specified in Proposition 4, there holds 1 nb/s k/s 1/d for all b [B] with probability PBs at least 1 1/n2. Proof [of Lemma 18] Let c n := [0, 1]d \ n. Let b [B] be fixed. For ℓ [s], we define ξℓ,b := 1 cn(Xb ℓ) P(X c n). Then we have EPξℓ,b = 0 and EP[ξℓ,b]2 P(X c n). Given f c in Assumption 2, we have P(X c n) cµ( c n) (k/s)1/d. Therefore, we have EP[ξℓ,b]2 (k/s)1/d. Applying Bernstein s inequality in Lemma 7, we obtain ℓ=1 1 cn(Xb ℓ) P(X c n) q 2(k/s)1/dτ/s + 2τ/(3s) with probability Ps at least 1 e τ. Setting τ := 3 log n and using P(X c n) (k/s)1/d, we obtain ℓ=1 1 cn(Xb ℓ) k/s 1/d + q (k/s)1/d(log n)/s + (log n)/s. with probability Ps at least 1 1/n3. By the definition of N1 in Proposition 4, we have k cn for all n > N1, where cn is specified in Lemma 12. Therefore, we obtain (k/s)1/d k/s k/s cn/s. This yields k/s 1/d + q (k/s)1/d(log n)/s + (log n)/s (k/s)1/d. Cai, Yang, Ma, Hang with probability Ps at least 1 1/n3. Using the union bound, this inequality holds for all b [B] with probability PBs at least 1 1/n2. This completes the proof. Proof [of Proposition 1] Let Ω1 denote the event defined by (21) and (22) in Lemma 12. Furthermore, let Ω3 be the event defined by the inequality for the upper bound of (I), (II), and (III) in Proposition 4, and let Ω4 be the event defined by the inequality for the upper bound of 1 nb/s in Lemma 18. By applying the union bound argument on Lemma 12, 18, and Proposition 4, the event Ω1 Ω3 Ω4 holds with probability PBs at least 1 1/n2 1/n2 2/n2 1 4/n2 for all n > N 1 := N1, where N1 is specified in Proposition 4. The subsequent arguments assume that Ω1 Ω3 Ω4 holds and n > N 1 . For Xb i satisfying B(Xb i , Rb s,(kb )(Xb i )) [0, 1]d for all b [B], using the bound of (I), (II), and (III) on the event Ω3 and (19), we get L(Xb i , f B n ) = f B n (Xb i ) f(Xb i ) (I) (II) + (I) (III) ((log n)/k)1+1/d + (log n)/(k B) 1/2 + (k/s)1/d. (52) The condition B (k/(log n))1+2/d implies that ((log n)/k)1+1/d (log n)/(k B) 1/2. (53) The conditions wb 2 kb 1/2 for b [B] and k k yield that wb 2 (k) 1/2 (k) 1/2. Combining this with the condition log s log n in (ii), we obtain (log n)/(k B) 1/2 p (log s)/B wb 2. (54) By (46) in the proof of Proposition 4 and the condition k k in (ii), we obtain that for all n > N1, (k/s)1/d k/s 1/d Rw,b s (Xb i ). on the event Ω1. Combining this with (52), (53), and (54), it follows that f B n (Xb i ) f(Xb i ) p (log s)/B wb 2 + Rw,b s (Xb i ). This completes the proof of (5). Next, let us turn to the proof of (6). Let n, Ib, and nb be defined by Lemma 18. Then, it is clear to see that RL,DB s (f B n ) = 1 f B n (Xb i ) f(Xb i ) f B n (Xb i ) f(Xb i ) + X f B n (Xb i ) f(Xb i ) . (55) Let b [B] be fixed for now. We consider the first term on the right-hand side of (55). For any i Ib, we have Xb i n. Consequently, for any b [B] and y B(Xb i , Rb s,(kb )(Xb i )), Bagged Regularized k-Distances for Anomaly Detection we obtain d(y, Rd \ [0, 1]d) c 3(k/s)1/d Rb s,(kb )(Xb i ) 0 on the event Ω1. This implies that y [0, 1]d and thus B(Xb i , Rb s,(kb )(Xb i )) [0, 1]d for all i Ib and b [B]. Therefore, by (5), we have f B n (Xb i ) f(Xb i ) 1 (log s)/B wb 2 + Rw,b s (Xb i ) (log s)/B wb 2 + 1 i=1 Rw,b s (Xb i ). (56) Now, we consider the second term on the right-hand side of (55). The condition Ps i=1 i1/dwb i kb 1/d, b [B], together with (47) in the proof of Proposition 4 implies that for all x [0, 1]d, there holds f B n (x) = 1 Vd RB n (x)d i=1 wb i(i/s)1/d d k/s RB n (x)d 1 on the event Ω1. Combining this with f c in Assumption 2, on the event Ω3, we get X f B n (Xb i ) f(Xb i ) #{i [s] : Xb i c n} = s nb s(kb/s)1/d s(k/s)1/d. By (46) in the proof of Proposition 4, we have (k/s)1/d Rw,b s (Xb i ) for all i [s] on the event Ω1. Consequently, we obtain f B n (Xb i ) f(Xb i ) (k/s)1/d 1 i=1 Rw,b s (Xb i ). Combining this with (55) and (56), we obtain RL,DB s (f B n ) Rsur L,DB s (f B n ) := 1 (log s)/B wb 2 + 1 i=1 Rw,b s (Xb i ) . s Ps i=1 Rw,b s (Xb i ) = Ps i=1 wb i R b s,(i), we obtain the desired assertion. 6.2 Proofs Related to the Convergence Rates of BRDDE We present the proofs related to the results concerning the surrogate risk minimization in Section 6.2.1. Additionally, the proof of Theorem 2 are provided in Section 6.2.2. 6.2.1 Proofs Related to Section 4.2.1 The following lemma, which will be used several times in the sequel, supplies the key to the proof of Proposition 5. Cai, Yang, Ma, Hang Lemma 19 Let R b s,(i) be defined as in Proposition 1, and let wb, and kb, be defined as in Proposition 5. Then, for each b [B], there exists a constant µb > 0 such that R b s,(kb, ) < µb R b s,(kb, +1). (For simplicity, we set R b s,(i) = for all i > s.) Moreover, the optimal weights satisfy wb, i = µb R b s,(i) Pkb, i=1 µb R b s,(i) , 1 i kb, . (57) Moreover, the weights wb, i are bounded as follows: R b s,(kb, ) R b s,(i) Pkb, i=1 R b s,(kb, +1) R b s,(i) wb, i R b s,(kb, +1) R b s,(i) Pkb, i=1 R b s,(kb, ) R b s,(i) , 1 i kb, . (58) Additionally, the following inequality holds: R b s,(kb, ) R b s,(i) 2 < log s R b s,(kb, +1) R b s,(i) 2. (59) Proof [of Lemma 19] By Theorem 3.1 in Anava and Levy (2016), for each b [B], there exists a constant µb > 0 such that the weights satisfy the formula in (57). This inequality, together with (57), implies that the bound in (58) holds. Moreover, by (8), we have p (log s)/B wb, i / wb, 2 = µb + νb i R b s,(i). For 1 i kb, , we have wb, i > 0, this implies that νb i = 0 by the KKT condition. Therefore, we have p (log s)/B wb, i / wb, 2 = µb R b s,(i), 1 i kb, . Now, summing over i from 1 to kb, , we get µb R b s,(i) 2 = log s i=1 (wb, i )2/ wb, 2 2 = log s This, along with the inequality R b s,(kb, ) < µb R b s,(kb, +1), establishes (59). Proof [of Proposition 5] Let c 2 and c 3 be the constants defined in (21), and let {cn} be the sequence from Lemma 12. Define c 4 := (c 2/c 3)d < 1. Since s (n/ log n)(d+1)/(d+2) and B n1/(d+2)(log n)(d+1)/(d+2), there exists constants c 5, c 6, and c 7 such that c 5 s (n/ log n)(d+1)/(d+2) c 6 and B c 7n1/(d+2)(log n)(d+1)/(d+2). (61) The choice of s implies that log s log(n) log(log n) log n. Consequently, we have s2/dc 1 2/d n log s n 2(d+1) d(d+2) (log n) 4d+6 Bagged Regularized k-Distances for Anomaly Detection Using the order of B, we know that there exists n3 N such that for all n > n3, there holds s2/dc 1 2/d n (log s)/(32/d+1(c 3)2c 1 2/d 4 ) > B. (62) Furthermore, from the order of s and B, we see that there exists n4 N such that for all n > n4, the following three inequalities hold: c 5(n/ log n)(d+1)/(d+2) c 1, (63) c 7n1/(d+2)(log n)(d+1)/(d+2) 2(d2 + 4)(log n)/3, (64) c 5(n/ log n)d/(d+2) c 8 := 4(c 6) 2 d+2 c 4((c 3)2c 7) d 2+d 1/2 (1 t1/d)2 dt d d+2 , (65) where c 1 is the constant specified in Lemma 12. As we will demonstrate in the second part of this argument, the inequalities (63) and (65), together with the condition n > (c 6)2+d , guarantee that the lower bound for s is satisfied. Similarly, inequality (64) ensures that the lower bound for B holds. Additionally, by the divergence of the sequence cn, there exists n5 N such that for all n n5, we have 1/2 t1/d(1 t1/d) dt. (66) Let Ω1 denote the event defined by (21) and (22) in Lemma 12. By Lemma 12, the event Ω1 holds with probability PBs at least 1 1/n2. For the remainder of the proof, we assume that Ω1 holds and that n > N2 := max{n3, n4, n5, (c 6)2+d }. We proceed with the proof of statement 1. Verification of Condition (i): We first show that kb, 2cn/c 4 for all b [B] by contradiction. Suppose that kb, < 2cn/c 4 for some b [B]. Since c 4 < 1, it follows that cn/c 4 > cn > 2. Therefore, we have kb, + 1 < 2cn/c 4 + 1 2cn/c 4 + 2 3cn/c 4. This leads to the bound R b s,(kb, +1) c 3((kb, + 1)/s)1/d 31/dc 3c 1/d 4 (cn/s)1/d on the event Ω1. Therefore, we obtain R b s,(kb, +1) R b s,(i) 2 kb, R b s,(kb, +1) 2 32/d+1(c 3)2c 1 2/d 4 c1+2/d n s 2/d. Combining this with (62), we conclude that Pkb, i=1 R b s,(kb, +1) R b s,(i) 2 < (log s)/B. How- ever, by (59) in Lemma 19, we know that (log s)/B Pkb, i=1 R b s,(kb, +1) R b s,(i) 2. This leads to a contradiction, implying that kb, 2cn/c 4 > cn, b [B]. (67) Let b [B] be fixed. By Lemma 19, we have R b s,(kb, +1) R b s,(i) R b s,(kb, ) R b s,(i) . (68) Cai, Yang, Ma, Hang On the event Ω1, we have the following upper bound for the numerator: R b s,(kb, +1) R b s,(i) cn R b s,(kb, +1) cn(kb, /s)1/d. (69) Next, we establish the lower bound for Pkb, i=1 R b s,(kb, ) R b s,(i) . Since kb, > cn by (67), we have R b s,(kb, ) c 2 kb, /s 1/d = c 3 c 4kb, /s 1/d c 3 c 4kb, /s 1/d and R b s,(i) c 3(i/s)1/d for cn i s on the event Ω1. Therefore, we obtain R b s,(kb, ) R b s,(i) c 3 c 4kb, /s 1/d c 3(i/s)1/d. (70) for cn i s. Since c 4kb, 2cn by (67), we obtain R b s,(kb, ) R b s,(i) R b s,(kb, ) R b s,(i) (kb, )1/d+1 s1/d 1 c 4kb, Since g1(t) := 1 t1/d is a monotonically decreasing function for 0 t 1, we have R b s,(kb, ) R b s,(i) (kb, )1/d+1 cn/ c 4kb, g1(t) dt (kb, )1/d+1 1/2 g1(t) dt (kb, )1/d+1s 1/d. (71) Combining this with the upper bound in (69) and inequality (68), we get Pcn i=1 wb, i (log n)/kb, for b [B]. Hence, we verify condition (i) in Proposition 1. Verification of Condition (ii): Let b [B] be fixed. Since kb, > cn by (67), we have R b s,(kb, +1) R b s,(i) 2 R b s,(kb, +1) 2 (kb, )2/d+1s 2/d. on the event Ω1. Combining this with (59) in Lemma 19, we have (log s)/B (kb, )2/d+1s 2/d, which implies the lower bound kb, s2/(2+d)((log s)/B)d/(2+d). Combining this with the choice of s and B, we get the lower bound kb, (n/ log n)1/(d+2). Next, we derive the upper bound of kb, . Using the lower bound in (70) again, we have R b s,(kb, ) R b s,(i) 2 R b s,(kb, ) R b s,(i) 2 (c 3)2 c 4kb, X c 4kb, /s 1/d (i/s)1/d 2 Bagged Regularized k-Distances for Anomaly Detection = (c 3)2 c 4kb, 2/d+1 s2/d 1 c 4kb, Since c 4kb, 2cn > 144 by (67), we have c 4kb, c 4kb, /2. Therefore, we get R b s,(kb, ) R b s,(i) 2 (c 3)2(c 4)2/d+1 1/2 (1 t1/d)2 dt (kb, )2/d+1s 2/d. Combining this with (59) in Lemma 19, we obtain B > (c 3)2(c 4)2/d+1 1/2 (1 t1/d)2 dt (kb, )2/d+1s 2/d. (72) Since n N2 (c 6)2+d, using the choice of s in (61), we get log s log c 6 + d + 1 d + 2 log n log n. Using this bound with the choices of B and s in (61), and the inequality (72), we derive kb, 2 c 4(c 3)2d/(2+d) 1/2 (1 t1/d)2 dt d d+2 log s d d+2 s 2 d+2 2(c 6) 2 d+2 c 4((c 3)2c 7) d d+2 1/2 (1 t1/d)2 dt d d+2 n log n 1 d+2 = c 8 2 1 d+2 . (73) Combining the lower and upper bounds of kb, , we conclude kb, (n/ log n)1/(d+2), b [B]. (74) The inequalities (61) and (63) ensure that s c 1 for all n > N2. Combining the upper bound of kb, in (73) with the choice of s in (61) and the inequality (65) yields s 2k for all n > N2. Therefore, we have s max{c 1, 2k} for all n > N2 and it is straightforward to verify that log s log n. Additionally, the choice of B in (61) combined with (64) ensures that B 2(d2 + 4)(log n)/3 for all n > N2. Moreover, using (74) and the choices of B and s, we obtain B (k/(log n))1+2/d. This complete the verification of condition (ii) in Proposition 1. Verification of Condition (iii): Fix b [B]. By inequality (58) in Lemma 19, we have Pkb, i=1 i1/d R b s,(kb, ) R b s,(i) R b s,(kb, +1) R b s,(i) i=1 i1/dwb, i Pkb, i=1 i1/d R b s,(kb, +1) R b s,(i) R b s,(kb, ) R b s,(i) . (75) We first evaluate the numerator on the left-hand side. Since c 4 < 1 and c 4kb, 2cn by (67), we have R b s,(kb, ) R b s,(i) R b s,(kb, ) R b s,(i) Cai, Yang, Ma, Hang Applying inequality (70), we obtain R b s,(kb, ) R b s,(i) s 1/d c 4kb, X i=cn i1/d c 4kb, 1/d i1/d s 1/d(kb, )2/d+1 1 c 4kb, 1/d 1 i c 4kb, Define g2(t) := t1/d(1 t1/d) for t [0, 1]. Since g2(t) 1 for t [0, 1], we obtain the lower bound R b s,(kb, ) R b s,(i) (kb, )2/d+1 (cn 1)/ c 4kb, g2(t) dt 2 c 4kb, From (66) and (67), we have c 4kb, 1 1/(2cn) R 1 1/2 g2(t) dt/4 and (cn 1)/ c 4kb, 1/2 for all n > N2. Therefore, we obtain the following lower bound: R b s,(kb, ) R b s,(i) (kb, )2/d+1 1/2 g2(t) dt (kb, )2/d+1 On the other hand, on the event Ω1, since kb, > cn by (67), we have the upper bound: R b s,(kb, +1) R b s,(i) kb, R b s,(kb, +1) (kb, )1+1/ds 1/d. Combining these bounds with (75), we obtain Pkb, i=1 i1/dwb, i (kb, )1/d. By similar arguments, we can also derive the upper bound Pkb, i=1 i1/dwb, i (kb, )1/d from the right-hand term of inequality (75). Hence, we verify the first part of condition (iii). Using the formula for wb, i in (57), along with the equality (60), and noting that R b s,(kb, ) < µb R b s,(kb, +1) from Lemma 19, we obtain q Pkb, i=1 µb R b s,(i) 2 Pkb, i=1 µb R b s,(i) = ((log s)/B)1/2 Pkb, i=1 µb R b s,(i) < ((log s)/B)1/2 Pkb, i=1(R b s,(kb, ) R b s,(i)) . Combining this with (71), the choice of B and s, and the order of kb, in (74), we derive wb, 2 ((log s)/B)1/2/ (kb, )1/d+1s 1/d (kb, ) 1/2. Finally, using the Cauchy Schwarz inequality, we obtain 1 = wb, 2 1 kb, wb, 2 2. This implies the lower bound wb, 2 (kb, ) 1/2. Therefore, combining the upper and lower bounds for wb, 2, we verify the second part of condition (iii). This completes the verification of Condition (iii) in Proposition 1. Bagged Regularized k-Distances for Anomaly Detection Verification of Condition (iv): Let c 3 be the constant specified in Lemma 12, and let c 8 be the constant defined in (65). We define hn := c 8(n/(log n))1/(d+2)/2 + 1, and define the following sequence ( c 3(hn/s)1/d, 1 i (hn s); 0, otherwise. By (73), we have kb, + 1 c 8(n/ log n)1/(d+2)/2 + 1 hn for all b [B]. Therefore, for 1 i kb, , we have i (hn s), and thus Cn,i = c 3(hn/s)1/d. By (67), we have kb, > cn for all b [B]. Therefore, on the event Ω1, we have Cn,i = c 3(hn/s)1/d c 3((kb, + 1)/s)1/d R b s,(kb, +1) (77) for 1 i kb, and b [B]. Using the expression for kb, in (74), the choice of s, and the inequality (71), we derive R b s,(kb, ) R b s,(i) n log n d+2 n log n Consequently, using (58) from Lemma 19 and inequality (77), we obtain R b s,(kb, +1) R b s,(i) Pkb, i=1 R b s,(kb, ) R b s,(i) R b s,(kb, +1) R b s,(i) R b s,(kb, +1) Cn,i for 1 i kb, and b [B]. Since wb, i = 0 for i > kb, and b [B], we conclude that wb, i Cn,i for all b [B] and i [s]. We now analyze the summation Ps i=1 i1/d 1/2Cn,i. Since hn s hn, we have i=1 i1/d 1/2Cn,i = c 3 i=1 i1/d 1/2(hn/s)1/d h1/d+1/2 n (hn/s)1/d (n/ log n) On the other hand, from (74), we know that k (n/ log n)1/(d+2), which implies k 1/d 1/2 d+2 . Therefore, we conclude that Ps i=1 i1/d 1/2Cn,i k 1/d 1/2. This completes the verification of condition (iv) in Proposition 1. Finally, statement 2 has been verified in (74), completing the proof of Proposition 5. Within the same theoretical framework as the preceding proof, the following lemma establishes a finite sample bound on the number of nonzero weights returned by SRM without bagging. This result is essential for the comparison of search complexity in Section 4.3. Following the approach in Section 2.3, we randomly partition the data into two disjoint subsets one for weight selection and the other for computing k-distances. For convenience, we assume without loss of generality that n is an odd number in the following lemma. Cai, Yang, Ma, Hang Lemma 20 Let Ds be a subset of size s = n/2 randomly drawn from the dataset Dn. Define Rs,(i) as the average i-distance for any integer i s on the subset Ds, following the definition in Proposition 1. Furthermore, let w be the solution to the following SRM problem: w := arg min w Ws log s w 2 + i=1 wi Rs,(i), and define k := sup{i [s] : w i = 0}. Then, there exists an integer N3 N such that for all n > N3, with probability Ps at least 1 1/n2, we have k n2/(d+2)(log n)d/(d+2). Proof [of Lemma 20] Let c 3 denote the constant specified in Lemma 12, c 4 denote the constant introduced in the proof of Proposition 5, and {cn} denote the sequence defined in Lemma 12. By the definition of cn, there exists n6 N such that for all n n6, we have (n/2)2/dc 1 2/d n log(n/2)/(32/d+1(c 3)2c 1 2/d 4 ) > 1. (78) Define N3 := max{ 2c 1 + 2, n6}, where c 1 is as specified in Lemma 12. Let Ω1 denote the event defined by (21) and (22) in Lemma 12 with B = 1 and s = n/2. By Lemma 12, for all n > N3, we have s c 1, ensuring that the event Ω1 holds with probability Ps at least 1 1/n2. In the subsequent argument, we assume that Ω1 holds and n N3. Since (78) is analogous to (62) in the proof of Proposition 5, a similar argument as in the Verification of Condition (i) part of that proof shows that k 2cn/c 4 by contradiction. Following the reasoning in the Verification of Condition (ii) part, we obtain log s (k )2/d+1s 2/d. Substituting s = n/2 into the above inequality yields k n2/(d+2)(log n)d/(d+2). This completes the proof of Lemma 20. 6.2.2 Proofs Related to Section 3.2 In this section, we present the proof related to BRDDE. The weights wb, are derived using SRM based on the data for BRDDE, whereas Proposition 4 assumes that the weights are fixed and independent of the data. As a result, Proposition 4 cannot be directly applied to establish the convergence rate of our density estimator. However, Proposition 5 ensures that the weights returned by SRM satisfy the required conditions in Proposition 1 with high probability. Therefore, by making slight modifications to the proof of Proposition 4, we can establish the error decomposition of BRDDE as stated in Proposition 21. Before proceeding, we introduce additional notation. Regarding the expression of f B, (x) in (11), we define the error term (I ), (II ), and (III ) as follows, corresponding to (I), (II), and (III) in (14), (17), and (18): Vd RB, n (x)d i=1 wb, i γs,i j V 1/d d f(x)1/d RB, n (x) d 1 j. (79) b=1 wb, i γs,i P(B(x, e Rb s,(i)(x)))1/d , (80) b=1 wb, i P(B(x, e Rb s,(i)(x)))1/d V 1/d d f(x)1/d e Rb s,(i)(x) . (81) Bagged Regularized k-Distances for Anomaly Detection Following a similar derivation as in (19), we obtain the error decomposition: f B, n (x) f(x) (I ) (II ) + (I ) (III ). (82) Proposition 21 Let Assumption 2 hold. Let (I ), (II ), and (III ) be defined in (79), (80), and (81), respectively. Then, there exists an integer N4 N such that for all n > N4 and any x satisfying B(x, e Rb s,(kb, )(x)) [0, 1]d for all b [B], with probability Pn at least 1 2/n2, (I ), (II ), and (III ) have upper bounds of the same asymptotic order as (I), (II), and (III) in Proposition 4. Proof [of Proposition 21] Let Ω5 denote the event defined by the statements in Proposition 5. Applying Lemma 12 to the subset { e Db s}B b=1, we obtain that, with probability at least 1 1/n2, the following holds: c 2(i/s)1/d e Rb s,(i)(x) c 3(i/s)1/d and P(B(x, e Rb s,(i)(x))) i/s for all x X, b [B], and cn i s. We define this event as Ω6. Since the datasets Db s and e Db s are independent for b [B] and wb, is the solution to the SRM in (7), we have E wb, i ((P(B(x, e Rb s,(i)(x))))1/d γs,i) Ω5 = E wb, i Ω5 E (P(B(x, e Rb s,(i)(x))))1/d γs,i = 0 for a fixed x X, where the last equality follows from (20) and the expectation is taken with respect to the empirical measure e Db s, conditional on the event Ω5. Following the proof of Lemma 17 and conditioning on the event Ω5, for all n > n1 = 2dd + 1, we have sup x X,i [ k ] b=1 wb, i P(B(x, e Rb s,(i)(x)))1/d γs,i Cn,i Since Proposition 5 ensures that P(Ω5) 1 1/n2, applying the conditional probability formula yields sup x X,i [ k ] b=1 wb, i P(B(x, e Rb s,(i)(x)))1/d γs,i Cn,i for all n > n1 N2, where N2 is the integer specified in Proposition 5. Denote this event as Ω7. By the union bound, the event Ω5 Ω6 Ω7 holds with probability Pn at least 1 4/n2 for all n > n1 N2. Since the events Ω6 and Ω7 correspond to the events Ω1 and Ω2 in the proof of Proposition 4, respectively, and given that conditions (i) (iv) in Proposition 1 hold for wb, and kb, on Ω5, there exists N4 N such that the upper bound for (I ), (II ), and (III ) follow for all n > N4 by similar arguments as in the proof of Proposition 4. (Note that N4 may differ N1 in that proposition.) The details are omitted. Proof [of Theorem 2] Let Ω5, Ω6, and Ω7 be the events defined in the proof of Proposition 21. Following the arguments therein, the event Ω5 Ω6 Ω7 holds with probability Pn at least 1 4/n2 for all n N4. For the remainder of the proof, we assume that the event Ω5 Ω6 Ω7 holds and that n > N 2 := N4. Cai, Yang, Ma, Hang From (73) in the proof of Proposition 5, we have k c 8(n/ log n)1/(d+2)/2. Define n := [c 3(c 8(n/ log n)1/(d+2)/(2s))1/d, 1 c 3(c 8(n/ log n)1/(d+2)/(2s))1/d]d. On the event Ω6, for all x n and any b [B], we have e Rb s,(kb, )(x) c 3(kb, /s)1/d c 3(c 8(n/ log n)1/(d+2)/(2s))1/d. Thus, for any y B x, e Rb s,(kb, )(x) , we have d(y, Rd \ [0, 1]d) c 3(c 8(n/ log n)1/(d+2)/(2s))1/d e Rb s,(kb, )(x) 0. This implies that B x, e Rb s,(kb, )(x) [0, 1]d for all x n and b [B]. Therefore, from Proposition 21 and inequality (82), we obtain f B, n (x) f(x) ((log n)/k)1+1/d + (log n)/(k B) 1/2 + (k/s)1/d, x n. Using kb, (n/ log n)1/(d+2) from Proposition 5 and substituting the choices of B and s, we obtain f B, n (x) f(x) n 1/(d+2)(log n)(d+3)/(d+2), x n Integrating over n, we get Z n |f B, n (x) f(x)| dx n 1/(d+2)(log n)(d+3)/(d+2). (83) On the other hand, on the event Ω5, condition (iii) in Proposition 1 holds for wb, and kb, , which implies that Ps i=1 wb, i i1/d (kb, )1/d for all b [B]. Since γs,i < ((i+1/d)/(s+ 1/d))1/d (2i/s)1/d by (38), we have i=1 wb, i γs,i 1 i=1 wb, i (i/s)1/d 1 b=1 (kb, /s)1/d (k/s)1/d. Following similar arguments as in (47) from Proposition 4, we obtain RB, n (x) (k/s)1/d for all x X on the event Ω5 Ω6. This implies f B, n (x) = 1 Vd RB, n (x)d i=1 wb, i γs,i d k/k 1, x X. Since f c from Assumption 2, we conclude that f B, n f is bounded by a constant. Therefore, we have Z X\ n |f B, n (x) f(x)|dx µ(X \ n) ((n/ log n)1/(d+2)/s)1/d n 1/(d+2)(log n)(d+3)/(d+2). Finally, combining this with (83), we have Z f B, n (x) f(x) dx = Z f B, n (x) f(x) dx n 1/(d+2)(log n)(d+3)/(d+2), which completes the proof. Bagged Regularized k-Distances for Anomaly Detection 6.3 Proofs Related to the Convergence Rates of BRDAD In this subsection, we first present the proofs for learning the AUC regret in Section 6.3.1. Then we provide the proof of Theorem 3 in Section 6.3.2. 6.3.1 Proofs Related to Section 4.2.2 Proof [of Proposition 6] Under the Huber contamination model in Assumption 1, let η(x) be defined as in (12), and define bη(x) = Π f B, n (x) 1. By the expression of f B, n (x) in (11), it follows that 1{RB, n (X) RB, n (X ) > 0} = 1{bη(X) bη(X ) > 0} and 1{RB, n (X) = RB, n (X )} = 1{bη(X) = bη(X )}. Consequently, we obtain AUC(RB, n ) = E 1{(Y Y )(RB, n (X) RB, n (X ) > 0)} + 1{RB, n (X) = RB, n (X )}/2|Y = Y = E 1{(Y Y )(bη(X) bη(X ) > 0)} + 1{bη(X) = bη(X )}/2|Y = Y = AUC(bη). Therefore, we have Reg AUC(RB, n ) = Reg AUC(bη). Applying Agarwal (2013, Corollary 11), we obtain Reg AUC(RB, n ) = Reg AUC(bη) 1 Π(1 Π) X |bη(x) η(x)|d PX(x). (84) From Assumption 2, we have f c, and given that f B, n c, it follows that |bη(x) η(x)| = Π f B, n (x) f(x) f B, n (x)f(x) f B, n (x) f(x) . Combining this with (84) and the condition f c from Assumption 2, we establish the desired result. 6.3.2 Proofs Related to Section 3.3 Proof [of Theorem 3] Let Ω5, Ω6, and Ω7 be the events defined in the proof of Proposition 21. Following the arguments therein, the event Ω5 Ω6 Ω7 holds with probability Pn at least 1 4/n2 for all n N4. For the subsequent arguments, we assume that Ω5 Ω6 Ω7 holds and that n > N 2 = N4. Let {cn} denote the sequence from Lemma 12. On the event Ω6, for all x X, we have Rb, s (x) = i=1 wb, i e Rb s,(i)(x) = i=1 wb, i e Rb s,(i)(x) + i=cn+1 wb, i e Rb s,(i)(x) e Rb, s,(cn)(x) + i=cn+1 wb, i e Rb s,(i)(x) (cn/s)1/d + i=1 wb, i (i/s)1/d, By Proposition 5, on the event Ω5, we have Ps i=1 wb, i i1/d k 1/d and k (log n)2. This implies that Rb, s (x) (cn/s)1/d + (k/s)1/d (k/s)1/d. Averaging over b in [B], we obtain RB, n (x) = 1 b=1 Rb, s (x) (k/s)1/d (85) Cai, Yang, Ma, Hang On the other hand, using (38), we have i=1 wb, i γs,i > s + 1 + 1/d i=1 wb, i (i/s)1/d, where we use the inequality (i + 1/d 1)/(s + 1 + 1/d) (i 1)/(s + 2) (i 1)/(2s). Applying Proposition 5 again, we get Ps i=1 wb, i i1/d k 1/d. Averaging over b in [B], we have i=1 wb, i γs,i (k/s)1/d. Combining this with (85), we conclude that f B, n (x) is lower bounded by a constant for x X. Consequently, by Theorem 2 and Proposition 6, we obtain the desired assertion. 7. Conclusion In this paper, we propose a distance-based algorithm, Bagged Regularized k-Distances for Anomaly Detection (BRDAD), to address challenges in unsupervised anomaly detection. BRDAD mitigates the sensitivity of hyperparameter selection by formulating the problem as a convex optimization task and incorporates bagging to enhances computational efficiency. From a theoretical perspective, we establish fast convergence rates for the AUC regret of BRDAD and show that the bagging technique substantially reduces computational complexity. As a by-product, we derive optimal convergence rates for the L1-error of Bagged Regularized k-Distances for Density Estimation (BRDDE), which shares the same weights as BRDAD, further validating the effectiveness of the Surrogate Risk Minimization (SRM ) framework for density estimation. On the experimental side, BRDAD is evaluated against distancebased, forest-based, and kernel-based methods on various anomaly detection benchmarks, demonstrating superior performance. Additionally, parameter analysis reveals that choosing an appropriate number of bagging rounds improves performance, making the method well-suited for practical applications. Acknowledgments The authors would like to thank the reviewers and the action editor for their help and advice, which led to a significant improvement of the article. Hanfang Yang and Yuheng Ma are corresponding authors. The research is supported by the Special Funds of the National Natural Science Foundation of China (Grant No. 72342010). Yuheng Ma is supported by the Outstanding Innovative Talents Cultivation Funded Programs 2024 of Renmin University of China. This research is also supported by Public Computing Cloud, Renmin University of China. Bagged Regularized k-Distances for Anomaly Detection Shivani Agarwal. Surrogate regret bounds for the area under the roc curve via strongly proper losses. In Conference on Learning Theory, pages 338 353. PMLR, 2013. Charu C Aggarwal. Applications of outlier analysis. In Outlier analysis, pages 373 400. Springer, 2012. Charu C Aggarwal. An introduction to outlier analysis. In Outlier analysis, pages 1 34. Springer, 2016a. Charu C Aggarwal. Outlier ensembles. In Outlier Analysis, pages 185 218. Springer, 2016b. Charu C Aggarwal and Saket Sathe. Theoretical foundations and algorithms for outlier ensembles. ACM SIGKDD Explorations Newsletter, 17(1):24 47, 2015. Samet Akcay, Amir Atapour-Abarghouei, and Toby P Breckon. Ganomaly: Semi-supervised anomaly detection via adversarial training. In Asian conference on computer vision, pages 622 637. Springer, 2018. Oren Anava and Kfir Levy. k -nearest neighbors: From global to local. Advances in neural information processing systems, 29, 2016. Clément Berenfeld and Marc Hoffmann. Density estimation on an unknown submanifold. Electronic Journal of Statistics, 15:2178 2223, 2021. Sergei N. Bernstein. The Theory of Probabilities. Gastehizdat Publishing House, Moscow, 1946. Gérard Biau and Luc Devroye. Lectures on the Nearest Neighbor Method, volume 246. Springer, 2015. Gérard Biau, Frédéric Chazal, David Cohen-Steiner, Luc Devroye, and Carlos Rodrıguez. A weighted k-nearest neighbor density estimate for geometric inference. Electronic Journal of Statistics, 5:204 237, 2011. Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 93 104, 2000. Yuchao Cai, Yuheng Ma, Yiwei Dong, and Hanfang Yang. Extrapolated random tree for regression. In International Conference on Machine Learning, pages 3442 3468. PMLR, 2023. Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3):1 58, 2009. Pengfei Chen, Huabing Huang, and Wenzhong Shi. Reference-free method for investigating classification uncertainty in large-scale land cover datasets. International Journal of Applied Earth Observation and Geoinformation, 107:102673, 2022. Cai, Yang, Ma, Hang Y-S Chow, Stuart Geman, and L-D Wu. Consistent cross-validated density estimation. The Annals of Statistics, 11(1):25 38, 1983. Felipe Cucker and Ding-Xuan Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, 2007. Sanjoy Dasgupta and Samory Kpotufe. Optimal rates for k-nn density and mode estimation. Advances in Neural Information Processing Systems, 27, 2014. Luc Devroye and Gábor Lugosi. Combinatorial methods in density estimation. Springer Science & Business Media, 2001. Luc P Devroye and Terry J Wagner. The strong uniform consistency of nearest neighbor density estimates. The Annals of Statistics, pages 536 540, 1977. Yixiang Dong, Minnan Luo, Jundong Li, Deng Cai, and Qinghua Zheng. Lookcom: Learning optimal network for community detection. IEEE Transactions on Knowledge and Data Engineering, 34(2):764 775, 2020. Muhammad Fahim and Alberto Sillitti. Anomaly detection, analysis and prediction techniques in iot environment: A systematic literature review. IEEE Access, 7:81664 81681, 2019. Tharindu Fernando, Harshala Gammulle, Simon Denman, Sridha Sridharan, and Clinton Fookes. Deep learning for medical anomaly detection a survey. ACM Computing Surveys (CSUR), 54(7):1 37, 2021. Gianluigi Folino, Carla Otranto Godano, and Francesco Sergio Pisani. An ensemble-based framework for user behaviour anomaly detection and classification for cybersecurity. The Journal of Supercomputing, pages 1 24, 2023. Jerome H Friedman, Jon Louis Bentley, and Raphael Ari Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software (TOMS), 3(3):209 226, 1977. Walter Gautschi. Some elementary inequalities relating to the gamma and incomplete gamma function. J. Math. Phys, 38(1):77 81, 1959. Parikshit Gopalan, Vatsal Sharan, and Udi Wieder. Pidforest: anomaly detection via partial identification. Advances in Neural Information Processing Systems, 32, 2019. Xiaoyi Gu, Leman Akoglu, and Alessandro Rinaldo. Statistical analysis of nearest neighbor methods for anomaly detection. Advances in Neural Information Processing Systems, 32, 2019. Songqiao Han, Xiyang Hu, Hailiang Huang, Minqi Jiang, and Yue Zhao. Adbench: Anomaly detection benchmark. Advances in Neural Information Processing Systems, 35:32142 32159, 2022. Bagged Regularized k-Distances for Anomaly Detection Hanyuan Hang, Ingo Steinwart, Yunlong Feng, and Johan AK Suykens. Kernel density estimation for dynamical systems. The Journal of Machine Learning Research, 19(1): 1260 1308, 2018. Hanyuan Hang, Yuchao Cai, Hanfang Yang, and Zhouchen Lin. Under-bagging nearest neighbors for imbalanced classification. The Journal of Machine Learning Research, 23 (1):5135 5197, 2022. Waleed Hilal, S Andrew Gadsden, and John Yawney. Financial fraud: a review of anomaly detection techniques and recent advances. Expert systems With applications, 193:116429, 2022. Peter J Huber. A robust version of the probability ratio test. The Annals of Mathematical Statistics, 36(6):1753 1758, 1965. Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution, pages 492 518. Springer, 1992. Heinrich Jiang. Uniform convergence rates for kernel density estimation. In International Conference on Machine Learning, pages 1694 1703. PMLR, 2017. Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In 2008 eighth ieee international conference on data mining, pages 413 422. IEEE, 2008. Mark Lokanan, Vincent Tran, and Nam Hoai Vuong. Detecting anomalies in financial statements using machine learning algorithm: The case of vietnamese listed firms. Asian Journal of Accounting Research, 4(2):181 201, 2019. Ezequiel López-Rubio. A histogram transform for probability density function estimation. IEEE transactions on pattern analysis and machine intelligence, 36(4):644 656, 2013. Andréa Eliza O Luz, Rogério G Negri, Klécia G Massi, Marilaine Colnago, Erivaldo A Silva, and Wallace Casaca. Mapping fire susceptibility in the brazilian amazon forests using multitemporal remote sensing and time-varying unsupervised anomaly detection. Remote Sensing, 14(10):2429, 2022. Pascal Massart. Concentration Inequalities and Model Selection, volume 1896 of Lecture Notes in Mathematics. Springer, Berlin, 2007. David S Moore and James W Yackel. Large sample properties of nearest neighbor density function estimators. In Statistical Decision Theory and Related Topics, pages 269 279. Elsevier, 1977. Ali Bou Nassif, Manar Abu Talib, Qassim Nasir, and Fatima Mohamad Dakalbab. Machine learning for anomaly detection: A systematic review. Ieee Access, 9:78658 78700, 2021. Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submanifolds with high confidence from random samples. Discrete & Computational Geometry, 39:419 441, 2008. Cai, Yang, Ma, Hang Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 427 438, 2000. M Ravinder and Vikram Kulkarni. A review on cyber security and anomaly detection perspectives of smart grid. In 2023 5th International Conference on Smart Systems and Inventive Technology (ICSSIT), pages 692 697. IEEE, 2023. Lukas Ruff, Jacob R Kauffmann, Robert A Vandermeulen, Grégoire Montavon, Wojciech Samek, Marius Kloft, Thomas G Dietterich, and Klaus-Robert Müller. A unifying review of deep and shallow anomaly detection. Proceedings of the IEEE, 109(5):756 795, 2021. Bernhard Schölkopf, Robert C Williamson, Alex Smola, John Shawe-Taylor, and John Platt. Support vector method for novelty detection. Advances in neural information processing systems, 12, 1999. Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. Estimating the support of a high-dimensional distribution. Neural computation, 13(7):1443 1471, 2001. Haiyang Sheng and Guan Yu. TNN: A transfer learning classifier based on weighted nearest neighbors. Journal of Multivariate Analysis, 193:105126, 2023. Bernard W Silverman. Density estimation for statistics and data analysis. Routledge, 2018. Maciej Skorski. Bernstein-type bounds for beta distribution. Modern Stochastics: Theory and Applications, 10(2):211 228, 2023. Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer Science & Business Media, 2008. Ingo Steinwart, Don Hush, and Clint Scovel. A classification framework for anomaly detection. Journal of Machine Learning Research, 6(2), 2005. Mahito Sugiyama and Karsten Borgwardt. Rapid distance-based outlier detection via sampling. Advances in neural information processing systems, 26, 2013. Maximilian E Tschuchnig and Michael Gadermayr. Anomaly detection in medical imaging-a mini review. In International Data Science Conference, pages 33 38. Springer, 2021. Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, New York, 2009. Aad W. van der Vaart and Jon A. Wellner. Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer-Verlag, New York, 1996. Shay Vargaftik, Isaac Keslassy, Ariel Orda, and Yaniv Ben-Itzhak. Rade: resource-efficient supervised anomaly detection using decision tree-based ensemble methods. Machine Learning, 110(10):2835 2866, 2021. Bagged Regularized k-Distances for Anomaly Detection Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019. Weiping Wang, Zhaorong Wang, Zhanfan Zhou, Haixia Deng, Weiliang Zhao, Chunyang Wang, and Yongzhen Guo. Anomaly detection of industrial control systems based on transfer learning. Tsinghua Science and Technology, 26(6):821 832, 2021. Mingxi Wu and Christopher Jermaine. Outlier detection by sampling with accuracy guarantees. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 767 772, 2006. Puning Zhao and Lifeng Lai. On the convergence rates of KNN density estimation. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 2840 2845. IEEE, 2021. Puning Zhao and Lifeng Lai. Analysis of knn density estimation. IEEE Transactions on Information Theory, 68(12):7971 7995, 2022. Fan Zhou, Guanyu Wang, Kunpeng Zhang, Siyuan Liu, and Ting Zhong. Semi-supervised anomaly detection via neural process. IEEE Transactions on Knowledge and Data Engineering, 2023.