# bagged_regularized_kdistances_for_anomaly_detection__504452ed.pdf

Journal of Machine Learning Research 26 (2025) 1-59 Submitted 11/23; Revised 3/25; Published 7/25

Bagged Regularized k-Distances for Anomaly Detection

Yuchao Cai yccai@nus.edu.sg Department of Statistics and Data Science National University of Singapore 117546, Singapore

Hanfang Yang hyang@ruc.edu.cn Center for Applied Statistics and School of Statistics Renmin University of China 100872 Beijing, China

Yuheng Ma yma@ruc.edu.cn School of Statistics Renmin University of China 100872 Beijing, China

Hanyuan Hang hanyuan0725@gmail.com Hong Kong Research Institute Contemporary Amperex Technology (Hong Kong) Limited Hong Kong Science Park, New Territories, Hong Kong

Editor: Aryeh Kontorovich

We consider the paradigm of unsupervised anomaly detection, which involves the identiﬁcation of anomalies within a dataset in the absence of labeled examples. Though distancebased methods are top-performing for unsupervised anomaly detection, they suﬀer heavily from the sensitivity to the choice of the number of the nearest neighbors. In this paper, we propose a new distance-based algorithm called bagged regularized k-distances for anomaly detection (BRDAD), converting the unsupervised anomaly detection problem into a convex optimization problem. Our BRDAD algorithm selects the weights by minimizing the surrogate risk, i.e., the ﬁnite sample bound of the empirical risk of the bagged weighted k-distances for density estimation (BWDDE). This approach enables us to successfully address the sensitivity challenge of the hyperparameter choice in distance-based algorithms. Moreover, when dealing with large-scale datasets, the eﬃciency issues can be addressed by the incorporated bagging technique in our BRDAD algorithm. On the theoretical side, we establish fast convergence rates of the AUC regret of our algorithm and demonstrate that the bagging technique signiﬁcantly reduces the computational complexity. On the practical side, we conduct numerical experiments to illustrate the insensitivity of the parameter selection of our algorithm compared with other state-of-the-art distance-based methods. Furthermore, our method achieves superior performance on real-world datasets with the introduced bagging technique compared to other approaches.

Keywords: Unsupervised learning, density estimation, anomaly detection, weighted kdistances, regularization, surrogate risk minimization (SRM), bagging, ensemble learning, optimal convergence rates, learning theory

c 2025 Yuchao Cai, Hanfang Yang, Yuheng Ma, Hanyuan Hang.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/23-1519.html.

Cai, Yang, Ma, Hang

1. Introduction

Anomaly detection refers to the process of identifying patterns or instances that deviate signiﬁcantly from the expected behavior within a dataset (Chandola et al., 2009). It has been widely and carefully studied within diverse research areas and application domains, including industrial engineering (Fahim and Sillitti, 2019; Wang et al., 2021), medicine (Fernando et al., 2021; Tschuchnig and Gadermayr, 2021), cyber security (Folino et al., 2023; Ravinder and Kulkarni, 2023), earth science (Luz et al., 2022; Chen et al., 2022), and ﬁnance (Lokanan et al., 2019; Hilal et al., 2022), etc. For further discussions on anomaly detection techniques and applications, we refer readers to the survey of Nassif et al. (2021).

Based on the availability of labeled data, anomaly detection problems can be classiﬁed into three main paradigms. The ﬁrst is the supervised paradigm, where both the normal and anomalous instances are labeled. As mentioned in Aggarwal (2016a) and Vargaftik et al. (2021), researchers often employ existing binary classiﬁers in this case. The second is the semi-supervised paradigm, where the training data only consists of normal samples, and the goal is to identify anomalies that deviate from the normal samples. (Akcay et al., 2018; Zhou et al., 2023). Perhaps the most ﬂexible yet challenging paradigm is the unsupervised paradigm (Aggarwal, 2016a; Gu et al., 2019), where no labeled examples are available to train an anomaly detector. For the remainder of this paper, we only focus on the unsupervised paradigm, where we do not assume any prior knowledge of labeled data.

The existing algorithms in the literature on unsupervised anomaly detection can be roughly categorized into three main categories: The ﬁrst category is distance-based methods, which determine an anomaly score based on the distance between data points and their neighboring points. For example, k-nearest neighbors (k-NN) (Ramaswamy et al., 2000) calculate the anomaly score of an instance based on the distance to its k-th nearest neighbor, distance-to-measure (DTM) (Gu et al., 2019) introduces a novel distance metric based on the distances of the ﬁrst k-nearest neighbors, and local outlier factor (LOF) (Breunig et al., 2000) computes the anomaly score by quantifying the deviation of the instance from the local density of its neighboring data points. The second category is forest-based methods, which compute anomaly scores based on tree structures. For instance, isolation forest (i Forest) (Liu et al., 2008) constructs an ensemble of trees to isolate data points and quantiﬁes the anomaly score of each instance based on its distance from the leaf node to the root in the constructed tree and partial identiﬁcation forest (PIDForest) (Gopalan et al., 2019) computes the anomaly score of a data point by identifying the minimum density of data points across all subcubes partitioned by decision trees. The third category is kernel-based methods such as the one-class SVM (OCSVM) (Schölkopf et al., 1999), which deﬁnes a hyperplane to maximize the margin between the origin and normal samples. It has been empirically shown (Aggarwal and Sathe, 2015; Aggarwal, 2016b; Gu et al., 2019) that distance-based and forestbased methods are the top-performing methods across a broad range of real-world datasets. Moreover, experiments in Gu et al. (2019) suggest that distance-based methods show their advantage on high-dimensional datasets, as forest-based methods are likely to neglect a substantial number of features when dealing with high-dimensional data. Unfortunately, it is widely acknowledged that distance-based methods suﬀer from the sensitivity to the choice of the hyperparameter k (Aggarwal, 2012). This problem is particularly severe in unsupervised learning tasks because the absence of labeled data makes it diﬃcult to guide the selection

Bagged Regularized k-Distances for Anomaly Detection

of hyperparameters. To the best of our knowledge, no algorithm in the literature eﬀectively solves the aforementioned sensitivity problem. Besides, while distance-based methods are crucial and eﬃcient for identifying anomalies, they pose a challenge in scenarios with a high volume of data samples, owing to the need for a considerable expansion in the search for nearest neighbors, leading to a notable increase in computational overhead. Therefore, there also remains a great challenge for distance-based algorithms to improve their computational eﬃciency. In this paper, we propose a distance-based algorithm, bagged regularized k-distances for anomaly detection (BRDAD), which formulates the weight selection problem in unsupervised anomaly detection as a minimization problem. Speciﬁcally, we ﬁrst establish the surrogate risk, a ﬁnite-sample bound on the empirical risk of the bagged weighted k-distances for density estimation (BWDDE). At each bagging round, we determine the weights by minimizing the surrogate risk on a subsampled dataset. Then, using an independently drawn subsample of the same size, we compute the corresponding k-distances. By combining the learned weights and these k-distances, we obtain the regularized k-distance. The ﬁnal anomaly scores are derived by averaging these regularized k-distances, referred to as bagged regularized kdistances. BRDAD ranks the data in descending order of these scores and identiﬁes the top m instances as anomalies. BRDAD oﬀers two key advantages. First, the surrogate risk minimization (SRM ) approach eﬀectively mitigates the sensitivity of parameter choices in distance-based methods. Second, the incorporation of bagging enhances computational eﬃciency, making the method scalable for large datasets. The contributions of this paper are summarized as follows. (i) We propose a new distance-based algorithm BRDAD, that prevents the sensitivity of the hyperparameter selection in unsupervised anomaly detection problems by formulating it as a convex optimization problem. Moreover, the incorporated bagging technique in BRDAD improves the computational eﬃciency of our distance-based algorithm. (ii) From the theoretical perspective, we establish fast convergence rates of the AUC regret of BRDAD. Moreover, we show that with relatively few bagging rounds B, the number of iterations in the optimization problem at each bagging round can be reduced substantially. This demonstrates that the bagging technique signiﬁcantly reduces computational complexity. (iii) From an experimental perspective, we conduct numerical experiments to evaluate the eﬀectiveness of our proposed BRDAD method. First, we empirically validate the reasonableness of SRM by demonstrating similar convergence behaviors for both SR and MAE. Next, we compare BRDAD with distance-based, forest-based, and kernel-based methods on anomaly detection benchmarks, highlighting its superior performance. Finally, we perform a parameter analysis on the number of bagging rounds B, showing that selecting an appropriate B based on the sample size leads to improved performance. The remainder of this paper is organized as follows. In Section 2, we introduce some preliminaries related to anomaly detection and propose our BRDAD algorithm. We provide basic assumptions and theoretical results on the convergence rates of BRDDE and BRDAD in Section 3. Some comments and discussions concerning the theoretical results will also be provided in this section. We present the error and complexity analysis of our algorithm in Section 4. Some comments concerning the time complexity will also be provided in this section. We verify the theoretical ﬁndings of our algorithm by conducting numerical

Cai, Yang, Ma, Hang

experiments in Section 5. We also conduct numerical experiments to compare our algorithm with other state-of-the-art algorithms for anomaly detection on real-world datasets in this Section. All the proofs of Sections 2, 3, and 4 can be found in Section 6. We conclude this paper in Section 7.

2. Methodology

We present our methodology in this section. Section 2.1 introduces basic notations and concepts. In Section 2.2, we propose the bagged weighted k-distances for density estimation (BWDDE) to demonstrate how bagged weighted k-distances can be applied to anomaly detection. Section 2.3 reformulates the weight selection problem for density estimation as a surrogate risk minimization problem, aiming to minimize the ﬁnite-sample bound of the empirical risk of BWDDE. Finally, the weights obtained by solving the SRM problem are utilized to construct our main algorithm, named bagged regularized k-distances for anomaly detection (BRDAD).

2.1 Preliminaries

We begin by introducing some fundamental notations that will frequently appear. Suppose that the data Dn := {X1, . . . , Xn} are independent and identically distributed (i.i.d.) and drawn from an unknown distribution P that is absolutely continuous with respect to the Lebesgue measure µ and admits a unique density function f. In this paper, we assume that f is supported on [0, 1]d, which we denote as X. Recall that for 1 p < and a vector x Rd, the ℓp-norm is deﬁned as x p := (xp 1 + + xp d)1/p, and the ℓ -norm is deﬁned as x := maxi=1,...,d |xi|. For a measurable function g : X R, we deﬁne the

Lp-norm as g p := R

X |g(x)|p dx 1/p. Let B(x, r) := {x Rd : x x 2 r} denote a ball in Euclidean space Rd centered at x Rd with radius r (0, + ). We use Vd to denote the volume of the d-dimensional closed unit ball. In addition, for n N+, we write [n] := {1, . . . , n} as the set containing integers from 1 to n and Wn := {(w1, . . . , wn) Rn : Pn i=1 wi = 1, wi 0, i [n]}. For any x R, let x be the largest integer less than or equal to x and x be the smallest ingeter larger than or equal to x. Throughout this paper, we use a b = max{a, b} and a b = min{a, b}. Moreover, we use the following notations to compare the magnitudes of quantities: an bn or an = O(bn) indicates that there exists a positive constant c > 0 that is independent of n such that an cbn; an bn implies that there exists a positive constant c > 0 such that an cbn; and an bn means that an bn and bn an hold simultaneously. In this paper, we focus on the case of a ﬁxed dimension, allowing the constant c to depend on the dimension d. Finally, we use C, C , c, and c to represent positive constants.

2.2 Bagged Weighted k-Distances for Anomaly Detection

The learning goal of anomaly detection is to identify observations that deviate signiﬁcantly from the majority of the data. Anomalies are typically rare and diﬀerent from the expected behavior of the data set. In this paper, we adopt the contamination model proposed by Huber (1965) and further discussed in Huber (1992, Section 1):

Bagged Regularized k-Distances for Anomaly Detection

Assumption 1 The data Dn consists of i.i.d. samples drawn from a distribution P satisfying the contamination model:

P = (1 Π) P0 + Π P1, (1)

where P0 and P1 denote the distributions of normal and anomalies, respectively, and Π (0, 1) is the contamination proportion. Additionally, we assume that P0 has a probability density function f0, while P1 is uniformly distributed over [0, 1]d with density function f1.

Assumption 1 describes a contamination model where the observed data is drawn from a mixture of normal and anomalous distributions. Assuming P1 to be uniform on [0, 1]d is a common choice, reﬂecting an uninformative prior on anomalies (Steinwart et al., 2005). Within this framework, anomaly detection reduces to identifying low-density regions, as the density of normal data shares the same family of level sets as the density of the contaminated model. This classical assumption facilitates theoretical analysis and underpins several wellknown unsupervised methods, including OC-SVM (Schölkopf et al., 2001) and deep learningbased approaches (Ruﬀet al., 2021). Building on this framework, we explore the distance-based method for unsupervised anomaly detection, motivated by the connection between distance functions and k-nearest neighbor (k NN) density estimations highlighted by Biau et al. (2011). This method leverages the distances between data points and their nearest neighbors to assess anomalies. For any x Rd and a dataset Dn, we denote X(k)(x; Dn) as the k-th nearest neighbor of x in Dn. We then deﬁne Rn,(k)(x) := x X(k)(x; Dn) 2 as the distance between x and X(k)(x; Dn), referred to as the k-nearest neighbor distance, or k-distance of x in Dn. Distance-based methods are important and eﬀective for anomaly detection. However, when handling large datasets, the number of nearest neighbors that need to be searched grows substantially, leading to signiﬁcant computational overhead. To mitigate this issue, we incorporate the bagging technique by averaging the weighted k-distances computed on multiple disjoint sub-datasets randomly drawn from the original dataset Dn without replacement. Let B be the number of bagging rounds pre-speciﬁed by the user, and {Db s}B b=1 be B disjoint subsets of Dn, each of size s. Since the sub-samples are disjoint and the data Dn is supposed i.i.d., this procedure is mathematically equivalent to taking the ﬁrst s samples for bag 1, the following s samples for bag 2, etc. In each subset Db s, b [B], let Rb s,(k)(x) := x X(k)(x; Db s) 2 be the k-distance of x in Db s for any integer k s, and

let the weighted k-distance be deﬁned as Rw,b s (x) := Ps i=1 wb i Rb s,(i)(x), with wb Ws. The bagged weighted k-distances are obtained by averaging these weighted k-distances across the B sub-datasets:

RB n (x) := 1

b=1 Rw,b s (x).

Then, we introduce the bagged weighted k-distances for density estimation (BWDDE) as

f B n (x) := 1 Vd RB n (x)d

i=1 wb iγs,i

Cai, Yang, Ma, Hang

where Vd = πd/2/Γ(d/2 + 1) is the volume of the unit ball and

γs,i := Γ(i + 1/d)Γ(s + 1)

Γ(i)Γ(s + 1 + 1/d). (3)

Here, γs,i represents the expected value of the d-th root of the beta distribution Beta(i, s + 1 i), which is associated with the probability of a ball having a radius of Rb s,(i)(x). The choice of γs,i facilitates the derivation of the concentration inequality for BWDDE. A detailed discussion is provided in Section 4.1. Potential anomalies can be identiﬁed in regions of low density using BWDDE. More speciﬁcally, the dataset Dn = {X1, . . . , Xn} can be sorted in ascending order based on their BWDDE values, denoted as {X 1, . . . , X n}, such that f B n (X 1) f B n (X n). If the number of anomalies is predetermined as m, then the m data points with the smallest BWDDE values are identiﬁed as anomalies.

2.3 Bagged Regularized k-Distances for Anomaly Detection

A key challenge in using bagged weighted k-distance for density estimation (2) is selecting appropriate weights for the nearest neighbors. These weights play a crucial role in determining the accuracy of the density estimate and, consequently, the precision of anomaly detection. The simplest way is to take B = 1 and, for a ﬁxed in advance k, set wk = 1 and wi = 0 for i [n] \ {k}. In this case, BWDDE reverts to the standard k-NN density estimation (Moore and Yackel, 1977; Devroye and Wagner, 1977; Dasgupta and Kpotufe, 2014). Notably, the standard k-NN density estimation relies solely on the distance to the k-th nearest neighbor, disregarding information from other neighbors. To address this limitation, a more general approach was proposed by Biau et al. (2011), which investigated the general weighted k-nearest neighbor density estimation by associating the weights with a given probability measure on [0, 1]. More speciﬁcally, for a given probability measure ν on [0, 1] and a sequence of positive integer {kn}, the weights are deﬁned as wi = R

((i 1)/kn,i/kn] ν(dt), for 1 i kn, with wi = 0 otherwise.

Challenges for the weight selection. Since density estimation is an unsupervised problem, we lack access to the true density function for direct hyperparameter selection. Existing literature has proposed two common approaches to address this challenge:

1. ANLL-based Approach: A common approach is to optimize hyperparameters by minimizing the Average Negative Log Likelihood (ANLL) (Chow et al., 1983; López-Rubio, 2013; Silverman, 2018), which is equivalent to minimizing KL divergence. This approach assumes that the density estimate integrates to one over Rd. However, this assumption does not hold for our BWDDE. Given the dataset Dn, let M := maxi [n] Xi 2. By the deﬁnition of the bagged weighted k-distances, for any x Rd, we have:

RB n (x) = 1

b=1 Rw,b s (x) 1

b=1 Rb s,(s)(x) 1

b=1 ( x 2 + M) = x 2 + M.

Therefore, the density estimate satisﬁes

f B n (x) 1

Vd x 2 + M d

i=1 wb iγs,i

Bagged Regularized k-Distances for Anomaly Detection

for any x Rd. Notably, the integral of f B n (x) over Rd diverges, rendering the ANLL approach unsuitable.

2. L2 Risk Approach: An alternative approach is leave-one-out cross-validation based on L2 risk of the density estimate (Tsybakov, 2009; Biau et al., 2011), deﬁned as Z

f B n (x) f(x) 2 dx = Z

Rd f B n (x)2 dx 2 Z

Rd f B n (x)f(x) dx + Z

Rd f(x)2 dx.

The last term is independent of f B n and can be ignored in the optimization. The second term can be estimated by 2 Pn i=1 f B n,( i)(Xi)/n, where f B n,( i)(x) is the density estimator with the i-th observation removed. Thus, we deﬁne the cross-validation loss as

LCV(f B n ) := Z

Rd f B n (x)2 dx 2

i=1 f B n,( i)(Xi).

However, computing R

Rd f B n (x)2 dx requires Monte Carlo methods, which becomes infeasible in high dimensions due to the absence of a closed-form expression for this integral. In summary, existing hyperparameter selection methods are not well-suited for BWDDE in high-dimensional cases due to the lack of closed-form integrals. To address this, we propose the Surrogate Risk Minimization (SRM) approach, which allows automatic weight selection through a more computationally feasible optimization, distinguishing our method from other nearest-neighbor-based density estimators.

Surrogate risk. In the context of density estimation, we consider the absolute loss function L : X R [0, ), deﬁned as L(x, t) := |f(x) t|, to measure the discrepancy between an estimate and the true density function f. Let DB s = B b=1Db s denote the union of the B sub-datasets, where each Db s = {Xb 1, . . . , Xb s}. The empirical risk of f B n with respect to DB s is given by

RL,DB s (f B n ) := 1

f B n (Xb i ) f(Xb i ) . (4)

As noted in Devroye and Lugosi (2001); Hang et al. (2018), the absolute loss is a reasonable choice for density estimation due to its invariance under monotone transformations. Moreover, it is proportional to the total variation metric, providing a more interpretable measure of proximity to the true density. Since the underlying density function f in (4) is unknown, standard optimization techniques for parameter selection cannot be directly applied to weight selection in density estimation. To address this, we seek a surrogate for the empirical risk in (4) and minimize it to determine the nearest neighbor weights. To proceed, we introduce the following regularity assumptions on the underlying probability distribution P.

Assumption 2 Assume that P has a Lebesgue density f with support X = [0, 1]d.

(i) [Lipschitz Continuity] The density f is Lipschitz continuous on [0, 1]d, i.e., for all x, y [0, 1]d, there exists a constant c L > 0 such that |f(x) f(y)| c L x y 2.

Cai, Yang, Ma, Hang

(ii) [Boundness] There exist constants c c > 0 such that c f(x) c for all x X.

The smoothness assumption is necessary for bounding the variation of the density function and is a common approach in density estimation (Dasgupta and Kpotufe, 2014; Jiang, 2017). This assumption helps prevent overﬁtting and provides a more stable density estimate. The assumption of lower boundedness of the density has been commonly employed in prior work, such as (Dasgupta and Kpotufe, 2014; Zhao and Lai, 2022), to establish ﬁnitesample rates for k-NN density estimation. This assumption simpliﬁes deriving ﬁnite-sample bounds for k-distances (see Lemma 12), thereby providing a clearer characterization of the surrogate risk in the following Proposition. We emphasize that although this assumption aids theoretical analysis, our proposed algorithm remains applicable in broader settings. Further discussions can be found after Theorem 3. Under these assumptions, along with additional conditions on the weights, the next proposition presents a surrogate for the empirical risk (4).

Proposition 1 (Surrogate Risk) Let Assumption 2 hold, and let L denote the absolute value loss. Let {Db s}B b=1 be B disjoint subsets randomly drawn from Dn, with Db s = {Xb 1, . . . , Xb s}, and deﬁne R b s,(i) := Ps j=1 Rb s,(i)(Xb j)/s as the average i-distances for any integer i s on the subset Db s. Furthermore, let f be the true density function and f B n be the BWDDE as in (2). Moreover, let kb := k(wb) := sup{i [s] : wb i = 0}, k := minb [B] kb, and k := maxb [B] kb. Finally, suppose that the following four conditions hold:

(i) There exists a sequence cn log n such that

i=1 wb i (log n)/kb for all b [B];

(ii) k (log n)2, k k, B 2(d2 + 4)(log n)/3, B (k/(log n))1+2/d, log s log n, and s max{c 1, 2k}, where c 1 is a constant deﬁned in Lemma 12;

(iii) wb 2 kb 1/2 and

i=1 i1/dwb i kb 1/d for b [B];

(iv) There exist constants Cn,i such that max b [B] wb i Cn,i for i [s] and

i=1 i1/d 1/2Cn,i

Then, there exists N 1 N, speciﬁed in the proof, such that for all n N 1 and Xb i satisfying B(Xb i , Rb s,(kb )(Xb i )) [0, 1]d for all b [B], there holds

L(Xb i , f B n ) p

(log s)/B wb 2 + Rw,b s (Xb i ), i [s], b [B], (5)

with probability PBs at least 1 4/n2. Furthermore, we have

RL,DB s (f B n ) Rsur L,DB s (f B n ) := 1

(log s)/B wb 2 + 1

i=1 Rw,b s (Xb i )

(log s)/B wb 2 +

i=1 wb i R b s,(i)

Bagged Regularized k-Distances for Anomaly Detection

The term on the right-hand side of (6) is referred to as the surrogate risk. A smaller surrogate risk clearly corresponds to higher accuracy in BWDDE. Condition (i) and (ii) imply that Pcn i=1 wb i 0 as n for all b [B]. This ensures that the weights are not overly concentrated in the ﬁrst cn nearest neighbors, promoting a more balanced distribution of weights across the data points. The ﬁrst requirement in (ii) ensures that the number of nearest neighbors at each bagging round is at least of the order (log n)2, which aligns with the condition k in Moore and Yackel (1977); Dasgupta and Kpotufe (2014). The second requirement mandates that the number of non-zero nearest neighbors across diﬀerent subsets remains of the same order. The third and fourth requirements impose lower and upper bounds on the number of bagging rounds B. Since the subsets are drawn without replacement, a very large B can result in a small sample size in each subset, which would increase the estimation error. Conversely, if B is too small, the density estimator may not beneﬁt suﬃciently from bagging. The last two conditions in (ii) impose bounds on s for similar reasons. Condition (iii) on the relationship between wb and kb is satisﬁed for commonly used weight choices for nearest neighbors. Finally, condition (iv) requires that the moments of the weights be bounded by powers of k. The condition B(Xb i , Rb s,(kb )(Xb i ))

[0, 1]d for all b [B] ensures that the kb -distance ball is fully contained within the cube [0, 1]d for b [B]. This is crucial because if the ball extends beyond the cube, the density estimator may not be consistent under Assumption 2. While the conditions in Proposition 1 may appear complex, they encompass commonly used weight choices. For example, under a uniform weight distribution for the nearest neighbors, we have wb i = 1{i k}/k for i [s] and b [B], where k [s] is ﬁxed. Under this setting, condition (i) is directly satisﬁed, and condition (ii) holds with appropriately chosen parameters. Regarding condition (iii), we obtain wb 2 = kb 1/2 = 1/

k and Ps i=1 i1/dwb i = Pk i=1 i1/d/k k1/d for all b [B]. If we set Cn,i = 1{i k}/k for i [s], it follows that Ps i=1 i1/d 1/2Cn,i = Pk i=1 i1/d 1/2/k k1/d 1/2, implying that all conditions hold for bagged uniformly weighted k-distance density estimation. With minor modiﬁcations, these arguments extend to the more general case wi = R

((i 1)/k,i/k] ν(dt), for 1 i k, with wi = 0 otherwise, where ν is the probability measure associated with Beta(α, 1) for α 1. These conditions are primarily used to derive our surrogate risk, which leads to a new and eﬀective algorithm without hyperparameter tuning. In fact, Proposition 5 in Section 4.2.1 ensures that our proposed method satisﬁes these conditions with high probability, making it unnecessary to verify them in practical applications.

Surrogate risk minimization (SRM). From the expression of the surrogate risk in (6), minimizing the surrogate risk is equivalent to solving the following optimization problems:

wb, := arg min wb Ws

(log s)/B wb 2 +

i=1 wb i R b s,(i), b [B]. (7)

A closer examination of the optimization problems in (7) reveals that each consists of two components. The ﬁrst term is proportional to the ℓ2-norm of the weights wb, while the second term represents a linear combination of these weights.

Cai, Yang, Ma, Hang

Without the ﬁrst term, the optimization objective in (7) reduces to the second term, Ps i=1 wb i R b s,(i), which attains its minimum when wb = (1, 0, . . . , 0). In this case, the weighted

k-distance simpliﬁes to Rw,b s (x) = Rb s,(1)(x), representing the distance from x to its nearest neighbor. This often leads to overﬁtting in density estimation, as it fails to incorporate information from other nearest neighbors. By introducing the wb 2 term into the minimization problem (7), we mitigate the overﬁtting issue. The ℓ2-norm wb 2 attains its maximum value of 1 when all weight is assigned to a single nearest neighbor, i.e., when wb i = 1 for some i [s] and wb j = 0 for all j = i. Conversely, it reaches its minimum value of n 1/2 when the weights are uniformly distributed as wb = (1/n, . . . , 1/n). Consequently, incorporating this term encourages the weights to be distributed across multiple nearest neighbors, thereby preventing overﬁtting. As a result, wb 2 serves as a regularization term in the minimization problem (7).

Solution to SRM. Notice that (7) is a convex optimization problem solved eﬃciently from the data. For a ﬁxed b [B], considering the constraint Lagrangian, we have

L(wb, µb, νb) := p

(log s)/B wb 2 +

i=1 wb i R b s,(i) + µb 1

i=1 νb i wb i,

where µb R and νb 1, . . . νb s 0 are the Lagrange multipliers. Since (7) is a convex optimization problem, the solution satisfying the KKT conditions is a global minimum. Setting the partial derivative of L(wb, µb, νb) with respect to wb i to zero gives:

(log s)/B wb i/ wb 2 = µb + νb i R b s,(i). (8)

Since wb, is the optimal solution of (7), the KKT conditions imply that if wb, i > 0, then νb i = 0. Conversely, if wb, i = 0, then νb i 0, which implies R b s,(i) µb. Therefore, wb, i is

proportional to µb R b s,(i) for all nonzero entries. This together with the equality constraint Ps i=1 wb, i = 1 yields that wb, i has the form

µb R b s,(i) 1{R b s,(i) < µb} Ps i=1 µb R b s,(i) 1{R b s,(i) < µb}

for i [s]. Since R b s,(i) becomes larger as i increases, the formulation above shows that wb, i becomes smaller as i increases. Moreover, the optimal weights have a cut-oﬀeﬀect that only nearest neighbors near x, i.e. R b s,(i) < µb are considered in the solution, while the weights for the remaining nearest neighbors are all set to zero. This is consistent with our usual judgment: the closer the neighbor, the greater the impact on the density estimation. There are many eﬃcient methods to solve the convex optimization problem (7). Here, we follow the method developed in Anava and Levy (2016); Dong et al. (2020); Sheng and Yu (2023). The key idea is to add nearest neighbors in a greedy manner based on their distance from x until a stopping criterion is met. We present it in Algorithm 1.

Bagged Regularized k-Distances for Anomaly Detection

Algorithm 1: Surrogate Risk Minimization (SRM)

Input: Average i-distances R b s,(i), 1 i s.

B/ log s R b s,(i), 1 i s. Set µ0 = r1 + 1 and k = 0. while µk > rk+1 and k s 1 do

µk = Pk j=1 rj + q

k + (Pk j=1 rj)2 k Pk j=1 r2 j /k.

end Compute A = Ps i=1((µk ri) 1(ri < µk)). Compute wb, i = (µk ri) 1(ri < µk)/A, 1 i s. Output: Weights wb, .

Density estimation. The discussions above indicate that the minimization problem (7) provides a practical method for determining the weights of nearest neighbors in density estimation. However, these weights depend on the data, introducing statistical dependence between the weights wb, i and the i-distance Rb s,(i)(x) for i [s] in each bag b [B]. Such dependence complicates the theoretical analysis of weighted k-distances and density estimation within the framework of statistical learning theory. To address this issue, we employ distinct subsets of the data for computing the kdistances, ensuring their independence from the optimized weights. For simplicity, we assume that n = 2Bs, which facilitates a more straightforward theoretical derivation without loss of generality. This assumption ensures that each subset is of equal size; if n is not exactly divisible by 2B, minor modiﬁcations to the partitioning scheme can be applied without aﬀecting the theoretical conclusions. We randomly partition Dn into B disjoint pairs of subsets {(Db s, e Db s)}B b=1, where each subset has size s and Db s e Db s = for each b. Let wb, represent the weights obtained by solving the optimization problem (7) using {Db s}B b=1, and let e Rb s,(i)(x) denote the i-distance of x computed using { e Db s}B b=1 for i [s]. Then, we deﬁne the regularized k-distances as follows:

Rb, s (x) := Rwb, ,b s (x) =

i=1 wb, i e Rb s,(i)(x). (9)

We refer to the weighted average of these k-distances as the bagged regularized k-distance

RB, n (x) := 1

b=1 Rb, s (x). (10)

By incorporating the wb, and RB, n (x) into the BWDDE formula (2), we are able to obtain a new nearest-neighbor-based density estimator called bagged regularized k-distances for density estimation (BRDDE), expressed as

f B, n (x) := 1

Vd RB, n (x)d

i=1 wb, i γs,i

Cai, Yang, Ma, Hang

Algorithm 2: Bagged Regularized k-Distances for Anomaly Detection (BRDAD)

Input: Data D = {X1, , Xn}; Number of anomalies m; Bagging rounds B; Subsampling size s. Randomly partition Dn into B disjoint pairs of subsets, denoted as {(Db s, e Db s)}B b=1, such that Db s e Db s = for each b. for b [B] do

Compute weights wb, using Db s by (7); Compute the regularized k-distances Rb, s (Xi) for 1 i n using (9), based on wb, and e Db s. end Compute the bagged regularized k-distances RB, n (Xi) by (10) for 1 i n. Sort the data Dn = {X1, . . . , Xn} as {X 1, . . . , X n} in descending order according to their bagged regularized k-distances, i.e., RB, n (X 1) RB, n (X n). Output: Anomalies {X i}m i=1.

This minimization approach distinguishes our BRDDE from existing nearest-neighbor-based density estimators. Speciﬁcally, they suﬀer from the sensitivity to the choice of the hyperparameter k, since the selection of k is inherently diﬃcult due to the lack of supervised information. On the contrary, when the number of bagging rounds B is ﬁxed, SRM enables the calculation of the weights of nearest neighbors in each subset Db s by solving the convex optimization problem based on the average i-distance R b s,(i) as in equation (7). As a result, we successfully address the hyperparameter selection challenge without changing the unsupervised nature of the problem.

Anomaly Detection. By applying BRDDE to all samples, anomalies can be identiﬁed as instances with lower BRDDE values. However, explicit density estimation is not necessary for anomaly detection. Since density estimates serve as anomaly scores, any monotone transformation preserves their ranking (possibly in reverse) and maintains the same family of level sets, leading to identical AUC values. Thus, bagged regularized k-distances provide a suﬃcient and practical alternative for anomaly detection. They can be accurately computed in high-dimensional spaces, and their associated weights can be eﬃciently optimized from (7), making them an eﬀective approach for density-based anomaly detection. We now introduce our anomaly detection algorithm, bagged regularized k-distances for anomaly detection (BRDAD). The dataset Dn is sorted into the sequence {X 1, . . . , X n} based on their bagged regularized k-distances in descending order, i.e. RB, n (X 1) RB, n (X n). Given the pre-speciﬁed number of anomalies m, the ﬁrst m instances {X i}m i=1, are considered as the m anomalies. The complete procedure of our BRDAD algorithm is presented in Algorithm 2. As illustrated above, SRM mitigates the challenge of hyperparameter selection in density estimation. Consequently, BRDAD retains the advantages of BRDDE by using the same weights to address the sensitivity of hyperparameter selection in nearest-neighbor-based methods for unsupervised anomaly detection. Algorithm 2 is based on the assumption that P1 follows a uniform distribution. However, when prior knowledge about the density function of anomaly is available, anomalies can still

Bagged Regularized k-Distances for Anomaly Detection

be detected using the bagged regularized k-distances RB, n (x). Consider the Huber model in (1), where P, P0, and P1 have densities f, f0, and f1, respectively. Deﬁning h := f0/f1 and ρ := Π/(1 Π), Steinwart et al. (2005, Corollary 3) establishes that instances in the set {Xi : h(Xi) ρ} can be recognized as anomalies with theoretical guarantees. This set can be equivalently written as {Xi : f(Xi)/f1(Xi) 2Π}, based on the relationship between the densities. Since RB, n (x)d is inversely proportional to the density estimate f B, n (x), anomalies can be identiﬁed by sorting the data according to:

f1(X 1)1/d RB, n (X 1) f1(X n)1/d RB, n (X n),

where the top m instances {X i}m i=1 are considered anomalies. Furthermore, the theoretical results in Section 3 can be extended to this setting with minor modiﬁcations, ensuring the generality of our approach.

3. Theoretical Results

In this section, we present theoretical results related to our BRDAD algorithm. We ﬁrst investigate the Huber contamination model in Section 3.1, in which we can analyze the performance of the bagged regularized k-distances from a learning theory perspective. Then, we present the convergence rates of BRDDE and BRDAD in Section 3.2 and 3.3, respectively. Finally, we provide comments and discussions on our algorithms and theoretical results in Section 3.4. We also compare our theoretical ﬁndings on the convergences of both BRDDE and BRDAD with other nearest-neighbor-based methods in this section.

3.1 Huber Contamination Model

In the Huber contamination model (HCM) in Assumption 1, for every instance X from P, we can use a latent variable Y {0, 1} that indicates which distribution it is from. More speciﬁcally, Y = 0 and Y = 1 indicate that the instance is from the normal and the anomalous distribution, respectively. As a result, the anomaly detection problem can be converted into a bipartite ranking problem where instances are labeled positive or negative implicitly according to whether it is normal or not. Let e P represent the joint probability distribution of X Y. In this case, our learning goal is to learn a score function that minimizes the probability of mis-ranking a pair of normal and anomalous instances, i.e. that maximizes the area under the ROC curve (AUC). Therefore, we can study regret bounds for the AUC of the bagged regularized k-distances to evaluate its performance from the learning theory perspective. Let r : X R be a score function, then the AUC of r can be written as

AUC(r) = E 1{(Y Y )(r(X) r(X ) > 0)} + 1{r(X) = r(X )}/2|Y = Y ,

where (X, Y ), (X , Y ) are assumed to be drawn i.i.d. from e P. In other words, the AUC of r is the probability that a randomly drawn anomaly is ranked higher than a randomly drawn normal instance by the score function r. Given the HCM in (1) and the assumption that P1 is uniformly distributed over [0, 1]d in Assumption 1, the posterior probability function with respect to e P is given by

η(x) := e P(Y = 1|X = x) = Πf1(x) (1 Π)f0(x) + Πf1(x) = Πf(x) 1. (12)

Cai, Yang, Ma, Hang

Then, the optimal AUC is deﬁned as

AUC := sup r:X R AUC(r) = 1 1 2Π(1 Π)EX,X min η(X)(1 η(X )), η(X )(1 η(X) .

Finally, the AUC regret of a score function r is deﬁned as

Reg AUC(r) := AUC AUC(r).

As discussed in Section 2.2, BRDAD is a density-based anomaly detection method. To establish its convergence rates under the Huber contamination model, we ﬁrst derive the theoretical convergence rates of BRDDE in (11), which are presented in the next subsection.

3.2 Convergence Rates of BRDDE

The convergence rates of BRDDE are presented in the following Theorem.

Theorem 2 Let Assumption 2 hold. Suppose that the dataset Dn is randomly partitioned into B disjoint pairs of subsets, denoted as {(Db s, e Db s)}B b=1, such that Db s e Db s = for each b. Let f be the true density function, and let f B, n be the BRDDE deﬁned in (11). If we choose

s (n/ log n)(d+1)/(d+2) and B n1/(d+2)(log n)(d+1)/(d+2), (13)

then there exists N 2 N, which will be speciﬁed in the proof, such that for all n > N 2 , with probability Pn at least 1 4/n2, we have Z

X |f B, n (x) f(x)| dx n 1/(2+d)(log n)(d+3)/(d+2).

The convergence rate of the L1-error of BRDDE in the above theorem matches the minimax lower bound established in Zhao and Lai (2021) when the density function is Lipschitz continuous. Therefore, BRDDE attains the optimal convergence rates for density estimation. As a result, the SRM procedure in Section 2.3 turns out to be a promising approach for determining the weights of nearest neighbors for BWDDE. Moreover, notice that the number of iterations required in the optimization problem (7) at each bagging round depends on the sub-sample size s. In Theorem 2, the choice of s is signiﬁcantly smaller than n, indicating that fewer iterations are required at each bagging round. This explains the computational eﬃciency of incorporating the bagging technique if parallel computation is employed. However, due to the dependence in d, this improvement becomes less and less signiﬁcant in high dimension. Further discussions on the complexity are presented in Section 4.3.

3.3 Convergence Rates of BRDAD

The next theorem provides the convergence rates for BRDAD.

Theorem 3 Let Assumptions 1 and 2 hold. Suppose the conditions in Theorem 2 hold, including the dataset partitioning and choice of parameters s and B. Let RB, n be the bagged regularized k-distances returned by Algorithm 2. Then there exists N 2 N, as speciﬁed in Theorem 2, such that for all n > N 2 , with probability Pn at least 1 4/n2, we have

Reg AUC(RB, n ) n 1/(2+d)(log n)(d+3)/(d+2).

Bagged Regularized k-Distances for Anomaly Detection

Theorem 3 establishes that, up to a logarithmic factor, the AUC regret of BRDAD converges at a rate of O(n 1/(d+2)), provided that the number of bagging rounds B and the subsample size s are chosen as in (13). Notably, the parameter choices and convergence rates in Theorem 3 align with those in Theorem 2 for BRDDE. This follows from the fact that BRDAD is a density-based anomaly detection method built upon BRDDE. Although surrogate risk minimization is formulated under Assumption 2, we note that these assumptions can be relaxed, and similar convergence rates for our BRDDE and BRDAD remain valid under more general conditions. In particular, Lipschitz continuity naturally extends to manifolds: if the data is supported on a d -dimensional manifold M with density f absolutely continuous with respect to the manifold s volume measure, then f is Lipschitz continuous on M if there exists a constant c L > 0 such that |f(x) f(y)| c Ld M(x, y) for all x, y M, where d M(x, y) denotes the geodesic distance. This assumption aligns with that in the prior literature (Berenfeld and Hoﬀmann, 2021). Additionally, the assumption of lower boundedness can be relaxed with suitable conditions on the tail behavior of the density, as discussed in Zhao and Lai (2022). Since geodesic distances on manifolds are well approximated by Euclidean distances within a small neighborhood (Niyogi et al., 2008; Berenfeld and Hoﬀmann, 2021), our theoretical results for SRM and the convergence rates of BRDDE and BRDAD can be extended to cases where both assumptions are relaxed. However, our current focus remains on hyperparameter sensitivity in density estimation; therefore, a detailed exploration of these extensions is beyond the scope of this paper.

3.4 Comments and Discussions

By reformulating the analysis of bagged regularized k-distances in terms of BRDDE within a statistical learning framework (van der Vaart and Wellner, 1996), we establish convergence rates for the AUC regret of bagged regularized k-distances under the Huber contamination model, assuming mild regularity conditions on the density function (Theorem 3). Notably, our ﬁndings reveal that the convergence rate of the AUC regret of BRDAD matches that for density estimation, indicating the eﬀectiveness of BRDAD. In contrast, previous theoretical studies on distance-based methods for unsupervised anomaly detection did not establish a connection between distance-based algorithms and density estimation, leaving the convergence rates unaddressed. For instance, Sugiyama and Borgwardt (2013) introduced a sampling-based outlier detection method and analyzed its eﬀectiveness compared to traditional k-nearest neighbors approaches, but without a rigorous theoretical foundation. More recently, Gu et al. (2019) conducted a statistical analysis of distance-to-measure (DTM) for anomaly detection under the Huber contamination model, assuming speciﬁc regularity conditions on the distribution, and showed that anomalies can be identiﬁed with high probability. However, since these studies did not derive convergence rates for the AUC regret, their results are not directly comparable to ours.

4. Error and Complexity Analysis

In this section, we present the error analysis of the AUC regret and the complexity analysis of our algorithm. In detail, in Section 4.1, we provide the error decomposition of the surrogate risk, which leads to the derivation of the surrogate risk in Proposition 1 in Section 2.3. Furthermore, in Section 4.2, we illustrate the three building blocks in learning the AUC

Cai, Yang, Ma, Hang

regret, which indicates the way to establish the convergence rates of both BRDDE and BRDAD in Theorem 2 and 3 in Section 3.3. Finally, we analyze the time complexity of BRDAD and illustrate the computational eﬃciency of BRDAD compared to other distancebased methods for anomaly detection in Section 4.3.

4.1 Error Analysis for the Surrogate Risk

In this section, we ﬁrst provide the error decomposition for the BWDDE f B n (x) in (2). Then, we present the upper bounds for these error terms. Let the term (I) be deﬁned as

(I) := 1 Vd RB n (x)d

i=1 wb iγs,i

j V 1/d d f(x)1/d RB n (x) d 1 j. (14)

Using the triangle inequality and the equality

xd yd = (x y)

i=0 xiyd 1 i, (15)

f B n (x) f(x) = 1 Vd RB n (x)d

i=1 wb iγs,i

d Vdf(x)RB n (x)d

i=1 wb iγs,i V 1/d d f(x)1/d RB n (x)

i=1 wb iγs,i 1

i=1 wb i V 1/d d f(x)1/d Rb s,(i)(x)

b=1 wb i γs,i V 1/d d f(x)1/d Rb s,(i)(x) . (16)

If the terms (II) and (III) are deﬁned respectively as

b=1 wb i γs,i P(B(x, Rb s,(i)(x)))1/d , (17)

b=1 wb i P(B(x, Rb s,(i)(x)))1/d V 1/d d f(x)1/d Rb s,(i)(x) , (18)

then applying the triangle inequality to (16) yields the error decomposition f B n (x) f(x) (I) (II) + (I) (III). (19)

We emphasize that γs,i = E[P(B(x, Rb s,(i)(x)))1/d] for a ﬁxed x X, b [B], and i [s]. Therefore, standard concentration inequalities can be applied to establish the convergence

Bagged Regularized k-Distances for Anomaly Detection

rates for term (II). The derivation of this equality is explained as follows: Since P has a density over [0, 1]d with respect to the Lebesgue measure, the random variable X x 2 is continuous for a ﬁxed x X. By the probability integral transform, P(B(x, X x 2)) is uniformly distributed over [0, 1]. For any b [B], note that Xb 1, . . . , Xb s are i.i.d. with the same distribution P. Let Ub 1, . . . , Ub s be i.i.d. uniform [0, 1] random variables. Then

P(x, Xb 1 x 2), . . . , P(x, Xb s x 2) D= Ub 1, . . . , Ub s .

Reordering the samples such that Xb (1)(x) x 2 Xb (s)(x) x 2, we obtain

P(B(x, Rb s,(1)(x))), . . . , P(B(x, Rb s,(s)(x))) D= Ub (1), . . . , Ub (s) , (20)

where Ub (i) is the i-th order statistic of Ub 1, . . . , Ub s. This reduces the study of P(B(x, Rb s,(i)(x))) to the study of Ub (i). By Corollary 1.2 in Biau and Devroye (2015), Ub (i) Beta(i, s + 1 i).

Hence, E[P(B(x, Rb s,(i)(x)))1/d] = E[(Ub (i))1/d] = γs,i. In the literature on weighted nearest-neighbor density estimation (Biau et al., 2011; Biau and Devroye, 2015), the expression (i/s)1/d is often used in place of γs,i, introducing an additional error term |γs,i (i/s)1/d|. While this error is negligible in the absence of bagging, it slows the convergence rate of term (II) when bagging is employed. Therefore, using γs,i enables a clearer demonstration of the beneﬁts of bagging in density estimation. The following proposition provides the upper bounds for the error terms (I), (II), and (III), respectively.

Proposition 4 Let Assumption 2 hold. Furthermore, let (I), (II), and (III) be deﬁned as in (14), (17), and (18), respectively. Let kb, k, and k be deﬁned as in Proposition 1. Suppose that the conditions (i) (iv) in Proposition 1 hold. Then there exists N1 N, which will be speciﬁed in the proof, such that for all n > N1 and x satisfying B(x, Rb s,(kb)(x)) [0, 1]d for

all b [B], the following statements hold with probability PBs at least 1 2/n2:

k/s 1/d (log n)/(k B) 1/2;

(III) (log n)1+1/d/(s1/dk) + (k/s)2/d.

4.2 Learning the AUC Regret: Three Building Blocks

Recalling that the central concern in statistical learning theory is the convergence rates of learning algorithms under various settings. In Section 3.1, we show that when the probability distribution P follows the Huber contamination model (HCM) in Assumption 1, we can use a latent variable Y to indicate whether it is from the anomalous distribution. Moreover, the posterior probability in (12) implies that in HCM, anomalies can be identiﬁed by using the Bayes classiﬁer with respect to the classiﬁcation loss, resulting in the set of anomalies as

S := {x Rd : η(x) > 1/2} = {x Rd : Πf(x) 1 > 1/2} = {x Rd : f(x) < 2Π}.

Cai, Yang, Ma, Hang

This set can be estimated by the lower-level set estimation of BRDDE at the threshold 2Π as in (11), i.e., b S := {x Rd : f B, n (x) < 2Π} with f B, n (x) as deﬁned in (11). If we choose

θ = 1 (2VdΠ)1/d B

i=1 wb, i γs,i,

then we have

{x Rd : RB, n (x) θ} = {x Rd : f B, n (x) < 2Π} = b S.

This implies that the upper-level set of bagged regularized k-distances, i.e., {x Rd : RB, n (x) θ}, equals the estimation b S with the properly chosen threshold. As a result, the unsupervised anomaly detection problem is converted to an implicit binary classiﬁcation problem. Therefore, we are able to analyze the performance of RB, n (x) in anomaly detection by applying the analytical tools for classiﬁcation. Since the posterior probability estimation is inversely proportional to the BRDDE as shown in (12) in Section 3.1, the problem of analyzing the posterior probability estimation can be further converted to analyzing the BRDDE. Therefore, it is natural and necessary to investigate the following three problems:

(i) The ﬁnite sample bounds of the optimized weights wb, by solving SRM problems.

(ii) The convergence of the BRDDE as stated in Theorem 2, that is, whether f B, n converges to f in terms of L1-norm.

(iii) The convergence of AUC regret for RB, n , i.e., whether the convergences of BRDDE f B, n imply the convergences of the AUC regret of RB, n .

Reg AUC(RB, n ) 0 f B, n f L1(X) 0 (Theorem 2) SRM

Figure 1: An illustration of the three fundamental components of AUC regret. The left block represents the consistency of AUC regret, the middle block signiﬁes the consistency of BRDDE, and the right block corresponds to the statistical analysis of SRM, aligning with Problem (iii), (ii), and (i), respectively.

The above three problems form the foundations for conducting a learning theory analysis on bagged regularized k-distances and serve as three main building blocks. Notice that Problem (ii) is already provided in Theorem 2 in Section 3.2. Detailed explorations of the other two Problems (i) and (iii), will be expanded in the following subsections.

4.2.1 Analysis for the Surrogate Risk Minimization

The following Proposition ensures that the solution to the surrogate risk minimization satisﬁes the conditions in Proposition 1 with high probability, thereby addressing Problem (i) and validating the eﬀectiveness of our algorithm.

Proposition 5 Let Assumption 2 hold. Let {Db s}B b=1 be B disjoint subsets of size s randomly drawn from the data set Dn. Moreover, let wb, be deﬁned as in (7), and kb, := k(wb, ) :=

Bagged Regularized k-Distances for Anomaly Detection

sup{i [n] : wb, i = 0}. If we choose s and B as in (13) in Theorem 2, then there exists N2 N, which will be speciﬁed in the proof, such that for all n > N2, the following two statements hold with probability PBs at least 1 1/n2:

1. The conditions (i) (iv) in Proposition 1 hold for wb, and kb, .

2. Furthermore, we have kb, (n/ log n)1/(d+2) for b [B].

4.2.2 Analysis for the AUC Regret

Problem (iii) in the left block of Figure 4.2 is solved by the next proposition, which shows that the problem of bounding the AUC regret of the bagged regularized k-distances can be converted to the problem of bounding the L1-error of the BRDDE.

Proposition 6 Let Assumptions 1 and 2 hold, and let f be the true density function. Let RB, n be the bagged regularized k-distances as in (10) and f B, n (x) be the BRDDE as in (11). Suppose that there exists a constant c > 0 such that f B, n c. Then, we have

Reg AUC(RB, n ) Z

f B, n (x) f(x) dx.

The results in Proposition 6 apply broadly to any density estimator that is lower bounded and whose weights satisfy the conditions outlined in Propositions 4 and 5. Indeed, this means any suﬃciently good (weighted) k-NN density estimator meeting these conditions would achieve similar theoretical guarantees. Therefore, our proposed estimator does not claim a faster convergence rate than existing methods. Instead, its primary advantage lies in reducing sensitivity to hyperparameter selection in practice, as thoroughly discussed in Section 2.3. The theoretical guarantees established in Proposition 6 play a crucial role in validating the consistency of our estimator in comparison to standard approaches.

4.3 Complexity Analysis

To deal with the eﬃciency issue in distance-based methods for anomaly detection when dealing with large-scale datasets, Wu and Jermaine (2006) proposed the iterative subsampling, i.e., for each test sample, they ﬁrst randomly select a portion of data and then compute the k-distance over the subsamples. They provided a probabilistic analysis of the quality of the subsampled distance compared to the k-distance over the whole dataset. Furthermore, Sugiyama and Borgwardt (2013) proposed the one-time sampling for the computation of the k-distances over the dataset for all test samples, which is shown to be more eﬃcient than the iterative sampling. Although these sub-sampling methods improve computational eﬃciency, these distance-based methods fail to comprehensively utilize the information in the dataset since a large portion of samples are dropped out. By contrast, the bagging technique incorporated in our BRDAD not only addresses the eﬃciency issues when dealing with large-scale datasets but also maintains the ability to make full use of the data. In the following, we conduct a complexity analysis for BRDAD in detail to show the computational eﬃciency of BRDAD. As a widely used algorithm, the k-d tree (Friedman et al., 1977) is commonly employed in NN-based methods to search for nearest neighbors. Given n data points in d dimensions, Friedman et al. (1977) showed that constructing a k-d tree requires O(nd log n) time,

Cai, Yang, Ma, Hang

while searching for k nearest neighbors takes O(k log n) time. Below, we analyze the time complexities of the construction and search stages in BRDAD to demonstrate how bagging reduces computational complexity.

(i) In each bagging round, BRDAD constructs a k-d tree using s data points. Taking s as the order in Theorem 2, the construction time per k-d tree is O(ds log s) = O(dn(1+d)/(d+2)(log n)1/(d+2)) when parallelism is applied. In contrast, without bagging, constructing a single k-d tree requires O(dn log n) time. Thus, bagging reduces the construction time complexity.

(ii) The time complexity of regularized k-distances at each bagging round consists of two main components: (i) computing the average k-distances, and (ii) solving the SRM problem. The ﬁrst part, querying kb, neighbors, takes O(kb, log s) time. For the second part, Anava and Levy (2016, Theorem 3.3) shows that Algorithm 1 ﬁnds the solution in O(kb, ) time. Consequently, the search stage requires at most O(kb, log s) time. From the order of kb, and s in Proposition 5 and Theorem 3, the search complexity with parallel computation across subsets {Db s}B b=1 is O(n1/(d+2)(log n)(d+1)/(d+2)). In contrast, without bagging, the search complexity is O(n2/(d+2)(log n)(2d+2)/(d+2)) from the order of k in Lemma 20 in Section 6.2, This result shows that bagging also improves the search eﬃciency.

In summary, with proper parallelization, the overall complexity is dominated by the construction stage at O(dn(1+d)/(d+2)(log n)1/(d+2)), compared to O(dn log n) without bagging. Thus, bagging enhances computational eﬃciency when fully leveraging parallel computation. However, due to its dependence on d, the computational improvement becomes less signiﬁcant in higher dimensions. For popular distance-based anomaly detection methods like standard k-NN and DTM (Gu et al., 2019), the primary computational cost comes from constructing a k-d tree and searching for k nearest neighbors. If k is set to the optimal order O(n2/(d+2)(log n)d/(d+2)) for standard k-NN density estimation, the construction stage takes O(nd log n) time, while the search stage requires O(n2/(d+2)(log n)(2d+2)/(d+2)) time. For another distance-based method, LOF, in addition to constructing a k-d tree and searching for k nearest neighbors, there is an extra step of computing scores for all samples, which adds a time complexity of O(n) (Breunig et al., 2000). A straightforward comparison shows that these methods are signiﬁcantly more computationally intensive than BRDAD. In contrast, the construction and search stages of BRDAD have much lower complexities of O(dn(1+d)/(d+2)(log n)1/(d+2)) and O(n1/(d+2)(log n)(d+1)/(d+2)), respectively.

5. Experiments

This section presents numerical experiments. In Section 5.1, we perform synthetic data experiments on density estimation to illustrate the convergence of the surrogate risk and mean absolute error of BRDDE as the sample size grows. Section 5.2 focuses on realworld anomaly detection benchmarks, where we compare BRDAD with various methods and explain its advantages. Our empirical results demonstrate that bagging enhances the algorithm s performance.

Bagged Regularized k-Distances for Anomaly Detection

5.1 Synthetic Data Experiments on Density Estimation

In this section, we empirically demonstrate the convergence of both the surrogate risk (SR) and the mean absolute error (MAE) of BRDDE as the sample size n increases. The results are presented in Figures 2(a) and 2(b). As established in Theorem 1, the surrogate risk is expected to exhibit a convergence behavior similar to that of the MAE for BRDDE. To investigate this, we sample n data points from the N(0, 1) distribution, with n set to {300, 1000, 3000, 5000, 10000} for training. The surrogate risk for each n is computed using Algorithm 1 with B = 1. Additionally, we randomly sample 10, 000 instances to calculate the MAE to assess BRDDE s performance. Each experiment is repeated 20 times for every sample size n. The results in Figure 2(a) show that the surrogate risk decreases monotonically as n increases, while Figure 2(b) exhibits a similar convergence pattern for the MAE. Furthermore, we plot the ratio of SR to MAE for each sample size n in Figure 2(c), which indicates that the ratio stabilizes when n exceeds 3000, further conﬁrming the similarity in their convergence behaviors. To extend this analysis to more complex distributions and higher dimensions, we conduct additional experiments using a mixture distribution, 0.4 N(0.31d, 0.01Id) + 0.6 N(0.71d, 0.0025Id), across dimensions d {1, 3, 5, 7, 9}. The ratio of SR to MAE for each sample size n is plotted in Figure 2(d), revealing convergence toward a dimension-dependent constant. This result reinforces the generalizability of our optimized weights across diverse settings.

We provide an illustrative example on a synthetic dataset to demonstrate the sensitivity of choosing the hyperparameter k in other nearest-neighbor-based density estimation methods, including the k-NN density estimation (k-NN) and the weighted k-NN density estimation (Wk NN) (Biau et al., 2011). In accordance with Biau et al. (2011), we take ν as the measure of Uα, where U is uniform on [0, 1] and α > 0 is a parameter. Given an integer k, the weights of Wk NN are deﬁned by wi = R

((i 1)/k,i/k] ν(dt) = (i/k)1/α ((i 1)/k)1/α, for i [k], with wi = 0 otherwise. We generate 1000 data points to train the density estimators and an additional 10,000 points to compute the MAE from a Gaussian mixture model with the density function 0.5 N(0.3, 0.01) + 0.5 N(0.7, 0.0025). We vary the hyperparameter k from 3 to 500 to observe its eﬀect on the MAE for both k-NN and Wk NN. The results in Figure 3 illustrate that the performance of these density estimators is signiﬁcantly inﬂuenced by the choice of k, regardless of α. Only a narrow range of k values leads to optimal results. In contrast, our proposed estimator, BRDDE, mitigates this sensitivity. The black dashed line in Figure 3 represents the MAE performance of BRDDE, demonstrating that it achieves near-optimal results comparable to the best-tuned nearest-neighbor-based methods without requiring ﬁne-tuning of k.

5.2 Real-world Data Experiments on Anomaly Detection

5.2.1 Dataset Descriptions

To provide an extensive experimental evaluation, we use the latest anomaly detection benchmark repository named ADBench established by Han et al. (2022). The repository includes 47 tabular datasets, ranging from 80 to 619326 instances and from 3 to 1555 features. We provide the descriptions of these datasets in the Table 1.

Cai, Yang, Ma, Hang

Table 1: Descriptions of ADBench Datasets

Number Data # Samples # Features # Anomaly % Anomaly Category

1 ALOI 49534 27 1508 3.04 Image 2 annthyroid 7200 6 534 7.42 Healthcare 3 backdoor 95329 196 2329 2.44 Network 4 breastw 683 9 239 34.99 Healthcare 5 campaign 41188 62 4640 11.27 Finance 6 cardio 1831 21 176 9.61 Healthcare 7 Cardiotocography 2114 21 466 22.04 Healthcare 8 celeba 202599 39 4547 2.24 Image 9 census 299285 500 18568 6.20 Sociology 10 cover 286048 10 2747 0.96 Botany 11 donors 619326 10 36710 5.93 Sociology 12 fault 1941 27 673 34.67 Physical 13 fraud 284807 29 492 0.17 Finance 14 glass 214 7 9 4.21 Forensic 15 Hepatitis 80 19 13 16.25 Healthcare 16 http 567498 3 2211 0.39 Web 17 Internet Ads 1966 1555 368 18.72 Image 18 Ionosphere 351 32 126 35.90 Oryctognosy 19 landsat 6435 36 1333 20.71 Astronautics 20 letter 1600 32 100 6.25 Image 21 Lymphography 148 18 6 4.05 Healthcare 22 magic.gamma 19020 10 6688 35.16 Physical 23 mammography 11183 6 260 2.32 Healthcare 24 mnist 7603 100 700 9.21 Image 25 musk 3062 166 97 3.17 Chemistry 26 optdigits 5216 64 150 2.88 Image 27 Page Blocks 5393 10 510 9.46 Document 28 pendigits 6870 16 156 2.27 Image 29 Pima 768 8 268 34.90 Healthcare 30 satellite 6435 36 2036 31.64 Astronautics 31 satimage-2 5803 36 71 1.22 Astronautics 32 shuttle 49097 9 3511 7.15 Astronautics 33 skin 245057 3 50859 20.75 Image 34 smtp 95156 3 30 0.03 Web 35 Spam Base 4207 57 1679 39.91 Document 36 speech 3686 400 61 1.65 Linguistics 37 Stamps 340 9 31 9.12 Document 38 thyroid 3772 6 93 2.47 Healthcare 39 vertebral 240 6 30 12.50 Biology 40 vowels 1456 12 50 3.43 Linguistics 41 Waveform 3443 21 100 2.90 Physics 42 WBC 223 9 10 4.48 Healthcare 43 WDBC 367 30 10 2.72 Healthcare 44 Wilt 4819 5 257 5.33 Botany 45 wine 129 13 10 7.75 Chemistry 46 WPBC 198 33 47 23.74 Healthcare 47 yeast 1484 8 507 34.16 Biology

Bagged Regularized k-Distances for Anomaly Detection

300 1000 3000 5000 10000 n

(a) Convergence of SR

300 1000 3000 5000 10000 n

(b) Convergence of MAE

300 1000 3000 5000 10000 n

(c) Ratio of SR to MAE

300 1000 3000 5000 10000

log(SR/MAE)

dim=1 dim=3 dim=5 dim=7 dim=9

(d) Log ratio of SR to MAE

Figure 2: (a)(b) show that SRM leads to the convergence of both surrogate risk (SR) and mean absolute error (MAE). Furthermore, (c) shows that as the sample size n increases, the ratio of SR to MAE becomes stable, indicating similar convergence behaviors for both SR and MAE by applying Algorithm 1.

0 100 200 300 400 500 k

= 10 = 2 = 1 = 0.5 = 0.2

Figure 3: Illustration of parameter k s sensitivity.

5.2.2 Methods for Comparison

We conduct experiments on the following anomaly detection algorithms.

Cai, Yang, Ma, Hang

(i) BRDAD is our proposed algorithm, with details provided in Algorithm 2. The choice of B depends on the sample size: for n (0, 10, 000], (10, 000, 50, 000], and (50, 000, + ), we set B = 1, 5, and 10, respectively. In practice, when B is ﬁxed, we randomly divide the data into B subsets, each containing either n/B or n/B + 1 samples. Each subset is then further split into two parts such that their sizes are equal or diﬀer by at most 1. This process ensures that the full dataset is partitioned as evenly as possible.

(ii) Distance-To-Measure (DTM) (Gu et al., 2019) is a distance-based algorithm which employs a generalization of the k nearest neighbors named distance-to-measure . We use the author s implementation. As suggested by the authors, the number of neighbors k is ﬁxed to be k = 0.03 sample size.

(iii) k-Nearest Neighbors (k-NN) (Ramaswamy et al., 2000) is a distance-based algorithm that uses the distance of a point from its k-th nearest neighbor to distinguish anomalies. We use the implementation of the Python package Py OD with its default parameters.

(iv) Local Outlier Factor (LOF) (Breunig et al., 2000) is a distance-based algorithm that measures the local deviation of the density of a given data point with respect to its neighbors. We also use Py OD with its default parameters.

(v) Partial Identiﬁcation Forest (PIDForest) (Gopalan et al., 2019) is a forest-based algorithm that computes the anomaly score of a point by determining the minimum density of data points across all subcubes partitioned by decision trees. We use the authors implementation with the number of trees T = 50, the number of buckets B = 5, and the depth of trees p = 10 suggested by the authors.

(vi) Isolation Forest (i Forest) (Liu et al., 2008) is a forest-based algorithm that works by randomly partitioning features of the data into smaller subsets and distinguishing between normal and anomalous points based on the number of splits required to isolate them, with anomalies requiring fewer splits. We use the implementation of the Python package Py OD with its default parameters.

(vii) One-class SVM (OCSVM) (Schölkopf et al., 1999) is a kernel-based algorithm which tries to separate data from the origin in the transformed high-dimensional predictor space. We also use Py OD with its default parameters.

Note that as BRDAD, i Forest, and PIDForest are randomized algorithms, we repeat these three algorithms for 10 runs and report the averaged AUC performance. DTM, k-NN, LOF, and OCSVM are deterministic, and hence we report a single AUC number for them.

5.2.3 Experimental Results

Table 2 presents the performance of seven anomaly detection methods on the ADBench benchmark, evaluated using the AUC metric. The last two rows summarize each algorithm s rank sum and the number of top-ranking performances. A lower rank sum and a higher number of ﬁrst-place rankings indicate better performance. BRDAD demonstrates exceptional results across both evaluation metrics. Speciﬁcally, it achieves the lowest rank sum of 145, signiﬁcantly outperforming other methods, with DTM and i Forest scoring 160

Bagged Regularized k-Distances for Anomaly Detection

Table 2: Experimental Comparisons on ADBench Datasets

BRDAD DTM k-NN LOF PIDForest i Forest OCSVM

ALOI 0.5427 0.5440 0.6942 0.7681 0.5061 0.5411 0.5326 annthyroid 0.6516 0.6772 0.7343 0.7076 0.8781 0.8138 0.5842 backdoor 0.8490 0.9216 0.6682 0.7135 0.6965 0.7238 0.8465 breastw 0.9883 0.9799 0.9765 0.3907 0.9750 0.9871 0.8052 campaign 0.6711 0.6908 0.7202 0.5366 0.7945 0.7182 0.6630 cardio 0.9142 0.8879 0.7330 0.6372 0.8258 0.9271 0.9286 Cardiotocography 0.6302 0.6043 0.5449 0.5705 0.5587 0.6973 0.7872 celeba 0.5896 0.6929 0.5666 0.4332 0.6732 0.6955 0.6962 census 0.6394 0.6435 0.6465 0.5501 0.5543 0.6116 0.5336 cover 0.9301 0.9277 0.7961 0.5262 0.8065 0.8784 0.9141 donors 0.7858 0.8000 0.6117 0.5977 0.6945 0.7810 0.7323 fault 0.7591 0.7587 0.7286 0.5827 0.5437 0.5714 0.5074 fraud 0.9552 0.9583 0.9342 0.4750 0.9489 0.9493 0.9477 glass 0.7993 0.8688 0.8640 0.8114 0.7913 0.7933 0.4407 Hepatitis 0.6954 0.6303 0.6745 0.6429 0.7186 0.6944 0.6418 http 0.9943 0.0507 0.2311 0.3550 0.9870 0.9999 0.9949 Internet Ads 0.7274 0.7063 0.7110 0.6485 0.6754 0.6913 0.6890 Ionosphere 0.9113 0.9237 0.9259 0.8609 0.6820 0.8493 0.7395 landsat 0.6176 0.6184 0.5773 0.5497 0.5245 0.4833 0.3660 letter 0.8426 0.8417 0.8950 0.8872 0.6636 0.6318 0.4843 Lymphography 0.9988 0.9965 0.9988 0.9953 0.9656 0.9993 0.9977 magic.gamma 0.8205 0.8214 0.8323 0.6712 0.7252 0.7316 0.5947 mammography 0.8132 0.8301 0.8424 0.7398 0.8453 0.8592 0.8412 mnist 0.8335 0.8630 0.8041 0.6498 0.5366 0.7997 0.8204 musk 0.7583 0.9987 0.6604 0.4271 0.9997 0.9995 0.8094 optdigits 0.3912 0.5474 0.4189 0.5831 0.8248 0.6970 0.5336 Page Blocks 0.8889 0.8859 0.7813 0.7345 0.8154 0.8980 0.8903 pendigits 0.9174 0.9581 0.7127 0.4821 0.9214 0.9515 0.9354 Pima 0.7291 0.7224 0.7137 0.5978 0.6842 0.6803 0.6022 satellite 0.7449 0.7375 0.6489 0.5436 0.7122 0.7043 0.5972 satimage-2 0.9991 0.9991 0.9164 0.5514 0.9919 0.9935 0.9747 shuttle 0.9898 0.9442 0.6317 0.5239 0.9885 0.9968 0.9823 skin 0.7570 0.7177 0.5881 0.5756 0.7071 0.6664 0.4857 smtp 0.8506 0.8854 0.8953 0.9023 0.9203 0.9077 0.7674 Spam Base 0.5687 0.5663 0.4977 0.4581 0.6941 0.6212 0.5251 speech 0.4834 0.4810 0.4832 0.5067 0.4739 0.4648 0.4639 Stamps 0.8980 0.8594 0.8362 0.7269 0.8883 0.8911 0.8179 thyroid 0.9353 0.9470 0.9508 0.8075 0.9687 0.9771 0.8437 vertebral 0.3236 0.3663 0.3768 0.4208 0.2857 0.3515 0.3852 vowels 0.9489 0.9667 0.9797 0.9443 0.7817 0.7590 0.5507 Waveform 0.7783 0.7685 0.7457 0.7133 0.7263 0.7144 0.5393 WBC 0.9972 0.9930 0.9925 0.8399 0.9904 0.9959 0.9967 WDBC 0.9841 0.9773 0.9782 0.9796 0.9916 0.9850 0.9877 Wilt 0.3138 0.3545 0.4917 0.5394 0.5012 0.4477 0.3491 wine 0.8788 0.4277 0.4992 0.8756 0.8221 0.7987 0.6941 WPBC 0.5188 0.5101 0.5208 0.5184 0.5283 0.4942 0.4743 yeast 0.3717 0.3876 0.3936 0.4571 0.4019 0.3964 0.4141

Rank Sum 145 160 192 243 187 159 228 Num. No. 1 11 8 5 5 9 6 3

Cai, Yang, Ma, Hang

and 159, respectively. In terms of ﬁrst-place rankings across datasets, BRDAD ranks ﬁrst in 11 out of 47 tabular datasets, outperforming PIDForest (9/47) and DTM (8/47). Overall, BRDAD exhibits outstanding performance, excelling over previous distance-based methods while competing eﬀectively with forest-based approaches.

On the one hand, BRDAD outperforms distance-based methods such as DTM and k NN in some datasets. For example, on the Satellite dataset, while DTM achieves a high AUC of 0.7375, BRDAD further improves it to 0.7449. Similarly, on the Internet Ads dataset, BRDAD achieves an AUC of 0.7274, outperforming k-NN s score of 0.7110.

On the other hand, BRDAD remains competitive even in datasets where distancebased methods perform poorly while forest-based methods excel, such as the Stamps and Wine datasets. On Stamps, DTM and k-NN achieve AUC scores of 0.8594 and 0.8362, respectively, whereas forest-based methods like PIDForest and i Forest attain 0.8883 and 0.8911. Surprisingly, despite being a distance-based method, BRDAD surpasses them all with an AUC of 0.8980. Similarly, on the Wine dataset, PIDForest and i Forest achieve AUC scores of 0.8221 and 0.7987, respectively, while distancebased methods DTM and k-NN perform signiﬁcantly worse, with scores of 0.4277 and 0.4992. Remarkably, BRDAD not only outperforms other distance-based methods but also surpasses forest-based methods, achieving the highest AUC of 0.8788.

These results empirically demonstrate BRDAD s superiority over both distance-based and forest-based anomaly detection algorithms.

5.2.4 Parameter Analysis

In this section, we analyze the impact of the bagging rounds B on BRDAD s performance using the ADBench datasets. Following the sample size categorization in Section 5.2.2, we classify the datasets into three groups: small datasets with n (0, 10, 000], medium datasets with n (10, 000, 50, 000], and large datasets with n (50, 000, + ). We evaluate BRDAD with diﬀerent values of B {1, 5, 10, 20} and record the rank sum for each setting.

Table 3: Experimental Comparisons for BRDAD with diﬀerent B on ADBench Datasets

B = 1 B = 5 B = 10 B = 20

Small 95 112 106 102 Medium 25 25 26 26 Large 26 25 25 26

Table 3 indicates that the eﬀect of B on the rank sum metric varies with sample size. For small datasets, B = 1 signiﬁcantly outperforms other choices, as bagging reduces the already limited sample size. In contrast, for medium and large datasets, a slightly larger B yields marginally better performance. This aligns with our theoretical ﬁndings in Theorem 3, which suggest that B should increase with sample size. These results also validate the eﬀectiveness of our chosen values of B in real-world datasets, as discussed in the previous subsection.

Bagged Regularized k-Distances for Anomaly Detection

In this section, we present proofs of the theoretical results in this paper. More precisely, we ﬁrst provide proofs related to the surrogate risk in Section 6.1. The proofs related to the convergence rates of BRDDE and BRDAD are provided in Sections 6.2 and 6.3, respectively.

6.1 Proofs Related to the Surrogate Risk

In this section, we ﬁrst provide proofs related to the error analysis of BWDDE in Section 6.1.1. Then in Section 6.1.2, we present the proof of Proposition 2.3 concerning the surrogate risk.

6.1.1 Proofs Related to Section 4.1

Before we proceed, we present Bernstein s inequality Bernstein (1946) that will be frequently applied within the subsequent proofs. This concentration inequality is extensively featured in numerous statistical learning literature, such as Massart (2007); Cucker and Zhou (2007); Steinwart and Christmann (2008); Cai et al. (2023).

Lemma 7 (Bernstein s inequality) Let B > 0 and σ > 0 be real numbers, and n 1 be an integer. Furthermore, let ξ1, . . . , ξn be independent random variables satisfying EPξi = 0, ξi B, and EPξ2 i σ2 for all i = 1, . . . , n. Then for all τ > 0, we have

Although Bernstein s inequality is a powerful tool for establishing ﬁnite-sample bounds for random variables, the resulting bounds may not be tight enough for random variables with vanishing variance. To address this limitation, we employ the sub-exponential tail bound, which provides sharper concentration results for such random variables. We begin by introducing the deﬁnition of sub-exponential random variables as stated in Wainwright (2019, Deﬁnition 2.7) as follows.

Deﬁnition 8 (Sub-exponential variables) A random variable X with mean µ = EX is sub-exponential if there exist non-negative parameters (ζ, α) such that

E[eλ(X µ)] eζ2λ2/2, for all |λ| 1/α.

This condition on the moment-generating function, together with the Chernoﬀtechnique, leads to concentration inequalities that bound the deviations of sub-exponential random variables. The following lemma gives this result (Wainwright, 2019, Proposition 2.9).

Lemma 9 (Sub-exponential tail bound) Suppose that X is sub-exponential with parameters (ζ, α). Then for all 0 τ ζ2/(2α2), we have

To measure the complexity of the functional space, we ﬁrst recall the deﬁnition of the covering number in van der Vaart and Wellner (1996).

Cai, Yang, Ma, Hang

Deﬁnition 10 (Covering Number) Let (X, d) be a metric space and A X. For ε > 0, the ε-covering number of A is denoted as

N(A, d, ε) := min n 1 : x1, . . . , xn X such that A

i=1 B(xi, ε) ,

where B(x, ε) := {x X : d(x, x ) ε}.

The following Lemma, which is taken from Hang et al. (2022) and needed in the proof of Lemma 12, provides the covering number of the indicator functions on the collection of balls in Rd.

Lemma 11 Let B := {B(x, r) : x Rd, r > 0} and 1B := {1B : B B}. Then for any ε (0, 1), there exists a universal constant C such that

N(1B, L1(Q), ε) C(d + 2)(4e)d+2ε (d+1)

holds for any probability measure Q.

The following lemma, which will be used several times in the sequel, provides the uniform bound on the distance between any point and its k-th nearest neighbor with a high probability when the distribution has bounded support.

Lemma 12 Let Assumption 2 hold. Furthermore, let {Db s}B b=1 be B disjoint subsets of size s randomly drawn from the data set Dn, where Db s = {Xb 1, . . . , Xb s} and Rb s,(i)(x) denotes the i-distance of x in the subset Db s for i [s]. Additionally, let the sequence cn := 48(2d + 9) log n + 48 log 2 + 144 . Suppose that s c 1 := max{4e, d + 2, C}, where C is the constant speciﬁed in Lemma 11. Then, there exist constants 0 < c 2 < c 3 such that, with probability PBs at least 1 1/n2, for all x X, b [B], and cn i s, the following inequalities hold:

c 2(i/s)1/d Rb s,(i)(x) c 3(i/s)1/d (21)

P(B(x, Rb s,(i)(x))) i/s. (22)

Proof [of Lemma 12] For x X and q [0, 1], under the continuity of the density function, as stated in Assumption 2, the intermediate value theorem guarantees the existence of a unique ρx(q) 0 such that P(B(x, ρx(q))) = q. Let τ := (2d + 9) log n + log 2 + 3 and cn := 48τ = 48(2d + 9) log n + 48 log 2 + 144 . Let i be an integer ﬁxed in the sequel with i cn, which ensures that i > 3τ. Accordingly, we consider the set B i := B x, ρx (i

3τi)/s : x X B. Lemma 11 implies that for any probability measure Q and ε (0, 1), there holds

N(1B i , L1(Q), ε) N(1B, L1(Q), ε) C(d + 2)(4e)d+2ε (d+1). (23)

Bagged Regularized k-Distances for Anomaly Detection

By the deﬁnition of the covering number, there exists ε-net {A j }J j=1 B i with J := C(d + 2)(4e)d+2ε (d+1) , and for any x X, there exists some j {1, . . . , J} such that 1 B x, ρx (i

3τi)/s 1A j L1(Dbs) ε. (24)

Let j [J] and b [B] be ﬁxed for now. For any ℓ [s], deﬁne the random variables ξℓ,b = 1A j (Xb ℓ) (i

3τi)/s. We have EPξℓ,b = 0, ξℓ,b 1, and EPξ2 ℓ,b (i

3τi)/s. Applying Bernstein s inequality in Lemma 7, we obtain

ℓ=1 1A j (Xb ℓ) (i p

3τi)/s + 2τ/(3s) (25)

for A j with probability Ps at least 1 e τ. The union bound then implies that this inequality holds for all A j , j = 1, , J with probability Ps at least 1 Je τ. Combining this result with (24), we obtain the following bound:

ℓ=1 1 Xb ℓ B x, ρx (i

3τi)/s (i p

3τi)/s + 2τ/(3s) + ε (26)

for all x X with probability Ps at least 1 Je τ. Let ε = 1/s. Since x x and s max{4e, d + 2, C}, it follows that log J log C + log(d + 2) + (d + 2) log(4e) + (d + 1) log s (2d + 5) log s. Since τ = (2d + 9) log n + log 2 + 3 and s n, we have the following lower bound on the probability: 1 Je τ 1 s2d+5/(2e3n2d+9) > 1 1/(2n4). Therefore, for ﬁxed cn i s and b [B], inequality (26) holds for all x X with probability Ps at least 1 1/(2n4). Next, note that since τ > 3 and i 48τ 48τ, a straightforward calculation gives q

3τi)/s + 2τ/(3s) + ε

2τi/s + (2τ/3 + 1)/s

2τi/s + τ/s

τi/48/s = (

3τi/s. (27)

The ﬁrst inequality holds trivially without requiring i 48τ . However, when considering the set B+ i := B x, ρx (i +

3τi)/s : x X , the condition i 48τ is necessary to establish i +

3τi 5i/4, allowing us to proceed with a similar argument. Combining (27) with (26), we get Ps ℓ=1 1 Xb ℓ B x, ρx (i

3τi)/s /s i/s for all x X with probability Ps at least 1 1/(2n4). By the deﬁnition of Rb s,(i)(x), there holds

Rb s,(i)(x) ρx (i

3τi)/s . (28)

Now, let Vd denote the volume of the d-dimensional closed unit ball and µ denote the Lesbegue measure. By the deﬁnition of ρx and the condition f(x) c for all x X, as stated in Assumption 2, we have

3τi /s = P B(x, ρx (i

3τi)/s)) f(u) du

Cai, Yang, Ma, Hang

c µ B x, ρx (i

3τi)/s = c Vdρd x (i

3τi)/s . (29)

Since i 48τ 48τ, we have i

3i2/48 /s = 3i/(4s). Combining this with inequality (29), we get c Vdρd x (i

3τi)/s 3i/(4s). Therefore, we obtain

3τi)/s 3/(4Vdc) 1/d(i/s)1/d.

Combining this with (28), we have Rb s,(i)(x) ρx (i

3τi)/s) (i/s)1/d for all x X and ﬁxed i and b with probability Ps at least 1 1/(2n4). Therefore, by a union bound argument over i and b, for all i cn, x X, and b [B], we have

Rb s,(i)(x) ρx (i

3τi)/s) (i/s)1/d (30)

with probability PBs at least 1 1/(2n2). On the other hand, to obtain the upper bound for Rb s,(i)(x), we consider the set B+ i = B x, ρx (i +

3τi)/s : x X . Using similar arguments, we can show that for all cn i s, x X, and b [B], it holds that

Rb s,(i)(x) ρx (i +

3τi)/s (i/s)1/d (31)

with probability PBs at least 1 1/(2n2). Combining (30) and (31), we obtain that (21) holds for all cn i s, x X, and b [B] with probability PBs at least 1 1/n2. Furthermore, (30) and (31) implies

3τi)/s = P B(x, ρx((i

3τi)/s)) P(B(x, Rb s,(i)(x)))

P B(x, ρx((i +

3τi)/s)) = (i +

3τi i log n i and i

3i2/48 = 3i/4 for cn i s, we get P(B(x, Rb s,(i)(x))) i/s. This completes the proof of Lemma 12.

The following lemma, which is needed in the proof of Lemma 17, shows that the probability P(B(x, Rb s,(i)(x))) is a Lipschitz continuous function for ﬁxed i [s] and b [B]. This Lemma is necessary to establish the uniform concentration inequality in Lemma 17.

Lemma 13 Suppose that Assumption 2 holds. For i [s] and b [B], let Rb s,(i)(x) denote the i-distance of x in the subset Db s = {Xb 1, . . . , Xb s}. Then for any x, x X and all b [B], we have P(B(x, Rb s,(i)(x))) P(B(x , Rb s,(i)(x ))) 2c Vd 3dd(d+1)/2 x x 2, where Vd denotes the volume of the d-dimensional closed unit ball.

Proof [of Lemma 13] We ﬁrst show that Rb s,(i)(x) is a Lipschitz continuous function of x. Let t = Rb s,(i)(x). By the triangle inequality the i nearest neighbors of x are at distance at most d(x, x ) + t of x . Therefore there are at least i points at at most this distance from x , hence the i-th nearest neighbor of x is also at distance less than d(x, x )+t, in other words,

Rb s,(i)(x ) Rb s,(i)(x) + d(x, x ).

Bagged Regularized k-Distances for Anomaly Detection

By symmetry, we also have Rb s,(i)(x) Rb s,(i)(x ) + d(x, x ), implying that Rb s,(i)(x) is a Lipschitz continuous function. Next, we aim to show the continuity of P(B(x, Rb s,(i)(x))). For any x, x X, given f c from Assumption 2, we have:

P(B(x, Rb s,(i)(x))) P(B(x , Rb s,(i)(x ))) c µ(B(x, Rb s,(i)(x)) B(x , Rb s,(i)(x ))) , (32)

where the notation represents the symmetric diﬀerence between the two sets, i.e. A B := (A \ B) (B \ A), and µ denote the Lesbegue measure. Using geometric considerations, the Hausdorﬀdistance between B(x, Rb s,(i)(x)) and B(x , Rb s,(i)(x )) can be bounded by δ := x x 2 + |Rb s,(i)(x) Rb s,(i)(x )|. Therefore, we have

µ(B(x, Rb s,(i)(x)) \ B(x , Rb s,(i)(x ))) Vd Rb s,(i)(x) + δ d Rb s,(i)(x)d .

Applying the Lagrange mean value theorem, there exists ξ [Rb s,(i)(x), Rb s,(i)(x) + δ] such

that Rb s,(i)(x) + δ d Rb s,(i)(x)d dξd 1δ. From the Lipschitz continuity of Rb s,(i)(x) and

the assumption X = [0, 1]d, we know that δ 2 x x 2 2d1/2 and Rb s,(i)(x) d1/2.

This implies ξ Rb s,(i)(x) + δ 3d1/2. Substituting this into the above inequalities, we get

µ(B(x, Rb s,(i)(x)) \ B(x , Rb s,(i)(x ))) Vd d (3d1/2)d 1 δ. Since δ 2 x x 2, this yields

µ(B(x, Rb s,(i)(x)) \ B(x , Rb s,(i)(x ))) Vd 3dd(d+1)/2 x x 2.

By symmetry, we can similarly bound µ(B(x , Rb s,(i)(x ))\B(x, Rb s,(i)(x))). Combining these bounds, we obtain:

µ(B(x, Rb s,(i)(x)) B(x , Rb s,(i)(x ))) 2Vd 3dd(d+1)/2 x x 2.

Finally, substituting this into (32), we obtain the desired assertion.

By (20), the study of P(B(x, Rb s,(i)(x))) reduces to analyzing Ub (i), where Ub (i) Beta(i, s+ 1 i). Consequently, term (II) in (17) represents a sum of sample mean deviations, where the sample refers to P(B(x, Rb s,(i)(x)))1/d across b [B]. and our objective is to establish ﬁnite-sample bounds for each i. To achieve this, we leverage the sub-exponential property of the d-th root of the beta distribution. However, this problem is challenging due to the complexity of the integral in its moment-generating function. To the best of our knowledge, no existing results provide such bounds for this distribution. To address this gap, we introduce a new approach to derive an upper bound for the moment-generating function. Speciﬁcally, we establish upper bounds on the p-th absolute central moment of the beta distribution for p 1 in Lemma 14, following Skorski (2023). We then extend these results to its d-th root in Lemma 15, leading to the sub-exponential property established in Lemma 16. This property serves as the foundation for our ﬁnitesample bounds in Lemma 17.

Cai, Yang, Ma, Hang

Lemma 14 Let s and i be integers with s 2i. Let X Beta(i, s + 1 i) and µ = EX = i/(s + 1). Deﬁne

u = 2(s + 1 2i)

(s + 1)(s + 3), v = i(s + 1 i) (s + 1)2(s + 2). (33)

Then X is sub-exponential with parameters (2 v, 2u). Furthermore, for all p 1, we have E |X µ|p 4p+1(p v)p.

Proof [of Lemma 14] We deﬁne the moment generating function φX(λ) := E[exp(λ(X EX))] and its logarithm ψ(λ) := log φX(λ). Since s 2i, we have s + 1 i i. Therefore, for the Beta distribution Beta(i, s + 1 i) with v and u as deﬁned above, the ﬁrst case in Skorski (2023)[Claim (43)] provides the bound:

u2 uλ + log(1 uλ) , 0 λ < 1/u. (34)

Deﬁne g(x) = log(1 x)+x+2x2 for 0 x 1/2. Then we have g (x) = x(3 4x)/(1 x) > 0. Therefore, g(x) is increasing, and thus g(x) g(0) = 0 for 0 x 1/2. Hence, for 0 λ 1/(2u), we have log(1 uλ) uλ 2u2λ2. Substituting this into (34), we get

ψ(λ) 2vλ2, 0 λ 1/(2u). (35)

Next, we extend the bound to λ < 0. Since X Beta(i, s + 1 i), it follows that 1 X Beta(s + 1 i, i). Applying the second case in Skorski (2023)[Claim (43)], we have

ψ(λ) = log E[exp ( λ) (1 X E[1 X]) ] vλ2/2, 1/(2u) λ < 0.

Combining this with (35), we conclude

ψ(λ) 2vλ2, |λ| 1/(2u).

Therefore, X is sub-exponential with parameters (2 v, 2u). Next, we turn to bound the moments. It is clear to see that for x 0 and a > 0, we have eax ax. For x µ and p 1, this implies x µ 4p v e(x µ)/(4p v) e(x µ)/(4 v) + e (x µ)/(4 v) 1/p.

By symmetry, this yields

|x µ|p (4p v)p e(x µ)/(4 v) + e (x µ)/(4 v) .

Given s 2i, we have

1 4 v = (s + 1) s + 2

i(s + 1 i) s + 1

s + 2 s + 1 i (s + 1)(s + 2)

4(s + 1 i) < (s + 1)(s + 3)

4(s + 1 2i) = 1

Taking expectation with respect to X and using the sub-exponential property, we obtain

E[|X µ|p] (4p v)p E[e(X µ)/(4 v)] + E[e (X µ)/(4 v)]

2e1/8(4p v)p 4p+1(p v)p.

This completes the proof.

Bagged Regularized k-Distances for Anomaly Detection

Lemma 15 Let i, s, and d be integers with s 2i. Let X Beta(i, s + 1 i) and γs,i be deﬁned as in (3). Then for all p 1, we have

E |X1/d γs,i|p 3 (8p)p i 1/2+1/ds 1/d p.

Proof [Proof of Lemma 15] For all x, y R and p 1, by Jensen s inequality, we have (|x| + |y|)p 2p 1(|x|p + |y|p). Therefore,

E |X1/d γs,i|p 2p 1E |X1/d (i/(s + 1))1/d|p + 2p 1|(i/(s + 1))1/d γs,i|p. (36)

Bounding the First Term: By equality (15), we have

|X i/(s + 1)| = |X1/d (i/(s + 1))1/d|

Xj/d(i/(s + 1))(d 1 j)/d

(i/(s + 1))(d 1)/d |X1/d (i/(s + 1))1/d|.

Therefore, we get

E |X1/d (i/(s + 1))1/d|p (i/(s + 1)) p(d 1)/d E |X (i/(s + 1))|p . (37)

Lemma 14 implies that E |X i/(s+1)|p 4p+1(p v)p, where v is speciﬁed in (33). Since v < i/(s + 1)2, combining this with (37) gives

2p 1E |X1/d (i/(s + 1))1/d|p 2p 1(i/(s + 1)) p(d 1)/d 4p+1pp (i1/2/(s + 1))p

= 2 (8p)p i 1/2+1/d(s + 1) 1/d p.

Bounding the Second Term: Using Gautschi s inequality (Gautschi, 1959) for Gamma functions, we have

d 1 1/d < Γ(i + 1/d)

Γ(i) < i + 1

1/d < Γ(s + 1 + 1/d)

Γ(s + 1) < s + 1 + 1

By the deﬁnition of γs,i in (3), we conclude that

s + 1 + 1/d

1/d < γs,i < i + 1/d

Rewriting the left bound:

s + 1 + 1/d

1/d = i s + 1

1/d 1 + (1/d 1)/i (1/d)/(s + 1)

1 + (1/d)/(s + 1)

Cai, Yang, Ma, Hang

Since s 2i, we have (1/d 1)/i (1/d)/(s + 1)

1 + (1/d)/(s + 1)

1/d s + 1 1/d 1

Deﬁne g(x) = (1+x)1/d for x [ 1+1/(2d), 1 1/(2d)]. Then, its derivative satisﬁes g (x) = (1+x)1/d 1/d 21 1/d/d1/d < 2. By the mean value theorem, we have |(1+x)1/d 1| 2|x| for x [ 1 + 1/(2d), 1 1/(2d)]. This implies

s + 1 + 1/d

1/d i s + 1

1/d 2(2 1/d)

Next, we rewrite the right bound in (38) as i + 1/d

1/d = i s + 1

1/d 1 + (1/d)/i + (1 1/d)/(s + 1)

1 (1 1/d)/(s + 1)

For s 2i 2, we derive 1 (1 1/d)/(s + 1) 1 1/(s + 1) = s/(s + 1) 2/3 and (1/d)/i + (1 1/d)/(s + 1) 0. Using the inequality (1 + x)1/d 1 + x/d for x 0, we obtain 1 + (1/d)/i + (1 1/d)/(s + 1)

1 (1 1/d)/(s + 1)

d (1/d)/i + (1 1/d)/(s + 1)

1 (1 1/d)/(s + 1)

= 1 + 3(1/d + 1) (1/d)

where the second inequality follows from 1 (1 1/d)/(s + 1) 2/3 and s 2i and the last inequality follows from x(1 + x) 2 for x [0, 1]. Therefore, we obtain

1/d i s + 1

Combining this with (40), we conclude that

|(i/(s + 1))1/d γs,i| 2i 1+1/d(s + 1) 1/d 2i 1/2+1/d(s + 1) 1/d.

Therefore, we have

2p 1|(i/(s + 1))1/d γs,i|p (8p)p i 1/2+1/d(s + 1) 1/d p.

Finally, combining the bounds for both terms and using (36), we obtain

E |X1/d γs,i|p 3 (8p)p i 1/2+1/ds 1/d p.

This concludes the proof.

Bagged Regularized k-Distances for Anomaly Detection

Lemma 16 Let s and i be integers with s 2i. Let X Beta(i, s + 1 i). Then X1/d is sub-exponential with parameters (16e

3i 1/2+1/ds 1/d, 16ei 1/2+1/ds 1/d).

Proof [of Lemma 16] Deﬁne the moment generating function (MGF) as

φ(λ) := E[exp(λ(X1/d γs,i))]

and its logarithm ψ(λ) := log φ(λ), where γs,i is deﬁned in (3). To verify the sub-exponential property, we evaluate the MGF. Using the Taylor expansion of the exponential function, we write

E exp λ(X1/d γs,i) = E

1 + λ(X1/d γs,i) +

λp(X1/d γs,i)p

Since E[X1/d γs,i] = 0, the linear term vanishes, and we bound the higher-order terms using Lemma 15 as follows

E exp λ(X1/d γs,i) 1 +

|λ|p E X1/d γs,i p

3(8i 1/2+1/ds 1/d|λ|)ppp

Using Stirling s approximation p! (p/e)p for p 1, we obtain

E exp λ(X1/d γs,i) 1 + 3

p=2 (8ei 1/2+1/ds 1/d|λ|)p

The geometric series sums to

p=2 (8ei 1/2+1/ds 1/d|λ|)p = 64e2i 1+2/ds 2/dλ2

1 8ei 1/2+1/ds 1/d|λ|.

For |λ| i1/2 1/ds1/d/(16e), we have 8ei 1/2+1/ds 1/d|λ| 1/2, ensuring convergence. Substituting this bound, we get

E exp λ(X1/d γs,i) 1 + 384e2i 1+2/ds 2/dλ2.

Since 1 + x ex for x R, we get

1 + 384e2i 1+2/ds 2/dλ2 exp(384e2i 1+2/ds 2/dλ2).

Therefore, we have

E exp λ(X1/d γs,i) exp(384e2i 1+2/ds 2/dλ2), |λ| i1/2 1/ds1/d/(16e).

This establishes that X1/d is sub-exponential with parameters ζ = 16e

3i 1/2+1/ds 1/d and α = 16ei 1/2+1/ds 1/d. This completes the proof.

Cai, Yang, Ma, Hang

Lemma 17 Let Assumption 2 hold. Additionally, let {Db s}B b=1 be B disjoint subsets of size s randomly drawn from data Dn. Let Rb s,(i)(x) be the i-distance of x in the subset Db s. Deﬁne

k as in Proposition 1. Suppose that B 2(d2 + 4)(log n)/3, s 2k. Furthermore, assume that there exists constants C n,i such that wb i C n,i for all b [B] and i [s]. Then, there exists n1 := 2dd + 1, such that for all x X, 1 i k, and all n > n1, the following statement holds with probability PBs at least 1 1/n2: 1 B

b=1 wb i P(B(x, Rb s,(i)(x)))1/d γs,i C n,i

Proof [of Lemma 17] We ﬁrst prove (41) holds for a ﬁxed x X and a ﬁxed 1 i k. For any b [B], by (20), we have

P(B(x, Rb s,(1)(x))), . . . , P(B(x, Rb s,(s)(x))) D= Ub (1), . . . , Ub (s) ,

where Ub (i) is the i-th order statistic of i.i.d. uniform [0, 1] random variables. By Biau and Devroye (2015, Corollary 1.2), Ub (i) Beta(i, s + 1 i). Let ξb := P(B(x, Rb s,(i)(x))). Since {Db s}B b=1 are independent, {ξb}B b=1 are independent random variables following Beta(i, s + 1 i). The desired inequality (41) then reduces to the concentration inequality for wb iξ1/d b . To prove this, we begin by showing that 1

B PB b=1 wb i ξ1/d b γs,i is a sub-exponential random variables. Since ξb are independent, we have

b=1 wb i ξ1/d b γs,i =

b=1 E exp λwb i ξ1/d b γs,i /B .

Since s 2k, it follows that s i + 1 i for 1 i k. Given the condition wb i C n,i, we have |λwb i/B| i1/2 1/ds1/d/(16e) for all |λ| i1/2 1/ds1/d B/(16e C n,i). By leveraging the

sub-exponential property of ξ1/d b stated in Lemma 16, we obtain

b=1 wb i ξ1/d b γs,i exp 384e2C 2 n,ii 1+2/dλ2

, |λ| i1/2 1/ds1/d B

16e C n,i .

This shows 1 B PB b=1 wb i ξ1/d b γs,i is sub-exponential. Applying the sub-exponential tail bound in Lemma 9, we have

b=1 wb i ξ1/d b γs,i 16e

for all 0 τ 3B/2. Since B 2(d2 + 4)(log n)/3, it follows that 3B/2 (d2 + 4) log n. Taking τ := (d2 + 4) log n and replacing ξb with P(B(x, Rb s,(i)(x))), for a ﬁxed x X and a

ﬁxed 1 i k, we have

b=1 wb i P(B(x, Rb s,(i)(x)))1/d γs,i C n,i

1 2 nd2+4 . (42)

Bagged Regularized k-Distances for Anomaly Detection

To extend the upper bound to all x X, consider a 1/nd-net {zj}J j=1 of [0, 1]d. A 1/nd-net is a ﬁnite subset of X such that for any x X, there exists zj in the net with x zj 2 1/nd. The construction of such a net can be done by placing grid points spaced 1/(dnd) apart in each of the d dimensions. This results in at most dnd grid points per dimension, and thus the total number of grid points satisﬁes J (dnd)d ddnd2. By (42), the bound holds for each zj with j [J], and using the union bound, it holds for all j [J] with probability PBs at least 1 2dd/n4. Since {zj}J j=1 is a 1/nd-net, for any x X, there exists zj such that x zj 2 1/nd. By Lemma 13, we have P(B(x, Rb s,(i)(x))) P(B(zj, Rb s,(i)(zj))) 1/nd for all b [B]. Using the Minkowski inequality, this implies P(B(x, Rb s,(i)(x)))1/d P(B(zj, Rb s,(i)(zj)))1/d

P(B(x, Rb s,(i)(x))) P(B(zj, Rb s,(i)(zj))) 1/d 1/n.

Combining this with the triangle inequality and the error bound for zj, we obtain 1 B

b=1 wb i P(B(x, Rb s,(i)(x)))1/d γs,i

b=1 wb i P(B(x, Rb s,(i)(x)))1/d P(B(zj, Rb s,(i)(zj)))1/d

b=1 wb i P(B(zj, Rb s,(i)(zj)))1/d γs,i

i B + C n,i n

for all x X and a ﬁxed i with probability PBs at least 1 2dd/n4. By applying the union bound over i, the same holds for all x X and 1 i k with probability PBs at least 1 2dd/n3. For n > n1 := 2dd + 1, we note that i log n 1 for all 1 i k. This implies

Combining this with the previous inequality, we conclude 1 B

b=1 wb i P(B(x, Rb s,(i)(x)))1/d γs,i C n,i

for all x X, 1 i k, and n > n1, with probability PBs at least 1 2dd/n3 1 1/n2. This completes the proof.

Proof [of Proposition 4] Let Ω1 denote the event deﬁned by (21) and (22) in Lemma 12, and let Ω2 denote the event deﬁned by (41) in Lemma 17. By applying the union bound, the event Ω1 Ω2 holds with probability PBs at least 1 2/n2 for all n > n1, where n1 = 2dd+1.

Cai, Yang, Ma, Hang

The condition Ps i=1 wb ii1/d kb 1/d in (iii) implies the existence of a constant c 4 > 0 such that Ps i=1 wb i(i/s)1/d c 4(kb/s)1/d for all b [B]. On the other hand, the condition Pcn i=1 wb i (log n)/kb in (i) and the bound k (log n)2 together imply the existence of n2 N such that for all n n2, we have Pcn i=1 wb i c 4/2 and kb cn for all b [B], where cn is the sequence speciﬁed in Lemma 12. Hence, we consider the subsequent arguments under the assumptions that the event Ω1 Ω2 holds and that n > N1 := max{n1, n2}. Proof of Bounding (I). Let (IV ) and (V ) be deﬁned as follows:

i=1 wb i(i/s)1/d j V 1/d d f(x)1/d RB n (x) d 1 j and (V ) := Vd RB n (x)d.

By the equality (14), in order to derive the upper bound of (I), it suﬃces to derive the upper bound of (IV ) and the lower bound of (V ). Let us ﬁrst consider (IV ). By condition (iii), we have Ps i=1 wb i(i/s)1/d kb/s 1/d

k/s 1/d for b [B]. Consequently we get

i=1 wb i(i/s)1/d j

k/s j/d. (43)

On the event Ω1, we have

Rw,b s (x) =

i=1 wb i Rb s,(i)(x) =

i=1 wb i Rb s,(i)(x) +

i=cn+1 wb i Rb s,(i)(x)

Rb s,(cn)(x) +

i=cn+1 wb i Rb s,(i)(x) (cn/s)1/d +

i=1 wb i(i/s)1/d

for all x X, where the last inequality follows from Ps i=1 wb ii1/d k 1/d in condition (iii) and k (log n)2 in condition (i). Consequently we obtain

RB n (x) = 1

b=1 Rw,b s (x)

k/s 1/d (44)

for all x X. Combining this with (43) and f c from Assumption 2, we derive

k/s (d 1 j)/d

k/s (d 1)/d. (45)

Next, let us consider (V ). On the event Ω1, for any b [B], we have

i=cn+1 wb i Rb s,(i)(x)

i=cn+1 wb i(i/s)1/d =

i=1 wb i(i/s)1/d

i=1 wb i(i/s)1/d

Bagged Regularized k-Distances for Anomaly Detection

for all x X. From earlier arguments in the second paragraph of this proof, we have Ps i=1 wb i(i/s)1/d c 4(kb/s)1/d and Pcn i=1 wb i(i/s)1/d (cn/s)1/d Pcn i=1 wb i c 4(kb/s)1/d/2 for all n > N1. Therefore, we have

i=1 wb i(i/s)1/d

i=1 wb i(i/s)1/d c 4 kb/s 1/d/2 (46)

for all x X. This implies

RB n (x) = 1

b=1 Rw,b s (x) (k/s)1/d (47)

for all x X. Therefore, we get (V ) = Vd RB n (x)d k/s. Combining with (45) and k k in condition (ii), we conclude that (I) = (IV )/(V )

k/s 1/d. This completes the proof of bounding (I). Proof of Bounding (II). Since wb i = 0 for all i > k and b [B], it follows that

b=1 wb i P(B(x, Rb s,(i)(x))1/d γs,i =

b=1 wb i P(B(x, Rb s,(i)(x))1/d γs,i .

By applying (41) on the event Ω2, for all x X, we have

b=1 wb i P(B(x, Rb s,(i)(x)))1/d γs,i

Given Ps i=1 Cn,ii1/d 1/2

k 1/d 1/2 in condition (iv), we obtain

b=1 wb i P(B(x, Rb s,(i)(x)))1/d γs,i

k/s 1/d (log n)/(k B) 1/2.

This completes the proof of bounding (II). Proof of Bounding (III). Let b [B] be ﬁxed for now. We analyze two cases separately: 1 i < cn and cn i kb. We begin with the case 1 i < cn. On the event Ω1, we have Rb s,(i)(x) Rb s,(cn)(x)

(cn/s)1/d and P(B(x, Rb s,(i)(x))) P(B(x, Rb s,(cn)(x))) cn/s for all x X by using (21) and (22). These bounds imply

P(B(x, Rb s,(i)(x)))1/d V 1/d d f(x)1/d Rb s,(i)(x) (cn/s)1/d ((log n)/s)1/d.

Therefore, we have

i=1 wb i P(B(x, Rb s,(i)(x)))1/d V 1/d d f(x)1/d Rb s,(i)(x) log n

Cai, Yang, Ma, Hang

Using Pcn i=1 wb i (log n)/kb from condition (i) and k k from condition (ii), we obtain

i=1 wb i P(B(x, Rb s,(i)(x)))1/d V 1/d d f(x)1/d Rb s,(i)(x) (log n)1+1/d

for all x X. Now, we consider the case cn i kb. Using (15) and the condition f c from Assumption 2, we get P(B(x, Rb s,(i)(x)))1/d V 1/d d f(x)1/d Rb s,(i)(x)

P(B(x, Rb s,(i)(x))) Vdf(x)Rb s,(i)(x)d Pd 1 j=0 P(B(x, Rb s,(i)(x)))j/d Rb s,(i)(x)d 1 j , (49)

for all x X. Next, consider x such that B(x, Rb s,(kb)(x)) [0, 1]d for all b [B]. Using the Lipschitz smoothness from Assumption 2, we obtain P(B(x, Rb s,(i)(x))) Vdf(x)Rb s,(i)(x)d

B(x,Rb s,(i)(x)) f(y) dy Z

B(x,Rb s,(i)(x)) f(x) dy Z

B(x,Rb s,(i)(x)) |f(y) f(x)| dy

B(x,Rb s,(i)(x)) y x 2 dy Rb s,(i)(x)d+1.

On the event Ω1, we have Rb s,(i)(x) (i/s)1/d for cn i kb. This implies

P(B(x, Rb s,(i)(x))) Vdf(x)Rb s,(i)(x)d (i/s)1+1/d. (50)

Moreover, using f c in Assumption 2 and Rb s,(i)(x) (i/s)1/d on the event Ω1, we have P(B(x, Rb s,(i)(x))) i/s for cn i kb. Consequently, we obtain

j=0 (i/s)j/d Rb s,(i)(x) d 1 j

j=0 (i/s)j/d (i/s)(d 1 j)/d (i/s)(d 1)/d.

This together with (49) and (50) implies

P(B(x, Rb s,(i)(x)))1/d V 1/d d f(x)1/d Rb s,(i)(x) (i/s)2/d (51)

for cn i kb. Therefore, using Ps i=1 wb ii1/d kb 1/d in condition (iii), we have

i=cn wb i (P(B(x, Rb s,(i)(x))))1/d V 1/d d f(x)1/d Rb s,(i)(x)

i=cn wb i(i/s)2/d (k/s)1/d s X

i=1 wb i(i/s)1/d (k/s)1/d (kb/s)1/d (k/s)2/d.

Bagged Regularized k-Distances for Anomaly Detection

Combining this with (48), we obtain

i=1 wb i P(B(x, Rb s,(i)(x)))1/d V 1/d d f(x)1/d Rb s,(i)(x) (log n)1+1/d

s1/dk + (k/s)2/d.

for all x X and a ﬁxed b [B]. Averaging over b [B], we have

i=1 wb i P(B(x, Rb s,(i)(x)))1/d V 1/d d f(x)1/d Rb s,(i)(x) (log n)1+1/d

s1/dk + (k/s)2/d.

This completes the proof of bounding (III), and hence the proof of Proposition 4.

6.1.2 Proofs Related to Section 2.3

To prove Proposition 1, we require the following lemma, which establishes an upper bound on the number of instances near the boundary.

Lemma 18 let {Db s}B b=1 be B disjoint subsets of size s randomly drawn from the data Dn, with Db s = {Xb 1, . . . , Xb s} for b [B]. Deﬁne n := [c 3(k/s)1/d, 1 c 3(k/s)1/d]d, where c 3 is the constant from Lemma 12 and k is as deﬁned in Proposition 1. Assume that the condition k (log n)2 holds. Deﬁne Ib := {i [s] : Xb i n} and nb := |Ib|. Then, for all n > N1 with N1 speciﬁed in Proposition 4, there holds 1 nb/s

k/s 1/d for all b [B] with probability PBs at least 1 1/n2.

Proof [of Lemma 18] Let c n := [0, 1]d \ n. Let b [B] be ﬁxed. For ℓ [s], we deﬁne ξℓ,b := 1 cn(Xb ℓ) P(X c n). Then we have EPξℓ,b = 0 and EP[ξℓ,b]2 P(X c n). Given f c in Assumption 2, we have P(X c n) cµ( c n) (k/s)1/d. Therefore, we have EP[ξℓ,b]2 (k/s)1/d. Applying Bernstein s inequality in Lemma 7, we obtain

ℓ=1 1 cn(Xb ℓ) P(X c n) q

2(k/s)1/dτ/s + 2τ/(3s)

with probability Ps at least 1 e τ. Setting τ := 3 log n and using P(X c n) (k/s)1/d, we obtain

ℓ=1 1 cn(Xb ℓ)

k/s 1/d + q

(k/s)1/d(log n)/s + (log n)/s.

with probability Ps at least 1 1/n3. By the deﬁnition of N1 in Proposition 4, we have k cn for all n > N1, where cn is speciﬁed in Lemma 12. Therefore, we obtain (k/s)1/d k/s k/s cn/s. This yields

k/s 1/d + q

(k/s)1/d(log n)/s + (log n)/s (k/s)1/d.

Cai, Yang, Ma, Hang

with probability Ps at least 1 1/n3. Using the union bound, this inequality holds for all b [B] with probability PBs at least 1 1/n2. This completes the proof.

Proof [of Proposition 1] Let Ω1 denote the event deﬁned by (21) and (22) in Lemma 12. Furthermore, let Ω3 be the event deﬁned by the inequality for the upper bound of (I), (II), and (III) in Proposition 4, and let Ω4 be the event deﬁned by the inequality for the upper bound of 1 nb/s in Lemma 18. By applying the union bound argument on Lemma 12, 18, and Proposition 4, the event Ω1 Ω3 Ω4 holds with probability PBs at least 1 1/n2 1/n2 2/n2 1 4/n2 for all n > N 1 := N1, where N1 is speciﬁed in Proposition 4. The subsequent arguments assume that Ω1 Ω3 Ω4 holds and n > N 1 . For Xb i satisfying B(Xb i , Rb s,(kb )(Xb i )) [0, 1]d for all b [B], using the bound of (I), (II), and (III) on the event Ω3 and (19), we get

L(Xb i , f B n ) = f B n (Xb i ) f(Xb i ) (I) (II) + (I) (III)

((log n)/k)1+1/d + (log n)/(k B) 1/2 + (k/s)1/d. (52)

The condition B (k/(log n))1+2/d implies that

((log n)/k)1+1/d (log n)/(k B) 1/2. (53)

The conditions wb 2 kb 1/2 for b [B] and k k yield that wb 2 (k) 1/2 (k) 1/2. Combining this with the condition log s log n in (ii), we obtain

(log n)/(k B) 1/2 p

(log s)/B wb 2. (54)

By (46) in the proof of Proposition 4 and the condition k k in (ii), we obtain that for all n > N1,

(k/s)1/d k/s 1/d Rw,b s (Xb i ).

on the event Ω1. Combining this with (52), (53), and (54), it follows that f B n (Xb i ) f(Xb i ) p

(log s)/B wb 2 + Rw,b s (Xb i ).

This completes the proof of (5). Next, let us turn to the proof of (6). Let n, Ib, and nb be deﬁned by Lemma 18. Then, it is clear to see that

RL,DB s (f B n ) = 1

f B n (Xb i ) f(Xb i )

f B n (Xb i ) f(Xb i ) + X

f B n (Xb i ) f(Xb i ) . (55)

Let b [B] be ﬁxed for now. We consider the ﬁrst term on the right-hand side of (55). For any i Ib, we have Xb i n. Consequently, for any b [B] and y B(Xb i , Rb s,(kb )(Xb i )),

Bagged Regularized k-Distances for Anomaly Detection

we obtain d(y, Rd \ [0, 1]d) c 3(k/s)1/d Rb s,(kb )(Xb i ) 0 on the event Ω1. This implies

that y [0, 1]d and thus B(Xb i , Rb s,(kb )(Xb i )) [0, 1]d for all i Ib and b [B]. Therefore, by (5), we have

f B n (Xb i ) f(Xb i ) 1

(log s)/B wb 2 + Rw,b s (Xb i )

(log s)/B wb 2 + 1

i=1 Rw,b s (Xb i ). (56)

Now, we consider the second term on the right-hand side of (55). The condition Ps i=1 i1/dwb i kb 1/d, b [B], together with (47) in the proof of Proposition 4 implies that for all x [0, 1]d, there holds

f B n (x) = 1 Vd RB n (x)d

i=1 wb i(i/s)1/d d k/s RB n (x)d 1

on the event Ω1. Combining this with f c in Assumption 2, on the event Ω3, we get X

f B n (Xb i ) f(Xb i ) #{i [s] : Xb i c n} = s nb s(kb/s)1/d s(k/s)1/d.

By (46) in the proof of Proposition 4, we have (k/s)1/d Rw,b s (Xb i ) for all i [s] on the event Ω1. Consequently, we obtain

f B n (Xb i ) f(Xb i ) (k/s)1/d 1

i=1 Rw,b s (Xb i ).

Combining this with (55) and (56), we obtain

RL,DB s (f B n ) Rsur L,DB s (f B n ) := 1

(log s)/B wb 2 + 1

i=1 Rw,b s (Xb i ) .

s Ps i=1 Rw,b s (Xb i ) = Ps i=1 wb i R b s,(i), we obtain the desired assertion.

6.2 Proofs Related to the Convergence Rates of BRDDE

We present the proofs related to the results concerning the surrogate risk minimization in Section 6.2.1. Additionally, the proof of Theorem 2 are provided in Section 6.2.2.

6.2.1 Proofs Related to Section 4.2.1

The following lemma, which will be used several times in the sequel, supplies the key to the proof of Proposition 5.

Cai, Yang, Ma, Hang

Lemma 19 Let R b s,(i) be deﬁned as in Proposition 1, and let wb, and kb, be deﬁned as in

Proposition 5. Then, for each b [B], there exists a constant µb > 0 such that R b s,(kb, ) <

µb R b s,(kb, +1). (For simplicity, we set R b s,(i) = for all i > s.) Moreover, the optimal weights satisfy

wb, i = µb R b s,(i) Pkb, i=1 µb R b s,(i) , 1 i kb, . (57)

Moreover, the weights wb, i are bounded as follows:

R b s,(kb, ) R b s,(i) Pkb, i=1

R b s,(kb, +1) R b s,(i) wb, i

R b s,(kb, +1) R b s,(i) Pkb, i=1

R b s,(kb, ) R b s,(i) , 1 i kb, . (58)

Additionally, the following inequality holds:

R b s,(kb, ) R b s,(i) 2 < log s

R b s,(kb, +1) R b s,(i) 2. (59)

Proof [of Lemma 19] By Theorem 3.1 in Anava and Levy (2016), for each b [B], there exists a constant µb > 0 such that the weights satisfy the formula in (57). This inequality, together with (57), implies that the bound in (58) holds. Moreover, by (8), we have p

(log s)/B wb, i / wb, 2 = µb + νb i R b s,(i).

For 1 i kb, , we have wb, i > 0, this implies that νb i = 0 by the KKT condition. Therefore, we have p

(log s)/B wb, i / wb, 2 = µb R b s,(i), 1 i kb, .

Now, summing over i from 1 to kb, , we get

µb R b s,(i) 2 = log s

i=1 (wb, i )2/ wb, 2 2 = log s

This, along with the inequality R b s,(kb, ) < µb R b s,(kb, +1), establishes (59).

Proof [of Proposition 5] Let c 2 and c 3 be the constants deﬁned in (21), and let {cn} be the sequence from Lemma 12. Deﬁne c 4 := (c 2/c 3)d < 1. Since s (n/ log n)(d+1)/(d+2) and B n1/(d+2)(log n)(d+1)/(d+2), there exists constants c 5, c 6, and c 7 such that

c 5 s (n/ log n)(d+1)/(d+2) c 6 and B c 7n1/(d+2)(log n)(d+1)/(d+2). (61)

The choice of s implies that log s log(n) log(log n) log n. Consequently, we have

s2/dc 1 2/d n log s n

2(d+1) d(d+2) (log n) 4d+6

Bagged Regularized k-Distances for Anomaly Detection

Using the order of B, we know that there exists n3 N such that for all n > n3, there holds

s2/dc 1 2/d n (log s)/(32/d+1(c 3)2c 1 2/d 4 ) > B. (62)

Furthermore, from the order of s and B, we see that there exists n4 N such that for all n > n4, the following three inequalities hold:

c 5(n/ log n)(d+1)/(d+2) c 1, (63)

c 7n1/(d+2)(log n)(d+1)/(d+2) 2(d2 + 4)(log n)/3, (64)

c 5(n/ log n)d/(d+2) c 8 := 4(c 6) 2 d+2

c 4((c 3)2c 7) d 2+d

1/2 (1 t1/d)2 dt d d+2 , (65)

where c 1 is the constant speciﬁed in Lemma 12. As we will demonstrate in the second part of this argument, the inequalities (63) and (65), together with the condition n > (c 6)2+d , guarantee that the lower bound for s is satisﬁed. Similarly, inequality (64) ensures that the lower bound for B holds. Additionally, by the divergence of the sequence cn, there exists n5 N such that for all n n5, we have

1/2 t1/d(1 t1/d) dt. (66)

Let Ω1 denote the event deﬁned by (21) and (22) in Lemma 12. By Lemma 12, the event Ω1 holds with probability PBs at least 1 1/n2. For the remainder of the proof, we assume that Ω1 holds and that n > N2 := max{n3, n4, n5, (c 6)2+d }. We proceed with the proof of statement 1.

Veriﬁcation of Condition (i): We ﬁrst show that kb, 2cn/c 4 for all b [B] by contradiction. Suppose that kb, < 2cn/c 4 for some b [B]. Since c 4 < 1, it follows that cn/c 4 > cn > 2. Therefore, we have kb, + 1 < 2cn/c 4 + 1 2cn/c 4 + 2 3cn/c 4. This leads to the bound R b s,(kb, +1) c 3((kb, + 1)/s)1/d 31/dc 3c 1/d 4 (cn/s)1/d on the event Ω1. Therefore, we obtain

R b s,(kb, +1) R b s,(i) 2 kb,

R b s,(kb, +1) 2 32/d+1(c 3)2c 1 2/d 4 c1+2/d n s 2/d.

Combining this with (62), we conclude that Pkb, i=1

R b s,(kb, +1) R b s,(i) 2 < (log s)/B. How-

ever, by (59) in Lemma 19, we know that (log s)/B Pkb, i=1

R b s,(kb, +1) R b s,(i) 2. This leads to a contradiction, implying that

kb, 2cn/c 4 > cn, b [B]. (67)

Let b [B] be ﬁxed. By Lemma 19, we have

R b s,(kb, +1) R b s,(i)

R b s,(kb, ) R b s,(i) . (68)

Cai, Yang, Ma, Hang

On the event Ω1, we have the following upper bound for the numerator:

R b s,(kb, +1) R b s,(i) cn R b s,(kb, +1) cn(kb, /s)1/d. (69)

Next, we establish the lower bound for Pkb, i=1

R b s,(kb, ) R b s,(i) . Since kb, > cn by (67), we

have R b s,(kb, ) c 2 kb, /s 1/d = c 3 c 4kb, /s 1/d c 3 c 4kb, /s 1/d and R b s,(i) c 3(i/s)1/d

for cn i s on the event Ω1. Therefore, we obtain

R b s,(kb, ) R b s,(i) c 3 c 4kb, /s 1/d c 3(i/s)1/d. (70)

for cn i s. Since c 4kb, 2cn by (67), we obtain

R b s,(kb, ) R b s,(i)

R b s,(kb, ) R b s,(i)

(kb, )1/d+1

s1/d 1 c 4kb,

Since g1(t) := 1 t1/d is a monotonically decreasing function for 0 t 1, we have

R b s,(kb, ) R b s,(i) (kb, )1/d+1

cn/ c 4kb, g1(t) dt

(kb, )1/d+1

1/2 g1(t) dt (kb, )1/d+1s 1/d. (71)

Combining this with the upper bound in (69) and inequality (68), we get Pcn i=1 wb, i (log n)/kb, for b [B]. Hence, we verify condition (i) in Proposition 1.

Veriﬁcation of Condition (ii): Let b [B] be ﬁxed. Since kb, > cn by (67), we have

R b s,(kb, +1) R b s,(i) 2

R b s,(kb, +1) 2 (kb, )2/d+1s 2/d.

on the event Ω1. Combining this with (59) in Lemma 19, we have (log s)/B (kb, )2/d+1s 2/d, which implies the lower bound kb, s2/(2+d)((log s)/B)d/(2+d). Combining this with the choice of s and B, we get the lower bound kb, (n/ log n)1/(d+2). Next, we derive the upper bound of kb, . Using the lower bound in (70) again, we have

R b s,(kb, ) R b s,(i) 2

R b s,(kb, ) R b s,(i) 2

(c 3)2 c 4kb, X

c 4kb, /s 1/d (i/s)1/d 2

Bagged Regularized k-Distances for Anomaly Detection

= (c 3)2 c 4kb, 2/d+1

s2/d 1 c 4kb,

Since c 4kb, 2cn > 144 by (67), we have c 4kb, c 4kb, /2. Therefore, we get

R b s,(kb, ) R b s,(i) 2 (c 3)2(c 4)2/d+1

1/2 (1 t1/d)2 dt (kb, )2/d+1s 2/d.

Combining this with (59) in Lemma 19, we obtain

B > (c 3)2(c 4)2/d+1

1/2 (1 t1/d)2 dt (kb, )2/d+1s 2/d. (72)

Since n N2 (c 6)2+d, using the choice of s in (61), we get

log s log c 6 + d + 1

d + 2 log n log n.

Using this bound with the choices of B and s in (61), and the inequality (72), we derive

kb, 2 c 4(c 3)2d/(2+d)

1/2 (1 t1/d)2 dt d d+2 log s

d d+2 s 2 d+2

2(c 6) 2 d+2

c 4((c 3)2c 7) d d+2

1/2 (1 t1/d)2 dt d d+2 n log n

1 d+2 = c 8 2

1 d+2 . (73)

Combining the lower and upper bounds of kb, , we conclude

kb, (n/ log n)1/(d+2), b [B]. (74)

The inequalities (61) and (63) ensure that s c 1 for all n > N2. Combining the upper bound of kb, in (73) with the choice of s in (61) and the inequality (65) yields s 2k for all n > N2. Therefore, we have s max{c 1, 2k} for all n > N2 and it is straightforward to verify that log s log n. Additionally, the choice of B in (61) combined with (64) ensures that B 2(d2 + 4)(log n)/3 for all n > N2. Moreover, using (74) and the choices of B and s, we obtain B (k/(log n))1+2/d. This complete the veriﬁcation of condition (ii) in Proposition 1.

Veriﬁcation of Condition (iii): Fix b [B]. By inequality (58) in Lemma 19, we have

Pkb, i=1 i1/d

R b s,(kb, ) R b s,(i)

R b s,(kb, +1) R b s,(i)

i=1 i1/dwb, i

Pkb, i=1 i1/d

R b s,(kb, +1) R b s,(i)

R b s,(kb, ) R b s,(i) . (75)

We ﬁrst evaluate the numerator on the left-hand side. Since c 4 < 1 and c 4kb, 2cn by (67), we have

R b s,(kb, ) R b s,(i)

R b s,(kb, ) R b s,(i)

Cai, Yang, Ma, Hang

Applying inequality (70), we obtain

R b s,(kb, ) R b s,(i) s 1/d c 4kb, X

i=cn i1/d c 4kb, 1/d i1/d

s 1/d(kb, )2/d+1 1 c 4kb,

1/d 1 i c 4kb,

Deﬁne g2(t) := t1/d(1 t1/d) for t [0, 1]. Since g2(t) 1 for t [0, 1], we obtain the lower bound

R b s,(kb, ) R b s,(i) (kb, )2/d+1

(cn 1)/ c 4kb, g2(t) dt 2 c 4kb,

From (66) and (67), we have c 4kb, 1 1/(2cn) R 1 1/2 g2(t) dt/4 and (cn 1)/ c 4kb, 1/2 for all n > N2. Therefore, we obtain the following lower bound:

R b s,(kb, ) R b s,(i) (kb, )2/d+1

1/2 g2(t) dt (kb, )2/d+1

On the other hand, on the event Ω1, since kb, > cn by (67), we have the upper bound:

R b s,(kb, +1) R b s,(i) kb, R b s,(kb, +1) (kb, )1+1/ds 1/d.

Combining these bounds with (75), we obtain Pkb, i=1 i1/dwb, i (kb, )1/d. By similar arguments, we can also derive the upper bound Pkb, i=1 i1/dwb, i (kb, )1/d from the right-hand term of inequality (75). Hence, we verify the ﬁrst part of condition (iii). Using the formula for wb, i in (57), along with the equality (60), and noting that R b s,(kb, ) <

µb R b s,(kb, +1) from Lemma 19, we obtain

q Pkb, i=1 µb R b s,(i) 2

Pkb, i=1 µb R b s,(i) = ((log s)/B)1/2 Pkb, i=1 µb R b s,(i) < ((log s)/B)1/2 Pkb, i=1(R b s,(kb, ) R b s,(i)) .

Combining this with (71), the choice of B and s, and the order of kb, in (74), we derive

wb, 2 ((log s)/B)1/2/ (kb, )1/d+1s 1/d (kb, ) 1/2.

Finally, using the Cauchy Schwarz inequality, we obtain 1 = wb, 2 1 kb, wb, 2 2. This implies the lower bound wb, 2 (kb, ) 1/2. Therefore, combining the upper and lower bounds for wb, 2, we verify the second part of condition (iii). This completes the veriﬁcation of Condition (iii) in Proposition 1.

Bagged Regularized k-Distances for Anomaly Detection

Veriﬁcation of Condition (iv): Let c 3 be the constant speciﬁed in Lemma 12, and let c 8 be the constant deﬁned in (65). We deﬁne hn := c 8(n/(log n))1/(d+2)/2 + 1, and deﬁne the following sequence

( c 3(hn/s)1/d, 1 i (hn s); 0, otherwise.

By (73), we have kb, + 1 c 8(n/ log n)1/(d+2)/2 + 1 hn for all b [B]. Therefore, for 1 i kb, , we have i (hn s), and thus Cn,i = c 3(hn/s)1/d. By (67), we have kb, > cn for all b [B]. Therefore, on the event Ω1, we have

Cn,i = c 3(hn/s)1/d c 3((kb, + 1)/s)1/d R b s,(kb, +1) (77)

for 1 i kb, and b [B]. Using the expression for kb, in (74), the choice of s, and the inequality (71), we derive

R b s,(kb, ) R b s,(i) n log n

d+2 n log n

Consequently, using (58) from Lemma 19 and inequality (77), we obtain

R b s,(kb, +1) R b s,(i) Pkb, i=1

R b s,(kb, ) R b s,(i) R b s,(kb, +1) R b s,(i) R b s,(kb, +1) Cn,i

for 1 i kb, and b [B]. Since wb, i = 0 for i > kb, and b [B], we conclude that wb, i Cn,i for all b [B] and i [s]. We now analyze the summation Ps i=1 i1/d 1/2Cn,i. Since hn s hn, we have

i=1 i1/d 1/2Cn,i = c 3

i=1 i1/d 1/2(hn/s)1/d h1/d+1/2 n (hn/s)1/d (n/ log n)

On the other hand, from (74), we know that k (n/ log n)1/(d+2), which implies k 1/d 1/2

d+2 . Therefore, we conclude that Ps i=1 i1/d 1/2Cn,i k 1/d 1/2. This completes the veriﬁcation of condition (iv) in Proposition 1. Finally, statement 2 has been veriﬁed in (74), completing the proof of Proposition 5.

Within the same theoretical framework as the preceding proof, the following lemma establishes a ﬁnite sample bound on the number of nonzero weights returned by SRM without bagging. This result is essential for the comparison of search complexity in Section 4.3. Following the approach in Section 2.3, we randomly partition the data into two disjoint subsets one for weight selection and the other for computing k-distances. For convenience, we assume without loss of generality that n is an odd number in the following lemma.

Cai, Yang, Ma, Hang

Lemma 20 Let Ds be a subset of size s = n/2 randomly drawn from the dataset Dn. Deﬁne Rs,(i) as the average i-distance for any integer i s on the subset Ds, following the deﬁnition in Proposition 1. Furthermore, let w be the solution to the following SRM problem:

w := arg min w Ws

log s w 2 +

i=1 wi Rs,(i),

and deﬁne k := sup{i [s] : w i = 0}. Then, there exists an integer N3 N such that for all n > N3, with probability Ps at least 1 1/n2, we have k n2/(d+2)(log n)d/(d+2).

Proof [of Lemma 20] Let c 3 denote the constant speciﬁed in Lemma 12, c 4 denote the constant introduced in the proof of Proposition 5, and {cn} denote the sequence deﬁned in Lemma 12. By the deﬁnition of cn, there exists n6 N such that for all n n6, we have

(n/2)2/dc 1 2/d n log(n/2)/(32/d+1(c 3)2c 1 2/d 4 ) > 1. (78)

Deﬁne N3 := max{ 2c 1 + 2, n6}, where c 1 is as speciﬁed in Lemma 12. Let Ω1 denote the event deﬁned by (21) and (22) in Lemma 12 with B = 1 and s = n/2. By Lemma 12, for all n > N3, we have s c 1, ensuring that the event Ω1 holds with probability Ps at least 1 1/n2. In the subsequent argument, we assume that Ω1 holds and n N3. Since (78) is analogous to (62) in the proof of Proposition 5, a similar argument as in the Veriﬁcation of Condition (i) part of that proof shows that k 2cn/c 4 by contradiction. Following the reasoning in the Veriﬁcation of Condition (ii) part, we obtain log s (k )2/d+1s 2/d. Substituting s = n/2 into the above inequality yields k n2/(d+2)(log n)d/(d+2). This completes the proof of Lemma 20.

6.2.2 Proofs Related to Section 3.2

In this section, we present the proof related to BRDDE. The weights wb, are derived using SRM based on the data for BRDDE, whereas Proposition 4 assumes that the weights are ﬁxed and independent of the data. As a result, Proposition 4 cannot be directly applied to establish the convergence rate of our density estimator. However, Proposition 5 ensures that the weights returned by SRM satisfy the required conditions in Proposition 1 with high probability. Therefore, by making slight modiﬁcations to the proof of Proposition 4, we can establish the error decomposition of BRDDE as stated in Proposition 21. Before proceeding, we introduce additional notation. Regarding the expression of f B, (x) in (11), we deﬁne the error term (I ), (II ), and (III ) as follows, corresponding to (I), (II), and (III) in (14), (17), and (18):

Vd RB, n (x)d

i=1 wb, i γs,i

j V 1/d d f(x)1/d RB, n (x) d 1 j. (79)

b=1 wb, i γs,i P(B(x, e Rb s,(i)(x)))1/d , (80)

b=1 wb, i P(B(x, e Rb s,(i)(x)))1/d V 1/d d f(x)1/d e Rb s,(i)(x) . (81)

Bagged Regularized k-Distances for Anomaly Detection

Following a similar derivation as in (19), we obtain the error decomposition: f B, n (x) f(x) (I ) (II ) + (I ) (III ). (82)

Proposition 21 Let Assumption 2 hold. Let (I ), (II ), and (III ) be deﬁned in (79), (80), and (81), respectively. Then, there exists an integer N4 N such that for all n > N4 and any x satisfying B(x, e Rb s,(kb, )(x)) [0, 1]d for all b [B], with probability Pn at least 1 2/n2, (I ), (II ), and (III ) have upper bounds of the same asymptotic order as (I), (II), and (III) in Proposition 4.

Proof [of Proposition 21] Let Ω5 denote the event deﬁned by the statements in Proposition 5. Applying Lemma 12 to the subset { e Db s}B b=1, we obtain that, with probability at least 1 1/n2, the following holds: c 2(i/s)1/d e Rb s,(i)(x) c 3(i/s)1/d and P(B(x, e Rb s,(i)(x))) i/s for all x X, b [B], and cn i s. We deﬁne this event as Ω6. Since the datasets Db s and e Db s are independent for b [B] and wb, is the solution to the SRM in (7), we have

E wb, i ((P(B(x, e Rb s,(i)(x))))1/d γs,i) Ω5

= E wb, i Ω5 E (P(B(x, e Rb s,(i)(x))))1/d γs,i = 0

for a ﬁxed x X, where the last equality follows from (20) and the expectation is taken with respect to the empirical measure e Db s, conditional on the event Ω5. Following the proof of Lemma 17 and conditioning on the event Ω5, for all n > n1 = 2dd + 1, we have

sup x X,i [ k ]

b=1 wb, i P(B(x, e Rb s,(i)(x)))1/d γs,i Cn,i

Since Proposition 5 ensures that P(Ω5) 1 1/n2, applying the conditional probability formula yields

sup x X,i [ k ]

b=1 wb, i P(B(x, e Rb s,(i)(x)))1/d γs,i Cn,i

for all n > n1 N2, where N2 is the integer speciﬁed in Proposition 5. Denote this event as Ω7. By the union bound, the event Ω5 Ω6 Ω7 holds with probability Pn at least 1 4/n2

for all n > n1 N2. Since the events Ω6 and Ω7 correspond to the events Ω1 and Ω2 in the proof of Proposition 4, respectively, and given that conditions (i) (iv) in Proposition 1 hold for wb, and kb, on Ω5, there exists N4 N such that the upper bound for (I ), (II ), and (III ) follow for all n > N4 by similar arguments as in the proof of Proposition 4. (Note that N4 may diﬀer N1 in that proposition.) The details are omitted.

Proof [of Theorem 2] Let Ω5, Ω6, and Ω7 be the events deﬁned in the proof of Proposition 21. Following the arguments therein, the event Ω5 Ω6 Ω7 holds with probability Pn at least 1 4/n2 for all n N4. For the remainder of the proof, we assume that the event Ω5 Ω6 Ω7 holds and that n > N 2 := N4.

Cai, Yang, Ma, Hang

From (73) in the proof of Proposition 5, we have k c 8(n/ log n)1/(d+2)/2. Deﬁne

n := [c 3(c 8(n/ log n)1/(d+2)/(2s))1/d, 1 c 3(c 8(n/ log n)1/(d+2)/(2s))1/d]d.

On the event Ω6, for all x n and any b [B], we have e Rb s,(kb, )(x) c 3(kb, /s)1/d c 3(c 8(n/ log n)1/(d+2)/(2s))1/d.

Thus, for any y B x, e Rb s,(kb, )(x) , we have

d(y, Rd \ [0, 1]d) c 3(c 8(n/ log n)1/(d+2)/(2s))1/d e Rb s,(kb, )(x) 0.

This implies that B x, e Rb s,(kb, )(x) [0, 1]d for all x n and b [B]. Therefore, from Proposition 21 and inequality (82), we obtain f B, n (x) f(x) ((log n)/k)1+1/d + (log n)/(k B) 1/2 + (k/s)1/d, x n.

Using kb, (n/ log n)1/(d+2) from Proposition 5 and substituting the choices of B and s, we obtain f B, n (x) f(x) n 1/(d+2)(log n)(d+3)/(d+2), x n

Integrating over n, we get Z

n |f B, n (x) f(x)| dx n 1/(d+2)(log n)(d+3)/(d+2). (83)

On the other hand, on the event Ω5, condition (iii) in Proposition 1 holds for wb, and kb, , which implies that Ps i=1 wb, i i1/d (kb, )1/d for all b [B]. Since γs,i < ((i+1/d)/(s+ 1/d))1/d (2i/s)1/d by (38), we have

i=1 wb, i γs,i 1

i=1 wb, i (i/s)1/d 1

b=1 (kb, /s)1/d (k/s)1/d.

Following similar arguments as in (47) from Proposition 4, we obtain RB, n (x) (k/s)1/d

for all x X on the event Ω5 Ω6. This implies

f B, n (x) = 1

Vd RB, n (x)d

i=1 wb, i γs,i

d k/k 1, x X.

Since f c from Assumption 2, we conclude that f B, n f is bounded by a constant. Therefore, we have Z

X\ n |f B, n (x) f(x)|dx µ(X \ n) ((n/ log n)1/(d+2)/s)1/d

n 1/(d+2)(log n)(d+3)/(d+2).

Finally, combining this with (83), we have Z

f B, n (x) f(x) dx = Z

f B, n (x) f(x) dx n 1/(d+2)(log n)(d+3)/(d+2),

which completes the proof.

Bagged Regularized k-Distances for Anomaly Detection

6.3 Proofs Related to the Convergence Rates of BRDAD

In this subsection, we ﬁrst present the proofs for learning the AUC regret in Section 6.3.1. Then we provide the proof of Theorem 3 in Section 6.3.2.

6.3.1 Proofs Related to Section 4.2.2

Proof [of Proposition 6] Under the Huber contamination model in Assumption 1, let η(x) be deﬁned as in (12), and deﬁne bη(x) = Π f B, n (x) 1. By the expression of f B, n (x) in (11), it follows that 1{RB, n (X) RB, n (X ) > 0} = 1{bη(X) bη(X ) > 0} and 1{RB, n (X) = RB, n (X )} = 1{bη(X) = bη(X )}. Consequently, we obtain

AUC(RB, n )

= E 1{(Y Y )(RB, n (X) RB, n (X ) > 0)} + 1{RB, n (X) = RB, n (X )}/2|Y = Y

= E 1{(Y Y )(bη(X) bη(X ) > 0)} + 1{bη(X) = bη(X )}/2|Y = Y = AUC(bη).

Therefore, we have Reg AUC(RB, n ) = Reg AUC(bη). Applying Agarwal (2013, Corollary 11), we obtain

Reg AUC(RB, n ) = Reg AUC(bη) 1 Π(1 Π)

X |bη(x) η(x)|d PX(x). (84)

From Assumption 2, we have f c, and given that f B, n c, it follows that

|bη(x) η(x)| = Π f B, n (x) f(x)

f B, n (x)f(x) f B, n (x) f(x) .

Combining this with (84) and the condition f c from Assumption 2, we establish the desired result.

6.3.2 Proofs Related to Section 3.3

Proof [of Theorem 3] Let Ω5, Ω6, and Ω7 be the events deﬁned in the proof of Proposition 21. Following the arguments therein, the event Ω5 Ω6 Ω7 holds with probability Pn at least 1 4/n2 for all n N4. For the subsequent arguments, we assume that Ω5 Ω6 Ω7 holds and that n > N 2 = N4. Let {cn} denote the sequence from Lemma 12. On the event Ω6, for all x X, we have

Rb, s (x) =

i=1 wb, i e Rb s,(i)(x) =

i=1 wb, i e Rb s,(i)(x) +

i=cn+1 wb, i e Rb s,(i)(x)

e Rb, s,(cn)(x) +

i=cn+1 wb, i e Rb s,(i)(x) (cn/s)1/d +

i=1 wb, i (i/s)1/d,

By Proposition 5, on the event Ω5, we have Ps i=1 wb, i i1/d k 1/d and k (log n)2. This implies that Rb, s (x) (cn/s)1/d + (k/s)1/d (k/s)1/d. Averaging over b in [B], we obtain

RB, n (x) = 1

b=1 Rb, s (x) (k/s)1/d (85)

Cai, Yang, Ma, Hang

On the other hand, using (38), we have

i=1 wb, i γs,i >

s + 1 + 1/d

i=1 wb, i (i/s)1/d,

where we use the inequality (i + 1/d 1)/(s + 1 + 1/d) (i 1)/(s + 2) (i 1)/(2s). Applying Proposition 5 again, we get Ps i=1 wb, i i1/d k 1/d. Averaging over b in [B], we have

i=1 wb, i γs,i (k/s)1/d.

Combining this with (85), we conclude that f B, n (x) is lower bounded by a constant for x X. Consequently, by Theorem 2 and Proposition 6, we obtain the desired assertion.

7. Conclusion

In this paper, we propose a distance-based algorithm, Bagged Regularized k-Distances for Anomaly Detection (BRDAD), to address challenges in unsupervised anomaly detection. BRDAD mitigates the sensitivity of hyperparameter selection by formulating the problem as a convex optimization task and incorporates bagging to enhances computational eﬃciency. From a theoretical perspective, we establish fast convergence rates for the AUC regret of BRDAD and show that the bagging technique substantially reduces computational complexity. As a by-product, we derive optimal convergence rates for the L1-error of Bagged Regularized k-Distances for Density Estimation (BRDDE), which shares the same weights as BRDAD, further validating the eﬀectiveness of the Surrogate Risk Minimization (SRM ) framework for density estimation. On the experimental side, BRDAD is evaluated against distancebased, forest-based, and kernel-based methods on various anomaly detection benchmarks, demonstrating superior performance. Additionally, parameter analysis reveals that choosing an appropriate number of bagging rounds improves performance, making the method well-suited for practical applications.

Acknowledgments

The authors would like to thank the reviewers and the action editor for their help and advice, which led to a signiﬁcant improvement of the article. Hanfang Yang and Yuheng Ma are corresponding authors. The research is supported by the Special Funds of the National Natural Science Foundation of China (Grant No. 72342010). Yuheng Ma is supported by the Outstanding Innovative Talents Cultivation Funded Programs 2024 of Renmin University of China. This research is also supported by Public Computing Cloud, Renmin University of China.

Bagged Regularized k-Distances for Anomaly Detection

Shivani Agarwal. Surrogate regret bounds for the area under the roc curve via strongly proper losses. In Conference on Learning Theory, pages 338 353. PMLR, 2013.

Charu C Aggarwal. Applications of outlier analysis. In Outlier analysis, pages 373 400. Springer, 2012.

Charu C Aggarwal. An introduction to outlier analysis. In Outlier analysis, pages 1 34. Springer, 2016a.

Charu C Aggarwal. Outlier ensembles. In Outlier Analysis, pages 185 218. Springer, 2016b.

Charu C Aggarwal and Saket Sathe. Theoretical foundations and algorithms for outlier ensembles. ACM SIGKDD Explorations Newsletter, 17(1):24 47, 2015.

Samet Akcay, Amir Atapour-Abarghouei, and Toby P Breckon. Ganomaly: Semi-supervised anomaly detection via adversarial training. In Asian conference on computer vision, pages 622 637. Springer, 2018.

Oren Anava and Kﬁr Levy. k -nearest neighbors: From global to local. Advances in neural information processing systems, 29, 2016.

Clément Berenfeld and Marc Hoﬀmann. Density estimation on an unknown submanifold. Electronic Journal of Statistics, 15:2178 2223, 2021.

Sergei N. Bernstein. The Theory of Probabilities. Gastehizdat Publishing House, Moscow, 1946.

Gérard Biau and Luc Devroye. Lectures on the Nearest Neighbor Method, volume 246. Springer, 2015.

Gérard Biau, Frédéric Chazal, David Cohen-Steiner, Luc Devroye, and Carlos Rodrıguez. A weighted k-nearest neighbor density estimate for geometric inference. Electronic Journal of Statistics, 5:204 237, 2011.

Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 93 104, 2000.

Yuchao Cai, Yuheng Ma, Yiwei Dong, and Hanfang Yang. Extrapolated random tree for regression. In International Conference on Machine Learning, pages 3442 3468. PMLR, 2023.

Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3):1 58, 2009.

Pengfei Chen, Huabing Huang, and Wenzhong Shi. Reference-free method for investigating classiﬁcation uncertainty in large-scale land cover datasets. International Journal of Applied Earth Observation and Geoinformation, 107:102673, 2022.

Cai, Yang, Ma, Hang

Y-S Chow, Stuart Geman, and L-D Wu. Consistent cross-validated density estimation. The Annals of Statistics, 11(1):25 38, 1983.

Felipe Cucker and Ding-Xuan Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, 2007.

Sanjoy Dasgupta and Samory Kpotufe. Optimal rates for k-nn density and mode estimation. Advances in Neural Information Processing Systems, 27, 2014.

Luc Devroye and Gábor Lugosi. Combinatorial methods in density estimation. Springer Science & Business Media, 2001.

Luc P Devroye and Terry J Wagner. The strong uniform consistency of nearest neighbor density estimates. The Annals of Statistics, pages 536 540, 1977.

Yixiang Dong, Minnan Luo, Jundong Li, Deng Cai, and Qinghua Zheng. Lookcom: Learning optimal network for community detection. IEEE Transactions on Knowledge and Data Engineering, 34(2):764 775, 2020.

Muhammad Fahim and Alberto Sillitti. Anomaly detection, analysis and prediction techniques in iot environment: A systematic literature review. IEEE Access, 7:81664 81681, 2019.

Tharindu Fernando, Harshala Gammulle, Simon Denman, Sridha Sridharan, and Clinton Fookes. Deep learning for medical anomaly detection a survey. ACM Computing Surveys (CSUR), 54(7):1 37, 2021.

Gianluigi Folino, Carla Otranto Godano, and Francesco Sergio Pisani. An ensemble-based framework for user behaviour anomaly detection and classiﬁcation for cybersecurity. The Journal of Supercomputing, pages 1 24, 2023.

Jerome H Friedman, Jon Louis Bentley, and Raphael Ari Finkel. An algorithm for ﬁnding best matches in logarithmic expected time. ACM Transactions on Mathematical Software (TOMS), 3(3):209 226, 1977.

Walter Gautschi. Some elementary inequalities relating to the gamma and incomplete gamma function. J. Math. Phys, 38(1):77 81, 1959.

Parikshit Gopalan, Vatsal Sharan, and Udi Wieder. Pidforest: anomaly detection via partial identiﬁcation. Advances in Neural Information Processing Systems, 32, 2019.

Xiaoyi Gu, Leman Akoglu, and Alessandro Rinaldo. Statistical analysis of nearest neighbor methods for anomaly detection. Advances in Neural Information Processing Systems, 32, 2019.

Songqiao Han, Xiyang Hu, Hailiang Huang, Minqi Jiang, and Yue Zhao. Adbench: Anomaly detection benchmark. Advances in Neural Information Processing Systems, 35:32142 32159, 2022.

Bagged Regularized k-Distances for Anomaly Detection

Hanyuan Hang, Ingo Steinwart, Yunlong Feng, and Johan AK Suykens. Kernel density estimation for dynamical systems. The Journal of Machine Learning Research, 19(1): 1260 1308, 2018.

Hanyuan Hang, Yuchao Cai, Hanfang Yang, and Zhouchen Lin. Under-bagging nearest neighbors for imbalanced classiﬁcation. The Journal of Machine Learning Research, 23 (1):5135 5197, 2022.

Waleed Hilal, S Andrew Gadsden, and John Yawney. Financial fraud: a review of anomaly detection techniques and recent advances. Expert systems With applications, 193:116429, 2022.

Peter J Huber. A robust version of the probability ratio test. The Annals of Mathematical Statistics, 36(6):1753 1758, 1965.

Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution, pages 492 518. Springer, 1992.

Heinrich Jiang. Uniform convergence rates for kernel density estimation. In International Conference on Machine Learning, pages 1694 1703. PMLR, 2017.

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In 2008 eighth ieee international conference on data mining, pages 413 422. IEEE, 2008.

Mark Lokanan, Vincent Tran, and Nam Hoai Vuong. Detecting anomalies in ﬁnancial statements using machine learning algorithm: The case of vietnamese listed ﬁrms. Asian Journal of Accounting Research, 4(2):181 201, 2019.

Ezequiel López-Rubio. A histogram transform for probability density function estimation. IEEE transactions on pattern analysis and machine intelligence, 36(4):644 656, 2013.

Andréa Eliza O Luz, Rogério G Negri, Klécia G Massi, Marilaine Colnago, Erivaldo A Silva, and Wallace Casaca. Mapping ﬁre susceptibility in the brazilian amazon forests using multitemporal remote sensing and time-varying unsupervised anomaly detection. Remote Sensing, 14(10):2429, 2022.

Pascal Massart. Concentration Inequalities and Model Selection, volume 1896 of Lecture Notes in Mathematics. Springer, Berlin, 2007.

David S Moore and James W Yackel. Large sample properties of nearest neighbor density function estimators. In Statistical Decision Theory and Related Topics, pages 269 279. Elsevier, 1977.

Ali Bou Nassif, Manar Abu Talib, Qassim Nasir, and Fatima Mohamad Dakalbab. Machine learning for anomaly detection: A systematic review. Ieee Access, 9:78658 78700, 2021.

Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submanifolds with high conﬁdence from random samples. Discrete & Computational Geometry, 39:419 441, 2008.

Cai, Yang, Ma, Hang

Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Eﬃcient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 427 438, 2000.

M Ravinder and Vikram Kulkarni. A review on cyber security and anomaly detection perspectives of smart grid. In 2023 5th International Conference on Smart Systems and Inventive Technology (ICSSIT), pages 692 697. IEEE, 2023.

Lukas Ruﬀ, Jacob R Kauﬀmann, Robert A Vandermeulen, Grégoire Montavon, Wojciech Samek, Marius Kloft, Thomas G Dietterich, and Klaus-Robert Müller. A unifying review of deep and shallow anomaly detection. Proceedings of the IEEE, 109(5):756 795, 2021.

Bernhard Schölkopf, Robert C Williamson, Alex Smola, John Shawe-Taylor, and John Platt. Support vector method for novelty detection. Advances in neural information processing systems, 12, 1999.

Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. Estimating the support of a high-dimensional distribution. Neural computation, 13(7):1443 1471, 2001.

Haiyang Sheng and Guan Yu. TNN: A transfer learning classiﬁer based on weighted nearest neighbors. Journal of Multivariate Analysis, 193:105126, 2023.

Bernard W Silverman. Density estimation for statistics and data analysis. Routledge, 2018.

Maciej Skorski. Bernstein-type bounds for beta distribution. Modern Stochastics: Theory and Applications, 10(2):211 228, 2023.

Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer Science & Business Media, 2008.

Ingo Steinwart, Don Hush, and Clint Scovel. A classiﬁcation framework for anomaly detection. Journal of Machine Learning Research, 6(2), 2005.

Mahito Sugiyama and Karsten Borgwardt. Rapid distance-based outlier detection via sampling. Advances in neural information processing systems, 26, 2013.

Maximilian E Tschuchnig and Michael Gadermayr. Anomaly detection in medical imaging-a mini review. In International Data Science Conference, pages 33 38. Springer, 2021.

Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, New York, 2009.

Aad W. van der Vaart and Jon A. Wellner. Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer-Verlag, New York, 1996.

Shay Vargaftik, Isaac Keslassy, Ariel Orda, and Yaniv Ben-Itzhak. Rade: resource-eﬃcient supervised anomaly detection using decision tree-based ensemble methods. Machine Learning, 110(10):2835 2866, 2021.

Bagged Regularized k-Distances for Anomaly Detection

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.

Weiping Wang, Zhaorong Wang, Zhanfan Zhou, Haixia Deng, Weiliang Zhao, Chunyang Wang, and Yongzhen Guo. Anomaly detection of industrial control systems based on transfer learning. Tsinghua Science and Technology, 26(6):821 832, 2021.

Mingxi Wu and Christopher Jermaine. Outlier detection by sampling with accuracy guarantees. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 767 772, 2006.

Puning Zhao and Lifeng Lai. On the convergence rates of KNN density estimation. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 2840 2845. IEEE, 2021.

Puning Zhao and Lifeng Lai. Analysis of knn density estimation. IEEE Transactions on Information Theory, 68(12):7971 7995, 2022.

Fan Zhou, Guanyu Wang, Kunpeng Zhang, Siyuan Liu, and Ting Zhong. Semi-supervised anomaly detection via neural process. IEEE Transactions on Knowledge and Data Engineering, 2023.