# deep_anomaly_detection_under_labeling_budget_constraints__e8047941.pdf Deep Anomaly Detection under Labeling Budget Constraints Aodong Li * 1 Chen Qiu * 2 Marius Kloft 3 Padhraic Smyth 1 Stephan Mandt 1 Maja Rudolph 2 Selecting informative data points for expert feedback can significantly improve the performance of anomaly detection (AD) in various contexts, such as medical diagnostics or fraud detection. In this paper, we determine a set of theoretical conditions under which anomaly scores generalize from labeled queries to unlabeled data. Motivated by these results, we propose a data labeling strategy with optimal data coverage under labeling budget constraints. In addition, we propose a new learning framework for semi-supervised AD. Extensive experiments on image, tabular, and video data sets show that our approach results in stateof-the-art semi-supervised AD performance under labeling budget constraints. 1. Introduction Detecting anomalies in data is a fundamental task in machine learning with applications across multiple domains, from industrial fault detection to medical diagnosis. The main idea is to train a model (such as a neural network) on a data set of normal samples to minimize the loss of an auxiliary (e.g., self-supervised) task. Using the loss function to score test data, one hopes to obtain low scores for normal data and high scores for anomalies (Ruff et al., 2021). In practice, the training data is often contaminated with unlabeled anomalies that differ in unknown ways from the i.i.d. samples of normal data. No access to a binary anomaly label (indicating whether a sample is normal or not) makes learning the anomaly scoring function from contaminated data challenging; the training signal has to come exclusively from the input features (typically real-valued vectors). Many *Equal contribution Joint supervision 1Department of Computer Science, University of California, Irvine, USA 2Bosch Center for Artificial Intelligence, Pittsburgh, USA 3Department of Computer Science, TU Kaiserslautern, Germany. Correspondence to: Aodong Li , Stephan Mandt , Maja Rudolph . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). approaches either assume that the unlabeled anomalies are too rarely encountered during training to affect learning (Wang et al., 2019) or try to detect and exploit the anomalies in the training data (e.g., Qiu et al. (2022a)). While AD is typically an unsupervised training task, sometimes expert feedback is available to check if individual samples are normal or not. For example, in a medical setting, one may ask a medical doctor to confirm whether a given image reflects normal or abnormal cellular tissue. Other application areas include detecting network intrusions or machine failures. Anomaly labels are usually expensive to obtain but are very valuable to guide an anomaly detector during training. For example, in Fig. 1, we can see that our method, with only one labeled query (Fig. 1 d) is almost on par with supervised AD (Fig. 1 a). However, the supervised setting is unrealistic, since expert feedback is typically expensive. Instead, it is essential to develop effective strategies for querying informative data points. Previous work on AD under a labeling budget primarily involves domain-specific applications and/or ad hoc architectures, making it hard to disentangle modeling choices from querying strategies (Trittenbach et al., 2021). In contrast, this paper theoretically and empirically studies generalization performance using various labeling budgets, querying strategies, and losses. In summary, our main contributions are as follows: 1. We prove that the ranking of anomaly scores generalizes from labeled queries to unlabeled data under certain conditions that characterize how well the queries cover the data. Based on this theory, we propose a diverse querying strategy for deep AD under labeling budget constraints. 2. We propose semi-supervised outlier exposure with a limited labeling budget (SOEL), a semi-supervised learning framework compatible with a large number of deep AD losses. We show how all major hyperparameters can be eliminated, making SOEL easy to use. To this end, we provide an estimate for the anomaly ratio in the data. 3. We provide an extensive benchmark for deep AD with a limited labeling budget. Our experiments on image, tabular, and video data provide evidence that SOEL outperforms existing methods significantly. Comprehensive ablations disentangle the benefits of each component. Deep Anomaly Detection under Labeling Budget Constraints Our paper is structured as follows. Sec. 2 introduces the problem setting we address and our main algorithm. Sec. 3 discusses related work in deep AD. Sec. 4 discusses experimental results on each of image, video, and tabular data. Finally, we conclude this work in Section 5. 2.1. Notation and Problem Statement Consider a dataset {xi}N i=1 where the datapoints xi are i.i.d. samples from a mixture distribution p(x) = (1 α)p0(x) + αp1(x). The distribution p0(x) corresponds to the normal data, while p1(x) corresponds to anomalous data. We assume that 0 α < 0.5, i.e., that the anomalous data is non-dominant in the mixture; in practice, α 0.5. In the AD problem, we wish to use the data to train an anomaly detector in the form of a parametric anomaly score function S(x; θ). Once trained this score function is thresholded to determine whether a datapoint xi is anomalous, as indicated by the binary anomaly label yi := y(xi) {0 := normal , 1 := abnormal }. We focus on the situation where the training data is unlabeled (only xi is known, not yi), but where we have access to an oracle (e.g., a human expert) that is able to provide labels yi for a budgeted number K of the N training points. 2.2. Outline of the Technical Approach Our work addresses the following questions: How to best select informative data points for labeling this is called the querying strategy, how to best learn an anomaly detector from both the labeled and unlabeled data in a semisupervised fashion, and how to make the approach easy to use by eliminating a crucial hyper-parameter. Querying Strategy. A successful approach for deep AD under labeling budget constraints will require a strategy for selecting the most beneficial set of queries. We choose a theoretically-grounded approach based on generalization performance. For this, we exploit that at test-time an AD method will threshold the anomaly scores to distinguish between normal samples and anomalies. This means that the quality of a scoring function is not determined by the absolute anomaly scores but only by their relative ranking. In Sec. 2.4, we characterize a favorable property of the query set which can guarantee that the ranking of anomaly scores generalizes from the labeled data to unlabeled samples. Since this is desirable, we derive a querying strategy that under a limited labeling budget best fulfills the favorable properties put forth by our analysis. Semi-supervised Outlier Exposure. As a second contribution, we propose a semi-supervised learning framework that best exploits both the labeled query set and the unla- beled data. It builds on supervised AD and latent outlier exposure (LOE) which we review in Sec. 2.3. We present SOEL in Sec. 2.5. The SOEL training objective is designed to receive opposing training signals from the normal samples and the anomalies. An EM-style algorithm alternates between estimating the anomaly labels of the unlabaled data and improving the anomaly scoring function using the data samples and their given or estimated labels. Hyperparameter Elimination. Like related methods discussed in Sec. 3, SOEL has an important hyperparameter α which corresponds to the expected fraction of anomalies in the data. While previous work has to assume that α is known (Qiu et al., 2022a), our proposed method presents an opportunity to estimate it. The estimate has to account for the fact that the optimal querying strategy derived from our theory in Sec. 2.4 is not i.i.d.. In Sec. 2.6, we provide an estimate of α for any stochastic querying strategy. 2.3. Background: Deep AD In deep AD, auxiliary losses help learn the anomaly scoring function S(x; θ). Popular losses include autoencoder-based losses (Zhou and Paffenroth, 2017), the deep SVDD loss (Ruff et al., 2018), or the neural transformation learning loss (Qiu et al., 2021). It is assumed that minimizing such a loss Lθ 0(x) L0(S(x; θ)) over normal data leads to a desirable scoring function that assigns low scores to normal samples and high scores to anomalies. Most deep AD methods optimize such an objective over an entire unlabeled data set, even if it contains unknown anomalies. It is assumed that the anomalies are rare enough that they will not dilute the training signal provided by the normal samples (inlier priority, (Wang et al., 2019)). Building on the ideas of Ruff et al. (2019) that synthetic anomalies can provide valuable training signal, Qiu et al. (2022a) show how to discover and exploit anomalies by treating the anomaly labels as latent variables in training. The key idea of Ruff et al. (2019) is to construct a complementary loss Lθ 1(x) L1(S(x; θ)) for anomalies that has an opposing effect to the normal loss Lθ 0(x). For example, the deep SVDD loss Lθ 0(x) = ||fθ(x) c||2, with feature extractor fθ, pulls normal data points towards a fixed center c (Ruff et al., 2018). The opposing loss for anomalies, defined in Ruff et al. (2019) as Lθ 1(x) = 1/Lθ 0(x), pushes abnormal data away from the center. Supervised AD. Using only the labeled data indexed by Q one could train S(x; θ) using a supervised loss (Hendrycks et al., 2018; Görnitz et al., 2013) LQ(θ) = 1 |Q| yj Lθ 1(xj) + (1 yj)Lθ 0(xj) . (1) Latent Outlier Exposure. Latent outlier exposure (LOE, Deep Anomaly Detection under Labeling Budget Constraints Figure 1. Anomaly score contour plots on 2D toy data demonstrate that SOEL [ours, (d)] with only one labeled sample can achieve detection accuracy that is competitive with a fully supervised approach (a). Binary classification (b) is problematic for AD since it cannot detect new anomalies, e.g. in the upper right corner of the plot. Subplot (c) demonstrates that unsupervised AD is challenging with contaminated data. Even a single labeled query, in combination with our approach, can significantly improve AD. (Qiu et al., 2022a)) is an unsupervised AD framework that uses the same loss as Eq. (1) but treats the labels y as latent variables. An EM-style algorithm alternates between optimizing the model w.r.t. θ and inferring the labels y. In this work, we propose semi-supervised outlier exposure with a limited labeling budget (SOEL) which builds on these ideas. We next present the querying strategy and when the querying strategy leads to correct ranking of anomaly scores (Sec. 2.4), the SOEL loss (Sec. 2.5), and how the hyperparameter α can be eliminated (Sec. 2.6) 2.4. Querying Strategies for AD The first ingredient of SOEL is a querying strategy for selecting informative data points to be labeled, which we derive from theoretical considerations. An important property of the querying strategy is how well it covers unlabeled data. The quality of a querying strategy is determined by the smallest radius δ such that all unlabeled points are within distance δ of one queried sample of the same type. In this paper, we prove that if the queries cover both the normal data and the anomalies well (i.e., if δ is small), a learned anomaly detector that satisfies certain conditions is guaranteed to generalize correctly to the unlabeled data (The exact statement and its conditions will be provided in Thm. 1). Based on this insight, we propose to use a querying strategy that is better suited for deep AD than previous work. Theorem 1. Let Q0 be the index set of datapoints labeled normal and Q1 the index set of datapoints labeled abnormal. Let δ R+ be the smallest radius, such that for each unlabeled anomaly ua and each unlabeled normal datum un there exist labeled data points xa, a Q1 and xn, n Q0, such that ua is within the δ-ball of xa and un is within the δ-ball around xn. If a λs-Lipschitz continuous function S ranks the labeled data correctly, with a large enough margin, i.e. S(xa) S(xn) 2δλs, then S ranks the unlabeled points correctly, too, and S(ua) S(un). In Supp. A, we prove Thm. 1 and discuss the assumptions. An implication of this theorem is that a smaller δ corresponding to a tighter cover of the data leads to better-generalized ranking performance. As detailed in Supp. A, there is a connection between correct anomaly score ranking and high AUROC performance, a common evaluation metric for AD. Existing methods use querying strategies that do not have good coverage and are therefore not optimal under Thm. 1. For a limited querying budget, random querying puts too much weight on high-density areas of the data space, while other strategies only query locally, e.g., close to an estimated decision boundary between normal and abnormal data. Proposed Querying Strategy. Based on Thm. 1, we propose a querying strategy that encourages tight coverage: diverse querying. In practice, we use the seeding algorithm of k-means++ which is usually used to initialize diverse clusters.1 It iteratively samples another data point to be added to the query set Q until the labeling budget is reached. Given the existing queried samples, the probability of drawing another query from the unlabeled set U is proportional to its distance to the closest sample already in the query set Q: pquery(xi) = softmax h(xi)/τ i U, (2) The temperature parameter τ controls the diversity of the sampling procedure, and h(xi) = minxj Q d(xi, xj) is the distance of a sample xi to the query set Q. For a meaningful notion of distance, we define d in an embedding space as d(x, x ) = ϕ(x) ϕ(x ) 2, where ϕ is a neural feature map. We stress that all deep methods considered in this paper have an associated feature map that we can use. The fact that L2 distance is used in the querying strategy is not an ad-hoc choice but rather aligned with the δ-ball radius definition (Eq. (5) in Supp. A) in Thm. 1. In Supp. A, we discuss the cover radius and empirically validate that diverse querying leads to smaller δ than others 1This has complexity O(KN) which can be reduced to O(K log N) using scalable alternatives (Bahmani et al., 2012). Deep Anomaly Detection under Labeling Budget Constraints and is hence advantageous for AD. 2.5. Semi-supervised Outlier Exposure Loss (SOEL) We next consider how to use both labeled and unlabeled samples in training. We propose SOEL whose loss combines the unsupervised AD loss of LOE (Qiu et al., 2022a) for the unlabeled data with the supervised loss (Eq. (1)) for the labeled samples. For all queried data (with index set Q), we assume that ground truth labels yi are available, while for unqueried data (with index set U), the labels yi are unknown. Adding both losses together yields L(θ, y) = 1 |Q| yj Lθ 1(xj) + (1 yj)Lθ 0(xj) + yi Lθ 1(xi) + (1 yi)Lθ 0(xi) . (3) Similar to Qiu et al. (2022a), optimizing this loss involves a block coordinate ascent scheme that alternates between inferring the unknown labels and taking gradient steps to minimize Eq. (3) with the inferred labels. In each iteration, the pseudo labels yi for i U are obtained by minimizing Eq. (3) subject to a constraint of P i U yi = αN. The constraint ensures that the inferred anomaly labels respect a certain contamination ratio α. To be specific, let α denote the fraction of anomalies among the unqueried set U, so that α|U| + P j Q yj = αN. The constrained optimization problem is then solved by using the current anomaly score function S to rank the unlabeled samples and assign the top α-quantile of the associated labels yi to the value 1, and the remaining to the value 0. We illustrate SOEL s effect on a 2D toy data example in Fig. 1, where SOEL (d) almost achieves the same performance as the supervised AD (c) with only one queried point. In theory, α could be treated as a hyperparameter, but eliminating hyperparameters is important in AD. In many practical applications of AD, there is no labeled data that can be used for validation. While Qiu et al. (2022a) have to assume that the contamination ratio is given, SOEL provides an opportunity to estimate α. In Sec. 2.6, we develop an importance-sampling based approach to estimate α from the labeled data. Estimating this ratio can be beneficial for many AD algorithms, including OC-SVM (Schölkopf et al., 2001), k NN (Ramaswamy et al., 2000), Robust PCA/Autoencoder (Zhou and Paffenroth, 2017), and Soft-boundary deep SVDD (Ruff et al., 2018). When working with contaminated data, these algorithms require a decent estimate of the contamination ratio for good performance. Another noteworthy aspect of the SOEL loss is that it weighs the averaged losses equally to each other. In Supp. E.9, we empirically show that equal weighting yields the best results among a large range of various weights. This provides more weight to every queried data point than to an unqueried one, because we expect the labeled samples to be more informative. On the other hand, it ensures that neither loss component will dominate the learning task. Our equal weighting scheme is also practical because it avoids a hyperparameter. 2.6. Contamination Ratio Estimation. To eliminate a critical hyperparameter in our approach, we estimate the contamination ratio α, i.e., the fraction of anomalies in the dataset. Under a few assumptions, we show how to estimate this parameter using mini-batches composed of on non-i.i.d. samples. We consider the contamination ratio α as the fraction of anomalies in the data. We draw on the notation from Sec. 2.1 to define y(x) as an oracle, outputting 1 if x is an anomaly, and 0 otherwise (e.g., upon querying x). We can now write α = Ep(x)[y(x)]. Estimating α would be trivial given an unlimited querying budget of i.i.d. data samples. The difficulty arises due to the fact that (1) our querying budget is limited, and (2) we query data in a non-i.i.d. fashion so that the sample average is not representative of the anomaly ratio of the full data set. Since the queried data points are not independently sampled, we cannot straightforwardly estimate α based on the empirical frequency of anomalies in the query Q. More precisely, our querying procedure results in a chain of indices Q = {i1, i2, ..., i|Q|}, where i1 Unif(1 : N), and each conditional distribution ik|i S(un)) Pn U0,a U1(S(ua) > S(un)) which measures the probability of ranking unlabeled samples ua higher than un in terms of their scores. U = U0 S U1 is the un-queried data indices and U0 and U1 are disjoint un-queried normal and abnormal data sets respectively. ua and un are instances of each kind. We conducted experiments on CIFAR-10 and F-MNIST, where we trained an anomaly detector (NTL) on the queried data for 30 epochs and then compute the AUC on the remaining un-queried data. The results of four querying straties are reported in Fig. 5, which shows that our proposed diverse querying strategy generalizes the anomaly score ranking the best to the unqueried data among the compared strategies, testifying our analysis in the main paper. A consequence is that diverse querying can provide accurate assignments of the latent anomaly labels, which will further help learn a high-quality of anomaly detector through the unsupervised loss term in Eq. (3). Optimality of Cover Radius. Although k-means++ greedily samples the queries which may have a sub-optimal cover radius, greedy sampling strategies for selecting a diverse set of datapoints in a multi-dimensional space are known to produce nearly optimal solutions (Krause and Golovin, 2014), with significant runtime savings over more sophisticated search methods. As a results, we follow common practice (e.g. Arthur and Vassilvitskii (2007)) and also use the greedy approach. We check the diversity of the rustling query set by comparing all sampling strategies considered in the paper in terms of data coverage. Figure 4 shows that the greedy strategy we use achieves the best coverage, i.e. results in the most diverse query set. On the Assumptions of Thm. 1. In the proof, we assume a Lipschitz continuous S and a large margin between S(xa) and S(xn). Lipschitz continuity serves as a working assumption and is a common assumption when analyzing optimization landscapes of deep learning. Lipschitz continuity can be controlled by the strength of regularization on the model parameters. The large margin condition is achieved by optimizing our loss function. The supervised anomaly detection loss encourages a large margin as it minimizes the anomaly score of queried normal data and maximizes the score of the queried abnormal data. If the anomaly score function doesn t do well for the queried samples, then it should be optimized further. Our empirical results also show this is a reasonable condition. B. Theorem 2 In this section, we will empirically justify the assumptions we made in Sec. 2.6 that are used to build an unbiased estimator of the anomaly ratio α (Eq. (4)). We will also demonstrate the robustness of the estimation under varying α. B.1. Proof Proof. Let A1 and A2 denote Assumption 1 and 2, respectively. Furthermore, let q(x1, ..., x|Q|) and qs(s1, ..., s|Q|) denote the query distribution in the data and anomaly score spaces, respectively. A2 assumes ys(s) := ys(S(x)) = y(x) for all x. So the expectation of Eq. (4) is E[ˆα] = Eq(x1,...,x|Q|) ps(S(xi)) qs(S(xi))y(x) A2 = Eqs(s1,...,s|Q|) ps(si) qs(si)ys(si) A1 = EQ|Q| i=1 qs(si) ps(si) qs(si)ys(si) i=1 Eqs(si) qs(si)ys(si) = Eps(s)[ys(s)] = Ep(x)[ys(S(x))] A2 = Ep(x)[y(x)] = α where the change of variables makes necessary assumptions, including the existence of density functions. Deep Anomaly Detection under Labeling Budget Constraints CIFAR-10 Class 1 as Normal CIFAR-10 Class 2 as Normal F-MNIST Class 1 as Normal F-MNIST Class 2 as Normal Figure 6. Anomaly score correlation matrix S(xi), S(xj) , where xi and xj are jointly sampled in the same query set. The result indicates that anomaly scores can be considered as approximately independent random variables. B.2. Assumption 1 We verify Assumption 1 by showing the correlation matrix in Fig. 6, where we jointly queried 20 points with diversified querying strategy and repeated 1000 times on two classes of CIFAR-10 and F-MNIST. Then the correlation between each pair of points are computed and placed in the off-diagonal entries. For each matrix, we show the average, maximum, and minimum of the off-diagonal terms CIFAR-10 Class 1: -0.001, 0.103, -0.086 CIFAR-10 Class 2: -0.001, 0.085, -0.094 F-MNIST Class 1: -0.001, 0.081, -0.075 F-MNIST Class 2: -0.005, 0.087, -0.067 Which shows the correlations S(xi), S(xj) are negligible, and the anomaly scores can be considered approximately independent random variables. B.3. Assumption 2 We verify Assumption 2 by counting the violations, i.e., S(xi) = S(xj) but y(xi) = y(xj) (because Assumption 2 states ys(si) = y(xi) and ys(sj) = y(xj), S(xi) = S(xj) implies y(xi) = ys(si) = ys(sj) = y(xj). The negation is S(xi) = S(xj) and y(xi) = y(xj).). We run the experiments on both CIFAR-10 and FMNIST. We apply the "one-vs.-rest" Deep Anomaly Detection under Labeling Budget Constraints setup for both datasets and set the first class as normal and all the other classes as abnormal. We set the ground-truth anomaly ratio as 0.1. After the initial training, we count the pairs of data points that satisfy S(xi) = S(xj) but y(xi) = y(xj) for i = j. Our validation shows that on FMNIST, among 6666 training data points, there are 38 pairs of matching scores, and none of them have opposite labels, and on CIFAR-10, among 5555 training data points, the numbers are 21 and 3, respectively. B.4. Contamination Ratio Estimation Table 4. Estimated contamination ratios on CIFAR-10 and F-MNIST when |Q| = 40 and the backbone model is NTL. The first row shows the true contamination ratio ranging from 1% to 45%. The estimations are repeated 50 times. 1% 5% 10% 15% 20% CIFAR-10 0.5% 1.2% 6.0% 3.3% 12.0% 4.4% 15.3% 4.5% 18.9% 5.4% F-MNIST 1.0% 1.5% 3.8% 2.3% 8.7% 4.1% 12.8% 5.3% 19.3% 5.1% 25% 30% 35% 40% 45% CIFAR-10 26.2% 6.0% 30.6% 5.5% 35.8% 6.9% 42.0% 7.7% 47.2% 6.7% F-MNIST 27.9% 6.4% 31.8% 6.1% 38.3% 6.5% 43.1% 5.7% 48.9% 5.6% We estimate the contamination ratio by Eq. (4) under varying true ratios. This part shows the estimated contamination ratio when the query budget is |Q| = 40. The estimations from the backbone model NTL is shown in Tab. 4. The first row contains the ground truth contamination rate, and the second and third row indicate the inferred values for two datasets, using our approach. Most estimates are withing the error bars and hence accurate. The estimation errors for low ground-truth contamination ratios are acceptable as confirmed by the sensitivity study in (Qiu et al., 2022a) which concludes that the LOE approach still works well if the anomaly ratio is mis-specified within 5 percentage points. Interestingly, we find the estimation error increases somewhat with the contamination ratio. However, a contamination ratio larger than 40% is rare in practice (most datasets should be fairly clean and would otherwise require additional preprocessing). In an anomaly detection benchmark (https://github.com/Minqi824/ADBench), none of the datasets have an anomaly ratio larger than 40%. C. Baselines Details In this section, we describe the details of the baselines in Tab. 1 in the main paper. For each baseline method, we explain their query strategies and post-query training strategies we implement in our experiment. Please also refer to our codebase for practical implementation details. Rand1. This strategy used by Ruff et al. (2019) selects queries by sampling uniformly without replacement across the training set, resulting in the queried index set Q = {iq Unif(1, , N)|1 q |Q|}. After the querying, models are trained with a supervised loss function based on outlier exposure on the labeled data and with a one-class classification loss function on the unlabeled data, LRand1(θ) = 1 |Q| yj Lθ 1(xj) + (1 yj)Lθ 0(xj) + 1 i U Lθ 0(xi). (6) As in SOEL both loss contributions are weighted equally. LRand1(θ) is minimized with respect to the backbone model parameters θ. Rand2. The querying strategy of Trittenbach et al. (2021) samples uniformly among the top 50% data ranked by anomaly scores without replacement. This leades to a random set of positive queries. After the queries are labeled, the training loss function is the same as LRand1(θ) (Eq. (6)). Mar. After training the backbone model for one epoch, this querying strategy by Görnitz et al. (2013) uses the α-quantile (sα) of the training data anomaly scores to define a normality region . Then the |Q| samples closest to the margin sα are selected to be queried. After the queries are labeled, the training loss function is the same as LRand1(θ) (Eq. (6)). Note that in practice we don t know the true anomaly ratio for the α-quantile. In all experiment, we provide this querying strategy with the true contamination ratio of the dataset. Even with the true ratio, the Mar strategy is still outperformed by SOEL. Deep Anomaly Detection under Labeling Budget Constraints Hybr1. This hybrid strategy, also used by (Görnitz et al., 2013) combines the Mar query with neighborhood-based diversification. The neighborhood-based strategy selects samples with fewer neighbors covered by the queried set to ensure the samples diversity in the feature space. We start by selecting the data index arg min1 i N si sα into Q. Then the samples are selected sequentially without replacement by the criterion arg min 1 i N 0.5 + |{j NNk(ϕ(xi)) : j Q}| 2k + β si sα mini si sα maxi si sα mini si sα where the inter-sample distance is measured in the feature space and the number of nearest neighbors is k = N/|Q| . We set β = 1 for equal contribution of both terms. After the queries are labeled, the training loss function is the same as LRand1(θ) (Eq. (6)). Pos1. This querying strategy by Pimentel et al. (2020) always selects the top-ranked samples ordered by their anomaly scores, arg max1 i N si. After the queries are labeled, the training loss only involves the labeled data LPos1(θ) = 1 |Q| yj Lθ 1(xj) + (1 yj)Lθ 0(xj) . Pimentel et al. (2020) use the logistic loss but we use the supervised outlier exposure loss. The supervised outlier exposure loss is shown to be better than the logistic loss in learning anomaly detection models (Ruff et al., 2019; Hendrycks et al., 2018). Pos2. This approach of (Barnabé-Lortie et al., 2015) uses the same querying strategy as Pos1, but the training is different. Pos2 also uses the unlabeled data during training. After the queries are labeled, the training loss function is the same as LRand1(θ) (Eq. (6)). Hybr2. This hybrid strategy by Das et al. (2019) makes positive diverse queries. It combines querying according to anomaly scores with distance-based diversification. Hybr2 selects the initial query arg max1 i N si into Q. Then the samples are selected sequentially without replacement by the criterion arg max 1 i N si mini si maxi si mini si + β min j Q d(xi, xj) mina =b d(xa, xb) maxa =b d(xa, xb) mina =b d(xa, xb) where d(xi, xj) = ||ϕ(xi) ϕ(xj)||2. We set β = 1 for equal contribution of both terms. After the queries are labeled, Das et al. (2019) use the labeled set to learn a set of weights for the components of an ensemble of detectors. For a fair comparison of active learning strategies, we use the labeled set to update an individual anomaly detector with parameters θ by optimizing the loss LHybr2(θ) = 1 |Q| yj Lθ 1(xj) + (1 yj)Lθ 0(xj) . Hybr3. This baseline by (Ning et al., 2022) uses the same query strategy as Hybr2, but differs in the training loss function, LHybr3(θ) = 1 |Q| + |U| j Q wj(1 yj)Lθ 0(xj) + 1 |Q| + |U| i U ˆwi Lθ 0(xi), where wj = 2σ(dj) and ˆwi = 2 2σ(di) where σ( ) is the Sigmoid function and di = 10cd ||ϕ(xi) c0||2 ||ϕ(xi) c1||2 where c0 is the center of the queried normal samples and c1 is the center of the queried abnormal samples in the feature space, and cd is the min-max normalization factor. We make three observations for the loss function. First, LHybr3(θ) filters out all labeled anomalies in the supervised learning part and puts a large weight (but only as large as 2 at most) to the true normal data that has a high anomaly score. Second, in the unlabeled data, LHybr3(θ) puts smaller weight (less than 1) to the seemingly abnormal data. Third, overall, the weight of the labeled data is similar to the weight of the unlabeled data. This is unlike SOEL, which weighs labeled data |U|/|Q| times higher than unlabeled data. Deep Anomaly Detection under Labeling Budget Constraints Algorithm 1 Training Procedure of SOEL Input: Unlabeled training dataset D, querying budget K Procedure: Train the model on D for one epoch as if all data were normal; Query K data points from D diversely resulting in a labeled set Q and an unlabeled set U; Estimate the contamination ratio α based on Q; Finally train the model with {Q, U} until convergence: For each iteration: We construct a mini-batch with Q and a subsampled mini-batch of U The sample in Q is up-weighted with 1/|Q| and the sample in U is down-weighted with weight 1/|U| The training strategy for Q is supervised learning; the training strategy for U is LOE with the estimated anomaly ratio α. D. Implementation Details In this section, we present the implementation details in the experiments. They include an overall description of the experimental procedure for all datasets, model architecture, data split, and details about the optimization algorithm. D.1. Experimental Procedure We apply the same experimental procedure for each dataset and each compared method. The experiment starts with an unlabeled, contaminated training dataset with index set U. We first train the anomaly detector on U for one epoch as if all data were normal. Then we conduct the diverse active queries at once and estimate the contamination ratio α by the importance sampling estimator Eq. (4). Lastly, we optimize the post-query training losses until convergence. The obtained anomaly detectors are evaluated on a held-out test set. The training procedure of SOEL is shown in Alg. 1. D.2. Data Split Image Data. For the image data including both natural (CIFAR-10 (Krizhevsky et al., 2009) and F-MNIST (Xiao et al., 2017)) and medical (Med MNIST (Yang et al., 2021)) images, we use the original training, validation (if any), and test split. When contaminating the training data of one class, we randomly sample images from other classes training data and leave the validation and test set untouched. Specifically for Derma MNIST in Med MNIST, we only consider the classes that have more than 500 images in the training data as normal data candidates, which include benign keratosis-like lesions, melanoma, and melanocytic nevi. We view all other classes as abnormal data. Different experiment runs have different randomness. Tabular Data. Our study includes the four multi-dimensional tabular datasets from the ODDS repository3 which have an outlier ratio of at least 30%. . To form the training and test set for tabular data, we first split the data into normal and abnormal categories. We randomly sub-sample half the normal data as the training data and treat the other half as the test data. To contaminate training data, we randomly sub-sample the abnormal data into the training set to reach the desired 10% contamination ratio; the remaining abnormal data goes into the test set. Different experiment runs have different randomness. Video Data. We use UCSD Peds14, a benchmark dataset for video anomaly detection. UCSD Peds1 contains 70 surveillance video clips 34 training clips and 36 testing clips. Each frame is labeled to be abnormal if it has non-pedestrian objects and labeled normal otherwise. Making the same assumption as (Pang et al., 2020), we treat each frame independent and mix the original training and testing clips together. This results in a dataset of 9955 normal frames and 4045 abnormal frames. We then randomly sub-sample 6800 frames out of the normal frames and 2914 frames out of the abnormal frames without replacement to form a contaminated training dataset with 30% anomaly ratio. A same ratio is also used in the literature (Pang et al., 2020) that uses this dataset. The remaining data after sampling is used for the testing set, whose about 30% data is anomalous. Like the other data types, different experiment runs have different randomness for the training dataset construction. 3http://odds.cs.stonybrook.edu/ 4http://www.svcl.ucsd.edu/projects/anomaly/dataset.htm Deep Anomaly Detection under Labeling Budget Constraints D.3. Model Architecture The experiments involve two anomaly detectors, NTL and multi-head Rot Net (MHRot), and three data types. NTL on Image Data and Video Data. For all images (either natural or medical) and video frames, we extract their features by feeding them into a Res Net152 pre-trained on Image Net and taking the penultimate layer output for our usage. The features are kept fixed during training. We then train an NTL on those features. We apply the same number of transformations, network components, and anomaly loss function Lθ 1(x), as when Qiu et al. (2022a) apply NTL on the image data. NTL on Tabular Data. We directly use the tabular data as the input of NTL. We apply the same number of transformations, network components, and anomaly loss function Lθ 1(x), as when Qiu et al. (2022a) apply NTL on the tabular data. MHRot on Image Data. We use the raw images as input for MHRot. We set the same transformations, MHRot architecture, and anomaly loss function as when Qiu et al. (2022a) apply MHRot on the image data. DSVDD on Image Data. For all images (either natural or medical), we build DSVDD on the features from the penultimate layer of a Res Net152 pre-trained on Image Net. The features are kept fixed during training. The neural network of DSVDD is a three-layer MLP with intermediate batch normalization layers and Re LU activation. The hidden sizes are [1024, 512, 128]. D.4. Optimization Algorithm Model Dataset Learning Rate Epoch Minibatch Size τ CIFAR-10 1e-4 30 512 1e-2 F-MNIST 1e-4 30 512 1e-2 Med MNIST 1e-4 30 512 1e-2 ODDS 1e-3 100 N/5 1e-2 UCSD Peds1 1e-4 3 192 1e-2 MHRot CIFAR-10 1e-3 15 10 N/A F-MNIST 1e-4 15 10 N/A Med MNIST 1e-4 15 10 N/A Deep SVDD CIFAR-10 1e-4 30 512 1e-2 F-MNIST 1e-4 30 512 1e-2 Med MNIST 1e-4 30 512 1e-2 Hybr2, Hybr3, Pos1, and Pos2 train 30 epochs. All other methods train 3 epochs. SOEL train 3 epochs. Table 5. A summary of optimization parameters for all methods. In the experiments, we use Adam (Kingma and Ba, 2014) to optimize the objective function to find the local optimal anomaly scorer parameters θ. For Adam, we set β1 = 0.9, β2 = 0.999 and no weight decay for all experiments. To set the learning rate, training epochs, minibatch size for Med MNIST, we find the best performing hyperparameters by evaluating the method on the validation dataset. We use the same hyperparameters on other image data. For video data and tabular data, the optimization hyperparameters are set as recommended by Qiu et al. (2022a). In order to choose τ (in Eq. (2)), we constructed a validation dataset of CIFAR-10 to select the parameter τ among {1, 1e-1, 1e-2, 1e-3} and applied the validated τ (1e-2) on all the other datasets in our experiments. Specifically, we split the original CIFAR-10 training data into a training set and a validation set. After validation, we train the model on the original training set again. We summarize all optimization hyperparameters in Tab. 5. When training models with SOEL, we resort to the block coordinate descent scheme that update the model parameters θ and the pseudo labels y of unlabeled data in turn. In particular, we take the following two update steps iteratively: update θ by optimizing Eq. (3) given y fixed; update y by sovling the constrained optimization in Sec. 2.5 given θ fixed; Deep Anomaly Detection under Labeling Budget Constraints Upon updating y, we use the LOES variant (Qiu et al., 2022a) for the unlabeled data. We set the pseudo labels y by performing the optimization below min y {0,0.5}|U| 1 |U| i U yi Lθ 1(xi) + (1 yi)Lθ 0(xi) s.t. i=1 yi = α|U| where α is the updated contamination ratio of U after the querying round, α = αN P j Q y(xj) /|U|, and α is computed by Eq. (4) given Q. The solution is to rank the data by Lθ 0(x) Lθ 1(x) and label the top α data abnormal (equivalently setting y = 0.5) and all the other data normal (equivalently y = 0). When we compute the Euclidean distance in the feature space, we construct the feature vector of a sample by concatenating all its encoder representations of different transformations. For example, if the encoder representation has 500 dimensions and the model has 10 transformations, then the final feature representation has 10 500 = 5000 dimensions. D.5. Time Complexity Regarding the time complexity, the optimization uses stochastic gradient descent. The complexity of our querying strategy is O(KN) where K is the number of queries and N is the size of the training data. This complexity can be further reduced to O(K log N) with a scalable extension of k-means++ (Bahmani et al., 2012). E. Additional Experiments and Ablation Study The goal of this ablation study is to show the generality of SOEL, to better understand the success of SOEL, and to disentangle the benefits of the training objective and the querying strategy. To this end, we applied SOEL to different backbone models and different data forms (raw input and embedding input), performed specialized experiments to compare the querying strategies, to demonstrate the optimality of the proposed weighting scheme in Eq. (3), and to validate the detection performance of the estimated ratio by Eq. (4). We also compared SOEL against additional baselines including semi-supervised learning frameworks and shallow anomaly detectors. E.1. Randomness of Initialization Random Initialization affects both the queried samples and downstream performance. To evaluate the effects, we ran all experiments 5 times with different random seeds and reported all results with error bars. In Fig. 4 we can see that the radius of the cover (a smaller radius means the queries are more diverse) does have some variance due to the random initialization. However, the corresponding results in terms of detection accuracy in Fig. 2 do have very low variance. Our interpretation is that for the CIFAR10 and F-MNIST experiments, the random initialization has little effect on detection performance. E.2. Results with Other Backbone Models Table 6. |Q| = 20. AUC (%) with standard deviation for anomaly detection on six datasets (CIFAR-10, F-MNIST, Blood, Organ A, Organ C, Organ S). The backbone models are MHRot (Hendrycks et al., 2019) and Deep SVDD (Ruff et al., 2018). For all experiments, we set the contamination ratio as 10%. SOEL consistently outperforms two best-performing baselines on all six datasets. MHRot Deep SVDD SOEL Hybr1 Hybr2 SOEL Hybr1 Hybr2 CIFAR-10 86.9 0.7 83.9 0.1 49.1 2.0 93.1 0.2 89.0 0.6 91.3 1.0 F-MNIST 92.6 0.1 87.1 0.2 58.9 5.7 91.4 0.5 90.9 0.4 82.5 2.9 Blood 83.3 0.2 81.1 2.5 61.8 2.1 80.2 1.1 79.7 1.2 77.2 3.0 Organ A 96.5 0.3 94.1 0.3 61.1 4.8 89.5 0.3 87.1 0.7 71.3 3.8 Organ C 92.1 0.2 91.6 0.1 70.9 0.8 87.5 0.7 85.3 0.8 84.2 0.9 Organ S 89.3 0.2 88.3 0.3 68.2 0.1 85.5 0.7 83.4 0.3 81.2 1.3 We are interested whether SOEL works for different backbone models. To that end, we repeat part of the experiments in Tab. 2 but using an self-supervised learning model MHRot (Hendrycks et al., 2019) and a one class classification model Deep SVDD (Ruff et al., 2018) as the backbone model. We compare SOEL to two best performing baselines Hybr1 and Deep Anomaly Detection under Labeling Budget Constraints 50 100 150 #Queried samples CIFAR-10 (1%) SOEL (ours) Hybr2 50 100 150 #Queried samples FMNIST (1%) SOEL (ours) Hybr2 50 100 150 #Queried samples CIFAR-10 (5%) SOEL (ours) Hybr2 50 100 150 #Queried samples FMNIST (5%) SOEL (ours) Hybr2 50 100 150 #Queried samples CIFAR-10 (20%) SOEL (ours) Hybr2 50 100 150 #Queried samples FMNIST (20%) SOEL (ours) Hybr2 Figure 7. Running AUCs (%) with different query budgets and data contamination ratios (1%-top row, 5%-middle row, 20%-bottom row). Models are evaluated at 20, 40, 80, 160 queries. SOEL performs the best on all three contamination ratio setups. Hybr2. In this experiment, MHRot and Deep SVDD take different input types: while MHRot takes raw images as input, Deep SVDD uses pre-trained image features. We also set the query budget to be |Q| = 20. We report the results in Tab. 6. It showcases the superiority of SOEL compared to the baselines. On all datasets, SOEL significantly outperforms the two best performing baselines, Hybr1 and Hybr2, thus demonstrating the wide applicability of SOEL across anomaly detection model types. E.3. Robustness to Anomaly Ratios Our method works for both low anomaly ratios and high anomaly ratios. In Fig. 7, we compare SOEL against the bestperforming baseline Hybr2on CIFAR-10 and FMNIST benchmarks. We vary the anomaly ratio among 1%, 5%, and 20%. On all these three anomaly ratio settings, SOEL has significantly better performance than the baseline by over 2 percentage points on average. E.4. Disentanglement of SOEL We disentangle the benefits of each component of SOEL and compare it to unsupervised anomaly detection with latent outlier exposure (LOE) (Qiu et al., 2022a), and to supervised active anomaly detection with k-means++ querying strategy. Both active approaches (k-means++ and SOEL) are evaluated with |Q| = 20 labeled samples. The unsupervised approach LOE requires an hyperparameter of the assumed data contamination ratio, which we set to the ground truth value 10%. Deep Anomaly Detection under Labeling Budget Constraints Table 7. |Q| = 20. AUC (%) with standard deviation for anomaly detection on CIFAR-10 and F-MNIST. For all experiments, we set the contamination ratio as 10%. SOEL mitigates the performance drop when NTL and MHRot trained on the contaminated datasets. Results of the unsupervised method LOE are borrowed from Qiu et al. (2022a). LOE k-means++ SOEL LOE k-means++ SOEL CIFAR-10 94.9 0.1 95.6 0.3 96.3 0.3 86.3 0.2 64.0 0.2 86.9 0.7 F-MNIST 92.5 0.1 94.3 0.2 94.8 0.4 91.2 0.4 91.5 0.1 92.6 0.1 20 40 80 160 #Queried samples 20 40 80 160 #Queried samples 20 40 80 160 #Queried samples 20 40 80 160 #Queried samples SOEL (ours) SOEL-BCE Figure 8. Running AUCs (%) with different query budgets. Models are evaluated at 20, 40, 80, 160 queries. Deep AD model (NTL) performs significantly better than a binary classifier. Comparing SOEL to LOE reveals the benefits of the k-means++ active approach5; comparing SOEL to k-means++ reveals the benefits of the unsupervised loss function in SOEL. Results in Tab. 7 show that SOEL leads to improvements for both ablation models. E.5. Comparison to Binary Classifier In the semi-supervised AD setup, the labeled points can be seen as an imbalanced binary classification dataset. We, therefore, perform an ablation study where we only replace deep AD backbone models with a binary classifier. All the other training and querying procedures are the same. We report the results on four different querying budget situations in Fig. 8. The figure shows that a binary classifier on all 11 image datasets falls far short of the NTL, a deep AD model. The results prove that the inductive bias (learning compact representations for normal data) used by AD models are useful for AD tasks. However, such inductive bias is lacking for binary classifiers. Especially when only querying as few as 20 points, the model can t see all anomalies. The decision boundary learned by the classifier based on the queried anomalies possibly doesn t generalize to the unseen anomalies. E.6. Comparison to a Batch Sequential Setup In Fig. 9, we extend our proposed method SOEL to a sequential batch active AD setup. This sequential extension is possible because our querying strategy k-means++ is also a sequential one. At each round, we query 20 points and update 5Notice that while LOE uses the true contamination ratio (an oracle information), SOEL only uses the estimated contamination ratio by the 20 queries. Deep Anomaly Detection under Labeling Budget Constraints 50 100 150 200 #Queried samples SOEL (ours) SOEL-seq 50 100 150 200 #Queried samples SOEL (ours) SOEL-seq Figure 9. Running AUCs (%) with different query budgets. Models are evaluated at 20, 40, 80, 160 queries. SOEL performs better than a sequential version. the estimated contamination ratio. We plot this sequential version of SOEL and the original SOEL in Fig. 9 and make comparisons. The sequential version is not as effective as a single batch query of SOEL. E.7. Comparisons of Querying Strategies Figure 10. Ablation study on the query strategy. K-Means++ significantly outperforms other strategies for active anomaly detection on most of the datasets. To understand the benefit of sampling diverse queries with k-means++ and to examine the generalization ability (stated in Thm. 1) of different querying strategies, we run the following experiment: We use a supervised loss on labeled samples to train various anomaly detectors. The only difference between them is the querying strategy used to select the samples. We evaluate them on all image data sets we study for varying number of queries |Q| between 20 and 160. Results are in Fig. 10. On all datasets except OCT, k-means++ consistently outperforms all other querying strategies from previous work on active anomaly detection. The difference is particularly large when only few samples are queried. This also confirms that diverse querying generalizes better on the test data than other querying strategies (see additional results in Supp. A). Deep Anomaly Detection under Labeling Budget Constraints E.8. Ablation on Estimated Contamination Ratio 20 40 80 160 #Queried samples Figure 11. Model using the estimated ratio is indistinguishable from the one using the true ratio. To see how the estimated ratio affects the detection performance, we compare SOEL to the counterpart with the true anomaly ratio. We experiment on all 11 image datasets. In Fig. 11, we report the average results for all datasets when querying |Q| = 20, 40, 80, 160 samples. It shows that SOEL with either true ratio or estimated ratio performs similar given all query budgets. Therefore, the estimated ratio can be applied safely. This is very important in practice, since in many applications the true anomaly ratio is not known. E.9. Ablations on Weighting Scheme 0.01 0.10 1.00 10.00 Weights of queried samples 0.01 0.10 1.00 10.00 Weights of queried samples Figure 12. Ablation study on the weighting scheme in Eq. (3). With different query budgets |Q|, the performance on image datasets degrades both upon down-weighting (0.01, 0.1) or up-weighting (10.0) the queried samples. In contrast, equal weighting yields optimal results. We make the implicit assumption that the averaged losses over queried and unqueried data should be equally weighted (Eq. (3)). That means, if a fraction ϵ of the data is queried, every queried data point weights 1/ϵ as much as an unqueried datum. As a consequence, neither the queried nor the unqueried data points can dominate the result. To test whether this heuristic is indeed optimal, we added a scalar prefactor to the supervised loss in Eq. (3) (the first term) and reported the results on the CIFAR-10 and F-MNIST datasets with different query budgets (Fig. 12). A weight<1 corresponds to downweighting the queried term, while a weight>1 corresponds to upweighting it. We use the same experimental setup and backbone (NTL) as in the paper. The results are shown in Fig. 12. We see that the performance degrades both upon down-weighting (0.01, 0.1) or up-weighting (10.0) the queried samples. In contrast, equal weighting yields optimal results. Deep Anomaly Detection under Labeling Budget Constraints Table 8. Performance of ablation study on τ. AUROCs (%) on CIFAR-10 and F-MNIST when |Q| = 20, the ground-truth contamination ratio is 0.1, and the backbone model is NTL. τ 1 0.1 0.01 0.001 CIFAR-10 93.2 1.7 94.5 0.8 96.3 0.3 95.9 0.4 F-MNIST 91.8 1.4 92.7 1.1 94.8 0.6 94.9 0.2 E.10. Ablations on Temperature τ τ (in Eq. (2)) affects the querying procedure and smaller τ makes the querying procedure more deterministic and diverse because the softmax function (in Eq. (2)) can eventually become a maximum function. We add an ablation study on different values of τ. We did experiments under the ground truth contamination ratio being 0.1 and |Q| = 20. As Tab. 8 shows, the smaller τ results in better AUROC results (more diverse) and smaller errors (more deterministic). E.11. Ablations on Pseudo-label Values y Table 9. Performance of ablation study on y. AUROC (%) on CIFAR-10 and F-MNIST when |Q| = 20, the ground-truth contamination ratio is 0.1, and the backbone model is NTL. y 1.0 0.875 0.75 0.625 0.5 0.25 CIFAR-10 95.3 0.6 95.7 0.4 95.8 0.4 96.0 0.5 96.3 0.3 94.5 0.3 F-MNIST 94.5 0.5 94.5 0.4 94.6 0.4 94.6 0.3 94.8 0.6 94.0 0.4 Analyzing the effects of the pseudo-label values y is an interesting ablation study. Therefore, we perform the following experiments to illustrate the influence of different y values. We set the ground truth contamination ratio being 0.1 and |Q| = 20. We vary the y from 0.25 to 1.0 and conduct experiments. For each y value, we run 5 experiments with different random seeds and report the AUROC results with standard deviation. It shows that y = 0.5 performs the best. While the performance of CIFAR-10 degrades slightly as y deviates from 0.5, F-MNIST is pretty robust to y. All tested y outperform the best baseline reported in Tab. 2. E.12. Comparisons with Semi-supervised Learning Frameworks 20 40 80 160 #Queried samples Fix Match w/ k-means++ Fix Match w/ random k NN (k=10) Gaussian Process Gaussian Process with Anomaly Ratio 20 40 80 160 #Queried samples Fix Match w/ k-means++ Fix Match w/ random k NN (k=10) Gaussian Process Gaussian Process with Anomaly Ratio Figure 13. Comparison with semi-supervised learning fraemworks, Fix Match (Sohn et al., 2020a), k-nearest neighbors (Iscen et al., 2019), and Gaussian process (Li et al., 2018). On F-MNIST, SOEL outperforms all baselines, while on CIFAR-10, SOEL has a comparable performance with Fix Match with k-means++ querying. SOEL exploits the unlabeled data to improve the model performance. This shares the same spirit of semi-supervised learning. We are curious about how a semi-supervised learning method performs in our active anomaly detection setup. To this end, we adapted an existing semi-supervised learning framework Fix Match (Sohn et al., 2020a) to our setup and compared with our method in Fig. 13. As follows, we will first describe the experiment results and then state the adaptation of Fix Match to Deep Anomaly Detection under Labeling Budget Constraints anomaly detection we made. Fix Match, as a semi-supervised learning algorithm, regularizes the image classifier on a large amount of unlabeled data. The regularization, usually referred to consistency regularization, requires the classifier to have consistent predictions on different views of unlabeled data, thus improves the classifier s performance. Fix Match generates various data views through image augmentations followed by Cutout (De Vries and Taylor, 2017). We noticed that, although Fix Match focuses on making use of the unlabeled data, its performance is highly affected by the quality of the labeled data subset. We investigated two variants depending on how we acquire the labeled data. One is the original semi-supervised learning setting, i.e., assuming the labeled data is a random subset of the whole dataset. The other one utilizes the same diversified data querying strategy k-means++ as SOEL to acquire the labeled part. In Fig. 13, we compared the performance of the two variants with SOEL. It shows that, on natural images CIFAR10 for which Fix Match is developped, while the original Fix Match with random labeled data is still outperformed by SOEL, Fix Match with our proposed querying strategy k-means++ has a comparable performance with SOEL. However, such advantage of Fix Match diminishes for the gray image dataset F-MNIST, where both variants are beat by SOEL on all querying budgets. In addition, the Fix Match framework is restrictive and may not be applicable for tabular data and medical data, as the augmentations are specially designed for natural images. Fix Match is designed for classification. To make it suit for anomaly detection, we adapted the original algorithm6 and adopted the following procedure and loss function. 1. Label all training data as normal and train the anomaly detector for one epoch; 2. Actively query a subset of data with size |Q|, resulting in Q and the remaining data U; 3. Finetune the detector in a supervised manner on non-augmented Q for 5 epochs; 4. Train the detector with the Fix Match loss Eq. (7) on augmented {U, Q} until convergence. We denote weak augmentation of input x by α(x) and the strong augmentation by A(x). The training objective function we used is LFix Match(θ) = 1 |Q| yj Lθ 1(α(xj)) + (1 yj)Lθ 0(α(xj)) i U 1(S(α(xi)) < q0.7 or S(α(xi)) > q0.05) yi Lθ 1(A(xi)) + (1 yi)Lθ 0(A(xi)) (7) where pseudo labels yi = 1(S(α(xi)) > q0.05)) and qn is the n-quantile of the anomaly scores {S(α(xi))}i U. In the loss function, we only use the unlabeled samples with confidently predicted pseudo labels. This is controlled by the indicator function 1(S(α(xi)) < q0.7 or S(α(xi)) > q0.05). We apply this loss function for mini-batches on a stochastic optimization basis. We also extend the semi-supervised learning methods using non-parametric algorithms to our active anomaly detection framework. We applied k-nearest neighbors and Gaussian process for inferring the latent anomaly labels (Iscen et al., 2019; Li et al., 2018) because these algorithms are unbiased in the sense that if the queried sample size is large enough, the inferred latent anomaly labels approach to the true anomaly labels. For these baselines, we also queried a few labeled data with k-means++ -based diverse querying strategy and then annotate the unqueried samples with k-nearest neighbor classifier or Gaussian process classifer trained on the queried data. Both methods become ablations of SOEL. We compare SOEL with them on CIFAR-10 and F-MNIST under various query budgets and report their results in Fig. 13. On both datasets, SOEL improves over the variant of using only queried samples for training. On F-MNIST, SOEL outperforms all ablations clearly under all query budgets, while on CIFAR-10, SOEL outperforms all ablations except for Fix Match when query budget is low. In conclusion, SOEL boosts the performance by utilizing the unlabeled samples properly, while other labeling strategies are less effective. E.13. More Comparisons Comparisons to k NN (Ramaswamy et al., 2000) We compared against k NN in two ways. First we confirmed that our baseline backbone model NTL is competitive with k NN, which is shown to have a strong performance on tabular 6We adapted the Fix Match implementation https://github.com/kekmodel/Fix Match-pytorch Deep Anomaly Detection under Labeling Budget Constraints Table 10. Comparisons with k NN method. We reported the F1-score (%) with standard error for anomaly detection on tabular datasets when the query budget K = 10. SOEL outperforms the k NN baseline. kth NN ALOE Breast W 92.5 2.1 93.9 0.5 Ionosphere 88.1 1.3 91.8 1.1 Pima 40.5 4.7 55.5 1.2 Satellite 61.1 2.2 71.1 1.7 Average 70.6 78.1 20 40 80 160 #Queried samples Representation Diversity Gradient Diversity 20 40 80 160 #Queried samples Representation Diversity Gradient Diversity Figure 14. Comparison with gradient diversity querying strategy (BADGE) (Ash et al., 2020). The gradients wrt. the penultimate layer representation don t provide as informative queries as the representation itself, thus outperformed by our querying strategy SOEL. The true contamination ratio is 10%. data (Shenkar and Wolf, 2022). To this end, NTL has been shown to yield 95.7% AUC on clean CIFAR-10 data, see Shenkar and Wolf, 2022, Table 1 column 1. In contrast, Qiu et al. (2022a) reported 96.2% AUC in Table 2, which is very close. Second, we tested the performance of the k NN method on our corrupted training data set. We gave k NN the advantage of using the ground truth contamination ratio (otherwise when under-estimating this value, we saw the method degrade severely in performance). KNN has two key hyperparameters: the number of nearest neighbors k and the assumed contamination ratio of the training set. The method uses this assumed contamination ratio when fitting to define the threshold on the decision function. In our experiments, we tried multiple values of k and reported the best testing results. Although the ground truth anomaly rate is unknown and our proposed methods don t have access to it, we gave k NN the competitive advantage of knowing the ground truth contamination ratio. We studied the same tabular data sets as in our paper: Breast W, Ionosphere, Pima, and Satellite. We used the same procedure for constructing contaminated data outlined in our paper, where the contamination ratio was set to 10%. The results are summarized in Tab. 10. We adopted Py OD s implementation of k NN7 and set all the other hyperparameters to their default values ( method, radius, algorithm, leaf_size, metric, p, and metric_params ). We repeated the experiments 10 times and reported the mean and standard deviation of the F1 scores in Tab. 10. We find that our active learning framework outperforms the k NN baseline. In more detail, the F1 scores for different values of k are listed below, where k = 1, 2, 5, 10, 15, 20, respectively: Breast W: 84.3 7.6, 86.5 3.1, 89.9 3.9, 90.7 3.4, 92.5 2.1, 91.9 1.5 Ionosphere: 88.1 1.3, 87.6 2.6, 84.5 3.9, 75.2 2.5, 70.4 3.6, 67.4 3.4 Pima: 34.4 3.6, 32.3 3.4, 36.9 6.4, 40.5 4.7, 35.3 3.6, 35.5 4.5 7https://github.com/yzhao062/pyod Deep Anomaly Detection under Labeling Budget Constraints Satellite: 51.0 1.1, 53.5 0.7, 54.7 1.3, 57.4 1.8, 59.3 1.3, 61.1 2.2 Comparisons to Gradient Diversity Querying Strategy (BADGE) (Ash et al., 2020) We compared against a popular active learning method, BADGE (Ash et al., 2020), which is a diversity-driven active learning method that exploits samplewise gradient diversity. We start with observing that BADGE doesn t work well for anomaly detection in Fig. 14, where we only replaced the objects that k-means++ works on in SOEL with gradients demanded in BADGE (Ash et al., 2020) while keeping all other settings fixed. This variant is referred to as "Gradient Diversity" while ours is denoted by "Representation Diversity". Fig. 14 shows the performance of Gradient Diversity is outperformed by a large margin, failing in querying informative samples as our Representation Diversity. To understand which part of BADGE breaks for anomaly detection tasks, we check the gradients used by BADGE in an anomaly detection model. Before that, we start with describing how BADGE works. BADGE is developed for active learning in classification tasks. Given a pre-trained classifier, it first predicts the most likely label ˆy (pseudo labels) for the unlabeled training data x. These pseudo labels are then used to formulate a cross entropy loss l CE(x, ˆy). BADGE computes every data point s loss function s gradient to the final layer s weights as the data s representation. Upon active querying, a subset of data are selected such that their representations are diverse. In particular, the gradient to each class-specific weight Wk is Wkl CE(x, ˆy) = (pk 1(ˆy = k))ϕ(x) where pk is the predicted probability of being class k and ϕ(x) is the output of the penultimate layer. Proposition 1 of Ash et al. (2020) shows the norm of the gradient with pseudo labels is a lower bound of the one with true labels. In addition, note that the gradient is a scaling of the penultimate layer output. The scaling factor describes the predictive uncertainty and is upper bounded by 1. Therefore, the gradients are informative surrogates of the penultimate layer output of the network, as shown by the inequality || Wkl CE(x, ˆy)||2 || Wkl CE(x, y)||2 ||ϕ(x)||2. (8) However, these properties are associated with the softmax activation function usage. In anomaly detection, models and losses are diverse and are beyond the usage of softmax activation outputs. Hence the gradients are no longer good ways to construct active queries. For example, the supervised deep SVDD (Ruff et al., 2019) uses the contrasting loss l(x, y) = y/(Wϕ(x) c)2 + (1 y)(Wϕ(x) c)2 to compact the normal sample representations around center c. However, the gradient W l(x, y) = 2(1 y)(Wϕ(x) c) 2y(Wϕ(x) c) 3 ϕ(x) is not a bounded scaling of ϕ(x) any more, thus not an informative surrogate of point x. E.14. NTL as a Unified Backbone Model In Section 4 of the main paper, we have empirically compared SOEL to active-learning strategies known from various existing papers, where these strategies originally were proposed using different backbone architectures (either shallow methods or simple neural architectures, such as autoencoders). However, several recent benchmarks have revealed that these backbones are no longer competitive with modern self-supervised ones (Alvarez et al., 2022). For a fair empirical comparison of SOEL to modern baselines, we upgraded the previously proposed active-learning methods by replacing their simple respective backbones with a modern self-supervised backbone: NTL (Qiu et al., 2021) the same backbone that is also used in SOEL. We motivate our choice of NTL as unified backbone in our experiments as follows. Fig. 15 shows the results of ten shallow and deep anomaly detection methods (Qiu et al., 2022a; Ruff et al., 2018; Deecke et al., 2018; Golan and El-Yaniv, 2018; Hendrycks et al., 2019; Tax and Duin, 2004; Liu et al., 2008; Diederik P. Kingma, 2014; Makhzani and Frey, 2015; Sohn et al., 2020b) on the CIFAR10 one-vs.-rest anomaly detection task. NTL performs best (by a large margin) among the compared methods, including many classic backbone models known from the active anomaly detection literature (Trittenbach et al., 2021; Ruff et al., 2019; Görnitz et al., 2013; Das et al., 2019; Pimentel et al., 2020; Ning et al., 2022; Barnabé-Lortie et al., 2015). An independent benchmark comparison of 13 methods (including nine deep methods proposed in 2018 2022) (Alvarez et al., 2022) recently identified NTL as the leading anomaly-detection method on tabular data. In their summary, the authors write: Neu Tra LAD, the transformation-based approach, offers consistently above-average performance across all datasets. The data-augmentation strategy is particularly efficient on small-scale datasets where samples are scarce. . Note that the latter is also the scenario where active learning is thought to be the most promising. We show the results from Alvarez et al. (2022) in Tab. 11. Deep Anomaly Detection under Labeling Budget Constraints Figure 15. Error (in % of 1-AUCROC) of ten methods on CIFAR10: two shallow methods (SVDD (Tax and Duin, 2004) and IF (Liu et al., 2008)) and eight deep methods (VAE (Diederik P. Kingma, 2014), DCAE (Makhzani and Frey, 2015), ADGAN (Deecke et al., 2018), DSVDD (Ruff et al., 2018), GT (Golan and El-Yaniv, 2018), MHRot (Hendrycks et al., 2019), Contrastive (Sohn et al., 2020b), and NTL (Qiu et al., 2022a)). NTL achieves the best anomaly detection performance on CIFAR10. Table 11. F1-scores (in %) and their standard deviations of 13 anomaly detection methods on tabular data. Results are taken from Alvarez et al. (2022). The results indicate that NTL is the state-of-the-art for tabular anomaly detection. KDDCUP10 NSL-KDD IDS2018 Arrhythmia Thyroid Avg. ALAD 95.9 0.7 92.1 1.5 59.0 0.0 57.4 0.4 68.6 0.5 74.6 DAE 93.2 2.0 96.1 0.1 71.5 0.5 61.5 2.5 59.0 1.5 76.3 DAGMM 95.9 1.4 85.3 7.4 55.8 5.3 50.6 4.7 48.6 8.0 67.2 Deep SVDD 89.1 2.0 89.3 2.0 20.8 11 55.5 3.0 13.1 13 53.6 DROCC 91.1 0.0 90.4 0.0 45.6 0.0 35.8 2.6 62.1 10 65.0 DSEBM-e 96.6 0.1 94.6 0.1 43.9 0.8 59.9 1.0 23.8 0.7 63.8 DSEBM-r 98.0 0.1 95.5 0.1 40.7 0.1 60.1 1.0 23.6 0.4 63.6 DUAD 96.5 1.0 94.5 0.2 71.8 2.7 60.8 0.4 14.9 5.5 67.7 Mem AE 95.0 1.7 95.6 0.0 59.9 0.1 62.6 1.6 56.1 0.9 73.8 SOM-DAGMM 97.7 0.3 95.6 0.3 44.1 1.1 51.9 5.9 52.7 12 68.4 LOF 95.1 0.0 91.1 0.0 63.8 0.0 61.5 0.0 68.6 0.0 76.0 OC-SVM 96.7 0.0 93.0 0.0 45.4 0.0 63.5 0.0 68.1 0.0 73.3 NTL 96.4 0.2 96.0 0.1 59.5 8.9 60.7 3.7 73.4 0.6 77.2