# timesead_benchmarking_deep_multivariate_timeseries_anomaly_detection__f68f8f02.pdf Published in Transactions on Machine Learning Research (04/2023) Time Se AD: Benchmarking Deep Multivariate Time-Series Anomaly Detection Dennis Wagner dwagner@cs.uni-kl.de RPTU Kaiserslautern-Landau Tobias Michels tmichels@cs.uni-kl.de RPTU Kaiserslautern-Landau Florian C.F. Schulz florian.cf.schulz@tu-berlin.de TU Berlin Arjun Nair naira@rptu.de RPTU Kaiserslautern-Landau Maja Rudolph maja.rudolph@us.bosch.com Bosch AI Marius Kloft kloft@cs.uni-kl.de RPTU Kaiserslautern-Landau Reviewed on Open Review: https: // openreview. net/ forum? id= i Mms CI0Js S Developing new methods for detecting anomalies in time series is of great practical significance, but progress is hindered by the difficulty of assessing the benefit of new methods, for the following reasons. (1) Public benchmarks are flawed (e.g., due to potentially erroneous anomaly labels), (2) there is no widely accepted standard evaluation metric, and (3) evaluation protocols are mostly inconsistent. In this work, we address all three issues: (1) We critically analyze several of the most widely-used multivariate datasets, identify a number of significant issues, and select the best candidates for evaluation. (2) We introduce a new evaluation metric for time-series anomaly detection, which in contrast to previous metrics is recall consistent and takes temporal correlations into account. (3) We analyze and overhaul existing evaluation protocols and provide the largest benchmark of deep multivariate time-series anomaly detection methods to date. We focus on deep-learning based methods and multivariate data, a common setting in modern anomaly detection. We provide all implementations and analysis tools in a new comprehensive library for Time Series Anomaly Detection, called Time Se AD1. 1 Introduction Anomaly detection (AD) on time series is a fundamental problem in machine learning and significant in various applications, from monitoring patients and uncovering financial fraud to detecting faults in manufacturing and critical process conditions in chemical plants (Ruff et al., 2021). The aim in AD is to automatically identify significant deviations to the norm so-called anomalies. There are two principal approaches to AD on time series: Either an anomaly detector assigns a score to each time step separately (point-wise) or the entire time series (globally). This work focuses on unsupervised AD in the point-wise setting, which has These authors contributed equally. 1https://github.com/wagner-d/Time Se AD Published in Transactions on Machine Learning Research (04/2023) become the standard in the literature and can easily be adapted to the global environment by aggregating the local labels. Moreover, point-wise methods lend themselves naturally to real-time prediction and anomaly localization, which are of practical importance in many applications. Evaluating the accuracy of an anomaly detector for time series is not straightforward. Most authors commonly rely on point-wise metrics, particularly the F1-score (Choi et al., 2021; Audibert et al., 2022). However, by definition, point-wise metrics ignore any temporal dependencies in time series and, consequently, fail to distinguish predictive patterns, such as early and late predictions. The few prior attempts to introduce specialized evaluation metrics (Lavin & Ahmad, 2015; Tatbul et al., 2018) for time-series AD have not caught on in the community, primarily due to their complexity and the counterintuitive results they can produce (Huet et al., 2022). We explicitly discuss existing metrics for time series anomaly detection in Section 3.2. Another major problem in time-series AD is the datasets. Recently, Wu & Keogh (2021) have exposed significant flaws in many widely used univariate datasets, ranging from surface-level issues (such as mislabeled points and an unrealistic density of anomalies) to deep-rooted problems (such as positional bias and trivial features). These problems raise doubts about the reliability of existing evaluations obscuring the actual progress in the field. The analysis by Wu & Keogh (2021) does not address all critical aspects of time-series AD, such as distributional shift. Furthermore, many modern applications of AD reside in the multivariate regime. To the best of our knowledge, this is the first comprehensive analysis of many popular multivariate times-series AD datasets. The third crucial problem impeding progress in the field is the diverse and often incompatible evaluation protocols across publications. Many recent methods compare to previously reported results while using different evaluation protocols. Differences range from the way methods are trained, optimized, tuned, or evaluated, making the results inherently incomparable. The general lack of detailed specification of the evaluation protocol and methods, and official implementations additionally burden fair comparisons. Furthermore, common to many evaluations, some practices have already been proven to introduce bias and skew the results in favor of random predictions (Kim et al., 2022; Doshi et al., 2022). The distinct flaws of datasets, metrics, and evaluation protocols are pervasive problems subverting evaluations in time-series AD, making it hard to determine the actual progress in the field. This work examines the most popular multivariate time-series datasets, evaluation metrics, and protocols in detail and proposes a general evaluation framework to address the identified problems. We have created a detailed, extendable, and userfriendly library, where we implemented 28 deep-learning based multivariate time-series AD methods. This library is an unprecedented asset enabling researchers to quickly and reliably develop, test, and evaluate new methods. It provides a set of tools to analyze datasets and methods alike. In our evaluation, we focus entirely on multivariate datasets and deep-learning based methods, a common setting in modern applications (Darban et al., 2022). Multivariate data are more complex than univariate data, allowing for complex dependencies between features, and deep learning based methods have been shown to outperform shallow baselines on multivariate data in multiple settings (Le Cun et al., 2015). Our main contributions are the following: We conduct a thorough analysis of the most widely used datasets, metrics, and evaluation protocols for multivariate time-series AD, revealing significant problems with all three. We propose a new evaluation metric that is provable recall consistent (a property we define in section Section 3.2) and empirically provides a reasonable ordering of evaluated methods. We present the largest comprehensive benchmark so far for multivariate time-series AD, comparing 28 deep-learning methods on 21 datasets. 2 Related Work Several papers have attempted to summarize the vast number of time-series AD approaches (Darban et al., 2022). However, most prior work focuses either on a subclass of network architectures (Lindemann et al., 2021; Lee et al., 2021; Wen et al., 2022) or a specific application domain and methods specifically applied therein (Luo et al., 2021). Others discuss multiple methods and concepts in a high-level overview (Blázquez García et al., 2021), with a strong focus on application. Choi et al. (2021) and Audibert et al. (2022) use Published in Transactions on Machine Learning Research (04/2023) point-wise metrics to selectively evaluate nine and 14 methods, respectively, on three and five datasets, in which we identified several problems (see Section 3.1). Similarly, Lai et al. (2021) evaluate nine methods on four other datasets, one of which is not available anymore, using point-wise methods. Schmidl et al. (2022) evaluate a large collection of more than 20 deep and several shallow methods primarily on univariate or lowdimensional datasets. Their evaluation relies on a slow (quadratic in time) implementation of time-series precision and recall (Tatbul et al., 2018). Consequently, they had to exclude results where the computation took too long. Other libraries, such as (Bhatnagar et al., 2021), mainly focus on shallow or basic deep methods. We expand the work of Wu & Keogh (2021) to analyze multivariate datasets. They thoroughly analyzed several of the most popular univariate time-series datasets, identified multiple flaws, and concluded that many datasets do not guarantee a fair evaluation of AD algorithms. Following their work, we find several similar problems with the most widely used multivariate time-series datasets. Overcoming the inherent problems of point-wise metrics on time-series data is not an easy task. Huet et al. (2022) provide an overview of existing attempts and introduce a metric based on the distance of predicted anomaly windows to the nearest anomaly. However, they report their results on datasets in which we identify several problems. Others modify the predictions before evaluation (Xu et al., 2018; Scharwächter & Müller, 2020; Kim et al., 2022) or consider only the beginning of anomaly windows (Doshi et al., 2022). Some metrics are clearly biased towards extreme cases of anomaly detectors (Hundman et al., 2018). Lavin & Ahmad (2015) introduced the first metric to directly address the problems of point-wise evaluations by penalizing late predictions in anomaly windows. With its numerous hyperparameters, the metric was too complex and variable to be widely adapted (Xu et al., 2018). Tatbul et al. (2018) proposed time-series precision and recall, a generalization of previous concepts in many ways. For a long time, the only publicly available implementations were too slow and cumbersome to use in practice. Lastly, Garg et al. (2021) propose a variation of time-series precision and recall that ignores any overlap between prediction and anomalies in the recall. 3 The Illusion of Progress In this section, we uncover several issues that make evaluations in (multivariate) time-series AD unreliable and thus create an illusion of progress in the field. Our analysis first examines some of the most commonly used datasets. These datasets are the backbone of time-series AD evaluation and have been used in virtually all major comparisons in the field (Schmidl et al., 2022; Garg et al., 2021; Choi et al., 2021; Jacob et al., 2020). Our analysis reveals several significant flaws in these datasets. Second, we investigate the shortcomings of frequently used evaluation metrics, particularly the point-wise F1-score and its adaptations. Lastly, we examine the inconsistencies and other problems within established evaluation protocols. 3.1 Datasets A good dataset for benchmarking deep time-series AD methods should be as unbiased as possible and adhere to the underlying assumptions of AD (Ruff et al., 2021), but should also be complex enough to represent a significant challenge. It is impossible to determine exact statistics of a good benchmark dataset, but we can look at similar settings to gain some intuition. Time-series data contain strong temporal dependencies similar to spatial dependencies in images, where deep-learning based methods have outclassed most traditional approaches [cite]. Thus, a good benchmark dataset should contain a similar amount of samples to facilitate the training of deep models. To present a significant challenge, a good benchmark dataset should contain enough features to allow for complex inter-feature dependencies. In the following, we highlight several problems found in time-series datasets and summarize our findings on SWa T (Goh et al., 2016), WADI (Ahmed et al., 2017), SMAP, and MSL (Hundman et al., 2018). Anomaly density refers to the fraction of anomalies in the test set. Anomalies are usually considered rare deviations, and their density in the test set should reflect this characterization. However, except for WADI, all considered datasets contain more than 10% anomalies, which might already be too high to be considered rare deviations from the norm. Published in Transactions on Machine Learning Research (04/2023) Positional bias is introduced when the distribution of the relative positions of anomalies deviates significantly from a uniform distribution. For example, anomalies can be biased towards the end when an anomaly means a fatal error for the generating process, known as run-to-failure bias (Wu & Keogh, 2021). Algorithms accounting for this shift have an immediate advantage over any competitors. To investigate this bias, we examine the relative positions of anomalous time steps in each time series in the test sets and find clear evidence of positional bias in both SMAP and MSL (for example, see Figure 1a). (a) Relative anomaly positions (b) Anomaly window lengths (c) Normal feature distribution Figure 1: (a) The relative positions of anomalies in the test set of SMAP show clear positional bias towards the latter half of the time series. (b) The distribution of anomaly window lengths in SWa T shows the existence of exceptionally long anomaly windows. (c) Comparing the distribution of normal time steps of feature AIT201 in SWa T reveals clear signs of distributional shift. Long anomalies can introduce problems in the evaluation. Some methods may rely on normal context in each window to predict subsequent anomalies and are thus at a disadvantage when long anomaly windows occur. Long anomalies also interact with adapted evaluation protocols, which we discuss in Section 3.3. Although not inherently negative, both effects should be kept in mind when using data containing long anomaly windows for evaluation. We found that the vast majority of anomalous time steps in all datasets belong to one or several long anomalies (for example, see Figure 1b). Constant features appear in all considered datasets. Features that are constant only in the training or test set are generally desirable. However, some datasets contain features that remain constant across training and test set. While such features may be valuable in practical applications, they add unnecessary complexity to the benchmark. Distributional shift occurs when the underlying process that generated normal training and test data is not the same. It breaks one of the fundamental assumptions of AD. For several datasets, we can examine distributional shift by simply inspecting their feature means and standard deviations (for example, see Figure 1c). Furthermore, anomalies should be labeled consistently where they occur in the data to ensure an unbiased and fair evaluation. Effects that show in a sensor only after the labeled anomaly (for example, see Figure 2a) pose an impossible problem for any anomaly detector. This holds especially true for long-lasting changes in the data. Some anomalies seem to permanently change the distribution of the system, causing clear distributional shift (for example, see Figure 2b). 3.1.1 Analyzing SWa T, WADI, SMAP, and MSL In the following, we summarize our analysis of four of the most widely used datasets for multivariate timeseries AD. We provide descriptions, detailed examples, and discussions for all datasets in Appendix B. SWa T and WADI contain the clearest examples of delayed and long-term effects in the data. In both datasets, the distribution changes drastically in the second half of the test set. Additionally, we found exceptionally long anomalies in both datasets, especially in SWa T, where one anomaly spans nearly 36,000 time steps. Even if the former issue was addressed by experts, this could introduce even more anomalous time steps, longer anomaly windows, and positional bias. Thus we conclude that evaluations on these two datasets are highly unreliable and that these datasets are not suited for multivariate time-series AD evaluation. Published in Transactions on Machine Learning Research (04/2023) (a) 2_P_004_SPEED (b) 2A_AIT_002_PV Figure 2: Two features from the test set of WADI show where anomalies seem to cause (a) delayed or (b) long-term effects in the data. Red-shaded areas are ground truth anomalies. The feature in (b), normalized to range in [0, 1] on the training set, jumps to unprecedented values on the test set. SMAP and MSL contain time series with one feature representing a sensor measurement, while the rest represent binary encoded commands. The command features are often constant, particularly in sections where anomalies occur. Furthermore, since several sensors have been used to construct the dataset, each time series in both datasets should be considered independently. SMAP contains a clear positional bias towards the end, and both seem to contain significant distributional shifts caused by anomalies. Thus, we conclude that both MSL and SMAP are also not suited for general time-series AD evaluation. 3.2 Metrics An anomaly detector produces an anomaly score for each time step in a time series. The higher the score at time t, the more confident the detector is that the point at that time is an anomaly. Anomalies are then predicted by thresholding these scores. Given predictions and labels, an evaluation metric produces a score based on their agreement. Different algorithms can then be compared based on the scores produced by their predictions. A good metric should not be unintentionally biased towards a specific group of bad algorithms, for example random predictions. Evaluation on time-series data is complicated due to the temporal dependencies between time steps, as anomalies often appear in intervals and two predictors might differ in the pattern of their predictions inside these anomaly windows. In the following, we examine the shortcomings of point-wise metrics and existing attempts to explicitly include the temporal dependency in evaluation metrics. Predictive patterns describe the patterns of the predictions of an anomaly detector. Two methods making the same amount of predictions can still differ significantly depending on their predictive pattern. Predictive patterns matter for separating early and late or consistent and fragmented predictions, and their differences should be reflected in the metric used to compare them. Many proposed metrics ignore the predictive patterns for their computation. This is particularly true for point-wise metrics that consider each prediction separately. As a consequence, any two methods that differ only in a predictive pattern on anomalies are indistinguishable for any point-wise metric (see Figure 3a as an example). However, most papers rely on the point-wise F1-score for their evaluation, oftentimes reported alongside precision and recall. Recall consistency refers to the monotonicity of point-wise recall with respect to the threshold of the evaluated anomaly detector, that is, the point-wise recall is monotonically decreasing with an increasing threshold. We argue that any derived metric replicating the intuition of recall should be recall consistent to avoid unexpected and unintuitive behavior. Such behavior can even lead to problems when computing aggregated metrics that assume recall consistency. Time-series precision and recall (Tatbul et al., 2018) is an attempt to incorporate predictive patterns in the computation of recall and precision. Consider the set of anomaly windows in a dataset A, the set of predicted windows P, and the set of predicted windows overlapping with a set PA = {P P | |A P| > 0}. Then time-series recall is defined as TRec(A, P) = 1 |A| α1(|PA| > 0) + (1 α)γ(|PA|) X t P A δ(t min A, |A|) P t A δ(t min A, |A|) Published in Transactions on Machine Learning Research (04/2023) (a) Two predictions of equal size with distinct predictive patterns on an anomaly. (b) Anomaly score and predictions P1, P2, where only the larger persists for both thresholds t1, t2. Figure 3: (a) A point-wise metric cannot distinguish between two methods that differ only in their predictive pattern on anomalies. (b) Counterintuitively, TRec with a constant bias and γ(x) = x 1 increases when the threshold increases from t1 to t2. with weight 0 α 1, monotone decreasing cardinality function γ with γ(1) = 1, and bias function δ 1. This metric is not recall consistent in general, in particular for the recommended default parameter choices (γ(x) = x 1) (Tatbul et al., 2018). Increasing the threshold and removing an entire window from the predictions will increase γ(|PA|) which may override the decrease in the following sum that quantifies the overlap if γ is not chosen carefully. For example, consider two disjoint predictions P1, P2 A A for a threshold λ, such that P t P1 A δ(t min A, |A|) > P t P2 A δ(t min A, |A|). Then, if there exists a threshold greater than λ such that P1 is kept intact while P2 vanishes, TRec increases (see Figure 3b for illustration). Implicit bias is at the core of any evaluation metric defining the ideal behavior to strive for. Thus each metric needs to be carefully designed, to not encourage unwanted behavior by accident or introduce conflicting goals. Consider the precision associated with TRec that is computed by interchanging the rolls of anomalies and predictions, i.e., TPrec(A, P) = TRec(P, A). This choice encourages algorithms to predict many small anomaly windows, such that the negative impact of falsely anomalous predictions is diminished since all predictions are weighted equally. The resulting behavior conflicts with the choice of a decreasing cardinality function, which encourages the opposite. Compensating flaws generally introduces further, often subtle, bias into the evaluation. For many applications, precise predictions are important, for example, where false positives cause severe overhead. Introducing soft boundaries for anomalies to account for predictions only nearly missing ground truth anomaly windows erodes the required preciseness of predictions. Depending on the particular setting, such leeway might be desired behavior, but even then should be inserted conscious of its implicit biases. In a well-labeled dataset, each anomaly window should indicate where alarms are expected and acceptable. Thus, such issues should not be addressed in the metric, but rather in the dataset. 3.2.1 Analyzing evaluation metrics for time series We discuss and analyze recent evaluation metrics for time-series data with respect to the identified flaws. Soft-boundary metrics, such as distance-based precision and recall Huet et al. (2022) and range-based volume under surface metrics by Paparrizos et al. (2022), compensate for predictions outside of anomaly windows by relying on the distance between predictions and anomalies or extending anomaly windows. While the former introduces implicit bias on correct predictions towards the center of anomaly windows and makes no explicit distinction between predictions before or after anomalies, the latter carries over the indifference to predictive patterns of point-wise methods. Similarly, Scharwächter & Müller (2020) extend anomaly window and predicted windows in the computation of point-wise precision and recall, respectively. Early prediction metrics, such as the NAB score (Lavin & Ahmad, 2015) and sequence precision delay (Doshi et al., 2022), only consider the first anomalous prediction after the beginning of an anomaly window. This emphasis on early predictions is largely motivated by specific applications, where early detection Published in Transactions on Machine Learning Research (04/2023) of anomalies is vital. However, by ignoring the remaining predictions, these metrics cannot distinguish predictive patterns and thus distinguish subtle differences between methods. Time-series precision and recall and their derived metrics solve most problems of point-wise metrics and provide an intuitive set of adjustable parameters. However, they suffer from recall inconsistency and conflicting implicit bias. Even though clearly flawed, they constitute a reasonable attempt to include the temporal structure of time series in an evaluation metric. Other attempts to adjust precision and recall for time series contain clear bias towards constant predictors (Hundman et al., 2018) or fail to account for predictive patterns similar to point-wise metrics (Garg et al., 2021). 3.3 Evaluation Protocol An evaluation protocol comprises specifications of how the experiments are conducted, including the preprocessing of datasets, feature elimination, and parameter selection heuristics. Consistent evaluation protocols are necessary to guarantee fair comparison across different models. In the following, we outline some of the problems we identify in and around evaluation protocols in the literature. Point adjustment is a technique modifying the predictions to complement the point-wise F1-score. Any anomaly window with at least one correctly predicted time step is considered predicted correctly. However, even random methods have a decent chance to predict at least one point in larger anomaly windows, where they can easily reach the performance of most complex methods or even outperform them (Kim et al., 2022; Doshi et al., 2022). Despite these flaws, this technique was adopted by many papers (Su et al., 2019; Audibert et al., 2020; Zhao et al., 2020; Zhang et al., 2021; Xiao et al., 2021; Chen et al., 2021; Wang et al., 2021; Challu et al., 2022; Hua et al., 2022; Chambaret et al., 2022; Zhang et al., 2022b;a). In light of our discussion on evaluation metrics and their apparent flaws, this technique should generally be abandoned for evaluations of time-series AD. Implementations and specifications provided by the original authors are an important tool to ensure reproducibility. This becomes especially important when the evaluation protocols need to be adapted, or the methods need to be evaluated on new datasets or with different metrics. However, several works do not specify their evaluation protocol, hyperparameters, and architecture in enough detail to reproduce their results and do not publish any source code that contains them (e.g., Homayouni et al., 2020; Pereira & Silveira, 2018). As the employed evaluation protocol can significantly affect the final performance, a functioning implementation should be completely disclosed alongside an evaluation. Other potential inconsistencies can be found across multiple evaluation protocols. Some papers seem to tune the model parameters on the test set, introducing more bias into the evaluation (Zhan et al., 2022). Others report aggregated metrics over datasets consisting of samples from different distributions or aggregated over multiple datasets (Su et al., 2019). Such metrics can be of interest in their own right, however, without a clear definition of how these aggregated values are computed or additional analysis, these evaluations often lack clarity, comparability, and reproducibility. We encountered many more minor inconsistencies, highlighting the importance of official implementations and thorough specifications. 4 Time Se AD: Benchmarking Deep Multivariate Time-Series AD In this section, we propose how to benchmark time-series AD methods in a way that mitigates the issues discussed in Section 3. We discuss the strengths and weaknesses of two recent datasets, and how their flaws can be mitigated. Further, we introduce modified versions of time-series precision and recall, to alleviate the biases of the precision and ensure recall consistency of the recall. Finally, we discuss our evaluation protocol and implementation. 4.1 Datasets In the previous section, we uncovered flaws in several commonly used benchmark datasets, making them unfit for evaluation. However, there are also datasets we find more suited for benchmarking, namely SMD (Su et al., 2019) and Exathlon (Jacob et al., 2020). Published in Transactions on Machine Learning Research (04/2023) SMD contains 28 time series generated from different processes and thus comprises 28 datasets with 38 features each. Some of its datasets suffer from distributional shift and have been removed from evaluations in the past (Li et al., 2021b). Detecting distributional shift in time-series data is no trivial task in itself. Therefore, we rely on manual inspection of all datasets in this study. We exclude several datasets from the final evaluation, where we suspect delayed or long-term effects caused by anomalies, and only report those results in Appendix E. In total, we remove 13 datasets leaving 15 datasets for evaluation. Exathlon comprises eight datasets collected from applications run on a cluster. The time series in Exathlon suffer from missing values, which the creators suggest be replaced with default values. This inadvertently injects unlabeled anomalies in the data, where the default values follow a different distribution. Instead, we replace any missing values with the respective preceding value. We omit two applications, one, for which we identify a severe distributional shift, and one with a too-small test set, leaving six datasets. Overall, we find several more instances of possible delayed effects and distributional shift, which might be attributed to background effects. Nonetheless, we strongly encourage further careful inspection by application experts, especially to address the high anomaly density in all datasets in Exathlon. All datasets in SMD and Exathlon are far from ideal benchmark datasets. Considering each time series individually leaves each dataset with fewer samples compared to other datasets, and the high anomaly density in Exathlon is worrying. While not perfect, these datasets are significantly better alternatives to SWa T, WADI, SMAP, and MSL for evaluating time-series AD methods. 4.2 Metrics In Section 3.2 we discussed the potential and shortcomings of TRec and TPrec. We propose new default parameters for TRec and a variation of TPrec to address their flaws. Let us first note the discrepancy between the two terms in Equation (1). The first term counts the number of anomaly windows for which at least one point was predicted correctly. In contrast, the second term is entirely concerned with the predictive structure within each anomaly window. Since the first term is completely oblivious to the size of the anomalies, the range of both terms could vary wildly between tasks, and the terms would need to be balanced for each task individually. Furthermore, the second term already implicitly acknowledges the existence of anomalies in their overlap. Thus, we suggest using α = 0. To prevent unintuitive results caused by recall inconsistency, we further require the cardinality function to guarantee recall consistency. Thus, we define a suitable class of cardinality functions for which recall consistency always holds. Theorem 1 TRec is recall consistent for any cardinality function of the form γ(1, A) = 1, γ(n, A) = max 0 |P A|. Then the term in Equation (1) corresponding to A is non-increasing, if the following holds P t P A δ(t min A, |A|) t P A δ(t min A, |A|) t A δ(t min A, |A|) (|PA| |P A|) P t A δ(t min A, |A|) γ(|PA|, A) γ(|P A|, A). If this holds for all cardinalities smaller than the initial prediction, the statement holds, since recall consistency with respect to the threshold is equivalent to the term in Equation (1) corresponding to A being non-increasing with respect to the cardinality of the predictions. 2We provide the detailed proof in Appendix A. Published in Transactions on Machine Learning Research (04/2023) This definition of suitable cardinality functions depends on the choice of bias function. A particularly important choice is the constant bias, allowing the metric to be readily applied to most settings. Thus, the following theorem shows the closed-form solution of its corresponding cardinality function. Theorem 2 With constant bias the cardinality function has the closed-form solution γ (n, A) = |A| 1 Proof - Sketch2: By an inductive argument, using the fact that the maximum over a set is larger or equal to all its elements, we can show that γ is lower bounded by γ for all cardinalities. By a similar argument, rewriting the maximum using a technical lemma, we can show γ is upper bounded by γ for all cardinalities. Combining both facts yields the statement. We call TRec with cardinality function γ and constant bias TRec . While this gives an easy-to-compute metric, the general formulation preserves the bias function as a tunable parameter. It is important to retain this degree of generality in the definition, such that we can still adapt the metric to specific use cases, such as early prediction. Finally, we address the bias of time-series precision. In the definition of TPrec, each prediction is weighted by the inverse of the cardinality of the predictions |P| 1. Instead of using equal weights, we propose to weigh each term inside the sum by |P| P P P |P| 1. This choice penalizes fragmented predictions by eliminating their global effects on the total precision. Using these implementations of precision and recall, we can compute an F1-score and the area under the precision-recall curve (AUPRC). To further justify this choice, we examine the anomaly scores produced by different methods. We compare the order induced by the point-wise F1-score with that produced by the F1-score using TRec and the adjusted TPrec and find that the latter closer matches our intuition. For example, while LSTM-P produces scores that fluctuate within anomaly windows and spike near the end of and even outside anomaly windows, likely resulting in fragmented predictions, TCN-AE produces scores that smoothly increase and decrease over the duration of anomaly windows, resulting in continuous predictions, and spiking at the terminal failure (see Figure 4). (a) LSTM-P scores (b) TCN-AE scores Figure 4: Scores from two different methods on a test time series of Exathlon 2, where (b) performs better according to our score, but worse according to the point-wise F1-score, providing an ordering aligned with our intuition. 4.3 Evaluation To address the inconsistencies in evaluation protocols and provide the necessary tools for consistent evaluations, we introduce our Time Se AD library. The library consists of a general training and evaluation framework geared towards deep learning based methods, several analysis tools for datasets and methods, as well as a large collection of architectural elements, methods, and baselines. The collections of architectural elements are implemented on top of Py Torch (Paszke et al., 2019) to provide reusable building blocks allowing for great customization. This setup allows researchers to prototype ideas quickly and users to adjust individual elements to any setting. Using these elements, we implemented 28 deep methods. Furthermore, the library provides a common interface for all datasets considered in this study alongside several analysis tools which we used in the evaluation in Section 3.1. Additionally, we provide a fast (linear in time) implementation of time series precision and recall along with the extensions proposed in the previous section. All elements of this library are specifically designed to work well with time-series data. Published in Transactions on Machine Learning Research (04/2023) The general framework for training and evaluation provides the foundation for a unified evaluation and enables integration into customized experiment management systems. Thus we implemented a separate plugin based on sacred (Greff et al., 2017) to run and manage the experiments for the evaluation of methods and datasets. We used this setup to conduct our analysis of the datasets in Section 3.1 and create a benchmark of 28 methods on 21 datasets. Since performance can vary greatly between datasets for two sets of hyperparameters, we adapt grid search to tune the hyperparameters over a preselected set of parameter choices. To perform grid search without introducing significant bias in the evaluation, we remove part of the test set to tune the parameters on, before evaluating with the best performing parameters on the rest. Because of distributional changes in the test set, a fixed, arbitrary split can introduce further bias. To mitigate its effects, instead, we perform cross-validation on the test set, splitting it into multiple folds and using each fold once as a validation set. Finally, to mitigate the impact of temporal dependencies between folds, we remove the neighboring folds of each validation set. To ensure a fair evaluation, we choose a maximum training time and adjust the size of the parameter grid, such that each method can be fully evaluated within this time frame. We use this evaluation protocol for all methods3. 4.4 Benchmark results See Table 1 for the main results of our evaluation on SMD and Exathlon. Here, we show the best F1 scores based on TRec and TPrec , introduced in chapter Section 4.2. To maintain clear visibility, we only report the induced ranking, where 1 corresponds to the highest and 28 to the lowest score. Interested readers can find the raw scores alongside evaluations with different metrics in Appendix E. On SMD, we can see a consistently strong performance by older (less complex) methods, such as LSTM-AE (Malhotra et al., 2016) and LSTM-P (Malhotra et al., 2015). In contrast, several modern approaches, such as the group of GAN-based methods, often perform poorly on SMD datasets. Interestingly, SMD and Exathlon do not share any methods in the top three best performing methods. In fact, methods performing well on SMD generally struggle on Exathlon and the other way around, indicating that there is currently no dominant architecture for multivariate time series AD. However, the autoencoderand prediction-based methods perform consistently across multiple datasets. Whereas, the methods collected in other have the most difficulties across datasets. Our benchmark reveals that the variational autoencoder-based method GMM-GRU-VAE (Zhang et al., 2021), the prediction-based method GDN (Deng & Hooi, 2021), and the reconstruction-based STGAT-MAD (Zhan et al., 2022) perform the most consistent across SMD and Exathlon. There are many reasons why our results might not reflect the promised advances of numerous papers. First, we use a standardized evaluation protocol with a fixed training time during hyperparameter searches to guarantee a fair comparison. As more complex models usually take longer to train, we shrink the grid of possible hyperparameters to fit our limited time budget (see Section 4.3 for details). Second, we tune the models by maximizing our novel F1 score based on TRec and TPrec . Lastly, authors do not always provide an official implementation, meaning we replicate some methods solely based on the corresponding paper. 5 Discussion In this section, we discuss current and future challenges of time-series AD. Quality datasets are the backbone of any evaluation. The analysis in Section 3.1 revealed multiple severe flaws in many widely used multivariate time-series datasets. Other datasets used for evaluation are sometimes not publicly available (Audibert et al., 2020; Park et al., 2018). Thus, assessing their quality is virtually impossible. This poses a huge problem for the field going forward. Publicly available high-quality datasets, such as CIFAR (Krizhevsky et al., 2009) or Image Net (Deng et al., 2009) in computer vision, provide great platforms for comparative evaluations propelling their respective fields forward. Such a dataset is utterly needed for multivariate time-series AD. Our analysis shows that any new dataset needs to undergo careful scrutiny. The analysis tools in our Time Se AD library provide a solid baseline, but future discussion will likely reveal more potential flaws and pitfalls. Automated detection of distributional shift especially yields great potential for future analysis. 3We provide a detailed description in Appendix D. Published in Transactions on Machine Learning Research (04/2023) Table 1: Cross-validation results on Exathlon and SMD. We report the ranks according to the best F1-score based on TRec and TPrec averaged over all test folds. µExa s and µSMD s are the ranked average scores over all datasets in Exathlon and SMD, respectively. µall s shows the ranked order of the weighted average scores over all datasets from both Exathlon and SMD. We weight µall s by the number of datasets in Exathlon and SMD in order to treat both datasets equally. Bold are the top 3, normal size the top 9, and tiny all other methods for each dataset. Full results are provided in Appendix E. Exathlon SMD ID 1 2 4 5 6 9 µExa s 1 6 8 9 10 11 13 14 16 17 20 21 24 26 27 µSMD s µall s reconstruction LSTM-AE 24 2 22 21 22 24 22 4 3 5 10 1 8 4 1 1 3 3 1 2 4 3 1 4 LSTM-Max-AE 5 23 24 19 21 8 20 3 24 22 12 21 17 6 19 20 11 21 25 12 7 11 17 22 MSCRED 10 1 2 14 4 1 1 9 19 1 20 19 26 21 26 13 9 24 23 21 22 18 20 12 FC-AE 4 20 11 15 14 11 10 7 13 8 9 11 13 9 8 6 6 14 6 9 8 2 7 7 USAD 7 18 10 11 17 20 15 24 21 20 18 20 11 10 10 24 13 12 15 20 23 15 15 16 TCN-AE 8 5 15 9 19 2 3 21 15 6 23 22 25 23 23 16 16 5 20 22 1 23 21 17 Gen AD 3 25 4 10 1 28 4 20 28 24 25 24 28 8 20 23 18 28 28 18 14 13 24 23 STGAT-MAD 14 17 9 20 2 16 14 13 7 14 4 4 10 14 4 3 7 4 10 8 2 5 5 3 Anomaly Transformer 27 27 19 27 3 25 27 26 20 19 28 25 22 27 17 22 26 25 14 19 19 28 27 27 LSTM-P 19 12 23 8 27 13 25 1 1 2 14 6 9 2 2 4 2 11 7 11 12 4 2 10 LSTM-S2S-P 13 19 1 24 13 6 11 6 16 3 19 23 23 24 25 10 15 17 18 15 13 21 18 18 Deep Ant 12 10 7 12 12 14 7 10 12 12 5 17 15 15 9 7 14 10 13 6 10 20 12 11 TCN-S2S-P 15 7 13 13 24 17 19 16 2 4 6 9 7 16 12 2 1 2 5 5 11 1 3 5 GDN 2 6 17 16 9 5 2 2 14 7 11 7 14 13 14 15 10 14 11 4 20 10 10 2 LSTM-VAE 20 14 14 2 8 18 6 15 11 17 21 2 5 7 13 9 20 8 2 7 24 14 9 8 Donut 23 22 8 4 20 10 16 17 6 9 3 3 6 19 5 18 5 1 9 1 26 9 6 6 LSTM-DVAE 18 24 18 3 23 19 17 25 10 15 22 8 4 18 3 12 23 9 3 10 21 24 13 13 GMM-GRU-VAE 21 11 20 6 6 4 5 11 5 11 2 5 1 17 6 14 4 6 8 3 15 6 4 1 Omni Anomaly 25 21 27 1 5 12 21 18 4 16 8 16 2 1 15 5 21 18 16 14 27 7 11 14 SIS-VAE 17 16 6 7 7 22 12 5 9 10 7 12 12 11 7 8 8 7 12 13 3 8 8 9 Beat GAN 6 3 16 18 15 15 8 19 18 18 15 13 16 12 18 21 12 16 17 23 17 17 16 15 MAD-GAN 9 15 12 23 10 23 18 22 23 13 17 18 24 22 16 17 24 13 26 27 18 26 23 24 LSTM-VAE-GAN 11 8 5 25 16 7 13 14 17 21 1 15 19 3 22 25 22 21 19 25 6 12 19 20 Tad GAN 1 4 21 17 18 21 9 12 26 27 16 14 21 5 21 19 17 19 21 17 5 25 22 21 LSTM-AE OC-SVM 16 9 25 26 26 26 26 27 25 25 26 27 18 26 27 27 19 23 27 24 9 16 26 26 MTAD-GAT 22 13 26 5 11 9 24 23 8 26 13 10 3 20 11 11 25 20 4 16 16 22 14 19 NCAD 28 28 28 28 28 27 28 28 27 28 27 28 27 28 28 28 28 26 24 28 28 27 28 28 THOC 26 26 3 22 25 3 23 8 22 23 24 26 20 25 24 26 27 27 22 26 25 19 25 25 Published in Transactions on Machine Learning Research (04/2023) Generating data normal or anomalous could be used to address certain shortcomings of datasets. The task of creating large real-world datasets of high quality is difficult for many reasons, often involving complex systems with many interactions. Thus, recent advances in data generation could be used to augment small datasets. Even fully artificial datasets could be used to evaluate algorithms with respect to specific aspects of the data (Schmidl et al., 2022). On the other hand, generated anomalies could be purposefully injected into datasets to expand the range of anomalous situations or simulate and recreate anomalies without having to observe these anomalies in the system. Anomalies in large systems are often expensive to induce or tied to critical system failures making the latter option appealing for data generation outside simulations. A common evaluation metric is another critical tool to make evaluations comparable, eliminating the need to reevaluate many methods repeatedly. In Section 4.2 we presented a recall consistent implementation of recall for time-series data and an accompanying precision with adjusted bias. We illustrate its capabilities experimentally through examples and the benchmark presented in Section 4.3, and justify its definition theoretically in Section 4.2. However, an in-depth experimental comparison to its alternatives on a wide range of use cases could help steer the community in a unified direction. Particularly interesting could be an analysis beyond the standard setting, for example, early detection. The Time Se AD library provides, at the time of writing, a shared evaluation protocol, a collection of 28 methods, and several analysis tools. New methods are proposed constantly, and the library and future benchmarks need to adjust accordingly. Thus, if adopted by the community, we will continue to expand the library, including more methods, metrics, and datasets. Of particular interest are shallow baselines. To truly justify using large deep models, a solid collection of easier-to-train shallow methods beyond the trivial baselines currently implemented is needed to provide a well-rounded benchmark. Especially if datasets become more complex and high-dimensional, we expect shallow methods to quickly fall behind in performance, as has been the case for other settings. We encourage researchers to contribute their methods and experiments to grow the library. Explanation and robustness play an essential role in safety-critical applications, such as self-driving cars. The correlation and dependencies between features in multivariate time series, in particular, offer great opportunities and challenges for explanations and robustness. Some methods offer the necessary mechanisms to enable explanations, such as feature-based attention or graph-based structures (Zhao et al., 2020; Deng & Hooi, 2021; Hua et al., 2022; Zhan et al., 2022), leaving ample room to explore these concepts further. Robustness to corrupted training samples can be important when large-scale data collection is noisy or unreliable. Recently, Li et al. (2022) analyzed robustness in time-series AD, however, they consider only four methods on four datasets relying on point-wise metrics. Thus, much more research is needed to explore robustness further. Other settings beyond the one presented in this paper appear in many applications. Some require anomalies to be detected as early as possible, where our metric can be adapted by changing the bias function. Other settings require anomalies to be detected even before an anomalous event occurs. This requires more than adjusted metrics. At least the datasets need to include anomalies that are detectable ahead of time. Another interesting question is that of generalization. Our benchmark revealed that no method consistently performs best or worst across all datasets. Simply training on multiple datasets at once seems infeasible since datasets can contain different sets of features. Moreover, even if they contain the same amount of features, individual samples can differ remarkably, spreading them thin. Another emerging setting considers unequally spaced time series (Jeong et al., 2022), where we can technically apply classical methods without much effort. However, developing alternatives that explicitly address temporal irregularities could be an interesting line of research. 6 Conclusion Many datasets are severely flawed and form a shaky foundation for AD evaluations. Even carefully constructed datasets (such as those in Exathlon) reveal flaws under careful scrutiny. In addition, despite their well-known problems, point-wise metrics are still the de-facto standard in most evaluations. Together with inconsistent evaluations, these three main issues create an illusion of progress in time-series AD. We have proposed Time Se AD, a library for anomaly detection on multivariate time-series data specialized in Published in Transactions on Machine Learning Research (04/2023) deep-learning based methods. Time Se AD contains a new metric that considers temporal dependencies and produces reliable results, as we demonstrate, a collection of analysis tools for datasets and methods, implementations of 28 methods, and a general evaluation framework. The metric is provably recall-consistent and allows for customization through the bias function. Using our library, we created a substantial benchmark revealing no method that consistently outperforms any competitors. We found that modern approaches often struggle to reach the performance of older methods. We hope that our comprehensive Time Se AD library aids the community in measuring the gains of new algorithms in the future and thus helps to shed some light on the actual progress in (deep) multivariate time-series AD. Acknowledgments Part of this work was conducted within the DFG research unit FOR 5359 on Deep Learning on Sparse Chemical Process Data (KL 2698/6-1 and KL 2698/7-1). FCFS acknowledges support from TU Berlin and BASF SE under the BASLEARN - TU Berlin (BASF Joint Lab for Machine Learning) project. MK acknowledges support by the Carl-Zeiss Foundation, the DFG awards KL 2698/2-1, KL 2698/5-1, KL 2698/61, and KL 2698/7-1, and the BMBF awards 03|B0770E and 01|S21010C. Published in Transactions on Machine Learning Research (04/2023) Chuadhry Mujeeb Ahmed, Venkata Reddy Palleti, and Aditya P Mathur. Wadi: a water distribution testbed for research in the design of secure cyber physical systems. In Proceedings of the 3rd international workshop on cyber-physical systems for smart water networks, pp. 25 28, 2017. Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 214 223. PMLR, 2017. URL https://proceedings.mlr.press/v70/arjovsky17a.html. Julien Audibert, Pietro Michiardi, Frédéric Guyard, Sébastien Marti, and Maria A Zuluaga. Usad: Unsupervised anomaly detection on multivariate time seriesammann2020anomaly. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3395 3404, 2020. Julien Audibert, Pietro Michiardi, Frédéric Guyard, Sébastien Marti, and Maria A Zuluaga. Do deep neural networks contribute to multivariate time series anomaly detection? ar Xiv preprint ar Xiv:2204.01637, 2022. Aadyot Bhatnagar, Paul Kassianik, Chenghao Liu, Tian Lan, Wenzhuo Yang, Rowan Cassius, Doyen Sahoo, Devansh Arpit, Sri Subramanian, Gerald Woo, et al. Merlion: A machine learning library for time series. ar Xiv preprint ar Xiv:2109.09265, 2021. Ane Blázquez-García, Angel Conde, Usue Mori, and Jose A Lozano. A review on outlier/anomaly detection in time series data. ACM Computing Surveys (CSUR), 54(3):1 33, 2021. Chris U Carmona, François-Xavier Aubet, Valentin Flunkert, and Jan Gasthaus. Neural contextual anomaly detection for time series. ar Xiv preprint ar Xiv:2107.07702, 2021. Cristian I Challu, Peihong Jiang, Ying Nian Wu, and Laurent Callot. Deep generative model with hierarchical latent factors for time series anomaly detection. In International Conference on Artificial Intelligence and Statistics, pp. 1643 1654. PMLR, 2022. Guillaume Chambaret, Laure Berti-Equille, Frédéric Bouchara, Emmanuel Bruno, Vincent Martin, and Fabien Chaillan. Stochastic pairing for contrastive anomaly detection on time series. In International Conference on Pattern Recognition and Artificial Intelligence, pp. 306 317. Springer, 2022. Wenchao Chen, Long Tian, Bo Chen, Liang Dai, Zhibin Duan, and Mingyuan Zhou. Deep variational graph convolutional recurrent network for multivariate time series anomaly detection. In International Conference on Machine Learning, pp. 3621 3633. PMLR, 2022. Zekai Chen, Dingshuo Chen, Xiao Zhang, Zixuan Yuan, and Xiuzhen Cheng. Learning graph structures with transformer for multivariate time series anomaly detection in iot. IEEE Internet of Things Journal, 2021. Kukjin Choi, Jihun Yi, Changhwa Park, and Sungroh Yoon. Deep learning for anomaly detection in timeseries data: review, analysis, and guidelines. IEEE Access, 2021. Zahra Zamanzadeh Darban, Geoffrey I Webb, Shirui Pan, Charu C Aggarwal, and Mahsa Salehi. Deep learning for time series anomaly detection: A survey. ar Xiv preprint ar Xiv:2211.05244, 2022. Ailin Deng and Bryan Hooi. Graph neural network-based anomaly detection in multivariate time series. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 4027 4035, 2021. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009. Keval Doshi, Shatha Abudalou, and Yasin Yilmaz. Tisat: Time series anomaly transformer. ar Xiv preprint ar Xiv:2203.05167, 2022. Published in Transactions on Machine Learning Research (04/2023) Kamil Faber, Marcin Pietron, and Dominik Zurek. Ensemble neuroevolution-based approach for multivariate time series anomaly detection. Entropy, 23(11):1466, 2021. Daniel Fährmann, Naser Damer, Florian Kirchbuchner, and Arjan Kuijper. Lightweight long short-term memory variational auto-encoder for multivariate time series anomaly detection in industrial control systems. Sensors, 22(8):2886, 2022. Pavel Filonov, Andrey Lavrentyev, and Artem Vorontsov. Multivariate industrial time series with cyber-attack simulation: Fault detection using an lstm-based predictive data model. ar Xiv preprint ar Xiv:1612.06676, 2016. Astha Garg, Wenyu Zhang, Jules Samaran, Ramasamy Savitha, and Chuan-Sheng Foo. An evaluation of anomaly detection and diagnosis in multivariate time series. IEEE Transactions on Neural Networks and Learning Systems, 33(6):2508 2517, 2021. Alexander Geiger, Dongyu Liu, Sarah Alnegheimish, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Tadgan: Time series anomaly detection using generative adversarial networks. In 2020 IEEE International Conference on Big Data (Big Data), pp. 33 43. IEEE, 2020. Jonathan Goh, Sridhar Adepu, Khurum Nazir Junejo, and Aditya Mathur. A dataset to support research in the design of secure water treatment systems. In International conference on critical information infrastructures security, pp. 88 99. Springer, 2016. Klaus Greff, Aaron Klein, Martin Chovanec, Frank Hutter, and Jürgen Schmidhuber. The sacred infrastructure for computational research. In Proceedings of the 16th python in science conference, volume 28, pp. 49 56, 2017. Yifan Guo, Weixian Liao, Qianlong Wang, Lixing Yu, Tianxi Ji, and Pan Li. Multidimensional time series anomaly detection: A gru-based gaussian mixture variational autoencoder approach. In Asian Conference on Machine Learning, pp. 97 112. PMLR, 2018. Yangdong He and Jiabao Zhao. Temporal convolutional networks for anomaly detection in time series. In Journal of Physics: Conference Series, volume 1213, pp. 042050. IOP Publishing, 2019. Hajar Homayouni, Sudipto Ghosh, Indrakshi Ray, Shlok Gondalia, Jerry Duggan, and Michael G Kahn. An autocorrelation-based lstm-autoencoder for anomaly detection on time-series data. In 2020 IEEE International Conference on Big Data (Big Data), pp. 5068 5077. IEEE, 2020. Xiaolei Hua, Lin Zhu, Shenglin Zhang, Zeyan Li, Su Wang, Dong Zhou, Shuo Wang, and Chao Deng. Genad: General representations of multivariate time seriesfor anomaly detection. ar Xiv preprint ar Xiv:2202.04250, 2022. Alexis Huet, Jose Manuel Navarro, and Dario Rossi. Local evaluation of time series anomaly detection algorithms. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 635 645, 2022. Kyle Hundman, Valentino Constantinou, Christopher Laporte, Ian Colwell, and Tom Soderstrom. Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 387 395, 2018. Vincent Jacob, Fei Song, Arnaud Stiegler, Bijan Rad, Yanlei Diao, and Nesime Tatbul. Exathlon: A benchmark for explainable anomaly detection over time series. ar Xiv preprint ar Xiv:2010.05073, 2020. Kyeong-Joong Jeong, Jin-Duk Park, Kyusoon Hwang, Seong-Lyun Kim, and Won-Yong Shin. Two-stage deep anomaly detection with heterogeneous time series data. IEEE Access, 10:13704 13714, 2022. Wenqian Jiang, Yang Hong, Beitong Zhou, Xin He, and Cheng Cheng. A gan-based anomaly detection approach for imbalanced industrial time series. IEEE Access, 7:143608 143619, 2019. Published in Transactions on Machine Learning Research (04/2023) Siwon Kim, Kukjin Choi, Hyun-Soo Choi, Byunghan Lee, and Sungroh Yoon. Towards a rigorous evaluation of time-series anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 7194 7201, 2022. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Kwei-Herng Lai, Daochen Zha, Junjie Xu, Yue Zhao, Guanchu Wang, and Xia Hu. Revisiting time series outlier detection: Definitions and benchmarks. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. Alexander Lavin and Subutai Ahmad. Evaluating real-time anomaly detection algorithms the numenta anomaly benchmark. In 2015 IEEE 14th international conference on machine learning and applications (ICMLA), pp. 38 44. IEEE, 2015. Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436 444, 2015. Chang-Ki Lee, Yu-Jeong Cheon, and Wook-Yeon Hwang. Studies on the gan-based anomaly detection methods for the time series data. IEEE Access, 9:73201 73215, 2021. Dan Li, Dacheng Chen, Jonathan Goh, and See-kiong Ng. Anomaly detection with generative adversarial networks for multivariate time series. ar Xiv preprint ar Xiv:1809.04758, 2018a. Dan Li, Dacheng Chen, Baihong Jin, Lei Shi, Jonathan Goh, and See-Kiong Ng. Mad-gan: Multivariate anomaly detection for time series data with generative adversarial networks. In International Conference on Artificial Neural Networks, pp. 703 716. Springer, 2019. Longyuan Li, Junchi Yan, Haiyang Wang, and Yaohui Jin. Anomaly detection of time series with smoothnessinducing sequential variational auto-encoder, 2021a. Wenkai Li, Cheng Feng, Ting Chen, and Jun Zhu. Robust learning of deep time series anomaly detection models with contaminated training data. ar Xiv preprint ar Xiv:2208.01841, 2022. Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. In International Conference on Learning Representations, 2018b. URL https://openreview.net/forum?id=SJi HXGWAZ. Zhihan Li, Youjian Zhao, Jiaqi Han, Ya Su, Rui Jiao, Xidao Wen, and Dan Pei. Multivariate time series anomaly detection and interpretation using hierarchical inter-metric and temporal embedding. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 3220 3230, 2021b. Benjamin Lindemann, Benjamin Maschler, Nada Sahlab, and Michael Weyrich. A survey on anomaly detection for technical systems using lstm networks. Computers in Industry, 131:103498, 2021. Yuan Luo, Ya Xiao, Long Cheng, Guojun Peng, and Danfeng Yao. Deep learning-based anomaly detection in cyber-physical systems: Progress and opportunities. ACM Computing Surveys (CSUR), 54(5):1 36, 2021. Pankaj Malhotra, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal. Long short term memory networks for anomaly detection in time series. In 23rd European Symposium on Artificial Neural Networks, ESANN 2015, Bruges, Belgium, April 22-24, 2015, 2015. URL http://www.elen.ucl.ac.be/Proceedings/ esann/esannpdf/es2015-56.pdf. Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand, Lovekesh Vig, Puneet Agarwal, and Gautam Shroff. Lstm-based encoder-decoder for multi-sensor anomaly detection. ar Xiv preprint ar Xiv:1607.00148, 2016. Ali H Mirza and Selin Cosan. Computer network intrusion detection using sequential lstm neural networks autoencoders. In 2018 26th signal processing and communications applications conference (SIU), pp. 1 4. IEEE, 2018. Published in Transactions on Machine Learning Research (04/2023) Mohsin Munir, Shoaib Ahmed Siddiqui, Andreas Dengel, and Sheraz Ahmed. Deepant: A deep learning approach for unsupervised anomaly detection in time series. Ieee Access, 7:1991 2005, 2018. Zijian Niu, Ke Yu, and Xiaofei Wu. Lstm-based vae-gan for time-series anomaly detection. Sensors, 20(13): 3738, 2020. John Paparrizos, Paul Boniol, Themis Palpanas, Ruey S Tsay, Aaron Elmore, and Michael J Franklin. Volume under the surface: a new accuracy evaluation measure for time-series anomaly detection. Proceedings of the VLDB Endowment, 15(11):2774 2787, 2022. Daehyung Park, Yuuna Hoshi, and Charles C Kemp. A multimodal anomaly detector for robot-assisted feeding using an lstm-based variational autoencoder. IEEE Robotics and Automation Letters, 3(3):1544 1551, 2018. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019. Joao Pereira and Margarida Silveira. Unsupervised anomaly detection in energy time series data using variational recurrent autoencoders with attention. In 2018 17th IEEE international conference on machine learning and applications (ICMLA), pp. 1275 1282. IEEE, 2018. Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Francis Bach and David Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 1530 1538, Lille, France, 2015. PMLR. URL https: //proceedings.mlr.press/v37/rezende15.html. Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. Deep one-class classification. In International conference on machine learning, pp. 4393 4402. PMLR, 2018. Lukas Ruff, Robert A. Vandermeulen, Nico Görnitz, Alexander Binder, Emmanuel Müller, Klaus-Robert Müller, and Marius Kloft. Deep semi-supervised anomaly detection. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hkg H0TEYw H. Lukas Ruff, Jacob R. Kauffmann, Robert A. Vandermeulen, Grégoire Montavon, Wojciech Samek, Marius Kloft, Thomas G. Dietterich, and Klaus-Robert Müller. A unifying review of deep and shallow anomaly detection. Proc. IEEE, 109(5):756 795, 2021. Mahmoud Said Elsayed, Nhien-An Le-Khac, Soumyabrata Dev, and Anca Delia Jurcut. Network anomaly detection using lstm based autoencoder. In Proceedings of the 16th ACM Symposium on Qo S and Security for Wireless and Mobile Networks, pp. 37 45, 2020. Erik Scharwächter and Emmanuel Müller. Statistical evaluation of anomaly detectors for sequences. ar Xiv preprint ar Xiv:2008.05788, 2020. Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly detection in time series: a comprehensive evaluation. Proceedings of the VLDB Endowment, 15(9):1779 1797, 2022. Bernhard Schölkopf, John C. Platt, John Shawe-Taylor, Alex J. Smola, and Robert C. Williamson. Estimating the Support of a High-Dimensional Distribution. Neural Computation, 13(7):1443 1471, 07 2001. ISSN 0899-7667. doi: 10.1162/089976601750264965. URL https://doi.org/10.1162/089976601750264965. Lifeng Shen, Zhuocong Li, and James Kwok. Timeseries anomaly detection using temporal hierarchical one-class network. Advances in Neural Information Processing Systems, 33:13016 13026, 2020. Maximilian Sölch, Justin Bayer, Marvin Ludersdorfer, and Patrick van der Smagt. Variational inference for on-line anomaly detection in high-dimensional time series. ar Xiv preprint ar Xiv:1602.07109, 2016. Published in Transactions on Machine Learning Research (04/2023) Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2828 2837, 2019. Nesime Tatbul, Tae Jun Lee, Stan Zdonik, Mejbah Alam, and Justin Gottschlich. Precision and recall for time series. Advances in neural information processing systems, 31, 2018. Markus Thill, Wolfgang Konen, and Thomas Bäck. Time series encodings with temporal convolutional networks. In International Conference on Bioinspired Methods and Their Applications, pp. 161 173. Springer, 2020. Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018. URL https: //openreview.net/forum?id=r JXMpik CZ. Taras K. Vintsyuk. Speech discrimination by dynamic programming. Cybernetics, 4(1):52 57, January 1968. ISSN 1573-8337. doi: 10.1007/BF01074755. URL https://doi.org/10.1007/BF01074755. Lan Wang, Yusan Lin, Yuhang Wu, Huiyuan Chen, Fei Wang, and Hao Yang. Forecast-based multi-aspect framework for multivariate time-series anomaly detection. In 2021 IEEE International Conference on Big Data (Big Data), pp. 938 947. IEEE, 2021. Qingsong Wen, Tian Zhou, Chaoli Zhang, Weiqi Chen, Ziqing Ma, Junchi Yan, and Liang Sun. Transformers in time series: A survey. ar Xiv preprint ar Xiv:2202.07125, 2022. Renjie Wu and Eamonn Keogh. Current time series anomaly detection benchmarks are flawed and are creating the illusion of progress. IEEE Transactions on Knowledge and Data Engineering, 2021. Qinfeng Xiao, Shikuan Shao, and Jing Wang. Memory-augmented adversarial autoencoders for multivariate time-series anomaly detection with deep reconstruction and prediction. ar Xiv preprint ar Xiv:2110.08306, 2021. Haowen Xu, Yang Feng, Jie Chen, Zhaogang Wang, Honglin Qiao, Wenxiao Chen, Nengwen Zhao, Zeyan Li, Jiahao Bu, Zhihan Li, and et al. Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. Proceedings of the 2018 World Wide Web Conference on World Wide Web - WWW 18, 2018. doi: 10.1145/3178876.3185996. URL http://dx.doi.org/10.1145/3178876.3185996. Jiehui Xu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Anomaly transformer: Time series anomaly detection with association discrepancy. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Lz QQ89U1qm_. Jun Zhan, Siqi Wang, Xiandong Ma, Chengkun Wu, Canqun Yang, Detian Zeng, and Shilin Wang. Stgatmad: Spatial-temporal graph attention network for multivariate time series anomaly detection. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3568 3572. IEEE, 2022. Chuxu Zhang, Dongjin Song, Yuncong Chen, Xinyang Feng, Cristian Lumezanu, Wei Cheng, Jingchao Ni, Bo Zong, Haifeng Chen, and Nitesh V Chawla. A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp. 1409 1416, 2019. Hongjing Zhang, Fangzhou Cheng, and Aparna Pandey. One-class predictive autoencoder towards unsupervised anomaly detection on industrial time series. In ANDEA 22, 2022a. Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018. URL https: //openreview.net/forum?id=r1Ddp1-Rb. Published in Transactions on Machine Learning Research (04/2023) Kai Zhang, Yushan Jiang, Lee Seversky, Chengtao Xu, Dahai Liu, and Houbing Song. Federated variational learning for anomaly detection in multivariate time series. In 2021 IEEE International Performance, Computing, and Communications Conference (IPCCC), pp. 1 9. IEEE, 2021. Weiqi Zhang, Chen Zhang, and Fugee Tsung. Grelen: Multivariate time series anomaly detection from the perspective of graph relational learning. In Lud De Raedt (ed.), Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp. 2390 2397. International Joint Conferences on Artificial Intelligence Organization, 7 2022b. doi: 10.24963/ijcai.2022/332. Main Track. Hang Zhao, Yujing Wang, Juanyong Duan, Congrui Huang, Defu Cao, Yunhai Tong, Bixiong Xu, Jing Bai, Jie Tong, and Qi Zhang. Multivariate time-series anomaly detection via graph attention network. In 2020 IEEE International Conference on Data Mining (ICDM), pp. 841 850. IEEE, 2020. Bin Zhou, Shenghua Liu, Bryan Hooi, Xueqi Cheng, and Jing Ye. Beatgan: Anomalous rhythm detection using adversarially generated time series. In IJCAI, pp. 4433 4439, 2019. Published in Transactions on Machine Learning Research (04/2023) Let X RT F be a time series of length T and dimension F and let y RT be the corresponding point-wise labels. Given an online anomaly detector, s: RT F RT that computes a score for each time step in X based only on points that came before it. The set of anomalies is A = {[a, b] [T] | t [a, b]: y[t] = 1; [a , b ] [a, b]: t [a , b ]: y[t] = 1} and the set of all predictions for a threshold λ R is Pλ = {[a, b] [T] | t [a, b]: s(x)[t] λ; [a , b ] [a, b]: t [a , b ]: s(x)[t] λ}. Given a cardinality function γ : N P([T]) R 0 and a bias function δ: R R 0, where P([T]) is the power set of [T], the time-series recall is given by TRec(A, P) = 1 |A| α1(|PA| > 0) + (1 α)γ(|PA|, A) X t P A δ(t min A, |A|) P t A δ(t min A, |A|) with PA = {P P | |A P| > 0}. The cardinality function is monotone decreasing in its first argument and γ(1, ) = 1. Proof of Theorem 1: It is straightforward to see, that γ is monotone decreasing, as the maximum is over all values with smaller inputs multiplied by a factor smaller than one. It remains to show, that the resulting TRec is recall consistent. Since the terms within the sum are all non-negative, it suffices to show, that each individual term only ever decreases. Consider two thresholds λ, λ R with λ > λ and anomaly A A such that Pλ A = Pλ A . Note that |Pλ A| = 0 implies |Pλ A | = 0. If |Pλ A | = 0, the inner sum is zero, and the statement is true. Thus, we assume |Pλ A | > 0 and can therefore ignore the first term inside the outer sum, since 1(|Pλ A| > 0) = 1(|Pλ A | > 0) always holds. First, we consider the case |Pλ A | |Pλ A|. Since γ is monotone decreasing in its first argument and the inner sum loses at least one non-negative term, the second term can either decrease or stay the same. Next, we consider the case |Pλ A | < |Pλ A|. We want to show that each term only ever decreases with an increasing threshold, i.e. γ(|Pλ A|, A) X t P A δ(t min A, |A|) P t A δ(t min A, |A|) γ(|Pλ A |, A) X t P A δ(t min A, |A|) P t A δ(t min A, |A|) . If γ(|Pλ A |, A) = 0, the recall does not not change, because γ is monotone decreasing. Thus we assume γ(|Pλ A |, A) > 0, in which case the inequality above holds if and only if γ(|Pλ A|, A) γ(|Pλ A |, A) t P A δ(t min A, |A|) t P A δ(t min A, |A|) . Consider δ = P t P A δ(t min A, |A|) P t P A δ(t min A, |A|) > 0. Then it holds P Pλ A P Pλ A P \ P | |Pλ A| |Pλ A | Published in Transactions on Machine Learning Research (04/2023) Since P A A, we also know P t P A δ(t min A, |A|) t P A δ(t min A, |A|) = t P A δ(t min A, |A|) δ t P A δ(t min A, |A|) t A δ(t min A, |A|) δ P t A δ(t min A, |A|) t A δ(t min A, |A|) (|Pλ A| |Pλ A |) P t A δ(t min A, |A|) γ(|Pλ A|, A) t A δ(t min A, |A|) (|Pλ A| |Pλ A |) P t A δ(t min A, |A|) γ(|Pλ A |, A) holds true for all 0 < |Pλ A | < |Pλ A|, the resulting recall is recall consistent. Lemma 1 For any x R 1 it holds x 1 x for all n N. Proof: We know x 1 x . By induction over n, it holds x 1 x = x (n + 1) x2 x (n + 1) Proof of Theorem 2: We show the proposition by induction. First, note that γ (n, A) = |A| 1 Now assume γ (m, A) = |A| 1 |A| m 1 for all m n. Then it holds γ (n + 1, A) = max 0