# fewshot_fastadaptive_anomaly_detection__7629311d.pdf

Few-Shot Fast-Adaptive Anomaly Detection

Ze Wang , Yipin Zhou , Rui Wang , Tsung-Yu Lin , Ashish Shah , and Ser-Nam Lim

Purdue University Meta AI

The ability to detect anomaly has long been recognized as an inherent human ability, yet to date, practical AI solutions to mimic such capability have been lacking. This lack of progress can be attributed to several factors. To begin with, the distribution of abnormalities is intractable. Anything outside of a given normal population is by deﬁnition an anomaly. This explains why a large volume of work in this area has been dedicated to modeling the normal distribution of a given task followed by detecting deviations from it. This direction is however unsatisfying as it would require modeling the normal distribution of every task that comes along, which includes tedious data collection. In this paper, we report our work aiming to handle these issues. To deal with the intractability of abnormal distribution, we leverage Energy Based Model (EBM). EBMs learn to associate low energies to correct values and higher energies to incorrect values. At its core, the EBM employs Langevin Dynamics (LD) in generating these incorrect samples based on an iterative optimization procedure, alleviating the intractable problem of modeling the world of anomalies. Then, in order to avoid training an anomaly detector for every task, we utilize an adaptive sparse coding layer. Our intention is to design a plug and play feature that can be used to quickly update what is normal during inference time. Lastly, to avoid tedious data collection, this mentioned update of the sparse coding layer needs to be achievable with just a few shots. Here, we employ a meta learning scheme that simulates such a few shot setting during training. We support our ﬁndings with strong empirical evidence.

1 Introduction

Anomaly detection is an important area of study in the ﬁeld of artiﬁcial intelligence. It has found utility in computer vision applications such as industrial inspection [6] and video surveillance [28, 61, 39], in the context of abuse prevention such as misinformation, fraud and network intrusion detection [60, 8, 35], and others such as system health monitoring and fault detection [4, 42]. In this paper, we propose an approach for detecting anomaly in images, where we have carefully designed steps to handle some of the bigger issues that have prevented the deployment of image anomaly detection in the real-world.

Image anomaly detection can generally be deﬁned as the identiﬁcation of abnormalities in a given image. An exact deﬁnition of abnormality in this case is elusive because abnormality can be derived from any unknown distribution outside of a normal population. Many studies have hence focused on modeling the normal population instead of learning irregularities, where the goal is to capture the shared concept among all of the normal data as one or several reference models. This process usually requires investing signiﬁcant efforts in curating a large set of normal samples for each task, after which anomaly is detected as deviations from the reference model(s) [1, 58]. Recent work from [50] provides algorithms that utilize only a few normal samples to train models from scratch. However, the models still have to be provisioned for each new task, which requires considerable human efforts

Work done as an intern at Meta AI. Contact: zewang@purdue.edu

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

and expertise, and thus lack the fast deployment criterion that is often time critical for real-world applications. In view of these challenges, our goals for this work are threefold. We are interested in designing an anomaly detection system that is capable of: (G1) modeling the normal population while at the same time has a principled approach towards modeling the abnormalities; (G2) quickly adapting to a new task at inference time; and (G3) requiring only a few normal shots to update itself to the new task at hand.

For (G1), we introduce the class of Energy Based Model (EBM), which is an important family of generative models [62, 17, 57]. EBMs have been shown to demonstrate superior capability on modeling data density and localizing anomaly [20]. For our purpose, the EBM we adopted learns to assign low energy to normal samples but high energy to abnormal samples. More importantly, the abnormal samples are generated with a procedure known as Langevin Dynamics (LD) [54], which, in its original form, starts with a noise image and gradually samples from the distribution along the direction of lower energy. This lends itself gracefully to utilizing the generated intermediate samples as negative/abnormal. The LD procedure is then coupled with maximum likelihood loss [24] that aims to maximize the energy differences between the normal and abnormal samples.

To achieve (G2), we propose an adaptive sparse coding layer that is attached to the deep feature extractor in the EBM as Figure 1 shows. The extracted deep feature is forwarded to the sparse coding layer, where the dictionary is constructed with the features of a few normal samples of the given task. In essence, the input representation has been decomposed into a linear combination of normal features with the sparsity constraint imposed. The ﬁnal energy score is measured by the distance between the original and the reconstructed features (after the sparse coding layer). Under this scheme, the dictionary for a particular task is not obtained by learning, but instead is constructed by the feature representations of a few normal samples during inference. As a result, this simple plug-and-play trick allows the model to be adapted to novel tasks promptly without re-training. Further, we expect that the dictionary, which is formed by normal features, will not be able to explain the abnormal samples well, causing relatively high reconstruction error that lends itself for subsequent detection. As a bonus, a backward pass of energy score minimization can be used for localizing abnormal regions. We show that using gradient to localize anomalies yields superior robustness.

Towards (G3), we utilize meta learning [52, 18] to simulate the scenario of being given a new task with a few normal shots to update the dictionary, followed by training the EBM. This is accomplished by episodic training, where in each episode the model is adapted to a held back task that is given a few normal samples. To accelerate the EBM training, we introduce learning from inpainting , a simple yet effective strategy for synthesizing hard abnormal samples quicker by starting the LD procedure with a synthesized image that is simply a normal sample with a noise patch injected as opposed to a noise image that is traditionally what is used.

We show the proposed few-shot fast-adaptive anomaly detection and localization framework is able to efﬁciently adapt to a novel task (e.g., a new object category or scenes from a new camera) with a few normal samples without training on both industrial inspection and video surveillance. Compared with previous methods that adapts to new task through either from scratch training in few shots [50, 55] or few-shot with few steps of gradient descent [32], the proposed framework is the ﬁrst that performs task adaptation with a single forward pass and without any gradient descent. Despite the fast adaptation, we provide both qualitative and quantitative results to demonstrate that our method outperforms other adaptive frameworks and is comparable to methods that rely on large amount of normal samples.

2 Backgrounds

In this section, we brieﬂy introduce two key ingredients of the proposed method: EBMs and sparse coding.

Energy-based Model. In EBMs, the goal is to learn an energy function E (x) : Rd ! R which parametrizes the data density p (x) as:

p (x) = exp( E (x)) R

x exp( E (x)), (1)

where is the parameter of the energy function and Z =

x exp( E (x)) is the partition function. Approximating the true data distribution pdata(x) is equivalent to minimizing the expected negative

log-likelihood function over the data distribution, deﬁned by the loss function:

LML = Ex pdata(x)[ log p (x)] = Ex pdata(x)[E (x) + log Z ]. (2)

As the computation of LML involves an intractable term Z , the common practice is to represent the gradient of LML as,

r LML = Ex+ pdata(x)[r E (x+)] Ex p (x)[r E (x )]. (3)

This objective decreases the energy of positive data samples x+ from the true distribution (normal samples in our use case) and increases the energy of negative samples x from the model p (synthesized abnormal samples). In practice, the synthesized negative samples are achieved through Langevin dynamics [54], which a J-steps sampling along the direction of energy minimization is given by:

xj = xj 1 β

2 rx E ( xj 1) + !k, !k N(0, βI), j = {1, . . . , J} (4)

where β is the step size, and the initialization x0 is sampled from a predeﬁned prior distribution. The synthesizing ability of EBMs enables generating abnormal samples to help in learning a more accurate data density, and is often touted as the one of the advantages of using an EBM.

Sparse coding. Approximating a signal z 2 Rd with the sparse linear combination over a dictionary D 2 Rd k can be expressed as:

1 2||z D ||2

2 + λ|| ||1, (5)

where is the sparse coefﬁcients, with its sparsity (l1 norm) and λ is the weight of the sparsity constraint. D is a sparse approximation to the original signal z. In practice, ﬁnding the dictionary atoms and the sparse coefﬁcients is usually formulated as an optimization problem.

In this paper, we adopt Iterative Soft Thresholding Continuation (ISTC) [25] to convert this optimization problem into linear operations with a non-linear shrinkage function, which allows sparse coding to be seamlessly integrated into the deep neural networks. To compute a sparse coefﬁcient , ISTC performs iterations of gradient steps on reconstruction ||z D ||2 and a proximal projection step to increase coefﬁcient sparsity.

Formally, initializing the coefﬁcients at the ﬁrst step 0 with all zeros, each step of ISTC reﬁnes the sparse code with descending values of λ from λmax to λ?: each step of ISTC is expressed as:

n+1 = σ( n + D>(z D n), λn), with λn = λmax

where σ( , ) here is a shrinkage function that truncates small values (lower than λ) of the coefﬁcients to 0 to enforce sparsity, and can be easily implemented by a customized Re LU activation function:

σ(z, λ) = sgn(z)(max(|z| λ, 0)) = sgn(z)Re LU(|z| λ). (7)

3 Proposed Method

In this section, we describe the proposed fast adaptive anomaly detection framework in details. In Section 3.1, we introduce the adaptive EBM which consists of a deep feature extractor followed by an adaptive sparse coding layer. From there, we further show that utilizing larger receptive ﬁeld in the sparse coding could improve training robustness (Section 3.1.1), and applying smoothed shrinkage functions could help speed up convergence (Section 3.1.2). In Section 3.2, we describe the episodic training regime on various anomaly detection tasks that mimics few-shot adaptation in the meta-testing stage while learning common knowledge across tasks. Instead of synthesizing negative samples (anomaly) directly from noise, we introduce a simple but effective learning from inpainting operation to accelerate the training in Section 3.3. Finally, we summarize training steps of the proposed method in Algorithm 1, and inference steps in Algorithm 2.

Testing abnormal samples

Testing normal samples

Sparse coding (b)

Energy estimation #! $; & = MSE(,, , ) Backward pass

Dictionary plug and play (a)

K normal samples from the new task

Figure 1: Overview of the inference stage on a new task. (a) Adapting the task-speciﬁc dictionary with K normal samples. (b) Three iterations of sparse coding based on Eqn. 6. We also show a backward pass from the reconstruction error to localize the abnormal regions

3.1 Adaptive Energy-based Model

An EBM is a form of generative model and it is widely used for modeling data density and sampling. While there has been recent work [20] applying EBM to anomaly detection, it still requires retraining for each new task. To efﬁciently adapt the EBM to novel tasks, we introduce an adaptive sparse coding layer which is conditioned on the dictionary constructed by the features of normal samples. Speciﬁcally, as illustrated in Fig 1, given an input image, x 2 R3 h w, we ﬁrst obtain the corresponding feature z 2 Rd h0 w0 from the deep feature extractor with parameters , so that z = (x; ). All feature vectors along spatial axes of z are then sparsely decomposed through the sparse coding layer over a task-speciﬁc dictionary D 2 Rd Kh0w0, which contains the features of K normal samples of the current task as shown in the Fig 1(a). Each feature vector of the normal sample features is then directly used as an atom in the task dictionary. The decomposed coefﬁcients are = S(z; D), where 2 RKh0w0 h0 w0 and S denotes the iterative sparse decomposition process of (6). By multiplying the coefﬁcient with the dictionary D, we obtain the reconstructed features z0 = D . The sparsity regularization to is important, as it encourages input features to be reconstructed by simple combinations of dictionary atoms (normal features), so that it would be difﬁcult for features of abnormal samples to be well-approximated, therefore producing higher reconstruction errors that make it conducive for detecting anomalies. From here, the ﬁnal energy score is formulated as the mean squared error (MSE) between the original and the reconstructed features:

E (x; D) = MSE(z, z0) = || (x; ) DS

In effect, Eqn. 8 depicts a conditional EBM, which is conditioned on the task-speciﬁc D formed by normal features. With the energy score, we can obtain pixel-wise anomaly localization maps through rx E (x; D), i.e., the gradients of pixels along the direction of minimization. High gradient magnitudes indicate regions that cannot be well explained by the dictionary D. Modiﬁcations to these regions can potentially remove the anomaly and reduce the energy as in Eqn. 4. In Section 4.1 and Appendix Section B.5, we show that using the gradient (as a natural ingredient of EBMs with LD) is more robust compared with auto-encoder and reconstruction based methods to generalize well to unseen tasks (Appendix Figure C). In the following sections, we will discuss how to make the training of this adaptive structure more robust.

3.1.1 Sparse Coding with Receptive Field.

As discussed in Section 3.1, the input feature z is represented as h0 w0 of d-dim feature vectors and they are treated independently while passing through the sparse coding layer. The region of the input image that affects one feature vector is determined by the receptive ﬁeld of the feature extractor. The trade-off is that a small receptive ﬁeld may not capture enough contextual information, while applying a large receptive ﬁeld would make feature maps spatially coarse and make it hard to

!<669=> 85>

Adaptation Dictionary plug and play

Normal samples

(Evaluation)

Synthesized abnormal samples

Step 1 Step 2 Step 3 Step 4 Step 5

Process of Langevin dynamics

56789:5 n-1

Figure 2: Illustration of episodic training and (a) learning by inpainting . In each episode, a support set is constructed with normal samples. The features of the normal samples are plugged into the adaptive sparse coding layer as the dictionary. Synthesized abnormal samples are corrected by performing gradient corrections with the gradient obtained by the direction of energy minimization.

spot small anomaly regions. To solve this dilemma, instead of carefully tuning the receptive ﬁeld of each layer of the feature extractor, we introduce a simple yet effective technique of applying the receptive ﬁeld on the sparse coding layer. Speciﬁcally, as illustrated in Appendix Fig A, rather than performing sparse coding to each individual d-dim feature vectors, we apply it on d l l volumes centered around each feature vector, where l is the receptive ﬁeld. This is equivalent to applying a l l, sliding window on spatial axes of the feature map and can be easily implemented by image to column (Im2Col) operation. Then we ﬂatten the feature volumes into dl2-dim vectors and adjust the shape of the dictionary accordingly. In this way, we are able to capture contextual information without needing to carefully tune the architecture of the feature extractor and we show in the later experiments that this technique improves the robustness of the network on different types of objects.

3.1.2 Shrinkage Function

The effectiveness of training the EBM for localizing anomaly regions heavily depends on the gradient propagation from later to earlier layers. It is shown in [15] that smooth activation functions like Swish [45] could be beneﬁcial here. Notably, the gradients of the dictionary D are determined by the sparse coding coefﬁcients as shown in Eqn. 6. However, the sparsity constraint of would turn off the gradient computation of many elements in D and this could be detrimental during the early stage of the training. To alleviate the sparse gradient issue, we replace the RELU-like shrinkage function in Eqn. 7 with its smoothed counterparts by introducing the Sigmoid based shrinkage functions (Sig Shrink). The Sig Shrink function is originally proposed for non-parametric signal estimation in [3], and can be deﬁned as:

σ (z, λ) = z 1 + exp( (|z| λ)), (9)

where is the hyperparameter of smoothness. We present visualizations of the hard shrinkage function Eqn. 7 and Sig Shrink with different values of in Fig B. Comparing to the hard shrinkage function which truncates small values into zeros, the Sig Shrink with a large can sharply force small values to near-zeros. Therefore, the Sig Shrink will guarantee non-zero gradients everywhere.

3.2 Episodic Training

To train the proposed adaptive EBM, we perform episodic training that is widely adopted by metalearning few-shot learners [18, 51]. Following the terminology of few-shot learning, in each training episode, the model is adapted and tested with a task sampled from the underlying task distribution. Speciﬁcally, the model is adapted to a support set of the given task, then a query set with ground truth labels is applied to evaluate the adaptation, which is used to update the model parameters. As shown in Fig 2, the support set of the i-th episode task contains a small number of K normal samples {si

k=1. The features zi

k; ) of these normal samples are plugged into the dictionary Di 2 Rd Kh0w0 corresponding to the i-th task to adapt the dictionary. After that, the adapted model

is measured by a query set consisting of M normal samples {qi

m=1 and M abnormal samples {ˆqi

m=1. Note that there is no actual abnormal samples given during training, instead, they are iteratively sampled from the EBM and the sampling will be discussed in details in Section 3.3. Recall that the training of EBM with contrastive divergence as in Eqn. 3 requires the estimation of energy scores of both positive samples from the true data distribution and negative samples from the modeled distribution. The positive energy can be estimated empirically with normal query set samples. The negative energy can be estimated by performing the MCMC-based sampling technique [37, 54], typically Langevin Dynamics as described in Eqn. 4. Denoting the output of Langevin dynamics (sampled abnormal samples) initialized with ˆqi

m as LD(ˆqi

m), we have the empirical estimation of the contrastive divergence of the i-th episode as:

m; Di) E (LD

m); Di# . (10)

With the energy score equivalent to the feature reconstruction error in Eqn. 8, minimizing Lcd encourages normal features to be well-reconstructed by a sparse linear combination of dictionary atoms while the features from abnormal samples tend to produce relatively higher reconstruction errors so that they can be easily spotted.

3.3 Synthesizing Negative Samples

Typical EBM training with contrastive divergence conducts negative sampling from the modeled density using techniques such as Langevin Dynamics, which applies gradient descent to a noise initialization with small step size and large number of steps [17]. Such negative sampling steps can be costly and we argue that it is unnecessary in our case. Instead, we introduce a new strategy of learning by inpainting . Starting from a positive query sample qi

m, we synthesize the corresponding negative sample ˆqi

m by randomly placing a small uniform noise patch on the image. The Langevin Dynamics procedure is then initialized with the resulting image instead of a noise image. As the Langevin Dynamics proceeds, synthesized abnormal samples LD(ˆqi

m) are inpainted along the direction of normal , qi

m, and we introduce the following reconstruction loss:

We show in Fig 2(a) that, starting from a synthesized abnormal sample, only 5 steps of Langevin dynamic would be sufﬁcient to make it visually close to the corresponding normal sample during training, serving as hard negatives that further facilitates the learning. The ﬁnal loss of the episodic training is simply:

L = 0Lrec + 1Lcd, (12) where 0 and 1 are balance two loss terms. We summarize the overall training and inference procedure in Alg. 1 and Alg. 2 respectively.

4 Experiments

In this section, we conduct evaluation on the industrial inspection task with the MVTec-AD dataset [5, 6] (Section 4.1). Even though our proposed framework is image-based, we further demonstrate it s efﬁcacy on the video anomaly detection task in Section 4.2. In Section 4.3, we show ablations and insights relating to the adaptive sparse coding components. We show additional ablations including the superiority of using gradient of EBMs over pixel-wise reconstruction to localize anomalies in App. B and we provide implementation details in App. A.

4.1 Industrial Inspection

The goal of this anomaly detection task is to predict whether a manufactured component contains any defects. The MVTec-AD dataset includes 15 categories of object. To demonstrate the fast adaptation capability of the proposed method, we adopt a leave-one-out training strategy. Speciﬁcally, samples of each target category are reserved for testing only, and the episodic training is performed on the

Algorithm 1 Training procedure.

1: Given: A feature extractor with parameter ; a training dataset of multiple tasks with positive (normal)

samples only. 2: Given: Number of shots K; number of query samples Q; step size β of Langevin dynamics; total training

episodes I; and learning rate . 3: Initialize the feature extractor . 4: for Episode i = 1 : I do 5: Sample the i-th task from the dataset, and randomly pick K +M samples to form the support set {si

k=1 and the query set {qi

m=1. 6: Generate corrupted query samples {ˆqi

m=1 by placing random patches to {qi

m=1. 7: Extract the support and query sample features with and update the adaptive sparse coding layer with i-th task dictionary Di, which is constructed by support sample features. The energy function of the i-th task is now parametrized by E ( , Di). 8: Obtain synthesized negative samples {LD(ˆqi

m=1 with the updated energy function using Langevin dynamic in (4). 9: Obtain the ﬁnal loss L with Lcd (10) and Lrec (11). 10: Update parameters r L. 11: end for 12: Return with parameter .

Algorithm 2 Inference procedure on a task indexed by i.

1: Given: Feature extractor with trained parameter . 2: Given: The support set {si

k=1 and query samples {qi

m=1 to be tested. 3: Extract normal feature tensor Zi 2 RK d h0 w0. 4: Reshape the normal feature tensor into a matrix, and use it as the task-speciﬁc dictionary Di 2 Rd Kh0w0. 5: Estimate the anomaly score of a test sample qi

m by its energy score using Eqn 8, with E (qi

m; Di) = || (qi

||2, where a higher energy score indicates that qi

m is more likely to be an abnormal sample. 6: The pixel-wise anomaly map of a test sample qi

m can be obtain by visualizing rqim E (qi

m; Di), where abnormal regions show higher magnitude.

remaining categories. During the training stage, the model will not see any samples from the target category. During testing, we ﬁrst adapt the model to the target category with 10 randomly selected normal samples, then measure the performance with the entire testing set. We run the test 5 times, each time the model is adapted to random sets of 10 normal samples from the target category.

In Table 1, we ﬁrst show performance of upper-bound methods, which train each category from scratch with massive normal samples. Speciﬁcally, [7, 6] train auto-encoders (AE) with normal samples and measure the reconstruction errors during the inference; Ano GAN [48] adopts a generative adversarial network (GAN) to learn the manifold of normal; VE-VAE [29] presents a visually explainable variational auto-encoder through gradient-based attention. For apple-to-apple comparison, we create a strong baseline by applying model-agnostic meta-learning [18] on an AE (denoted as MAML-AE, more details in App. Sec. A.3). To the best of our knowledge, the proposed method is the ﬁrst that is capable of producing strong performance on new tasks, using just a single forward pass and no further training. This strongly suggests that the learned parameters are effectively shared across tasks, greatly helping to accelerate the model deployment process that is typically cumbersome otherwise. All results from our methods are obtained with ﬂipping and 90o rotation as augmentations. The proposed method outperforms MAML-AE by a large margin. Our results are even competitive with the upper-bounds in some categories. We show the localized anomaly regions from our method in Fig 3. Additional visualizations are in the App. Fig D.

4.2 Video Surveillance

In video anomaly detection, a common goal is to detect abnormal events captured by surveillance cameras (e.g., a motorcycle on the sidewalk). A model trained on videos from one camera might not generalize well on other cameras due to different locations / mounting heights / lightning conditions, and it is not feasible to train one model for every new camera in practice. The ability to quickly adapt to new scenes is a signiﬁcant contribution to the task of video surveillance. We are only aware of the

Input GT Prediction

Figure 3: Visualizations of localized anomaly by our method.

Category AE (SSIM) AE (MSE) Ano GAN VE-VAE MAML-AE Ours

Carpet 0.69 0.38 0.34 0.1 0.20 0.26 0.87 0.59 0.54 0.78 0.68 0.84

Grid 0.88 0.83 0.04 0.02 0.01 0.12 0.94 0.90 0.58 0.73 0.53 0.82

Leather 0.71 0.67 0.34 0.74 0.12 0.40 0.78 0.75 0.64 0.87 0.77 0.95

Tile 0.04 0.23 0.08 0.14 0.14 0.26 0.59 0.51 0.50 0.93 0.52 0.76

Wood 0.36 0.29 0.14 0.47 0.11 0.23 0.73 0.73 0.62 0.91 0.68 0.78

Bottle 0.15 0.22 0.05 0.07 0.02 0.23 0.93 0.86 0.86 0.78 0.56 0.82

Cable 0.01 0.05 0.01 0.18 0.04 0.18 0.82 0.86 0.78 0.90 0.74 0.80

Capsule 0.09 0.11 0.04 0.11 0.03 0.10 0.94 0.88 0.84 0.74 0.68 0.90

Hazelnut 0.00 0.41 0.02 0.44 0.11 0.40 0.97 0.95 0.87 0.98 0.72 0.94

Metal nut 0.01 0.26 0.00 0.49 0.10 0.28 0.89 0.86 0.76 0.94 0.78 0.78

Pill 0.07 0.25 0.17 0.18 0.10 0.11 0.91 0.85 0.87 0.83 0.62 0.88

Screw 0.03 0.34 0.01 0.17 0.02 0.08 0.96 0.96 0.80 0.97 0.55 0.86

Toothbrush 0.08 0.51 0.07 0.14 0.06 0.18 0.92 0.93 0.90 0.94 0.80 0.85

Transistor 0.01 0.22 0.08 0.30 0.02 0.18 0.90 0.86 0.80 0.93 0.76 0.80

zipper 0.10 0.13 0.01 0.06 0.04 0.15 0.88 0.77 0.78 0.78 0.68 0.86

Table 1: Numerical evaluation of anomaly localization on MVTec-AD. We report both m Io U (top rows) and AUC-ROC (bottom rows) values. Col 2-5 are upper-bound methods trained with massive normal samples.

work in [32] (r-GAN) that has such adaptation capability. Speciﬁcally, the model adapts to a new scene using gradient descent with several beginning frames of a query video, after which a GAN is applied to generate future frames. Anomaly is then detected via the discrepancy between predicted future frames and the original frames. Note that the MAML-AE baseline we conducted in Section 4.1 can be seen as an ablation of r-GAN on the single-frame without temporal information.

We follow the same evaluation regime as r-GAN by training with normal samples in all 13 scenes from SH-Tech [28] and testing on UCSD Pedestrian 1, UCSD Pedestrian 2 [34], and CUHK Avenue [30]. Note that since our method is image-based, it predicts the video frames independently without leveraging any temporal information as in r-GAN. In each episode, we adapt our model with a support set containing a few normal frames randomly sampled from the target scenes. In Table 2, we compare our method against r-GAN pre-trained on SH-Tech only (r-GAN Pre-train), ﬁne-tuned on target datasets (r-GAN Fine-tune), and with one step gradient descent with meta-learning (r-GAN MAML). We also show the performance of MAML-AE as a baseline for image-based meta-learning method. In the last section of Table 2, we present intra-dataset results as well by training with 6 scenes of SH-Tech and testing on remaining 7. We follow common evaluation protocol and measure the frame-level AUC-ROC. Without leveraging temporal information and re-training (gradient descent), our method achieves comparable results to r-GAN MAML and outperforms image-based meta-learning method by a large margin. In App B.4, we show that incorporating simple temporal information can further improve the performance.

4.3 Ablation Studies

Sparse coding receptive ﬁelds. To evaluate the effectiveness of using large receptive ﬁelds in the sparse coding layer, we conduct additional experiments on the MVTec-AD dataset, and select 5 representative categories with different levels of difﬁculties to present the comparisons with l = 1 and l = 3 (Sec. 3.1.1) in Table 3. Sparse coding with large receptive ﬁeld clearly beneﬁts more complex

Input GT Prediction

Figure 4: Visualizations of anomaly localization with video anomaly detection.

Target datasets Methods 1-shot 5-shot 10-shot

r-GAN Pre-train 73.10 73.10 73.10 r-GAN Fine-tune 76.99 77.85 78.23 r-GAN MAML 80.60 81.42 82.38

MAML-AE 64.12 66.88 67.34 Ours 77.42 78.12 78.65

r-GAN Pre-train 81.95 81.95 81.95 r-GAN Fine-tune 85.64 89.66 91.11 r-GAN MAML 91.19 91.80 92.80

MAML-AE 78.24 82.04 83.30 Ours 91.22 92.00 92.45

CUHK Avenue

r-GAN Pre-train 71.43 71.43 71.43 r-GAN Fine-tune 75.43 76.52 77.77 r-GAN MAML 76.58 77.10 78.79

MAML-AE 68.72 69.67 70.01 Ours 80.68 83.41 84.46

r-GAN Pre-train 70.11 70.11 70.11 r-GAN Fine-tune 71.61 70.47 71.59 r-GAN MAML 74.51 75.28 77.36

MAML-AE 66.62 67.12 68.04 Ours 75.32 79.64 81.28

Table 2: Frame-level AUC-ROC for the video anomaly detection tasks.

Figure 5: Loss curves with smooth (Sig Shrink) and non-smooth (hardshrink RELU-like) shrinkage functions.

Category Leather Grid Hazelnut Cable Zipper

l = 1 0.40 0.95 0.11 0.80 0.36 0.91 0.17 0.76 0.12 0.84 l = 3 0.40 0.95 0.12 0.81 0.40 0.94 0.18 0.80 0.15 0.86

Table 3: Comparison of different sparse coding receptive ﬁelds. We report both m Io U (left) and AUC-ROC (right) values.

Category Leather Hazelnut Cable

Ours 0.40 0.95 1.6e-4 0.40 0.94 2.4e-4 0.18 0.80 2.0e-4 No sparsity 0.32 0.90 0.9e-4 0.24 0.80 1.7e-4 0.12 0.68 1.5e-4

Table 4: Performance w/ and w/o sparsity constraint. From left to right: m Io U; AUC-ROC; the difference of averaged reconstruction errors between abnormal/normal samples.

structural objects (hazelnut, cable, and capsule), while the improvements are limited for the texture objects (leather and grid), where contextual regularization is intuitively less important.

Shrinkage functions. To show the beneﬁts of smooth shrinkage function, we plot the loss curves of models trained with smooth Sig Shrink (Eqn. 9) and non-smooth RELU-like shrinkage (Eqn. 7) functions in Fig 5. The model with smooth shrinkage function converges notably faster in the early training stage and achieves lower loss.

Sparsity constraint. As discussed in Section 3.1, we impose sparsity constraint to the feature decomposition in the adaptive sparse coding layer, in order to prevent abnormal features from being well-approximated by the linear combinations of normal features, so that the reconstruction errors are effective for detecting anomaly. To validate this, we conduct experiments by removing the shrinkage function σ in the sparse coding stage (Eqn. 6). We show comparison in Table 4 with m Io U, AUC-ROC, and the difference of averaged reconstruction errors between abnormal and normal samples. Without sparsity, the performance drops dramatically, and reconstruction errors of normal and abnormal samples become closer.

5 Related Work

Anomaly detection with sparse coding. Early efforts on adopting sparse coding in anomaly detection are based on optimization (with L1 penalty) [30, 61]. Recent advances on iterative sparse thresholding algorithms [11, 25] allow seamless integration of online sparse coding with deep neural networks, and [33] formulates the sparse coding as stack RNNs for video anomaly detection.

Anomaly detection with generative models. One of the core challenges in anomaly detection is that abnormal samples are usually unavailable in the training stage. Generative models are widely utilized in anomaly detection due to the capability in modeling the density of desired data distribution. Early efforts on variational autoencoders (VAE) based methods [1, 58] are arguably having hard time calibrating uncertainties in novel samples [36], accurately localizing abnormal regions through reconstruction errors [12]. Recent efforts have explored variant generative architectures like energybased models (EBM) [20], GANs [50], and combining VAE with EBM [12]. Various methods also exploit intra-image structures [10, 7], cross-frame consistency [31], and motion-appearance consistency in videos [39] while detecting anomaly.

Few-shot learning. Few-shot learning is extensively explored in classiﬁcation tasks. It leverages common knowledge extracted from a distribution of tasks, and induces an adaptive model that ﬁts to a new classiﬁcation task with as few as one sample per class. Proposed methods are based on optimization [18, 47, 19, 59, 46], learning metric [51, 53] and parameter prediction [22, 43, 21]. These technologies are further applied in other tasks like image generation [9, 27] and out-of-distribution detection [49].

Energy-based models. As a family of generative models, studies on EBMs [26] are mainly focused on probabilistic modeling and sampling over data, either unconditionally [38, 44, 40, 17, 15, 63, 57, 56, 2], or conditionally [13, 14]. This has been recently followed by extensions to other applications that include reasoning [16], latent space modeling of generative models [41], and anomaly detection [12].

6 Conclusion

In this paper, we introduced few-shot fast-adaptive anomaly detection. We formulated our model as an energy-based model with an adaptive sparse coding layer, of which the dictionary is directly formed by normal features of a target task. We adopted episodic meta-learning to learn common knowledge across tasks, which enables few shots adaptation. We further introduced smooth shrinkage functions, sparse coding with large receptive ﬁelds, and learning by inpainting to improve and accelerate the training. Notably, when evaluating our method s performance on industrial inspection and video anomaly detection, our method is comparable and even boasts better performance than methods trained with a large amount of normal samples. Through this work, we hope to have made a signiﬁcant contribution to the important problem of anomaly detection by shedding light on our ﬁndings that anomaly detection can indeed be generalized to new tasks with a few normal samples only.

Social Impact and Ethics. As a general framework for few-shot anomaly detection, the proposed method does not suffer from particular ethical concerns or negative social impacts. All datasets used are public, and we have blurred all human faces in the qualitative visualizations.

[1] Jinwon An and Sungzoon Cho. Variational autoencoder based anomaly detection using reconstruction

probability. Special Lecture on IE, 2(1):1 18, 2015.

[2] Michael Arbel, Liang Zhou, and Arthur Gretton. Generalized energy based models. ICLR, 2021.

[3] Abdourrahmane M Atto, Dominique Pastor, and Gregoire Mercier. Smooth sigmoid wavelet shrinkage

for non-parametric estimation. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3265 3268. IEEE, 2008.

[4] Yuequan Bao, Zhiyi Tang, Hui Li, and Yufeng Zhang. Computer vision and deep learning based data

anomaly detection method for structural health monitoring. Structural Health Monitoring, 18(2):401 421, 2019.

[5] Paul Bergmann, Kilian Batzner, Michael Fauser, David Sattlegger, and Carsten Steger. The mvtec anomaly

detection dataset: a comprehensive real-world dataset for unsupervised anomaly detection. International Journal of Computer Vision, 129(4):1038 1059, 2021.

[6] Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Mvtec ad a comprehensive

real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9592 9600, 2019.

[7] Paul Bergmann, Sindy Löwe, Michael Fauser, David Sattlegger, and Carsten Steger. Improving unsupervised defect segmentation by applying structural similarity to autoencoders. ar Xiv preprint ar Xiv:1807.02011, 2018.

[8] Richard J Bolton and David J Hand. Statistical fraud detection: A review. Statistical science, 17(3):235 255,

[9] Louis Clouâtre and Marc Demers. Figr: Few-shot image generation with reptile. ar Xiv preprint ar Xiv:1901.02199, 2019.

[10] Niv Cohen and Yedid Hoshen. Sub-image anomaly detection with deep pyramid correspondences. ar Xiv

preprint ar Xiv:2005.02357, 2020.

[11] Ingrid Daubechies, Michel Defrise, and Christine De Mol. An iterative thresholding algorithm for linear

inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 57(11):1413 1457, 2004.

[12] David Dehaene, Oriel Frigo, Sébastien Combrexelle, and Pierre Eline. Iterative energy-based projection on

a normal data manifold for anomaly localization. ICLR, 2020.

[13] Yilun Du, Shuang Li, and Igor Mordatch. Compositional visual generation with energy based models.

Neur IPS, 33:6637 6647, 2020.

[14] Yilun Du, Shuang Li, Yash Sharma, Josh Tenenbaum, and Igor Mordatch. Unsupervised learning of

compositional energy concepts. Neur IPS, 34:15608 15620, 2021.

[15] Yilun Du, Shuang Li, Joshua Tenenbaum, and Igor Mordatch. Improved contrastive divergence training of

energy based models. International Conference on Machine Learning, 2021.

[16] Yilun Du, Shuang Li, Joshua Tenenbaum, and Igor Mordatch. Learning iterative reasoning through energy

minimization. In ICML, pages 5570 5582. PMLR, 2022.

[17] Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. Neur IPS,

[18] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep

networks. In International Conference on Machine Learning, pages 1126 1135. PMLR, 2017.

[19] Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. ar Xiv preprint

ar Xiv:1806.02817, 2018.

[20] Ergin Utku Genc, Nilesh Ahuja, Ibrahima J Ndiour, and Omesh Tickoo. Energy-based anomaly detection

and localization. ar Xiv preprint ar Xiv:2105.03270, 2021.

[21] Spyros Gidaris and Nikos Komodakis. Generating classiﬁcation weights with gnn denoising autoencoders

for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition, pages 21 30, 2019.

[22] Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, and Richard E Turner. Meta-learning

probabilistic inference for prediction. ar Xiv preprint ar Xiv:1805.09921, 2018.

[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.

In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

[24] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation,

14(8):1771 1800, 2002.

[25] Yuling Jiao, Bangti Jin, and Xiliang Lu. Iterative soft/hard thresholding with homotopy continuation for

sparse recovery. IEEE Signal Processing Letters, 24(6):784 788, 2017.

[26] Yann Le Cun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based learning.

Predicting structured data, 1(0), 2006.

[27] Ming-Yu Liu, Xun Huang, Arun Mallya, Tero Karras, Timo Aila, Jaakko Lehtinen, and Jan Kautz. Few-

shot unsupervised image-to-image translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10551 10560, 2019.

[28] W. Liu, D. Lian W. Luo, and S. Gao. Future frame prediction for anomaly detection a new baseline. In

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.

[29] Wenqian Liu, Runze Li, Meng Zheng, Srikrishna Karanam, Ziyan Wu, Bir Bhanu, Richard J Radke, and

Octavia Camps. Towards visually explaining variational autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8642 8651, 2020.

[30] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In Proceedings of

the IEEE international conference on computer vision, pages 2720 2727, 2013.

[31] Yiwei Lu, K Mahesh Kumar, Seyed shahabeddin Nabavi, and Yang Wang. Future frame prediction using

convolutional vrnn for anomaly detection. In 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1 8. IEEE, 2019.

[32] Yiwei Lu, Frank Yu, Mahesh Kumar Krishna Reddy, and Yang Wang. Few-shot scene-adaptive anomaly

detection. In European Conference on Computer Vision, pages 125 141. Springer, 2020.

[33] Weixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse coding based anomaly detection in stacked

rnn framework. In Proceedings of the IEEE International Conference on Computer Vision, pages 341 349, 2017.

[34] Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. Anomaly detection in crowded

scenes. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1975 1981. IEEE, 2010.

[35] Biswanath Mukherjee, L Todd Heberlein, and Karl N Levitt. Network intrusion detection. IEEE network,

8(3):26 41, 1994.

[36] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. Do deep

generative models know what they don t know? ar Xiv preprint ar Xiv:1810.09136, 2018.

[37] Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo,

2(11):2, 2011.

[38] Jiquan Ngiam, Zhenghao Chen, Pang W Koh, and Andrew Y Ng. Learning deep energy models. In ICML,

[39] Trong-Nguyen Nguyen and Jean Meunier. Anomaly detection in video sequence with appearance-motion

correspondence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1273 1283, 2019.

[40] Erik Nijkamp, Mitch Hill, Tian Han, Song-Chun Zhu, and Ying Nian Wu. On the anatomy of mcmc-based

maximum likelihood learning of energy-based models. In AAAI, 2020.

[41] Bo Pang, Tian Han, Erik Nijkamp, Song-Chun Zhu, and Ying Nian Wu. Learning latent space energy-based

prior model. Neur IPS, 33:21994 22008, 2020.

[42] Afrooz Purarjomandlangrudi, Amir Hossein Ghapanchi, and Mohammad Esmalifalak. A data mining

approach for fault diagnosis: An application of anomaly detection algorithm. Measurement, 55:343 352, 2014.

[43] Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L Yuille. Few-shot image recognition by predicting

parameters from activations. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7229 7238, 2018.

[44] Yixuan Qiu, Lingsong Zhang, and Xiao Wang. Unbiased contrastive divergence algorithm for training

energy-based latent variable models. In ICLR, 2019.

[45] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. ar Xiv preprint

ar Xiv:1710.05941, 2017.

[46] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. International Conference

on Learning Representations, 2016.

[47] Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and

Raia Hadsell. Meta-learning with latent embedding optimization. International Conference on Learning Representations, 2019.

[48] Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg Langs.

Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging, pages 146 157. Springer, 2017.

[49] Vikash Sehwag, Mung Chiang, and Prateek Mittal. Ssd: A uniﬁed framework for self-supervised outlier

detection. ar Xiv preprint ar Xiv:2103.12051, 2021.

[50] Shelly Sheynin, Sagie Benaim, and Lior Wolf. A hierarchical transformation-discriminating generative

model for few shot anomaly detection. ar Xiv preprint ar Xiv:2104.14535, 2021.

[51] Jake Snell, Kevin Swersky, and Richard S Zemel. Prototypical networks for few-shot learning. Advances

in Neural Information Processing Systems, 2017.

[52] Ricardo Vilalta and Youssef Drissi. A perspective view and survey of meta-learning. Artiﬁcial intelligence

review, 18(2):77 95, 2002.

[53] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching

networks for one shot learning. Advances in Neural Information Processing Systems, 2016.

[54] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings

of the 28th international conference on machine learning (ICML-11), pages 681 688. Citeseer, 2011.

[55] Jhih-Ciang Wu, Ding-Jie Chen, Chiou-Shann Fuh, and Tyng-Luh Liu. Learning unsupervised metaformer

for anomaly detection. In International Conference on Computer Vision, pages 4369 4378, 2021.

[56] Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat. Vaebm: A symbiosis between variational

autoencoders and energy-based models. ICLR, 2020.

[57] Jianwen Xie, Yang Lu, Song-Chun Zhu, and Yingnian Wu. A theory of generative convnet. In International

Conference on Machine Learning, pages 2635 2644. PMLR, 2016.

[58] Haowen Xu, Wenxiao Chen, Nengwen Zhao, Zeyan Li, Jiahao Bu, Zhihan Li, Ying Liu, Youjian Zhao,

Dan Pei, Yang Feng, et al. Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. In Proceedings of the 2018 World Wide Web Conference, pages 187 196, 2018.

[59] Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian

model-agnostic meta-learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 7343 7353, 2018.

[60] Qiang Zhang, Aldo Lipani, Shangsong Liang, and Emine Yilmaz. Reply-aided detection of misinformation

via bayesian deep learning. In The world wide web conference, pages 2333 2343, 2019.

[61] Bin Zhao, Li Fei-Fei, and Eric P Xing. Online detection of unusual events in videos via dynamic sparse

coding. In CVPR 2011, pages 3313 3320. IEEE, 2011.

[62] Junbo Zhao, Michael Mathieu, and Yann Le Cun. Energy-based generative adversarial network. ar Xiv

preprint ar Xiv:1609.03126, 2016.

[63] Yang Zhao, Jianwen Xie, and Ping Li. Learning energy-based generative models via coarse-to-ﬁne

expanding and sampling. In International Conference on Learning Representations, 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contribu-

tions and scope? [Yes] (b) Did you describe the limitations of your work? [No]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental

results (either in the supplemental material or as a URL)? [No] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?

[Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments

multiple times)? [No] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs,

internal cluster, or cloud provider)? [Yes]

4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re us-

ing/curating? [No] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable informa-

tion or offensive content? [Yes]

5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable?

[N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB)

approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on

participant compensation? [N/A]