# anomaly_attribution_with_likelihood_compensation__b7a435d2.pdf Anomaly Attribution with Likelihood Compensation Tsuyoshi Id e, Amit Dhurandhar, Jiˇr ı Navr atil, Moninder Singh, Naoki Abe IBM Research, T. J. Watson Research Center {tide, adhuran, jiri, moninder, nabe}@us.ibm.com This paper addresses the task of explaining anomalous predictions of a black-box regression model. When using a black-box model, such as one to predict building energy consumption from many sensor measurements, we often have a situation where some observed samples may significantly deviate from their prediction. It may be due to a sub-optimal black-box model, or simply because those samples are outliers. In either case, one would ideally want to compute a responsibility score indicative of the extent to which an input variable is responsible for the anomalous output. In this work, we formalize this task as a statistical inverse problem: Given model deviation from the expected value, infer the responsibility score of each of the input variables. We propose a new method called likelihood compensation (LC), which is founded on the likelihood principle and computes a correction to each input variable. To the best of our knowledge, this is the first principled framework that computes a responsibility score for real valued anomalous model deviations. We apply our approach to a real-world building energy prediction task and confirm its utility based on expert feedback. 1 Introduction With the rapid development of Internet-of-Things technologies, anomaly detection has played a critical role in modern industrial applications of artificial intelligence (AI). One of the recent technology trends is to create a digital twin using a highly flexible machine learning model, typically deep neural networks, for monitoring the health of the production system (Tao et al. 2018). However, the more representational power the model has, the more difficult it is to understand its behavior. In particular, explaining deviations between predictions and true/expected measurements is one of the main pain points. A large deviation from the truth may be due to sub-optimal model training, or simply because the observed samples are outliers. If the model is black-box and the training dataset is not available, it is hard to determine which of these two situations have occurred. Nonetheless, we would still want to provide information to help end-users in their decision making. As such, in this paper we propose a method that can compute a responsibility score for each variable of a given Copyright 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. input. We aver to this task as anomaly attribution. Specifically, we are concerned with model-agnostic anomaly attribution for black-box regression models, where we want to explain the deviation between the models prediction and the true/expected output in as concise a manner as possible. As a concrete example, consider the scenario of monitoring building energy consumption as the target variable y (see Section 5 for the detail). The input to the model is a multivariate sensor measurement x that is typically real-valued and noisy. Under the assumption that the model is blackbox and the training data are not available, our goal is to compute a numerical score for each of the input variables, quantifying the extent to which they are responsible for the judgment that a given test sample is anomalous. Anomaly attribution has been studied typically as a subtask of anomaly detection to date. For instance, in subspacebased anomaly detection, computing each variable s responsibility has been part of the standard procedure for years (Chandola, Banerjee, and Kumar 2009). However, there is little work on how anomaly attribution can be done when the model is black-box and the training data set is not available. In the XAI (explainable AI) community, on the other hand, growing attention has been paid to post-hoc explanations of black-box prediction models. Examples of the techniques include feature subset selection, feature importance scoring, and sample importance scoring (Costabello et al. 2019; Molnar 2019; Samek et al. 2019). For anomaly attribution, there are at least two post-hoc explanation approaches that are potentially applicable: (1) those based on the expected conditional deviation, best known as the Shapley value, which was first introduced to the machine learning community by ˇStrumbelj and Kononenko (2010), and (2) those based on local linear models, best known under the name of LIME (Local Interpretable Model-agnostic Explanations) (Ribeiro, Singh, and Guestrin 2016). In spite of their popularity, these two approaches may not be directly useful for anomaly attribution: Given a test sample (xt,yt), these methods may explain what the value of f(xt) itself can be attributed to. However, what is more relevant is explaining the deviation of f(xt) from the actual yt. To address this limitation, we propose likelihood compensation (LC), a new local anomaly attribution approach for black-box regression models. We formalize the task of anomaly attribution as a statistical inverse problem that in- The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) observed sample observed sample Figure 1: Illustration of the likelihood compensation along the i-th axis (δi). For a given test sample (yt,xt), LC seeks a perturbation that achieves the best possible fit with the black-box regression model y = f(x), which could be nonsmooth (the red curves). For more details please refer to Section 4. fers a perturbation to the test input xt from the deviation yt f(xt), conversely to the forward problem that computes the deviation (or its variants) from (xt,yt). As illustrated in Fig. 1, LC can be viewed intuitively as the deviation measured horizontally if certain conditions are met. This admits direct interpretation as it suggests an action that might be taken to bring back the outlying sample to normalcy. Importantly, LC does not use any problem-specific assumptions but is built upon the maximum likelihood principle, the basic principle in statistical machine learning. To the best of our knowledge, this is the first principled framework for modelagnostic anomaly attribution in the regression setting. 2 Related Work Although the machine learning community had not paid much attention to explainability of AI (XAI) in the blackbox setting until recently, the last few years has seen a surge of interest in XAI research. For general background, Gade et al. (2020) provides a useful summary of major research issues in XAI for industries. In a more specific context of inddustrial anomaly detection, Langone et al. (2020) and Amarasinghe et al. (2018) give a useful summary of practical requirements of XAI in the deep learning era. An extensive survey on various XAI methodologies is given in (Costabello et al. 2019; Molnar 2019; Samek et al. 2019). So far most of the model-agnostic post-hoc explanation studies are designed for classification, often restricted to image classification. As discussed before, two approaches are potentially applicable to the task of anomaly attribution in the black-box regression setting, namely the Shapley value (SV) (ˇStrumbelj and Kononenko 2010, 2014; Casalicchio, Molnar, and Bischl 2018) and the LIME (Ribeiro, Singh, and Guestrin 2016, 2018). The relationship between the two was discussed by Lundberg et al. (2017) assuming binary inputs. In the context of anomaly detection from noisy, realvalued data, two recent studies, Zhang et al. (2019) and Giurgiu et al. (2019), proposed a method built on LIME and SV, respectively. While these belong to the earliest modelagnostic XAI studies for anomaly detection, they naturally inherit the limitations of the existing approaches mentioned in introduction. Recently, Lucic et al. (2020) proposed a LIME-based approach for identifying a variable-wise normal range, which although related is different from our formulation of the anomaly attribution problem. Zemicheal et al. (2019) proposed an SV-like feature scoring method in the context of outlier detection, however, this does not apply to the regression setting. One of the main contributions of this work is the proposal of a generic XAI framework for input attribution built upon the likelihood principle. The method of integrated gradient (Sundararajan, Taly, and Yan 2017) is another generic input attribution approach applicable to the black-box setting. Sipple (2020) recently applied it to anomaly detection and explanation. However, it is applicable to only the classification setting and, as pointed out by Sipple (2020), the need for the baseline input makes it less useful in practice. Layerwise relevance propagation (Bach et al. 2015) is another well-known input attribution method and has been applied to real-world anomaly attribution tasks (Amarasinghe, Kenney, and Manic 2018). However, it is deep-learning-specific and assumes we have white-box access to the model. Another research thread relevant to our work revolves around the counterfactual approach, which focuses on what is missing in the model (or training data) rather than what exists. In the context of image classification, the idea of counterfactuals is naturally translated into perturbation-based explanation (Fong and Vedaldi 2017). Recent works (Dhurandhar et al. 2018; Wachter, Mittelstadt, and Russell 2017) proposed the idea of contrastive explanation, which attempts to find a perturbation best characterizing a classification instance such that the probability of choosing a different class supersedes the original prediction. Our approach is similar in spirit, but as mentioned above, is designed for regression and uses a very different objective function. Moreover, both of these methods (Dhurandhar et al. 2018; Wachter, Mittelstadt, and Russell 2017) require white-box access, while ours is a black-box approach. 3 Problem Setting As mentioned before, we focus on the task of anomaly attribution in the regression setting rather than classification or unsupervised settings. Throughout the paper, the input variable x is assumed to be noisy, multivariate, and real-valued in general. Our task is formally stated as follows: Definition 1 (Anomaly detection and attribution). Given a black-box regression model y = f(x) and a test data set Dtest: (1) compute the score to represent the degree of anomaly of the prediction on Dtest; (2) compute the responsibility score for each input variable for the prediction being anomalous. The black-box regression model is assumed to be deterministic with y R and x RM, where M 2 is the dimensionality of the input random variable x. The functional form of f( ) and the dependency on model parameters are not available to us. The training data on which the model was trained is not available either. The only interface to the model we are given is x, which follows an unknown distri- bution P(x). Queries to get the response f(x) can be performed cheaply at any x. The test data set is denoted as Dtest = {(xt,yt) t = 1,...,Ntest}, where t is the index for the t-th test sample and Ntest is the number of test samples. When Ntest = 1, the task may be called the outlier detection and attribution. Anomaly detection as forward problem Assume that, from the deterministic regression model, we can somehow obtain p(y x), a probability density over y given the input signal x (see Sec. 4.2 for a proposed approach to do this). The standard approach to quantifying the degree of anomaly is to use the negative log-likelihood of test samples. Under the i.i.d. assumption, it can be written as a(Dtest) = 1 Ntest t Dtest lnp(yt xt), (1) which is called the anomaly score for Dtest (or the outlier score for a single sample dataset). An anomaly is declared when a(Dtest) exceeds a predefined threshold. Anomaly attribution as inverse problem The above anomaly/outlier detection formulation is standard. However, the anomaly/outlier attribution is more challenging when the underlying model is black-box. This is in some sense an inverse problem: The function f(x) readily gives an estimate of y from x, but, in general, there is no obvious way to do the reverse in the multivariate case. When an estimate f(xt) looks bad in light of an observed yt, what can we say about the contribution, or responsibility, of the input variables? Section 4 below gives our proposed answer to this question. Notation We use boldface to denote vectors. The i-th dimension of a vector δ is denoted as δi. The ℓ1 and ℓ2 norms of a vector are denoted by 1 and 2, respectively, and are defined as δ 1 i δi and δ 2 i δ2 i . The sign function sign(δi) is defined as being 1 for δi > 0, and 1 for δi < 0. For δi = 0, the function takes a value in [ 1,1]. For a vector input, the definition applies element-wise, giving a vector of the same size as the input. We distinguish between a random variable and its realizations with a superscript. For notational simplicity, we symbolically use p( ) to represent different probability distributions, whenever there is no confusion. For instance, p(x) is used to represent the probability density of a random variable x while p(y x) is a different distribution of another random variable y conditioned on x. The Gaussian distribution of a random variable y is denoted by N(y , ), where the first and the second arguments after the bar are the mean and the variance, respectively. The multivariate Gaussian distribution is defined in a similar way. 4 The Method of Likelihood Compensation This section presents the key idea of likelihood compensation as illustrated in Fig. 1. We start with a likelihood-based interpretation of LIME to highlight the idea. 4.1 Improving Likelihood via Corrected Input For a given test sample (yt,xt), LIME minimizes the lasso objective to let the sparse regression estimation process select a subset of the variables. From a Bayesian perspective, it can be rewritten as a MAP (maximum a posteriori) problem: LIME: max β ln{p(y xt,β) p(β)} vic(xt) subject to y = f(x), (2) where vic(xt) denotes the expectation over random samples generated from an assumed local distribution in the vicinity of xt. For p s above, LIME uses the Gaussian observation model p(y x,β) = N(y β0 + β x,σ2) and the Laplace prior p(β) exp( ν β 1). Here σ2,ν are hyperparameters. The regression coefficient β as well as the intercept β0 captures the local linear structure of f and is interpreted as the sensitivity of f at xt. From the viewpoint of actionability, however, the slope can be less useful than x itself, particularly for the purpose of outlier attribution. If (xt,yt) is an outlier far from the population, how can we expect to obtain actionable insights from the local slope? Another issue is that yt plays no role in this formulation. Notice the constraint of maximization: LIME amounts to assuming that the model is always right and is not sensitive to the question of whether (yt,xt) is an outlier or not. Keeping this in mind, we propose to introduce a directly interpretable parameter δ as a correction term to x, rather than the slope as in LIME: Proposed: max δ [ln{p(yt f(xt + δ)) p(δ)}], (3) p(y f(x + δ)) = N(y f(x + δ),σ2(x)). (4) The prior p(δ) can be designed to reflect problem-specific constraints such as infeasible regions so that the resultant x + δ is a realistic or high probability input. Considering the well-known issue of lasso that in the presence of multiple correlated explanatory variables it tends to pick one at random (Roy, Chakraborty et al. 2017), we employ p(δ) exp( 1 2λ δ 2 2 ν δ 1). σ2(x) is the local variance representing the uncertainty of prediction (see Sec. 4.2), and λ,ν are hyperparameters controlling the sparsity and the overall scale of δ (see Sec. 4.4 for typical values). We call δ the likelihood compensation (LC) as it compensates for the loss in likelihood incurred by an anomalous prediction. Note that, unlike LIME, our explainabiliy model is neither linear nor additive, being free from the masking effect (Hastie, Tibshirani, and Friedman 2009) observed in linear XAI models. We can naturally extend the point-wise definition of Eq. (3) to a collection of test samples. For the Gaussian observation and the elastic net prior, we have the following optimization problem for the LC for Dtest: [yt f(xt + δ)] 2 2 δ 2 2 + ν δ 1 , (5) where σ2 t is the local variance evaluated at xt. This is the main problem studied in this paper. 4.2 Deriving Probabilistic Prediction Model So far we have assumed the predictive distribution p(y x) is given. Now let us think about how to derive it from the deterministic black-box regression model y = f(x). If there are too few test samples, we have no choice but to set σ2 t to a constant using prior knowledge. Otherwise, we can obtain an estimate of σ2 t using a subset of Dtest in a cross-validation (CV)-like fashion. Let Dt ho = {(x(n),y(n)) n = 1,...,Nho} Dtest be a held-out data set that does not include the given test sample (xt,yt). For the observation model Eq. (4) and the test sample xt, we consider a locally weighted version of maximum likelihood: Nho n=1 wn(xt){ln 1 2πσ2 (y(n) f(x(n)))2 where wn(xt) is the similarity between xt and x(n). A reasonable choice for wn is the Gaussian kernel: wn(xt) = N(x(n) xt,diag(η)), (7) where diag(η) is a diagonal matrix whose i-th diagonal is given by ηi, which can be of the same order as the sample variance of xi evaluated on Dho. The maximizer of Eq. (6) can be found by differentiating w.r.t. σ 2. The solution is given by σ2 t = 1 m wm(xt) Nho n=1 wn(xt)[y(n) f(x(n))] 2 . (8) This has to be computed for each xt Dtest. 4.3 Solving the Optimization Problem Although seemingly simple, solving the optimization problem (5) is generally challenging. Due to the black-box nature of f, we do not have access to the parametric form of f, let alone the gradient. In addition, as is the case in deep neural networks, f can be non-smooth (see the red curves in Fig. 1), which makes numerical estimation of the gradient tricky. To derive an optimization algorithm, we first note that there are two origins of non-smoothness in the objective function in (5). One is inherent to f while the other is due to the added ℓ1 penalty. To separate them, let us denote the objective function in Eq. (5) as J(δ)+ν δ 1, where J contains the first and second terms. Since we are interested only in a local solution in the vicinity of δ = 0, it is natural to adopt an iterative update algorithm starting from δ 0. Suppose that we have an estimate δ = δold that we wish to update. If we have a reasonable approximation of the gradient in its vicinity, denoted by J(δold) , the next estimate can be found by δnew = arg min δ {J(δold) + (δ δold) J(δold) 2κ δ δold 2 2 + ν δ 1} (9) in the same spirit of the proximal gradient (Parikh, Boyd et al. 2014), where κ is a hyperparameter representing the learning rate. Notice that the first three terms in the curly bracket correspond to a second-order approximation of J(δ) in the vicinity of δold. We find the best estimate under this approximation. The r.h.s. has an analytic solution. Define φ δold κ J(δold) . The optimality condition is δ φ + κν sign(δ) = 0. If φi > κν holds for the i-th dimension, by φi κ > 0, we have δi = φi κν sign(δi) = φi κν. Similar arguments straightforwardly verify the following solution: φi κν, φi > κν 0, φi κν φi + κν, φi < κν . (10) Performing differentiation, we see that φ is given by φ = (1 κλ)δold+ Ntest t=1 {yt f(xt + δ) σ2 t } f(xt + δ) Note that f(xt + δ) is readily available at any δ without approximation. Here we provide some intuition behind the updating equation (11). Convergence is achieved when either the deviation yt f or the gradient f/ δ vanishes at xt +δ. The former and the latter correspond, respectively, to the situations illustrated in Fig. 1 (a) and (b). As shown in the figure, δi corresponds to the horizontal deviation along the xi axis between the test sample and the regression function. If there is no horizontal intersection on the regression surface it seeks the zero gradient point based on a smooth surrogate of the gradient. To find f/ δ , a smooth surrogate of the gradient, we propose a simple sampling-based procedure. Specifically, we draw Ns samples from a local distribution at xt + δ as x[m] N( xt + δ,diag(η)), (12) and fit a linear regression model f = β0 + β x on the populated local data set {(x[m],f [m]) m = 1,...,Ns}, where f [m] = f(x[m]). Solving the least squares problem, we have δ = β = [ΨsΨ s + εIM] 1 Ψsfs, (13) where ε 0 is a small positive constant added to the diagonals for numerical stability. In Eq. (13), we have defined fs [f [1] f,...,f [Ns] f] and Ψs [x[1] x,..., x[Ns] x]. As usual, the population means are defined as f 1 Ns m f [m] and x 1 Ns m x[m]. 4.4 Algorithm Summary Algorithm 1 summarizes the iterative procedure for finding δ. The most important parameter is the ℓ1 regularization strength ν, which has to be hand-tuned depending on the business requirements of the application of interest. On the other hand, the ℓ2 strength λ controls the overall scale of δ. It can be fixed to some value between 0 and 1. In our experiments, it was adjusted so its scale is on the same order as LIME s output for consistency. It is generally recommended to rescale the input variables to have the zero Algorithm 1 Likelihood Compensation Input: f(x),Dtest. Parameters: λ,ν,κ. 1: for all xt Dtest do 2: Compute σ2 t with Eq. (8). 3: end for 4: Randomly initialize δ 0. 5: repeat 6: Set g = 0. 7: for all xt Dtest do 8: Compute β with Eq. (13). 9: Update g g + β yt f(xt+δ) Ntestσ2 t . 10: end for 11: φ = (1 κλ)δ + κg. 12: Find δ with Eq. (10). 13: until convergence. 14: return δ mean and unit variance before starting the iteration (assuming Ntest 1), and retrieve the scale factors after convergence. For the learning rate κ, in our experiments, we fixed κ = 0.1 and shrank it (geometrically) by a factor of 0.98 in every iteration. In addition to the parameters listed in Algorithm 1, the sampling-based estimation of the gradient Eq. (13) requires two minor parameters, N s,η. In the experiment, we fixed N s = 1000 following (Ribeiro, Singh, and Guestrin 2016) and ηi = 1 for all i after standardization. The same η was used for Eq. (7). 5 Experiments We now describe our experimental design and baselines we compare against in the empirical studies that follow. Evaluation strategy Explainability of AI is generally evaluated from three major perspectives (Costabello et al. 2019): decomposability, simulatability, and algorithmic transparency. In post-hoc explanations of black-box models, decomposability and simulatability are most important. We thus design our experiments to answer the following questions: a) whether LC can provide extra information on specific anomalous samples beyond the baseline methods (decomposability), b) whether LC can robustly compute the responsibility score under heavy noise (simulatability), and c) whether LC can provide actionable insights in a realworld business scenario (simulatability). Regarding the third question, we validated our approach with feedback from domain experts as opposed to crowd sourced studies with lay users. In industrial applications, the end-user s interests can be highly specific to particular business needs and the system s inner workings tend to be difficult for non-experts to understand and simulate. Baselines We compare LC with three possible alternatives: (1) Z-score and extended versions of (2) Shapley val- 0.5 0.0 0.5 negative positive x t1 =1 solution at x t = (1,0)T (a) Mexican hat (at x2=0 slice) x1 δ1 (b) δ1 as a function of y t Figure 2: Mexican Hat: (a) The x2 = 0 slice of f(x). (b) Computed δ1 as a function of yt. 0.5 0.0 0.5 0.5 0.0 0.5 0.5 0.0 0.5 0.5 0.0 0.5 Z-score SV+ LIME+ LC Figure 3: Mexican Hat: Comparison of the responsibility scores evaluated at yt = 0.2 (upper) and 0 (lower). ues (SV) and (3) LIME. The Z-score is the standard univariate outlier detection method in the unsupervised setting, and that of xt i is defined as (xt i mi)/σi, where mi,σi are the mean and the standard deviation of xi in Dtest, respectively. Shapley values (SV) and LIME are used as a proxy of the prior works (Zhang et al. 2019; Giurgiu and Schumann 2019; Lucic, Haned, and de Rijke 2020), which used SV or LIME in certain tasks similar to ours. For fair comparison, we extended these methods to be applied on the deviation f y instead of f itself, and name them LIME+ and SV+, respectively. We dropped SV+ in the building energy experiment as the training data was not available to compute the null/base values for each variable that SV requires. Note that contrastive and counterfactual methods such as (Dhurandhar et al. 2018; Wachter, Mittelstadt, and Russell 2017) are not valid competitors here as they require white-box access to the model and are predominantly used in classification settings. Two-Dimensional Mexican Hat One of the major features of LC is its capability to provide explanations relevant to specific anomalous samples. To illustrate this, we used the two-dimensional Mexican Hat for the regression function f(x) (1 1 2 x 2 2)exp( 1 2 x 2 2) as shown in Fig. 2 (a). Suppose we have obtained a test sample at xt = (1,0) . By symmetry, LIME+ has only the x1 component, which can be analytically calculated to be 0.29 at this xt when ν 0+. Similarly, LC has only the x1 component, and is computed through iterative updates with the aid of analytic expression of the gradient. For SV+, we used uniform sam- sample #3 sample #7 2.5 0.0 2.5 RM Figure 4: Boston Housing: Pairwise scatter plot between y (MEDV) and two selected input variables (LSTAT, RM). pling from [ 4,4]2 to evaluate the expectations. Figure 2 (b) shows the calculated values of δ1 as a function of yt with ν = 0,λ = 0.01. Figure 3 compares Z-score, SV+, LIME+, and LC for the two particular values of yt, corresponding to the f > yt and f < yt cases. As shown, Z-score, SV+, and LIME+ are not able to distinguish between the two cases, demonstrating the limited utility in anomaly attribution. In contrast, LC s value of δ1 corresponds to the horizontal distance between the test point and the curve of f as shown in Fig. 2. Hence we can think of it as a measure of horizontal deviation, as we illustrated earlier in Fig. 1. Boston Housing Next we used Boston Housing data (Belsley 1980) to test the robustness to noise. The task is to predict the median home price ( MEDV ) of the districts in Boston with M = 13 input variables such as the percentage of lower status of the population ( LSTAT ) and the average number of rooms ( RM ). As one might expect, the data is very noisy. As an illustrative example, Fig. 4 shows scatter plots between y (MEDV) and two selected input variables (LSTAT, RM), which have the highest correlations with y. We held out 20% of the data as Dtest (Ntest = 101), and trained a random forest on the rest. Then we picked the two top outliers, as highlighted as #3 and # 7 in Fig. 4. These are the two samples with the highest outlier scores of Eq. (1), to which not only LSTAT and RM but also all the other variables contributed. Figure 5 compares the results of LC with the baselines for these outliers. For the ℓ1 parameter, we gave ν = 0.1 for LC, then chose ν = 0.005 for LIME+, so that LIME+ has on average the same number of nonzero elements as LC. The ℓ2 parameter λ was chosen as 0.5 for LC and LIME+ to have approximately the same scale. For SV+, all the 2M 1M = 53248 combinations were evaluated with the empirical distribution of the training samples, which are actually supposed to be unavailable in our setting, requiring about an hour to finish on a laptop PC (Core i7-8850H) for each test sample, while LC required only several seconds. From the figure, we see that overall SV+, LIME+, and LC are consistent in the sense that most of the weights appear on a few common variables including LSTAT. Z-score behaves quite differently, reflecting the fact that it is agnostic to the y-x relationship. TAX PTRATIO TAX PTRATIO Figure 5: Boston Housing: Comparison of the responsibility scores for the top two outliers (#3 and #7). outlier score Figure 6: Building Energy: Outlier score computed with Eq. (1) for the test data. For these outliers, LC gave positive and negative scores for LSTAT and RM in Fig. 5, respectively. Checking the scatter plots in Fig. 4, we can confirm the LC s characterization as the horizontal deviation that a positive (negative) score means a positive (negative) shift will give a better fit. In contrast, LIME+ simply indicates whether the local slope is positive or negative, independently how the test samples deviate. In fact, one can mathematically show that LIME+ is invariant to the value of y, meaning that LIME cannot be a useful tool for instance-specific anomaly attribution. In SV+, the situation is more subtle. It does not allow simple interpretations like LC or LIME+. The sign of the scores unpredictably becomes negative or positive, probably due to complicated effects of higher-order correlations. This suggests SV s tendency to be unstable under noise. In fact, our bootstrap analysis (not included for page limitation) shows that the SV+ scores are vulnerable to noise; The top three variables with the highest absolute SV+ scores gave a 35.3% variability relative to the mean. In addition, SV+ needs training data or the true distribution of x for Monte Carlo evaluation. Z-score, LIME+, and LC do not have such a requirement. Along with the prohibitive computational cost, those timeofday temperature dewpoint humidity sunrad daytype_Fr daytype_Mo daytype_Sa daytype_Su daytype_Th daytype_Tu daytype_We (a) Z-score Figure 7: Building Energy: Comparison of the responsibility scores computed for the test data. limitations make it impractical to apply SV+ to real-world system monitoring scenarios of the type presented below. Real-World Application: Building Energy Management Finally, we applied LC to a building administration task. Collaborating with an entity offering building management services and products, we obtained energy consumption data for an office building in India. The total wattage is predicted by a black-box model as a function of weather-related (temperature, humidity, etc.) and time-related variables (time of day, day of week, month, etc.). There are two intended usages of the predictive model. One is near future prediction with short time windows for optimizing HVAC (heating, ventilating, and air conditioning) system control. The other is retrospective analysis over the last few months for the purpose of planning long-term improvement of the building facility and its management policies. In the retrospective analysis, it is critical to get clear explanation on unusual events. At the beginning of the project, we interviewed 10 professionals on what kind of model explainability would be most useful for them. Their top priority capabilities were uncertainty quantification in forecasting and anomaly diagnosis in retrospective analysis. Our choices in the current research reflect these business requirements. We obtained a one month worth of test data with M = 12 input variables recorded hourly. We first computed σ2 t according to Eq. (8) in which we leave (yt,xt) out for each t. For each of the test samples, we computed the outlier score by Eq. (1) under the Gaussian observation model, which resulted in a few conspicuous anomalies as shown in Fig. 6. An important business question was who or what may be responsible for those anomalies. To obtain insights regarding the detected anomalies, we computed the LC score as shown in Fig. 7, where we computed δ each day with Ntest = 24 in Eq. (5), and visualized δ 2 2. For the Z-score, we visualized the daily mean of the absolute values. For LIME+, we computed regression coefficients for every sample, and visualized the ℓ2 norm of their daily mean. We used (ν,λ) = (0.1,0.5), which was determined by the level of sparsity and scale preferred by the domain experts. As shown in the plot, the LC score clearly highlights a few variables whenever the outlier score is exceptionally high in Fig. 6, while the Z-score and LIME+ do not provide much information beyond the trivial weekly patterns. The pattern of LIME+ was very stable over 0 < ν 1, showing empirical evidence of insensitivity to outliers. As mentioned before, one can mathematically prove this important fact: LIME+ as well as SV+ are invariant to the translation in f. On the other hand, the Z-score sensitively captures the variability in the weather-related variables, but it fails to explain the deviations in Fig. 6. This is understandable because the Zscore does not reflect the relationship between y and x. The artifact seen in the daytype variables is due to the one-hot encoding of the day of week. Finally, with LC, the variables highlighted around October 19 (Thursday) are timeofday , daytype Sa , and daytype Su , implying that those days had an unusual daily wattage pattern for a weekday and looked more like weekend days. Interestingly, it turned out that the 19th was a national holiday in India and many workers were off on and around that date. Thus we conclude that the anomaly is most likely not due to any faulty building facility, but due to the model limitation caused by the lack of full calendar information. Though simple, such pointed insights made possible by our method were highly appreciated by the professionals. 6 Conclusions We have proposed a new method for model-agnostic explainability in the context of regression-based anomaly attribution. To the best of our knowledge, the proposed method provides the first principled framework for contrastive explainability in regression. The recommended responsibility score Likelihood Compensation is built upon the maximum likelihood principle. This is very different from the objectives used to obtain contrastive explanations in the classification setting. We demonstrated the advantages of the proposed method based on synthetic and real data, as well as on a real-world use-case of building energy management where we sought expert feedback. Acknowledgements The authors thank Dr. Kaoutar El Maghraoui for her support and technical suggestions. T.I. is partially supported by the Department of Energy National Energy Technology Laboratory under Award Number DE-OE0000911. A part of this report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. References Amarasinghe, K.; Kenney, K.; and Manic, M. 2018. Toward explainable deep neural network based anomaly detection. In Proc. Intl. Conf. Human System Interaction (HSI), 311 317. IEEE. Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; M uller, K.-R.; and Samek, W. 2015. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. Plo S one 10(7): e0130140. Belsley, K. . W. 1980. Regression diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley. Casalicchio, G.; Molnar, C.; and Bischl, B. 2018. Visualizing the feature importance for black box models. In Proc. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 655 670. Springer. Chandola, V.; Banerjee, A.; and Kumar, V. 2009. Anomaly Detection: A Survey. ACM Computing Survey 41(3): 1 58. Costabello, L.; Giannotti, F.; Guidotti, R.; Hitzler, P.; L ecu e, F.; Minervini, P.; and Sarker, K. 2019. On Explainable AI: From Theory to Motivation, Applications and Limitations. In Tutorial, AAAI Conference on Artificial Intelligence. Dhurandhar, A.; Chen, P.-Y.; Luss, R.; Tu, C.-C.; Ting, P.; Shanmugam, K.; and Das, P. 2018. Explanations based on the missing: Towards contrastive explanations with pertinent negatives. In Advances in Neural Information Processing Systems, 592 603. Fong, R. C.; and Vedaldi, A. 2017. Interpretable explanations of black boxes by meaningful perturbation. In Proc. IEEE Intl. Conf. Computer Vision, 3429 3437. Gade, K.; Geyik, S.; Kenthapadi, K.; Mithal, V.; and Taly, A. 2020. Explainable AI in Industry: Practical Challenges and Lessons Learned. In Companion Proceedings of the Web Conference 2020, 303 304. Giurgiu, I.; and Schumann, A. 2019. Additive Explanations for Anomalies Detected from Multivariate Temporal Data. In Proc. Intl. Conf. Information and Knowledge Management, 2245 2248. ACM. Hastie, T.; Tibshirani, R.; and Friedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2 edition. Langone, R.; Cuzzocrea, A.; and Skantzos, N. 2020. Interpretable Anomaly Prediction: Predicting anomalous behavior in industry 4.0 settings via regularized logistic regression tools. Data & Knowledge Engineering 101850. Lucic, A.; Haned, H.; and de Rijke, M. 2020. Why does my model fail? contrastive local explanations for retail forecast- ing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 90 98. Lundberg, S. M.; and Lee, S.-I. 2017. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, 4765 4774. Molnar, C. 2019. Interpretable machine learning. Lulu. Parikh, N.; Boyd, S.; et al. 2014. Proximal algorithms. Foundations and Trends in Optimization 1(3): 127 239. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why should I trust you?: Explaining the predictions of any classifier. In Proc. ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining, 1135 1144. ACM. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2018. Anchors: High-precision model-agnostic explanations. In Proc. AAAI Conference on Artificial Intelligence. Roy, V.; Chakraborty, S.; et al. 2017. Selection of tuning parameters, solution paths and standard errors for Bayesian lassos. Bayesian Analysis 12(3): 753 778. Samek, W.; Montavon, G.; Vedaldi, A.; Hansen, L. K.; and M uller, K.-R. 2019. Explainable AI: interpreting, explaining and visualizing deep learning, volume 11700. Springer Nature. Sipple, J. 2020. Interpretable, Multidimensional, Multimodal Anomaly Detection with Negative Sampling for Detection of Device Failure. In Proc. Intl. Conf. Machine Learning, 9016 9025. ˇStrumbelj, E.; and Kononenko, I. 2010. An efficient explanation of individual classifications using game theory. Journal of Machine Learning Research 11(Jan): 1 18. ˇStrumbelj, E.; and Kononenko, I. 2014. Explaining prediction models and individual predictions with feature contributions. Knowledge and information systems 41(3): 647 665. Sundararajan, M.; Taly, A.; and Yan, Q. 2017. Axiomatic attribution for deep networks. In Proc. Intl. Conf. Machine Learning, 3319 3328. Tao, F.; Cheng, J.; Qi, Q.; Zhang, M.; Zhang, H.; and Sui, F. 2018. Digital twin-driven product design, manufacturing and service with big data. The International Journal of Advanced Manufacturing Technology 94(9-12): 3563 3576. Wachter, S.; Mittelstadt, B.; and Russell, C. 2017. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harvard Journal of Law & Technology 31: 841. Zemicheal, T.; and Dietterich, T. G. 2019. Anomaly detection in the presence of missing values for weather data quality control. In Proc. ACM SIGCAS Conf. Computing and Sustainable Societies, 65 73. Zhang, X.; Marwah, M.; Lee, I.-t.; Arlitt, M.; and Goldwasser, D. 2019. ACE An Anomaly Contribution Explainer for Cyber-Security Applications. In Proc. IEEE Intl. Conf. Big Data, 1991 2000.