# multiscale_anomaly_detection_on_attributed_networks__04522bc2.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Multi-Scale Anomaly Detection on Attributed Networks Leonardo Guti errez-G omez,1,3 Alexandre Bovet,1 Jean-Charles Delvenne1,2 1Institute for Information and Communication Technologies, Electronics and Applied Mathematics (ICTEAM) 2Center for Operations Research and Econometrics (CORE) Universit e catholique de Louvain, Louvain-la-Neuve, Belgium {alexandre.bovet, jean-charles.delvenne}@uclouvain.be 3Luxembourg Institute of Science and Technology (LIST), Esch-sur-Alzette, Luxembourg leonardo.gutierrez@list.lu Many social and economic systems can be represented as attributed networks encoding the relations between entities who are themselves described by different node attributes. Finding anomalies in these systems is crucial for detecting abuses such as credit card frauds, web spams or network intrusions. Intuitively, anomalous nodes are defined as nodes whose attributes differ starkly from the attributes of a certain set of nodes of reference, called the context of the anomaly. While some methods have proposed to spot anomalies locally, globally or within a community context, the problem remain challenging due to the multi-scale composition of real networks and the heterogeneity of node metadata. Here, we propose a principled way to uncover outlier nodes simultaneously with the context with respect to which they are anomalous, at all relevant scales of the network. We characterize anomalous nodes in terms of the concentration retained for each node after smoothing specific signals localized on the vertices of the graph. Besides, we introduce a graph signal processing formulation of the Markov stability framework used in community detection, in order to find the context of anomalies. The performance of our method is assessed on synthetic and realworld attributed networks and shows superior results concerning state of the art algorithms. Finally, we show the scalability of our approach in large networks employing Chebychev polynomial approximations. Anomaly detection is an important problem in data mining with multiple applications in diverse domains (Aggarwal 2013). Examples include credit card fraud detection, intrusion detection for network systems, identifying anomalous users in communication networks, web spam detection, and so on. Anomalous data can be understood as noteworthy objects with patterns or behaviors that deviate significantly from a background property. Investigating anomalies in networks is especially interesting because many real world systems are well modeled by a network consisting of nodes with specific attributes and edges representing the relations between nodes. For instance, in a social network where nodes represent persons and edges social ties, node attributes could be demographic information. In a co-purchase Corresponding author Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. O3 O3 1st scale 2nd scale 3rd scale Age: 26 Gender: male Income: 1000 Education: master Productivity: 7.5 Role: comercial Experience: 3 Nationality: Belgian Status: single Kids: 1 Node attributes (metadata) O ces Departments Company Figure 1: A toy example of work relation network. Nodes have attributes describing individual features. Node attributes define structural clusters in multiple scales. At the 1st scale outlier nodes (O1, O2, O3) lie within a local context, i.e, offices. In a 2nd scale, departments emerge as new contexts where O2 is not defined. Finally, at a larger scale O3 remains as a global anomaly in context of the whole company. network, nodes represent items and might contain information about them as attributes, while an edge is present between two items when the same person has bought them. Because the mechanisms generating anomalies in real world networks are usually unknown, defining ground truth anomalies, is problematic and measuring the degree for which a node is anomalous is a challenging problem. For this reason, anomalies must be defined in opposition to a background of normal nodes, i.e., the context relevant for an anomaly. So far, the issue of finding anomalous nodes, or outliers (we consider these words as synonyms in this paper) on networks together with their relevant contexts has not been entirely solved. Existing approaches consider outliers within local or community contexts (Liu, Huang, and Hu 2017; Gao et al. 2010) sometimes with strong assumptions about the distribution of the attributes (Gao et al. 2010), or separate the context identification from outlier detection (S anchez et al. 2014). Other methods detect anomalies without considering any context (Li et al. 2017). However, real world networks representing complex economic, social, or biological systems are characterized by modular, multiscalar, and often hierarchical structures (Boccaletti et al. 2006). Taking into account the node attributes spanning across the multi-scale structure of networks to correctly define the contexts of anomalies is therefore crucial. In addition, anomalous nodes might emerge within a specific context and later disappear in different contexts. Hence, anomaly detection is linked naturally to the problem of community detection, consisting of finding clusters of nodes that are densely linked together compared to the rest of the network (Newman and Girvan 2004). By performing multiscalar community detection on a network where related nodes are close to each other through edge weights expressing the similarity of nodes attributes, we discover all the relevant contexts for anomalous nodes. In this sense, we consider that a node can be an outlier only at certain scales and not necessarily at all scales. Consider the toy example of the co-working network in Fig. 1. If an employee, e.g., O2, works mainly with employees from a different office and not with the employees of his office, it may appear as an anomaly at the scale of offices. But if his office and the office of the employees he is working with, both belong to the same department, this employee will not be an anomaly at the department scale. At an even larger scale however, a company may form a large context with anomalies persisting as global outliers, e.g, O3. Anomalies are scale dependent, hence considering the multiscale nature of real world systems together with the node attributes is essential for the detection of anomalies. In summary, the contributions of this paper are: We propose a novel algorithm called MADAN (Multiscale Anomaly Detection in Attributed Networks) providing a principled mechanism to rank and localize outlier nodes within their context at all scales in a network. We conduct experiments on synthetic and real world benchmarks showing that our method allows to not only recover and rank the so called ground truth anomalies, but also to discover new anomalies jointly with their contexts. Finally, we show that our method is parallelizable and scales to large networks thanks to the Chebychev approximations of the exponential of the graph Laplacian. As side benefit, it also provides a faster methodology for the continuous-time Markov stability framework (Lambiotte, Delvenne, and Barahona 2014) for community detection. Related work As the amount of attributed network data has increased during the last decade, anomaly detection on attributed networks has attracted more attention. Analyzing static graphs for anomaly detection can be grouped in two general classes (Akoglu, Tong, and Koutra 2015): anomalies on plain (unlabeled) graphs and anomalies on attributed networks. The former one relates to detecting nodes having anomalous connectivity features in the network without exploiting any node attribute information (Akoglu, Mc Glohon, and Faloutsos 2010), i.e. isolated or bridge nodes. The second one, into which our method falls, aims to detect anomalous nodes regarding the attributes of nodes within a given local context, i.e community of a node or attribute subspace. CODA (Gao et al. 2010) is a community based algorithm which uses Hidden Markov Random Fields to characterize both data and links simultaneously turning the outlier detection in an inference problem. Consub+Dis Out (S anchez et al. 2014) is a two step method performing statistical selection of congruent subspace defined by subset of node attributes and subsequently rank anomalous nodes introducing a distance based outlier model. ALAD (Liu, Huang, and Hu 2017) is a method to retrieve outlier nodes exploiting network structure and node attribute information jointly considered in a non-negative matrix factorization framework. The RADAR (Li et al. 2017) approach propose a learning framework to identify anomalies via residual analysis exploring the coherence between attribute information and the network information. AMEN (Perozzi and Akoglu 2018) considers egonetworks to characterize anomalous neighbors in attributed networks. More recent deep learning approaches (Ding, Li, and Liu 2019) introduce an interactive approach incorporating the feedback form the end user. Unlike the previous methods, our MADAN approach allows to spot anomalous nodes across all scales of the network uncovering the relevant scales spanned by the nodes attributes and the graph structure. Problem statement and framework Creating a weighted graph Considering a graph G = (V, E) composed by a set V with |V | = N nodes or vertices, and a set of edges E. For simplicity, we will restrict this work to undirected, connected, simple graphs. Considering multidimensional node attributes, we associate to each vertex u V a d-dimensional vector f(u) = f1(u), . . . , fd(u) where fk(u) R represents the k th attribute of the vertex u. Our problem is, informally speaking, to discover anomalous nodes and the context with respect to which they are anomalous, for all possible ranges of context size. In order to do so, we introduce weights to the edges of the graph, expressing the similarity of the nodes attributes linked by the edge. Here we use the Gaussian weighting function: exp f(u) f(v) 2 2σ2 if (u, v) E 0 otherwise (1) where σ is a scaling parameter. This creates a weighted adjacency matrix W over the network, or similarity matrix, where two nodes are similar if they are neighbours with close values of the attributes. This weighting function is commonly used in image processing to represent similarity between pixel intensities (Haralick and Shapiro 1992). Here we use it to translate the node attributes into the structure of the network. The heat kernel This adjacency matrix allows us to define a heat equation for any graph signal x RN, i.e. any function V R assigning a scalar to every node. The heat equation is expressed as x(t) = (W D)x(t) = Lx(t) where D is the diagonal matrix of node strengths, defined by Duu = du = v w(u, v). The matrix L is called the Laplacian of the weighted graph. This equation, solved by x(t) = e t Lx(0), takes its name from a physical analogy, where xu(t) is the temperature or internal energy of node u (assuming unit heat capacity for each node), which is being redistributed across the graph following Fourier s heat conduction law along edges with conductivity w(., .). Importantly, the dynamics preserves the average of the signal. This is seen as the sum of every column of L is zero. In the physical analogy, it corresponds to the fact that the total energy on the graph is neither created nor destroyed, only diffused. In terms of signal processing (Shuman et al. 2013), the heat kernel e t L is often seen as a smoothing filter acting on the initial signal x(0), parametrized by the time t. It replaces every entry of x(0) with a weighted average of other nodes signals (with a greater emphasis on neighboring nodes). In the limit of large t, the smoothed signal is just constant over the network. When the graph is a discrete line with constant weights (modelling for example a discrete unidimensional space), the filter is translation-invariant. This means that for any two Kronecker delta signals δu (taking unit value on node u and zero value elsewhere) or δv (for node v), the corresponding smoothed signals e t Lδu and e t Lδv are translations of one another, and therefore have the same shape. This translation-invariance, typical in standard signal processing, disappears in arbitrary networks, where e t Lδu will strongly depend on the structure and weights of the network around u. This property allows us to characterize the structure of the graph through the properties of its kernel. We use this fact first to detect the anomalies and then to detect the contexts in which these anomalies are relevant. Finding the anomalous nodes The concentration of a node u V at scale t is defined by the L2 norm of the filtered δu signal with a heat kernel at time t (Perraudin et al. 2018): cu(t) = e t Lδu 2. (2) This measure is useful because the filter preserves the sum (or mean) of the signal, and therefore the Kronecker delta has maximum norm of one while the completely smoothed signal has norm 1/N, which is the smallest possible norm for a signal of sum one spread over N nodes. A node with high concentration at time t tends to be poorly connected with its neighborhood, through few edges or edges of low weight. The time t selects the size of the neighborhood of reference, in a way that is discussed below. With the weighting proposed in Eq. 1, a highconcentration node indicates a node that is very dissimilar in its attributes with its neighbors. In this way, we use the concentration as way to quantify the degree of deviation of a given node with respect to its context at a given time scale and provides a scoring for ranking potential outliers. To identify outliers, we use the following standard thresholding rule used in general outlier detection methods (Iglewicz and Hoaglin 1993) which works well in our experiments. However, other criteria for identifying from the concentration distribution could be developed. Let c(t) = [c1(t), . . . , c N(t)] be the overall graph concentration. A node u is considered as anomalous at a time scale t if the following thresholding condition holds: cu(t) c(t) + 2s(c(t)) (3) with c(t) as the average concentration and s(c(t)) its standard deviation across nodes. Finding the contexts We now make sense of the choice of t, in that it selects the context with respect to which a node is anomalous. In a large t limit, a delta signal δu will be smoothed to a constant over the whole graph, as the heat kernel will have the time to mingle the signal of all nodes together. A node that stands as an outlier for a large t is therefore an outlier globally, with respect to the whole graph. Conversely, a small t only allows heat diffusion with immediate neighbors, and thus a small t outlier is anomalously dissimilar to a small context. More generally, we consider that a set of nodes S is a suitable context for all potentially anomalous nodes lying in it, if this set is relatively poorly connected to the rest of the network, in a manner akin to community detection. Here, poorly connected is defined in reference to a given time scale t, in that the internal energy contained in S essentially remains in S within time t. Consider h S the characteristic signal of S, i.e. the node signal taking unit values in S and zero values outside. The initial total energy of S is |h S|1, the number of nodes in S. After smoothing, the energy remaining in S in excess of the energy 1 N |h S|1 that will remain at t = is: r(t; h S) = h T Se t Lh S 1 N |h S|1, (4) which we want to be as high as possible. As we want to be able to provide a context for potentially each node of the graph, we look for a partition of the nodes, encoded by its N K characteristic matrix H, where K is the number of sets and every column h1, . . . , h K is the characteristic vector of a set of nodes. An optimal partition into contexts is given by the matrix H maximizing: h T i e t Lhi 1 This is essentially a particular case of the Markov stability (Delvenne, Yaliraki, and Barahona 2010; Lambiotte, Delvenne, and Barahona 2014), designed as a general framework for multi-scale community detection. It has the expected behaviour to provide a fine partition of small sets of nodes for low t, and a few large sets for large t. As some nodes could be assigned to their own isolated context, in practice, we join them with the closest one. Although it is known to be NP-hard to optimize the maximization of Eq. (5) (Brandes et al. 2008) it can be recoded (Lambiotte, Delvenne, and Barahona 2014) into a modularity maximization problem for which the Louvain algorithm (Blondel et al. 2008) is known to provide good results in practice. For each time step, we run the Louvain algorithm 100 times with different random initializations and use Eq. 5 as a scoring function to rank partitions H. In order to assess the robustness of the retrieved partition at a given time scale, we follow the methodology of (Delvenne et al. 2013) Concentration O3: Rich O1: Medium O2: Very poor Concentration Concentration 1st scale 2nd scale 3rd scale mean + 2*std mean + 2*std mean + 2*std Figure 2: Synthetic social network of 160 nodes with a planted partition of four communities. Each node (representing a person) has a scalar attribute representing the income of a person, partitioning the network in four clusters labeled as rich, medium, poor, and very poor. Three outlier nodes are injected: O1 is a medium income person within the rich cluster, O2 a very poor person within the cluster of medium income people and O3 is a rich person in the very poor cluster. (A) We perform a scanning of relevant contexts along the attribute subspace: for each time t we find the number of clusters (blue curve) and compute the average variation of information (VI(t))) of the retrieved partitions (red curve). The VI(t) is obtained from 100 runs of the Louvain algorithm. The background represents the VI(t, t ) between optimized partitions across times. As can be seen in (A), starting at t 0.65, the partitions of 4, 3 and 2 clusters are found as persistent according with the low VI(t). Subsequently, we compute the concentration over all nodes of the graph (bar plots) with Eq. (2). (B) At the 1st scale, the concentration of nodes at t = 1.5 reveals three local outliers with their context given by the colored clusters. (C) At the 2nd scale, at t = 10.5 the context of O3 enlarges whereas the concentration of O2 decreases. (D) At the 3rd scale given by t = 578, O2 is not an outlier and the context for O3 enlarges more. (E) The dendrogram shows the hierarchy of partitions and the outlier nodes at each scale. and compute the variation of information among the ensemble of found partitions for a given time. To be more precise, the normalized variation of information of two partitions P1 and P2 reads: V I(P1, P2)(t) = H(P1|P2) + H(P2|P1) where H(P1|P2) is the conditional entropy of P1 given P2, i.e the additional information needed to describe P1 when P2 is known assuming a uniform probability on the nodes. To simplify the notation we will write V I(P1, P2)(t) as V I(t). To find the range of scales relevant for a given partition, we also compute the variation of information between the ensemble of partitions found between each pair of times V I(t, t ). Relevant partitions H are finally identified by times t, t where V I(t) shows a dip and V I(t, t ) shows a continuous plateau (Lambiotte, Delvenne, and Barahona 2014). Summary of our approach Our multi-scale anomaly detection method (MADAN) is summarized and illustrated in Fig. 2. Roughly speaking, we first scan the network to find the most relevant scales and finding the contexts, as described in the previous section. For the times that are identified by this context-finding methodology, we simultaneously detect the outliers satisfying Eq. 3. Experiments We evaluate the performance of our method in detecting anomalous nodes in synthetic and real life attributed networks. MADAN1 is compared against state-of-the-art and baseline methods. Experimental setup Synthetic dataset. We generate a series of undirected synthetic networks with ground truth community structure, adding node attributes and injecting node anomalies in different hierarchies. In order to simulate real networks, we generate networks with the Lanchinetti-Fornunato-Radicchi (LFR) benchmark (Lancichinetti and Fortunato 2009). This algorithm generates networks with diverse community structures, community sizes and power law degree distributions. Networks with N = 1000 nodes are generated with the mixing parameter μ = 0.1 generating partitions of well defined communities containing between 50 to 200 nodes. The average node degree is ranging from 10 to 100. The size of the communities are taken from a power law distribution with exponent 1. Similarly, the degree distribution for the nodes is drawn from a power law distribution with exponent equal to 2. 1MADAN Python code can be found at https://github.com/leoguti85/MADAN Percentage of anomalies Percentage of anomalies 1 5 10 15 20 25 30 0.0 1 5 10 15 20 25 30 MADAN LOF AMEN ALAD Random RADAR MADAN LOF AMEN ALAD Random RADAR Figure 3: Synthetic experiments. Performance on anomaly detection when adding increasing levels of anomalous nodes. Colors correspond to different baseline techniques, including our MADAN approach. We report the mean and standard deviation (vertical lines) over 50 LFR networks. Nodes attributes are generated as follows: to each node, we associate a vector of attributes of dimension d = 20. To make the problem more realistic, we introduce heterogeneity in the node attribute values, so that within each community C {0, 1, 2, . . . , cmax} nodes attributes are sampled from different probability distributions. For any attribute k of a node belonging to a community C, its values are drawn from a normal distribution with mean C and standard deviation ˆσ, a uniform distribution in the interval [C ˆσ, C + ˆσ] and a logistic distribution with mean C and scale ˆσ. As nodes within a same cluster are deemed to have similar values, we chose ˆσ = 0.1 for all distributions. In addition, we force some clusters to have similar node attributes, i.e. drawn from the same distributions, in order to introduce attribute hierarchies. Thus, the ground truth community structure defines structural clusters, and node attributes on top introduces clusters similarity, i.e., clusters of clusters. To generate anomalous nodes, a percentage of randomly chosen nodes (1%, 5%, 10%, 15%, 20%, 25%, 30%) is perturbed by replacing 30% of their attributes with attributes from a distinct, randomly chosen cluster. The performance of anomaly detection is measured over 50 random networks, reporting the average and standard deviation of the performance measures. Real life datasets. We also evaluate our approach on three real life attributed networks, see Table 1. They were labeled with the so called ground truth anomalies used as benchmark datasets in the literature. Table 1: Real life datasets DATASET # nodes # edges # node attr # anomalies Disney 124 334 28 6 Books 1418 3695 28 28 Enron 13533 176987 20 5 The Disney and Books datasets (M uller et al. 2013) are co-purchase networks extracted from Amazon. They contain 28 attributes per node describing properties about online items, i.e., rating, selling price, etc. The ground truth anomalies for Disney DVDs movies were tagged manually by high school students in Germany. Ground truth anomalies were defined as follow (M uller et al. 2013). A parti- tion of the network was obtained after applying the Louvain algorithm. Students were then asked to tag anomalous nodes with a high deviation of the attribute values within each cluster. Outliers nodes were defined as the ones tagged as anomalous by at least 50% of the students. In the Book dataset, ground truth anomalies were defined as nodes having the tag amazonfail (M uller et al. 2013). Enron (Metsis and et al. 2006) is a communication network with edges indicating email transmission between people. Each node contains 20 attributes describing metadata of the message i.e., content length, number of recipients, etc. This dataset has been extensively used as benchmark for spam detection (Metsis and et al. 2006). Spammers were labeled as ground truth for anomaly detection (M uller et al. 2013). In all cases, we set the parameter σ in Eq. 1 as the standard deviation of the distribution of pairwise distances between node attributes. Evaluation metric. Following the evaluation setup of (Li et al. 2017; M uller et al. 2013), we assess the anomaly detection performance with two well known metrics for evaluation of anomaly detection systems (Aggarwal 2013): the area under the receiver operating characteristic curve (ROCAUC) and the area under the precision/recall curve (PRAUC). The former allows to quantify the trade-off between true positive rate (tpr) and false positive rate (fpr) across different thresholds. The tpr is defined as the detection rate, i.e. the rate of true anomalous nodes correctly identified as anomalous, whereas the fpr is the false alarm rate, i.e. rate of normal nodes identified as anomalous. The second metric quantifies the trade-off between the precision, i.e., the predictive power of the method and the recall, defined as the ratio between true positive over true positives plus false negatives. In both cases, the more the AUC approaches to 1 the better the performance of the method. Baseline methods We compare our MADAN approach with five baseline methods of the literature having their source code available: Local Outlier Factor (LOF) (Breunig et al. 2000), Accelerated Local Anomaly Detector (ALAD) (Liu, Huang, and Hu 2017), Anomaly Mining of Entity Neighborhoods (AMEN) (Perozzi and Akoglu 2018), Residual Analysis for Anomaly Detection (RADAR) (Li et al. 2017) and a simple Random classifier (Random) for anomaly detection, i.e., assigning random ranking scores to each node. We did not experiment with CODA (Gao et al. 2010) and Consub+Dis Out (C+D) (S anchez et al. 2014) on our synthetic dataset because the authors did not provide their scripts. However, we compare against them on the real life benchmarks by taking their performance values from their papers. Each method output an anomaly score or a ranking of nodes according with their anomaly level. We compute the ROC and PR curves and report their AUC score. Performance evaluation Synthetic results. Results on synthetic attributed networks are shown in Fig. 3. As can be seen, our MADAN approach outperforms the baseline methods in all cases. Here the scale t was chosen as the one giving the best Table 2: Results real life datasets (ROC-AUC) DATASET LOF C+D RADAR ALAD AMEN MADAN Disney 0.61 0.81 0.87 0.70 0.52 0.93 Books 0.49 0.60 0.58 0.43 0.47 0.68 Enron 0.44 0.74 0.65 0.72 0.47 0.66 performance. Indeed our approach can deal with hierarchical composition of attributes across communities together with heterogeneous distribution of node attributes. Injecting a few amount of anomalies, i.e., 1% of perturbed nodes, MADAN is comparable with RADAR and LOF in both ROC-AUC and PR-AUC metrics. However, when the number of anomalies increases, LOF performance fall down because it does not exploit the graph structure. It can also be seen that AMEN performance is precarious comparable to a random classifier because this method was designed to detect anomalous nodes on ego-networks discovering anomalous neighborhoods. MADAN also shows the lowest variance across identifying outliers across different random networks. Real life results. Results on real life datasets are shown in Table 2. We see that our approach (MADAN) achieves the best performance compared to other baseline methods on the Disney and Books networks while C+D shows the best AUC on the Enron dataset. The advantage of our method that it allows us to find the right scale for which the anomalous nodes are well defined regarding the nodes attributes of the network. In general it performs better, or at least with comparable AUC, than the baseline methods, in particular compared to LOF, which ignores the underlying structure of the network, and AMEN, which focuses only anomalous neighbors. Results of precision, recall and F12 scores (thresholded with Eq. 3) in Table 3 highlight the performance of our method, except on Enron dataset. This shows that our method is not well suited for spammer detection. It is worth mentioning that, when assessing the performance of anomaly detection algorithms on empirical datasets, one should keep in mind that the set of manually annotated, so-called, ground truth anomalies does not represent an objective truth as different annotators may label different nodes as anomalous, choosing different criteria to make their decisions. We believe that our method, that allows to measure the degree of anomaly of each node as a function of the size of its attribute context, offers a framework to explore systematically anomalies in a principled way. We will present later an example of such anomaly exploration in a real world network.. Computational issues The computational bottleneck of our approach relies in the computation of the exponential of the Laplacian for large graphs (more than 8000 nodes) used for computing the node concentration (Eq. 2), and finding the optimal context for anomalous nodes (Eq. 5). Previously, (Hammond, Vandergheynst, and Gribonval 2011) introduce an efficient way 2The F1-score is weighted by support (the number of true instances for each label), accounting for label imbalance. Table 3: Results real life datasets (PR/RC/F1-score) DATASET LOF RADAR ALAD AMEN MADAN Disney 0.55 0.47 0.53 0.48 0.82 Books 0.50 0.50 0.50 0.50 0.52 Enron 0.49 0.50 0.50 0.50 0.50 Disney 0.62 0.50 0.67 0.41 0.82 Books 0.50 0.52 0.46 0.44 0.60 Enron 0.44 0.55 0.55 0.45 0.60 Disney 0.57 0.48 0.41 0.35 0.83 Books 0.48 0.09 0.34 0.34 0.53 Enron 0.47 0.10 0.33 0.33 0.17 to approximate kernel evaluations of general filters employing Chebychev polynomials approximation. Here3 we propose a way to compute the heat kernel evaluation at a given scale, by approximating the entire exponential of the Laplacian as: e t L = [e t Lδ1, . . . , e t LδN] (7) where each column component is approximated with a Chebychev polynomial, being easily parallelized across columns. The error of the approximation is controlled by the degree of the Chebychev polynomial m. The larger the m the better the approximation with a cost of a longer running time. We chose m = 30 as in (Hammond, Vandergheynst, and Gribonval 2011). We test the scalability of this approach in the computation of the exponential of the Laplacian, compared with the classical Fourier approach, i.e., Cholesky decomposition and Pad e approximations. We generate Barab asi-Albert networks with power-law degree distribution of varying number of nodes N and number of edges equal to 3N. Results are depicted in Fig. 4. Figure 4: Running time for the computation of the exponential of the graph Laplacian for t = 1 varying the network size. Our parallel setting run across 8 cores. Multi-scale anomaly detection: a case study We consider the Disney network as a case study. This dataset provides manually labeled ground-truth anomalies (M uller et al. 2013) and has been largely used as benchmark for outlier detection in networks. Here, we use Min Price Private Seller and Number of reviews as node attributes. 3All computations were done on a standard computer Intel(R) Core(TM) i7-4790CPU, 3.60GHz I with 16G of RAM. F 4th scale 3rd scale O2 O3 O4 O5 O6 O7 O3 O4O5 O6 O2 O1 Concentration Concentration Concentration Concentration 2nd scale 1st scale 2nd scale 3rd scale 4th scale mean + 2*std mean + 2*std mean + 2*std mean + 2*std Figure 5: (A) Number of clusters (blue curve) found by maximizing Eq. 5, at each time step, the variation of information V I(t) (red curve) between the ensemble of optimal partitions at each time and the variation of information (V I(t, t )) between optimal partitions across times (background contour plot). Relevant partitions are determined by dips of V I(t) and extended plateaus of V (t, t ). Visualization of four robust partitions and the node heat concentration (bar plots), indicating outlier nodes (blue bars) when evaluating the node concentration at (B) t = 1.18, (C) t = 7.25, (D) t = 14.89 and (E) t = 204.8. In all bar plots, the upper horizontal line (blue) indicates the detection threshold defined in Eq. 3. (F) Dendrogram showing the hierarchy of clusters at each time with the contextual outliers. (G) Anomalous Disney DVD movies. As is shown in Fig. 5, we start scanning to look at relevant scales taken into account both the node metadata and the graph structure. For each intrinsic scales found in Fig. 5A, we display on the network the corresponding optimal partition (scales) in the right-hand side of Fig. 5. The contour plot in Fig. 5A shows the variation of information V I(t, t ) between the sets of optimal partitions at time t and t and reveals blocks of low VI (dark blue blocks) corresponding to the different scales uncovering different contexts. The number of clusters at each scale (blue curve) displays plateaus corresponding to the blocks in V I(t, t ). The scales are also visible as dips in the V I(t) among the set of partitions found by iterating the Louvain algorithm at a given time scale (red curve). We identify four relevant scales with 9, 6, 4 and 2 number of clusters (or contexts) respectively. Simultaneously, the concentration on each node Eq. 2 is evaluated at the same t, spotting anomalies at the given scale and context. As shown in the right hand side of Fig. 5, (bar plots in B,C,D,E), outliers nodes at a given scale are detected as the ones having a concentration of energy more than two standard deviation above the mean concentration (Eq. 3). At the 1st scale (the finest one) the Fig. 5B shows the partitioning of the network in 9 clusters. Node O5 (The jungle book) appears as an anomalous node within this context, that corresponds to all the read along movies, having both high selling price and number of reviews. An interesting case is found in O3 (The Nightmare Before Christmas / James and the Giant Peach) which has low reviews but is a very expensive article because it is the only one in its context consisting in a two DVD pack. The cluster containing O2 and O4 corresponds to Pixar movies, where again O4 is a three pack DVD promotion (Toy story 1-2 and A Bug s Life), whereas O2 corresponds to the poorly reviewed but very expensive Buzz Lightyear of Star Command: The Adventure Begins film. The cluster containing O6 consists in many Winnie-The-Pooh related movies with some independent Tigger sagas. The anomalous node O6 (The Many Adventures of Winnie the Pooh) is the most expensive film with the highest rate of reviews in this category. The threenodes cluster containing O7 (The Nightmare Before Christmas) flagged O7 as an anomaly because its selling price is 3 and 6 times higher than the two other nodes. Going up in the hierarchy of clusters allows us to detect anomalous nodes within coarser contexts and discarding more local outliers. As can be seen in the 3rd scale, the Fig. 5D, the outliers remaining within the pink cluster, O3 and O4, correspond to the two packs DVD promotions with very high prices, whereas more local anomalies disappear at this scale, e.g., O7. Finally we see in Fig. 5E that at the 4th scale, O3 (The Nightmare Before Christmas / James and the Giant Peach) remains as a global outlier within the coarser pink cluster, being a bad connected node with high attribute values. Conclusions In this work we introduce an effective algorithm (MADAN) to perform multi-scale anomaly detection on attributed networks. Anomalous nodes are characterized as those remaining highly concentrated after smoothing unit impulse signals around each vertex of the graph. Using the heat kernel as filtering operator allows us to exploit the link with the Markov stability to find the context for outlier nodes at all relevant scales of the network. Extensive empirical studies on synthetic and real world benchmarks demonstrated the superiority of MADAN over many state of the art approaches. In addition, we show that our method is highly efficient, being easily parallelized scaling with networks with up to at least 100k nodes. Acknowledgments This work was supported by the Flagship European Research Area Network (FLAG-ERA) Joint Transnational Call Futur ICT 2.0 and the Swiss National Science Foundation Fellowship P300P2 177793 (A.B) which are gratefully acknowledged as well as Leto Peel for helpful suggestions. References Aggarwal, C. C. 2013. Outlier Analysis. Springer Publishing Company, Incorporated. Akoglu, L.; Mc Glohon, M.; and Faloutsos, C. 2010. oddball: Spotting anomalies in weighted graphs. In Zaki, M. J.; Yu, J. X.; Ravindran, B.; and Pudi, V., eds., Advances in Knowledge Discovery and Data Mining, 410 421. Akoglu, L.; Tong, H.; and Koutra, D. 2015. Graph based anomaly detection and description: A survey. Data Min. Knowl. Discov. 29(3):626 688. Blondel, V. D.; Guillaume, J.-L.; Lambiotte, R.; and Lefebvre, E. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008(10):P10008. Boccaletti, S.; Latora, V.; Moreno, Y.; Chavez, M.; and Hwang, D. U. 2006. Complex networks: Structure and dynamics. Physics Reports 424(4-5):175 308. Brandes, U.; Delling, D.; Gaertler, M.; Gorke, R.; Hoefer, M.; Nikoloski, Z.; and Wagner, D. 2008. On modularity clustering. IEEE Transactions on Knowledge and Data Engineering 20(2):172 188. Breunig, M. M.; Kriegel, H.-P.; Ng, R. T.; and Sander, J. 2000. Lof: Identifying density-based local outliers. SIGMOD Rec. 29(2):93 104. Delvenne, J.-C.; Schaub, M. T.; Yaliraki, S. N.; and Barahona, M. 2013. The Stability of a Graph Partition: A Dynamics-Based Framework for Community Detection. New York, NY: Springer New York. 221 242. Delvenne, J.-C.; Yaliraki, S. N.; and Barahona, M. 2010. Stability of graph communities across time scales. Proceedings of the National Academy of Sciences 107(29):12755 12760. Ding, K.; Li, J.; and Liu, H. 2019. Interactive anomaly detection on attributed networks. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM 19, 357 365. New York, NY, USA: ACM. Gao, J.; Liang, F.; Fan, W.; Wang, C.; Sun, Y.; and Han, J. 2010. On community outliers and their efficient detection in information networks. KDD 10, 813 822. ACM. Hammond, D. K.; Vandergheynst, P.; and Gribonval, R. 2011. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30(2):129 150. Haralick, R. M., and Shapiro, L. G. 1992. Computer and Robot Vision. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1st edition. Iglewicz, B., and Hoaglin, David C. (David Caster), . 1993. How to detect and handle outliers. Milwaukee, Wis. : ASQC Quality Press. Lambiotte, R.; Delvenne, J.; and Barahona, M. 2014. Random walks, markov processes and the multiscale modular organization of complex networks. IEEE Transactions on Network Science and Engineering 1(2):76 90. Lancichinetti, A., and Fortunato, S. 2009. Community detection algorithms: A comparative analysis. Phys. Rev. E 80:056117. Li, J.; Dani, H.; Hu, X.; and Liu, H. 2017. Radar: Residual analysis for anomaly detection in attributed networks. In IJCAI17, 2152 2158. Liu, N.; Huang, X.; and Hu, X. 2017. Accelerated local anomaly detection via resolving attributed networks. In IJCAI17, 2337 2343. Metsis, V., and et al. 2006. Spam filtering with naive bayes which naive bayes? In CEAS. M uller, E.; S anchez, P. I.; M ulle, Y.; and B ohm, K. 2013. Ranking outlier nodes in subspaces of attributed graphs. In ICDEW, 216 222. Newman, M. E. J., and Girvan, M. 2004. Finding and evaluating community structure in networks. Physical Review E 69(2):026113. Perozzi, B., and Akoglu, L. 2018. Discovering communities and anomalies in attributed graphs: Interactive visual exploration and summarization. ACM Trans. Knowl. Discov. Data 12(2):24:1 24:40. Perraudin, N.; Ricaud, B.; Shuman, D. I.; and Vandergheynst, P. 2018. Global and local uncertainty principles for signals on graphs. APSIPA Transactions on Signal and Information Processing 7:e3. S anchez, P. I.; M uller, E.; Irmler, O.; and B ohm, K. 2014. Local context selection for outlier ranking in graphs with multiple numeric node attributes. In SSDBM, 16:1 16:12. ACM. Shuman, D. I.; Narang, S. K.; Frossard, P.; Ortega, A.; and Vandergheynst, P. 2013. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine 30(3):83 98.