# distributed_spectrumbased_fault_localization__450168fa.pdf Distributed Spectrum-Based Fault Localization Avraham Natan, Roni Stern, Meir Kalech Ben-Gurion University of the Negev, Israel avinat123@gmail.com, roni.stern@gmail.com, kalech@bgu.ac.il Spectrum-Based Fault Localization (SFL) is a popular approach for diagnosing faulty systems. SFL algorithms are inherently centralized, where observations are collected and analyzed by a single diagnoser. Applying SFL to diagnose distributed systems is challenging, especially when communication is costly and there are privacy concerns. We propose two SFL-based algorithms that are designed for distributed systems: one for diagnosing a single faulty component and one for diagnosing multiple faults. We analyze these algorithms theoretically and empirically. Our analysis shows that the distributed SFL algorithms we developed output identical diagnoses to centralized SFL while preserving privacy. 1 Introduction Spectrum-Based Fault Localization (SFL) is a popular approach for diagnosing faulty systems. Primarily used in software diagnosis (Abreu, Zoeteweij, and Van Gemund 2009), SFL aims to identify faults in programs. To achieve this, SFL models the program as a set of components, and runs tests, each of which executes some components. This activity is recorded into activity matrix (spectrum), and the tests outcomes are recorded in an error vector. Some SFL algorithms diagnose single faults, while others diagnose multiple faults. SFL algorithms are centralized. However, different systems are inherently distributed such as software systems, multi-agent systems, communication networks etc. Applying centralized SFL techniques to diagnose such distributed systems is challenging, especially when communication is costly and the privacy should be considered: First, information on activity of a component may be hard to get or restricted. Second, privacy restrictions or communication load might challenge the performance of a centralized algorithm. The contribution of this paper is: (1) We define the problem of Distributed SFL (DSFL). (2) We propose two SFL-based diagnosis algorithms for distributed systems: one for single faults (DSFLA-SINGLE) and one for multiple faults (DSFLA-MULTI). (3) We provide empirical and theoretical analysis of soundness, completeness, privacy, communication load and runtime. We show that the distributed algorithms output identical diagnoses to the centralized Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. while preserving privacy. We also evaluate three variations of DSFLA-MULTI, and show that with a small reduction in diagnosis quality, DSFLA-MULTI performs better in terms of privacy, communication load and runtime. 2 Background and Related Work Our work builds on prior work on SFL. Thus, we provide here a brief background on SFL. For a more comprehensive background see Abreu, Zoeteweij, and Van Gemund 2009. Definition 1 (SFL Problem). An SFL problem is defined by a tuple C, R, S, E where C is a set of components, each of which may be faulty; R is a set of runs executed on the system; S is a binary |R| |C| matrix where Si,j = 1 denotes that component cj participated in run ri; and E is a vector of length |R| where Ei = 1 denotes that ri failed, and Ei = 0 otherwise. An SFL problem arises when i : Ei = 1. The matrix S is called spectrum and the vector E is called error vector. Table 1a shows the spectrum and error vector for a system with 3 components (c1, c2, c3) that is executed 4 (r1, r2, r3, r4) times. Consider the second row, for example, components c2 and c3 were involved, and the run failed. A solution to an SFL problem is a set of diagnoses and a ranking function to rank them. A diagnosis is a set of components that explain all the failed runs. Diagnosis explains a failed run if at least one of the components in participated in the failed run according to the spectrum S. Formally: Definition 2 (SFL Diagnosis). A set of components {1, .., |C|} is diagnosis for an SFL problem C, R, S, E if i : Ei = 1 j : j Si,j = 1. Some SFL algorithms diagnose single faults, while others diagnose multiple faults. Single fault SFL: In this case, a component j is a diagnosis if it participated in at least one failed run. Diagnosis ranking uses Similarly Coefficients (Hofer et al. 2015), which are evaluated using four Similarity Counters npq(j), p, q {0, 1} defined as: cj, npq(j) = |{i|Sij = p Ei = q}|. Table 1b shows these counters with respect to Table 1a. For example, n11(2)=2 since the number of runs where Si,2=1 Ei=1 is 2. Using Ochiai (Abreu, Zoeteweij, and Van Gemund 2007) similarity coefficient with those values yields the probabilities: P(c1)=0.67, P(c2)=0.82, P(c3) = 0.41. The diagnosis set ranked by their probabilities in that case is: D = { {c2}, 0.82 , {c1}, 0.67 , {c3}, 0.41 }. The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23) c1 c2 c3 E r1 1 1 0 1 r2 0 1 1 1 r3 1 0 0 1 r4 1 0 1 0 c1 c2 c3 n11(j) 2 2 1 n10(j) 1 0 1 n01(j) 1 1 2 n00(j) 0 1 0 Table 1: (a) Spectrum and Error Vector for system of 3 components that is run 4 times, and (b) their similarity counters. SFL for multiple faults: Barinel (Abreu, Zoeteweij, and Van Gemund 2009) is an SFL-based algorithm that addresses this case by combining SFL with model-based diagnosis (MBD). Barinel defines a set of components that participated in a failed run as a a conflict. Then it outputs diagnoses by computing a Minimal Hitting Set (MHS) of the set of conflicts. This approach is known in MBD as the conflictdirected approach (De Kleer and Williams 1987; Williams and Ragno 2007; De Kleer 2011). Diagnosis ranking uses a Bayesian approach, by computing P( |E)= P (E| ) P ( ) P (E) , where P( ) is the prior of and P(E) is a normalization factor. To compute the Likelihood Function P(E| ), hj is defined as the likelihood of faulty component to behave normally, formally: Definition 3 (Likelihood Function). Given a diagnosis , spectrum S and error vector E, the Likelihood Function L is defined as: L = P(E| ) = Q Ei E Li. With each term Li defined for a row in the spectrum as the probability of all involved components to behave normally: Li = P(Ei| ) = (Q j Sij=1 hj if Ei = 0 1 Q j Sij=1 hj if Ei = 1 (1) Since hj is not known, a maximization algorithm such as gradient descent is applied to maximize L. Example 1. In Table 1a the conflicts for runs r1, r2, r3 are {c1, c2}, {c2, c3}, {c1} respectively. The minimal hitting sets 1 = {c1, c2} and 2 = {c1, c3} are the diagnoses. As an example, for 2 = {1, 3} the terms corresponding to the rows are: L1 = P(E1| 2) = (1 h1) L2 = P(E2| 2) = (1 h3) L3 = P(E3| 2) = (1 h1) L4 = P(E4| 2) = (h1 h3) And their product is: L = (1 h1) (1 h3) (1 h1) (h1 h3) (2) Maximizing L with gradient descent yields L = 0.161. 2.1 Related Work SFL has been studied in recent years. Different works improve and demonstrate its usefulness. One work addresses scalability challenges that rise due to the computational overhead when computing the spectrum (Perez, Abreu, and Riboira 2014). They propose dynamic code coverage (DCC) which dynamically increases the granularity of the components. Results show the usefulness of this approach. A different improvement to SFL incorporates software fault prediction model (Elmishali, Stern, and Kalech 2016) to improve diagnosis ranking. Results show significant improve in diagnosis accuracy and troubleshooting efficiency. Another work (Elmishali, Stern, and Kalech 2018) shows how SFL can be used in a novel Learn, Diagnose and Plan (LDP) paradigm as the diagnosis part. Results show how this improves SFL when it is part of LDP. Later study shows SFL usage in diagnosing system exploits (Elmishali, Stern, and Kalech 2020). The components are blocks of code and a trace of tests using them is recorded. They use fuzz testing tool and under-tracing technique to enhance this method, and improve diagnosis accuracy. The method s validity is shown on different software projects. Another work demonstrates SFL for Multi-Agent Systems (Passos, Abreu, and Rossetti 2015). The authors extend SFL to address agent specific features such as agent autonomy. They show prominent results of roughly 96.96% diagnosis accuracy. Some work addresses distributed diagnosis with different approaches such as inter-level communication (P erez Zu niga et al. 2018), fuzzy fault isolation (Syfert, Barty s, and Ko scielny 2018) and structural analysis (Perez-Zuniga et al. 2022). Distributed approach to SFL was not proposed previously. 3 Methodology In this section we address the problem of Distributed Spectrum-Based Fault Localization (DSFL). 3.1 Problem Definition The presented approaches rely on a central solver to generate and rank the diagnoses. Applying them on distributed systems raises challenges such as single point of failure, communication load, and privacy. To address the latter two, we assume no central solver. Instead, every component has vision of the execution runs in which it participates, denoted as Local Spectrum and Local Error Vector. Formally: Definition 4 (Local Spectrum and Local Error Vector). Given spectrum S, Error Vector E and component cj, the Local Spectrum Sj of cj is: Sj = {Si, |Si,j = 1}, and the Local Error Vector Ej of cj is: Ej = {Ei|Si,j = 1}. c1 c2 c3 E r1 1 1 0 1 r2 - - - - r3 1 0 0 1 r4 1 0 1 0 c1 c2 c3 E r1 1 1 0 1 r2 0 1 1 1 r3 - - - - r4 - - - - c1 c2 c3 E r1 - - - - r2 0 1 1 1 r3 - - - - r4 1 0 1 0 Table 2: Local spectra S1, S2, S3 and error vectors E1, E2, E3 corresponding to components c1, c2, c3. Tables 2a, 2b, 2c show the local spectra and error vectors of components c1, c2, c3. Solving the SFL problem using one spectrum may output wrong diagnoses. To that end we define the Distributed SFL (DSFL) problem: Definition 5 (DSFL problem). A DSFL problem is defined as C, R, {Sj}|C| j=1, {Ej}|C| j=1 , where C is a set of components, R is a set of runs, and Sj and Ej are the local spectrum and error vector of component cj. A DSFL problem arises when j, i : Ej i = 1. A solution to DSFL is a diagnosis for the joint SFL problem defined by the union of the local spectras and local error vectors. A naive solution collects the information from the components to a single solver. Such approach may have high communication costs, and neglects privacy. We discuss the topic of privacy loss in distributed SFL in Section 4. In the following sections we describe distributed versions of the SFL algorithms that address privacy for single fault and for multiple faults. They share information in order to reach similar output to the centralized algorithms, while minimizing the private information revealed. Algorithm 1: DSFLA-SINGLE Input: cj - the current component Result: P - probability of diagnosis = {cj} 1 n11(j), n10(j), n01(j), n00(j) 0, 0, 0, 0 2 for i Sj do 3 q Ej i , n1q(j) n1q(j) + 1 4 for j [1, ..., |C|] s.t. j = j do 5 nn01, nn00 request missing(cj, cj ) 6 n01(j), n00(j) n01(j) + nn01, n00(j) + nn00 7 P similarity(n11(j), n10(j), n01(j), n00(j)) 3.2 Distributed SFL for Single Fault Problems Here we present our distributed algorithm for diagnosing single faults (DSFLA-SINGLE). Since the algorithm handles single faults, each component is a potential diagnosis and we only need to present the ranking. Each component cj should calculate its similarity counters n11(j), n10(j), n01(j), n00(j) and use them with a similarity coefficient. The challenge is that a component knows only the outcomes of runs it participates in, as shown in Table 2, making it able to calculate only n11(j), n10(j). To address this challenge, each component cj requests from other components information about n01(j) and n00(j) as indicated by their spectrum. This information is summed up by cj to the true values of n01(j) and n00(j). Algorithm 1 presents the process of ranking component cj. The component initializes npq(j) (line 1), and calculates n1q(j) using the rows in its local spectrum (Lines 2-3). Then, the component requests information from other components (Lines 4-6), by calling the request missing procedure (The details of this procedure are listed in Algorithm 2). Note that these request messages are sent to other components by order of their appearance in the spectrum. To allow this, we assume the components to have a predefined order. This allows a component to avoid sending data it knows was already sent by previous components. Finally, a similarity coefficient is used to calculate the rank. Algorithm 2 de- Algorithm 2: request missing Input: cj - the component requesting the data Input: cj - the component returning the data Result: nn01, nn00 - counter data 1 nn01, nn00 0, 0 2 for i Sj do 3 if ( j < j s.t. Sj i,j = 1) V Sj i,j = 0 then i , nn0q(j) nn0q(j) + 1 5 return nn01, nn00 scribes the request missing method. Component cj provides the data requested by cj. For every row in its local spectrum, cj checks if cj has already received the information about this row from a previous component, or cj has this row in its local spectrum (Line 3). If this is not the case, cj updates either nn01 or nn00 according to the error vector (Line 4). The reason for component cj handling only rows for which it is the first component that observes the row, is to avoid duplicate reports by different components on the same rows. Example 2. We demonstrate how component c2 calculates its similarity counters using its local spectrum presented in Table 2b. After executing lines 2-3 of Algorithm 1, c2 has n11(2)=2 and n10(2)=0. Next, it requests information from c1 and c3 in that order. c1 considers runs r1, r3, r4 of S1 (Table 2a). Since S1 1,2=1, run r1 is skipped, since it means that c2 has the same run in S2. Runs r3, r4 show that c2 does not have them, so c1 uses them to calculate nn01=1, nn00=1. The same request is sent to c3. c3 considers its local spectra and finds out that for each of its runs, either c2 has them, or c1 has already addressed them, and returns nothing. At the end of the process, c2 has the same counter values as shown in Table 1b. c2 uses them to calculate the likelihood of it being faulty by similarity coefficient algorithm. 3.3 Distributed SFL for Multiple Faults Problems Here we present our algorithm for diagnosing a distributed system with multiple faults (DSFLA-MULTI). It first generates diagnoses and then ranks them. Diagnosis Generation The components generate diagnoses in two stages. First, each component calculates local diagnoses by applying a Minimal Hitting Set (MHS) algorithm (De Kleer 2011; Rodler 2022) using its local spectra. This is done in parallel by all components. Then, c1 sends the set of local diagnoses it computed to c2. Subsequently, c2 refines its set of local diagnoses based on the diagnoses it received from c1. This continues sequentially, with the last component having a complete set of diagnoses. This set is identical to the one generated by the centralized version of the algorithm. Algorithm 3 lists the pseudo-code for the algorithm described above from the perspective of component cj. First, cj computes the local diagnoses set LD by executing MHS algorithm on its local spectrum (Lines 1-2). Then, it receives a set of previously calculated diagnoses PD from cj 1 (line 3). Note that if j=1 then PD= . Next, cj adds PD and LD as elements of a set and computes its hitting set to get a combined global diagnosis set GD (Line 4). GD is then refined by removal of duplicate diagnoses and super-sets (Line 5). The generated diagnoses are sent to the next component, or output in case of the last component (Lines 6-8). Before continuing with the running example, we demonstrate how lines 4-5 in Algorithm 3 work. Line 4 applies MHS on a two-element set, in which the elements are sets LD and PD to receive a set of sets of sets, and line 5 refines the resulting set by unifying the elements of the sets of sets, and later filtering out subsets. We demonstrate this with a short example. Suppose for example, that LD = {{2}, {1, 3}} and PD = {{1}}. MHS({LD, PD}) returns the set: GD = {{{2}, {1}}, {{1, 3}, {1}}}. After unifying and duplicate removing we receive: GD = {{1, 2}, {1, 3}}. Algorithm 3: Diagnose Input: cj - the current component Result: GD - a set of global diagnoses 1 LC local conflicts(Sj) 2 LD MHS(LC) 3 PD recieve diagnoses(cj 1) 4 GD MHS({PD, LD}) 5 GD refine(GD) 6 if j = |C| then 7 return GD 8 diagnose(cj+1, GD) LD PD GD 1 {{1}} {{1}} 2 {{2}, {1, 3}} {{1}} {{1, 2}, {1, 3}} 3 {{2}, {3}} {{1, 2}, {1, 3}} {{1, 2}, {1, 3}} Table 3: Example of one diagnosis generation process. Example 3. Table 3 demonstrates the iterative process for generating diagnoses described in Algorithm 3. Each of the three rows is executed by the component with the corresponding number, shown in column 1. Column 2 shows the local diagnoses (LD) and is executed in parallel. Columns 3-4 present the sequential process of receiving previous diagnoses (PD) and refining them (GD) at each component. The example is run on 3 components, each of which has one of the spectra shown in Table 2. First, considering its local spectrum, each component calculates its local diagnoses LDj (column 2). Then c1 starts with a previous diagnosis set PD1 = . This leads to GD1 = LD1. Next, c1 sends GD1 to c2 as PD2. c2 then executes MHS on the set {PD2, LD2} to get GD2 = {{1, 2}, {1, 3}}, and sends it to c3 as PD3. c3 then executes MHS on the set {PD3, LD3} to get GD3 = {{1, 2}, {1, 3}}. Here the diagnosis process stops, the diagnoses are 1 = {c1, c2} and 2 = {c1, c3}. Ranking Diagnoses Ranking is also challenging since the components do not possess the entire likelihood function L (Def. 3), so none of them can maximize it by itself. To address this, we define the Local Likelihood Function for component cj: Definition 6 (Local Likelihood Function). Given a diagnosis , component cj, local spectrum Sj and local error vector Ej, the Local Likelihood Function Lj is: Lj = P(Ej| ) = Q Ej i Ej Lj i. With Lj i defined similarly as Li in Def. 3, but with relation to Sj and Ej. Consider, for instance, 2 = {c1, c3} (Ex.3), and the spectra and error vectors in Table 2. The local likelihood functions of c1, c2, c3, are presented in Table 4 column 3. This raises a challenge in performing gradient descent as done in Barinel, since for each component cj, some terms are missing (Lj missing). As a result, a component can not maximize L by itself. On the other hand, reconstructing Lj to a complete L by exchanging the missing terms will allow components to reconstruct the complete spectrum, and reveal private information, since there is a bijective relation between Li and row ri. To that end we propose a distributed version of gradient descent, where the components share two values: (1) the value of L, and (2) the values of hj H. This ensures that the complete function L will not be known to the components. It is worth noting that L, Lj and Lj missing are functions. Throughout the demonstration of our algorithms, we denote the numerical values of these functions as l, lj and lj missing, respectively. Given a diagnosis to be ranked, each component cj initializes the group H={hj=1/2}|C| j=1, and its local likelihood function Lj. Next a sequential process starts with c1 and ends with c M during which the value l is computed, with each component cj updating it in turn. Next, c|C| broadcasts the final value l to the other components, and then a parallel process occurs, where each component calculates its partial gradient j and updates hj accordingly. The updated value of hj is broadcast to all components, to ensure the values in H are the same for all components. Algorithm 4 details the ranking algorithm from the perspective of component cj. cj receives as input a global diagnosis . First cj initiates the array of health values H = {hj = 1/2}|C| j=1, its local likelihood function Lj and a threshold for the process termination ϵ (Line 1). At the beginning, component c1 sets the initial values of l and lprev (Lines 2-3). Then, a gradient descent loop follows (Lines 416), which halts when l has converged (Line 4). In each iteration, cj waits to receive the updated value l from the previous component (Line 7). cj then extends it by multiplying it with all the terms Lj k of its local estimation function Lj that were not observed by previous components and then evaluates the resulting function (Lines 8-9). Then cj sends l to cj+1, and waits to receive the final l from c|C| (Line 13). In case that j = |C|, cj broadcasts the updated l to all the components (Lines 10-11). This concludes the sequential stage of calculating the value l. cj uses l to calculate its own hj (Lines 14-15). It does so by dividing l by lj to obtain the j H Lj lj derivativej l l3 lj missing gradientj next H 1 1/2, 1/2, 1/2 (1 h1) (1 h1) (h1 h3) 1/16 -1/8 1/16 1/32 1/2 -1/16 7/16, 1/2, 1/2 2 1/2, 1/2, 1/2 (1 h1) (1 h3) 1/4 0 1/32 1/32 1/8 0 1/2, 1/2, 1/2 3 1/2, 1/2, 1/2 (1 h3) (h1 h3) 1/8 0 1/32 1/32 1/4 0 1/2, 1/2, 1/2 Table 4: Example showing a single iteration of Algorithm 4 for ranking the diagnosis 2 = {c1, c3}. Each row shows different values that each component possesses. In the beginning the components have columns 2-5. Column 6 is calculated sequentially, and once it is done, each component uses column 6 to compute the values in columns 6-10 in parallel. Algorithm 4: Rank Input: cj - the current component Input: - a global diagnosis Result: p - the probability of 1 H, Lj init( , |C|), ϵ 0.005 2 if j = 1 then 3 l 0, lprev 1 4 while |l lprev| > ϵ do 6 if j = 1 then 7 l receive prev(cj 1) 8 Lj extended l Q k: j