# distributed_spectrumbased_fault_localization__450168fa.pdf

Distributed Spectrum-Based Fault Localization

Avraham Natan, Roni Stern, Meir Kalech

Ben-Gurion University of the Negev, Israel avinat123@gmail.com, roni.stern@gmail.com, kalech@bgu.ac.il

Spectrum-Based Fault Localization (SFL) is a popular approach for diagnosing faulty systems. SFL algorithms are inherently centralized, where observations are collected and analyzed by a single diagnoser. Applying SFL to diagnose distributed systems is challenging, especially when communication is costly and there are privacy concerns. We propose two SFL-based algorithms that are designed for distributed systems: one for diagnosing a single faulty component and one for diagnosing multiple faults. We analyze these algorithms theoretically and empirically. Our analysis shows that the distributed SFL algorithms we developed output identical diagnoses to centralized SFL while preserving privacy.

1 Introduction

Spectrum-Based Fault Localization (SFL) is a popular approach for diagnosing faulty systems. Primarily used in software diagnosis (Abreu, Zoeteweij, and Van Gemund 2009), SFL aims to identify faults in programs. To achieve this, SFL models the program as a set of components, and runs tests, each of which executes some components. This activity is recorded into activity matrix (spectrum), and the tests outcomes are recorded in an error vector. Some SFL algorithms diagnose single faults, while others diagnose multiple faults. SFL algorithms are centralized. However, different systems are inherently distributed such as software systems, multi-agent systems, communication networks etc. Applying centralized SFL techniques to diagnose such distributed systems is challenging, especially when communication is costly and the privacy should be considered: First, information on activity of a component may be hard to get or restricted. Second, privacy restrictions or communication load might challenge the performance of a centralized algorithm. The contribution of this paper is: (1) We define the problem of Distributed SFL (DSFL). (2) We propose two SFL-based diagnosis algorithms for distributed systems: one for single faults (DSFLA-SINGLE) and one for multiple faults (DSFLA-MULTI). (3) We provide empirical and theoretical analysis of soundness, completeness, privacy, communication load and runtime. We show that the distributed algorithms output identical diagnoses to the centralized

Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

while preserving privacy. We also evaluate three variations of DSFLA-MULTI, and show that with a small reduction in diagnosis quality, DSFLA-MULTI performs better in terms of privacy, communication load and runtime.

2 Background and Related Work Our work builds on prior work on SFL. Thus, we provide here a brief background on SFL. For a more comprehensive background see Abreu, Zoeteweij, and Van Gemund 2009. Definition 1 (SFL Problem). An SFL problem is defined by a tuple C, R, S, E where C is a set of components, each of which may be faulty; R is a set of runs executed on the system; S is a binary |R| |C| matrix where Si,j = 1 denotes that component cj participated in run ri; and E is a vector of length |R| where Ei = 1 denotes that ri failed, and Ei = 0 otherwise. An SFL problem arises when i : Ei = 1. The matrix S is called spectrum and the vector E is called error vector. Table 1a shows the spectrum and error vector for a system with 3 components (c1, c2, c3) that is executed 4 (r1, r2, r3, r4) times. Consider the second row, for example, components c2 and c3 were involved, and the run failed. A solution to an SFL problem is a set of diagnoses and a ranking function to rank them. A diagnosis is a set of components that explain all the failed runs. Diagnosis explains a failed run if at least one of the components in participated in the failed run according to the spectrum S. Formally: Definition 2 (SFL Diagnosis). A set of components {1, .., |C|} is diagnosis for an SFL problem C, R, S, E if i : Ei = 1 j : j Si,j = 1. Some SFL algorithms diagnose single faults, while others diagnose multiple faults.

Single fault SFL: In this case, a component j is a diagnosis if it participated in at least one failed run. Diagnosis ranking uses Similarly Coefficients (Hofer et al. 2015), which are evaluated using four Similarity Counters npq(j), p, q {0, 1} defined as: cj, npq(j) = |{i|Sij = p Ei = q}|. Table 1b shows these counters with respect to Table 1a. For example, n11(2)=2 since the number of runs where Si,2=1 Ei=1 is 2. Using Ochiai (Abreu, Zoeteweij, and Van Gemund 2007) similarity coefficient with those values yields the probabilities: P(c1)=0.67, P(c2)=0.82, P(c3) = 0.41. The diagnosis set ranked by their probabilities in that case is: D = { {c2}, 0.82 , {c1}, 0.67 , {c3}, 0.41 }.

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

c1 c2 c3 E r1 1 1 0 1 r2 0 1 1 1 r3 1 0 0 1 r4 1 0 1 0

c1 c2 c3 n11(j) 2 2 1 n10(j) 1 0 1 n01(j) 1 1 2 n00(j) 0 1 0

Table 1: (a) Spectrum and Error Vector for system of 3 components that is run 4 times, and (b) their similarity counters.

SFL for multiple faults: Barinel (Abreu, Zoeteweij, and Van Gemund 2009) is an SFL-based algorithm that addresses this case by combining SFL with model-based diagnosis (MBD). Barinel defines a set of components that participated in a failed run as a a conflict. Then it outputs diagnoses by computing a Minimal Hitting Set (MHS) of the set of conflicts. This approach is known in MBD as the conflictdirected approach (De Kleer and Williams 1987; Williams and Ragno 2007; De Kleer 2011). Diagnosis ranking uses a Bayesian approach, by computing P( |E)= P (E| ) P ( )

P (E) , where P( ) is the prior of and P(E) is a normalization factor. To compute the Likelihood Function P(E| ), hj is defined as the likelihood of faulty component to behave normally, formally: Definition 3 (Likelihood Function). Given a diagnosis , spectrum S and error vector E, the Likelihood Function L is defined as: L = P(E| ) = Q Ei E Li. With each term Li defined for a row in the spectrum as the probability of all involved components to behave normally:

Li = P(Ei| ) =

(Q j Sij=1 hj if Ei = 0 1 Q j Sij=1 hj if Ei = 1 (1)

Since hj is not known, a maximization algorithm such as gradient descent is applied to maximize L. Example 1. In Table 1a the conflicts for runs r1, r2, r3 are {c1, c2}, {c2, c3}, {c1} respectively. The minimal hitting sets 1 = {c1, c2} and 2 = {c1, c3} are the diagnoses. As an example, for 2 = {1, 3} the terms corresponding to the rows are: L1 = P(E1| 2) = (1 h1) L2 = P(E2| 2) = (1 h3) L3 = P(E3| 2) = (1 h1) L4 = P(E4| 2) = (h1 h3) And their product is:

L = (1 h1) (1 h3) (1 h1) (h1 h3) (2)

Maximizing L with gradient descent yields L = 0.161.

2.1 Related Work SFL has been studied in recent years. Different works improve and demonstrate its usefulness. One work addresses scalability challenges that rise due to the computational overhead when computing the spectrum (Perez, Abreu, and

Riboira 2014). They propose dynamic code coverage (DCC) which dynamically increases the granularity of the components. Results show the usefulness of this approach. A different improvement to SFL incorporates software fault prediction model (Elmishali, Stern, and Kalech 2016) to improve diagnosis ranking. Results show significant improve in diagnosis accuracy and troubleshooting efficiency. Another work (Elmishali, Stern, and Kalech 2018) shows how SFL can be used in a novel Learn, Diagnose and Plan (LDP) paradigm as the diagnosis part. Results show how this improves SFL when it is part of LDP. Later study shows SFL usage in diagnosing system exploits (Elmishali, Stern, and Kalech 2020). The components are blocks of code and a trace of tests using them is recorded. They use fuzz testing tool and under-tracing technique to enhance this method, and improve diagnosis accuracy. The method s validity is shown on different software projects. Another work demonstrates SFL for Multi-Agent Systems (Passos, Abreu, and Rossetti 2015). The authors extend SFL to address agent specific features such as agent autonomy. They show prominent results of roughly 96.96% diagnosis accuracy. Some work addresses distributed diagnosis with different approaches such as inter-level communication (P erez Zu niga et al. 2018), fuzzy fault isolation (Syfert, Barty s, and Ko scielny 2018) and structural analysis (Perez-Zuniga et al. 2022). Distributed approach to SFL was not proposed previously.

3 Methodology

In this section we address the problem of Distributed Spectrum-Based Fault Localization (DSFL).

3.1 Problem Definition

The presented approaches rely on a central solver to generate and rank the diagnoses. Applying them on distributed systems raises challenges such as single point of failure, communication load, and privacy. To address the latter two, we assume no central solver. Instead, every component has vision of the execution runs in which it participates, denoted as Local Spectrum and Local Error Vector. Formally:

Definition 4 (Local Spectrum and Local Error Vector). Given spectrum S, Error Vector E and component cj, the Local Spectrum Sj of cj is: Sj = {Si, |Si,j = 1}, and the Local Error Vector Ej of cj is: Ej = {Ei|Si,j = 1}.

c1 c2 c3 E r1 1 1 0 1 r2 - - - - r3 1 0 0 1 r4 1 0 1 0

c1 c2 c3 E r1 1 1 0 1 r2 0 1 1 1 r3 - - - - r4 - - - -

c1 c2 c3 E r1 - - - - r2 0 1 1 1 r3 - - - - r4 1 0 1 0

Table 2: Local spectra S1, S2, S3 and error vectors E1, E2, E3 corresponding to components c1, c2, c3.

Tables 2a, 2b, 2c show the local spectra and error vectors of components c1, c2, c3. Solving the SFL problem using one spectrum may output wrong diagnoses. To that end we define the Distributed SFL (DSFL) problem: Definition 5 (DSFL problem). A DSFL problem is defined as C, R, {Sj}|C| j=1, {Ej}|C| j=1 , where C is a set of components, R is a set of runs, and Sj and Ej are the local spectrum and error vector of component cj. A DSFL problem arises when j, i : Ej i = 1. A solution to DSFL is a diagnosis for the joint SFL problem defined by the union of the local spectras and local error vectors. A naive solution collects the information from the components to a single solver. Such approach may have high communication costs, and neglects privacy. We discuss the topic of privacy loss in distributed SFL in Section 4. In the following sections we describe distributed versions of the SFL algorithms that address privacy for single fault and for multiple faults. They share information in order to reach similar output to the centralized algorithms, while minimizing the private information revealed.

Algorithm 1: DSFLA-SINGLE

Input: cj - the current component Result: P - probability of diagnosis = {cj}

1 n11(j), n10(j), n01(j), n00(j) 0, 0, 0, 0

2 for i Sj do

3 q Ej i , n1q(j) n1q(j) + 1

4 for j [1, ..., |C|] s.t. j = j do

5 nn01, nn00 request missing(cj, cj )

6 n01(j), n00(j) n01(j) + nn01, n00(j) + nn00 7 P similarity(n11(j), n10(j), n01(j), n00(j))

3.2 Distributed SFL for Single Fault Problems Here we present our distributed algorithm for diagnosing single faults (DSFLA-SINGLE). Since the algorithm handles single faults, each component is a potential diagnosis and we only need to present the ranking. Each component cj should calculate its similarity counters n11(j), n10(j), n01(j), n00(j) and use them with a similarity coefficient. The challenge is that a component knows only the outcomes of runs it participates in, as shown in Table 2, making it able to calculate only n11(j), n10(j). To address this challenge, each component cj requests from other components information about n01(j) and n00(j) as indicated by their spectrum. This information is summed up by cj to the true values of n01(j) and n00(j). Algorithm 1 presents the process of ranking component cj. The component initializes npq(j) (line 1), and calculates n1q(j) using the rows in its local spectrum (Lines 2-3). Then, the component requests information from other components (Lines 4-6), by calling the request missing procedure (The details of this procedure are listed in Algorithm 2). Note that these request messages are sent to other components by order of their appearance in the spectrum. To allow

this, we assume the components to have a predefined order. This allows a component to avoid sending data it knows was already sent by previous components. Finally, a similarity coefficient is used to calculate the rank. Algorithm 2 de-

Algorithm 2: request missing

Input: cj - the component requesting the data Input: cj - the component returning the data Result: nn01, nn00 - counter data

1 nn01, nn00 0, 0

2 for i Sj do

3 if ( j < j s.t. Sj

i,j = 1) V Sj

i,j = 0 then

i , nn0q(j) nn0q(j) + 1

5 return nn01, nn00

scribes the request missing method. Component cj provides the data requested by cj. For every row in its local spectrum, cj checks if cj has already received the information about this row from a previous component, or cj has this row in its local spectrum (Line 3). If this is not the case, cj updates either nn01 or nn00 according to the error vector (Line 4). The reason for component cj handling only rows for which it is the first component that observes the row, is to avoid duplicate reports by different components on the same rows. Example 2. We demonstrate how component c2 calculates its similarity counters using its local spectrum presented in Table 2b. After executing lines 2-3 of Algorithm 1, c2 has n11(2)=2 and n10(2)=0. Next, it requests information from c1 and c3 in that order. c1 considers runs r1, r3, r4 of S1 (Table 2a). Since S1 1,2=1, run r1 is skipped, since it means that c2 has the same run in S2. Runs r3, r4 show that c2 does not have them, so c1 uses them to calculate nn01=1, nn00=1. The same request is sent to c3. c3 considers its local spectra and finds out that for each of its runs, either c2 has them, or c1 has already addressed them, and returns nothing. At the end of the process, c2 has the same counter values as shown in Table 1b. c2 uses them to calculate the likelihood of it being faulty by similarity coefficient algorithm.

3.3 Distributed SFL for Multiple Faults Problems Here we present our algorithm for diagnosing a distributed system with multiple faults (DSFLA-MULTI). It first generates diagnoses and then ranks them.

Diagnosis Generation The components generate diagnoses in two stages. First, each component calculates local diagnoses by applying a Minimal Hitting Set (MHS) algorithm (De Kleer 2011; Rodler 2022) using its local spectra. This is done in parallel by all components. Then, c1 sends the set of local diagnoses it computed to c2. Subsequently, c2 refines its set of local diagnoses based on the diagnoses it received from c1. This continues sequentially, with the last component having a complete set of diagnoses. This set is identical to the one generated by the centralized version of the algorithm. Algorithm 3 lists the pseudo-code for the algorithm described above from the perspective of component cj. First,

cj computes the local diagnoses set LD by executing MHS algorithm on its local spectrum (Lines 1-2). Then, it receives a set of previously calculated diagnoses PD from cj 1 (line 3). Note that if j=1 then PD= . Next, cj adds PD and LD as elements of a set and computes its hitting set to get a combined global diagnosis set GD (Line 4). GD is then refined by removal of duplicate diagnoses and super-sets (Line 5). The generated diagnoses are sent to the next component, or output in case of the last component (Lines 6-8). Before continuing with the running example, we demonstrate how lines 4-5 in Algorithm 3 work. Line 4 applies MHS on a two-element set, in which the elements are sets LD and PD to receive a set of sets of sets, and line 5 refines the resulting set by unifying the elements of the sets of sets, and later filtering out subsets. We demonstrate this with a short example. Suppose for example, that LD = {{2}, {1, 3}} and PD = {{1}}. MHS({LD, PD}) returns the set: GD = {{{2}, {1}}, {{1, 3}, {1}}}. After unifying and duplicate removing we receive: GD = {{1, 2}, {1, 3}}.

Algorithm 3: Diagnose

Input: cj - the current component Result: GD - a set of global diagnoses

1 LC local conflicts(Sj)

2 LD MHS(LC)

3 PD recieve diagnoses(cj 1)

4 GD MHS({PD, LD})

5 GD refine(GD)

6 if j = |C| then

7 return GD

8 diagnose(cj+1, GD)

LD PD GD 1 {{1}} {{1}} 2 {{2}, {1, 3}} {{1}} {{1, 2}, {1, 3}} 3 {{2}, {3}} {{1, 2}, {1, 3}} {{1, 2}, {1, 3}}

Table 3: Example of one diagnosis generation process.

Example 3. Table 3 demonstrates the iterative process for generating diagnoses described in Algorithm 3. Each of the three rows is executed by the component with the corresponding number, shown in column 1. Column 2 shows the local diagnoses (LD) and is executed in parallel. Columns 3-4 present the sequential process of receiving previous diagnoses (PD) and refining them (GD) at each component. The example is run on 3 components, each of which has one of the spectra shown in Table 2. First, considering its local spectrum, each component calculates its local diagnoses LDj (column 2). Then c1 starts with a previous diagnosis set PD1 = . This leads to GD1 = LD1. Next, c1 sends GD1 to c2 as PD2. c2 then executes MHS on the set {PD2, LD2} to get GD2 = {{1, 2}, {1, 3}}, and sends it to c3 as PD3. c3 then executes MHS on the set {PD3, LD3} to get GD3 = {{1, 2}, {1, 3}}. Here the diagnosis process stops, the diagnoses are 1 = {c1, c2} and 2 = {c1, c3}.

Ranking Diagnoses Ranking is also challenging since the components do not possess the entire likelihood function L (Def. 3), so none of them can maximize it by itself. To address this, we define the Local Likelihood Function for component cj:

Definition 6 (Local Likelihood Function). Given a diagnosis , component cj, local spectrum Sj and local error vector Ej, the Local Likelihood Function Lj is: Lj = P(Ej| ) = Q

Ej i Ej Lj i. With Lj i defined similarly as Li in Def. 3, but with relation to Sj and Ej.

Consider, for instance, 2 = {c1, c3} (Ex.3), and the spectra and error vectors in Table 2. The local likelihood functions of c1, c2, c3, are presented in Table 4 column 3. This raises a challenge in performing gradient descent as done in Barinel, since for each component cj, some terms are missing (Lj missing). As a result, a component can not maximize L by itself. On the other hand, reconstructing Lj to a complete L by exchanging the missing terms will allow components to reconstruct the complete spectrum, and reveal private information, since there is a bijective relation between Li and row ri. To that end we propose a distributed version of gradient descent, where the components share two values: (1) the value of L, and (2) the values of hj H. This ensures that the complete function L will not be known to the components. It is worth noting that L, Lj and Lj missing are functions. Throughout the demonstration of our algorithms, we denote the numerical values of these functions as l, lj and lj missing, respectively. Given a diagnosis to be ranked, each component cj initializes the group H={hj=1/2}|C| j=1, and its local likelihood function Lj. Next a sequential process starts with c1 and ends with c M during which the value l is computed, with each component cj updating it in turn. Next, c|C| broadcasts the final value l to the other components, and then a parallel process occurs, where each component calculates its partial gradient j and updates hj accordingly. The updated value of hj is broadcast to all components, to ensure the values in H are the same for all components. Algorithm 4 details the ranking algorithm from the perspective of component cj. cj receives as input a global diagnosis . First cj initiates the array of health values H = {hj = 1/2}|C| j=1, its local likelihood function Lj and a threshold for the process termination ϵ (Line 1). At the beginning, component c1 sets the initial values of l and lprev (Lines 2-3). Then, a gradient descent loop follows (Lines 416), which halts when l has converged (Line 4). In each iteration, cj waits to receive the updated value l from the previous component (Line 7). cj then extends it by multiplying it with all the terms Lj k of its local estimation function Lj that were not observed by previous components and then evaluates the resulting function (Lines 8-9). Then cj sends l to cj+1, and waits to receive the final l from c|C| (Line 13). In case that j = |C|, cj broadcasts the updated l to all the components (Lines 10-11). This concludes the sequential stage of calculating the value l. cj uses l to calculate its own hj (Lines 14-15). It does so by dividing l by lj to obtain the

j H Lj lj derivativej l l3 lj missing gradientj next H 1 1/2, 1/2, 1/2 (1 h1) (1 h1) (h1 h3) 1/16 -1/8 1/16 1/32 1/2 -1/16 7/16, 1/2, 1/2 2 1/2, 1/2, 1/2 (1 h1) (1 h3) 1/4 0 1/32 1/32 1/8 0 1/2, 1/2, 1/2 3 1/2, 1/2, 1/2 (1 h3) (h1 h3) 1/8 0 1/32 1/32 1/4 0 1/2, 1/2, 1/2

Table 4: Example showing a single iteration of Algorithm 4 for ranking the diagnosis 2 = {c1, c3}. Each row shows different values that each component possesses. In the beginning the components have columns 2-5. Column 6 is calculated sequentially, and once it is done, each component uses column 6 to compute the values in columns 6-10 in parallel.

Algorithm 4: Rank

Input: cj - the current component Input: - a global diagnosis Result: p - the probability of

1 H, Lj init( , |C|), ϵ 0.005

2 if j = 1 then

3 l 0, lprev 1

4 while |l lprev| > ϵ do

6 if j = 1 then

7 l receive prev(cj 1)

8 Lj extended l Q

k: j <j s.t. Sj k,j =1 Lj k

9 l Lj extended(H)

10 if j = |C| then

11 broadcast(l)

13 send(cj+1, l), l receive final(c|C|)

14 lj missing l lj

15 j lj missing hj

16 hj hj + j, broadcast(hj)

17 return p

value lj missing (Line 14). Then cj calculates the gradient by multiplying lj missing with the result of the partial derivative of Lj with respect to hj (Line 15). As a final step of this iteration, cj updates hj and broadcasts it.

Example 4. We now demonstrate the process of ranking the diagnoses by showing one iteration of ranking 2 = {c1, c3}. Table 4 shows the values held by each component during the iteration. The process of ranking works alternatively in parallel and in sequence, with the results in columns 2-5 being calculated in parallel, of column 6 in sequence, and of columns 7-10 in parallel again. Throughout this example, when we refer to a column or a row, unless specifically said otherwise, we refer to this table. The algorithm starts simultaneously, where every component initializes H = {h1, h2, h3} = {1/2, 1/2, 1/2} (column 2), and ϵ is set to 0.005. The local likelihood functions Lj are as shown by column 3, and their values lj are shown in column 4, with the values of their partial derivative functions with respect to hj shown in column 5. c1 initializes l and lprev to 0 and -1. Next, the gradient descent loop is executed (column 6 top to bottom). c1, as the first component, extends the function L1 vacuously to get L1 extended = L1. Then, after plugging in H, it gets l = 1/16 (Column 6 row 1). l is sent to the next component, c2, and c1 starts waiting for

the final value of l. c2, receives l = 1/16, and generates L2 extended = 1/16 L2 2 = 1/16 (1 h3). (As can be seen in Table 2b, term L2 2 is the only term that constitutes L2 for which the condition in Line 7 of the algorithm holds). Next c2 plugs in H to get p = 1/16 (1 1/2) = 1/32 (column 6 row 2). This value is then sent to c3. c3 does not contribute any part of L3, since looking at S3 2 and S3 4 shows that L3 3 and L3 4 were already evaluated by c1 and c2. So c3, as the last component, broadcasts l = 1/32 (column 7). Following is the evaluation of the gradient steps which are done in parallel, and their values are shown in columns 8-10. Every component cj first saves l as lprev for the sake of l convergence, and then calculates lj missing (column 8). Next, these values are multiplied by the values of the the partial derivatives of Lj with respect to hj (column 5), yielding the gradient values of = { 1/16, 0, 0} (column 9). Those values are used to update H (column 10).

4 Theoretical Analysis 4.1 Soundness and Completeness Here we present theorems of soundness and completeness for DSFLA-MULTI and that DSFLA-SINGLE and DSFLA-MULTI return similar ranking to the centralized versions. For lack of space, we omit the proofs. Theorem 1. DSFLA-MULTI is sound and complete. Theorem 2. DSFLA-SINGLE returns the same ranking as centralized coefficient based algorithms. Theorem 3. DSFLA-MULTI returns the same ranking as the centralized Barinel.

4.2 Privacy The proposed algorithms address privacy, as it is one of the motivations of our work. We first introduce the aspect of privacy by formally defining Hidden Information and Revealed Hidden Information Fraction. Definition 7 (Hidden Information). Given a component cj with corresponding local spectrum Sj and local error vector Ej, Hidden Information (HI) for component cj is defined as HIj = {Sj i, | ri / Sj Sj : ri Sj }. Simply put, hidden information are the cells of runs that are not in the local spectrum of cj but are in the local spectrum of at least one other component. We denote by Revealed Hidden Information (RHIj) the amount of HI that is revealed by cj. We define a metric for the amount of revealed private information as the percent of hidden information that is revealed out of the entire spectrum, formally:

Definition 8 (Revealed HI Fraction). Given a component cj with corresponding local spectrum Sj and local error vector Ej, and with HIj, Revealed HI Fraction (RHIF) for component cj is defined as RHIFj = |RHIj| |R| (|C|+1).

Simply put, we measure the fraction of cells hidden for cj that became revealed out of all the spectrum cells. For example, Table 2b shows S2 for which there are |HI2| = 8 cells that are hidden for c2. At the end of the diagnosis, RHIF2 = 4/16 = 0.25. We now review the proposed algorithms with respect to hidden information.

DSFLA-SINGLE: Consider Algorithm 1. By getting information about n01(j) or n00(j) from cj , cj can deduce that cj has a run ri Sj

that includes the cells: Sj

i,j = 1, Sj

i,j = 0 and Ej

case of n00(j)), or Ej

i = 1 (in case of n01(j)). This means that for every nn0q = k returned by cj , cj reveals 3 k cells.

Example 5. Returning to Example 2, c2 has |HI2| = 8. The values of the counters nn01 = 1 and nn00 = 1, sent by c1 to c2, reveal to c2 that there is one row ri1 in which S1 i1,1 = 1, S1 i1,2 = 0 and E1 i1 = 1, and another row ri2 in which S1 i2,1 = 1, S1 i2,2 = 0 and E1 i2 = 0. Those rows are summed up to 6 cells that are shown in red in Table 5b. Tables 5a,5c show the same for c1, c3 with 3 and 6 cells. It follows that RHIF1 = 3/16, RHIF2 = 6/16 and RHIF3 = 6/16.

c1 c2 c3 E r1 1 1 0 1 r2 0 1 - 1 r3 1 0 0 1 r4 1 0 1 0

c1 c2 c3 E r1 1 1 0 1 r2 0 1 1 1 r3 1 0 - 1 r4 1 0 - 0

c1 c2 c3 E r1 1 - 0 1 r2 0 1 1 1 r3 1 - 0 1 r4 1 0 1 0

Table 5: Information revealed by each component, denoted as the bold values in it s spectrum for DSFLA-SINGLE.

DSFLA-MULTI: By applying Algorithm 3, a component can reveal some information from the passed diagnoses PD. Stern et al. 2012 show that MHS of diagnoses produces a set of minimal conflicts. Armed with this finding, cj can apply MHS on PD received from cj 1, and reveal rows representing the set of minimal conflicts that led to PD. This does not mean that cj can deduce the entire S, since runs that are corresponding to superset of conflicts will not be revealed. Also, successful runs are not conflicts, therefore will not be revealed.

Example 6. Returning to Example 3, we show how c3 can reveal hidden information using PD3 sent to it from c2. Recall that PD3 = {{1, 2}, {1, 3}}, by running MHS on this set, the resulting minimal hitting sets are: {{1}, {2, 3}}. These sets indicate the following rows ri1 = [1, 0, 0| 1] and ri2 = [0, 1, 1| 1]. Since ri2 S3, it is not hidden information to c3 (ri2 / HI3). However, ri1 / S3, therefore ri2 HI3. HI3 is shown in Table 6c, and Tables 6a

and 6b show the same for c1 and c2. Here, it follows that RHIF1 = 0/16, RHIF2 = 4/16 and RHIF3 = 4/16.

c1 c2 c3 E r1 1 1 0 1 r2 - - - - r3 1 0 0 1 r4 1 0 1 0

c1 c2 c3 E r1 1 1 0 1 r2 0 1 1 1 r3 1 0 0 1 r4 - - - -

c1 c2 c3 E r1 - - - - r2 0 1 1 1 r3 0 0 0 1 r4 1 0 1 0

Table 6: Information revealed by each component, denoted as the bold values in it s spectrum for DSFLA-MULTI.

5 Evaluation

We experimented on samples inspired by the domain of Internet Delay Diagnosis (Stern and Kalech 2014). In this domain a network of routers forwards packages from sources to destinations. In terms of diagnosis, the routers are components, where a faulty router has a delay in forwarding packages, causing the destination to experience this delay. In terms of SFL, the spectra includes information about the routers each package passed through from the source to the destination, where the columns are the routers and the rows are the traces of the packages. The error vector indicates for each package, whether the destination experienced a delay.

5.1 Experimental Setting

The generation of DSFL problems consists of these steps: 1. Create synthetic networks with x routers (components). 2. Set probability of p to delay a packet for f components. 3. Generate random paths for y packets to traverse the network (system runs). We generate local spectra by these runs. 4. A packet has (1 p)f chance to reach the destination on time. The local error vectors are filled accordingly. We generated problems with varying number of components x {6, ..., 12, 13}, faulty components f {1, 2, 3, 4, 5}, fault probability values p {0.1, 0.2, ..., 0.9} and number of runs y {10, 20, ..., 50}. We conducted 30 examples for each combination. In total we run 54, 000 experiments.

Metrics: We measure the performance of the algorithms with respect to the quality of the diagnosis set they return by computing the average Wasted Effort, Weighted Precision, and Weighted Recall (Elmishali, Stern, and Kalech 2020). To understand wasted effort, assume that we repair the components in the diagnoses in a decreasing order of the diagnoses score. Wasted effort is the number of healthy components examined until all faulty components are repaired. For weighted precision and recall, note that a diagnosis is an assumption about which components are faulty. Thus, using the knowledge about the ground truth faulty components, we can compute the precision and recall of each single diagnosis. To compute the precision and recall of a set of diagnoses, we compute the weighted averages, where the precision/recall for every diagnosis is weighted by its score as returned by the diagnosis algorithm.

Single Centr. Single Baseline Multi BARINEL Multi Baseline Multi MC Multi SMC Wasted Effort 2.82 2.82 2.43 2.43 1.48 1.28 Weighted Precision 0.36 0.36 0.64 0.64 0.62 0.45 Weighted Recall 0.13 0.13 0.50 0.50 0.46 0.37 Runtime (seconds) 3.8e-4 1.06e-3 2.57 10.44 1.05 0.39 RHIF - 0.06 - 0.05 3.35e-4 1.01e-4 #Messages per agent - 25.5 - 65.51 13.92 4.60

Table 7: Wasted effort percent, weighted precision/recall, runtime, hidden information revealed, and communication load.

Additionally, we measure the average Revealed Hidden Information Fraction RHIF and the average #Messages. As for #Messages, we define it as the information units passed between the components during the diagnosis process. In that context, we regard a spectrum cell, a component of a diagnosis, a number, and an information request to be with the size of 1 unit. For example, when c2 passes the diagnoses {{1, 2}, {1, 3}} to c3, then #Messages=4.

Compared Algorithms: We compare DSFLA-MULTI to two unsound and incomplete variations. Instead of each component sending ALL local diagnoses to the next one (Line 8 in Algorithm 3), in (1) Multi-MC it sends only all minimal cardinality diagnoses, and in (2) Multi-SMC it sends only a single minimal cardinality diagnosis. From now on, we refer to the baseline variation of DSFLAMULTI as (3) Multi-Baseline and to the DSFLA-SINGLE as (4) Single-Baseline. In addition, we compare these algorithms to a centralized version, where all components send their matrices to the diagnoser. We denote these algorithms (5) Centr. Single and (6) Centr. Barinel.

5.2 Results Table 7 shows the results. The first three rows present the performance in terms of wasted effort, weighted precision and weighted recall. Here, lower value of wasted effort is better and higher values for weighted precision and recall are better. The last two rows present the results of the average RHIF and the average number of messages. Here lower numbers are better. The best values are shown in bold. The first two columns show the results of algorithms dealing with single faults - the centralized and distributed Single. The next 4 columns present the results for the algorithms for multiple faults - the centralized Barinel, Multi-Baseline, Multi-MC and Multi-SMC. The results show that the baseline distributed algorithms output the same diagnoses as the centralized algorithms. This is reflected in the identical results between the centralized and the baseline distributed algorithms, both for single faults and multi faults, in terms of wasted effort, weighted precision and weighted recall. The results also show that the precision and recall decrease for the unsound and incomplete variations of the distributed algorithm (Multi-MC and Multi-SMC). The wasted effort decreases as well although these variations are unsound and incomplete. This can be explained by the fact that the two variations return less diagnoses by definition - which means less healthy components to examine.

In terms of privacy and communication load, the results show that the RHIF and the number of messages decrease for the unsound and incomplete variations of the distributed algorithms. This is expected since less sets of diagnoses are sent by the components. This causes the components to discover less conflicts and by that less hidden cells in their respective spectra (https://github.com/avi-natan/DDIFMAS).

6 Discussion

Throughout the paper, we drew some assumptions to simplify the problem. One such assumption is of perfect communication. Since the components are distributed, our algorithms require the components to be able to communicate information. Here, perfect communication is assumed, i.e., challenges such as disconnection of a component, delays in communication and alternation of communication (accidental or deliberate) were not assumed. The second assumption is related to the collaboration of the components. We assume a system which is privacy aware yet collaborative. In it, prior agreement related to number of components, component sequencing and spectrum and error vector filling takes place between the components. Finally, we assume specific vision the components have. For every run they participate in, they also know about every other participating component. This directly influences our definition of the local spectrum. Additionally, we assume that every component that participates in a run, can see its outcome, which means that every component can see the error vector for runs it participates in. Different assumptions about what information is available to each component might influence the methods by which the components share the information they have.

7 Conclusions

In this paper we pointed out the challenges SFL faces when diagnosing distributed systems and formalized the problem of Distributed SFL (DSFL). To solve this problem we presented distributed versions of the mentioned algorithms: DSFLA-SINGLE and DSFLA-MULTI. We evaluated our algorithms theoretically and empirically. Evaluation shows that the algorithms achieve similar output to the centralized algorithms whilst addressing privacy, and that variations of DSFLA-MULTI can preserve more privacy with little cost in diagnosis quality.

Acknowledgments

This research was funded by ISF grant No. 1716/17, by the ministry of science grant No. 3-6078, and (partially) by the The Israeli Smart Transportation Research Center (ISTRC).

Abreu, R.; Zoeteweij, P.; and Van Gemund, A. J. 2007. On the accuracy of spectrum-based fault localization. In Testing: Academic and industrial conference practice and research techniques-MUTATION (TAICPART-MUTATION 2007), 89 98. IEEE. Abreu, R.; Zoeteweij, P.; and Van Gemund, A. J. 2009. Spectrum-based multiple fault localization. In 2009 IEEE/ACM International Conference on Automated Software Engineering, 88 99. IEEE. De Kleer, J. 2011. Hitting set algorithms for model-based diagnosis. In International Workshop on Principles of Diagnosis (DX-11). De Kleer, J.; and Williams, B. C. 1987. Diagnosing multiple faults. Artificial intelligence, 32(1): 97 130. Elmishali, A.; Stern, R.; and Kalech, M. 2016. Dataaugmented software diagnosis. In Twenty-Eighth IAAI Conference. Elmishali, A.; Stern, R.; and Kalech, M. 2018. An artificial intelligence paradigm for troubleshooting software bugs. Engineering Applications of Artificial Intelligence, 69: 147 156. Elmishali, A.; Stern, R.; and Kalech, M. 2020. Diagnosing Software System Exploits. IEEE Intelligent Systems, 35(6): 7 15. Hofer, B.; Perez, A.; Abreu, R.; and Wotawa, F. 2015. On the empirical evaluation of similarity coefficients for spreadsheets fault localization. Automated Software Engineering, 22(1): 47 74. Passos, L. S.; Abreu, R.; and Rossetti, R. J. 2015. Spectrumbased fault localisation for multi-agent systems. In Twenty Fourth International Joint Conference on Artificial Intelligence. Perez, A.; Abreu, R.; and Riboira, A. 2014. A dynamic code coverage approach to maximize fault localization efficiency. Journal of Systems and Software, 90: 18 28. P erez-Zu niga, C.; Chanthery, E.; Trav e-Massuy es, L.; Sotomayor, J.; and Artigues, C. 2018. Decentralized diagnosis via structural analysis and integer programming. IFACPapers On Line, 51(24): 168 175. Perez-Zuniga, G.; Chanthery, E.; Trav e-Massuy es, L.; and Sotomayor, J. 2022. Near-Optimal Decentralized Diagnosis via Structural Analysis. IEEE Transactions on Systems, Man, and Cybernetics: Systems. Rodler, P. 2022. Memory-limited model-based diagnosis. Artificial Intelligence, 305. Stern, R.; and Kalech, M. 2014. Model-based diagnosis techniques for Internet delay diagnosis with dynamic routing. Applied intelligence, 41(1): 167 183.

Stern, R. T.; Kalech, M.; Feldman, A.; and Provan, G. 2012. Exploring the duality in conflict-directed model-based diagnosis. In Twenty-Sixth AAAI Conference on Artificial Intelligence. Syfert, M.; Barty s, M.; and Ko scielny, J. 2018. Refinement of fuzzy diagnosis in decentralized two-level diagnostic structure. IFAC-Papers On Line, 51(24): 160 167. Williams, B. C.; and Ragno, R. J. 2007. Conflict-directed A* and its role in model-based embedded systems. Discrete Applied Mathematics, 155(12): 1562 1595.