# anticipatory_troubleshooting__f1933a3a.pdf

Anticipatory Troubleshooting

Netanel Hasidi, Roni Stern, Meir Kalech

Ben Gurion University of the Negev

Be er Sheva, Israel {hasidi,sternron,kalech}@bgu.ac.il

Shulamit Reches Jerusalem College of Technology

Jerusalem, Israel shulamit.reches@gmail.com

Troubleshooting is the process of diagnosing and repairing a system that is behaving abnormally. Diagnostic and repair actions may incur costs, and traditional troubleshooting algorithms aim to minimize the costs incurred until the system is ﬁxed. We propose an anticipatory troubleshooting algorithm, which is able to reason about both current and future failures. To reason about failures over time, we incorporate statistical tools from survival analysis that enable predicting when a failure is likely to occur. Incorporating this prognostic information in a troubleshooting algorithm enables (1) better fault isolation and (2) more intelligent decision making in which repair actions to employ to minimize troubleshooting costs over time.

1 Introduction System failures are prevalent in practically all the engineering ﬁelds, including automobiles, robots, information systems, and computer hardware. As systems become more complex, failures often become more common and maintenance costs tend to increase. Thus, automated diagnosis has been studied in the artiﬁcial intelligence ﬁeld for several decades, with substantial progress and successful applications in spacecrafts [Williams and Nayak, 1996], satellite decision support systems [Feldman et al., 2013], automotive industry [Struss and Price, 2003] and spreadsheets [Jannach and Schmitz, 2014]. The output of diagnosis algorithms is a set of possible diagnoses, where each possible diagnosis is an explanation of the observed system failure.

While diagnosis, and in particular root-cause analysis, is the task of understanding what has happened in the past that has caused an observed failure, prognosis is the task of predicting what will happen in the future, and in our context when will future failures occur. In parallel to the vast literature on automated diagnosis, there has been much work on developing prognosis techniques for estimating the remaining useful life of components in a system. In particular, survival analysis is a sub-ﬁeld of statistics in which various methods have been developed to generate survival curves of components, which are curves that plot the likelihood of a component to survive (not fail) as a function of the components us-

age or age [Miller Jr, 2011]. Sophisticated Machine Learning techniques have been shown to be able to learn such survival curves from data [Eyal et al., 2014].

Consider the following illustrative example, in which a car does not start. The mechanic inspecting the car observes that the water level in the radiator is low. This suggests that the radiator is not functioning well, which is a possible explanation a diagnosis for why the car does not start. However, alternative diagnoses are that the ignition system is faulty or the battery is empty. Clearly, considering the age of the battery and the survival curve of batteries of the same type can provide valuable input to the mechanic in deciding the most likely diagnosis and consequent next troubleshooting action.

The ﬁrst main contribution of this work is in showing how prognosis and diagnosis, and in particular survival curves and automated diagnosis algorithms, can be integrated effectively. Practically, we propose an improved diagnosis algorithm that considers both diagnostic information about the relation between sensor data and faults, as well as the likelihood of each component to fail given its age, obtained from the corresponding survival curves.

Beyond diagnosis, we show that the integration of survival curves into the model assists also to the troubleshooting process. Troubleshooting is the process of diagnosing and repairing an observed failure. Diagnostic and repair actions may incur costs, such as the time spent in observing internal components and the monetary cost of purchasing a new component to replace a faulty one. Troubleshooting algorithms aim to minimize the costs incurred until the system is ﬁxed.

The second main contribution is to use prognosis tools, and in particular, survival curves, to develop a troubleshooting algorithm that aims to minimize current troubleshooting costs and future maintenance costs. These maintenance costs includes costs due to future failures, which would require additional troubleshootings and perhaps system downtime. We refer to this type of troubleshooting, where future costs are also considered, as anticipatory troubleshooting, and propose an effective anticipatory troubleshooting algorithm. In particular, the speciﬁc dilemma our troubleshooting algorithm addresses is how to choose the most appropriate repair action, given a component that is identiﬁed as faulty. For example, ﬁxing a faulty component may be cheaper than replacing it with a new one. On the other hand, a new component is less likely to fail in the near future. Our anticipatory trou-

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

bleshooting algorithm leverages available survival curves to choose the appropriate repair action intelligently. We demonstrated experimentally the beneﬁt of our approach on benchmark systems modeled by a Bayesian network, showing signiﬁcant gains.

2 Related Work

Many works on automated troubleshooting are based on the seminal work of Heckerman et al. [1995] on Decision Theoretic Troubleshooting (DTT). Our paper can be viewed as a generalization of DTT by adding prognosis consideration. A decision theoretic approach integrating planning and diagnosis was applied to a real-world troubleshooting application [Warnquist et al., 2009; Pernest al et al., 2012]. In their setting a sequence of actions may be needed to perform repairs. For example, a vehicle may need to be disassembled to gain access to its internal parts. To address this problem, they used a Bayesian network for diagnosis and the AO* algorithm [Nilsson, 1982] as the planner. Torta et al. [2014] proposed the use of abstractions to improve the efﬁciency of troubleshooting. Friedrich and Nedjl [1992] proposed a troubleshooting algorithm aimed at minimizing the breakdown costs, a concept that corresponds roughly to a penalty incurred for every faulty output in the system and for every time step until the system is ﬁxed. The novelty of our work is two-fold. First, none of these previous works incorporated prognosis estimates into the troubleshooting algorithm (as we do in Section 4.1). Second, our anticipatory troubleshooting model and algorithm directly attempt to minimize costs incurred due to current and future failures.

A related line of work is in prognostics, where diagnostic information is used to provide more accurate remaining useful life estimations [Ferreiro et al., 2012; Tobon-Mejia et al., 2012]. There is also work in prognostic on Prognostic Decision Making, where scheduling maintenance activities is done intelligently by using prognostic information [Balaban and Alonso, 2012]. By contrast, we aim to improve decision making for ﬁxing a current fault, but consider also future faults. Somewhat similar is the work of Mc Naught and Zagorecki (2009), who considered several possible repair actions and their impact on future faults by unfolding a dynamic Bayesian network. However, they did not propose an actual method to choose which action to perform, as we do.

3 Troubleshooting

A system is composed of a set of components, denoted COMPS. A component C 2 COMPS is either healthy or faulty, denoted by the health predicate h(C) or h(C), respectively. The state of a system, denoted , is a conjunction of health literals deﬁning for every component whether it is healthy or not. A troubleshooting agent is an agent capable of performing sensing and repair actions. The agents belief about the state of the system, denoted B, is a conjunction of health literals. We assume that the agents knowledge is correct, i.e., if h(C) 2 B ! h(C) 2 . The agents belief, however, may be incomplete, i.e., there may exists a C 2 COMPS such that neither h(C) nor h(C) is in B. A troubleshooting problem arises if the system is identiﬁed as faulty, e.g.,

Action System state ( ) Agent s belief B Start { h(C1), h(C2), h(C3)} {h(C3)} Sense C2 { h(C1), h(C2), h(C3)} {h(C3), h(C2)} Sense C1 { h(C1), h(C2), h(C3)} {h(C3), h(C2), h(C1)} Repair C1 {h(C1), h(C2), h(C3)} {h(C3), h(C2), h(C1)}

Table 1: An example of a troubleshooting process

by some fault detection mechanism. We assume that such a mechanism exists, revealing to the agent whether the system is faulty or not. In this work we focus on the single-fault case, i.e., where a single component is faulty.

An action of the troubleshooting agent is a transition function, accepting and potentially modifying both system state and agent s belief B. We consider two types of actions: sense and repair. Each action is parametrized by a single component, where Sense C checks if C is healthy or not, and Repair C results in C being healthy.1 Formally, applying Sense C does not modify , and updates B by adding h(C) if h(C) 2 or adding h(C) otherwise. Similarly, applying Repair C adds h(C) to both B and , and removes h(C) from B and if it was there.

Deﬁnition 1 (Troubleshooting Problem (TP)) A TP is deﬁned by the tuple P=h COMPS, , B, Ai where (1) COMPS is the set of components in the system, (2) is the state of the system, (3) B is the agent s belief about the system state, and (4) A is the set of actions the troubleshooting agent is able to perform. A TP arises if 9C h(C) 2 . A solution to a TP is a sequence of actions that results in a system state in which all components are healthy.

A troubleshooting algorithm (TA) is an algorithm for guiding a troubleshooting agent faced with a TP. TAs are iterative: in every iteration the TA accepts the agent s current belief B as input and outputs a sense or repair action for the troubleshooting agent performs. A TA halts when the sequence of actions it outputted forms a solution to the TP, i.e., when the system is ﬁxed. The solution outputted by a TA to a TP P is denoted by (P). Both sense and repair actions incur a cost. The cost of an action a is denoted by cost(a). The cost of solving P using , denoted by cost( , P), is the sum of the costs of all actions in (P): cost( , P) = P

a2 (P ) cost(a). TAs aim to minimize this cost.

Consider the mentioned earlier car diagnosis example, in which there are three relevant components that may be faulty: the radiator (C1), the ignition system (C2), and the battery (C3). Assume that the radiator is the correct diagnosis, meaning the radiator is really faulty, and the agent knows that the battery is not faulty. The corresponding system state and agent s belief B are represented by: = { h(C1), h(C2), h(C3)} and B = {h(C3)}. Table 1 lists a solution to this TP, in which the agent ﬁrst senses the ignition system, then the radiator, and ﬁnally repairs the radiator. Formally, (P) = {Sense C2, Sense C1, Repair C1}. If the cost of sense is one and the cost of repair is ﬁve, then the troubleshooting costs of this solution is 1+1+5=7.

1See Stern et al. [2016] for a discussion on actions that apply to a batch of components.

1 6 11 16 21 26 31

Survival rate (=S(t))

P=1 P=2 P=5

Figure 1: An example of exponential survival curves.

4 Troubleshooting with Survival Functions

If the cost of sense actions is much smaller than the cost of repair actions, then an intelligent troubleshooting algorithm would only repair components that were ﬁrst identiﬁed as faulty as a result of a sense action. This simpliﬁes the troubleshooting process: perform sense actions on components until a faulty component is found, and then repair it. The challenge is which component to sense ﬁrst.

To address this challenge, efﬁcient troubleshooting algorithms use a diagnosis algorithm (DA). A DA outputs one or more diagnoses, where a diagnosis is a hypothesis about which components are faulty. Moreover, many DAs output for each diagnosis ! the likelihood that it is correct, denoted p(!). These diagnoses likelihoods can be aggregated to provide an estimate of the likelihood that each component is faulty, denoted p(C) [Stern et al., 2015]. A reasonable troubleshooter can then choose to sense ﬁrst the component most likely to be faulty.

Many DAs have been proposed in the literature. Most effective DAs use some prior knowledge about the diagnosed system to provide accurate diagnoses. Model-based diagnosis (MBD) is a classical approach to diagnosis in which an existing model of the system together with observations of the system behavior, are used to infer diagnoses. Some MBD algorithms assume a system model that represents the system behavior using propositional logic and use logical reasoning to infer diagnoses that are consistent with system model and observations [Feldman et al., 2010; Williams and Ragno, 2007; De Kleer and Williams, 1987]. In general, most MBD algorithms implicitly assume that the system model represents the relation between the system inputs (including sensors) and outputs, and the components behavior. In this work we use a DA based on a Bayesian network (BN) that represents the probabilistic dependency between observations and the system health state. Next, we show how techniques from survival analysis allow augmenting such models with information about the age of each component and its implication on the likelihood of components to be faulty.

4.1 Integrating Survival Analysis into a DA

Every component C is associated with an age Age C. Let TC be a random variable representing the age in which C will fail. A survival function for C, denoted SC(t), is the probability that C will survive until the age t, meaning it will not fail before age t. Formally: SC(t) = Pr(TC t). Survival functions can be obtained by analysis of the physics of the corresponding system or learned from past data [Eyal et al., 2014]. Figure 1 illustrates three possible survival curves gen-

erated by an exponential decay function e λ t, where λ is a parameter and t is the age (the x-axis). The y-axis represents the probability that a component will survive (= not fail) t time units (e.g., months). The three curves plotted in Figure 1 correspond to three values of the λ parameter.

The challenge is how to compute the probability that a component C caused a system failure given its age and survival function. In most systems, faulty components fail intermittently, i.e., a component may be faulty but act normally. Thus, the faulty component that caused the system to fail may have been faulty even before time t. Therefore, we estimated the probability of a component C of age Age C to have caused the system failure by the probability that it has failed anytime before the current time. This probability is directly given by 1 SC(Age C), denoted by FC(Age C).

Thus, for a given component C we have two estimates for the likelihood that it is correct: one from the MBD algorithm (p(C)) and one from its survival curve (FC(Age C)). The MBD algorithm s estimate is derived from the currently observed system behavior or knowledge about the system s structure. The survival curve estimate is derived from knowledge about how such components tend to fail over time. We propose to combine these estimates to provide a more accurate and more informed diagnostic report.

A simplistic approach to combine these fault likelihood estimates is by using some weighted linear combination, such that the weights are positive and sum up to one. However, we argue that these estimates are fundamentally different: FC(Age C) is an estimate given a-priori to the actual fault, while p(C) is computed by the MBD algorithm for the speciﬁc fault at hand, taking into consideration the currently observed system behavior. Indeed, MBD algorithms often require information about the prior probability of each component to be faulty when computing their likelihood estimates [De Kleer and Williams, 1987; Gonz alez-Sanchez et al., 2011; Zamir et al., 2014; Mengshoel et al., 2010]. However, these priors are often set to be uniform, although it has been shown that setting such priors more intelligently can signiﬁcantly improve diagnostic accuracy [Elmishali et al., 2016]. Therefore, we propose to use the fault likelihood estimation given by the survival curves as priors within the likelihood estimation computation done by the MBD algorithm.

Speciﬁcally, we experimented with an MBD that computes diagnoses by applying inference on a BN [Mengshoel et al., 2010]. The BN contains both health variables and other variables such as sensor readings. The values of the observable variables are set, and then the marginal of each health variable is computed by applying an inference algorithm on the BN. The Bayesian reasoning done by the inference algorithm requires a prior probability. We propose to use SC(Age C) as this prior probability (and normalize the fault probability over the remaining probability sum). Other ways to integrate survival curves in an MBD are also possible, and the key contribution is that doing so is beneﬁcial.

Consider our running example of a car that does not start. Figure 2 depicts a possible BN representing this example. Nodes Ig, B, and R correspond to the health variables for the ignition, battery, and radiator, respectively. Now W correspond to the water level variable, and C correspond to the

R P(R) S(5) OK 0.99 0.76 F 0.01 0.24

R W P(W|R ) OK High 0.7 OK Low 0.3 F High 0.05 F Low 0.95

B P(B) S(12) OK 0.98 0.34 F 0.02 0.66

Ig P(Ig) S(3) OK 0.99 0.64 F 0.01 0.36

Figure 2: A simple BN for our car diagnosis running example.

observation that the car not starting. The Conditional Probability Tables (CPTs) for all nodes except C are also displayed in Figure 2. The value of C deterministically depends on Ig, B, and R: only if all components are healthy can the car start. Modeling such dependency (a logical OR) in a BN is trivial. Note that we disallow multiple faults (these are mapped to a N/A value of C). Now, assume that the car does not start (C=False) and the water level is low (W=Low). Given this evidence we apply Bayesian reasoning to get the likelihood of each component to be faulty. In our case, the likelihood of Ig, B, or R to be faulty is 0.16, 0.33, and 0.52, respectively. Thus, a troubleshooter would sense ﬁrst R.

However, assume that the ages of the ignition (Ig), battery (B), and radiator (R) are 3, 12, and 5, respectively, and that they all follow an exponential survival curve e 0.09 t. Thus, according to the components age and survival curves the probability of Ig, B, and R to be faulty are 0.24, 0.66, and 0.36, respectively. Setting these probabilities instead of the original health nodes priors is shown in Figure 2 in the S(X) columns in the CPTs. Setting these priors dramatically affects the result of the Bayesian reasoning, where now the probability of Ig, B, and R to be faulty is 0.16, 0.56, and 0.28, respectively. As a result, a troubleshooter that is aware of both BN and survival curves would choose to sense the battery (rather than the radiator).

5 Anticipatory Troubleshooting

The main beneﬁt of using survival functions in the context of troubleshooting is in the ability to reason about future failures, with the goal of minimizing troubleshooting costs over time. Formally, let [0, Tlimit] be the time range in which we aim to minimize troubleshooting costs. During this time range, components in the system may fail. When the system fails, a troubleshooting process is initiated, performing sense and repair actions until the system is ﬁxed. The target function we wish to minimize is the sum of costs incurred due to actions performed by the troubleshooting agent in the time range [0, Tlimit]. 2 We refer to this sum of troubleshooting costs as the long-term troubleshooting cost, and call a troubleshooting algorithm that aims to minimize this cost an anticipatory troubleshooting algorithm.

2Note that we exclude from this process the task of fault detection, i.e., we assume that the troubleshooter is given a system that is not working properly and needs to be repaired.

When there is only a single sense action and a single repair action, there is no difference between an anticipatory troubleshooting and a troubleshooting algorithm only aiming to minimize the current troubleshooting costs. The difference between traditional troubleshooting and anticipatory troubleshooting is meaningful when there are multiple repair actions. In other words, after the troubleshooting algorithm identiﬁed which component is faulty, the troubleshooter needs to decide which repair action to use to repair it. Next, we describe such a setting, where there are two possible repair actions: Fix and Replace.

5.1 Fix vs. Replace Applying a Replace(C) action means that the troubleshooting agent replaces C with a new one. Applying Fix(C) action means that the troubleshooting agent ﬁxes C without replacing it. Both ﬁx and replace are repair actions, in the sense that after performing them the component is healthy and the agent knows about it, i.e., replacing h(C) with h(C) in both the system state and the agent s belief.

However, we expect Fix and Replace to differ in two realistic aspects. First, Fix is expected to be cheaper than Replace. Second, after replacing a component its ability to survive is expected to be signiﬁcantly higher than that after it has been ﬁxed, since the replaced component is new. To formalize this, let SC(t, Age C) be the survival curve of C after it was ﬁxed at age Age C, i.e., the probability of C to survive t time units after it was ﬁxed, given that it was ﬁxed at age Age C.

SC(t, Age C) = Pr(TC t + Age C|C ﬁxed at age Age C)

We call such a survival function an after-ﬁx survival function. Formally, the expected differences relations between ﬁx and replace are:

8C 2 COMPS : cost(Fix(C)) < cost(Replace(C)) (1) 8t 2 [0, Tlimit] 8C 2 COMPS : SC(t, Age C) < SC(t) (2)

Intuitively, ﬁxing a faulty component is cheaper, but may result in future faults being more frequent. This embodies the main dilemma in anticipatory troubleshooting: weighing current troubleshooting costs (Fix is preferable) against potential future troubleshooting costs (Replace is preferable).

5.2 Choosing the Appropriate Repair Action An exhaustive and almost optimal approach to choose which repair action to perform is to discretize the time range [0, Tlimit], model the problem as a Markov Decision Problem (MDP), and apply an off-the-shelf MDP solver, as follows. Discretization. The time limit [0, Tlimit] is partitioned to a non-overlapping set of equal-sized time ranges T = {T0, . . . , Tn}. Each Ti is referred to as a time step, and let t be the size of each time step. MDP modeling. An MDP is deﬁned by a state space S, a set of actions A, a reward function R(s, a), and a transition function Tr(s, a, s0). A state in our state space is deﬁned by a tuple s = h Ti, C, Curves, Agesi, representing a state in which component C was diagnosed as faulty at time step Ti, where Curves and Ages are vectors representing the survival curves and ages of all components in COMPS. C can be null, representing a state in which no component was faulty at time

Ti. Recall that we only consider single fault scenarios, i.e, at most one component is fault at every time step. States for time Tn+1 are terminal states. The set of actions A consists of three actions: Replace(C), Fix(C), and no-op. no-op represents not doing any action. The reward function R(s, a) is minus the cost of the executed action, where the no-op action costs zero. The state transition function is as follows. After any action, a state for time step Ti will transition to a state for Ti+1. The MDP transition function Tr(s, a, s0), which is a function that returns the probability of reaching state s0 after performing action a at state s, is deﬁned as follows. Let s = h Ti, C, Curves, Agesi and s0 = h Tj, C0, Curves0, Ages0i. The values of Tj, Curves0, and Ages0, are set deterministically by s and a: Tj = Ti+1, Curves0 is only updated after a Fix(C) action (replacing C s survival function with its afterﬁx curve), and Ages0 consists of all components being older by one time step, except for when C is replaced (in which case the age of C is set to zero). The uncertainty in state transition is which component, if any, will be faulty in the next time step. Let SC0 and Age C0 be the survival curve and age of C0 according to Curves0, Ages0. The probability that C0 will fail at a speciﬁc time range Tj given its survival curve is:

Pr(TC0 2 Tj) = SC0(Age C0 t) SC0(Age C0)

which is a standard computation in survival analysis: the probability of surviving before Tj (when the age of C0 was Age C0 t) minus the probability of surviving until Tj (when the age of C0 is Age C0). Solving the MDP. The state space of this MDP is exponential in the number of time steps reasoned about (n). Textbook MDP algorithms, such as Value Iteration and Policy Iteration, and AO* [Nilsson, 1982], will not work. Sophisticated sampling-based approaches, such as real-time dynamic programming (RTDP) [Barto et al., 1995] may be applicable.

In our experiments we implemented a simple decision rule that roughly corresponds to reasoning about a single level of this MDP state space. We call this decision rule Decision Rule 1 (DR1), and explain it next. Let Creplace = cost(Replace(C)), Cfix = cost(Fix(C)), and Tleft be the time left until Tlimit. Following DR1 is to replace a faulty component C iff the following inequality holds:

Creplace + (1 SC(Tleft)) Creplace

Cfix + (1 SC(Tleft, Age C)) Creplace (3)

DR1 has the following property. Proposition 1 DR1 is optimal if the following holds: (1) a component will not fail more than twice in the time range [0, Tlimit], (2) a component can be ﬁxed at most once, (3) a replaced component will not be ﬁxed in the future, and (4) components fail independently.

Proof of proposition 1 is omitted due to space constraints. DR1 is not optimal when the assumptions in Proposition 1 do not hold, but it can still be effective, as shown below.

6 Experimental Results To evaluate the proposed algorithms, we performed two sets of experiments: one-shot experiments, in which a single TP

Figure 3: BN for system S2.

is solved (evaluates the method in Section 4.1), and longterm experiments, in which troubleshooting costs are accumulated (evaluates the method in Section 5.2).

6.1 Benchmark Systems The experiments were performed over two systems, modeled using a Bayesian network (BN) following the standard use of BN for diagnoses [Choi et al., 2011]. The ﬁrst system, denoted S1, represents a real world Electrical Power System. The BN was generated automatically from formal design [Mengshoel et al., 2010] and is publicly available (reasoning.cs.ucla.edu/samiam/tutorial/system.net). It has 26 nodes, of which 6 are health nodes. The second system, denoted S2, is the CAR DIAGNOSIS 2 network from the library of benchmark BN made available by Norsys (www.norsys.com/netlib/Car Diagnosis2.dnet). The system represents a network for diagnosing a car that does not start, based on spark plugs, headlights, main fuse, etc. It contains 18 nodes. From these nodes, 7 nodes are health nodes. Figure 3 shows the graphical representation of S2.

6.2 Survival Curves and Component Ages Unfortunately, we did not have real data for the above systems from which to generate survival curves. Therefore, we used a standard exponential curve (deﬁned earlier in this paper and illustrated in Figure 1) with λ = 0.09. Exponential curves are fundamental parametric models used in the survival analysis literature [Ibrahim et al., 2005].

We set the age of each component to be Ageinit plus a random number between zero and Agediff, where Ageinit is a constant set arbitrarily to 0.3 and Agediff is a parameter we varied in our experiments. The purpose of the Agediff parameter is to control the possible impact of considering the components survival functions: a small Agediff results in all components having almost the same age, and thus the survival curves do not provide signiﬁcant information to distinguish between which component is more likely to be faulty.

6.3 One-Shot Experiments In this set of experiments we generated random TPs (details below) and compared the performance of four TAs: (1) Random, which chooses randomly which component to sense, (2) BN-based, which chooses to sense the component most likely to be faulty according to the BN, (3) Survival-based, which

0.01 0.1 0.5 1.5 2.5 10 20

Troubleshooting costs

0.01 0.1 0.5 1.5 2.5 10 20

Troubleshooting costs

Random BN Survival Hybrid

Figure 4: One-shot exp. results, for S1 (left) and S2 (right).

1.05 1.4 1.8 3

Long-Term Costs

Punish Factor On Cost Ratio

AF AR Hybrid

Figure 5: Long-term exp. results, for S1.

chooses to sense the component most likely to be faulty according to its survival curve and age, and (4) Hybrid, which chooses to sense the component most likely to be faulty taking into consideration both BN and survival curve, as proposed in Section 4.1. Performance of a TA was measured by the troubleshooting costs incurred until the system is ﬁxed. Since we only considered single fault scenarios, we omitted the cost of the single repair action performed in each of these experiments, as all algorithms had spent this cost.

Each TP was generated with a single faulty health node as follows. The value of non-health nodes in the BN that do not depend on any other node were set randomly according to their priors. We refer to these nodes as control nodes. Then, the age of each component was set as mentioned above, i.e., by sampling uniformly in the range [Ageinit, Ageinit+Agediff]. Then, the CPT of every health node was modiﬁed to take into account the survival curve as mentioned in Section 4.1, i.e., the prior of being healthy was set to SC(Age C). Next, we computed the marginal probability of each component to be faulty in this modiﬁed BN, and chose a single component to be faulty according to these computed probabilities. Finally, we sampled from the BN values for all remaining nodes (nodes that are not control or health node), setting the values of the already set nodes. These nodes are called the sensor nodes, and a subset of them were revealed to the DA.

Figure 4 shows the troubleshooting cost for each of the algorithms, for different values of the Agediff parameters, for S1 and S2. All results are averaged over 50 instances. Several trends can be observed. First, the proposed Hybrid TA outperforms all baseline TAs, thus demonstrating the importance of considering both survival curves and MBD. Second, as Agediff grows, the performance of Survival improves, since

the components age differ more, and thus considering it is more valuable. When Agediff is minimal, the performance of Survival is similar to Random and worse than BN. BN is better, since we provide it with evidence the values of some sensor nodes (in the case of S1 we revealed 9 sensor nodes and for S2 we revealed 2 sensor nodes). We also experimented with different numbers of revealed nodes. As expected, revealing more nodes improves the performance of both BN and Hybrid. In conclusion, these results demonstrate that Hybrid is more robust than both Survival and BN, and is either equal or outperforms them across all varied parameter.

6.4 Long-Term Experiments In this set of experiments we generated random TPs over a period of 28 months (i.e., Tlimit=28), choosing when each component fails according to its survival function (details below). In each experiment we used one of the following TAs to solve the TPs that arise: (1) Always Fix (AF), in which faulty components are repaired using the Fix action, (2) Always Replace (AR), in which faulty components are repaired using the Replace action, and (3) Hybrid, in which DR1 (Section 5.2) was used to choose the appropriate repair action. The performance of each algorithm is measured by the sum of troubleshooting costs incurred when solving all the TPs that arose. Since the focus of these experiments is to study the Fix vs. Replace dilemma, we omitted the costs incurred due to Sense action, and only measured the cost the repair action used in every troubleshooting session, i.e., Creplace or Cfix.

To sample when a component will fail after it was ﬁxed, and to compute the Hybrid TA, we require an after-ﬁx survival function (SC(t, Age C). Such functions can be given by domain experts or learned from past data, but neither were available to us. There, we devised the following after-ﬁx survival function SC(t, Age C) = (SC(t))P , where P is a parameter we call the ﬁx punish factor. This after-ﬁx survival curve holds the intuitive requirement that a replaced component is more likely to survive longer than a component that was ﬁxed (Equation 2). The punish-factor parameter P controls the difference between the after-ﬁx and the regular survival function. Figure 1 shows the survival curves after a punish factor of 2 and 5. Another important parameter in this set of experiments is the ratio between Creplace and Cfix. We refer to this parameter as the cost ratio parameter. We experimented with a range of values for these two parameters and studied their impact on the long-term costs.

Figure 5 shows the results of the long-term experiments on system S1. The x-axis shows different cost ratios, in buckets of punish-factor values. The y-axis shows the long-term troubleshooting costs. The impact of both parameters is clear. When cost ratio is small, then Fix is signiﬁcantly cheaper then Replace, and thus the Always Fix (AF) algorithm performs best. Similarly, when the punish factor is very high, a ﬁxed component is much more likely to fail than a replaced one, thus Always Replace (AR) performs best. The Hybrid algorithm successfully chooses when to replace or ﬁx in most parameter combinations. The same trends were also observed for system S2. Thus, even though the assumptions in which DR1 (=Hybrid) is optimal do not hold in our experiments (e.g., a component may have more than two faults), we see

that using it allows an effective balance between AF and AR.

7 Conclusion

We suggested the use of prognosis tools, and in particular survival curves, to lower troubleshooting costs. Two scenarios are presented where this integration of prognosis and diagnosis is useful. First, improving troubleshooting costs by using fault predictions from survival curves as priors in an MBD algorithm. Second, developing an anticipatory troubleshooter that chooses whether a faulty component should be ﬁxed or replaced by considering possible future troubleshooting costs. Experimental results over two BN-based system models show the beneﬁts of the developed survival-curve-aware TAs.

This works opens the possibility of future research on anticipatory troubleshooting. For example, to consider a decaying model of long-term costs instead of pre-deﬁning a time range [0, Tlimit] in which to minimize costs.

8 Acknowledgments

This research was funded in part by ISF grant # 363/12 to Meir Kalech.

[Balaban and Alonso, 2012] Edward Balaban and Juan J Alonso.

An approach to prognostic decision making in the aerospace domain. In Annual Conference of the PHM Society, 2012. [Barto et al., 1995] Andrew G Barto, Steven J Bradtke, and Satin-

der P Singh. Learning to act using real-time dynamic programming. Artiﬁcial Intelligence, 72(1):81 138, 1995. [Choi et al., 2011] Arthur Choi, A Darwiche, L Zheng, and Ole J

Mengshoel. A tutorial on bayesian networks for system health management. Machine Learning and Knowledge Discovery for Engineering Systems Health Management, 10(1):1 29, 2011. [De Kleer and Williams, 1987] Johan De Kleer and Brian C Williams. Diagnosing multiple faults. AIJ, 32(1), 1987. [Elmishali et al., 2016] Amir Elmishali, Roni Stern, and Meir

Kalech. Data-augmented software diagnosis. In IAAI/AAAI, 2016. [Eyal et al., 2014] Ayelet Eyal, Lior Rokach, Meir Kalech, Ofra

Amir, Rahul Chougule, Rajkumar Vaidyanathan, and Kallappa Pattada. Survival analysis of automobile components using mutually exclusive forests. IEEE T. Systems, Man, and Cybernetics: Systems, 44(2):246 253, 2014. [Feldman et al., 2010] Alexander Feldman, Gregory Provan, and

Arjan van Gemund. Approximate model-based diagnosis using greedy stochastic search. JAIR, 38:371 413, 2010. [Feldman et al., 2013] Alexander Feldman, Helena Vicente de Cas-

tro, Arjan van Gemund, and Gregory Provan. Model-based diagnostic decision-support system for satellites. In IEEE Aerospace Conference, pages 1 14, 2013. [Ferreiro et al., 2012] Susana Ferreiro, Aitor Arnaiz, Basilio Sierra,

and Itziar Irigoien. Application of bayesian networks in prognostics for a new integrated vehicle health management concept. Expert Systems with Applications, 39(7):6402 6418, 2012. [Friedrich and Nejdl, 1992] Gerhard Friedrich and Wolfgang Ne-

jdl. Choosing observations and actions in model-based diagnosis/repair systems. KR, 92:489 498, 1992.

[Gonz alez-Sanchez et al., 2011] Alberto Gonz alez-Sanchez, Rui Abreu, Hans-Gerhard Groß, and Arjan J. C. van Gemund. Spectrum-based sequential diagnosis. In AAAI, 2011. [Heckerman et al., 1995] David Heckerman, John S Breese, and

Koos Rommelse. Decision-theoretic troubleshooting. Communications of the ACM, 38(3):49 57, 1995. [Ibrahim et al., 2005] Joseph G Ibrahim, Ming-Hui Chen, and De-

bajyoti Sinha. Bayesian survival analysis. Wiley Online Library, 2005. [Jannach and Schmitz, 2014] Dietmar Jannach and Thomas Schmitz. Model-based diagnosis of spreadsheet programs: a constraint-based debugging approach. Automated Software Engineering, 1:1 40, 2014. [Mc Naught and Zagorecki, 2009] Ken R Mc Naught and Adam

Zagorecki. Using dynamic bayesian networks for prognostic modelling to inform maintenance decision making. In IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), pages 1155 1159, 2009. [Mengshoel et al., 2010] O.J. Mengshoel, M. Chavira, K. Cascio,

S. Poll, A. Darwiche, and S. Uckun. Probabilistic model-based diagnosis: An electrical power system case study. IEEE T. on Systems, Man and Cybernetics, Part A: Systems and Humans, 40(5):874 885, 2010. [Miller Jr, 2011] Rupert G Miller Jr. Survival analysis, volume 66.

John Wiley & Sons, 2011. [Nilsson, 1982] Nils J Nilsson. Principles of artiﬁcial intelligence.

Springer, 1982. [Pernest al et al., 2012] Anna Pernest al, Mattias Nyberg, and H akan

Warnquist. Modeling and inference for troubleshooting with interventions applied to a heavy truck auxiliary braking system. Engineering Applications of Artiﬁcial Intelligence, 25(4), 2012. [Stern et al., 2015] Roni Stern, Meir Kalech, Shelly Rogov, and

Alexander Feldman. How many diagnoses do we need? In AAAI, 2015. [Stern et al., 2016] Roni Stern, Meir Kalech, and Hilla Shinitzky.

Implementing troubleshooting with batch repair. In AAAI, 2016. [Struss and Price, 2003] Peter Struss and Chris Price. Model-based

systems in the automotive industry. AI magazine, 24(4), 2003. [Tobon-Mejia et al., 2012] DA Tobon-Mejia, Kamal Medjaher, and

Noureddine Zerhouni. Cnc machine tool s wear diagnostic and prognostic by using dynamic bayesian networks. Mechanical Systems and Signal Processing, 28:167 182, 2012. [Torta et al., 2014] Gianluca Torta, Luca Anselma, and Daniele Theseider Dupr e. Exploiting abstractions in costsensitive abductive problem solving with observations and actions. AI Commun., 27(3):245 262, 2014. [Warnquist et al., 2009] H akan Warnquist, Jonas Kvarnstr om, and

Patrick Doherty. Planning as heuristic search for incremental fault diagnosis and repair. In Scheduling and Planning Applications Workshop (SPARK) at ICAPS, 2009. [Williams and Nayak, 1996] Brian C. Williams and P. Pandurang

Nayak. A model-based approach to reactive self-conﬁguring systems. In AAAI, pages 971 978, 1996. [Williams and Ragno, 2007] Brian C Williams and Robert J Ragno.

Conﬂict-directed A* and its role in model-based embedded systems. Discrete Applied Mathematics, 155(12):1562 1595, 2007. [Zamir et al., 2014] Tom Zamir, Roni Stern, and Meir Kalech. Us-

ing model-based diagnosis to improve software testing. In AAAI, 2014.