# a_theory_of_faulttolerant_learning__1e7b0bc7.pdf

A Theory of Fault-Tolerant Learning

Changlong Wu 1 Yifan Wang 1 Ananth Grama 1

Developing machine learning models that account for potential faults encountered in real-world environments presents a fundamental challenge for mission-critical applications. In this paper, we introduce a novel theoretical framework grounded in learning theory for dealing with faults. In particular, we propose a framework called faulttolerant PAC learning, aimed at identifying the most fault-tolerant models from a given hypothesis class (such as neural networks). We show that if faults occur randomly, fault-tolerant learning is equivalent to regular PAC learning. However, for adversarial faults, we show that the sample complexity of fault-tolerant PAC learning can grow linearly w.r.t. the number of perturbing functions induced by the faults, even for a hypothesis class with VC-dimension 1. We then provide a matching upper bound by restricting the number of perturbing functions. Finally, we show that the linear dependency on the number of perturbing functions can be substantially improved for deletion faults in neural networks. Our work provides a powerful formal framework and avenues for a number of future investigations on the precise characterization of fault-tolerant learning.

1. Introduction

Learning reliable and trustworthy models is an important challenge in contemporary machine learning. Missioncritical applications like autonomous vehicles (Ma et al., 2020; Hengstler et al., 2016), spacecraft (Izzo et al., 2019; Tipaldi et al., 2020), learning-enabled industrial control systems (Al Shahrani et al., 2022; Kershaw et al., 2021), and those operating in extreme environments (Verma et al., 2023; Khan, 2020; Lu, 2023), demand high performance, reliability, and graceful degradation. Constructing models

1Center for Science of Information, Department of Computer Science, Purdue University, West Lafayette, USA. Correspondence to: Changlong Wu <wuchangl@hawaii.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

that are resilient to hardware faults is a critical component of such systems.

Experimental results on fault tolerance highlight the fragility of modern machine learning systems to even minor faults. For example, recent work (Nguyen et al., 2023) shows that stuck-at faults (Phatak & Koren, 1995; Syed et al., 2023) in only a few critical intermediate outputs can significantly reduce the accuracy of deep neural networks like Res Net18, rendering them ineffective. Adversarial faults in the form of bit-flip attacks (BFA) (Rakin et al., 2019) that alter small fractions of the weights can massively impact performance. These results underscore concerns relating to fault tolerance issues in machine learning systems operating in mission-critical settings. Due to the diversity of experimental settings, there has been a lack of theoretical formulations, models, and associated analyses that can inform practical solutions.

This paper introduces a novel theoretical framework for analyzing fault tolerance, proposing rigorous analytical tools to guide the development of reliable models in diverse faulty environments. Formally, let X be a set of features and Y be the set of outcomes. A model is a function hw : X Y indexed by w. This could be, for instance, a function represented by a neural network with parameters (i.e., weights) w. For any function hw, we associate a set of functions U(hw) YX , which includes all possible perturbed functions when a fault occurs during the execution of hw. For any given (x, y) and a loss function ℓ: Y Y R 0, we define the adversarial faulty-loss as:

ℓU(hw; x, y) = sup h U(hw) ℓ(h(x), y).

Note that the adversarial faulty loss measures the worst loss incurred by executing hw on x when a fault occurs. Similarly, for any given distribution p over U(hw), we define the random fault-loss as ℓU(hw; x, y) = Eh p[ℓ(h(x), y)]. Let D be a distribution over X Y and H = {hw : w W} be a class of models. Our goal is to learn a model ˆh H that achieves the minimal faulty-regret on D under fault-type U:

Regn(ˆh) = E(x,y) D h ℓU(ˆh; x, y) i

inf hw H E(x,y) D [ℓU(hw; x, y)]

A Theory of Fault-Tolerant Learning

by accessing a sample Sn = (x1, y1), , (xn, yn) sampled i.i.d. from D. Intuitively, we aim to learn a model ˆh that is most tolerant to fault-type U when deployed in an environment where data is generated from D. We say that a pair (H, U) is (ϵ, δ)-fault tolerant PAC learnable if there exists a learning rule A and a number n, such that for any distribution D, Regn(A(Sn)) ϵ w.p. 1 δ over an i.i.d. sample Sn of length n from D. The minimal number n allowing such a learning rule is referred to as the sample complexity.

Our contributions. Our contributions in this paper establish tight upper and lower bounds on the sample complexity for the fault-tolerant PAC learning across a wide range of model classes H and fault types U. Specifically, we show that for any random fault types, the sample complexity of fault-tolerant learning is equivalent to (regular) PAC learning of the class H, provided that U(hw) H 1. For adversarial faults, we demonstrate that the sample complexity of fault-tolerant PAC learning can be arbitrarily larger than the complexity of (regular) PAC learning of H. In particular, we show that there exists a class H of VC-dimension 1, such that for any number B, there exists a fault-type U with |U(hw)| B for all hw H, where the fault-tolerant sample complexity grows linearly with B. We then provide a matching upper bound by showing that for any adversarial fault-type satisfying |U(hw)| B and U(hw) H for all hw H, the sample complexity grows as O( B VC(H)+log(1/δ)

ϵ2 ), where VC(H) denotes the VC-dimension of H. Finally, we show that for neural networks with threshold activation functions and deletion faults (i.e., the fault occurs by deleting a subset of edges of the network), sample complexity grows polynomially with respect to the network size. This is surprising, given that the size of deletion fault-type can be exponentially large compared to the network size (i.e., any subset of edges may be deleted), yet a polynomial sample complexity is achievable even for adversarial fault-types.

In summary, our main contributions are: (i) a fundamentally new formulation of fault-tolerant PAC learning; (ii) tight characterizations of fault-tolerant PAC learnability for both random and adversarial faults; (iii) a case study of faulttolerant PAC learnability of feed-forward neural networks with deletion faults, demonstrating potential real-world applications; and (iv) novel formal model, and algorithmic and analytic techniques that are of independent interest.

Related work A number of recent efforts focus on enhancing fault tolerance in machine learning. Effective abstractions of various fault types that occur in real world settings have been developed (Torres-Huitzil & Girau, 2017),

1This is a very mild assumption, automatically satisfied for faults occurring in the parameters of a neural network.

with stuck-at (Phatak & Koren, 1995; Eghbal et al., 2015; Syed et al., 2023; Abraham & Fuchs, 1986) and random-bit flip (Dodd & Massengill, 2003) being particularly common, effectively representing permanent and transient (Sosnowski, 1994) faults. Investigations into the vulnerability of neural networks (Piuri, 2001; Rakin et al., 2019; Liu et al., 2017; Protzel et al., 1993; Nguyen et al., 2023) have revealed significant drops in accuracy due to even a small number of faults. Common strategies for developing fault-tolerant learning include: (i) adding redundancy (e.g. replication of critical nodes /weights) (Chu & Wah, 1990; Emmerson & Damper, 1993; Chin et al., 1994); (ii) modifying learning/ training algorithms (e.g. fault injection during training) (Arad & El-Amawy, 1997; Wei et al., 1996; Edwards & Murray, 1997); and (iii) optimization under fault tolerance constraints (Deodhare et al., 1998; Zhou & Chen, 2003). There have also been efforts focused on active fault tolerance, including error detection and recovery (Khunasaraphan et al., 1994; Hashmi et al., 2011; Deng et al., 2015). In contrast to these efforts, our work focuses on establishing the foundations of achievable performance in terms of tight constructive bounds for different learning models. To this end, our contributions complement these prior results.

2. Problem Formulation

Let X be the instance (feature) space, Y = [0, 1] be the label space and W be an index set of hypotheses. For any function f : W X Y, we define the hypothesis class induced by f as Hf = {hw(x) = f(w, x) : w W}. Note that, for any distinct w1, w2 W, we will treat hw1 and hw2 as different functions, even if they instantiate the same functions in [0, 1]X . This is unlike most of the classical learning theory literature, where properties of the hypothesis class depend only on its functionality. In our setup, we also distinguish hypotheses according to their representation (i.e., the index w). This distinction can arise, for example, when two sets of parameters within a fixed neural network architecture represent the same function but have different fault-tolerance properties for a particular fault type.

Fault-types. A fault-type of the function class Hf is a map U : W 2 (W), where (W) denotes the class of all probability distributions overs W, and 2 (W) is the power set (i.e., the class of all subsets) of (W) 2. In words, fault-type U maps each w W to a set U(w) of probability distributions over W. For any (x, y) X Y, w W, and loss function ℓ: Y Y R+, we define the faulty-loss as:

ℓU(hw; x, y) = sup p U(w) Ew p[ℓ(hw (x), y)]. (2)

2In this paper, we consider only proper fault types, where the perturbed functions must also be in Hf. The case of improper fault types can be defined similarly.

A Theory of Fault-Tolerant Learning

Faulty-loss is interpreted as follows: given (x, y) and w, an adversary first selects the worst distribution p U(w) and perturbs w by sampling w p; the faulty-loss is then the expected loss of hw on (x, y). This formulation simultaneously covers random and adversarial faults by carefully defining the function U, as illustrated in Examples 1 and 2 below:

Example 1 (Random Faults). If for any w W, we have |U(w)| = 1, and denote pw as the single element in U(w). Then, faulty loss ℓU is reduced to the case of random faults such that for any sample (x, y), the loss is incurred by replacing w with a random sample w pw. This can happen, for instance, with a random flip of the output of a single neuron in a neural network.

Example 2 (Adversarial Faults). If for any w W, the set U(w) contains only singleton distributions (i.e., distributions that assign probability 1 on a single w ), then, one may view the class U(w) as a subset of W. The faulty loss ℓU is reduced to the loss incurred by adversarially perturbing w to an element w U(w).

It is worth noting that our definition of faulty loss in (2) also provides a way of modeling scenarios that interpolate between the pure random and pure adversarial fault types.

Fault-tolerant PAC learning. We now formulate the main learning paradigm of this paper. Let D be an unknown distribution over X Y. For any hypothesis class Hf, loss ℓ, fault-type U, and index w W, we define fault-tolerant risk of hw as:

RU(hw; D) = E(x,y) D [ℓU(hw; x, y)] , (3)

where ℓU is the faulty-loss as defined in (2). Let Sn = {(x1, y1), , (xn, yn)} be a sample of size n sampling i.i.d. from D. Our goal is to find an algorithm A : Sn Hf that minimizes the fault-tolerant risk RU(A(Sn); D) as in (3). We say a pair (Hf, U) is fault-tolerant PAC-learnable under loss ℓif there exists an algorithm A such that for any ϵ, δ > 0, there exists a number n := n(ϵ, δ) such that:

RU(A(Sn); D) inf hw Hf RU(hw; D) ϵ δ.

(4) For any ϵ, δ > 0, the minimal number n achievable by an algorithm A that satisfies (4), is defined to be the sample complexity of (ϵ, δ)-fault-tolerant PAC learning of (Hf, U). We denote this number by M(ϵ, δ; Hf, U).

Fault-tolerant PAC learning aims to find the most robust hypothesis ˆh Hf that achieves the minimal fault-tolerant risk RU(ˆh; D) under perturbation of U.

Analogous to classical PAC learning framework, we can also define the following fault-tolerant Empirical Risk Minimization (ERM) rule. Let Sn = {(x1, y1), , (xn, yn)}

be any sample of size n, the empirical fault-tolerant risk of any hw is defined as:

ˆRU(hw; Sn) = 1

i=1 ℓU(hw; (xi, yi)). (5)

The fault-tolerant ERM rule corresponds to finding ˆh Hf that satisfies:

ˆRU(ˆh; Sn) = inf hw Hf{ ˆRU(hw; Sn)}. (6)

Robustness v.s. Fault-Tolerance. Our fault-tolerance framework is closely related to the concept of adversarial robust PAC learning, as introduced in (Montasser et al., 2019). Instead of perturbing the index w, the adversarial robust setting perturbs any input x with an adversarially selected sample x chosen from a set V(x) X. Here, V : X 2X is a function that specifies possible perturbations. The goal is to find a function ˆh (not necessary in Hf) that minimizes the following robustness risk:

RV(ˆh; D) = E(x,y) D

sup x V(x) ℓ(ˆh(x ), y)

Given any V as above and function hw Hf, we can define a set of functions Hw = {h YX : x X, x V(x) s.t. h(x) = hw(x )}. Now, if we define a fault-type U by mapping each w to the set of all singleton distributions over functions in Hw, then robust PAC learning is reduced to our fault-tolerant PAC learning. Observe, however, that the fault-type U we constructed here is improper since Hw need not be a subset of Hf.

The goal of this paper is to characterize how the structures of Hf and U impact our fault-tolerant PAC learnability, to find learning rules, and to derive upper and lower bounds on the sample complexity for selected natural classes.

3. Main Results

In this section, we provide tight characterizations on the fault-tolerant PAC learnability of finite VC-dimensional classes and various fault-types.

3.1. General Characterizations

We initiate our discussion with some basic properties of our fault-tolerant PAC learning framework by focusing on binary-valued hypothesis classes. For clarity of presentation, we consider the case when Y = {0, 1} and loss ℓ(y, y ) = 1{y = y } is the mis-classification loss. The following fact is easy to observe:

Fact 1. For any probability distribution p over W with mis-classification loss ℓ, we have for any (x, y):

Ew p[ℓ(hw (x), y)] = |Ew p[hw (x)] y|.

A Theory of Fault-Tolerant Learning

Proof. We argue by cases: if y = 0 then ℓ(hw (x), y) = hw (x) and the result trivially holds; if y = 1, then we have ℓ(hw (x), y) = 1 hw (x), and therefore the result follows by taking Ew on both sides.

Warm-up: Random Fault-Types. We start with the simpler random fault types introduced in Example 1. Let U be a random fault-type, i.e., for any w W we have |U(w)| = 1. Define for any w W the function:

gw(x) = Ew pw[hw (x)],

where pw is the single distribution in U(w). We have by Fact 1 that the fault-tolerant PAC learnability of the pair (Hf, U) under mis-classification loss is equivalent to the (regular) PAC learning of G = {gw(x) : w W} under absolute loss. Therefore, we have:

Proposition 1. Let Hf be a binary-valued class, U be any fault-type such that |U(w)| = 1 for all w W, and G [0, 1]X be as defined above. Then (Hf, U) is fault-tolerant PAC learnable under mis-classification loss if and only if G is PAC learnable under absolute loss. Moreover, these two learning problems have the same sample complexity.

Note that, although Proposition 1 provides a complete characterization of fault-tolerant PAC learnability by reducing it to PAC learning of class G, the structure of G may still be complicated. Fortunately, by definition of function gw, we have the class G is essentially a subset of the convex hull of Hf.

To proceed, we first recall the definition of Rademacher complexity of a hypothesis class H with horizon n as:

Radn(H) = sup xn X n Eϵn

i=1 ϵih(xi)

where ϵn is sampled from the uniform distribution over { 1, +1}n. The following key lemma relates the Rademacher complexities between Hf and G:

Lemma 1. We have Radn(G) Radn(Hf).

Proof. By (Shalev-Shwartz & Ben-David, 2014, Lem. 26.7) we know that the Rademacher complexity of a class equals the Rademacher complexity of its convex hull. The lemma follows since G is within the convex hull of Hf.

We now arrive at our first main result that completely characterizes the sample complexity of fault-tolerant learning of binary-valued hypothesis with a random fault-type. Theorem 1. Let Hf be a binary valued class with finite VC-dimension VC(Hf). Then, for any fault-type U with |U(w)| = 1 for all w W, we have the sample complexity

M(ϵ, δ; Hf, U) O VC(Hf) + log(1/δ)

Moreover, this bound is tight when U introduces no faults.

Proof. By Proposition 1, we have the sample complexity M(ϵ, δ; Hf, U) equals the sample complexity of PAC learning G under absolute loss. Using a simple symmetrization argument as (Shalev-Shwartz & Ben-David, 2014, Thm. 26.5) and 1-Lipschitz property of absolute loss, we have the generalization error ϵ of G is upper bounded by O(Radn(G) + p

log(1/δ)/n) for a sample of size n via ERM rule. The theorem then follows by Lemma 1, the fact Radn(Hf) O( p

VC(Hf)/n) (Wainwright, 2019, Example 5.24), and rewriting n w.r.t. ϵ, δ.

Remark 1. Note that, Theorem 1 shows that, statistically, introducing random faults does not make learning harder. Moreover, optimal sample complexity is achievable by the fault-tolerant ERM rule as in (6).

Example 3. Let f(w, x) = 1{ w, x 0} with w, x Rd

be the linear threshold function. Consider the fault-type that flips each coordinate w[i] with w[i] independently with certain probability σ > 0. Although the flip faults can alter the original functions in a complicated way, Theorem 1 shows that such a pairs can be fault-tolerant PAC learned with sample complexity O( d+log(1/δ)

Characterization of general fault-types. We now deal with the general fault-type U with |U(w)| 1 for all w W. Recall that U(w) is a set of distributions over W. This can be interpreted as, for instance, all possible random perturbations without knowing the exact perturbing distribution a-priori. If the distributions in U(w) are singleton distributions, this case reduces to the full adversary case as in Example 2.

Now, for any Hf, U, and w W we define two functions:

gmin w (x) = inf p U(w) Ew p[hw (x)],

and gmax w (x) = sup p U(w) Ew p[hw (x)].

We observe the following fact:

Fact 2. For the mis-classification loss ℓ, we have:

ℓU(hw; x, y) = gmax w (x), if y = 0 1 gmin w (x), if y = 1

where ℓU(hw; x, y) is the faulty-loss as in (2).

Proof. This follows directly from Fact 1 and definition in (2) by arguing on cases of y.

Let gw(x) = (gmin w (x), gmax w (x)) be a function mapping X [0, 1]2 and G = {gw : w W}. We know that

A Theory of Fault-Tolerant Learning

the fault-tolerant PAC learnability of (Hf, U) under misclassification loss is equivalent to the PAC learning of G under the following loss:

ℓ(gw; x, y) = (1 y)gmax w (x) + y(1 gmin w (x)).

Observe, however, that functions gmin w (x) and gmax w (x) need not be convex combinations of functions in Hf. Therefore, class G can have complicated structures. Specifically, for adversarial fault-type U, function gw can be compactly expressed as gw : X { , 0, 1} such that3 gw(x) = 0 if w U(w), hw (x) = 0; gw(x) = 1 if w U(w), hw (x) = 1; and gw(x) = otherwise. Moreover, the loss ℓ( , y) = 1 for all y {0, 1} and ℓ(y , y) = 1{y = y} for all y , y = .

The following lemma demonstrates that even for a class Hf with VC-dimension 1, the sample complexity of faulttolerant learning (Hf, U) can be arbitrarily large:

Lemma 2. There exists a class Hf of VC-dimension 1, such that for any B > 0, there exists a fault-type U with sample complexity M( 1

7; Hf, U) Ω(B).

Proof. Let X, W := N the set of natural numbers. For any w W, we define a function hw(x) = 1 if x = w and hw(x) = 0 otherwise. Denote Hf = {hw(x) : w W}. It is easy to observe that VC(Hf) = 1.

We now construct, for any natural number B > 0, an adversarial fault-type U that attains our claimed lower bound Ω(B) for any learning rule. Let A1, , AN be an enumeration of all subsets Ai {1, , 3B} of size |Ai| = B, where N = 3B B . For w {1, , N}, we define U(w) as the class of all singleton distributions over Aw (i.e., U(w) is effectively Aw). For any other w, we set U(w) = U(1). Note that, |U(w)| = B for all w W. Moreover, for w {1, , N} and y = 0, we have ℓU(hw; x, y) = 1 if and only if x Aw (by definition of Hf, U and Fact 2).

We now construct the distribution D using a probabilistic argument, resembling those used in (Montasser et al., 2019). For any A {1, , 3B}, we define DA as the distribution with x uniform over A and with y being constant 0. Let µ be the distribution uniform over all subsets of {1, , 3B} of size 2B and A be any learning rule. We have:

RU(A(SB); DA) inf hw RU(hw; DA)

(a) EA µ [ESB DA [RU(A(SB); DA)]]

(b) = 1 2B EAS

x A ℓU(ˆh; x, 0)

3Recall that for adversarial fault-type U, the set U(w) can be viewed as a subset of W.

(c) = 1 2B EAS

x A\AS ℓU(ˆh; x, 0)

(d) 1 2B EAS[C + (B C)/2] 1

where (a) follows from the fact that the hypothesis with index corresponds to subset {1, , 3B}\A attains 0 faulttolerant risk on DA (by construction of U and y = 0); (b) follows by switching the order of expectation and AS corresponds to the features in SB; (c) follows by setting C to be the loss incurred by ˆh := A(SB) on AS; (d) follows from the fact that the loss incurred by ˆh equals |Aw A|, where w is the index for ˆh Hf (taking Aw being A1 if w N); by conditioning on AS, the set A\AS is uniformly distributed over {1, , 3B}\AS with size 2B |AS|, we have the expected size of |Aw A| equals C + (|Aw| C) 2B |AS|

3B |AS| C + (|Aw| C) 1

2 = C + (B C)/2 since |AS| B and |Aw| = B.

Therefore, for any learning rule A, there must exists a distribution DA such that the expected fault-tolerant regret is lower bounded by 1

4 with sample size B. The lemma then follows by a standard argument for converting the expected lower bound to high probability lower bound as in (Shalev Shwartz & Ben-David, 2014, Thm. 5.1).

Note that, Lemma 2 demonstrates that one cannot hope to obtain a uniform sample complexity bound that is independent of the fault type U even for classes with finite VC-dimension 1. This is in contrast to the random faults case established in Theorem 1. Therefore, one must introduce certain constrains on U in order to obtain meaningful results. Inspired by Theorem 1, a natural constraint would be to bound the size of U(w). Doing so allows us to obtain our second main result of this paper:

Theorem 2. Let Hf be a binary valued class of finite VCdimension. Then, for any fault-type U with |U(w)| B for all w W, we have:

M(ϵ, δ; Hf, U) O B2 VC(Hf) + log(1/δ)

If, in addition, U is adversarial fault-type, then:

M(ϵ, δ; Hf, U) O B VC(Hf) log(1/ϵ) + log(1/δ)

and the linear dependency of on B cannot be improved.

Proof. The lower bound on the linear dependency of B follows directly from Lemma 2. We now prove the upper bound. By Fact 2 and the discussion that follows, we have the fault-tolerant sample complexity is equivalent to the sample complexity of PAC learning G under loss ℓ. Using

A Theory of Fault-Tolerant Learning

a standard symmetrization argument as (Shalev-Shwartz & Ben-David, 2014, Thm. 26.5), we have the generalization error (for fault-tolerant ERM rule) with sample size n is upper bounded by (w.p. 1 δ):

O(Radn( ℓ G) + p

log(1/δ)/n),

where (for ϵn uniform over { 1, +1}n):

Radn( ℓ G) = sup xn,yn Eϵn

i=1 ϵi ℓ(gw; xi, yi)

We now fix any xn and yn. Note that if yi = 0 then ℓ(gw; xi, yi) = gmax w , else we have ℓ(gw; xi, yi) = 1 gmin w . Therefore we have:

i=1 ϵi ℓ(gw; xi, yi) = X

{i:yi=0} ϵigmax w (xi)

{i:yi=1} ϵi(1 gmin w (xi)).

Taking the operator Eϵn supgw on both sides and noticing that the constant 1 vanishes, we have:

i=1 ϵi ℓ(gw; xi, yi)

{i:yi=0} ϵigmax w (xi)

{i:yi=1} ϵigmin w (xi)

where the inequality follows by sup(A + B) sup A + sup B. We now upper bound the first term in the RHS of (7) (the second term can be bounded similarly). For any w W, we denote Hw = {hp(x) = Ew p[hw (x)] : p U(w)}. We have gmax w (x) = suph Hw h(x) and |Hw| B for all w W.

To proceed, we establish the following key inequality. For any function F : W R and i with yi = 0, we have:

sup w ϵigmax w (xi) + F(w)

hj Hw ϵ jhj(xi) + F(w)

where ϵ B is a fresh sample uniform over { 1, +1}B. To see this, we observe that gmax w (x) = suph Hw h(x). Therefore, there exists hj Hw such that gmax w (xi) = hj (xi). We now fix ϵi and couple ϵj = ϵi. Let ˆw attain the supw term in the LHS of (8). We have: X

hj H ˆ w ϵ jhj(xi) + F( ˆw) sup w

hj Hw ϵ jhj(xi) + F(w).

Taking expectation over ϵ js for j = j , we have the LHS equals ϵj hj (xi) + F( ˆw). By definition of ˆw, we have Eϵj [ϵj hj (xi) + F( ˆw)] equals the LHS of (8). Therefore, the inequality (8) follows.

Repeatably using (8), we have:

{i:yi=0} ϵigmax w (xi)

i=1 ϵ(j 1)B+ihj(xi)

sup h B co(Hf )

i=1 ϵ(j 1)B+ihj(xi)

j=1 sup hj co(Hf )

i=1 ϵ(j 1)B+ihj(xi)

Bn Radn(co(Hf)) = Bn Radn(Hf),

where co(Hf) denotes the convex hull of Hf and the last equality follows by (Shalev-Shwartz & Ben-David, 2014, Lem. 26.7). Since Radn(Hf) O( p

VC(Hf)/n), we have the generalization error upper bounded by O(B p

VC(Hf)/n) = O( p

B2VC(Hf)/n). The first upper bound then follows by rewrite n w.r.t. ϵ, δ.

To prove the second upper bound, we observe that Hw Hf for adversarial fault-types. By Sauer s lemma (Shalev Shwartz & Ben-David, 2014), we have the number of such Hws restricted on any xn is upper bounded by n B VC(Hf ). Since gmax w and gmin w are completely determined by Hw, we have the size of restriction of G on xn

is upper bounded by n B VC(Hf ). Applying the Massart s lemma (Shalev-Shwartz & Ben-David, 2014), we arrive at an O( p

B VC(Hf) log n/n) upper bound for Radn( ℓ G). This completes the proof.

3.2. Neural Networks with Deletion Faults

In this section, we focus on an important hypothesis class: feed-forward neural networks with the threshold activation function. For simplicity, we assume that there are L layers, each with N neurons (except for the last layer, which has 1 neuron), and adjacent layers are fully connected4. We consider specifically the threshold activation function σ(x) = 1{x 0}. Let X = Rd be the input space, and E be the set of all edges (weights), where |E| = d N + (L 2)N 2 + N. The index space is therefore W = R|E|, which assigns the weight of each coordinate of w to an edge in E. Note that, for any w W, the architecture defines a hypothesis hw : X {0, 1}. The hypothesis class we consider is therefore Hf = {hw : w W}. It

4General architectures can be analyzed similarly.

A Theory of Fault-Tolerant Learning

is well known that VC(Hf) O(|E| log |E|) (Bartlett & Maass, 2003).

We now consider the following deletion faults. Let S be a set of subsets of E, the fault happens by (adversarially) selecting a subset A S and setting w[i] = 0 for all i A. In other words, the adversary effectively deletes the edges in A. We denote w A as the vector obtained by setting all coordinates of w in A being 0. Formally, for any given S, the (adversarial) deletion fault-type is a function US such that US(w) is the class of all singleton distributions over {w A : A S}. For notational convenience, we can simply write US = {w A : A S}.

The following corollary follows directly from Theorem 2:

Corollary 1. Let Hf and US be defined as above, we have

M(ϵ, δ; Hf, U) O |S||E| + log(1/δ)

where O hide poly-logarithmic factors on |E|, ϵ.

Note that, if we allow the adversary to delete k (arbitrary) edges, then |S| = |E| k |E|k. Therefore, even for reasonably large k, the sample complexity provided in Corollary 1 can be exponentially large. Our main result of this section is the following theorem that establishes a sample complexity independent of |S|.

Theorem 3. Let Hf and US be as defined above, we have the sample complexity M(ϵ, δ; Hf, US) upper bounded by:

O d LN 3 + L2N 4 + d2N 2 + log(1/δ)

where O hides poly-logarithmic factors on ϵ.

Note that, the hypothesis class we constructed in the proof of Lemma 2 can be instantiated by the neural network as well. Therefore, the hard fault-type constructed therein also applies to the hypothesis Hf in Theorem 3; i.e., the sample complexity must be scale linearly w.r.t. the faulttype size in the worst case. Perhaps surprisingly, Theorem 3 demonstrates that for certain natural fault-types, the sample complexity can be significantly improved.

Analysis for single neuron. Before we provide a formal proof of Theorem 3. We first consider the simpler case with a single neuron; i.e., the linear threshold function class f(w, x) = 1{ w, x 0}. Let US be an arbitrary deletion fault-type. We have by the proof of Theorem 2 that the generalization error (of fault-tolerant ERM rule) with sample size n is upper bounded by:

O(Radn( ℓ G) + p

log(1/δ)/n),

where G and ℓare as in Fact 2 and the discussion that follows. Note that by Massart s lemma (Shalev-Shwartz &

Ben-David, 2014), we have for any given xn:

Radn( ℓ G) O( q

log |G|xn|/n), (9)

where G|xn is the class of G restricted on xn. Therefore, it is sufficient to bound the size of G|xn.

By Sauer s lemma, the restriction of Hf on any xn has size upper bounded by nd. However, it is possible that for two hypotheses hw1 and hw2 with same restriction on xn, the restrictions of gw1 and gw2 are different. Therefore, to bound the size of G|xn, we will have to leverage the structural properties of G.

Our key observation is that, by property of linear function, the deletion on w is equivalent to the deletion on x! Specifically, we have:

w A, x = w, x A ,

for any A {1, , d}. Therefore, for any w, the restriction of Hw = {hw : w US(w)} = {hw A : A S} on xn can be uniquely recovered by the values of hw on L def = {x A : x {x1, , xn}, A S}. Since |S| 2d, we have |L| n2d. Invoking Sauer s lemma, we have the number of possible values of functions in Hf on L is upper bounded by (n2d)d = nd2d2. Note that any gw G is completely determined by Hw. This implies that G|xn nd2d2, and therefore (9) is upper bounded by O( p

d2 log n/n + p

log(1/δ)/n). Rewriting n w.r.t. ϵ, δ, we arrive at the sample complexity:

O d2 log(1/ϵ) + log(1/δ)

This proves Theorem 3 for L = N = 1.

Proof of Theorem 3. We now provide the complete proof of Theorem 3. By the same argument as for single neuron, we know that the sample complexity is reduced to the upper bound of Rademacher complexity as in (9), which is further reduced to the upper bound on the size of G|xn. For any fixed xn, we have:

G|xn = {(gw(x1), , gw(xn)) : w W},

where gw(x) = (gmin w (x), gmax w (x)) with gmin w (x) = inf A S{hw A(x)} and gmax w (x) = sup A S{hw A(x)}. Note that, for any w, (gw(x1), , gw(xn)) is completely determined by Vw = {(hw A(x1), , hw A(xn)) : A S}. Therefore, we have:

|G|xn| |{Vw : w W}|. (10)

Now, our key idea is to recursively construct a sequence of sets L1, , LL such that there exists an injection map F : {Vw : w W} LL, i.e., |{Vw : w W}| |LL|.

A Theory of Fault-Tolerant Learning

To this send, we denote by hl w the transition function specified by neurons in layer l [L] with weight w 5. Note that for l = 1, we have h1 w maps Rd N {0, 1}N; for 2 l L 1, hl w maps {0, 1}N 2 {0, 1}N; and for l = L, h L w maps {0, 1}N {0, 1}. Let L0 = {(x1 A, , x N A ) : x {x1, , xn}, A S}, where xi A denotes for the deletion of x by applying the deletion type of A restricted on the ith neuron of the first layer. Denote Wℓto be the weight space at layer ℓ [L]. We recursively define

1. L1 = {(h1 w(v))v L0 : w W1};

2. Assume for ℓ L 1, we have constructed the set Ll, which satisfies the form that any element v Ll is an array with each coordinate v[i] {0, 1}N. For any v Ll, we define an expanded array v = {(v[i]1 A, , v[i]N A ) : i len(v), A S}, where v[i]j A denotes the deletion of v[i] by applying the deletion type of A restricted on the jth neuron of the l+1st layer. Denote L l = {v : v Ll}. We define:

Ll+1 = n (hl+1 w (v [i]))i len(v ) : v L l, w Wℓ+1 o . (11)

We now observe the following fact about the Lls:

Fact 3. We have |L1| nd N2d2N2 and for any ℓ L 1, we have |Ll+1| |Ll|n N 22N 3d+ℓN 4.

Proof. Note that |L0| n2d N. We have L1 is essentially the restriction of all transition functions in the first layer on L0. Since the VC-dimension of linear threshold functions with input dimension d is upper bounded by d and there are N neurons in each layer, we have by Sauer s lemma that |L1| (n2d N)d N nd N2d2N2.

To prove the second upper bound, we denote Ll as the maximum length of arrays in Ll. We have L1 n2d N

and Ll+1 Ll2N2. This follows by our construction that len(v ) len(v)2N2, since there can be at most 2N2 deletion types on each layer. Therefore, we have Ll n2d N2l N 2. Note that:

v L l {(hl+1 w (v [i]))i len(v ) : w Wℓ+1},

where each set in the union can be viewed as the restriction of all transition function of layer l + 1 on v . We have, by Sauer s lemma again that:

|Ll+1| |L l|LN 2 l+1 |Ll|n N22N3d+ℓN 4,

where we have used the fact that |Ll| = |L l|.

5Here, we assume that each neuron has different inputs.

We now prove the main property of our construction:

Fact 4. There exists a injection function F : {Vw : w W} LL, i.e., |{Vw : w W}| |LL|.

Proof. For any weight w, we show that the set Vw can be reconstructed from an element in LL using a universal recovering rule. To see this, we define vl w to be the element in Ll that corresponds to w (note that w uniquely identifies an element in Lℓby restriction the weight on the first ℓlayers, see construction in (11)). We claim that v L w LL is the desired element that reconstructs Vw. Let A S, we now describe a strategy that recovers (hw A(x1), , hw A(xn)). For any xi and l L 1, we identify an index il in vl w. Initially, we set i1 to be the index in L0 that corresponds to deletion type A and xi. Assume il has been constructed. Since vl+1 w is constructed from vl w by expanding on deletion types in S, we define il+1 to be the index in vl+1 w that corresponds to vl w[il] and deletion type A (see the definition of v in our construction). It is easy to verify that v L w[i L] is exactly the same as hw A(xi). Note that our recovering rule depends only on the deletion type A and is independent of w. Therefore, any two distinct Vw1 and Vw2 cannot be recovered from the same element in LL. The function defined by mapping Vw to any elements in LL that recovers it is the desired injection map.

Now, by Fact 3, we have

|LL| nd N+LN22d LN3+L2N4+d2N 2.

Therefore, by Fact 4, (9) and (10), the sample complexity is upper bounded by

O d LN 3 + L2N 4 + d2N 2 + log(1/δ)

where O hides poly-logarithmic factor on ϵ. This completes the proof of Theorem 3.

4. Conclusion and Discussion

In this paper, we introduce a novel theoretical framework for analyzing fault tolerance in machine learning models, offering tight characterizations of fault-tolerant PAC learnability for both random and adversarial faults. We show that while random faults do not increase learning complexity, adversarial faults scale linearly with the fault space size. Specifically, for neural networks with deletion faults, we demonstrate that sample complexity can be, surprisingly, independent of the number of deletions, scaling instead with network parameters, indicating that certain natural fault-types can be well-tolerated. Our framework paves the way for significant future research, including extending analysis to other activation functions like Re LU in neural networks and exploring additional fault types, such as neuron activation faults or

A Theory of Fault-Tolerant Learning

precision errors. This work lays a rigorous foundation for developing more reliable and trustworthy machine learning models for diverse applications.

Computational aspects. While our work primarily focuses on the statistical sample complexity, it would also be interesting to investigate computationally efficient methods for computing our fault-tolerant ERM rule as in (6). A heuristic approach would work as follows: (i) we first estimate the faulty-loss by making (sampling) oracle calls to the fault-type U, which implements a value-oracle for the faulty-loss; (ii) the value-oracle will then be implemented via, e.g., the discrete differences, to compute the gradients of faulty-loss w.r.t. w; (iii) a standard SGD will then be used to compute the fault-tolerant ERM. We believe it would be an interesting future direction to investigate how the structures of the class G (see Proposition 1) impact the convergence rate of such an SGD algorithm.

Other variations. Besides the random and adversarial faulty scenarios we investigate in this paper, there can be many other formulations under the concept of fault-tolerant PAC learning. For instance, one may assume that the adversary alters the function only at the beginning, instead of altering it with each sample, as in our case. In such a scenario, the sample complexity would be controlled by that of S

h H U(h) via a standard uniform convergence argument. We believe investigating other types of formulations would also be interesting for future research.

Acknowledgements

The authors would like to thank the anonymous reviewers for their helpful suggestions. This work was partially supported by the NSF Center for Science of Information (CSo I) Grant CCF-0939370, and also by NSF Grants CCF2006440, CCF-2007238, and CCF-2211423.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Abraham, J. A. and Fuchs, W. K. Fault and error models for vlsi. Proceedings of the IEEE, 74(5):639 654, 1986.

Al Shahrani, A. M., Alomar, M. A., Alqahtani, K. N., Basingab, M. S., Sharma, B., and Rizwan, A. Machine learning-enabled smart industrial automation systems using internet of things. Sensors, 23(1):324, 2022.

Arad, B. S. and El-Amawy, A. On fault tolerant training of feedforward neural networks. Neural Networks, 10(3): 539 553, 1997.

Bartlett, P. L. and Maass, W. Vapnik-chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, pp. 1188 1192, 2003.

Chin, C.-T., Mehrotra, K., Mohan, C. K., and Rankat, S. Training techniques to obtain fault-tolerant neural networks. In Proceedings of IEEE 24th international symposium on fault-tolerant computing, pp. 360 369. IEEE, 1994.

Chu, L.-C. and Wah, B. W. Fault tolerant neural networks with hybrid redundancy. In 1990 IJCNN international joint conference on neural networks, pp. 639 649. IEEE, 1990.

Deng, J., Fang, Y., Du, Z., Wang, Y., Li, H., Temam, O., Ienne, P., Novo, D., Li, X., Chen, Y., et al. Retrainingbased timing error mitigation for hardware neural networks. In 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 593 596. IEEE, 2015.

Deodhare, D., Vidyasagar, M., and Keethi, S. S. Synthesis of fault-tolerant feedforward neural networks using minimax optimization. IEEE Transactions on Neural Networks, 9 (5):891 900, 1998.

Dodd, P. E. and Massengill, L. W. Basic mechanisms and modeling of single-event upset in digital microelectronics. IEEE Transactions on nuclear Science, 50(3):583 602, 2003.

Edwards, P. J. and Murray, A. F. Penalty terms for fault tolerance. In Proceedings of International Conference on Neural Networks (ICNN 97), volume 2, pp. 943 947. IEEE, 1997.

Eghbal, A., Yaghini, P. M., Bagherzadeh, N., and Khayambashi, M. Analytical fault tolerance assessment and metrics for tsv-based 3d network-on-chip. IEEE Transactions on computers, 64(12):3591 3604, 2015.

Emmerson, M. D. and Damper, R. I. Determining and improving the fault tolerance of multilayer perceptrons in a pattern-recognition application. IEEE transactions on neural networks, 4(5):788 793, 1993.

Hashmi, A., Berry, H., Temam, O., and Lipasti, M. Automatic abstraction and fault tolerance in cortical microachitectures. ACM SIGARCH computer architecture news, 39(3):1 10, 2011.

Hengstler, M., Enkel, E., and Duelli, S. Applied artificial intelligence and trust the case of autonomous vehicles and

A Theory of Fault-Tolerant Learning

medical assistance devices. Technological Forecasting and Social Change, 105:105 120, 2016.

Izzo, D., M artens, M., and Pan, B. A survey on artificial intelligence trends in spacecraft guidance dynamics and control. Astrodynamics, 3:287 299, 2019.

Kershaw, J., Yu, R., Zhang, Y., and Wang, P. Hybrid machine learning-enabled adaptive welding speed control. Journal of Manufacturing Processes, 71:374 383, 2021.

Khan, F. Safety and integrity management of operations in harsh environments, 2020.

Khunasaraphan, C., Vanapipat, K., and Lursinsap, C. Weight shifting techniques for self-recovery neural networks. IEEE transactions on neural networks, 5(4):651 658, 1994.

Liu, Y., Wei, L., Luo, B., and Xu, Q. Fault injection attack on deep neural network. In 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 131 138. IEEE, 2017.

Lu, Y. Adaptive Intelligent Systems for Extreme Environments. Ph D thesis, University of Essex, 2023.

Ma, Y., Wang, Z., Yang, H., and Yang, L. Artificial intelligence applications in the development of autonomous vehicles: A survey. IEEE/CAA Journal of Automatica Sinica, 7(2):315 329, 2020.

Montasser, O., Hanneke, S., and Srebro, N. Vc classes are adversarially robustly learnable, but only improperly. In Beygelzimer, A. and Hsu, D. (eds.), Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pp. 2512 2530. PMLR, 25 28 Jun 2019. URL https://proceedings.mlr.press/v99/ montasser19a.html.

Nguyen, T.-H., Imran, M., Choi, J., and Yang, J.-S. Craft: Criticality-aware fault-tolerance enhancement techniques for emerging memories-based deep neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023.

Phatak, D. S. and Koren, I. Complete and partial fault tolerance of feedforward neural nets. IEEE transactions on neural networks, 6(2):446 456, 1995.

Piuri, V. Analysis of fault tolerance in artificial neural networks. Journal of Parallel and Distributed Computing, 61(1):18 48, 2001.

Protzel, P., Palumbo, D., and Arras, M. Performance and fault-tolerance of neural networks for optimization. IEEE Transactions on Neural Networks, 4(4):600 614, 1993. doi: 10.1109/72.238315.

Rakin, A. S., He, Z., and Fan, D. Bit-flip attack: Crushing neural network with progressive bit search. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1211 1220, 2019.

Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

Sosnowski, J. Transient fault tolerance in digital systems. IEEE Micro, 14(1):24 35, 1994.

Syed, R. T., Ulbricht, M., Piotrowski, K., and Krstic, M. A survey on fault-tolerant methodologies for deep neural networks. Pomiary Automatyka Robotyka, 27(2):89 98, 2023.

Tipaldi, M., Feruglio, L., Denis, P., and D Angelo, G. On applying ai-driven flight data analysis for operational spacecraft model-based diagnostics. Annual Reviews in Control, 49:197 211, 2020.

Torres-Huitzil, C. and Girau, B. Fault and error tolerance in neural networks: A review. IEEE Access, 5:17322 17341, 2017. doi: 10.1109/ACCESS.2017.2742698.

Verma, S., Kaur, S., Garg, S., Sharma, A. K., and Alrashoud, M. Agric: Artificial intelligence-based green routing for industrial cyber-physical system pertaining to extreme environment. IEEE Internet of Things Journal, 2023.

Wainwright, M. J. High-dimensional statistics: A nonasymptotic viewpoint, volume 48. Cambridge University Press, 2019.

Wei, N., Yang, S., and Tong, S. A modified learning algorithm for improving the fault tolerance of bp networks. In Proceedings of International Conference on Neural Networks (ICNN 96), volume 1, pp. 247 252. IEEE, 1996.

Zhou, Z.-H. and Chen, S.-F. Evolving fault-tolerant neural networks. Neural Computing & Applications, 11:156 160, 2003.