# rnnrepair_automatic_rnn_repair_via_modelbased_analysis__703f68c7.pdf

RNNRepair: Automatic RNN Repair via Model-based Analysis

Xiaofei Xie 1 2 Wenbo Guo 3 Lei Ma 4 5 2 Wei Le 6 Jian Wang 1 Lingjun Zhou 7 Xinyu Xing 3 Yang Liu 1

Deep neural networks are vulnerable to adversarial attacks. Due to their black-box nature, it is rather challenging to interpret and properly repair these incorrect behaviors. This paper focuses on interpreting and repairing the incorrect behaviors of Recurrent Neural Networks (RNNs). We propose a lightweight model-based approach (RNNRepair) to help understand and repair incorrect behaviors of an RNN. Speciﬁcally, we build an inﬂuence model to characterize the stateful and statistical behaviors of an RNN over all the training data and to perform the inﬂuence analysis for the errors. Compared with the existing techniques on inﬂuence function, our method can efﬁciently estimate the inﬂuence of existing or newly added training samples for a given prediction at both sample level and segmentation level. Our empirical evaluation shows that the proposed inﬂuence model is able to extract accurate and understandable features. Based on the inﬂuence model, our proposed technique could effectively infer the inﬂuential instances from not only an entire testing sequence but also a segment within that sequence. Moreover, with the sample-level and segmentlevel inﬂuence relations, RNNRepair could further remediate two types of incorrect predictions at the sample level and segment level.

1. Introduction

In spite of many state-of-the-art applications and high test accuracy, Deep Neural Networks (DNNs) still make mistakes and output wrong predictions. To ﬁx an incorrect prediction, it is important to understand the root cause of

1Nanyang Technological University, Singapore 2Kyushu University, Japan 3College of Information Sciences and Technology, The Pennsylvania State University, State College, PA, USA 4University of Alberta, Canada 5Alberta Machine Intelligence Institute, Canada 6Iowa State University, USA 7Tianjin University, China. Correspondence to: Xiaofei Xie <xiaofei.xfxie@gmail.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

the wrong prediction (Koh & Liang, 2017). Once the root cause is identiﬁed, users may ﬁx the errors by removing harmful training data or adding speciﬁc data to improve model accuracy (Hara et al., 2019). However, due to the black-box nature of the DNN, it is challenging to identify the most responsible training samples and explain the wrong predictions. As a result, wrong predictions are difﬁcult to be corrected. Recently, inﬂuence functions have been widely studied for interpreting the predictions of DNNs by estimating the effect of removing training samples (Koh & Liang, 2017; Khanna et al., 2019; Koh et al., 2019; Hara et al., 2019). However, using it as a method to remediate misclassiﬁcation or wrong prediction is still challenging.

First, existing inﬂuence-function based methods are mostly designed for feed-forward neural networks (FNNs). Given Recurrent Neural Networks (RNNs), they usually suffer from the vanishing gradient and long-distance dependency. As a result, existing techniques could not be easily applied for RNNs . Second, different from FNNs, RNNs often come with stateful structures for processing sequential inputs (e.g., audio, natural language). For a sequential test input, we need to study the effect of its segments more precisely. For example, in automatic speech recognition, we want to identify which training samples are most responsible for the poor recognition of a speciﬁc pronunciation (i.e., segment). Existing methods mainly performed inﬂuence analysis for the whole test input but not at the segment level. Third, to use inﬂuence analysis based interpretation for repairing a wrong prediction, one needs to select helpful samples from a large number of collected or generated new samples. However, existing inﬂuence analysis based methods inevitably introduce intensive computation, making the selection of useful samples inefﬁcient. As a result, it greatly limits their potentiality to be used as a mechanism to repair the errors of a model.

In this work, we propose a light-weight model-based inﬂuence analysis for RNNs, named RNNRepair1. To capture the stateful behaviors of training data, we ﬁrst construct an inﬂuence model from concrete prediction traces of all training data. This model extracts accurate features of inputs from RNNs via state clustering. We then calculate the inﬂuential training samples for the segment of an input under

1https://bitbucket.org/xiaofeixie/rnnrepair

RNNRepair: Automatic RNN Repair via Model-based Analysis

a state (i.e., the context of the input). As part of this work, we also demonstrate the utility of the proposed inﬂuence analysis in multiple applications, such as understanding the behaviors of the RNN, ﬁxing inﬂuential mislabeled data, pinpointing Trojan backdoor in an RNN model, and π repairing incorrect predictions.

2. Related Work

Model Extraction. Existing research has developed various approaches to extract DFA (Deterministic Finite Automaton) from the known RNN architectures. Speciﬁcally, early-stage explorations focused on extracting DFA from the second-order RNNs (Omlin & Giles, 1996; Giles et al., 1990; 1992). More recent works extend the preliminary tech-

niques to GRUs and LSTMs, which have higher practicality than the second-order RNNs (Weiss et al., 2018; Cho et al., 2014; Chung et al., 2014; Weiss et al., 2019; Okudono et al., 2019; Ayache et al., 2018; Zhang et al., 2021). In terms of the state vector partition strategy, existing techniques mainly follow two different methods (1) equipartitionbased approach (Omlin & Giles, 1996; Weiss et al., 2018), which divides each dimension of the latent representations into k equal intervals, and (2) unsupervised learning-based approach (Zeng et al., 1993; Cechin et al., 2003), which applies the existing clustering method (e.g., K-means, GMM) to cluster the state vectors into different groups. Deep Stellar (Du et al., 2019) extracts a discrete-time Markov chain (DTMC) from an RNN. Then the DTMC is used for the testing and adversarial example detection. As introduced later in Section 3, we follow the second strategy and apply GMM for state partition. Despite using the same partition strategy, our method is fundamentally different from the existing DFA extraction techniques in that none of the existing methods could derive the inﬂuence of training samples upon a given testing sample (i.e., inﬂuence relations). In addition, to precisely simulate the behaviors of a target RNN, most of the existing methods need to exhaustively search for the state transitions in a target RNN, which limits their scalability in some applications. However, our method only requires a coarse-grained approximation for deriving the inﬂuence relationships, which is much lighter-weight than the existing model extraction methods.

Inﬂuence Analysis. Koh et al. (Koh & Liang, 2017) studied the inﬂuence of training samples upon a given testing sample for DNNs. Speciﬁcally, they utilized the inﬂuence function to identify the most representative training samples for a given testing sample. Following (Koh & Liang, 2017), recent efforts have been made to either enable the inﬂuence analysis for non-optimal models trained by using nonconvex losses (Hara et al., 2019) or analyze the inﬂuence of a group of training samples upon a given prediction (Koh et al., 2019). Despite deriving meaningful inﬂuence relations for feed-forward networks (i.e., MLP and CNN), the

existing methods might not be effective for RNNs due to the infamous gradient vanishing/explosion problem. In this work, our proposed method does not depend on the gradient calculation and could capture the stateful behaviors of the RNN accurately with the state clustering. In addition, our method is much lighter-weight as shown in Section 4. Furthermore, our method could provide a ﬁner-grained inﬂuence relation than the existing methods or, in other words, we can, at the segment level, pinpoint the most inﬂuential training samples to the segments within a testing sample.

DNN Repair. Wang et al. offset errors made by logistic regressions by integrating an additional layer into the model to pre-process the error inputs (Wang et al., 2019). Different from this work, our method does not modify the model architecture and targets more complex networks RNN. Yu et al. proposed a style-guided repair for the unknown failure pattern in DNNs with a style transfer method (Yu et al., 2020). Some works (Sotoudeh & Thakur, 2019; Zhang & Chan, 2019) propose to repair the model by changing the network weights. Differently, our method focuses on repairing the speciﬁc failed samples for the RNN with a model-based inﬂuence analysis. It should be noted that our method is different from the techniques about adversarial defenses (Boopathy et al., 2019; Weng et al., 2018; Singh et al., 2018) and noisy learning/data cleaning (Zhang et al., 2018) in that these techniques aim for improving the robustness against adversarial attacks or data poisoning attacks. Whereas, our remediation mechanism locally offsets testing errors of an RNN.

3. Approach

Overview. At a high level, we ﬁrst adopt the clustering to capture the stateful behaviors of all training data. Based on the state transitions, we then identify the inﬂuential training samples of a given testing input or a segment of the test input. Speciﬁcally, we ﬁrst extract the abstract states by grouping state vectors (i.e., hidden representations of the RNN) of all training data (Section 3.1). Then, based on the state abstraction, we can extract the trace for a given input. We construct the inﬂuence function based on the transitions in the traces and further perform the segment-level and sample-level inﬂuence analysis. (Section 3.2). Last but not least, based on the inﬂuence analysis, we develop an remediation mechanism to analyze and offset the test errors (Section 3.3).

3.1. Semantic-guided State Abstraction

Deﬁnition 1 (RNN) An RNN is deﬁned as a 5-tuple R = (GR, d, m, h0, YR): GR is a recursive function ht = GR(xt, ht 1), where ht 2 Rd is a d-dimensional state vector, xt 2 Rm is the m-dimensional input vector at time t; d and m are the dimensions of the state vector and the

RNNRepair: Automatic RNN Repair via Model-based Analysis

input vector, respectively. h0 2 Rd is the initial state; The output function YR : Rd ! R maps an internal state-vector to the output value.

Given a sequential input x = (x1, . . . , xn), an RNN generates a sequence of state vectors (h0, h1, . . . , hn) with the application of GR. To simplify the notation, we use Gx

R to denote the state vector sequence of the input x. YR calculates different types of outputs based on different applications. In this paper, we mainly focus on the classiﬁcation problem, where the output function maps each state vector to a speciﬁc class (i.e., Yn

R : Rd ! {0, . . . , n 1}, n is the total number of classes). Speciﬁcally, for the sequence classiﬁcation problem (e.g., semantic analysis), the output of the last state vector (i.e., Yn

R(hn)) is the classiﬁcation result of the whole input sequence. As for the sequence to sequence problem, such as speech recognition, YR transforms each state vector hi into a character/word in the target language, and all the output characters/words form the translated sentences. It should be noted that different from the feed-forward neural networks, an RNN takes an input sequentially. That is, at each time i, the RNN only processes the current segment xi of the input x = (x1, . . . , xn). As such, the inﬂuence analysis of the RNN is not only to identify the training samples that are most responsible for the whole sample x (i.e., sample-level inﬂuence analysis), but also the most inﬂuential training samples to a segment of the input xi (i.e., segment-level inﬂuence analysis).

Deﬁnition 2 (Inﬂuence Model) Given an RNN R and its training data set T, the Inﬂuence Model is a 4-tuple A = (Q, P, q0, I), where Q is a ﬁnite set of states, P is the set

of alphabet, q0 is the initial state, I : Q P ! P(T) is the inﬂuence function, where P(T) is the power set of T.

The inﬂuence function identiﬁes the training samples that contribute most to the RNN s prediction of a speciﬁc input. For example, I(q, xi) = T 0 represents that the training samples T 0 T have larger inﬂuence on the prediction of the input xi under the state q. If the input of the RNN is discrete data (e.g., the text x), the words of the text (e.g., xi) can be the symbol of the alphabet. If the input is continuous data (e.g., a sequence of pixels xi in an image x), we could perform the input abstraction that maps xi to an abstract input ˆxi and treat ˆxi as the symbols of the alphabet.

To help identify inﬂuential training samples, the inﬂuence model should capture the statistical behaviors of the RNN over all the training data. As such, we ﬁrstly feed all the training samples in T into the RNN and collect all the state vectors (denoted as SV = {h|8x 2 T, 8h 2 Gx

R} [ {h0}). Then, a partitioning function p : Rd ! N is applied to group the similar state vectors into one abstract state, which is used as the states of the inﬂuence model (i.e., Q = {p(h)|h 2 SV }). Here, the initial state is denoted as q0 = p(h0). We

assume that the abstract states can show different behaviors of the RNN.

Different from the existing research that extracts automaton to mimic the prediction of an RNN (Weiss et al., 2018; 2019), our inﬂuence model aims to capture the RNN s internal behaviors for the subsequent inﬂuence analysis. To efﬁciently represent a large number of state vectors, we use Gaussian Mixture Models (GMM), an unsupervised clustering method to group the state vectors. The unsupervised clustering requires a pre-speciﬁed cluster number K, which directly decides the number of abstract states and thus affects the accuracy of the inﬂuence analysis. However, since there is no explicit ground truth for measuring the correctness of a partition result for inﬂuence analysis, it is challenging to ﬁnd the correct K through cross-validation. To tackle this challenge, we propose a semantic-guided strategy to select an accurate K. The key insight behind is that the state vectors in one group should have similar semantics or, in other words, the RNN should produce a similar output or prediction for the vectors in the same group. Based on this insight, we propose a metric to evaluate the partition result and select the K based on the metric. Here, we introduce conﬁdence score, the metric developed to measure the semantics of the abstract state, followed by the selection strategy.

Deﬁnition 3 (Conﬁdence Scores) Given an RNN classiﬁer R = (GR, d, m, h0, YR) and a partition result Q, the conﬁdence score of each state q 2 Q is deﬁned as Cq = [c0, . . . , cn 1], where

ci = |{h|h 2 SVq Yn

R(h) = i}| |SVq| .

SVq is a set of state vectors of training samples in T that are clustered into the state q, n is the total number of classes for the classiﬁer.

Intuitively, Cq shows the distribution of the output classes of the state vectors in the state q. ci is deﬁned as the ratio of the state vectors in the abstract state q that are predicted as i by the RNN. A high ci indicates that most of the state vectors, clustered in one abstract state, share similar semantics. Given the conﬁdence score, we further deﬁne the state stability as follows:

Deﬁnition 4 (State Stability) For a state q as well as its conﬁdence score Cq = [c0, . . . , cn 1], the state is deﬁned as δ-stable, where δ = max(Cq).

The state stability is measured by the concentration of the output classes of the state vectors in the abstract state. δ is used to measure the concentration of the corresponding group (the abstract state). That is, a high value of δ

RNNRepair: Automatic RNN Repair via Model-based Analysis

7: 0.35 8: 0.334 3: 0.227 0: 0.029 2: 0.027 1: 0.021 9: 0.007 5: 0.002 6: 0.002

Time: 9 11 13 17 15 28 20

7: 0.35 8: 0.334 3: 0.227 0: 0.029 2: 0.027 1: 0.021 9: 0.007 5: 0.002 6: 0.002

2: 0.378 3: 0.176 1: 0.169 8: 0.151 6: 0.069 5: 0.044 0: 0.013

9: 0.0 7: 0.0 4: 0.0

3: 0.476 0: 0.322

6: 0.15 5: 0.027

1: 0.01 8: 0.005 9: 0.004 7: 0.003 2: 0.003

0: 0.645 6: 0.182 9: 0.069 3: 0.047 5: 0.026 7: 0.024 8: 0.007

4: 0.0 2: 0.0 1: 0.0

9: 0.493 8: 0.263 0: 0.096 4: 0.059 7: 0.056 6: 0.026 5: 0.006 3: 0.001 1: 0.001

8: 0.977 9: 0.019 4: 0.003 7: 0.001

3: 0.0 2: 0.0 6: 0.0 5: 0.0 1: 0.0 0: 0.0

... ... ... ... ... ... ...

Prediction: 7: 0.21 3: 0.49 2: 0.89 8: 0.99 9: 0.32 0: 0.26 3: 0.78

Figure 1: The prediction process of an image and the corresponding abstract states. The row prediction shows the prediction results including the label (i.e., Yn

R(hi)) and the probability. We also highlight the conﬁdence scores of the predicted labels in each abstract states (see the read values).

indicates a well-clustered state with regards to the concentration of output classes. Most state vectors in the corresponding abstract state are predicted as the same label i = arg max0 i<n ci, indicating that similar behaviors are captured in the abstract state. A new input falling into this state is more likely to be classiﬁed into i than the other classes. Whereas a low δ indicates that the output classes are not concentrated, i.e., the abstract state tends to be cause a confusion. As a result, the conﬁdence score is low for the testing samples belong to this state although the prediction probability can be high.

State Abstraction. In addition to the stability, we also desire a small number of total abstract states for a better generalizability (Weiss et al., 2018). To achieve these two goals at the same time, we deﬁne the semantics-guided abstraction to decide the K

minimize K , (1)

s.t. δ > , where δ = avg({δq1, . . . , δqn}) . (2)

δ is the average stability of all states (i.e. δ), which represents the stability of a partition result. is the target threshold of the clustering reﬁnement. A higher gives a more stable partition result. Given a pre-speciﬁed , we increase the cluster size K, starting from 1, and terminate the increment once δ reaches the threshold . As is shown later in Section 4, this selection strategy can guide the clustering for extracting accurate features.

Figure 1 shows the prediction of an image. At each time, the RNN reads one row from the image and outputs the hidden state. The sequential abstract states as well as the conﬁdence scores are also shown in the third row. In each abstract state, the ﬁrst column shows the labels and the second column shows the conﬁdence scores. For convenience, the conﬁdence scores are sorted in descending order. We can observe that: 1) except at time 11, all prediction outputs

correspond to the largest conﬁdence score in the abstract states; 2) As the prediction conﬁdence (i.e., the probability) of RNN is usually low when seeing only parts of the input, it enters into the non-stable states (with low conﬁdence score) in the front. For example, from the human perspective, we are uncertain to say whether the images at time 13 and 15 are 3 .

3.2. Light-Weight Inﬂuence Analysis

Deﬁnition 5 (Trace) Given an input x = (x1, . . . , xn), the trace x = (q0, x1, q1, . . . , xn, qn) is obtained from the state vector sequence Gx

R = (h0, . . . , hn), where qi = p(hi), p is the partitioning function.

For an input x, we extract a trace that represents its state vector sequence. Based on the abstract states constructed above, we build the transitions as well as the inﬂuence function for the inﬂuence analysis. Speciﬁcally, given the trace x = (q0, x1, q1, . . . , xn, qn) of each training sample x 2 T, the inﬂuence function I are updated as follows:

80 < i n, I(qi 1, xi) = I(qi 1, xi) [ {x} . (3)

where the inﬂuence function I could capture the effect of training samples at each abstract state.

After updating the inﬂuence function based on the state vectors of all the training samples, we perform the inﬂuence analysis for a segment of a given test input xi (i.e., segmentlevel inﬂuence analysis) or the entire testing sequence x (i.e., sample-level inﬂuence analysis) using the following methods.

Segment-level Inﬂuence Analysis. Given a segment xi in x = (q0, x1, q1, . . . , xn, qn) , we identify the inﬂuential training samples of xi as I(qi 1, xi). It represents the set of training samples that have the same segment xi at the state qi 1 and thus are accountable for the prediction of xi. Note that other training samples, which could also include xi at states other than qi 1, may have low inﬂuence or no inﬂuence upon the prediction of xi and thus are not taken as the inﬂuential samples of xi. Taking text as an example, there could be many sentences containing the same word point, but with totally different semantics, e.g., The pencil has a sharp point and It is not polite to point at people . These training sentences may have very different inﬂuences dependent on the test inputs. Our segment-level inﬂuence analysis is designed to distinguish such differences and only identify the training samples that are truly inﬂuential to a testing segment.

Sample-level Inﬂuence Analysis. To quantify the inﬂuence of training samples upon an entire testing sequence x, we deﬁne the temporal feature as follows:

RNNRepair: Automatic RNN Repair via Model-based Analysis

High inﬂuence Low inﬂuence Fine-tune

Figure 2: (a) The overview of the fault localization and repair. Red circle represents the failed input. (b) Four examples of data generation for repairing, where each group contains four images (i.e., x, Tmx, Ttx, rx).

Deﬁnition 6 (Temporal Feature) Given an RNN R, an input x = (x0, . . . , xn), and its trace x = (q0, x1, q1, . . . , xn, qn), the temporal feature is deﬁned as

Fx = (f0, . . . , fn), where fi = (ID(qi), Cqi, Yn

R(hi)). qi = p(hi) is the abstract state to which xi belongs and ID(qi) is the unique identiﬁer of the state qi. Cqi represents

the conﬁdence scores (see Deﬁnition 3) and Yn

R(hi)) is the prediction label at the time i.

With the deﬁnition of the temporal feature, we quantify the inﬂuence of a training sample on a test input. Speciﬁcally, given a training sample xtrain and a test sample xtest, the inﬂuence is quantiﬁed as the similarity between the temporal features of the training sample and the testing sample:

inflscore(xtrain, xtest) = similarity(Fxtrain, Fxtest) . (4)

The higher the similarity, the higher inﬂuence of the training sample xtrain upon xtest. In other words, due to the high inﬂuence by the training sample xtrain, the prediction of xtest is very similar to that of xtrain. Note that different similarity metrics can be selected for different applications. For example, lp norm distance can be used for the ﬁxedlength input sequences (e.g., image). For the inputs with varying lengths (e.g., natural language texts), one could select the Jaccard distance.

Considering Figure 1 again, we extract the temporal feature of the input explored by RNN and show it in the third row. Intuitively, the feature is aligned with the human perception. For example, at time 9, the predicted label is 7 and the current input looks like the start of a 7. The conﬁdence score of 7 in the abstract state is not high (0.35). As the input increases, it looks like 3, 2, 3, 0, 9 and 8. At time 17, we can see that it really looks like 0 and the conﬁdence is higher (0.645). Actually, it is still not very high due to that this 0 is not similar to the zeros in training data. At last, it has a very high conﬁdence to predict it as 8.

3.3. Fault Localization and Remediation

With the inﬂuence analysis method introduced above, we then develop a remediation mechanism to repair the misclassiﬁcations of the target RNN. Speciﬁcally, we mainly focus on two kinds of misclassiﬁcation: 1) misclassiﬁcation caused by a whole input instance and 2) misclassiﬁcation

caused by an input segment. In the following, we elaborate on our mechanisim of repairing these two different errors.

3.3.1. REMEDIATION WITH SAMPLE-LEVEL INFLUENCE

To repair the ﬁrst type of errors, we ﬁrst identify the responsible training samples. Then, we randomly generate new samples by manipulating the identiﬁed ones and apply the inﬂuence analysis to ﬁlter out the error-triggered training samples. Finally, we retrain the target RNN with the newly generated samples.

Fault Localization. Let x be an input misclassiﬁed as mx with the ground truth label tx, i.e., tx 6= mx. By applying the sample-level inﬂuence analysis, we identify the top-n training samples (denoted as φx

n) that are most responsible for the misclassiﬁcation of x. We use Ttx and Tmx to denote the training samples in φx

n, whose ground truth labels are tx and mx, respectively. Our empirical study shows that, Tmx has much more training samples than Ttx, i.e., the overall inﬂuence of Tmx is higher than Ttx (more detailed results can be found in the supplementary material). This observation explains the reason why x is classiﬁed as mx. That is, the training samples in Tmx have a higher inﬂuence upon x than those in Ttx. The left sub-ﬁgure in Figure 2(a) shows an example of the fault localization. The red circle is a test input, which is mainly inﬂuenced by the green triangle (i.e., high similarity) than the other circles. As a result, it is misclassiﬁed as a triangle.

Remediation. To repair the misclassiﬁcation, we synthesize new samples whose truth labels are tx but have higher inﬂuence on x than the existing training samples. The right sub-ﬁgure in Figure 2(a) shows the basic idea of our remediation method. As shown in the ﬁgure, we intend to generate new samples (e.g., blue circles) that are more inﬂuential than the green triangle. By retraining the model with these synthesized samples, the decision boundary can be ﬁne-tuned such that the misclassiﬁed input can be corrected. For a failed input x, the samples used for retraining are represented as: rx = {x0|x0 2 X0 tx0 = tx inflscore(x0, x) > max({inflscore(x0, x)|x0 2 Tmx})}, where X0 is a set of generated inputs whose truth labels are the same with x. The candidate set X0 can be generated

RNNRepair: Automatic RNN Repair via Model-based Analysis

by multiple techniques (e.g., random augmentation, generative adversarial network). In this work, we synthesize new samples (i.e., X0) through data augmentation:

X0 = {x0|x0 = aug(x00) x00 2 Ttx} , (5)

where aug is the data augmentation technique (e.g., image rotation and shearing). Note that, during the remediation, we do not perform the augmentation on the failed input x. Instead, we apply the random augmentations on the training samples in Ttx, which already have a strong inﬂuence upon x. Manipulating these samples will be more likely to generate highly inﬂuential samples that are more useful for remediation than perturbing other samples. Figure 2(b) shows some examples of x, Tmx, Ttx, and rx. From the perspective of human perception, in each of the 4 groups, the second image (i.e., Tmx) looks very similar with the failed input (i.e., x). Moreover, after manipulating the third image (i.e., Ttx), we could get a more inﬂuential sample (i.e., rx).

3.3.2. REMEDIATION WITH SEGMENT-LEVEL

INFLUENCE ANALYSIS

Similar with repairing the sample-level error, we also follow a three-step procedure to repair the second type of errors that are caused by the rarely seen segments in the training data. Differently, we design the following method to identify the root cause input segment rather than identifying whole input samples.

Fault Localization Given an input x = (x1, . . . , xn) as well as its trace x = (q0, x1, q1, . . . , xn, qn), we identify segments of the input that are more likely to be the root cause of the misclassiﬁcation as follows:

S = {xi|1 i n |I(qi 1, xi)| < γ}

where γ is a pre-deﬁned parameter. Intuitively, if the segment xi has less inﬂuential training samples (i.e., less than γ), indicating that the segment xi is rarely seen under the state qi 1 during training, it is more likely to cause the incorrect prediction.

For example, we show one failed input in the sentiment analysis (which is misclassiﬁed as negative):

Just(1,43) !

noticed(1,11) ! Æ

who(1,19)) ! Ø

gave(1,5) !

that(1,89) !

lulz(0,0) !

The prediction result and the number of the inﬂuential training samples are shown after each word. For example, after reading Just, 1 represents that the RNN outputs positive. 43 represents that Just appears 43 times after the state in the training samples (i.e., |I( , Just)| = 43). We observe that, after the word lulz, the RNN returns negative (i.e., 0) because the word lulz never appeared after the state , which causes the incorrect prediction.

Remediation To repair the misclassiﬁcation, we need to insert such segments into the inﬂuential training samples such that the missing knowledge (i.e., the appearance of xi under the state qi 1) could be learned. Speciﬁcally, for a localized segment xi 2 S, we conduct the remediation with the following steps:

We randomly select m training samples Xm from

I(qi 1, xi), where 8x0 2 Xm, tx = tx0.

For each selected training sample x0 2 Xm, we insert xi

into the corresponding position (i.e., after the state qi 1). Our assumption is that the insertion of xi will not change the truth label of x0 since the selected training sample x0 has the same truth label with the failed input x (i.e., tx = tx0).

Finally, we get a set of augmented training samples and

train the model to repair the misclassiﬁcation on x.

4. Evaluation

In our experiments, we evaluated the correctness of the temporal features (Sec 4.1), the effectiveness of our inﬂuence analysis (Sec. 4.2) and the effectiveness of the repair (Sec. 4.3). More evaluation can be found in the supplementary material.

Datasets and Models. We selected two widely-used public datasets (i.e., MNIST, and Toxic) to evaluate the inﬂuence analysis. MNIST (Le Cun & Cortes, 1998) is selected for evaluating the sample-level inﬂuence analysis by comparing it with the existing baselines. We train an LSTM network with hidden size 100 for this task. At each time, the RNN reads one row (i.e., 28 pixels) from the image. Toxic Comment Dataset (abbrev. Toxic) 2 is selected for evaluating the segment-level inﬂuence analysis. The task is to classify whether the comment is toxic or not. We train a GRU network with hidden size 300. In addition, we introduce another dataset Standard Sentiment Treebank (SST) (Socher et al., 2013) for the segment-level repair and a LSTM network with hidden size 300 is trained.

4.1. The Correctness of Temporal Features

Setting. The accuracy of the inﬂuence model directly affects the inﬂuence analysis. As such, we evaluate the accuracy of the inﬂuence model by measuring the ﬁdelity of the temporal features extracted by the state clustering. We trained a simple linear classiﬁer (denoted as Sim NN) with the different components of the temporal features (see Deﬁnition 6) extracted from the training samples and compare their performance with that of the the original RNN. We

2https://www.kaggle.com/c/ jigsaw-toxic-comment-classification-challenge.

RNNRepair: Automatic RNN Repair via Model-based Analysis

Table 1: Results of feature analysis (%)

Sim NN Ori R_L ID (ID, R_L) CSs (ID, R_L, CSs) MNIST 85.61 80.01 92.35 97.34 97.50 98.45 TOXIC 86.62 63.00 87.81 88.90 89.04 92.08

repeated the experiment 10 times and report the average results in Table 1, where column Ori shows the test accuracy of the original RNN while other columns use the corresponding temporal features as input of the Sim NN. Note that RL denotes the predicted labels at each time and we train the Sim NN with a sequence of predicted labels (i.e., Yn

We can observe that using only ID or R_L, Sim NN achieves a lower accuracy than combing them together on both datasets. With only the conﬁdence scores, the test accuracy reaches 97.34% and 88.90%, much higher than only using ID. It indicates that our semantic-based abstraction captures more information than clustering ID. Finally, models trained with the full temporal features achieve the most comparable performance with the original RNN, which indicates the ﬁdelity of extracted features.

It is worth mentioning that CSs and ID have the one-toone relation, i.e., there can be a mapping from ID to CSs. However, their results are very different in Table 1, the performance of ID is much lower than CSs. One may guess whether the one-layer linear model is too simple to learn the feature by ID. We conduct another experiment by evaluating CSs and ID on more complicated DNNs (i.e., Multi-layer Perceptron with 1/2/3 hidden layers). The results in Table 2 show the similar trend, i.e., CSs can achieve better results than ID.

We further conducted an experiment which tries to reverse the image from the extracted feature. In particular, we constructed a generative adversarial network (GAN) to generate images with the given features. We found in most of the cases, our method is able to reverse perceptionally similar images based on our extracted features. The detailed settings and the results are shown in the supplementary material.

4.2. Sample-level Inﬂuence Analysis for Identifying

Inﬂuential Mislabeled Training Data

Setting. Similar to the conﬁguration in (Koh & Liang, 2017; Khanna et al., 2019), we randomly mislabeled some training samples and identiﬁed such mislabeled samples with inﬂuence analysis. Speciﬁcally, we took a subset of MNIST with all the images of digit 1 and 7. Then, we randomly selected 30% images of 7 in the training set, ﬂipped their labels to 1, and trained a binary classiﬁer. We ranked the training samples based on their inﬂuence on the test errors of the classiﬁer. We measured the number of mislabeled samples identiﬁed (selected based on the inﬂuence order)

Table 2: Results of CSs and ID with different MLPs

MNIST TOXIC ID CSs ID CSs MLP-1 93.07% 97.25% 63.56% 91.46% MLP-2 93.91% 97.14% 63.61% 91.60% MLP-3 91.48% 97.27% 62.04% 91.51%

in a certain number of training samples and the number of errors repaired by ﬁxing the identiﬁed mislabeled samples. Two state-of-the-art technique K&L (Koh & Liang, 2017) and SGD(Hara et al., 2019), and the random strategy are selected as the comparison baselines.

Fig. 3(a) shows the results of identifying ﬂips by checking labels of training samples, following the order of the inﬂuence-based prioritization. The horizontal axis represents how many training samples are selected while the vertical axis represents how many ﬂips are identiﬁed from the selected samples. Overall, SGD method performs better to quickly identify ﬂips with the gradient-based estimation, they may identify those training samples that even have only small inﬂuence on the loss. However, not all ﬂipped/mislabeled training samples are responsible for the test errors. We found that, although many training samples are mislabeled (i.e., from 7 to 1), most of them are still predicted as 7 after training. Intuitively, such mislabeled samples may have low inﬂuence on the errors because they can still be predicted correctly. We consider the mislabeled samples predicted as 1 after training as inﬂuential ﬂips. In Fig. 3(b), the vertical axis represents how many inﬂuential ﬂips are identiﬁed in the selected training samples using the inﬂuence analysis. The results show that our method and K&L could identify more inﬂuential ﬂips than the other two approaches. Fig. 3(c) shows the repaired results by ﬁxing all ﬂips in the selected training samples. The results further conﬁrmed that the inﬂuential ﬂips have more inﬂuence on the errors and our method could identify them effectively. However, although SGD identiﬁed more ﬂips at an early stage (see Fig. 3(a)), many of them may have lower inﬂuence on the errors (Fig. 3(b) and Fig. 3(c)).

Performance. The average running time the model extraction is 76.37s, which is a one-time cost. Once the inﬂuence model is constructed, our inﬂuence analysis is very efﬁcient and takes much less time (an average of 1.16s on all errors) than the existing methods (70.13s for K&L and 5690.66s for SGD), indicating that our inﬂuence analysis tends to be more scalable than existing techniques.

4.3. RNN Repair via Sample-level Inﬂuence Analysis

Setting. We used the MNIST dataset in this experiment. To ﬁlter out the errors caused by the randomness, we only select misclassiﬁed samples that frequently occur in multiple training runs. Speciﬁcally, we trained seven models

RNNRepair: Automatic RNN Repair via Model-based Analysis

(a) Fixing all mislabels (b) Fixing inﬂuential mislabels (c) Repairing errors

Figure 3: Comparison on repairing errors by identifying inﬂuential mislabeled samples over 10 runs

Table 3: Results of Repairing Erroneous Behavior on MNIST

# Faults #Avg Fixed Distribution of Errors Under the Repair Success Rate 0 (0, 0.1] (0.1, 0.2] (0.2, 0.5] (0.5, 0.7] (0.7, 1) 1 Ori_Train 23 1.3 (5.7%) 8 10 5 0 0 0 0 Rand_Train 23 4.3 (18.7%) 4 9 3 3 3 1 0 RNNRepair_Train 23 11.7 (50.9%) 0 3 4 3 7 1 5

with different epochs and found 23 commonly failed inputs. Then, we applied random rotation and translation (Engstrom et al., 2019) to generate augmented data (see Eq. 5). We identiﬁed the most inﬂuential sample from the generated images for each error. As such, we obtained 161 new images and added them to the training set. By using different training epochs, we trained 10 models with the original and augmented training set, respectively. To further knockoff the randomness, we repeated this process 5 times and obtained 50 models from the original and the augmented training set. We compared the accuracy of these models on the 23 failed inputs. We used the random strategy as the baseline, i.e., randomly selecting 161 images from the synthesized images without the inﬂuence guidance.

Table 3 summarizes the comparisons among the models trained with the original training data (Row Ori_Train), the training set augmented with randomly selected samples (Row Rand_Train), and the training data including samples selected by our method (Row RNNRepair_Train). Column #Faults lists the number of errors needed to be repaired. Column #Avg Fixed shows the average number of errors that are correctly repaired by the 50 models. Column Distribution of Errors Under the Repair Success Rate gives the distribution of errors within different repair success rate intervals. Here, the success rate of each error is the percentage of (50) models that could correct it. The results show that our method can effectively repair 50.9% of errors by adding only 161 new training samples. Meanwhile, we can see that these errors are difﬁcult to be correctly predicted using the original training set (only 5.7%) and the training set selected by the random strategy (18.7%). In addition, the repair success rate of the original training set and the randomly selected data are extremely low (i.e., from 0 to 0.2). However, our

method performs much better in that it corrected the errors that are consistently misclassiﬁed (e.g., 8 and 4 in Ori_Train and Rand_Train) and overall obtain higher success rates.

4.4. RNN Repair via Segment-level Inﬂuence Analysis

Setting. We used the Toxic dataset and SST to evaluate the segment-level repairing. Speciﬁcally, we focus on the errors caused by segments, i.e., positive cases predicted as negative rather than negative-to-positive errors that are usually caused by wrong semantics of the whole sentence. Here are some examples:

positive-to-negative: Who the heck is Ramona any-

way ? ? ? ?

negative-to-positive: There are rumors that Boss Ross

was gay , are there any proof to these claims ? People , wake up ... I will state here then that she is very pretty

For the positive-to-negative one, we highlight the word (i.e., heck) that causes the misclassiﬁcation (after this word, the prediction of the RNN becomes negative). For the negativeto-positive one, it is always predicted as positive during the RNN processing. We observe that even humans are hard to judge it. The key reason could be that there is no a clear word that deﬁnitely makes it negative. Hence, it is classiﬁed as positive.

In addition, some positive-to-negative errors are caused by the un-supported embedding (i.e., the word is embedded as 0) and we ignored such errors. Finally, we selected 23 and 115 positive-to-negative test data that are misclassiﬁed on Toxic and SST. For each test case, we set the parameter γ (refer to Section 3.3.2) as 5. To repair such errors, we

RNNRepair: Automatic RNN Repair via Model-based Analysis

Table 4: Results of Repairing on Toxic and SST

Num. (m) 5 15 25 35 45

Toxic Random 43.63% 63.18% 65.91% 66.36% 61.36% RNNRepair 50% 65.64% 72.73% 81.82% 81.82%

SST Random 26.09% 21.74% 47.83% 47.83% 60.86% RNNRepair 30.43% 52.17% 60.87% 65.22% 65.22%

insert the identiﬁed words into some positive sentences in the training data. As a baseline, we use the random strategy to select the same number of sentences for the insertion. Finally, we use the augmented training data for training with 40 epochs (the same with the original model). To mitigate the randomness, we repeat the experiments with 10 seeds.

Table 4 shows the results of the segment-based repair. Row Number shows the number of training data that are selected for insertion. Speciﬁcally, we select 5, 15, 25, 35, 45 training samples (i.e., m in Section 3.3.2) for the augmentation, respectively. Row Random and Row RNNRepair represent the average success rate of repairing erroneous cases. We can see that, as the number of training samples increases, the success rate also increases. With the random insertion, some errors can be repaired. However, with the segmentinﬂuence analysis, we could ﬁnd the more inﬂuential cases that achieve better results.

5. Conclusion

This paper presented a novel model-based technique for inﬂuence analysis of RNNs. Different from existing techniques that perform loss change estimation, our method is less computation intensive and more efﬁcient. We could identify the most inﬂuential training samples on given test inputs at both segment level and sample level. Based on our RNN inﬂuence analysis, we further proposed a method for repairing two types of misclassiﬁed samples of RNN. We showed that our techniques are effective in identifying important mislabled training samples, and repairing RNNs. In future work, we plan to improve the GMM-based partitioning with more ﬁne-grained reﬁnement. We also consider introducing more diverse types of data augmentation techniques (e.g., GAN, morphing) to generate candidate data for repairing. Finally, we plan to extend our fault localization and repair on more different errors such as the negative-topositive cases.

Acknowledgments

This research is partially supported by the National Research Foundation, Singapore under its the AI Singapore Programme (AISG2-RP-2020-019), the National Research Foundation, Prime Ministers Ofﬁce, Singapore under its National Cybersecurity R&D Program (Award

No. NRF2018NCR-NCR005-0001), NRF Investigatorship NRFI06-2020-0022-0001, the National Research Foundation through its National Satellite of Excellence in Trustworthy Software Systems (NSOE-TSS) project under the National Cybersecurity R&D (NCR) Grant Award No. NRF2018NCR-NSOE003-0001, the JSPS KAKENHI Grant No.JP20H04168, JP19K24348, JP19H04086, JP21H04877 and JST-Mirai Program Grant No.JPMJMI20B8, Japan. Lei Ma is also supported by Canada CIFAR AI Program and Natural Sciences and Engineering Research Council of Canada. Wenbo Guo is supported by the IBM Ph.D. Fellowship Award.

Ayache, S., Eyraud, R., and Goudian, N. Explaining black

boxes on sequential data using weighted automata. In ICGI, 2018.

Boopathy, A., Weng, T.-W., Chen, P.-Y., Liu, S., and Daniel,

L. Cnn-cert: An efﬁcient framework for certifying robustness of convolutional neural networks. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pp. 3240 3247, 2019.

Cechin, A. L., Regina, D., Simon, P., and Stertz, K. State

automata extraction from recurrent neural nets using kmeans and fuzzy clustering. In 23rd International Conference of the Chilean Computer Science Society, 2003. SCCC 2003. Proceedings., pp. 73 78, Nov 2003. doi: 10.1109/SCCC.2003.1245447.

Cho, K., Van Merriënboer, B., Bahdanau, D., and Bengio, Y.

On the properties of neural machine translation: Encoderdecoder approaches. ar Xiv preprint ar Xiv:1409.1259, 2014.

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical

evaluation of gated recurrent neural networks on sequence modeling. ar Xiv preprint ar Xiv:1412.3555, 2014.

Du, X., Xie, X., Li, Y., Ma, L., Liu, Y., and Zhao, J. Deep-

stellar: Model-based quantitative analysis of stateful deep learning systems. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 477 487, 2019.

Engstrom, L., Tran, B., Tsipras, D., Schmidt, L., and Madry,

A. Exploring the landscape of spatial robustness. In Proc. of the 36th Intl. Conf. on Machine Learning, 2019.

Giles, C. L., Sun, G.-Z., Chen, H.-H., Lee, Y.-C., and Chen,

D. Higher order recurrent networks and grammatical inference. In Advances in neural information processing systems, 1990.

RNNRepair: Automatic RNN Repair via Model-based Analysis

Giles, C. L., Miller, C. B., Chen, D., Chen, H.-H., Sun,

G.-Z., and Lee, Y.-C. Learning and extracting ﬁnite state automata with second-order recurrent neural networks. Neural Computation, 1992.

Hara, S., Nitanda, A., and Maehara, T. Data cleansing for

models trained with sgd. In Advances in Neural Information Processing Systems, 2019.

Khanna, R., Kim, B., Ghosh, J., and Koyejo, S. Interpreting

black box predictions using ﬁsher kernels. In Chaudhuri, K. and Sugiyama, M. (eds.), The 22nd International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 3382 3390, 2019.

Koh, P. W. and Liang, P. Understanding black-box predic-

tions via inﬂuence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2017.

Koh, P. W. W., Ang, K.-S., Teo, H., and Liang, P. S. On

the accuracy of inﬂuence functions for measuring group effects. In Advances in Neural Information Processing Systems, 2019.

Le Cun, Y. and Cortes, C. The MNIST database of handwrit-

ten digits, 1998.

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y.,

and Potts, C. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pp. 142 150.

Mirza, M. and Osindero, S. Conditional generative adver-

sarial nets. ar Xiv preprint ar Xiv:1411.1784, 2014.

Okudono, T., Waga, M., Sekiyama, T., and Hasuo, I.

Weighted automata extraction from recurrent neural networks via regression on state spaces. ar Xiv preprint ar Xiv:1904.02931, 2019.

Omlin, C. and Giles, C. Extraction of rules from discrete-

time recurrent neural networks. Neural Networks, 9 (1):41 52, 1 1996. ISSN 0893-6080. doi: 10.1016/ 0893-6080(95)00086-0.

Singh, G., Gehr, T., Mirman, M., Püschel, M., and Vechev,

M. Fast and effective robustness certiﬁcation. In Advances in Neural Information Processing Systems, pp. 10802 10813, 2018.

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning,

C. D., Ng, A. Y., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631 1642, 2013.

Sotoudeh, M. and Thakur, A. V. Correcting deep neural

networks with small, generalizing patches. In Workshop on Safety and Robustness in Decision Making, 2019.

Wang, H., Ustun, B., and Calmon, F. P. Repairing without

retraining: Avoiding disparate impact with counterfactual distributions. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proc. of the 36th International Conference on Machine Learning (ICML), 2019, 2019.

Weiss, G., Goldberg, Y., and Yahav, E. Extracting automata

from recurrent neural networks using queries and counterexamples. In Proceedings of the 35th International Conference on Machine Learning, pp. 5247 5256, 2018.

Weiss, G., Goldberg, Y., and Yahav, E. Learning determinis-

tic weighted automata with queries and counterexamples. In Advances in Neural Information Processing Systems, pp. 8558 8569, 2019.

Weng, T.-W., Zhang, H., Chen, H., Song, Z., Hsieh, C.-

J., Boning, D., Dhillon, I. S., and Daniel, L. Towards fast computation of certiﬁed robustness for relu networks. ar Xiv preprint ar Xiv:1804.09699, 2018.

Yu, B., Qi, H., Guo, Q., Juefei-Xu, F., Xie, X., Ma, L., and

Zhao, J. Deeprepair: Style-guided repairing for dnns in the real-world operational environment. ar Xiv preprint ar Xiv:2011.09884, 2020.

Zeng, Z., Goodman, R., and Smyth, P. Learning ﬁnite state

machines with self-clustering recurrent networks. Neural Computation, 5, 11 1993. doi: 10.1162/neco.1993.5.6. 976.

Zhang, H. and Chan, W. Apricot: A weight-adaptation

approach to ﬁxing deep learning models. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 376 387. IEEE, 2019.

Zhang, X., Zhu, X., and Wright, S. Training set debugging

using trusted items. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

Zhang, X., Du, X., Xie, X., Ma, L., Liu, Y., and Sun, M.

Decision-guided weighted automata extraction from recurrent neural networks. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 35, pp. 11699 11707, 2021.

Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired

image-to-image translation using cycle-consistent adversarial networks. pp. 2223 2232, 2017.