# invariant_rationalization__98337616.pdf

Invariant Rationalization

Shiyu Chang * 1 Yang Zhang * 1 Mo Yu * 2 Tommi S. Jaakkola 3

Abstract Selective rationalization improves neural network interpretability by identifying a small subset of input features the rationale that best explains or supports the prediction. A typical rationalization criterion, i.e. maximum mutual information (MMI), ﬁnds the rationale that maximizes the prediction performance based only on the rationale. However, MMI can be problematic because it picks up spurious correlations between the input features and the output. Instead, we introduce a game-theoretic invariant rationalization criterion where the rationales are constrained to enable the same predictor to be optimal across different environments. We show both theoretically and empirically that the proposed rationales can rule out spurious correlations and generalize better to different test scenarios. The resulting explanations also align better with human judgments. Our implementations are publicly available at https://github.com/code-terminator/ invariant_rationalization.

1. Introduction

A number of selective rationalization techniques (Lei et al., 2016; Li et al., 2016b; Chen et al., 2018a;b; Yu et al., 2018; Carton et al., 2018; Bastings et al., 2019; Yu et al., 2019; Chang et al., 2019) have been proposed to explain the predictions of complex neural models. The key idea driving these methods is to ﬁnd a small subset of the input features rationale that sufﬁces on its own to yield the same outcome. In practice, rationales that remove much of the spurious content from the input, e.g., text, could be used and examined as justiﬁcations for model s predictions.

The most commonly-used criterion for rationales is the max-

*Equal contribution 1MIT-IBM Watson AI Lab 2IBM Research 3CSAIL MIT. Correspondence to: Shiyu Chang <shiyu.chang@ibm.com>, Yang Zhang <yang.zhang2@ibm.com>, Mo Yu <yum@us.ibm.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

imum mutual information (MMI) criterion. In the context of natural language processing (NLP), it deﬁnes rationale as the subset of input text that maximizes the mutual information between the subset and the model output, subject to the constraint that the selected subset remains within a prescribed length. Speciﬁcally, if we denote the random variables corresponding to input as X, rationales as Z and the model output as Y , then the MMI criterion ﬁnds the explanation Z = Z(X) that yields the highest prediction accuracy of Y .

MMI criterion can nevertheless lead to undesirable results in practice. It is prone to highlighting spurious correlations between the input features and the output as valid explanations. While such correlations represent statistical relations present in the training data, and thus incorporated into the neural model, the impact of such features on the true outcome (as opposed to model s predictions) can change at test time. In other words, MMI may select features that do not explain the underlying relationship between the inputs and outputs even though they may still be faithfully reporting the model s behavior. We seek to modify the rationalization criterion to better tailor it to ﬁnd causal features.

As an example, consider ﬁgure 1 that shows a beverage review which covers four aspects of beer: appearance, smell, palate, and overall. The reviewers also assigned a score to each of these aspects. Suppose we want to ﬁnd an explanation supporting a positive score to smell. The correct explanation should be the portion of the review that actually discusses smell, as highlighted in green. However, reviews for other aspects such as palate (highlighted in red) may co-vary with smell score since, as senses, smell and palate are related. The overall statement as highlighted in blue would typically also clearly correlate with any individual aspect score, including smell. Taken together, sentences highlighted in green, red and blue would all be highly correlated with the positive score for smell. As a result, MMI may select any one of them (or some combination) as the rationale, depending on precise statistics in the training data. Only the green sentence constitutes an adequate explanation.

Our goal is to design a rationalization criterion that approximates ﬁnding causal features. While assessing causality is challenging, we can approximate the task by searching instead features that are in some sense invariant. This notion was recently introduced in the context of invariant risk

Invariant Rationalization

Beer - Smell Label - Positive

375ml corked and caged bottle with bottled on date november 30 2005 , poured into snifter at brouwer s , reviewed on 5/15/11 . aroma : pours a clear golden color with orange hues and a whitish head that leaves some lacing around glass . smell : lots of barnyaardy funk with tons of earthy aromas , grass and some lemon peel . palate : similar to the aroma , lots of funk , lactic sourness , really earthy with citrus notes and oak . many layers of intriguing earthy complexities . overall : very funky and earthy gueuze , nice and crisp with good drinkability .

Figure 1. An example beer review and possible rationales explaining why the score on the smell aspect is positive. Green highlights the

review on the smell aspect, which is the true explanation. Red highlights the review on the taste aspect, which has a high correlation with the smell. Blue highlights the overall review, which summarizes all the aspects, including smell. All three sentences have high predictive powers of the smell score, but only the green sentence is the desired explanation.

minimization (IRM) (Arjovsky et al., 2019). The main idea is to highlight spurious (non-causal) variation by dividing the data into different environments. The same predictor, if based on causal features, should remain optimal in each environment separately.

In this paper, we propose invariant rationalization (INVRAT), a novel rationalization scheme that incorporates the invariance constraint. We extend the IRM principle to neural predictions by resorting to a game-theoretic framework to impose invariance. Speciﬁcally, the proposed framework consists of three modules: a rationale generator, an environment-agnostic predictor as well as an environmentaware predictor. The rationale generator generates rationales Z from the input X, and both predictors try to predict Y from Z. The only difference between the two predictors is that the environment-aware predictor also has access to which environment each training data point is drawn from. The goal of the rationale generator is to restrict the rationales in a manner that closes the performance gap between the two predictors while still maximizing the prediction accuracy of the environment-agnostic predictor.

We show theoretically that INVRAT can solve the invariant rationalization problem, and that the invariant rationales generalize well to unknown test environments in a well-deﬁned minimax sense. We evaluate INVRAT on multiple datasets with false correlations. The results show that INVRAT does signiﬁcantly better in removing false correlations and ﬁnding explanations that better align with human judgments. Both data and code will become publicly available.

2. Preliminaries: MMI and Its Limitation

In this section, we will formally review the MMI criterion and analyze its limitation using a probabilistic model. Throughout the paper, upper-cased letters, X and X, denote random scalars and vectors respectively; lower-cased letters, x and x, denote deterministic scalars and vectors respectively; H(X) denotes the Shannon entropy of X; H(Y |X) denotes the entropy of Y conditional on X; I(Y ; X) denotes the mutual information. Without causing ambiguities, we

Figure 2. A probabilistic model illustrating different parts of an

input that have different probabilistic relationships with the model output Y . A sentence X can be divided into three variables X1, X2 and X3. All X1, X2 and X3 can be highly correlated with Y , but only X1 is regarded as a plausible explanation.

use p X( ) and p(X) interchangeably to denote the probabilistic mass function of X.

2.1. Maximum Mutual Information Criterion

The MMI objective can be formulated as follows. Given the input-output pairs (X, Y ), MMI aims to ﬁnd a rationale Z, which is a masked version of X, such that it maximizes the mutual information between Z and Y . Formally,

max m2S I(Y ; Z) s.t. Z = m X, (1)

where m is a binary mask and S denotes a subset of {0, 1}N

with a sparsity and a continuity constraints. N is the total length in X. We leave the exact mathematical form of the constraint set S abstract here, and it will be formally introduced in section 3.5. denotes the element-wise multiplication of two vectors or matrices. Since the mutual information measures the predictive power of Z on Y , MMI essentially tries to ﬁnd a subset of input features that can best predict the output Y .

2.2. MMI Limitations

The biggest problem of MMI is that it is prone to picking up spurious probabilistic correlations, rather than ﬁnding the causal explanation. To demonstrate why this is the case, consider a probabilistic graph in ﬁgure 2, where X is divided into three variables, X1, X2 and X3, which represents the three typical relationship with Y : X1 inﬂuences Y ; X2

Invariant Rationalization

is inﬂuenced by Y ; X3 has no direction connections with Y . The dashed arrows represent some additional probabilistic dependencies among X. For now, we ignore E.

As observed from the graph, X1 serves as the valid explanation of Y , because it is the true cause of Y . Neither X2 nor X3 are valid explanations. However, X1, X2 and X3 can all be highly predicative of Y , so the MMI criterion may select any of the three features as the rationale. Concretely, consider the following toy example with all binary variables. Assume p X1(1) = 0.5, and

p Y |X1(1|1) = p Y |X1(0|0) = 0.9, (2)

which makes X1 a good predictor of Y . Next, deﬁne the conditional prior of X2 as

p X2|Y (1|1) = p X2|Y (0|0) = 0.9.

According to the Bayes rule,

p Y |X2(1|1) = p Y |X2(0|0) = 0.9, (3)

which makes X2 also a good predictor of Y . Finally, assume the conditional prior of X3 is

p X3|X1,X2(1|1, 1) = p X3|X1,X2(0|0, 0) = 1, and p X3|X1,X2(1|0, 1) = p X3|X1,X2(1|1, 0) = 0.5.

It can be computed that

p Y |X3(1|1) = p Y |X3(0|0) = 0.9. (4)

In short, according to equations (2), (3) and (4), we have constructed a set of priors such that the predictive power of X1, X2 and X3 is exactly the same. As a result, there is no reason for MMI to favor X1 over the others.

In fact, X1, X2 and X3 correspond to the three highlighted sentences in ﬁgure 1. X1 corresponds to the smell review (green sentence), because it represents the true explanation that inﬂuences the output decision. X2 corresponds to the overall review (blue sentence), because the overall summary of the beer inversely inﬂuenced by the smell score. Finally, X3 corresponds to the palate review (red sentence), because the palate review does not have a direct relationship with the smell score. However, X3 may still be highly predicative of Y because it can be strongly correlated with X1 and X2. Therefore, we need to explore a novel rationalization scheme that can distinguish X1 from the rest.

3. Adversarial Invariant Rationalization

In this section, we propose invariant rationalization, a rationalization criterion that can exclude rationales with spurious correlations, utilizing the extra information provided by an environment variable. We will introduce INVRAT, a gametheoretic approach to solving the invariant rationalization problem. We will then theoretically analyze the convergence property and the generalizability of invariant rationales.

3.1. Invariant Rationalization

Without further information, distinguishing X1 from X2 and X3 is a challenging task. However, this challenge can be resolved if we also have access to an extra piece of information: the environment. As shown in ﬁgure 2, an environment is deﬁned as an instance of the variable E that impacts the prior distribution of X (Arjovsky et al., 2019). On the other hand, we make the same assumption as in IRM that the p(Y |X1) remains the same across the environments (hence there is no edge pointing from E to Y in ﬁgure 2), because X1 is the true cause of Y . A general guidance on how to choose the environments is presented in appendix A. As we will show soon, p(Y |X2) and p(Y |X3) will not remain the same across the environments, which distinguishes X1 from X2 and X3.

Back to the binary toy example in section 2.2, suppose there are two environments, e1 and e2. In environment e1, all the prior distributions are exactly the same as in section 2.2. In environment e2, the priors are almost the same, except for the prior of X1. For notation ease, deﬁne q X( ) as the probabilities under environment e2, i.e. p X|E( |e2). Then, we assume that

q X1(1) = 0.6.

It turns out that such a small difference sufﬁces to expose X2 and X3. In this environment, q(Y |X1) is the same as in equation (2) as assumed. However, it can be computed that

q Y |X2(1|1) 0.926, q Y |X2(0|0) 0.867, q Y |X3(1|1) 0.912, q Y |X3(0|0) 0.883,

which are different from equations (3) and (4). Notice that we have not yet assumed any changes in the priors of X2 and X3, which will introduce further differences. The fundamental cause of such differences is that Y is independent of E only when conditioned on X1, so p Y |X1( | ) would not change with E. We call this property invariance. However, the conditional independence does not hold for X2 and X3.

Therefore, given that we have access to multiple environments during training, i.e. multiple instances of E, we propose the invariant rationalization objective as follows:

max m2S I(Y ; Z) s.t. Z = m X, Y ? E | Z, (5)

where ? denotes probabilistic independence. The only difference between equations (1) and (5) is that the latter has the invariance constraint, which is used to screen out X2 and X3. In practice, ﬁnding an eligible environment is feasible. In the beer review example in ﬁgure 1, a possible choice of environment is the brand of beer, because different beer brands have different prior distributions of the review in each aspect some brands are better at the appearance, others better at the palate. Such variations in priors sufﬁce to expose the non-invariance of the palate review or the overall review in terms of predicting the smell score.

Invariant Rationalization

3.2. The INVRAT Framework

The constrained optimization in equation (5) is hard to solve in its original form. INVRAT introduces a game-theoretic framework, which can approximately solve this problem. Notice that the invariance constraint can be converted to a constraint on entropy, i.e.,

Y ? E | Z , H(Y |Z, E) = H(Y |Z), (6)

which means if Z is invariant, E cannot provide extra information beyond Z to predict Y . Guided by this perspective, INVRAT consists of three players, as shown in ﬁgure 3:

an environment-agnostic/-independent predictor fi(Z); an environment-aware predictor fe(Z, E); and a rationale generator, g(X).

The goal of the environment-agnostic and environmentaware predictors is to predict Y from the rationale Z. The only difference between them is that the latter has access to E as another input feature but the former does not. Formally, denote L(Y ; f) as the cross-entropy loss on a single instance. Then the learning objective of these two predictors can be written as follows.

fi( ) E[L(Y ; fi(Z))], L

fe( , ) E[L(Y ; fe(Z, E))],

(7) where Z = g(X). The rationale generator generates Z by masking X. The goal of the rationale generator is also to minimize the invariance prediction loss L

i . However, there is an additional goal to make the gap between L

e small. Formally, the objective of the generator is as follows:

where h(t) a convex function that is monotonically increasing in t when t < 0, and strictly monotonically increasing in t when t 0, e.g., h(t) = t and h(t) = Re LU(t).

3.3. Convergence Properties

This section justiﬁes that equations (7) and (8) can solve equation (5) in its Lagrangian form. If the representation power of fi( ) and fe( , ) is sufﬁcient, the cross-entropy loss can achieve its entropy lower bound, i.e.,

i = H(Y |Z), L

e = H(Y |Z, E).

Notice that the environment-aware loss should be no greater than the environment-agnostic loss, because of the availability of more information, i.e., H(Y |Z) H(Y |Z, E). Therefore, the invariance constraint in equation (6) can be rewritten as an inequality constraint:

H(Y |Z) = H(Y |Z, E) , H(Y |Z) H(Y |Z, E). (9)

Env-aw fe( , )

Figure 3. The INVRAT framework with three players: the rationale

generator, environment-agnostic and -aware predictors.

Finally, notice that I(Y ; Z) = H(Y ) H(Y |Z). Thus the objective in equation (8) can be regarded as the Lagrange form of equation (5), with the constraint rewritten as an inequality constraint

h(H(Y |Z) H(Y |Z, E)) h(0). (10)

According to the KKT conditions, λ > 0 when equation (10) is binding. Moreover, the objectives in equations (7) and (8) can be rewritten as a minimax game

min g( ),fi( ) max

fe( , ) Li(g, fi) + λh(Li(g, fi) Le(g, fe)), (11)

Li(g, fi) = E[L(Y ; fi(Z))], Le(g, fe) = E[L(Y ; fe(Z, E))].

Therefore, the generator plays a co-operative game with the environment-agnostic predictor, and an adversarial game with the environment-aware predictor. The optimization can be performed using alternate gradient descent/ascent.

3.4. Invariance and Generalizability

In our previous discussions, we have justiﬁed the invariant rationales in the sense that it can uncover consistent and causal explanations and leave out spurious statistical correlations. In this section, we further justify invariant rationale in terms of generalizability. We consider two sets of environments, a set of training environments {et} and a test environment ea. Only the training environments are accessible during training. The prior distributions in the test environment are completely unknown. The question we want to ask is: does keeping the invariant rationales and dropping the non-invariant rationales improve the generalizability in the unknown test environment?

Assume that 1) the training data are sufﬁcient, 2) the predictor is environment-agnostic, 3) the predictor has sufﬁcient representation power, and 4) the training converges to the global optimum. Under these assumptions, any predictor is able to replicate the training set distribution (with all the training environments mixed) p(Y |Z, E 2 {et}), which is optimal under the cross-entropy training objective. In the test environment ea, the cross-entropy loss of this predictor

Invariant Rationalization

is given by

test(Z) = H(p(Y |Z, ea); p(Y |Z, {et})).

where p(Y |Z, {et}) is short for p(Y |Z, E 2 {et}). L

test(Z) cannot be evaluated because the prior distribution in the test environment is unknown. Instead, we consider the worst scenario. For notational ease, we introduce the following shorthand for the test environment distributions:

1(x1) = p X1|E(x1|ea), 2(x2|x1, y) = p X2|X1,Y,E(x2|x1, y, ea), 3(x3|x1, x2) = p X3|X1,X2,E(x3|x1, x2, , ea).

For the selected rationale Z, we consider an adversarial test environment (hence the notation ea), which chooses 1, 2 and 3 to maximize L

test(Z; 1, 2, 3) (note that L

test(Z) is a function of 1, 2, and 3). The following theorem shows that the minimizer of this adversarial loss is the invariant rationale X1.

Theorem 1. Assume the probabilistic graph in ﬁgure 2 and that there are two environments et and ea. Z = X1 achieves the saddle point of the following minimax problem

min Z2X max 1, 2, 3 L

test(Z; 1, 2, 3),

where X denotes the power set of [X1, X2, X3].

The proof is provided in the appendix B. Theorem 1 shows the nice property of the invariance rationale that it minimizes the risk under the most adverse test environment.

3.5. Incorporating Sparsity and Continuity Constraints

The sparsity and continuity constraint m 2 S (equation (5)) stipulates that the total number of 1 s in m should be upper bounded and contiguous. There are two ways to implement the constraints.

Soft constraints: Following Chang et al. (2019), we can add another two Lagrange terms to equations (11):

1 N E[kmk1]

where mn denotes the n-th element of m; is a predeﬁned sparsity level. m is produced by an independent selection process (Lei et al., 2016). This method is ﬂexible, but requires sophisticated tuning of three Lagrange multipliers.

Hard constraints: An alternative approach is to force g( ) to select one chunck of text with a pre-speciﬁed length l. Instead of predicting the mask directly, g( ) produces a score sn for each position n, and predicts the start position of the chunk by choosing the maximum of the score. Formally

[n 2 [n , n + l 1]], (13)

denotes the indicator function, which equals 1 if the argument is true, and 0 otherwise. Equation (13) is not differentiable, so when computing the gradients for the back propagation, we apply the straight-through technique (Bengio et al., 2013) and approximate it with the gradient of

ˆs = softmax(s), m = Causal Conv(ˆs),

where Causal Conv( ) denotes causal convolution, and the convolution kernel is an all-one vector of length l.

4. Experiments

4.1. Datasets

To evaluate the invariant rationale generation, we consider the following two binary classiﬁcation datasets with known spurious correlations.

IMDB (Maas et al., 2011): The original dataset consists of 25,000 movie reviews for training and 25,000 for testing. The output Y is the binarized score of the movie. We construct a synthetic setting that manually injects tokens with false correlations with Y , whose prior varies across artiﬁcial environments. The goal is to validate if the proposed method excludes these tokens from rationale selections. Speciﬁcally, we ﬁrst randomly split the training set into two balanced subsets, where each subset is considered as an environment. At the beginning of each review, we randomly append one punctuation, S 2 { , , . }, with the following distributions:

p(S = , |Y = 1, ei) = p(S = . |Y = 0, ei) = i

Here i is the environment index taking values on {0, 1}. Speciﬁcally, we set 0 and 1 to be 0.9 and 0.7, respectively, for the training set. For the purpose of model selection and evaluation, we randomly split the original test set into two balanced subsets, which are our new validation and test sets. To test how different rationalization techniques generalize to unknown environments, we also inject the punctuation to the test and validation set, but with 0 and 1 set as 0.5 for the validation set, and 0.1, 0.3 for the testing set. According to equation (4.1), these manually injected , and . can be thought of as the X2 variable in the ﬁgure 2, which have strong correlations to the label. It is worth mentioning that the environment ID is only provided in the training set.

Multi-aspect beer reviews (Mc Auley et al., 2012): This dataset is commonly used in the ﬁeld of rationalization (Lei et al., 2016; Bao et al., 2018; Yu et al., 2019; Chang et al., 2019). It contains 1.5 million beer reviews, each of which evaluates multiple aspects of a beer. These aspects include appearance, aroma, smell, palate and overall. Each aspect has a rating at the scale of [0, 1]. The goal is to provide rationales for these ratings. There is a high correlation among the rating scores of different aspects in the same review, making

Invariant Rationalization

it difﬁcult to directly learn a rationalization model from the original data. Therefore only the decorrelated subsets are selected as training data in the previous usages (Lei et al., 2016; Yu et al., 2019).

However, the high correlation among rating scores in the original data provides us a perfect evaluation benchmark for INVRATon its ability to exclude irrelevant but highly correlated aspects, because these highly correlated aspects can be thought of as X2 and X3 in ﬁgure 2, as discussed in section 2.2. To construct different environments, we cluster the data based on different degree of correlation among the aspects. To gauge the correlation among aspect, we train a simple linear regression model to predict the rating of the target aspect given the ratings of all the other aspects except the overall. A low prediction error of the data implies high correlation among the aspects. We then assign the data into different environments based on the linear prediction error. In particular, we construct two training environments using the data with least prediction error, i.e. highest correlations. The ﬁrst training environment is sampled from around the lowest 25 percentile of the prediction error1, while the second one is from around 25 to 50 percentile. On the contrary, we construct a validation set and a subjective evaluation set from data with the highest prediction error (i.e. around the highest 50 percentile). Following the same evaluation protocol (Bao et al., 2018; Chang et al., 2019), we consider a classiﬁcation setting by treating reviews with ratings 0.4 as negative and 0.6 as positive. Each training environment is further sub-sampled to contain a total 5,000 label-balanced examples, which makes the size of the training set as 10,000. The validation set is similarly sub-sampled into size 2,000. The size of the subjective evaluation set is 400. Same as almost all previous work in rationalization, we focus on the appearance, aroma, and palate aspects only.

Also, this dataset includes sentence-level annotations for about 1,000 reviews. Each sentence is annotated with one or multiple aspects label, indicating which aspect this sentence belonging to. We use this set to automatically evaluate the precision of the extracted rationales.

4.2. Baselines

We consider the following two baselines:

RNP: A generator-predictor framework proposed by Lei et al. (2016) for rationalizing neural prediction (RNP). The generator selects text spans as rationales which are then fed to the predictor for label classiﬁcation. The selection optimizes the MMI criterion shown in equation (1).

1For each aspect, the exact percentile needs to be adjusted such that there are sufﬁcient positive and negative examples to form a label-balanced subset of a given size. This also holds for the other environment partitions.

3PLAYER: The improvement of RNP from Yu et al. (2019), which aims to alleviate the degeneration problem of RNP. The model consists of three modules, which are the generator, the predictor and the complement predictor. The complement predictor tries to maximize the predictive accuracy from unselected words. Besides the MMI objective optimized between the generator and predictor, the generator also plays an adversarial game with the complement predictor, trying to minimize its performance.

There exist other differentiable selective rationalization methods with good performance, e.g., Bastings et al. (2019). These methods rely on the properties of distributions for binary selection of rationale words, which falls to a degenerated mode in our more challenging settings. Appendix C gives the studies of the out-of-box algorithm from (Bastings et al., 2019). Adapting these algorithms to span selection is non-trivial, and we leave it to future work.

4.3. Implementation Details

For all experiments, we use bidirectional gated recurrent units (Chung et al., 2014) with hidden dimension 256 for the generator and both of the predictors. All the methods are initialized with 100-dimension Glove embeddings (Pennington et al., 2014). We use the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.001. The batch size is set to 500. To seek fair comparisons, we try to keep the settings of both RNP and 3PLAYER the same to ours. We adapted the open-source implementations of the RNP2 and 3PLAYER3. The only major difference between these models is that both RNP and INVRAT use the straight-through technique (Bengio et al., 2013) to deal with the problem of non-differentiability in rationale selections while 3PLAYER is based on the policy gradient (Williams, 1992).

For the IMDB dataset, we follow a standard setting (Lei et al., 2016; Chang et al., 2019) to use the soft constraints to regularize the selected rationales for all methods. For the beer review task, we ﬁnd the baseline methods perform much worse using soft constraints compared to the hard one. This might be because the review of each aspect is highly correlated in the training set. Thus, we consider the hard constraints (equation (13)) with different length in generating rationales. We also ﬁnd that training with multiple random initializations can prevent being trapped in poor local optima. Hyperparameters (i.e., µ1, µ2 in equation (12) for the IMDB experiment, λ and h( ) in equation (8), the number of consecutive gradient ascent/descent steps for each player during one iteration, and the number of training epochs for both experiments) are determined based on the

2https://github.com/Yujia Bao/R2A/tree/ master/rationalization.

3https://github.com/Gorov/three_player_ for_emnlp.

Invariant Rationalization

Table 1. Results on the synthetic IMDB dataset. The last column

is the percentage of testing examples with the injected punctuation selected as a part of the rationales. The best test results are bolded. We also list the result of the black-box model with full texts as inputs for reference.

Dev Acc Test Acc Bias Highlighted RNP 78.90 72.25 78.24 INVRAT 86.65 87.05 0.00 Full Text 82.90 78.00 100.00

best performance on the validation set.

4.4. Results

IMDB: Table 1 shows the results on the synthetic IMDB dataset. As we can see, RNP selects the injected punctuation in 78.24% of the testing samples, while INVRAT, as expected, does not highlight any. This result veriﬁes our theoretical analysis in section 3. Moreover, because RNP relies on these injected punctuation, whose probabilistic distribution varies drastically between training set and test set, its generalizability is poor, which leads to low predictive accuracy on the testing set. Speciﬁcally, there is a large gap of around 15% between the test performance of RNP and the proposed INVRAT. As a reference, table 1 also reports the result on full text, i.e. the entire text as the rationale. Similar to Rnp, the full text also has poor generalizability to the test set, because it also includes the non-invariant punctuation as rationales. It is worth pointing out that, by the dataset construction, 3PLAYER will obviously fail by including all punctuation as rationales. This is because otherwise, the complement predictor will have a clear clue to guess the predicted label. Thus, we exclude 3PLAYER from the comparison.

Beer Review: We conduct both objective and subjective evaluations for the beer review dataset. We ﬁrst compare the generated rationales against the human annotations and report precision, recall and F1 score in table 2. Similarly, the reported performances are based on the best performance on the validation set, which is also reported. We consider the highlight lengths of 10, 20 and 30.

We observe that INVRAT consistently surpass the other two baselines in ﬁnding rationales that align with human annotation for most of the rationale lengths and the aspects. In particular, although the best accuracies among all three methods on validation sets have only small variations, the improvements are signiﬁcant in terms of ﬁnding the correct rationales. For example, INVRAT improves over the other two methods for more than 20 absolute percent in F1 for the appearance aspect. Two baselines methods fail to distinguish the true clues for different aspects, which conﬁrms that the previous MMI objective is insufﬁcient for ruling out

Figure 5. Subjective performances of generated rationales. Sub-

jects are asked to guess the target aspect (i.e. which aspect of the model is trained on) based on the generated rationales. We report the case of preset rationale length of 10, 20 and 30.

the spurious words.

In addition, we also visualize the generated rationales of our method with a preset length of 20 in ﬁgure 4. We observe that the INVRAT is able to produce meaningful justiﬁcations for all three aspects. By reading these selected texts alone, humans will easily predict the aspect label. To further verify that the rationales generated by INVRAT align with human judgment, we present a subjective evaluation via Amazon Mechanical Turk. Recall that for each aspect we preserved a hold-out set with 400 examples (total 1,200 examples for all three aspects). We generate rationales with different lengths for all methods. In each subjective test, the subject is presented with the rationale of one aspect of the beer review, generated by one of the three methods (unselected words blocked), and asked to guess which aspect the rationale is talking about. We then compute the accuracy as the performance metric, which is shown in ﬁgure 5. Under this setting, a generator that picks spurious correlated texts will have a low accuracy. As can be observed, INVRAT achieves the best performances in all cases.

5. Related Work

Selective rationalization Selective rationalization is one of the major categories of model interpretability in machine learning. Lei et al. (2016) ﬁrst propose a generator-predictor framework for rationalization. The framework is formally a co-operative game that maximizes the mutual information between the selected rationales and labels, as shown in (Chen et al., 2018a). Following this work, Chen et al. (2018b) improves the generator-predictor framework by proposing a new rationalization criterion by considering the combinatorial nature of the selection. Yu et al. (2019) point out the communication problem in co-operative learning and proposes a new three-player framework to control

Invariant Rationalization

Table 2. Experimental results on the multi-aspect beer reviews. We compare with the baselines on highlight lengths of 10, 20 and 30. For

each aspect and length, we report the best accuracy on the validation set and its corresponding performance on the human annotation set. The best precision (P), recall (R) and F1 score are bolded.

Methods Len Appearance Aroma Palate Dev Acc P R F1 Dev Acc P R F1 Dev Acc P R F1 RNP 10 75.20 13.51 5.75 8.07 75.30 30.30 15.26 20.30 75.00 28.20 17.24 21.40 3PLAYER 10 77.55 15.84 6.78 9.50 80.75 48.85 24.43 32.57 76.60 14.15 8.54 10.65 INVRAT 10 75.65 49.54 20.93 29.43 77.95 48.21 24.36 32.36 76.10 32.80 20.01 24.86 RNP 20 77.70 13.54 11.29 12.31 78.85 34.32 34.18 34.25 77.10 19.80 23.78 21.60 3PLAYER 20 82.56 15.63 13.47 14.47 82.95 35.73 35.89 35.81 79.75 20.73 24.91 22.63 INVRAT 20 81.30 58.03 49.59 53.48 81.90 42.72 42.52 42.62 80.45 44.04 52.75 48.00 RNP 30 81.65 26.26 33.10 29.29 83.10 39.97 60.13 48.02 78.55 19.18 33.81 24.47 3PLAYER 30 80.55 12.56 15.90 14.03 84.40 33.02 49.66 39.67 81.85 21.98 39.27 28.18 INVRAT 30 82.85 54.03 69.23 60.70 84.40 44.72 67.35 53.75 81.00 26.51 46.91 33.87

Beer - Appearance Rationale Length - 20

into a pint glass , poured a solid black , not so much head but enough , tannish in color , decent lacing down the glass . as for aroma , if you love coffee and beer , its the best of both worlds , a very fresh strong full roast coffee blended with ( and almost overtaking ) a solid , classic stout nose , with the toasty , chocolate malts . with the taste , its even more coffee , and while its my dream come true , so delicious , what with its nice chocolate and burnt malt tones again , but i almost say it <unknown> any <unknown> , and takes away from the beeriness of this beer . which is n t to say it is n t delicious , because it is , just seems a bit unbalanced . oh well ! the mouth is pretty solid , a bit light but not all that unexpected with a coffee blend . its fairly smooth , not quite creamy , well carbonated , thoroughly , exceptionally drinkable .

Beer - Aroma Rationale Length - 20

into a pint glass , poured a solid black , not so much head but enough , tannish in color , decent lacing down the glass . as for aroma , if you love coffee and beer , its the best of both worlds , a very fresh strong full roast coffee blended with ( and almost overtaking ) a solid , classic stout nose , with the toasty , chocolate malts . with the taste , its even more coffee , and while its my dream come true , so delicious , what with its nice chocolate and burnt malt tones again , but i almost say it <unknown> any <unknown> , and takes away from the beeriness of this beer . which is n t to say it is n t delicious , because it is , just seems a bit unbalanced . oh well ! the mouth is pretty solid , a bit light but not all that unexpected with a coffee blend . its fairly smooth , not quite creamy , well carbonated , thoroughly , exceptionally drinkable .

Beer - Palate Rationale Length - 20

into a pint glass , poured a solid black , not so much head but enough , tannish in color , decent lacing down the glass . as for aroma , if you love coffee and beer , its the best of both worlds , a very fresh strong full roast coffee blended with ( and almost overtaking ) a solid , classic stout nose , with the toasty , chocolate malts . with the taste , its even more coffee , and while its my dream come true , so delicious , what with its nice chocolate and burnt malt tones again , but i almost say it <unknown> any <unknown> , and takes away from the beeriness of this beer . which is n t to say it is n t delicious , because it is , just seems a bit unbalanced . oh well ! the mouth is pretty solid , a bit light but not all that unexpected with a coffee blend . its fairly smooth , not quite creamy , well carbonated , thoroughly , exceptionally drinkable .

Figure 4. Examples of INVRAT generated rationales on the multi-aspect datasets. Human annotated words are underlined. Appearance,

aroma and palate rationales are in bold text and highlighted in green, red, and blue respectively.

the unselected texts. Chang et al. (2019) aim to generate rationales in all possible classes instead of the target label only, which makes the model perform counterfactual reasoning. In all, these models deal with different challenges in generating high-quality rationales. However, they are still insufﬁcient to distinguish the invariant words from the correlated ones.

Self-explaining models beyond selective rationalization Besides selective rationalization, other approaches also improve the interpretability of neural predictions. For example, module networks (Andreas et al., 2016a;b; Johnson et al.,

2017) compose appropriate modules following the logical program produced by a natural language component. The restriction to a small set of pre-deﬁned programs currently limits their applicability. Other lines of work include evaluating feature importance with gradient information (Simonyan et al., 2013; Li et al., 2016a; Sundararajan et al., 2017) or local perturbations (Kononenko et al., 2010; Lundberg & Lee, 2017); and interpreting deep networks by locally ﬁtting interpretable models (Ribeiro et al., 2016; Alvarez-Melis & Jaakkola, 2018). However, these methods aim at providing post-hoc explanations of already-trained models, which is not able to ﬁnd invariant texts.

Invariant Rationalization

Learning with biases Our work also relates to the topic of discovering dataset-speciﬁc biases. Speciﬁcally, neural models have shown remarkable results in many NLP applications, however, these models sometimes prone to ﬁt some dataset-speciﬁc patterns or biases. For example, in natural language inference, such biased clues can be the word overlap between the input sentence pair (Mc Coy et al., 2019) or whether the negative word not exists (Niven & Kao, 2019). Similar observations have been found in multi-hop question answering (Welbl et al., 2018; Min et al., 2019). To learn with biased data but not fully rely on it, Lewis & Fan (2018) use generative objectives to force the QA models to make use of the full question. Agrawal et al. (2018); Wang et al. (2019) propose carefully designed model architectures to capture more complex interactions between input clues beyond the biases. Ramakrishnan et al. (2018); Belinkov et al. (2019) propose to add adversarial regularizations that punish the internal representations that cooperate well with bias-only models. Clark et al. (2019); He et al. (2019); Karimi Mahabadi et al. (2020) propose to learn ensemble models that ﬁt the residual from the prediction with bias features. However, all these works assume that the biases are known. Our work instead can rule out unwanted features without knowing the exact pattern a priori.

Finally, Feng et al. (2018) discovered nonsensical clues by removing uninformative words recognized by pre-trained neural models, indicating that these models are not always learning human-understandable causes for the predictions, which may partially because of the ﬁt of data biases.

6. Conclusion

In this paper, we propose a game-theoretic approach to invariant rationalization, where the method is trained to constrain the probability of the output conditional on the rationales be the same across multiple environments. The framework consists of three players, which competitively rule out spurious words with strong correlations to the output. We theoretically demonstrate the proposed game-theoretic framework drives the solution towards better generalization to test scenarios that have different distributions from the training. Extensive objective and subjective evaluations on both synthetic and multi-aspect sentiment classiﬁcation datasets demonstrate that INVRAT performs favorably against existing algorithms in rationale generation.

Agrawal, A., Batra, D., Parikh, D., and Kembhavi, A. Don t

just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4971 4980, 2018.

Alvarez-Melis, D. and Jaakkola, T. S. Towards robust inter-

pretability with self-explaining neural networks. ar Xiv preprint ar Xiv:1806.07538, 2018.

Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. Learn-

ing to compose neural networks for question answering. In Proceedings of NAACL-HLT, pp. 1545 1554, 2016a.

Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. Neural

module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 39 48, 2016b.

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-

Paz, D. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

Bao, Y., Chang, S., Yu, M., and Barzilay, R. Deriving

machine attention from human rationales. ar Xiv preprint ar Xiv:1808.09367, 2018.

Bastings, J., Aziz, W., and Titov, I. Interpretable neural

predictions with differentiable binary variables. ar Xiv preprint ar Xiv:1905.08160, 2019.

Belinkov, Y., Poliak, A., Shieber, S. M., Van Durme, B.,

and Rush, A. M. On adversarial removal of hypothesisonly bias in natural language inference. ar Xiv preprint ar Xiv:1907.04389, 2019.

Bengio, Y., L eonard, N., and Courville, A. Estimating or

propagating gradients through stochastic neurons for conditional computation. ar Xiv preprint ar Xiv:1308.3432, 2013.

Carton, S., Mei, Q., and Resnick, P. Extractive adversarial

networks: High-recall explanations for identifying personal attacks in social media posts. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3497 3507, 2018.

Chang, S., Zhang, Y., Yu, M., and Jaakkola, T. A game

theoretic approach to class-wise selective rationalization. In Advances in Neural Information Processing Systems, pp. 10055 10065, 2019.

Chen, J., Song, L., Wainwright, M., and Jordan, M. Learn-

ing to explain: An information-theoretic perspective on model interpretation. In International Conference on Machine Learning, pp. 882 891, 2018a.

Chen, J., Song, L., Wainwright, M. J., and Jordan, M. I.

L-Shapley and C-Shapley: Efﬁcient model interpretation for structured data. ar Xiv preprint ar Xiv:1808.02610, 2018b.

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical

evaluation of gated recurrent neural networks on sequence modeling. ar Xiv preprint ar Xiv:1412.3555, 2014.

Invariant Rationalization

Clark, C., Yatskar, M., and Zettlemoyer, L. Don t take

the easy way out: Ensemble based methods for avoiding known dataset biases. ar Xiv preprint ar Xiv:1909.03683, 2019.

Feng, S., Wallace, E., Grissom II, A., Iyyer, M., Rodriguez,

P., and Boyd-Graber, J. Pathologies of neural models make interpretations difﬁcult. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3719 3728, 2018.

He, H., Zha, S., and Wang, H. Unlearn dataset bias in

natural language inference by ﬁtting the residual. ar Xiv preprint ar Xiv:1908.10763, 2019.

Johnson, J., Hariharan, B., van der Maaten, L., Hoffman,

J., Fei-Fei, L., Lawrence Zitnick, C., and Girshick, R. Inferring and executing programs for visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2989 2998, 2017.

Karimi Mahabadi, R., Belinkov, Y., and Henderson, J. End-

to-end bias mitigation by modelling biases in corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8706 8716, Online, July 2020.

Kingma, D. P. and Ba, J. Adam: A method for stochastic

optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Kononenko, I. et al. An efﬁcient explanation of individual

classiﬁcations using game theory. Journal of Machine Learning Research, 11(Jan):1 18, 2010.

Lei, T., Barzilay, R., and Jaakkola, T. Rationalizing neural

predictions. ar Xiv preprint ar Xiv:1606.04155, 2016.

Lewis, M. and Fan, A. Generative question answering:

Learning to answer the whole question. 2018.

Li, J., Chen, X., Hovy, E., and Jurafsky, D. Visualizing and

understanding neural models in NLP. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 681 691, 2016a.

Li, J., Monroe, W., and Jurafsky, D. Understanding neural

networks through representation erasure. ar Xiv preprint ar Xiv:1612.08220, 2016b.

Lundberg, S. M. and Lee, S.-I. A uniﬁed approach to in-

terpreting model predictions. In Advances in Neural Information Processing Systems, pp. 4765 4774, 2017.

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y.,

and Potts, C. Learning word vectors for sentiment analysis. In ACL: Human language technologies, pp. 142 150, 2011.

Mc Auley, J., Leskovec, J., and Jurafsky, D. Learning atti-

tudes and attributes from multi-aspect reviews. In 2012 IEEE 12th International Conference on Data Mining, pp. 1020 1025. IEEE, 2012.

Mc Coy, T., Pavlick, E., and Linzen, T. Right for the wrong

reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3428 3448, 2019.

Min, S., Wallace, E., Singh, S., Gardner, M., Hajishirzi,

H., and Zettlemoyer, L. Compositional questions do not necessitate multi-hop reasoning. ar Xiv preprint ar Xiv:1906.02900, 2019.

Niven, T. and Kao, H.-Y. Probing neural network compre-

hension of natural language arguments. ar Xiv preprint ar Xiv:1907.07355, 2019.

Pennington, J., Socher, R., and Manning, C. Glove: Global

vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532 1543, 2014.

Ramakrishnan, S., Agrawal, A., and Lee, S. Overcoming

language priors in visual question answering with adversarial regularization. In Advances in Neural Information Processing Systems, pp. 1541 1551, 2018.

Ribeiro, M. T., Singh, S., and Guestrin, C. Why should I

trust you?: Explaining the predictions of any classiﬁer. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135 1144. ACM, 2016.

Simonyan, K., Vedaldi, A., and Zisserman, A. Deep in-

side convolutional networks: Visualising image classiﬁcation models and saliency maps. ar Xiv preprint ar Xiv:1312.6034, 2013.

Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attri-

bution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3319 3328. JMLR. org, 2017.

Wang, H., Yu, M., Guo, X., Das, R., Xiong, W., and Gao, T.

Do multi-hop readers dream of reasoning chains? ar Xiv preprint ar Xiv:1910.14520, 2019.

Welbl, J., Stenetorp, P., and Riedel, S. Constructing datasets

for multi-hop reading comprehension across documents. Transactions of the Association for Computational Lin-

guistics, 6:287 302, 2018.

Williams, R. J. Simple statistical gradient-following algo-

rithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992.

Invariant Rationalization

Yu, M., Chang, S., and Jaakkola, T. S. Learning corre-

sponded rationales for text matching. Open Review, 2018.

Yu, M., Chang, S., Zhang, Y., and Jaakkola, T. S. Rethinking

cooperative rationalization: Introspective extraction and complement control. ar Xiv preprint ar Xiv:1910.13294, 2019.