# invariant_risk_minimization_games__800fdeb0.pdf Invariant Risk Minimization Games Kartik Ahuja 1 Karthikeyan Shanmugam 1 Kush R. Varshney 1 Amit Dhurandhar 1 The standard risk minimization paradigm of ma chine learning is brittle when operating in environ ments whose test distributions are different from the training distribution due to spurious correla tions. Training on data from many environments and ﬁnding invariant predictors reduces the effect of spurious features by concentrating models on features that have a causal relationship with the outcome. In this work, we pose such invariant risk minimization as ﬁnding the Nash equilibrium of an ensemble game among several environments. By doing so, we develop a simple training algo rithm that uses best response dynamics and, in our experiments, yields similar or better empiri cal accuracy with much lower variance than the challenging bi-level optimization problem of Ar jovsky et al. (2019). One key theoretical contri bution is showing that the set of Nash equilibria for the proposed game are equivalent to the set of invariant predictors for any ﬁnite number of environments, even with nonlinear classiﬁers and transformations. As a result, our method also re tains the generalization guarantees to a large set of environments shown in Arjovsky et al. (2019). The proposed algorithm adds to the collection of successful game-theoretic machine learning algo rithms such as generative adversarial networks. 1. Introduction The annals of machine learning are rife with embarrassing examples of spurious correlations that fail to hold outside a speciﬁc training (and identically distributed test) distri bution. Beery et al. (2018) trained a convolutional neural network (CNN) to classify camels from cows. The training dataset had one source of bias, i.e., most of the pictures of cows had green pastures, while most pictures of camels 1IBM Research, Thomas J. Watson Research Center, York town Heights, NY. Correspondence to: Kartik Ahuja . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the au thor(s). were in deserts. The CNN picked up the spurious correla tion, i.e., it associated green pastures with cows and failed to classify pictures of cows on sandy beaches correctly. In another case, a neural network used a brake light indicator to continue applying brakes, which was a spurious correlation in the training data (de Haan et al., 2019); the list of such examples goes on. To address the problem of models inheriting spurious cor relations, Arjovsky et al. (2019) show that one can exploit the varying degrees of spurious correlation naturally present in data collected from multiple data sources to learn robust predictors. The authors propose to ﬁnd a representation Φ such that the optimal classiﬁer given Φ is invariant across training environments. This formulation leads to a challeng ing bi-level optimization, which the authors relax by ﬁxing a simple linear classiﬁer and learning a representation Φ such that the classiﬁer is approximately locally optimal in all the training environments. In this work, we take a very different approach. We create an ensemble of classiﬁers with each environment controlling one component of the ensemble. Each environment uses the entire ensemble to make predictions. We let all the environments play a game where each environment s action is to decide its contribution to the ensemble such that it minimizes its risk. Remarkably, we establish that the set of predictors that solve the ensemble game is equal to the set of invariant predictors across the training environments; this result holds for a large class of nonlinear classiﬁers. This brings us to the question: how do we solve the game? We use classic best response dynamics (Fudenberg & Levine, 1998), which has a very simple implementation. Each environment periodically takes its turn and moves its classiﬁer in the direction that minimizes the risk speciﬁc to its environment. Empirically, we establish that the invariant predictors found by our approach lead to better or compara ble performance with much lower standard deviation than Arjovsky et al. (2019) on several different datasets. A nice consequence of our approach is we do not restrict classi ﬁers to be linear, which was emphasized as an important direction for future work by Arjovsky et al. (2019). Broadly speaking, we believe that the game-theoretic per spective herein can open up a totally new paradigm to ad dress the problem of invariance. Invariant Risk Minimization Games 2. Related Work 2.1. Invariance Principles in Causality The invariant risk minimization formulation of Arjovsky et al. (2019) is the most related work, and is motivated from the theory of causality and causal Bayesian networks (CBNs) (Pearl, 1995). A variable y is caused by a set of non-spurious actual causal factors x Pa(y) if and only if in all environments where y has not been intervened on, the conditional probability P (y|x Pa(y)) remains invariant. This is called the modularity condition (Bareinboim et al., 2012). Related and similar notions are the independent causal mechanism principle (Sch olkopf et al., 2012; Janzing & Sch olkopf, 2010; Janzing et al., 2012) and the invariant causal prediction principle (Peters et al., 2016; Heinze Deml et al., 2018). These principles imply that if all the environments (train and test) are modeled by interventions that do not affect the causal mechanism of target variable y, then a classiﬁer conservatively trained on the transformation that involves the causal factors (Φ(x) = x Pa(y)) to predict y is robust to unseen interventions. In general, for ﬁnite sets of environments, there may be other invariant predictors. If one has information about the CBN structure, one can ﬁnd invariant predictors that are maximally predictive using conditional independence tests and other graph-theoretic tools (Magliacane et al., 2018; Subbaswamy et al., 2019). The above works select subsets of features, primarily using conditional independence tests, that make the optimal clas siﬁer trained on the selected features be invariant. Arjovsky et al. (2019) give an optimization-based reformulation of this invariance that facilitates searching over transformations in a continuous space, making their work widely applicable in areas such as computer vision where the causal features are latent (see Figure 6 in Arjovsky et al. (2019)). 2.2. Sample Reweighting, Domain Adaptation, and Robust Optimization Statistical machine learning has dealt with the distribution shift between the training distribution and test distribution in a number of ways. Conventional approaches are sam ple weighting, domain adaptation, and robust optimization. Importance weighting or more generally sample weighting attempts to match test and train distributions by reweighting samples (Shimodaira, 2000; Sugiyama et al., 2008; Gretton et al., 2009; Zhao et al., 2018). It typically assumes that the probability of labels given all covariates does not shift, and in more general cases, requires access to test labels. Domain adaptation tries to ﬁnd a representation Φ whose distribution is invariant across source and target domains (Ajakan et al., 2014; Ben-David et al., 2007; Glorot et al., 2011; Ganin et al., 2016). Domain adaptation is known to have serious limitations even when the marginal distribution of labels shift across environments (Zhao et al., 2019; Jo hansson et al., 2019). When only training data sources are given, robust optimization techniques ﬁnd the worst case loss over all possible convex combinations of the training sources (Mohri et al., 2019; Hoffman et al., 2018; Lee & Raginsky, 2018; Duchi et al., 2016). This assumes that the test distribution is within the convex hull of training distributions, which is not true in many settings. 3. Preliminaries 3.1. Game Theory Concepts We begin with some basic concepts from game theory (Fudenberg & Tirole, 1991) that we will use. Let Γ = (N, {Si}i N , {ui}i N ) be the tuple representing a standard normal form game, where N is the ﬁnite set of players. Player i N takes actions from a strategy set Si. The util ity of player i is ui : S R, where we write the joint set S = Πi N Si. The joint strategy of all the players is given as s S, the strategy of player i is si and the strategy of the rest of players is s i = (si0 )i =i. If the set S is ﬁnite, then 0 6 we call the game Γ a ﬁnite game. If the set S is uncountably inﬁnite, then the game Γ is a continuous game. Nash equilibrium in pure strategies. A strategy s is said to be a pure strategy Nash equilibrium (NE) if it satisﬁes ui(si , s i) ui(k, s i), k Si, i N 3.2. Invariant Risk Minimization We describe the invariant risk minimization (IRM) of Ar e e)}ne jovsky et al. (2019). Consider datasets {(xi , y from i i=1 multiple training environments e Etr. The feature value e e x X and the corresponding labels y Y, where i i X Rn and Y Rk .1 Deﬁne a predictor f : X Rk . The goal of IRM is to use these multiple datasets to construct a predictor f that performs well across many unseen envi ronments Eall, where Eall Etr. Deﬁne the risk achieved by f in environment e as Re(f) = EXe,Y e (f(Xe), Y e) , where is the loss when f(X) is the predicted value and Y is the corresponding label. To assume that f maps to real values is not restrictive; for instance, in a k-class classiﬁ cation problem, the output of the function f is the score for each class, which can be converted into a hard label by selecting the class with the highest score. Invariant predictor: We say that a data representation Φ : X Z Rd elicits an invariant predictor w Φ across environments e E if there is a classiﬁer w : Z Rk that achieves the minimum risk for all the environments 1The setup applies to both continuous and categorical data. If any feature or label is categorical, we one-hot encode it. Invariant Risk Minimization Games w arg minw Hw Re(w Φ), e E. The set of all the mappings Φ is given as HΦ and the set of all the classiﬁers is given as Hw. IRM may be phrased as the following constrained optimization problem (Arjovsky et al., 2019): X min Re(w Φ) Φ HΦ,w Hw e Etr (1) s.t. w arg min Re(w Φ), e Etr. If w Φ satisﬁes the constraints above, then it is an invariant predictor across the training environments Etr. Intuition behind IRM optimization in equation (1) We use a simpliﬁed version of the model described by Peters et al. (2016). In each environment e, the random variable Xe = [X1 e, ..., Xe ] corresponds to the feature vector and n Y e corresponds to the label. The data for each environment is generated by i.i.d. sampling (Xe, Y e) from the following generative model. Assume a subset S {1, ..., n} is causal for the label Y e . For each environment e Eall, Xe has an arbitrary distribution and is the vector Xe with indices in S , g : R|S | R is a function to describe the conditional expectation and ϵe F e , E[ ϵe] = 0, ϵ e XS . Let be the squared error loss function. We ﬁx the representation Φ (Xe) = XS . With Φ as the representation, the optimal classiﬁer w among all the functions is g(XS ) (this follows from the generative model). Since the optimal classiﬁer does e not vary across environments, w Φ = g is an invariant predictor across all Eall (assume g Hw). Deﬁne XS as the remaining feature values that are not causal of Y e . If Φ does not focus on S and some of the variables in XS are a part of the Markov blanket of Y e (Pearl, 2014), then the optimal classiﬁer may not be the same across environments and thus invariance won t be achieved. By solving (1) across training environments, IRM hopes to arrive at g, which will generalize well across all the test environments Eall. Deﬁne the set of representations and the corresponding clas siﬁers, (Φ, w) that satisfy the constraints in the above opti mization problem (1) as SIV, where IV stands for invariant. Also, separately deﬁne the set of invariant predictors w Φ SIV as ˆ = {w Φ |(Φ, w) SIV}. Remark. The sets SIV , SˆIV depend on the choice of classi ﬁer class Hw and representation class HΦ. We avoid making this dependence explicit until later sections. Members of SIV are equivalently expressed as solutions to Re(w Φ) Re(w Φ), w Hw, e Etr. (3) The main generalization result (Theorem 9) in Arjovsky et al. (2019) states that if representations and classiﬁers are from the class of linear models, i.e., Φ HΦ = Rn n (rep resentation output for input x is Φx) and w Hw = Rn 1 T (classiﬁer output for input z is w z), then under certain conditions on the data generation process and training en vironments Etr, the solution to (3) remains invariant in Eall (we use the same setup in our Theorem 2 as well). 4. Ensemble Invariant Risk Minimization Games 4.1. Game-Theoretic Reformulation Optimization problem (1) can be quite challenging to solve. We introduce an alternate characterization based on game theory to solve it. We endow each environment with its e own classiﬁer w Hw. We use a simple ensemble to av construct an overall classiﬁer w : Z Rk deﬁned as wav = 1 P|Etr | wq , where for each z Z, wav(z) = |Etr | q=1 1 P|Etr | wq(z). (The av stands for average.) Consider |Etr | q=1 the example of binary classiﬁcation with two environments e e e {e1, e2}; w = [w1, w2] is the classiﬁer of environment e, where each component is the score for each class. We av deﬁne the component j of the ensemble classiﬁer w as e1 e2 w +w av j j w = . These scores are input to a softmax; j 2 the ﬁnal probability assigned to class j for an input z is av wj (z) e av w (z) wav (z) . e 1 +e 2 av We require all the environments to use this ensemble w . We want to solve the following new optimization problem. X min Re(w av Φ) Φ HΦ,wav Hw e Etr 1 h X i e Re e q s.t. w arg min w + w Φ , e Etr w e Hw |Etr| q6=e We can equivalently restate the above as: X min Re(w av Φ) Φ HΦ,wav e Etr h X i 1 s.t. Re w e + wq Φ |Etr| q6=e 1 h X i e q e Re w + w Φ w Hw e Etr |Etr| q6=e What are the advantages of this formulation (4)? Using the ensemble automatically enforces that the Invariant Risk Minimization Games same classiﬁer is used across the environments. e Each environment is free to select the classiﬁer w from the entire set Hw, unlike in (1), where all envi ronments choices are required to be the same. The constraints in (4) are equivalent to the set of pure NE of a game that we deﬁne next. The game is played between |Etr| players, with each player corresponding to an environment e. The set of actions of the e environment e are w Hw. At the start of the game, a rep resentation Φ is selected from the set HΦ, which is observed by all the environments. The utility function for an envi e e av ronment e is deﬁned as ue[w , w , Φ] = Re(w , Φ), e where w = {wq }q6=e is the set of choices of all environ ments but e. We call this game Ensemble Invariant Risk Minimization (EIRM) and express it as a tuple ΓEIRM |Etr | = Etr, HΦ, {Hw} . q=1 , {ue}e Etr |Etr | We represent a pure NE as a tuple Φ, {wq}q=1 . Since each pure NE depends on Φ, we include it as a part of the tuple.2 We deﬁne the set of pure NE as SEIRM . We construct a set of all the ensemble predictors constructed from NE as3 nh |EX tr | i o 1 SˆEIRM q q}|Et| ) SEIRM = w Φ | (Φ, {w . |Etr| q=1 q=1 Members of SEIRM are equivalently expressed as the solu tions to e e e e e ue[w , w , Φ] ue[w , w , Φ], w Hw, e Etr. e e av If we replace ue[w , w , Φ] with Re(w , Φ), we obtain the inequalities in (4). So far we have deﬁned the game and given its relationship to the problem in (4). Overview of the Results In Sections 4.2 and 4.3 we will discuss the main theoretical results of this work. Here we give a brief preview of them. In Theorem 1 and Corollary 1 we establish equivalence between the predictors obtained using NE of EIRM game SEIRM and invariant predictors SIV . We establish this equivalence for a large class of representations and classiﬁers, where both can be nonlinear. In Theorem 2, we borrow the generalization result from (Arjovsky et al., 2019) and show that same gen eralization guarantees continue to hold for our setting. Following (Arjovsky et al., 2019), we assume both classiﬁers and representations are linear. 2We can also express each environment s action as a mapping from π : HΦ Hw but we don t to avoid complicated notation. 3We don t double count compositions leading to the same pre dictor. In Theorem 3, we discuss the role of representation and how in some cases we can reduce the computational expense that one may incur in searching for the repre sentations. We establish this result for a large class of classiﬁers and invertible representations, where both can be nonlinear. In Theorem 4, we discuss the existence of both the Nash equilibria of EIRM game and the invariant pre dictors. We restrict the classiﬁers to be linear but rep resentations may be nonlinear. In the supplement, we extend the result to nonlinear classiﬁers. 4.2. Equivalence Between NE and Invariant Predictors What is the relationship between the predictors obtained SIV? from NE SˆEIRM and invariant predictors ˆ Remarkably, these two sets are the same under very mild conditions. Before we show this result, we establish a stronger result and this result will follow from it. We use the set SEIRM to construct a new set. To each tuple |Etr | Φ, {wq} ) SEIRM augment the ensemble classiﬁer q=1 av 1 P|Etr | |Etr | av w = wq to get Φ, {wq}q=1 , w . We call |Etr | q=1 SEIRM the set of these new tuples . We use the set SIV to construct a new set. Consider an element (Φ, w) SIV . We deﬁne a decomposition for w in terms of the environment-speciﬁc classiﬁers as follows: w = 1 P|Etr | wq , where wq Hw. wq = w, q Etr |Etr | q=1 is one trivial decomposition. We use each such decomposi |Etr | tion and augment the tuple to obtain Φ, {wq}q=1 , w . We SIV call this set of new tuples . Both the sets S IV and S EIRM consist of tuples of representa tion, set of environment speciﬁc classiﬁers, and the ensem ble classiﬁer. We ask an even more interesting question than the one above. Is the set of representations, environment spe ciﬁc classiﬁers, and the ensembles found by playing EIRM (5) or solving IRM (3) the same? If these two sets are equal, SEIRM and ˆ then equality between ˆ SIV follows trivially. We state the only assumption we need. Assumption 1. Afﬁne closure: The class of functions Hw is closed under the following operations. Finite sum: If w1 Hw and w2 Hw, then w1 + w2 Hw, where for every z Z, (w1 + w2)(z) = w1(z) + w2(z) Scalar multiplication: For any c R and w Hw, cw Hw, where for every z Z, (cw)(z) = c w(z) The addition of the functions and scalar multiplication are deﬁned in a standard pointwise manner. Therefore, the class Hw also forms a vector space. Invariant Risk Minimization Games Examples of functions that satisfy afﬁne closure. Linear classiﬁers, kernel based classiﬁers (Hofmann et al., 2008) (functions in RKHS space), ensemble models with arbitrary number of weak learners (Freund et al., 1999), functions in Lp space (Ash & Dol eans-Dade, 2000), Re LU networks with arbitrary depth. We provide the justiﬁcation for each of these functions in the supplement. We now state the main result. SIV S EIRM Theorem 1. If Assumption 1 holds, then = The proofs of all the results are in the supplement. SIV SˆEIRM Corollary 1. If Assumption 1 holds, then ˆ = Signiﬁcance of Theorem 1 and Corollary 1 i. Computational This equivalence permits computational tools from game theory to ﬁnd NE of the EIRM game and the invariant predictors. (See Algorithm 1) ii. Theoretical This equivalence permits to use game theory to analyze the solutions of the EIRM game and understand the invariant predictors. (See Theorem 3) iii. Generalization In Theorem 9 (Arjovsky et al., 2019), it was shown for linear classiﬁers and linear representa tions that the invariant predictors generalize to a large set of unseen environments under certain conditions. Since our equivalence (Theorem 1) holds for linear classiﬁers (but is even broader), the same generalization holds for predictors obtained from EIRM game. Although the result follows straightaway from the equivalence in Theorem 1, we still state it formally next for completeness. We describe the generative model for the next theorem. For environment e, Y e Ze Tγ + ϵe, where ϵe is independent 1 of Z1 e , Ze Rc and γ Rc . We observe Xe , which is a 1 scrambled version of Ze and Ze, where Ze can be correlated 1 2 2 with both Ze and ϵe . We use Assumption 8 from (Arjovsky 1 et al., 2019). We restate it below. Assumption 2. A set of environments Etr lie in the linear general position of degree r if |Etr| n r + n for some r r N and for all non-zero x Rn dim span EXe [Xe Xe T]x EXe,ϵe [Xeϵe] > n r e Etr In the next theorem, we consider linear representations, i.e., Rn n HΦ = and linear classiﬁers, i.e., Hw = Rn 1 and w Φ = w TΦ. Theorem 2. For each environment e Eall we assume Y e Ze Tγ + ϵe , Z1 e ϵe , E[ϵe] = 0 1 (6) Xe S(Z1 Here γ Rc , Z1 e Rq, S Rn (c+q). Assume that Z1 is invertible component of S, i.e., S Rc n such that S (S(z1, z2)) = z1 for all z1 Rc and z2 Rq . Let n Φ Rn n have rank r. If at least n r + training r environments Etr Eall lie in linear general position of degree r, then any predictor obtained from EIRM game over the training environments in SˆEIRM is invariant across all the testing environments Eall. The above theorem establishes generalization guarantees for predictors obtained from the NE of the EIRM game and the proof follows from Theorem 1 above and proof of Theorem 9 in (Arjovsky et al., 2019). Role of representation Φ. We investigate the scenario when we ﬁx Φ to the identity mapping; this will motivate one of our approaches. Deﬁne the set SˆEIRM(Φ) as the set of ensemble predictors arrived at by playing the EIRM game using a ﬁxed representation representation Φ.4 Similarly, we deﬁne a set SˆIV(Φ) as the set of invariant predictors derived using the representation Φ. From Theorem 1, it SEIRM(Φ) = ˆ follows that ˆ SIV(Φ). We modify some of the earlier notations for results to follow. The set of predictors that result from the EIRM game SˆEIRM and the sets of in- SIV variant predictors ˆ are deﬁned for a family of maps Φ with co-domain Z. We make the co-domain Z explicit in SEIRM SEIRM and SˆIV SIV the notation. We write ˆ for ˆ for ˆ . Z Z Assumption 3. Φ HΦ satisﬁes the following Bijective: Φ 1 : Z X such that x X , Φ 1 Φ (x) = x, and z Z Φ Φ 1 (z) = z. Both X and Z are subsets of Rn Φ is differentiable and Lipschitz continuous. R Lp(Z): set of functions f : Z R s.t. |f|pdµ < Z Assumption 4. Hw = Lp(Z). S IV SIV Deﬁne a subset ˆ consisting of invariant predictors Z Z SIV SIV that are in Lp(X ), i.e., = {u | u ˆ and u Z Z Lp(X )}. Let Φ = I, where I : X X is the identity mapping. Following the above notation, the set of invariant predictors and the set of ensemble predictors obtained from SIV SEIRM NE are ˆ (I) and ˆ (I) respectively. X X S IV Theorem 3. If Assumptions 3 and 4 are satisﬁed and Z SIV SIV SEIRM is non-empty, then = ˆ (I) = ˆ (I) Z X X Signiﬁcance of Theorem 3. If we ﬁx the representation to identity and play the EIRM game, then it is sufﬁcient to recover all the invariant predictors (with bounded Lp norm) that can be obtained using all the representations Φ HΦ. Therefore, we can simply ﬁx Φ = I and use game-theoretic algorithms for learning equilibria. SEIRM SEIRM 4 Φ ˆ (Φ) = ˆ Invariant Risk Minimization Games 4.3. Existence of NE of ΓEIRM and Invariant Predictors In this section, we argue that there are many settings when both invariant predictors and the NE exist. Recall that in the example described in Section 3.2 based on the gener ative model in equation (2), we already showed existence of invariant predictor as we constructed one. In the same example if we also assume that Hw is afﬁne closed, then the NE also exist (From Theorem 1). The above claims for existence require us to make assumptions on the data generation process. Next, we discuss the existence with no assumptions on the data generation process. Assumption 5. Hw is a class of linear models, i.e. w Hw Rd 1 and classiﬁer output for input z T is w z. Hw is a closed, bounded and convex. The interior of Hw is non-empty. The loss function (w Tz, Y ), where Y R is the label, is convex and continuous in w. For e.g., if loss is cross-entropy for binary classiﬁcation or loss is mean squared error for regression, then this assumption is automatically satisﬁed. Theorem 4. If Assumption 5 is satisﬁed, then a pure strat egy Nash equilibrium of the game ΓEIRM exists, i.e., SEIRM is |Etr | SEIRM not empty. Suppose there exists a Φ, {wq }q=1 q such that q Etr w is in the interior of Hw, then the corresponding ensemble predictor 1 P|Etr | wq Φ is |Etr | q=1 invariant across all the training environments Etr. The family Hw of bounded linear functions does not satisfy afﬁne closure, which is why existence of NE does not im mediately imply the existence of invariant predictor (from Theorem 1). However, if the solution is in the interior of Hw , then it is the globally optimal solution among all the linear functions, which in fact actually satisfy afﬁne closure. As a result, in this case the invariant predictor also exists. Signiﬁcance of Theorem 4 Our approach is based on ﬁnd ing the NE. Therefore, it is important to understand when the solutions are guaranteed to exist. In the above theorem, we proved the result for linear classiﬁers only, but there were no assumptions made on the representation class. In the supplement, we discuss extensions to nonlinear classi ﬁers. Following the sufﬁcient condition for existence of invariant predictors, understanding what conditions cause the NEs to be in the interior or on the boundary of Hw can help further the theory of invariant prediction. 4.4. Algorithms for Finding NE of ΓEIRM There are different strategies in the literature to compute the equilibrium, such as best response dynamics (BRD) and ﬁctitious play (Fudenberg & Levine, 1998), but none of these strategies are guaranteed to arrive at equilibria in continuous games except for special classes of games (Hof bauer & Sorin, 2006; Barron et al., 2010; Mertikopoulos & Zhou, 2019; Bervoets et al., 2016; Daskalakis et al., 2017). BRD is one the most popular methods given its intuitive and natural structure. The training of GANs also follows an approximate BRD (Goodfellow et al., 2014). BRD is not known to converge to equilibrium in GANs. Instead a modi ﬁcation of it proposed recently, Hsieh et al. (2018) achieves mixed NE. Our game ΓEIRM is a non-zero sum game with continuous actions unlike GANs. Since there are no known techniques that are guaranteed to compute the equilibrium (pure or mixed) for these games, we adopt the classic BRD approach. In our ﬁrst approach, we use a ﬁxed representation Φ. Re call in Theorem 3, we showed how just ﬁxing Φ to identity can be a very effective approach. Hence, we can ﬁx Φ to be identity mapping or we can select Φ as some other map ping such as approximation of the map for Gaussian kernel (Rahimi & Recht, 2008). Once we ﬁx Φ, the environments play according to best response dynamics as follows. Each environment takes its turn (in a periodic manner with each environment going once) and minimizes its respective objective. Repeat this procedure until a certain criterion is achieved, e.g., maximum number of epochs or desired value of training accuracy. The above approach does not give much room to optimize Φ. We go back to the formulation in (4) and use the upper level optimization objective as a way to guide search for Φ. In this new approach, Φ is updated by the representation learner periodically using the objective in (4) and between two updates of Φ the environments play according to best response dynamics as described above. We now make assumptions on Hw and HΦ and give a de tailed algorithm (see Algorithm 1) that we use in experi e ments. We assume that w is parametrized by family of neural networks θw Θw and Φ is parametrized by family of neural networks θΦ ΘΦ. In the Algorithm 1, one of the variables Fixed - Phi (for our ﬁrst approach) or Variable-Phi is set to true, and then accordingly Φ remains ﬁxed or is updated periodically. In the Algorithm, K is a hyperparam eter that dictates how many updates of each environment occur before updating Φ. In Figure 1, we also show an illustration of the best response training when there are two environments and one representation learner. 5. Experiments 5.1. Benchmarks The most important benchmark for comparison is Arjovsky et al. (2019), which we refer to as IRM in the comparisons. We use the architecture described in their work (details in the supplement). We also compare with Invariant Risk Minimization Games Backprop to Environment 1 Loss Environment 1 Loss representation Loss Environment 2 Backprop to Environment 2 Backprop to representation Backprop to representation 𝑤!: Environment 1 𝑤": Environment 2 Φ: Representation Figure 1. Illustration of best response training with 2 environments and representation learner. Dotted lines for backpropagation and solid lines for forward pass. Variants of empirical risk minimization: ERM on entire training data (ERM), ERM on each environment sepa rately (ERM e refers to ERM trained on environment e), and ERM on data with no spurious correlations. Robust min-max training: In this method, we minimize the maximum loss across the multiple environments. We have two approaches for EIRM games: one that uses a Φ ﬁxed to the identity and the other that uses a variable Φ, which we refer to as the F-IRM and V-IRM game, respec tively. The details on architectures, hyperparameters, and optimizers used are in the supplement. The source-code is available at https://github.com/IBM/IRM-games. 5.2. Datasets Colored MNIST dataset. In Arjovsky et al. (2019), the comparisons were done on a colored digits MNIST dataset. We create the same dataset for our experiments. The task is to classify whether the digit is less than 5 (not including 5) or more than 5. There are three environments (two training containing 30,000 points each, one test containing 10,000 points) We add noise to the preliminary label (y = 0 if digit is between 0-4 and y = 1 if the digit is between 5 9) by ﬂipping it with 25 percent probability to construct the ﬁnal labels. We sample the color id z by ﬂipping the ﬁnal labels with probability pe, where pe is 0.2 in the ﬁrst environment, 0.1 in the second environment, and 0.9 in the third environment. The third environment is the testing environment. We color the digit red if z = 1 or green if z = 0. In addition to colored MNIST digits, we also create two other datasets that are inspired from Colored MNIST: Col ored Fashion MNIST and Colored Desprites. In these datasets as well the color is spuriously correlated with the label. We also create another dataset: Structured Noise Fash ion MNIST. In this dataset, instead of coloring the images to Algorithm 1 Best Response Training Input: Data for each environment and combined data e |Etr | Initialize: Randomly initialize {w } and Φcur cur e=1 from Hw and HΦ respectively while iter itermax do if Fixed-Phi then Φcur = I end if if Variable-Phi then i h P av Φcur = SGD Re(w Φcur) , SGD[.]: step e cur update using stochastic gradient descent end if for p {1, ..K} do for e {1, .., |Etr|} do h i e av wcur = SGD Re(w Φcur) cur P av 1 e w = w cur |Etr | e cur end for iter = iter + 1 end for end while establish spurious correlations, we create small patches of noise at speciﬁc locations in the image, where the locations are correlated with the labels (detailed description of the datasets is in the supplement). In all the comparisons, we averaged the performance of the different approaches over ten runs. 5.3. Comparisons Colored MNIST (Table 1) Standard ERM based ap proaches, and robust training based approach achieve be tween 10-15 percent accuracy on the testing set. F-IRM game achieves 59.9 2.7 percent testing accuracy. This im plies that the model is not using spurious correlation unlike the ERM based approaches, and robust training based ap proach, that is present in the color of the digit. F-IRM has a comparable mean and a much lower standard deviation than IRM, which achieves 62.75 9.5 percent. ERM grayscale is ERM on uncolored data, which is why it is better than all. In each Table, we include the optimal performance that is achievable (75 percent train and test). Colored Fashion MNIST (Table 2) We observe that the V-IRM game performs the best both in terms of the mean and the standard deviation achieving 70.2 1.5 percent. Colored Desprites (Table 3) We observe that V-IRM game achieves 50.0 0.2 percent while IRM achieves 51.8 6 percent. Structured Noise Fashion MNIST (Table 4) We observe that F-IRM achieves 62.0 2.0 percent and is comparable with IRM that achievs 63.9 10.9 percent; again observe Invariant Risk Minimization Games Table 1. Colored MNIST: Comparison of methods in terms of train ing, testing accuracy (mean std deviation). ALGORITHM TRAIN ACCURACY TEST ACCURACY ERM 84.88 0.16 10.45 0.66 ERM 1 84.84 0.21 10.86 0.52 ERM 2 84.95 0.20 10.05 0.23 ROBUST MIN MAX 84.25 0.43 15.24 2.45 F-IRM GAME 63.37 1.14 59.91 2.69 V-IRM GAME 63.97 1.03 49.06 3.43 IRM 59.27 4.39 62.75 9.59 ERM GRAYSCALE 71.81 0.47 71.36 0.65 OPTIMAL 75 75 Table 2. Colored Fashion MNIST: Comparison of methods in terms of training, testing accuracy (mean std deviation). ALGORITHM TRAIN ACCURACY TEST ACCURACY ERM 83.17 1.01 22.46 0.68 ERM 1 81.33 1.35 33.34 8.85 ERM 2 84.39 1.89 13.16 0.82 ROBUST MIN MAX 82.81 0.11 29.22 8.56 F-IRM GAME 62.31 2.35 69.25 5.82 V-IRM GAME 68.96 0.95 70.19 1.47 IRM 75.01 0.25 55.25 12.42 ERM GRAYSCALE 74.79 0.37 74.67 0.48 OPTIMAL 75 75 that we have a lower standard deviation. 5.4. Analyzing the Experiments In this section, we use plots of F-IRM game played on Colored Fashion MNIST (plots for both F-IRM and V-IRM on all other datasets are similar and hence convey the same message. We provide them in the supplement). In Figure 2, we show the accuracy of the ensemble model on the entire data and the two environments separately. In the initial stages, the training accuracy increases and eventually it starts to oscillate. Best response dynamics can often oscillate (Herings & Predtetchinski, 2017; Fudenberg & Levine, 1998; Barron et al., 2010). Next, we demistify these oscillations, explain their importance, and discuss how we terminate the procedure. 5.4.1. EXPLAINING THE MECHANISM OF OSCILLATIONS The oscillation has two states. In the ﬁrst state, the ensemble model performs well 88 % accuracy. In the second state, the accuracy dips to 75 %. In Figure 3, we plot the corre lation between the ensemble model and the color. When the oscillations appear in training accuracy in Figure 2, the correlation also start to oscillate in Figure 3. In the ﬁrst Table 3. Colored Desprites: Comparison of methods in terms of training, testing accuracy (mean std deviation). ALGORITHM TRAIN ACCURACY TEST ACCURACY ERM 85.01 0.03 9.97 0.05 ERM 1 81.33 1.35 33.34 8.85 ERM 2 84.39 1.89 13.16 0.82 ROBUST MIN MAX 84.94 0.09 10.28 0.33 F-IRM GAME 53.36 1.40 48.61 3.06 V-IRM GAME 56.31 4.94 50.04 0.15 IRM 52.67 2.40 51.82 5.95 ERM GRAYSCALE 67.67 0.58 66.97 0.69 OPTIMAL 75 75 Table 4. Structured Noise Fashion MNIST: Comparison of meth ods in terms of training, testing accuracy (mean std deviation). ALGORITHM TRAIN ACCURACY TEST ACCURACY ERM 83.49 1.22 20.13 8.06 ERM 1 81.80 1.50 30.94 1.01 ERM 2 84.66 0.40 11.98 0.23 ROBUST MIN MAX 82.78 1.32 25.59 9.14 F-IRM GAME 51.54 2.96 62.03 2.02 V-IRM GAME 47.70 1.69 61.46 0.53 IRM 52.57 9.95 63.92 10.95 ERM NO NOISE 74.79 0.37 74.67 0.48 OPTIMAL 75 75 state when the model performs well, the model is heavily correlated (negative correlation) with the color. In the sec ond state, the model performs worse, observe that the model now has much less correlation (close to zero) with the color. We ask two questions: (i) Why do the oscillations persist in the training accuracy plot (Figure 2) and correlation plot (Figure 3)?, and (ii) How do the oscillations emerge? Why do the oscillations persist? In our experiments there are two environments, the labels are binary, and we want to maximize the log-likelihood. Let sj be the score vector from environment j s classiﬁer, p be the softmax of s and y be the one hot encoded vector of labels. The gradient of the log-likelihood w.r.t. the scores given by each model for a certain instance x (see derivation in the supplement) is: log(py ) = y p = e. (7) sj where e is the error vector. The error e is determined by the both the models (both models impact p), it backpropagates and impacts individual weights. We argue next that the examples over which error occur are very different in the two states and that is the reason for oscillations. Consider the step when the correlation (absolute value) be tween the ensemble model and color is high. In this step, it Invariant Risk Minimization Games 0 20 40 60 80 100 Training steps Training accuracy Entire data Environment 1 Environment 2 0 20 40 60 80 100 Training steps Model s correlation with color Model environment 1 Model environment 2 Figure 2. F-IRM, Colored Fashion MNIST: Comparing accuracy of ensemble. 0 20 40 60 80 100 Training steps Ensemble model corrln with color Figure 3. F-IRM, Colored Fashion MNIST: Correlation of the ensemble model with color. is the turn of Model 1 to train. Observe that the accuracy of the model is high because the ensemble model is exploiting the spurious correlations with the color. We approximate this mathematically. The score from Model j for Label 1 is 1 0 φnc sj s βj t (x) + γj φj c(x), where φnc are the features j j j that are not correlated with the color, φc j is the indicator of the color. From Figure 4, γ1 and γ2 should have opposite signs, i.e. positive and negative respectively. In the current step, γ2 dominates γ1, which is why the ensemble model has a heavy negative correlation. The errors (7) that backpropa gate come from the examples for which exploiting spurious correlation with color does not work, i.e., the color is not indicative of the digit. During this step Model 1 is trained, backpropagation will change the weights such that γ1 in creases. As a result, the ensemble model s correlation with the color decreases (as we see in Figure 3). In the next step, it is the turn of Model 2 to train. Model 2 s environment has more examples than environment 1 where exploiting the color can help improve its accuracy. As a result, error from these examples backpropagate and γ2 decreases. This brings the ensemble model back to being negatively cor related with colors and also the training accuracy back to where it was approximately. This cycle of push and pull between the models continues. Figure 4. F-IRM, Colored Fashion MNIST: Correlation of the individual models with the color. How do these cycles emerge? The oscillations are weak at the beginning of the training. In the beginning, when Model 2 trains, the impact of the errors (from examples where spurious correlations can be exploited) on changing the weights are much stronger than when Model 1 trains, as the number of examples that beneﬁt from spurious correlations is much larger in comparison. As the training proceeds, this impact decreases as many examples are classiﬁed correctly by using spurious correlations while the weights continue to accumulate for Model 1, thus giving rise to oscillations. How to terminate? We terminate training when the oscilla tions are stable and when the ensemble model is in the lower accuracy state, which corresponds to the state with lower correlation with color. To ensure the oscillations are stable, we do not terminate until a certain number of steps have been completed (in our experiments we set this duration to be number of steps= (training data size)/(batch size)). To capture the model in a state of lower correlation with color, we set a threshold on accuracy (we decide the threshold by observing the accuracy plot); we terminate only when the training accuracy falls below this threshold. 6. Conclusion We developed a new framework based on game-theoretic tools to learn invariant predictors. We work with data from multiple environments. In our framework, we set up an en semble game; we construct an ensemble of classiﬁers with each environment controlling one portion of the ensemble. Remarkably, the set of solutions to this game is exactly the same as the set of invariant predictors across training envi ronments. The proposed framework performs comparably to the existing framework of Arjovsky et al. (2019) and also exhibits lower variance. We hope this framework opens new ways to address other problems pertaining to invariance in causal inference using tools from game theory. Invariant Risk Minimization Games Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., and Marchand, M. Domain-adversarial neural networks. ar Xiv preprint ar Xiv:1412.4446, 2014. Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez Paz, D. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. Ash, R. B. and Dol eans-Dade, C. A. Probability and Mea sure Theory. Academic Press, San Diego, California, 2000. Bareinboim, E., Brito, C., and Pearl, J. Local characteriza tions of causal Bayesian networks. In Graph Structures for Knowledge Representation and Reasoning, pp. 1 17. Springer, 2012. Barron, E., Goebel, R., and Jensen, R. Best response dynam ics for continuous games. Proceedings of the American Mathematical Society, 138(3):1069 1083, 2010. Beery, S., Van Horn, G., and Perona, P. Recognition in terra incognita. In Proceedings of the European Conference on Computer Vision, pp. 456 473, 2018. Ben-David, S., Blitzer, J., Crammer, K., and Pereira, F. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems, pp. 137 144, 2007. Bervoets, S., Bravo, M., and Faure, M. Learning and con vergence to Nash in games with continuous action sets. Technical report, Working paper, 2016. Daskalakis, C., Ilyas, A., Syrgkanis, V., and Zeng, H. Training gans with optimism. ar Xiv preprint ar Xiv:1711.00141, 2017. de Haan, P., Jayaraman, D., and Levine, S. Causal confusion in imitation learning. In Advances in Neural Information Processing Systems, pp. 11693 11704, 2019. Duchi, J., Glynn, P., and Namkoong, H. Statistics of ro bust optimization: A generalized empirical likelihood approach. ar Xiv preprint ar Xiv:1610.03425, 2016. Freund, Y., Schapire, R., and Abe, N. A short introduc tion to boosting. Journal-Japanese Society For Artiﬁcial Intelligence, 14(771-780):1612, 1999. Fudenberg, D. and Levine, D. K. The Theory of Learning in Games. MIT Press, Cambridge, Massachusetts, 1998. Fudenberg, D. and Tirole, J. Game Theory. MIT Press, Cambridge, Massachusetts, 1991. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(1):2096 2030, 2016. Glorot, X., Bordes, A., and Bengio, Y. Domain adaptation for large-scale sentiment classiﬁcation: A deep learning approach. In Proceedings of the International Conference on Machine Learning, pp. 513 520, 2011. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672 2680, 2014. Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borg wardt, K., and Sch olkopf, B. Covariate shift by kernel mean matching. In Dataset Shift in Machine Learning. 2009. Heinze-Deml, C., Peters, J., and Meinshausen, N. Invariant causal prediction for nonlinear models. Journal of Causal Inference, 6(2), 2018. Herings, P. J.-J. and Predtetchinski, A. Best-response cycles in perfect information games. Mathematics of Operations Research, 42(2):427 433, 2017. Hofbauer, J. and Sorin, S. Best response dynamics for continuous zero-sum games. Discrete and Continuous Dynamical Systems Series B, 6(1):215, 2006. Hoffman, J., Mohri, M., and Zhang, N. Algorithms and theory for multiple-source adaptation. In Advances in Neural Information Processing Systems, pp. 8246 8256, 2018. Hofmann, T., Sch olkopf, B., and Smola, A. J. Kernel meth ods in machine learning. Annals of Statistics, 36(3):1171 1220, 2008. Hsieh, Y.-P., Liu, C., and Cevher, V. Finding mixed nash equilibria of generative adversarial networks. ar Xiv preprint ar Xiv:1811.02002, 2018. Janzing, D. and Sch olkopf, B. Causal inference using the algorithmic Markov condition. IEEE Transactions on Information Theory, 56(10):5168 5194, 2010. Janzing, D., Mooij, J., Zhang, K., Lemeire, J., Zscheis chler, J., Daniuˇ olkopf, B. sis, P., Steudel, B., and Sch Information-geometric approach to inferring causal direc tions. Artiﬁcial Intelligence, 182:1 31, 2012. Johansson, F. D., Ranganath, R., and Sontag, D. Support and invertibility in domain-invariant representations. ar Xiv preprint ar Xiv:1903.03448, 2019. Invariant Risk Minimization Games Lee, J. and Raginsky, M. Minimax statistical learning with Wasserstein distances. In Advances in Neural Information Processing Systems, pp. 2687 2696, 2018. Magliacane, S., van Ommen, T., Claassen, T., Bongers, S., Versteeg, P., and Mooij, J. M. Domain adaptation by using causal inference to predict invariant conditional dis tributions. In Advances in Neural Information Processing Systems, pp. 10846 10856, 2018. Mertikopoulos, P. and Zhou, Z. Learning in games with continuous action sets and unknown payoff functions. Mathematical Programming, 173(1-2):465 507, 2019. Mohri, M., Sivek, G., and Suresh, A. T. Agnostic federated learning. ar Xiv preprint ar Xiv:1902.00146, 2019. Pearl, J. Causal diagrams for empirical research. Biometrika, 82(4):669 688, 1995. Pearl, J. Probabilistic reasoning in intelligent systems: net works of plausible inference. Elsevier, 2014. Peters, J., B uhlmann, P., and Meinshausen, N. Causal in ference by using invariant prediction: identiﬁcation and conﬁdence intervals. Journal of the Royal Statistical Soci ety: Series B (Statistical Methodology), 78(5):947 1012, 2016. Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pp. 1177 1184, 2008. Sch olkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. On causal and anticausal learning. ar Xiv preprint ar Xiv:1206.6471, 2012. Shimodaira, H. Improving predictive inference under covari ate shift by weighting the log-likelihood function. Jour nal of Statistical Planning and Inference, 90(2):227 244, 2000. Subbaswamy, A., Chen, B., and Saria, S. Should I in clude this edge in my prediction? Analyzing the stabilityperformance tradeoff. ar Xiv preprint ar Xiv:1905.11374, 2019. Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von B unau, P., and Kawanabe, M. Direct importance estima tion for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699 746, 2008. Zhao, H., Combes, R. T. d., Zhang, K., and Gordon, G. J. On learning invariant representation for domain adaptation. ar Xiv preprint ar Xiv:1901.09453, 2019. Zhao, S., Fard, M. M., Narasimhan, H., and Gupta, M. Metric-optimized example weights. ar Xiv preprint ar Xiv:1805.10582, 2018.