# heterogeneous_risk_minimization__e154c263.pdf Heterogeneous Risk Minimization Jiashuo Liu 1 Zheyuan Hu 1 Peng Cui 1 Bo Li 2 Zheyan Shen 1 Machine learning algorithms with empirical risk minimization usually suffer from poor generalization performance due to the greedy exploitation of correlations among the training data, which are not stable under distributional shifts. Recently, some invariant learning methods for outof-distribution (OOD) generalization have been proposed by leveraging multiple training environments to find invariant relationships. However, modern datasets are frequently assembled by merging data from multiple sources without explicit source labels. The resultant unobserved heterogeneity renders many invariant learning methods inapplicable. In this paper, we propose Heterogeneous Risk Minimization (HRM) framework to achieve joint learning of latent heterogeneity among the data and invariant relationship, which leads to stable prediction despite distributional shifts. We theoretically characterize the roles of the environment labels in invariant learning and justify our newly proposed HRM framework. Extensive experimental results validate the effectiveness of our HRM framework. 1. Introduction The effectiveness of machine learning algorithms with empirical risk minimization (ERM) relies on the assumption that the testing and training data are identically drawn from the same distribution, which is known as the IID hypothesis. However, distributional shifts between testing and training data are usually inevitable due to data selection biases or unobserved confounders that widely exist in real data. Under such circumstances, machine learning algorithms 1Department of Computer Science and Technology, Tsinghua University, Beijing, China; Email: {liujiashuo77, zyhu2001}@gmail.com, cuip@tsinghua.edu.cn, shenzy17@mails.tsinghua.edu.cn. 2School of Economics and Management, Tsinghua University, Beijing, China; Email: libo@sem.tsinghua.edu.cn. Correspondence to: Peng Cui . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). with ERM usually suffer from poor generalization performance due to the greedy exploitation of correlations among the training data, which are not stable under distributional shifts. How to guarantee a machine learning algorithm with out-of-distribution (OOD) generalization ability and stable performances under distributional shifts is of paramount significance, especially in high-stake applications such as medical diagnosis, criminal justice, and financial analysis etc (Kukar, 2003; Berk et al., 2018; Rudin & Ustun, 2018). There are mainly two branches of methods proposed to solve the OOD generalization problem, namely distributionally robust optimization (DRO) (Esfahani & Kuhn, 2018; Duchi & Namkoong, 2018; Sinha et al., 2018; Sagawa et al., 2019) and invariant learning (Arjovsky et al., 2019; Koyama & Yamaguchi, 2020; Chang et al., 2020). DRO methods aim to optimize the worst-performance over a distribution set to ensure their OOD generalization performances. While DRO is a powerful family of methods, it is often argued for its over-pessimism problem when the distribution set is large (Hu et al., 2018; Frogner et al., 2019). From another perspective, invariant learning methods propose to exploit the causally invariant correlations(rather than varying spurious correlations) across multiple training environments, resulting in out-of-distribution (OOD) optimal predictors. However, the effectiveness of such methods relies heavily on the quality of training environments, and the intrinsic role of environments in invariant learning remains vague in theory. More importantly, modern big data are frequently assembled by merging data from multiple sources without explicit source labels. The resultant unobserved heterogeneity renders these invariant learning methods inapplicable. In this paper, we propose Heterogeneous Risk Minimization (HRM), an optimization framework to achieve joint learning of the latent heterogeneity among the data and the invariant predictor, which leads to better generalization ability despite distributional shifts. More specifically, we theoretically characterize the roles of the environment labels in invariant learning, which motivates us to design two modules in the framework corresponding to heterogeneity identification and invariant learning respectively. We provide theoretical justification on the mutual promotion of these two modules, which resonates the joint optimization process in a reciprocal way. Extensive experiments in both synthetic and real-world experiments datasets demonstrate Submission and Formatting Instructions for ICML 2021 the superiority of HRM in terms of average performance, stability performance as well as worst-case performance under different settings of distributional shifts. We summarize our contributions as following: 1. We propose the novel HRM framework for OOD generalization without environment labels, in which heterogeneity identification and invariant prediction are jointly optimized. 2. We theoretically characterize the role of environments in invariant learning from the perspective of heterogeneity, based on which we propose a novel clustering method for heterogeneity identification from heterogeneous data. 3. We theoretically justify the mutual promotion relationship between heterogeneity identification and invariant learning, resonating the joint optimization process in HRM. 2. Problem Formulation 2.1. OOD and Maximal Invariant Predictor Following (Arjovsky et al., 2019; Chang et al., 2020), we consider a dataset D = {De}e supp(Etr), which is a mixture of data De = {(xe i, ye i )}ne i=1 collected from multiple training environments e supp(Etr), xe i X and ye i Y are the i-th data and label from environment e respectively and ne is number of samples in environment e. Environment labels are unavailable as in most real applications. Etr is a random variable on indices of training environments and P e is the distribution of data and label in environment e. The goal of this work is to find a predictor f( ) : X Y with good out-of-distribution generalization performance, which can be formalized as: arg min f max e supp(E) L(f|e) (1) where L(f|e) = E[l(f(X), Y )|e] = Ee[l(f(Xe), Y e)] is the risk of predictor f on environment e, and l( , ) : Y Y R+ is the loss function. E is the random variable on indices of all possible environments such that supp(E) supp(Etr). Usually, for all e supp(E) \ supp(Etr), the data and label distribution P e(X, Y ) can be quite different from that of training environments Etr. Therefore, the problem in Equation 1 is referred to as Out-of-Distribution (OOD) Generalization problem (Arjovsky et al., 2019). Without any prior knowledge or structural assumptions, it is impossible to figure out the OOD generalization problem, since one cannot characterize the unseen latent environments in supp(E). A commonly used assumption in invariant learning literature (Rojas-Carulla et al., 2015; Gong et al., 2016; Arjovsky et al., 2019; Kuang et al., 2020; Chang et al., 2020) is as follow: Assumption 2.1. There exists random variable Φ (X) such that the following properties hold: a. Invariance property: for all e, e supp(E), we have P e(Y |Φ (X)) = P e (Y |Φ (X)) holds. b. Sufficiency property: Y = f(Φ ) + ϵ, ϵ X. This assumption indicates invariance and sufficiency for predicting the target Y using Φ , which is known as invariant covariates or representations with stable relationships with Y across different environments e E. In order to acquire the invariant predictor Φ (X), a branch of work to find maximal invariant predictor (Chang et al., 2020; Koyama & Yamaguchi, 2020) has been proposed, where the invariance set and the corresponding maximal invariant predictor are defined as: Definition 2.1. The invariance set I with respect to E is defined as: IE = {Φ(X) : Y E|Φ(X)} = {Φ(X) : H[Y |Φ(X)] = H[Y |Φ(X), E]} (2) where H[ ] is the Shannon entropy of a random variable. The corresponding maximal invariant predictor (MIP) of IE is defined as: S = arg max Φ IE I(Y ; Φ) (3) where I( ; ) measures Shannon mutual information between two random variables. Here we prove that the MIP S can can guarantee OOD optimality, as indicated in Theorem 2.1. The formal statement of Theorem 2.1 as well as its proof can be found in appendix. Theorem 2.1. (Informal) For predictor Φ (X) satisfying Assumption 2.1, Φ is the maximal invariant predictor with respect to E and the solution to OOD problem in equation 1 is EY [Y |Φ ] = arg minf supe supp(E) E[L(f)|e]. Recently, some works suppose the availability of data from multiple environments with environment labels, wherein they can find MIP (Chang et al., 2020; Koyama & Yamaguchi, 2020). However, they rely on the underlying assumption that the invariance set IEtr of Etr is exactly the invariance set IE of all possible unseen environments E, which cannot be guaranteed as shown in Theorem 2.2. Theorem 2.2. IE IEtr As shown in Theorem 2.2 that IE IEtr, the learned predictor is only invariant to such limited environments Etr but is not guaranteed to be invariant with respect to all possible environments E. Here we give a toy example in Table 1 to illustrate this. We consider a binary classification between cats and dogs, where each photo contains 3 features, animal feature X1 {cat, dog}, a background feature X2 {on grass, in water} and the photographer s signature feature X3 {Irma, Eric}. Assume all possible testing Submission and Formatting Instructions for ICML 2021 Class 0 (Cats) Class 1 (Dogs) Index X1 X2 X3 X1 X2 X3 e1 Cats Water Irma Dogs Grass Eric e2 Cats Grass Eric Dogs Water Irma e3 Cats Water Eric Dogs Grass Irma e4 Cats Grass Irma Dogs Water Eric e5 Mixture: 90% data from e1 and 10% data from e2 e6 Mixture: 90% data from e3 and 10% data from e4 Table 1. A Toy Example for the difference between IE and IEtr. environments supp(E) = {e1, e2, e3, e4, e5, e6} and the train environment supp(Etr) = {e5, e6}, then IE = {Φ|Φ = Φ(X1)} while IEtr = {Φ|Φ = Φ(X1, X2)}. The reason is that e5, e6 only tell us X3 cannot be included in the invariance set but cannot exclude X2. But if e5 and e6 can be further divided into e1, e2 and e3, e4 respectively, the invariance set IEtr becomes IEtr = IE = {Φ(X1)}. This example shows that the manually labeled environments may not be sufficient to achieve MIP, not to mention the cases where environment labels are not available. This limitation necessitates the study on how to exploit the latent intrinsic heterogeneity in training data (like e5 and e6 in the above example) to form more refined environments for OOD generalization. The environments need to be subtly uncovered, in the sense of OOD generalization problem, as indicated by Theorem 2.3, not all environments are helpful to tighten the invariance set. Theorem 2.3. Given set of environments supp( ˆE), denote the corresponding invariance set I ˆE and the corresponding maximal invariant predictor ˆΦ. For one newlyadded environment enew with distribution P new(X, Y ), if P new(Y |ˆΦ) = P e(Y |ˆΦ) for e supp( ˆE), the invariance set constrained by supp( ˆE) {enew} is equal to I ˆE. 2.2. Problem of Heterogeneous Risk Minimization Besides Assumption 2.1, we make another assumption on the existence of heterogeneity in training data as: Assumption 2.2. Heterogeneity Assumption. For random variable pair (X, Φ ) and Φ satisfying Assumption 2.1, using functional representation lemma (El Gamal & Kim, 2011), there exists random variable Ψ such that X = X(Φ , Ψ ), then we assume P e(Y |Ψ ) can arbitrary change across environments e supp(E). The heterogeneity among provided environments can be evaluated by the compactness of the corresponding invariance set as |IE|. Specifically, smaller |IE| leads to higher heterogeneity, since more variant features can be excluded. Based on the assumption, we come up with the problem of heterogeneity exploitation for OOD generalization. Problem 1. Heterogeneous Risk Minimization. Given heterogeneous dataset D = {De}e supp(Elatent) without environment labels, the task is to generate environments Etr with minimal |IEtr| and learn invariant model under learned Etr with good OOD performance. Theorem 2.3 together with Assumption 2.2 indicate that, to better constrain IEtr, the effective way is to generate environments with varying P(Y |Ψ (X)) that can exclude variant features from IEtr. Under this problem setting, we encounter the circular dependency: first we need variant Ψ to generate heterogeneous environments Etr; then we need Etr to learned invariant Φ as well as variant Ψ . Furthermore, there exists positive feedback between these two steps. When acquiring Etr with tighter IEtr, more invariant predictor Φ(X) (i.e. a better approximation of MIP) can be found, which will further bring a clearer picture of variant parts, and therefore promote the generation of Etr. With this notion, we propose our framework for Heterogeneous Risk Minimization (HRM) which leverages the mutual promotion between the two steps and conduct joint optimization. In this work, we temporarily focus on a simple but general setting, where X = [Φ , Ψ ]T Rd in raw feature level and Φ , Ψ satisfy Assumption 2.1. Under this setting, Our Heterogeneous Risk Minimization (HRM) framework contains two interactive parts, the frontend Mc for heterogeneity identification and the backend Mp for invariant prediction. The general framework is shown in Figure 1. Given the pooled heterogeneous data, it starts with the het- Figure 1. The framework of HRM. erogeneity identification module Mc leveraging the learned variant representation Ψ(X) to generate heterogeneous environments Elearn. Then the learned environments are used by OOD prediction module Mp to learn the MIP Φ(X) as well as the invariant prediction model f(Φ(X)). After that, we derive the variant Ψ(X) to further boost the module Mc, which is supported by Theorem 2.3. As for the convert step, under our setting, we adopt feature selection in this work, through which more variant feature Ψ can be attained when more invariant feature Φ is learned. Specifically, the invariant predictor Φ(X) is generated as Φ(X) = M X, and the variant part Ψ(X) = (1 M) X correspondingly, where M {0, 1}d is the binary invariant feature selection mask. For instance, for Table 1, X = [X1, X2, X3], the Submission and Formatting Instructions for ICML 2021 ground truth binary mask is M = [1, 0, 0]. In this way, the better Φ is learned, the better Ψ can be obtained. Note that we use the soft selection which is more flexible and general in our algorithm with M [0, 1]d. The whole framework is jointly optimized, so that the mutual promotion between heterogeneity identification and invariant learning can be fully leveraged. 3.1. Implementation of Mp Here we introduce our invariant prediction module Mp, which takes multiple environments training data D = {De}e supp(Etr) as input, and outputs the corresponding invariant predictor f and the indices of invariant features M given current environments Etr. We combine feature selection with invariant learning under heterogeneous environments, which can select the features with stable/invariant correlations with the label across Etr. Specifically, the former module can select most informative features with respect to the loss function and latter module ensures the selected features are invariant. Their combination ensures Mp to select the most informative invariant features. For invariant learning, we follow the variance penalty regularizer proposed in (Koyama & Yamaguchi, 2020) and simplify it in feature selection scenarios. The objective function of Mp with M {0, 1}d is: Le(M X, Y ; θ) = EP e[ℓ(M Xe, Y e; θ)] (4) Lp(M X, Y ; θ) = EEtr[Le] + λtrace(Var Etr( θLe)) (5) However, as the optimization of hard feature selection with binary mask M suffers from high variance, we use the soft feature selection with gates taking continuous value in [0, 1]. Specifically, following (Yamada et al., 2020), we approximate each element of M = [m1, . . . , md]T to clipped Gaussian random variable parameterized by µ = [µ1, . . . , µd]T as mi = max{0, min{1, µi + ϵ}} (6) where ϵ is drawn from N(0, σ2). With this approximation, the objective function with soft feature selection can be written as: Le(θ, µ) = EP e EM [ℓ(M Xe, Y e; θ) + α M 0] (7) where M is a random vector with d independent variables mi for i [d]. Under the approximation in Equation 6, M 0 is simply P i [d] P(mi > 0) and can be calculated as M 0 = P i [d] CDF(µi/σ), where CDF is the standard Gaussian CDF. We formulate our objective as risk minimization problem: min θ,µ Lp(θ; µ) = EEtr[Le(θ, µ)] + λtrace(Var Etr( θLe)) (8) Le(θ, µ) = EP e EM ℓ(M Xe, Y e; θ) + α X i [d] CDF(µi/σ) Further, as for linear models, we simply approximate the regularizer trace(Var Etr( θLe)) by Var Etr( θLe) M 2. Then we obtain Φ(X) and Ψ(X) when we obtain µ as well as M. Further in Section 4, we theoretically prove that the prediction module Mp is able to learn the MIP with respect to given environments Etr. 3.2. Implementation of Mc Notation. Ψ means the learned variant part Ψ(X). K means K-dimension simplex. fθ( ) means the function f parameterized by θ. The heterogeneity identification module Mc takes a single dataset as input, and outputs a multi-environment dataset partition for invariant prediction. We implement it with a clustering algorithm. As indicated in Theorem 2.3, the more diverse P(Y |Ψ) for our generated environments, the better the invariance set I is. Therefore, we cluster the data points according to the relationship between Ψ and Y , for which we use P(Y |Ψ) as the cluster centre. Note that Ψ is initialized as Ψ(X) = X in our joint optimization. Specifically, we assume the j-th cluster centre PΘj(Y |Ψ) parameterized by Θj to be a Gaussian around fΘj(Ψ) as N(fΘj(Ψ), σ2): hj(Ψ, Y ) = PΘj(Y |Ψ) = 1 2πσ exp( (Y fΘj(Ψ))2 (10) For the given N = P e Etr ne empirical data samples D = {ψi(xi), yi}N i=1, the empirical distribution is modeled as ˆPN = 1 N PN i=1 δi(Ψ, Y ) where δi(Ψ, Y ) = ( 1, if Ψ = ψi and Y = yi 0, otherwise (11) The target of our heterogeneous clustering is to find a distribution in Q = {Q|Q = P j [K] qjhj(Ψ, Y ), q K} to fit the empirical distribution best. Therefore, the objective function of our heterogeneous clustering is: min Q Q DKL( ˆPN Q) (12) The above objective can be further simplified to: j=1 qjhj(ψi, yi) As for optimization, we use EM algorithm to optimize the centre parameter Θ and the mixture weight q. After optimizing equation 13, for building Etr, we assign each data point to environment ej Etr with probability: P(ej|Ψ, Y ) = qjhj(Ψ, Y )/ i=1 qihi(Ψ, Y ) In this way, Etr is generated by Mc. Submission and Formatting Instructions for ICML 2021 4. Theoretical Analysis In this section, we theoretically analyze our proposed Heterogeneous Risk Minimization (HRM) method. We first analyze our proposed invariant learning module Mp, and then justify the existence of the positive feedback in our HRM. Justification of Mp We prove that given training environments Etr, our invariant prediction model Mp can learn the maximal invariant predictor Φ(X) with respect to the corresponding invariance set IEtr. Theorem 4.1. Given Etr, the learned Φ(X) = M X is the maximal invariant predictor of IEtr. Justification of the Positive Feedback The core of our HRM framework is the mechanism for Mc and Mp to mutual promote each other. Here we theoretically justify the existence of such positive feedback. In Assumption 2.1, we assume the invariance and sufficiency properties of the stable features Φ and assume the relationship between unstable part Ψ and Y can arbitrarily change. Here we make a more specific assumption on the heterogeneity across environments with respect to Φ and Ψ . Assumption 4.1. Assume the pooled training data is made up of heterogeneous data sources: Ptr = P e supp(Etr) we P e. For any ei, ej Etr, ei = ej, we assume Ic i,j(Y ; Φ |Ψ ) max(Ii(Y ; Φ |Ψ ), Ij(Y ; Φ |Ψ )) (15) where Φ is invariant feature and Ψ the variant. Ii represents mutual information in P ei and Ic i,j represents the cross mutual information between P ei and P ej takes the form of Ic i,j(Y ; Φ|Ψ) = Hc i,j[Y |Ψ] Hc i,j[Y |Φ, Ψ] and Hc i,j[Y ] = R pei(y) log pej(y)dy. Here we would like to intuitively demonstrate this assumption. Firstly, the mutual information Ii(Y ; Φ ) = Hi[Y ] Hi[Y |Φ ] can be viewed as the error reduction if we use Φ to predict Y rather than predict by nothing. Then the cross mutual information Ii,j(Y ; Φ ) can be viewed as the error reduction if we use the predictor learned on Φ in environment ej to predict in environment ei, rather than predict by nothing. Therefore, the R.H.S in equation 15 measures that, in environment ei, how much prediction error can be reduced if we further add Φ for prediction rather than use only Ψ . And the L.H.S measures that, when using predictors trained in ei to predict in ej, how much prediction error can be reduced if we further add Φ for prediction rather than use only Ψ . Intuitively, Assumption 4.1 assumes that invariant feature Φ provides more information for predicting Y across environments than in one single environment, and correspondingly, the information provided by Ψ shrinks a lot across environments, which indicates that the relationship between variant feature Ψ and Y varies across environments. Based on this assumption, we first prove that the cluster centres are pulled apart as invariant feature is excluded from clustering. Theorem 4.2. For ei, ej supp(Etr), assume that X = [Φ , Ψ ]T satisfying Assumption 2.1, where Φ is invariant and Ψ variant. Then under Assumption 4.1, we have DKL(P ei(Y |X) P ej(Y |X)) DKL(P ei(Y |Ψ ) P ej(Y |Ψ )) Theorem 4.2 indicates that the distance between cluster centres is larger when using variant features Ψ , therefore, it is more likely to obtain the desired heterogeneous environments, which explains why we use learned variant part Ψ(X) for clustering. Finally, we provide the theorem for optimality guarantee for our HRM. Theorem 4.3. Under Assumption 2.1 and 4.1, for the proposed Mc and Mp, we have the following conclusions: 1. Given environments Etr such that IE = IEtr, the learned Φ(X) by Mp is the maximal invariant predictor of IE. 2. Given the maximal invariant predictor Φ of IE, assume the pooled training data is made up of data from all environments in supp(E), there exist one split that achieves the minimum of the objective function and meanwhile the invariance set regularized is equal to IE. Intuitively, Theorem 4.3 proves that given one of the Mc and Mp optimal, the other is optimal, which validates the existence of the global optimal point of our algorithm. 5. Experiment In this section, we validate the effectiveness of our method on simulation data and real-world data. Baselines We compare our proposed HRM with the following methods: Empirical Risk Minimization(ERM): minθ EP0[ℓ(θ; X, Y )] Distributionally Robust Optimization(DRO (Sinha et al., 2018)): minθ sup Q W (Q,P0) ρ EQ[ℓ(θ; X, Y )] Environment Inference for Invariant Learning(EIIL (Creager et al., 2020)): min Φ max u i ui(e)ℓ(w Φ(xi), yi)+ e E λ w|w=1.0 1 Ne i ui(e)ℓ(w Φ(xi), yi) 2 Invariant Risk Minimization(IRM (Arjovsky et al., 2019)) with environment Etr labels: e Etr Le + λ w|w=1.0Le(w Φ) 2 (17) Submission and Formatting Instructions for ICML 2021 Further, for ablation study, we also compare with HRMs, which runs HRM for only one iteration without the feedback loop. Note that IRM is based on multiple training environments and we provide environment Etr labels for IRM, while others do not need environment labels. Evaluation Metrics To evaluate the prediction performance, we use Mean Error defined as Mean Error = 1 |Etest| P e Etest Le, Std Error defined as Std Error = q 1 |Etest| 1 P e Etest(Le Mean Error)2, which are mean and standard deviation error across Etest and Max Error = maxe Etest Le, which are mean error, standard deviation error and worst-case error across Etest. Imbalanced Mixture It is a natural phenomena that empirical data follow a power-law distribution, i.e. only a few environments/subgroups are common and the rest are rare (Shen et al., 2018; Sagawa et al., 2019; 2020). Therefore, we perform non-uniform sampling among different environments in training set. 5.1. Simulation Data We design two mechanisms to simulate the varying correlations among covariates across environments, named by selection bias and anti-causal effect. Selection Bias In this setting, the correlations between variant covariates and the target are perturbed through selection bias mechanism. According to Assumption 2.1, we assume X = [Φ , Ψ ]T Rd and Y = f(Φ ) + ϵ and that P(Y |Φ ) remains invariant across environments while P(Y |Ψ ) changes arbitrarily. For simplicity, we select data points according to a certain variable set Vb Ψ : vi Vb |r| 5 |f(φ ) sign(r) vi| (18) where |r| > 1, Vb Rnb and ˆP(x) denotes the probability of point x to be selected. Intuitively, r eventually controls the strengths and direction of the spurious correlation between Vb and Y (i.e. if r > 0, a data point whose Vb is close to its y is more probably to be selected.). The larger value of |r| means the stronger spurious correlation between Vb and Y , and r 0 means positive correlation and vice versa. Therefore, here we use r to define different environments. In training, we generate sum = 2000 data points, where κ = 95% points from environment e1 with a predefined r and 1 κ = 5% points from e2 with r = 1.1. In testing, we generate data points for 10 environments with r [ 3, 2.7, 2.3, . . . , 2.3, 2.7, 3.0]. β is set to 1.0. We compare our HRM with ERM, DRO, EIIL and IRM for Linear Regression. We conduct extensive experiments with different settings on r, nb and d. In each setting, we carry out the procedure 10 times and report the average results. The results are shown in Table 2. Figure 2. Visualization of differences between environments in scenario 1 in selection bias experiment(r = 1.9). The left figure shows the initial clustering results using X, and the right one shows the learned Elearn using the learned variant part Ψ(X). From the results, we have the following observations and analysis: ERM suffers from the distributional shifts in testing and yields poor performance in most of the settings. DRO surprisingly has the worst performance, which we think is due to the over-pessimism problem (Frogner et al., 2019). EIIL has the similar performance with ERM, which indicates that its inferred environments cannot reveal the spurious correlations between Y and Vb. IRM performs much better than the above two baselines, however, as IRM depends on the available environment labels to work, it uses much more information than the other three methods. Compared to the three baselines, our HRM achieves nearly perfect performance with respect to average performance and stability, especially the variance of losses across environments is close to 0, which reflects the effectiveness of our heterogeneous clustering as well as the invariant learning algorithm. Furthermore, our HRM does not need environment labels, which verifies that our clustering algorithm can mine the latent heterogeneity inside the data and further shows our superiority to IRM. Besides, we visualize the differences between environments using Task2Vec (Achille et al., 2019) in Figure 2, where larger value means the two environments are more heterogeneous. The pooled training data are mixture of environments with r = 1.9 and r = 1.1, the difference between whom is shown in yellow box. And the red boxes show differences between learned environments by HRMs and HRM. The big promotion between Einit and Elearn verifies our HRM can exploit heterogeneity inside data as well as the existence of the positive feedback. Due to space limitation, results of varying sum, κ, nb as well as experimental details are left to appendix. Anti-causal Effect Inspired by (Arjovsky et al., 2019), we induce the spurious correlation by using anti-causal relationship from the target Y to the variant covariates Ψ . In this experiment, we assume X = [Φ , Ψ ]T Rd and firstly sample Φ from mixture Gaussian distribution characterized as Pk i=1 zk N(µi, I) and the target Y = θT φ Φ + βΦ1Φ2Φ3 + N(0, 0.3). Then the spurious Submission and Formatting Instructions for ICML 2021 Table 2. Results in selection bias simulation experiments of different methods with varying selection bias r, and dimensions nb and d of training data, and each result is averaged over ten times runs. Scenario 1: varying selection bias rate r (d = 10, nb = 1) r r = 1.5 r = 1.9 r = 2.3 Methods Mean Error Std Error Max Error Mean Error Std Error Max Error Mean Error Std Error Max Error ERM 0.476 0.064 0.524 0.510 0.108 0.608 0.532 0.139 0.690 DRO 0.467 0.046 0.516 0.512 0.111 0.625 0.535 0.143 0.746 EIIL 0.477 0.057 0.543 0.507 0.102 0.613 0.540 0.139 0.683 IRM(with Etr label) 0.460 0.014 0.475 0.456 0.015 0.472 0.461 0.015 0.475 HRMs 0.465 0.045 0.511 0.488 0.078 0.577 0.506 0.096 0.596 HRM 0.447 0.011 0.462 0.449 0.010 0.465 0.447 0.011 0.463 Scenario 2: varying dimension d (r = 1.9, nb = 0.1d) d d = 10 d = 20 d = 40 Methods Mean Error Std Error Max Error Mean Error Std Error Max Error Mean Error Std Error Max Error ERM 0.510 0.108 0.608 0.533 0.141 0.733 0.528 0.175 0.719 DRO 0.512 0.111 0.625 0.564 0.186 0.746 0.555 0.196 0.758 EIIL 0.507 0.102 0.613 0.543 0.147 0.699 0.542 0.178 0.727 IRM(with Etr label) 0.456 0.015 0.472 0.484 0.014 0.489 0.500 0.051 0.540 HRMs 0.488 0.078 0.577 0.486 0.069 0.555 0.477 0.081 0.553 HRM 0.449 0.010 0.465 0.466 0.011 0.478 0.465 0.015 0.482 Table 3. Prediction errors of the anti-causal effect experiment. We design two settings with different dimensions of Φ and Ψ as nφ and nψ respectively. The results are averaged over 10 runs. Scenario 1: nφ = 9, nψ = 1 e Training environments Testing environments Methods e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 ERM 0.290 0.308 0.376 0.419 0.478 0.538 0.596 0.626 0.640 0.689 DRO 0.289 0.310 0.388 0.428 0.517 0.610 0.627 0.669 0.679 0.739 EIIL 0.075 0.128 0.349 0.485 0.795 1.162 1.286 1.527 1.558 1.884 IRM(with Etr label) 0.306 0.312 0.325 0.328 0.343 0.358 0.365 0.374 0.377 0.392 HRMs 1.060 1.085 1.112 1.130 1.207 1.280 1.325 1.340 1.371 1.430 HRM 0.317 0.314 0.322 0.318 0.321 0.317 0.315 0.315 0.316 0.320 Scenario 2: nφ = 5, nψ = 5 e Training environments Testing environments Methods e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 ERM 0.238 0.286 0.433 0.512 0.629 0.727 0.818 0.860 0.895 0.980 DRO 0.237 0.294 0.452 0.529 0.651 0.778 0.859 0.911 0.950 1.028 EIIL 0.043 0.145 0.521 0.828 1.237 1.971 2.523 2.514 2.506 3.512 IRM(with Etr label) 0.287 0.293 0.329 0.345 0.382 0.420 0.444 0.461 0.478 0.504 HRMs 0.455 0.463 0.479 0.478 0.495 0.508 0.513 0.519 0.525 0.533 HRM 0.316 0.315 0.315 0.330 0.3200 0.317 0.326 0.330 0.333 0.335 correlations between Ψ and Y are generated by anti-causal effect as Ψ = θψY + N(0, σ(µi)2) (19) where σ(µi) means the Gaussian noise added to Ψ depends on which component the invariant covariates Φ belong to. Intuitively, in different Gaussian components, the corresponding correlations between Ψ and Y are varying due to the different value of σ(µi). The larger the σ(µi) is, the weaker correlation between Ψ and Y . We use the mixture weight Z = [z1, . . . , zk]T to define different environments, where different mixture weights represent different overall strength of the effect Y on Ψ . In this experiment, we set β = 0.1 and build 10 environments with varying σ and the dimension of Φ , Ψ , the first three for training and the last seven for testing. We run experiments for 10 times and the averaged results are shown in Table 3. EIIL achieves the best training performance with respect to prediction errors on training environments e1, e2, e3, while its performances in testing are poor. ERM suffers from distributional shifts in testing. DRO seeks for over-considered robustness and performs much worse. IRM performs much better as it learns invariant representations with help of environment labels. HRM achieves nearly uniformly good performance in training environments as well as the testing ones, which validates the effectiveness of our method and proves its excellent generalization ability. 5.2. Real-world Data We test our method on three real-world tasks, including car insurance prediction, people income prediction and house price prediction. 5.2.1. SETTINGS Car Insurance Prediction In this task, we use a realworld dataset for car insurance prediction (Kaggle). It is Submission and Formatting Instructions for ICML 2021 (a) Training and testing accuracy for the car insurance prediction. Left sub-figure shows the training results for 5 settings and the right shows their corresponding testing results. (b) Mis-Classification Rate for the income prediction. (c) Prediction error for the house price prediction. RMSE refers to the Root Mean Square Error. Figure 3. Results of real-world datasets, including training and testing performance for five methods. a classification task to predict whether a person will buy car insurance based on related information, such as vehicle damage, annual premium, vehicle age etc1. We impose selection bias mechanism on the correlation between the outcome (i.e. the label indicating whether buying insurance) and the sex attribute to simulate multiple environments. Specifically, we simulate different strengths |r| of the spurious correlation between sex and target in training, and reverse the direction of such correlation in testing(+|r| in training and |r| in testing). For IRM, in each setting, we divide the training data into three training environments with r1 = 0.95, r2 = 0.9, r3 = 0.8, and different overall correlation r corresponds to different numbers of data in e1, e2, e3. We perform 5 experiments with varying r and the results in both training and testing are shown in Figure 3(a). People Income Prediction In this task we use the Adult dataset (Dua & Graff, 2017) to predict personal income levels as above or below $50,000 per year based on personal details. We split the dataset into 10 environments according to demographic attributes sex and race. In training phase, all methods are trained on pooled data including 693 points from environment 1 and 200 from environment 2, and validated on 100 sampled from both. For IRM, the ground-truth environment labels are provided. In testing phase, we test all methods on the 10 environments and report the mis-classification rate on all environments in Figure 3(b). House Price Prediction In this experiment, we use a realworld regression dataset (Kaggle) of house sales prices from King County, USA2. The target variable is the transaction price of the house and each sample contains 17 predictive variables such as the built year of the house, number of bedrooms, and square footage of home, etc. We simulate 1https://www.kaggle.com/anmolkumar/health-insurancecross-sell-prediction 2https://www.kaggle.com/c/house-prices-advancedregressiontechniques/data different environments according to the built year of the house, since it is fairly reasonable to assume the correlations among covariates and the target may vary along time. Specifically, we split the dataset into 6 periods, where each period approximately covers a time span of two decades. All methods are trained on data from the first period([1900, 1920)) and test on the other periods. For IRM, we further divide the training data into two environments where built year [1900, 1910) and [1910, 1920) respectively. Results are shown in Figure 3(c). 5.2.2. ANALYSIS From the results of three real-world tasks, we have the following observations and analysis: ERM achieves high accuracy in training while performing much worse in testing, indicating its inability in dealing with OOD predictions. DRO s performance is not satisfactory, sometimes even worse than ERM. One plausible reason is its overpessimistic nature which leads to too conservative predictors. Comparatively, invariant learning methods perform better in testing. IRM performs better than ERM and DRO, which shows the usefulness of environment labels for OOD generalization and the possibility of learning invariant predictor from multiple environments. EIIL performs inconsistently across different tasks, possibly due to its instability of the environment inference method. In all tasks and almost all testing environments (16/18), HRM consistently achieves the best performances. HRM even outperforms IRM significantly in a unfair setting where we provide perfect environment labels for IRM. One one side, it shows the limitation of manually labeled environments. On the other side, it demonstrates that, relieving the dependence on environment labels, HRM can effectively uncover and fully leverage the intrinsic heterogeneity in training data for invariant learning. Submission and Formatting Instructions for ICML 2021 6. Related Works There are mainly two branches of methods for the OOD generalization problem, namely distributionally robust optimization (DRO) (Esfahani & Kuhn, 2018; Duchi & Namkoong, 2018; Sinha et al., 2018; Sagawa et al., 2019) and invariant learning (Arjovsky et al., 2019; Koyama & Yamaguchi, 2020; Chang et al., 2020; Creager et al., 2020). DRO methods propose to optimize the worst-case risk within an uncertainty set, which lies around the observed training distribution and characterizes the potential testing distributions. However, in real scenarios, to better capture the testing distribution, the uncertainty set should be pretty large, which also results in the over-pessimism problem of DRO methods(Hu et al., 2018; Frogner et al., 2019). Realizing the difficulty of solving OOD generalization problem without any prior knowledge or structural assumptions, invariant learning methods assume the existence of causally invariant relationships between some predictors Φ(X) and the target Y . (Arjovsky et al., 2019) and (Koyama & Yamaguchi, 2020) propose to learning an invariant representation through multiple training environments. (Chang et al., 2020) also proposes to select features whose predictive relationship with the target stays invariant across environments. However, their effectiveness relies on the quality of the given multiple training environments, and the role of environments remains vague theoretically. Recently, (Creager et al., 2020) improves (Arjovsky et al., 2019) by relaxing its requirements for multiple environments. Specifically, (Creager et al., 2020) proposes a two-stage method, which firstly infers the environment division with a pre-provided biased model, and then performs invariant learning on the inferred environments. However, the two stages cannot be jointly optimized, and the environment division relies on the given biased model and lacks theoretical guarantees. 7. Discussions In this work, we theoretically analyze the role of environments in invariant learning, and propose our HRM for joint heterogeneity identification and invariant prediction, which relaxes the requirements for environment labels and opens a new direction for invariant learning. To our knowledge, this is the first work to both theoretically and empirically analyze how the equality of multiple environments affects invariant learning. This paper mainly focuses on the raw variable level with the assumption of X = [Φ , Ψ ]T , which is able to cover a broad spectrum of applications, e.g. healthcare, finance, marketing etc, where the raw variables are informative enough. However, our work has some limitations, which we hope to improve in the future. Firstly, in order to achieve the mutual promotion, we should use the variant features Ψ for heterogeneity identification rather than the invariant ones. However, the process of invariant prediction continuously discards the variant features Ψ (for invariant features or representation), which makes it quite hard to recover the variant features. To overcome this, we focus on the simple setting where X = [Φ , Ψ ]T , since we can directly obtain the variant features Ψ when having invariant features Φ . To further extend the power of HRM, we will consider to incorporate representation learning from X in future work. Secondly, our clustering algorithm in Mc lacks theoretical guarantees for its convergence. To the best of our knowledge, in order to theoretically analyze the convergence of a clustering algorithm, it is necessary to measure the distance between data points. However, our clustering algorithm takes models parameters as cluster centers and aims to cluster data points (X, Y ) according to their relationships between X and Y , whose dissimilarity cannot be easily measured, since the relationship is statistical magnitude and cannot be calculated individually. How to theoretically analyze the convergence property of such clustering algorithms remains unsolved. 8. Acknowledgements This work was supported in part by National Key R&D Program of China (No. 2018AAA0102004, No. 2020AAA0106300), National Natural Science Foundation of China (No. U1936219, 61521002, 61772304), Beijing Academy of Artificial Intelligence (BAAI), and a grant from the Institute for Guo Qiang, Tsinghua University. Bo Li s research was supported by the Tsinghua University Initiative Scientific Research Grant, No. 2019THZWJC11; Technology and Innovation Major Project of the Ministry of Science and Technology of China under Grant 2020AAA0108400 and 2020AAA01084020108403; Major Program of the National Social Science Foundation of China (21ZDA036). Submission and Formatting Instructions for ICML 2021 Achille, A., Lam, M., Tewari, R., Ravichandran, A., Maji, S., Fowlkes, C. C., Soatto, S., and Perona, P. Task2vec: Task embedding for meta-learning. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 6429 6438. IEEE, 2019. doi: 10.1109/ICCV.2019.00653. URL https://doi.org/10.1109/ICCV.2019. 00653. Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez Paz, D. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. Berk, R., Heidari, H., Jabbari, S., Kearns, M., and Roth, A. Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research, pp. 0049124118782533, 2018. Chang, S., Zhang, Y., Yu, M., and Jaakkola, T. S. Invariant rationalization. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 1448 1458. PMLR, 2020. URL http://proceedings.mlr.press/ v119/chang20c.html. Creager, E., Jacobsen, J.-H., and Zemel, R. Environment inference for invariant learning. In ICML Workshop on Uncertainty and Robustness, 2020. Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml. Duchi, J. and Namkoong, H. Learning models with uniform performance via distributionally robust optimization. ar Xiv preprint ar Xiv:1810.08750, 2018. El Gamal, A. and Kim, Y.-H. Network information theory. Network Information Theory, 12 2011. doi: 10.1017/ CBO9781139030687. Esfahani, P. M. and Kuhn, D. Data-driven distributionally robust optimization using the wasserstein metric: performance guarantees and tractable reformulations. Math. Program., 171(1-2):115 166, 2018. doi: 10.1007/ s10107-017-1172-1. URL https://doi.org/10. 1007/s10107-017-1172-1. Frogner, C., Claici, S., Chien, E., and Solomon, J. Incorporating unlabeled data into distributionally robust learning. ar Xiv preprint ar Xiv:1912.07729, 2019. Gong, M., Zhang, K., Liu, T., Tao, D., Glymour, C., and Sch olkopf, B. Domain adaptation with conditional transferable components. In Balcan, M. and Weinberger, K. Q. (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp. 2839 2848. JMLR.org, 2016. URL http://proceedings.mlr.press/ v48/gong16.html. Hu, W., Niu, G., Sato, I., and Sugiyama, M. Does distributionally robust supervised learning give robust classifiers? In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 2034 2042. PMLR, 2018. URL http:// proceedings.mlr.press/v80/hu18a.html. Koyama, M. and Yamaguchi, S. Out-of-distribution generalization with maximal invariant predictor. Co RR, abs/2008.01883, 2020. URL https://arxiv.org/ abs/2008.01883. Kuang, K., Xiong, R., Cui, P., Athey, S., and Li, B. Stable prediction with model misspecification and agnostic distribution shift. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 4485 4492. AAAI Press, 2020. URL https://aaai.org/ojs/ index.php/AAAI/article/view/5876. Kukar, M. Transductive reliability estimation for medical diagnosis. Artificial Intelligence in Medicine, 29(1-2): 81 106, 2003. Rojas-Carulla, M., Schlkopf, B., Turner, R., and Peters, J. Invariant models for causal transfer learning. Stats, 2015. Rudin, C. and Ustun, B. Optimized scoring systems: Toward trust in machine learning for healthcare and criminal justice. Interfaces, 48(5):449 466, 2018. Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019. Sagawa, S., Raghunathan, A., Koh, P. W., and Liang, P. An investigation of why overparameterization exacerbates spurious correlations. 2020. Shen, Z., Cui, P., Kuang, K., Li, B., and Chen, P. Causally regularized learning with agnostic data selection bias. In 2018 ACM Multimedia Conference, 2018. Sinha, A., Namkoong, H., and Duchi, J. Certifying some distributional robustness with principled adversarial training. International Conference on Learning Representations, 2018. Submission and Formatting Instructions for ICML 2021 Yamada, Y., Lindenbaum, O., Negahban, S., and Kluger, Y. Feature selection using stochastic gates. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 10648 10659. PMLR, 2020. URL http://proceedings.mlr.press/v119/ yamada20a.html.