# measure_the_predictive_heterogeneity__0273c4b0.pdf Published as a conference paper at ICLR 2023 MEASURING THE PREDICTIVE HETEROGENEITY Jiashuo Liu , Jiayun Wu , Renjie Pi , Renzhe Xu , Xingxuan Zhang , Bo Li , Peng Cui Tsinghua University, Hong Kong University of Science and Technology {liujiashuo77, jiayun.wu.work}@gmail.com, rpi@connect.ust.hk xrz199721@gmail.com, xingxuanzhang@hotmail.com libo@sem.tsinghua.edu.cn, cuip@tsinghua.edu.cn As an intrinsic and fundamental property of big data, data heterogeneity exists in a variety of real-world applications, such as in agriculture, sociology, health care, etc. For machine learning algorithms, the ignorance of data heterogeneity will significantly hurt the generalization performance and the algorithmic fairness, since the prediction mechanisms among different sub-populations are likely to differ. In this work, we focus on the data heterogeneity that affects the prediction of machine learning models, and first formalize the Predictive Heterogeneity, which takes into account the model capacity and computational constraints. We prove that it can be reliably estimated from finite data with PAC bounds even in high dimensions. Additionally, we propose the Information Maximization (IM) algorithm, a bi-level optimization algorithm, to explore the predictive heterogeneity of data. Empirically, the explored predictive heterogeneity provides insights for sub-population divisions in agriculture, sociology, and object recognition, and leveraging such heterogeneity benefits the out-of-distribution generalization performance. 1 INTRODUCTION Big data bring great opportunities to modern society and promote the development of machine learning, facilitating human life within a wide variety of areas, such as the digital economy, healthcare, scientific discoveries. Along with the progress, the intrinsic heterogeneity of big data introduces new challenges to machine learning systems and data scientists (Fan et al., 2014; He, 2017). In general, data heterogeneity, as a fundamental property of big data, refers to any diversity inside data, including the diversity of data sources, data generation mechanisms, sub-populations, data structures, etc. When not properly treated, data heterogeneity could bring pitfalls to machine learning systems, especially in high-stake applications, such as precision medicine, autonomous driving, and financial risk management (Dzobo et al., 2018; Breitenstein et al., 2020; Challen et al., 2019), leading to poor out-of-distribution generalization performances and some fairness issues. For example, in supervised learning tasks where machine learning models learn from data to predict the target variable with given covariates, when the whole dataset consists of multiple sub-populations with shifts or different prediction mechanisms, traditional machine learning algorithms will mainly focus on the majority but ignore the minority. It will hurt the generalization ability and compromise the algorithmic fairness, as is shown in (Kearns et al., 2018; Sagawa et al., 2019; Duchi & Namkoong, 2021). Another well-known example is Simpson s paradox, which brings false discoveries to the social research (Wagner, 1982; Hern an et al., 2011). Despite its widespread existence, due to its complexity, data heterogeneity has not converged to a uniform formulation so far, and has different meanings among different fields. Li & Reynolds (1995) define the heterogeneity in ecology based on the system property and complexity or variability. Rosenbaum (2005) views the uncertainty of the potential outcome as unit heterogeneity in observational studies in economics. More recently, in machine learning, several works of causal learning (Peters et al., 2016; Arjovsky et al., 2019; Koyama & Yamaguchi, 2020; Liu et al., 2021; Creager et al., 2021) and robust learning (Sagawa et al., 2019; Liu et al., 2022) leverage heterogeneous data from multiple environments to improve the out-of-distribution generalization ability. However, previous works have not provided a precise definition or sound quantification. In this Corresponding Author. Published as a conference paper at ICLR 2023 work, from the perspective of prediction power, we propose the predictive heterogeneity, a new type of data heterogeneity. From the machine learning perspective, the main concern is the possible negative effects of data heterogeneity on making predictions. Therefore, given the complexity of data heterogeneity, in this work, we focus on the data heterogeneity that affects the prediction of machine learning models, which could facilitate the building of machine learning systems, and we name it the predictive heterogeneity. We raise the precise definition of predictive heterogeneity, which is quantified as the maximal additional predictive information that can be gained by dividing the whole data distribution into sub-populations. The new measure takes into account the model capacity and computational constraints, and can be reliably estimated from finite samples even in high dimensions with PAC bounds. We theoretically analyze its properties and examine it under typical cases of data heterogeneity (Fan et al., 2014). Additionally, we design the information maximization (IM) algorithm to empirically explore the predictive heterogeneity inside data. Empirically, we find the explored heterogeneity is explainable and it provides insights for sub-population divisions in many fields, including agriculture, sociology, and object recognition. And the explored sub-populations could be leveraged to enhance the out-of-distribution generalization performances of machine learning models, which is verified with both simulated and real-world data. 2 PRELIMINARIES ON MUTUAL INFORMATION AND PREDICTIVE V-INFORMATION In this section, we briefly introduce the mutual information and predictive V-information (Xu et al., 2020) which are the preliminaries of our proposed predictive heterogeneity. Notations. For a probability triple (S, F, P), define random variables X : S X and Y : S Y where X is the covariate space and Y is the target space. Accordingly. x X denotes the covariates, and y Y denotes the target. Denote the set of random categorical variables as C = {C : S N|supp(C) is finite}. Additionally, P(X), P(Y) denote the set of all probability measures over the Borel algebra on the spaces X, Y respectively. H( ) denotes the Shannon entropy of a random variable, and H( | ) denotes the conditional entropy of two random variables. In information theory, the mutual information of two random variables X, Y measures the dependence between the two variables, which quantifies the reduction of entropy for one variable when observing the other: I(X; Y ) = H(Y ) H(Y |X). (1) It is known that the mutual information is associated with the predictability of Y (Cover Thomas & Thomas Joy, 1991). While the standard definition of mutual information unrealistically assumes the unbounded computational capacity of the predictor, rendering it hard to estimate especially in high dimensions. To mitigate this problem, Xu et al. (2020) raise the predictive V-information under realistic computational constraints, where the predictor is only allowed to use models in the predictive family V to predict the target variable Y . Definition 1 (Predictive Family (Xu et al., 2020)). Let Ω= {f : X { } P(Y)}. We say that V Ωis a predictive family if it satisfies: f V, P range(f), f V, s.t. x X, f [x] = P, f [ ] = P. (2) A predictive family contains all predictive models that are allowed to use, which forms computational or statistical constraints. The additional condition in Equation 2 means that the predictor can always ignore the input covariates (x) if it chooses to (only use ). Definition 2 (Predictive V-information (Xu et al., 2020)). Let X, Y be two random variables taking values in X Y and V be a predictive family. The predictive V-information from X to Y is defined as: IV(X Y ) = HV(Y | ) HV(Y |X), (3) where HV(Y | ), HV(Y |X) are the predictive conditional V-entropy defined as: HV(Y |X) = inf f V Ex,y X,Y [ log f[x](y)]. (4) HV(Y | ) = inf f V Ey Y [ log f[ ](y)]. (5) Notably that f V is a function X { } P(Y), so f[x] P(Y) is a probability measure on Y, and f[x](y) R is the density evaluated on y Y. HV(Y | ) is also denoted as HV(Y ). Published as a conference paper at ICLR 2023 Compared with the mutual information, the predictive V-information restricts the computational power and is much easier to estimate in high-dimensional cases. When the predictive family V contains all possible models, i.e. V = Ω, it is proved that IV(X Y ) = I(X; Y ) (Xu et al., 2020). 3 PREDICTIVE HETEROGENEITY In this paper, from the machine learning perspective, we quantify the data heterogeneity that affects decision making, named Predictive Heterogeneity, which is easy to integrate with machine learning algorithms and could help analyze big data and build more rational algorithms. 3.1 INTERACTION HETEROGENEITY To formally define the predictive heterogeneity, we begin with the formulation of the interaction heterogeneity. The interaction heterogeneity is defined as: Definition 3 (Interaction Heterogeneity). Let X, Y be random variables taking values in X Y. Denote the set of random categorical variables as C, and take its subset E C. Then E is an environment set iff there exists E E such that X, Y E. E E is called an environment variable. The interaction heterogeneity between X and Y w.r.t. the environment set E is defined as: HE (X, Y ) = sup E E I(Y ; X|E) I(Y ; X). (6) Each environment variable E represents a stochastic partition of X Y, and the condition for the environment set implies that the joint distribution of X, Y could be preserved in each environment. In information theory, I(Y ; X|E) I(Y ; X) is called the interaction information, which measures the influence of the environment variable E on the amount of information shared between the target Y and the covariate X. And the interaction heterogeneity defined in Equation 6 quantifies the maximal additional information that can be gained from involving or uncovering the environment variable E. Intuitively, large HE (P) indicates that the predictive power from X to Y is enhanced by E, which means that uncovering the latent sub-population associated with the environment partition E will benefit the X Y prediction. 3.2 PREDICTIVE HETEROGENEITY Based on the mutual information, the computation of the interaction heterogeneity is quite hard, since the standard mutual information is notoriously difficult to estimate especially in big data scenarios. Also, even if the mutual information could be accurately estimated, the prediction model may not be able to make good use of it. Inspired by Xu et al. (2020), we raise the Predictive Heterogeneity, which measures the interaction heterogeneity that can be captured under computational constraints and affects the prediction of models within the specified predictive family. To begin with, we propose the Conditional Predictive V-information, which generalizes the predictive V-information. Definition 4 (Conditional Predictive V-information). Let X, Y be two random variables taking values in X Y and E be an environment variable. The conditional predictive V-information is defined as: IV(X Y |E) = HV(Y | , E) HV(Y |X, E), (7) where HV(Y | , E) and HV(Y |X, E) are defined as: HV(Y |X, E) = Ee E inf f V Ex,y X,Y |E=e[ log f[x](y)] . (8) HV(Y | , E) = Ee E inf f V Ey Y |E=e[ log f[ ](y)] . (9) Intuitively, the conditional predictive V-information measures the weighted average of predictive Vinformation among environments. And here we are ready to formalize the predictive heterogeneity measure. Definition 5 (Predictive Heterogeneity). Let X, Y be random variables taking values in X Y and E be an environment set. The predictive heterogeneity for the prediction X Y with respect to E is defined as: HE V (X Y ) = sup E E IV(X Y |E) IV(X Y ), (10) Published as a conference paper at ICLR 2023 where IV(X Y ) is the predictive V-information following from Definition 2. Leveraging the predictive V-information, the predictive heterogeneity defined in Equation 10 characterizes the maximal additional information that can be used by the prediction model when involving the environment variable E. It restricts the prediction models in V and the explored additional information could benefit the prediction performance of the model f V, for which it is named predictive heterogeneity. Next, we present some basic properties of the interaction heterogeneity and predictive heterogeneity. Proposition 1 (Basic Properties of Predictive Heterogeneity). Let X, Y be random variables taking values in X Y, V be a function family, and E , E1, E2 be environment sets. 1. Monotonicity: If E1 E2, HE1 V (X Y ) HE2 V (X Y ). 2. Nonnegativity: HE V(X Y ) 0. 3. Boundedness: HE V(X Y ) HV(Y |X). 4. Corner Case: If the predictive family V is the largest possible predictive family that includes all possible models, i.e. V = Ω, we have HE (X, Y ) = HE Ω(X Y ). For further theoretical properties of predictive heterogeneity, in Section 3.3, we derive its explicit forms under endogeneity, a common reflection of data heterogeneity. And we demonstrate in Section 3.4 that our proposed predictive heterogeneity can be empirically estimated with guarantees if the complexity of V is bounded (e.g., its Rademacher complexity). 3.3 THEORETICAL PROPERTIES IN LINEAR CASES In this section, we analyze the theoretical properties of the predictive heterogeneity in multiple linear settings, including (1) a homogeneous case with independent noises and (2) heterogeneous cases with endogeneity brought by selection bias and hidden variables. Under these typical settings, we could approximate the analytical forms of the proposed measure and the conclusions provide insights for general cases. Firstly, under a homogeneous case with no data heterogeneity, Theorem 1 proves that our measure is bounded by the scale of label noises (which is usually small) and reduces to 0 in linear case under mild assumptions. It indicates that the predictive heterogeneity is insensitive to independent noises. Notably that in the linear case we only deal with the environment variable satisfying X ϵ|E, since in common prediction tasks, the independent noises are unknown and unrealistic to be exploited for the inference of latent environments E. Theorem 1 (Homogeneous Case with Independent Noises). For a prediction task X Y where X, Y are random variables taking values in Rn R, consider the data generation process as Y = g(x) + ϵ, ϵ N(0, σ2) where g : Rn R is a measurable function. 1) For a function class G such that g G, define the function family as VG = {f|f[x] = N(φ(x), σ2 V ), φ G, σV R+}. With an environment set E , we have HE VG(X Y ) πσ2. 2) Take n = 1 and g(x) = βx,β R. Assume E[X] = 0 and E[X2] exists. Given the function family Vσ = {f|f[x] = N(θx, σ2), θ R, σ fixed } and the environment set E = {E|E C, |supp(E)| = 2, X ϵ|E}. We have HE Vσ(X Y ) = 0. Secondly, we examine the proposed measure under two typical cases of data heterogeneity (Fan et al., 2014), named endogeneity by selection bias (Heckman, 1979; Winship & Mare, 1992; Cui & Athey, 2022) and endogeneity with hidden variables (Fan et al., 2014; Arjovsky et al., 2019). To begin with, in Theorem 2, we consider the prediction task X Y with X, Y taking values in R2 R. Let X = [S, V ]T . The predictive family is specified as: V = {f|f[x] = N(θSS + θV V, σ2), θS, θV R, σ = 1}. (11) And the data distribution P(X, Y ) is a mixture of latent sub-populations, which could be formulated by an environment variable E C such that P(X, Y ) = P e supp(E ) P(E = e)P(X, Y |E = e). For each e supp(E ), P(X, Y |E = e) is the distribution of a homogeneous sub-population. Note that the prediction task is to predict Y with covariates X, and the sub-population structure is latent. That is, P(E |X, Y ) is unknown for models. In the following, we derive the analytical forms of our measure under the one typical case. Published as a conference paper at ICLR 2023 Theorem 2 (Endogeneity with Selection Bias). For the prediction task X = [S, V ]T Y with a latent environment variable E , the data generation process with selection bias is defined as: Y = βS + f(S) + ϵY , ϵY N(0, σ2 Y ); V = r(E )f(S) + σ(E ) ϵV , ϵV N(0, 1), (12) where f : R R and r, σ : supp(E ) R are measurable functions. β R. Assume that E[f(S)S] = 0 and there exists L > 1 such that Lσ2(E ) < r2(E )E[f 2]. For the predictive family defined in equation 11 and the environment set E = C, the predictive heterogeneity of the prediction task [S, V ]T Y approximates to: HC V(X Y ) Var(re)E[f 2] + E[σ2(E )] E[r2e]E[f 2] + E[σ2(E )] E[f 2(S)], error bounded by 1 2 max(σ2 Y , R(r, σ, f)). (13) And R(r(E ), σ(E ), f) = E[( 1 r2E[f2] σ2 +1)2]E[f 2] + EE [( 1 r σ + σ r E[f2] )2] < E[f 2]( 1 (L+1)2 + 1 L+2+ 1 Intuitively, the data generation process in Theorem 2 introduces the spurious correlation between the spurious feature V and the target Y , which varies across different sub-populations (i.e. r(E ) and σ(E ) varies) and brings about data heterogeneity. Here E[f(S)S] = 0 indicates a model misspecification since there is a nonlinear term f(S) that could not be inferred by the linear predictive family with the stable feature S. The constant L characterizes the strength of the spurious correlation between V and Y . Larger L means V could provide more information for prediction. From the approximation in Equation 13, we can see that our proposed predictive heterogeneity is dominated by two terms: (1) Var[r(E )]/E[r2(E )] characterizes the variance of r(E ) among sub-populations; (2) E[f 2(S)] reflects the strength of model misspecifications. These two components account for two sources of the data heterogeneity under selection bias, which validates the rationality of our proposed measure. According to the theorem, the more various r(E ) among the sub-populations and stronger model misspecifications, the larger the predictive heterogeneity. In general, Theorem 1 and 2 indicate that (1) our proposed measure is insensitive to the homogeneous cases and (2) for the two typical sources of data heterogeneity, our measure accounts for the key components reflecting the latent heterogeneity. Therefore, the theoretical results validate the rationality of our measure. 3.4 PAC GUARANTEES FOR PREDICTIVE HETEROGENEITY ESTIMATION Defined under explicit computation constraints, our Predictive Heterogeneity could be empirically estimated with guarantees if the complexity of the model family V is bounded. In this work, we provide finite sample generalization bounds with the Rademacher complexity. First, we describe the definition of the empirical predictive heterogeneity, the explicit formula for which could be found in Definition 7 in Appendix. Definition 6 (Empirical Predictive Heterogeneity (informal)). For the prediction task X Y with X, Y taking values in X Y, a dataset D is independently and identically drawn from the population such that D = {(xi, yi)N i=1 X, Y }. Given the predictive family V and the environment set EK = {E|E C, supp(E) = K} where K N+ is the number of environments, the empirical predictive heterogeneity ˆHEK V (X Y ; D) with respect to D is readily obtained by estimating HEK V (X Y ) on D with expectations replaced by statistics of finite samples. The formal definition is placed in Definition 7. Theorem 3 (PAC Bound). Consider the prediction task X Y where X, Y are random variables taking values in X Y. Assume that the predictive family V satisfies x X, y Y, f V, log f[x](y) [ B, B] where B > 0. For given K N, the environment set is defined as EK = {E|E C, supp(E) = K}. Let Q be the set of all probability distributions of X,Y ,E where E EK. Take an e supp(E) and define a function class GV = {g|g(x, y) = log f[x](y)Q(E = e|x, y), f V, Q Q}. Denote the Rademacher complexity of G with N samples by RN(G). Then for any δ (0, 1/(2K + 2)), with a probability over 1 2(K + 1)δ, for dataset D independently and identically drawn from X, Y , we have: |HEK V (X Y ) ˆ HEK V (X Y ; D)| 4(K + 1)R|D|(GV) + 2(K + 1)B δ /|D|, (14) where R|D|(GV) = O(|D| 1 2 ) (Bartlett & Mendelson, 2002). Published as a conference paper at ICLR 2023 4 ALGORITHM To empirically estimate the predictive heterogeneity in Definition 6, we derive the Information Maximization (IM) algorithm from the formal definition in Equation 33 to infer the distribution of E that maximizes the empirical predictive heterogeneity ˆHEK V (X Y ; D). Objective Function. Given dataset D = {XN, YN} = {(xi, yi)}N i=1, denote supp(E) = {e1, . . . , e K}, we parameterize the distribution of E|(XN, YN) with weight matrix W WK, where K is the pre-defined number of environments and WK = {W : W RN K + and W1K = 1N} is the allowed weight space. Each element wij in W represents P(E = ej|xi, yi) (the probability of the i-th data point belonging to the j-th sub-population). For a predictive family V, the solution to the supremum problem in the Definition 7 is equivalent to the following objective function: min W WKRV(W, θ 1(W), . . . , θ K(W)) = j=1 wijℓV(fθ j (xi), yi) + UV(W, YN) s.t. θ j (W) arg min θ i=1 wijℓV(fθ(xi), yi) , for j = 1, . . . , K, where fθ : X Y denotes a predicting function parameterized by θ, ℓV( , ) : Y Y R represents a loss function and UV(W, YN) is a regularizer. Specifically, fθ, ℓV and UV are determined by the predictive family V. Here we provide implementations for two typical and general machine learning tasks, regression and classification. (1) For the regression task, the predictive family is typically modeled as: V1 = {g : g[x] = N(fθ(x), σ2), f is the predicting function and θ is learnable, σ is a constant}. (16) The corresponding loss function is ℓV1(fθ(X), Y ) = (fθ(X) Y )2, and UV1(W, YN) becomes UV1(W, YN) = Varj [K](Y j N) = !2 1 N PN i=1 wij where Y j N denotes the mean value of the label Y given E = ej and U(W, YN) calculates the variance of Y j N among sub-populations e1 e K. (2) For the classification task, the predictive family is typically modeled as: V2 = {g : g[x] = fθ(x) c, f is the classification model and θ is learnable}, (18) where c is the class number and c denotes the c-dimensional simplex. Here each model in the predictive family V2 outputs a discrete distribution in the form of a c-dimensional simplex. In this case, the corresponding loss function ℓV2( , ) is the cross entropy loss and the regularizer becomes UV2(W, YN) = PK j=1 1 N (PN i=1 wij)H(Y j N), where H(Y j N) is the entropy of Y given E = ej. Optimization. The bi-level optimization in Equation 15 can be solved by performing projected gradient descent w.r.t. W. The gradient of W can be calculated by: (we omit the subscript V here) W R = W U + ℓ(fθj(xi), yi) N K i,j + j=1 θj R|θ j W θ j , (19) where θj R θ j W θ j θj R θt j k 1 is a factor for each sub-population, and here the data heterogeneity is brought by the endogeneity with hidden variable (Fan et al., 2014). V is the spurious feature whose relationship with Y is unstable across sub-populations and is controlled by the factor r. Intuitively, sign(r) controls whether the spurious correlation between V and Y is positive or negative. And |r| controls the strength of the spurious correlation, i.e. the larger |r| means the stronger spurious correlation. In training, we generate 10000 points, where the major group contains 80% data with r = 1.9 (i.e. strong positive spurious correlation) and the minor group contains 20% data with r = 1.9 (i.e. strong negative spurious correlation). In testing, we test the performances of the two groups respectively, and we also set r = 2.3 and r = 2.7 to simulate stronger distributional shifts. We use linear regression and set K = 2 for all methods, and we report the mean-square errors (MSE) of all methods. Data Generation of Colored MNIST Following Arjovsky et al. (2019), we design a binary classification task constructed on the MNIST dataset. Firstly, digits 0 4 are labeled Y = 0 and digits 5 9 are labeled Y = 1. Secondly, noisy labels Y are induced by randomly flipping the label Y with a probability of 0.2. Then we sample the colored id V spurious correlated with Y as V = + Y , with probability r, Y , with probability 1 r. . In fact, r controls the spurious correlation between Y and V . In training, we randomly sample 10000 data points and set r = 0.85, meaning that for 85% of the data, V is positively correlated with Y and for the rest 15%, the spurious correlation becomes negative, which causes data heterogeneity w.r.t. V and Y . In testing, we set r = 0 (strong negative spurious correlation), bringing strong shifts between training and testing. Analysis From the results in Table 1, for both the simulated and colored MNIST data, the two backbones with our IM algorithm achieve the best OOD generalization performances. Also, for the simulated data, the learned predictive heterogeneity enables backbone algorithms to equally treat the majority and minority inside data (i.e. low-performance gap between Major and Minor ), and significantly benefits the OOD generalization. Further, for both experiments, we plot the learned sub-populations of our IM algorithm in Figure 4 and 5. From Figure 4, compared with KMeans and EIIL, our predictive heterogeneity exploits the spurious correlation between V and Y , and enables the backbone algorithms to eliminate it. From Figure 5, the learned sub-populations of our method also reflect the different directions of the spurious correlation between digit labels Y and colors (red or green), which helps backbone methods to avoid using colors to predict digits. 6 CONCLUSION We define the predictive heterogeneity, as the first quantitative formulation of the data heterogeneity that affects the prediction of machine learning models. We demonstrate its theoretical properties and show that it benefits the out-of-distribution generalization performances. Published as a conference paper at ICLR 2023 ACKNOWLEDGEMENTS We would like to thank Yuting Pan, Jiaming Song, Fan Bao and anonymous reviewers for helpful feedback. Peng Cui s research was supported in part by National Key R&D Program of China (No. 2018AAA0102004, No. 2020AAA0106300), National Natural Science Foundation of China (No. U1936219, 62141607), Beijing Academy of Artificial Intelligence (BAAI). Bo Li s research was supported by the National Natural Science Foundation of China (No.72171131, 72133002); the Technology and Innovation Major Project of the Ministry of Science and Technology of China under Grants 2020AAA0108400 and 2020AAA0108403. Martin Arjovsky, L eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002. Jasmin Breitenstein, Jan-Aike Term ohlen, Daniel Lipinski, and Tim Fingscheidt. Systematization of corner cases for visual perception in automated driving. In 2020 IEEE Intelligent Vehicles Symposium (IV), pp. 1257 1264. IEEE, 2020. Robert Challen, Joshua Denny, Martin Pitt, Luke Gompels, Tom Edwards, and Krasimira Tsaneva Atanasova. Artificial intelligence, bias and clinical safety. BMJ Quality & Safety, 28(3):231 237, 2019. M Cover Thomas and A Thomas Joy. Elements of information theory. New York: Wiley, 3:37 38, 1991. Elliot Creager, J orn-Henrik Jacobsen, and Richard Zemel. Environment inference for invariant learning. In International Conference on Machine Learning, pp. 2189 2200. PMLR, 2021. Peng Cui and Susan Athey. Stable learning establishes some common ground between causal inference and machine learning. Nature Machine Intelligence, 4(2):110 115, 2022. John Duchi, Tatsunori Hashimoto, and Hongseok Namkoong. Distributionally robust losses for latent covariate mixtures. Operations Research, 2022. John C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378 1406, 2021. Kevin Dzobo, Dimakatso Alice Senthebane, Nicholas Ekow Thomford, Arielle Rowe, Collet Dandara, and M Iqbal Parker. Not everyone fits the mold: Intratumor and intertumor heterogeneity and innovative cancer drug design and development. Omics: a journal of integrative biology, 22 (1):17 34, 2018. Jianqing Fan, Fang Han, and Han Liu. Challenges of big data analysis. National science review, 1 (2):293 314, 2014. Jingrui He. Learning from data heterogeneity: Algorithms and applications. In IJCAI, pp. 5126 5130, 2017. James J Heckman. Sample selection bias as a specification error. Econometrica: Journal of the econometric society, pp. 153 161, 1979. Sumyea Helal. Subgroup discovery algorithms: a survey and empirical evaluation. Journal of computer science and technology, 31(3):561 576, 2016. Miguel A Hern an, David Clayton, and Niels Keiding. The simpson s paradox unraveled. International journal of epidemiology, 40(3):780 785, 2011. Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International Conference on Machine Learning, pp. 2564 2572. PMLR, 2018. Published as a conference paper at ICLR 2023 R. Kohavi and B. Becker. Adult, 1996. Masanori Koyama and Shoichiro Yamaguchi. Out-of-distribution generalization with maximal invariant predictor. 2020. Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer Science & Business Media, 1991. H Li and JF Reynolds. On definition and quantification of heterogeneity. Oikos, pp. 280 284, 1995. Jiashuo Liu, Zheyuan Hu, Peng Cui, Bo Li, and Zheyan Shen. Heterogeneous risk minimization. In International Conference on Machine Learning, pp. 6804 6814. PMLR, 2021. Jiashuo Liu, Zheyan Shen, Peng Cui, Linjun Zhou, Kun Kuang, and Bo Li. Distributionally robust learning with stable adversarial training. IEEE Transactions on Knowledge and Data Engineering, 2022. David B Lobell, Marshall B Burke, Claudia Tebaldi, Michael D Mastrandrea, Walter P Falcon, and Rosamond L Naylor. Prioritizing climate change adaptation needs for food security in 2030. Science, 319(5863):607 610, 2008. Todd K Moon. The expectation-maximization algorithm. IEEE Signal processing magazine, 13(6): 47 60, 1996. Jonas Peters, Peter B uhlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947 1012, 2016. Douglas A Reynolds. Gaussian mixture models. Encyclopedia of biometrics, 741(659-663), 2009. Paul R Rosenbaum. Heterogeneity and causality: Unit heterogeneity and design sensitivity in observational studies. The American Statistician, 59(2):147 152, 2005. Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019. Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back-propagation for bilevel optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1723 1732. PMLR, 2019. Clifford H Wagner. Simpson s paradox in real life. The American Statistician, 36(1):46 48, 1982. Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. Heterogeneous graph attention network. In The world wide web conference, pp. 2022 2032, 2019. Christopher Winship and Robert D Mare. Models for sample selection bias. Annual review of sociology, pp. 327 350, 1992. Yilun Xu, Shengjia Zhao, Jiaming Song, Russell Stewart, and Stefano Ermon. A theory of usable information under computational constraints. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https://openreview.net/forum?id=r1e Bey HFDH. Shengjia Zhao, Abhishek Sinha, Yutong He, Aidan Perreault, Jiaming Song, and Stefano Ermon. Comparing distributions by measuring differences that affect decision making. In International Conference on Learning Representations, 2021. Xiao Zhou, Yong Lin, Renjie Pi, Weizhong Zhang, Renzhe Xu, Peng Cui, and Tong Zhang. Model agnostic sample reweighting for out-of-distribution learning. In International Conference on Machine Learning, pp. 27203 27221. PMLR, 2022. Published as a conference paper at ICLR 2023 A FORMAL DEFINITION OF EMPIRICAL PREDICTIVE HETEROGENEITY In this section, we derive the explicit formula for the empirical estimation of the predictive heterogeneity which is described in Definition 6. The dataset D = {(xi, yi)}|D| i=1 is independently and identically drawn from the population X, Y . Given a function family V and an environment set EK, let Q be the set of all probability distributions of X,Y ,E where E EK. For given E, denote supp(E) = {(ek)K k=1}. The empirical predictive heterogeneity ˆHEK V (X Y ; D) is given by: ˆHEK V (X Y ; D) = sup E EK ˆIV(X Y |E; D) ˆIV(X Y ; D) (24) = sup ˆ Q Q h ˆQ(E = ek) ˆHV(Y |E = ek; D) ˆQ(E = ek) ˆHV(Y |X, E = ek; D) i [ ˆHV(Y ; D) ˆHV(Y |X; D)]. (26) Specifically, ˆQ(E = ek) ˆHV(Y |X, E = ek; D) (27) = inf f V ˆQ(E = ek) X xi,yi D log f[xi](yi) ˆQ(xi, yi|E = ek) P xj,yj D ˆQ(xj, yj|E = ek) (28) = inf f V ˆQ(E = ek) X xi,yi D log f[xi](yi) ˆQ(E = ek|xi, yi) ˆQ(xi, yi) P xj,yj D ˆQ(E = ek|xj, yj) ˆQ(xj, yj) (29) = inf f V ˆQ(E = ek) X xi,yi D log f[xi](yi) ˆQ(E = ek|xi, yi) ˆQ(xi, yi) ˆQ(E = ek) (30) xi,yi D log f[xi](yi) ˆQ(E = ek|xi, yi) ˆQ(xi, yi) (31) = inf f V 1 |D| xi,yi D log f[xi](yi) ˆQ(E = ek|xi, yi). (32) The explicit formula for ˆQ(E = ek) ˆHV(Y |E = ek; D), ˆHV(Y |X; D) and ˆHV(Y ; D) could be similarly derived. Here we are ready to formally define the empirical predictive heterogeneity. Definition 7 (Empirical Predictive Heterogeneity (formal)). For the prediction task X Y with X, Y taking values in X Y, a dataset D is independently and identically drawn from the population such that D = {(xi, yi)N i=1 X, Y }. Given the predictive family V and the environment set EK = {E|E C, supp(E) = K} where K N, let Q be the set of all probability distributions of X,Y ,E where E EK. The empirical predictive heterogeneity ˆHEK V (X Y ; D) with respect to D is defined as: ˆHEK V (X Y ; D) = sup ˆ Q Q h ˆQ(E = ek) ˆHV(Y |E = ek; D) ˆQ(E = ek) ˆHV(Y |X, E = ek; D) i [ ˆHV(Y ; D) ˆHV(Y |X; D)], (33) Published as a conference paper at ICLR 2023 ˆQ(E = ek) ˆHV(Y |X, E = ek; D) = inf f V 1 |D| xi,yi D log f[xi](yi) ˆQ(E = ek|xi, yi). (34) ˆQ(E = ek) ˆHV(Y |E = ek; D) = inf f V 1 |D| xi,yi D log f[ ](yi) ˆQ(E = ek|xi, yi). (35) ˆHV(Y |X; D) = inf f V 1 |D| xi,yi D log f[xi](yi). (36) ˆHV(Y ; D) = inf f V 1 |D| xi,yi D log f[ ](yi). (37) B SENSITIVITY OF K In the experiments of Section 5, we set the K = 2 for easy illustrations. In this section, we add the results of choosing different Ks for the simulated experiment in Section 5.2 to show that the OOD generalization performances of some typical algorithms plus our proposed method are not sensitive to the choices of K. Figure 6: The out-of-distribution generalization error of our methods with Sub-population Balancing, IRM and IGA as backbones. Here we plot the errors of different backbones under r = 2.7, which introduces strong distributional shifts with training data. In Figure 6, we show the out-of-distribution generalization error of our methods with Sub-population Balancing, IRM and IGA as backbones. We plot the OOD testing performances under r = 2.7, which has strong distributional shift with the training distribution. From the results, we can see that the performances of three OOD generalization methods do not be affected much by the choice of K, and from Table 1 , our performances significantly outperforms all the baselines. Also, we add one more experiment to show that (1) when the chosen K is smaller than the groundtruth, the performances of our methods will drop but are still better than ERM (2) when the chosen K is larger, the performances are not affected much (consistent with the results in Appendix B). Experiment Setting: The input features X = [S, T, V ] R10 consist of stable features S R5, noisy features T R4 and the spurious feature V R: S N(2, 2I5), T N(0, 2I4), Y = θT S S + S1S2S3 + N(0, 0.5), and we generate the spurious feature via: V = θe V Y + N(0, 0.3), Published as a conference paper at ICLR 2023 where θe V varies across sub-populations and is dependent on which sub-population the data point belongs to. In training, we sample 8000 data points from e1 with θ1 V = 3.0, 1000 points from e2 with θ2 V = 1.0, 1000 points from e3 with θ3 V = 2.0 and 1000 points from e4 with θ4 V = 3.0. Therefore, the ground-truth number of sub-populations is 4. In testing, we test the performances on e4 with θ4 V = 3.0, which has strong distributional shifts from training data. The average MSE over 10 runs are shown in Figure 7. Figure 7: The out-of-distribution generalization error of our methods with Sub-population Balancing, IRM and IGA as backbones for the added experiments. The ground-truth sub-population number is 4. From the results, we can see that when K is smaller than the ground-truth, increasing K benefits the OOD generalization performance, and when K is larger, the performances are not affected much, which is consistent with the results in Figure 6. For our IM algorithm, we think there are mainly two ways to choose K: According to the predictive heterogeneity index: When the chosen K is smaller than the ground-truth, our measure tends to increase quickly when increasing K; and when K is larger than the ground-truth, the increasing speed will slow down, which could direct people to choose an appropriate K. According to the prediction model: Since our IM algorithm aims to learn sub-populations with different prediction mechanisms, one could compare the learned model parameters θ1, . . . , θK to judge whether K is much larger than the ground-truth, i.e., if two resultant models are quite similar, K may be too large (divide one sub-population into two). For linear models, one can directly compare the coefficients. For deep models, we think one can calculate the transfer losses across sub-populations. For a detailed analysis of the best choice of K, we leave it for future work. C RELATED WORK To the best of our knowledge, data heterogeneity has not converged to a uniform formulation so far, and has different meanings among different fields. Li & Reynolds (1995) define the heterogeneity in ecology based on the system property and complexity or variability. Rosenbaum (2005) views the uncertainty of the potential outcome as unit heterogeneity in observational studies in economics. For graph data, the heterogeneity refers to various types of nodes and edges (Wang et al. (2019)). More recently, in machine learning, several works of causal learning (Peters et al., 2016; Arjovsky et al., 2019; Koyama & Yamaguchi, 2020; Creager et al., 2021) and robust learning (Sagawa et al., 2019) leverage heterogeneous data from multiple environments to improve the out-of-distribution generalization ability. Specifically, invariant learning methods (Arjovsky et al., 2019; Koyama & Published as a conference paper at ICLR 2023 Yamaguchi, 2020; Creager et al., 2021; Zhou et al., 2022) leverage the heterogeneous environment to learn the invariant predictors that have uniform performances across environments. And in distributionally robust optimization field, Sagawa et al. (2019); Duchi et al. (2022) propose to optimize the worst-group prediction error to guarantee the OOD generalization performance. However, in machine learning, previous works have not provided a precise definition or sound quantification of data heterogeneity, which makes it confusing and hard to leverage to develop more rational machine learning algorithms. As for clustering algorithms, most algorithms only focus on the covariates X, typified by KMeans and Gaussian Mixture Model (GMM, (Reynolds, 2009)). However, the learned clusters by KMeans cannot reflect the predictive heterogeneity, which is shown by our experiments. And the expectation maximization (EM, (Moon, 1996)) can also be used for clustering. However, our IM algorithm has essential differences from EM, for our IM algorithm infers latent variables that maximizes the predictive heterogeneity but EM maximizes the likelihood. Also, there are methods (Creager et al., 2021) from the invariant learning field to infer environments. Though it could benefit the OOD generalization, it lacks the theoretical foundation and only works in some settings. D PROOF OF PROPOSITION 1 Proof of Proposition 1. 1. Monotonicity: Because of E1 E2, HE1 V (X Y ) = sup E E1 IV(X Y |E) IV(X Y ) (38) sup E E2 IV(X Y |E) IV(X Y ) (39) = HE2 V (X Y ). (40) 2. Nonnegativity: According to the definition of the environment set, there exists E0 E such that for any e supp(E), X, Y |E = e is identically distributed as X, Y . Thus, we have HE V(X Y ) = sup E E [HV(Y | , E) HV(Y |X, E)] [HV(Y | ) HV(Y |X)] (41) [HV(Y | , E0) HV(Y |X, E0)] [HV(Y | ) HV(Y |X)] . (42) Specifically, HV(Y |X, E0) = Ee E0 inf f V Ex,y X,Y |E=e[ log f[x](y)] (43) inf f V Ex,y X,Y [ log f[x](y)] (44) = HV(Y |X). (45) Similarly, HV(Y | , E0) = HV(Y | ). Thus, HE V(X Y ) 0. 3. Boundedness: First, we have HV(Y |X, E) = Ee E inf f V Ex,y X,Y |E=e[ log f[x](y)] (46) inf f V Ex X|E=e Ey Y |x,e[ log f[x](y)] (47) by noticing that Ey Y |x[ log f[x](y)] is the cross entropy between Y |x, e and f[x]. Published as a conference paper at ICLR 2023 HV(Y | , E) = Ee E inf f V Ey Y |E=e[ log f[ ](y)] (49) inf f V Ee E Ey Y |E=e[ log f[ ](y)] (50) = inf f V Ey Y [ log f[ ](y)] (51) = HV(Y | ), (52) where Equation 50 is due to Jensen s inequality. Combing the above inequalities, HE V(X Y ) = sup E E [HV(Y | , E) HV(Y |X, E)] [HV(Y | ) HV(Y |X)] (53) sup E E HV(Y | , E) [HV(Y | ) HV(Y |X)] (54) HV(Y | ) [HV(Y | ) HV(Y |X)] (55) = HV(Y |X). (56) 4. Corner Case: According to Proposition 2 in Xu et al. (2020), HΩ(Y | ) = H(Y ). (57) HΩ(Y |X) = H(Y |X). (58) By taking random variables R, S identically distributed as X, Y |E = e for e supp(E), we have HΩ(Y |X, E = e) = HΩ(S|R) = H(S|R) = H(Y |X, E = e). (59) HΩ(Y |X, E) = Ee E[HΩ(Y |X, E = e)] = Ee E[H(Y |X, E = e)] = H(Y |X, E). (60) Similarly, we have HΩ(Y | , E) = H(Y |E). Thus, HE Ω(X Y ) = sup E E [HΩ(Y | , E) HΩ(Y |X, E)] [HΩ(Y | ) HΩ(Y |X)] (61) = sup E E [H(Y |E) H(Y |X, E)] [H(Y ) H(Y |X)] (62) = sup E E I(Y ; X|E) I(Y ; X) (63) = HE (X, Y ). (64) E PROOF OF THEOREM 1 Proof of Theorem 1. HVG(Y |X) = inf f VG Ex X Ey Y |x[ log f[x](y)] (65) Ey Y |x[ log 1 2π exp (y g(x))2 = Ex X Ey Y |x[π(y g(x))2] (67) = πσ2. (68) Published as a conference paper at ICLR 2023 Equation 66 holds by taking f[x] = N(g(x), 1 Given the function family Vσ = {f|f[x] = N(θx, σ2), θ R, σ fixed }, by expanding the Gaussian probability density function in the definition of predictive V-information, it could be shown that IVσ(X Y ) min k R E[(Y k X)2] Var(Y ), (69) where the predictive V-information is proportional to Mean Square Error subtracted by the variance of target, by a coefficient completely dependent on σ. The minimization problem is solved by E[X2] = 1. (70) Substituting k into eq.69, IVσ(X Y ) E[ϵ2] Var(X + ϵ) (71) = Var(X) = E[X2]. (72) Denote supp(E) = {E1, E2}. Let Q be the joint distribution of (X, ϵ, E). Let Q(E1) = α and Q(E2) = 1 α be the marginal of E. Abbreviate Q(X, ϵ|E = E1) by P1(X, ϵ) and Q(X, ϵ|E = E2) by P2(X, ϵ). Similar to 69, IVσ(X Y |E) min k E[(Y k X)2|E] Var(Y |E). (73) For E = E1, the minimization problem is solved by k = EP1[XY ] EP1[X2] . (74) IVσ(X Y |E = E1) EP1 " Y EP1[XY ] EP1[X2] X 2# Var P1(Y ) (75) = EP1[Y 2] E2 P1[XY ] EP1[X2] (EP1[Y 2] E2 P1[Y ]) (76) = E2 P1[Y ] E2 P1[XY ] EP1[X2] . (77) Similarly, we have IVσ(X Y |E = E2) E2 P2[Y ] E2 P2[XY ] EP2[X2] . (78) Notably, EP1[X2] and EP2[X2] are constrained by α and E[X2]. E[X2] = E[E[X2|E]] = αEP1[X2] + (1 α)EP2[X2]. (79) Similarly, E[X2] = E[XY ] = αEP1[XY ] + (1 α)EP2[XY ]. (80) 0 = E[Y ] = αEP1[Y ] + (1 α)EP2[Y ]. (81) The moments of P2 could thereafter be represented by those of P1. EP2[X2] = E[X2] αEP1[X2] EP2[XY ] = E[X2] αEP1[XY ] Published as a conference paper at ICLR 2023 EP2[Y ] = αEP1[Y ] Substituting to eq.78, IVσ(X Y |E = E2) α2 (1 α)2 E2 P1[Y ] 1 1 α E[X2] αEP1[XY ] 2 E[X2] αEP1[X2] . (85) HE Vσ(X Y ) = sup E E IVσ(X Y ) αIVσ(X Y |E = E1) (1 α)IVσ(X Y |E = E2) sup E E E[X2] αE2 P1[Y ] + αE2 P1[XY ] EP1[X2] α2 1 αE2 P1[Y ] + E[X2] αEP1[XY ] 2 E[X2] αEP1[X2] (87) = sup E E α 1 αE2 P1[Y ] + α EP1[X2] EP1[XY ] 2 EP1[X2] (E[X2] αEP1[X2])E[X2] (88) = sup E E α 1 αE2 P1[X + ϵ] + α E2 P1[Xϵ] EP1[X2] (E[X2] αEP1[X2])E[X2]. (89) Assuming X ϵ | E, HE Vσ(X Y ) = sup E E α 1 αE2 P1[X + ϵ] 0. (90) From Proposition 1, we have HE Vσ(X Y ) 0. Thus, HE Vσ(X Y ) = 0. F PROOF OF LINEAR CASES (THEOREM 2 AND ??) Proof of Theorem 2. For the ease of notion, we denote the r(E ) as re, σ(E ) as σe, and σ(E ) ϵv as ϵe. And we omit the superscript C of HC V. Firstly, we calculate the HV[Y | ] as: HV[Y | ] = 1 2σ2 Var(Y ) + log σ + 1 2 log 2π, (91) HV[Y | , E ] = 1 2σ2 EE [Var(Y |E )] + log σ + 1 2 log 2π. (92) Therefore, we have HV[Y | , E ] HV[Y | ] = 1 2σ2 Var(E[Y |E ]) 0. (93) As for HV[Y |X], we have HV[Y |X] = inf h S,h V EX,Y Y (h SS + h V V ) 2 1 2σ2 (94) = inf h S,h V EX,Y f(S) + ϵY (h SS + h V V ) 2 1 2σ2 (95) = inf h S,h V EE E[ f(S) + ϵY (h SS + h V (ref(S) + ϵe)) 2|E ] 1 2σ2 , (96) where we let h S = h S β here. Then we have 2σ2HV[Y |X] = inf h S,h V EE E[ (1 h V re)f(S) + ϵY h SS h V ϵe 2|E ] (97) = inf h S,h V EE E[ (1 h V re)f(S) h SS 2|E ] + σ2 Y + h2 V EE [σ2 e], (98) Published as a conference paper at ICLR 2023 notably that here for ei, ej supp(E ), we assume P ei(S, Y ) = P ej(S, Y ) (we choose such E as one possible split). And the solution of h S, h V is h S = Var(re)E[f 2(S)]E[f(S)S] + E[σ2 e]E[f(S)S] E[r2e]E[f 2(S)]E[S2] + E[σ2e]E[S2] E2[re]E2[f(S)S], (99) h V = E[re](E[f 2(S)]E[S2] E2[f(S)S]) E[r2e]E[f 2(S)]E[S2] + E[σ2e]E[S2] E2[re]E2[f(S)S]. (100) According to the assumption that E[f(S)S] = 0, we have h S = 0, (101) h V = E[r(E )]E[f 2] E[r2(E )]E[f 2] + E[σ2(E )]. (102) Therefore, we have 2σ2HV[Y |X] = EE [E[ (1 h V re)f(S) 2|E ]] + σ2 Y + h2 V EE [σ2 e] (103) = Var(re)E[f 2] + E[σ2(E )] E[r2e]E[f 2] + E[σ2(E )] E[f 2(S)] + σ2 Y , (104) 2σ2HV[Y |X, E ] = σ2 Y + E[( 1 σ2e + 1 )2]E[f 2] + EE [( 1 re σe + σe re E[f 2] )2]. (105) Note that here we simply set σ = 1 in the main body. And we have: HV(X Y ) Var(re)E[f 2] + E[σ2(E )] E[r2e]E[f 2] + E[σ2(E )] E[f 2(S)] (106) The approximation error is bounded by 1 2 max(σ2 Y , R(r(E ), σ(E ), E[f 2])), and R(r(E ), σ(E ), E[f 2]) is defined as: R(r(E ), σ(E ), E[f 2]) = E[( 1 σ2e + 1 )2]E[f 2] + EE [( 1 re σe + σe re E[f 2] )2] (107) Proof of Theorem ??. As proved above, we have h S = β + E[f(S)S] Var(re)(E[f 2(S)] + σ2 Y ) + E[σ2 e] E[r2e]E[f 2(S)]E[S2] + E[r2e]σ2 Y E[S2] + E[σ2e]E[S2] E2[re]E2[f(S)S], (108) h V = E[re](σ2 Y + E[f 2(S)])E[S2] E[re]E2[f(S)S] E[r2e]E[f 2(S)]E[S2] + E[r2e]σ2 Y E[S2] + E[σ2e]E[S2] E2[re]E2[f(S)S]. (109) For the model misspecification case, we further assume that (1) E[f(S)S] = 0 and (2) E[σ2 e] E[f 2(S)]E[S2], and then we have h S = β, (110) h V = E[re] E[r2e], (111) and for the heterogeneity, we have E[r2e] (E[f 2(S)] + E[σ2 Y ]) + h2 V EE[σ2 e] + σ2 Y 2σ2HV(X Y ) E[r2e] (E[f 2(S)] + E[σ2 Y ]) + h2 V EE[σ2 e] EE[ 1 r2e σ2 e]. (112) Without the model misspecification, we assume that f 0, and then we have h S = β, (113) h V = E[re]σ2 Y E[r2e]σ2 Y + E[σ2e], (114) Published as a conference paper at ICLR 2023 and for the heterogeneity we have 2σ2HV(X Y ) σ2 Y (1 2h V E[re] + h2 V E[r2 e]) + h2 V E[σ2 e] E[ 1 r2e σ2 e], (115) 2σ2HV(X Y ) σ2 Y (1 2h V E[re] + h2 V E[r2 e]) + h2 V E[σ2 e]. (116) G PROOF OF THE ERROR BOUND FOR FINITE SAMPLE ESTIMATION (THEOREM 3) In this section, we will prove the error bound of estimating the predictive heterogeneity with the empirical predictive heterogeneity. Before the proof of Theorem 3 which is inspired by Xu et al. (2020), we will introduce three lemmas. Lemma 1. Assume x X, y Y, f V, log f[x](y) [ B, B] where B > 0. Define a function class Gk V = {g|g(x, y) = log f[x](y)q(E = ek|x, y), f V, q Q}. Denote the Rademacher complexity of G with N samples by RN(G). Define ˆfk = arg inff 1 |D| P xi,yi D log f[xi](yi)q(E = ek|xi, yi). Then for any q Q, any δ (0, 1), with a probability over 1 δ, we have q(E = ek)HV(Y |X, E = ek) 1 xi,yi D log ˆfk[xi](yi)q(E = ek|xi, yi) 2R|D|(Gk V) + B δ |D| . (118) Proof. Apply Mc Diarmid s inequality to the function Φ(D) which is defined as: Φ(D) = sup f V,q Q q(E = ek)Eq [ log f[x](y)|E = ek] 1 xi,yi D log f[xi](yi)q(E = ek|xi, yi) Let D and D be two identical datasets except for one data point xj = x j. We have: Φ(D) Φ(D ) (120) sup f V,q Q q(E = ek)Eq [ log f[x](y)|E = ek] 1 xi,yi D log f[xi](yi)q(E = ek|xi, yi) q(E = ek)Eq [ log f[x](y)|E = ek] 1 |D | x i,y i D log f[x i](y i)q(E = ek|x i, y i) sup f V,q Q xi,yi D log f[xi](yi)q(E = ek|xi, yi) 1 |D | x i,y i D log f[x i](y i)q(E = ek|x i, y i) = sup f V,q Q log f[xj](yj)q(E = ek|xj, yj) log f[x j](y j)q(E = ek|x j, y j) (124) Published as a conference paper at ICLR 2023 According to Mc Diarmid s inequality, for any δ (0, 1), with a probability over 1 δ, we have: Φ(D) ED[Φ(D)] + B δ |D| . (126) Next we derive a bound for ED[Φ(D)]. Consider a dataset D independently and identically drawn from q(X, Y ) = P(X, Y ) with the same size as D. We notice that q(E = ek)Eq [ log f[x](y)|E = ek] (127) = q(E = ek)Eq [ log f[x](y)q(E = ek|x, y)|E = ek] (128) = Eq [Eq [ log f[x](y)q(E = ek|x, y)|E = ek]] (129) = Eq [ log f[x](y)q(E = ek|x, y)] (130) x i,y i D log f[x i](y i)q(E = ek|x i, y i) Thus, ED[Φ(D)] could be reformulated as: ED[Φ(D)] (132) sup f V,q Q x i,y i D log f[x i](y i)q(E = ek|x i, y i) xi,yi D log f[xi](yi)q(E = ek|xi, yi) sup f V,q Q ED x i,y i D log f[x i](y i)q(E = ek|x i, y i) (135) xi,yi D log f[xi](yi)q(E = ek|xi, yi) sup f V,q Q xi,yi D log f[xi](yi)q(E = ek|xi, yi) (137) x i,y i D log f[x i](y i)q(E = ek|x i, y i) sup f V,q Q xi,yi D σi log f[xi](yi)q(E = ek|xi, yi) (139) x i,y i D σi log f[x i](y i)q(E = ek|x i, y i) sup f V,q Q xi,yi D σi log f[xi](yi)q(E = ek|xi, yi) sup f V,q Q x i,y i D σi log f[x i](y i)q(E = ek|x i, y i) = 2R|D|(Gk V), (143) Published as a conference paper at ICLR 2023 where σi are independent Rademacher variables. Equation 137 follows from Jensen s inequality and the convexity of sup. Equation 139 holds due to the symmetry of log f[xi](yi)q(E = ek|xi, yi) log f[x i](y i)q(E = ek|x i, y i) and the argument that Radamacher variables preserve the expected sum of symmetric random variables with a convex mapping (Ledoux & Talagrand (1991), Lemma 6.3). Substituting Equation 143 to Equation 126, we have for any δ (0, 1), with a probability over 1 δ, f V, q Q, the following holds: q(E = ek)Eq [ log f[x](y)|E = ek] 1 xi,yi D log f[xi](yi)q(E = ek|xi, yi) 2R|D|(Gk V) + B δ |D| . (145) Let fk = arg inff{q(E = ek)Eq [ log f[x](y)|E = ek]}. Let ˆfk = arg inff{ 1 xi,yi D log f[xi](yi)q(E = ek|xi, yi)}. Now we have q(E = ek)Eq h log fk[x](y)|E = ek i 1 xi,yi D log fk[xi](yi)q(E = ek|xi, yi) (146) q(E = ek)HV(Y |X, E = ek) 1 xi,yi D log ˆfk[xi](yi)q(E = ek|xi, yi) (147) q(E = ek)Eq h log ˆfk[x](y)|E = ek i 1 xi,yi D log ˆfk[xi](yi)q(E = ek|xi, yi). (148) Combining Equation 144 and Equation 146-148, the lemma is proved. Lemma 2. Assume x X, y Y, f V, log f[ ](y) [ B, B] where B > 0. The definition of Gk V and RN(G) follows from Lemma 1. Define ˆfk = arg inff 1 |D| P xi,yi D log f[ ](yi)q(E = ek|xi, yi). Then for any q Q, any δ (0, 1), with a probability over 1 δ, we have q(E = ek)HV(Y |E = ek) 1 xi,yi D log ˆfk[ ](yi)q(E = ek|xi, yi) 2R|D|(Gk V) + B δ |D| . (150) Proof. Similar to Lemma 1, we could prove that q(E = ek)HV(Y |E = ek) 1 xi,yi D log ˆfk[ ](yi)q(E = ek|xi, yi) 2R|D|(Gk V ) + B δ |D| , (152) where Gk V = {g|g(x, y) = log f[ ](y)q(E = ek|x, y), f V, q Q}. According to the definition for the predictive family V (Xu et al. (2020), Definition 1), f V, there exists f V such that x X, f[ ] = f [x]. Thus, Gk V Gk V, and therefore R|D|(Gk V ) R|D|(Gk V). Substituting into Equation 151, the lemma is proved. Published as a conference paper at ICLR 2023 Lemma 3 ((Xu et al., 2020), Theorem 1). Assume x X, y Y, f V, log f[x](y) [ B, B] where B > 0. Define a function class G V = {g|g(x, y) = log f[x](y), f V}. The definition of RN(G) follows from Lemma 1. Then for any δ (0, 0.5), with a probability over 1 2δ, we have IV(X Y ) ˆIV(X Y ) 4R|D|(G V) + 2B δ |D| . (153) Finally we are prepared to prove Theorem 3. Proof of Theorem 3. We first bound the error of empirical estimation with the sum of items in Lemma 1,2,3. |HEK V (X Y ) ˆHEK V (X Y ; D)| (154) sup E EK IV(X Y |E) IV(X Y ) sup E EK ˆIV(X Y |E; D) ˆIV(X Y ; D) sup E EK IV(X Y |E) sup E EK ˆIV(X Y |E; D) + IV(X Y ) ˆIV(X Y ; D) (156) IV(X Y |E) ˆIV(X Y |E; D) + IV(X Y ) ˆIV(X Y ; D) (157) k=1 [q(E = ek)HV(Y |E = ek) q(E = ek)HV(Y |X, E = ek)] (158) h q(E = ek) ˆHV(Y |E = ek; D) q(E = ek) ˆHV(Y |X, E = ek; D) i (159) + IV(X Y ) ˆIV(X Y ; D) (160) k=1 sup q Q q(E = ek)HV(Y |E = ek) q(E = ek) ˆHV(Y |E = ek; D) (161) k=1 sup q Q q(E = ek)HV(Y |X, E = ek) q(E = ek) ˆHV(Y |X, E = ek; D) (162) + IV(X Y ) ˆIV(X Y ; D) (163) k=1 sup q Q q(E = ek)HV(Y |E = ek) 1 xi,yi D log ˆfk[xi](yi)q(E = ek|xi, yi) k=1 sup q Q q(E = ek)HV(Y |X, E = ek) 1 xi,yi D log ˆf k[ ](yi)q(E = ek|xi, yi) + IV(X Y ) ˆIV(X Y ; D) , (166) where ˆfk = arg inff 1 |D| P xi,yi D log f[xi](yi)q(E = ek|xi, yi), and ˆf k = arg inff 1 |D| P xi,yi D log f[ ](yi)q(E = ek|xi, yi), for any q Q and 1 k K. Published as a conference paper at ICLR 2023 For simplicity, let Errk = sup q Q q(E = ek)HV(Y |X, E = ek) 1 xi,yi D log ˆfk[xi](yi)q(E = ek|xi, yi) Err k = sup q Q q(E = ek)HV(Y |X, E = ek) 1 xi,yi D log ˆf k[ ](yi)q(E = ek|xi, yi) Err = IV(X Y ) ˆIV(X Y ; D) . (169) Then, by Lemma 1,2,3, |HV K ˆHV K(D)| > 4(K + 1)R|D|(GV) + 2(K + 1)B i=1 Err k + Err > 4(K + 1)R|D|(GV) + 2(K + 1)B i=1 Err k + Err > k=1 4R|D|(Gk V) + 4R|D|(G V) + 2(K + 1)B Errk >2R|D|(Gk V) + B Err k >2R|D|(Gk V) + B Err > 4R|D|(G V) + 2B Errk >2R|D|(Gk V) + B Err k >2R|D|(Gk V) + B Err > 4R|D|(G V) + 2B 2(K + 1)δ. (177) Equation 172 is because of Gk V = GV, G V GV and therefore R|D|(Gk V) R|D|(GV), R|D|(G V) R|D|(GV). |HEK V (X Y ) ˆHEK V (X Y ; D)| 4(K + 1)R|D|(GV) + 2(K + 1)B 1 2(K + 1)δ. (179) Published as a conference paper at ICLR 2023 H PROOF OF THEOREM 4 Proof of Theorem 4. The objective function of our IM algorithm is directly derived from the definition of empirical predictive heterogeneity in Definition 6. For the regression task, we assume the predictive family as V1 = {g : g[x] = N(fθ(x), σ2), f is the regression model and θ is learnable, σ = 1.0(fixed)}, (180) where we only care about the output of the model and the noise scale of the Gaussian distribution is often ignored, for which we simply set σ = 1.0 as a fixed term. Then for each environment e supp(E ), the IV(X Y |E = e) becomes IV(X Y |E = e) min θ E[ Y fθ(X) 2|E = e] Var(Y |E ), (181) which corresponds with the MSE loss and the proposed regularizer in Equation 17. For the classification task, the derivation is similar, and the regularizer becomes the entropy of Y in sub-population e and the loss function becomes the cross-entropy loss. I DISCUSSION ON DIFFERENCES WITH SUB-GROUP DISCOVERY Subgroup discovery (SD, (Helal, 2016)) is aimed at extracting interesting relations among different variables (X) with respect to a target variable Y . Coverage and precision of each discovered group is the focus of such method. To be specific, it learns a partition on P(X) such that some target label y dominates within each group. The most siginficant gap between subgroup discovery and our predictive heterogeneity lies in the pattern of distributional shift among clusters: for subgroup discovery, P(X) and P(Y ) varies across subgroups but there is a universal P(Y |X). While for predictive heterogeneity P(Y |X) differs across sub-population, which indicates diversified prediction mechanism. It is such disparity of prediction mechanism that inhibits the performance of a universal predictive model on a heterogeneous dataset, which is the emphasis of OOD problem and group fairness. We think sub-group discovery is more applicable for settings where the distributional shift is minor while high explainability is required, since it generates simplified rules that people can understand. Also, sub-group discovery methods is suitable for the settings that only involve tabular data (typlically from a relational database), where the input features have clear semantics. And our proposed method could deal with general machine learning settings, including complicated data (e.g., image data) that involves representation learning. Also, when people have to handle settings where data heterogeneity w.r.t. prediciton mechanism exists inside data, our method is more applicable. However, both kinds of methods can be used to help people understand data and make more reasonable decisions. J DISCUSSION ON THE POTENTIAL FOR FAIRNESS We find combining our measure with algorithmic fairness is an interesting and promising direction and we think our measure has the potential to deal with algorithmic bias. Our method could generate sub-populations with possibly different prediction mechanisms, which could do some help in the following aspects: Risk feature selection: we could select features according to our predictive heterogeneity measure to see what features bring the largest heterogeneity. If they are sensitive features, people should avoid their effects, and if they are not, they could direct people to build better machine learning models. Examine the algorithmic fairness: we could use the learned sub-populations to examine whether a given algorithm is fair by calculating the performance gap across the sub-populations.