# measure_the_predictive_heterogeneity__0273c4b0.pdf

Published as a conference paper at ICLR 2023

MEASURING THE PREDICTIVE HETEROGENEITY

Jiashuo Liu , Jiayun Wu , Renjie Pi , Renzhe Xu , Xingxuan Zhang , Bo Li , Peng Cui

Tsinghua University, Hong Kong University of Science and Technology {liujiashuo77, jiayun.wu.work}@gmail.com, rpi@connect.ust.hk xrz199721@gmail.com, xingxuanzhang@hotmail.com libo@sem.tsinghua.edu.cn, cuip@tsinghua.edu.cn

As an intrinsic and fundamental property of big data, data heterogeneity exists in a variety of real-world applications, such as in agriculture, sociology, health care, etc. For machine learning algorithms, the ignorance of data heterogeneity will signiﬁcantly hurt the generalization performance and the algorithmic fairness, since the prediction mechanisms among different sub-populations are likely to differ. In this work, we focus on the data heterogeneity that affects the prediction of machine learning models, and ﬁrst formalize the Predictive Heterogeneity, which takes into account the model capacity and computational constraints. We prove that it can be reliably estimated from ﬁnite data with PAC bounds even in high dimensions. Additionally, we propose the Information Maximization (IM) algorithm, a bi-level optimization algorithm, to explore the predictive heterogeneity of data. Empirically, the explored predictive heterogeneity provides insights for sub-population divisions in agriculture, sociology, and object recognition, and leveraging such heterogeneity beneﬁts the out-of-distribution generalization performance.

1 INTRODUCTION

Big data bring great opportunities to modern society and promote the development of machine learning, facilitating human life within a wide variety of areas, such as the digital economy, healthcare, scientiﬁc discoveries. Along with the progress, the intrinsic heterogeneity of big data introduces new challenges to machine learning systems and data scientists (Fan et al., 2014; He, 2017). In general, data heterogeneity, as a fundamental property of big data, refers to any diversity inside data, including the diversity of data sources, data generation mechanisms, sub-populations, data structures, etc. When not properly treated, data heterogeneity could bring pitfalls to machine learning systems, especially in high-stake applications, such as precision medicine, autonomous driving, and ﬁnancial risk management (Dzobo et al., 2018; Breitenstein et al., 2020; Challen et al., 2019), leading to poor out-of-distribution generalization performances and some fairness issues. For example, in supervised learning tasks where machine learning models learn from data to predict the target variable with given covariates, when the whole dataset consists of multiple sub-populations with shifts or different prediction mechanisms, traditional machine learning algorithms will mainly focus on the majority but ignore the minority. It will hurt the generalization ability and compromise the algorithmic fairness, as is shown in (Kearns et al., 2018; Sagawa et al., 2019; Duchi & Namkoong, 2021). Another well-known example is Simpson s paradox, which brings false discoveries to the social research (Wagner, 1982; Hern an et al., 2011).

Despite its widespread existence, due to its complexity, data heterogeneity has not converged to a uniform formulation so far, and has different meanings among different ﬁelds. Li & Reynolds (1995) deﬁne the heterogeneity in ecology based on the system property and complexity or variability. Rosenbaum (2005) views the uncertainty of the potential outcome as unit heterogeneity in observational studies in economics. More recently, in machine learning, several works of causal learning (Peters et al., 2016; Arjovsky et al., 2019; Koyama & Yamaguchi, 2020; Liu et al., 2021; Creager et al., 2021) and robust learning (Sagawa et al., 2019; Liu et al., 2022) leverage heterogeneous data from multiple environments to improve the out-of-distribution generalization ability. However, previous works have not provided a precise deﬁnition or sound quantiﬁcation. In this

Corresponding Author.

Published as a conference paper at ICLR 2023

work, from the perspective of prediction power, we propose the predictive heterogeneity, a new type of data heterogeneity.

From the machine learning perspective, the main concern is the possible negative effects of data heterogeneity on making predictions. Therefore, given the complexity of data heterogeneity, in this work, we focus on the data heterogeneity that affects the prediction of machine learning models, which could facilitate the building of machine learning systems, and we name it the predictive heterogeneity. We raise the precise deﬁnition of predictive heterogeneity, which is quantiﬁed as the maximal additional predictive information that can be gained by dividing the whole data distribution into sub-populations. The new measure takes into account the model capacity and computational constraints, and can be reliably estimated from ﬁnite samples even in high dimensions with PAC bounds. We theoretically analyze its properties and examine it under typical cases of data heterogeneity (Fan et al., 2014). Additionally, we design the information maximization (IM) algorithm to empirically explore the predictive heterogeneity inside data. Empirically, we ﬁnd the explored heterogeneity is explainable and it provides insights for sub-population divisions in many ﬁelds, including agriculture, sociology, and object recognition. And the explored sub-populations could be leveraged to enhance the out-of-distribution generalization performances of machine learning models, which is veriﬁed with both simulated and real-world data.

2 PRELIMINARIES ON MUTUAL INFORMATION AND PREDICTIVE V-INFORMATION

In this section, we brieﬂy introduce the mutual information and predictive V-information (Xu et al., 2020) which are the preliminaries of our proposed predictive heterogeneity.

Notations. For a probability triple (S, F, P), deﬁne random variables X : S X and Y : S Y where X is the covariate space and Y is the target space. Accordingly. x X denotes the covariates, and y Y denotes the target. Denote the set of random categorical variables as C = {C : S N|supp(C) is ﬁnite}. Additionally, P(X), P(Y) denote the set of all probability measures over the Borel algebra on the spaces X, Y respectively. H( ) denotes the Shannon entropy of a random variable, and H( | ) denotes the conditional entropy of two random variables.

In information theory, the mutual information of two random variables X, Y measures the dependence between the two variables, which quantiﬁes the reduction of entropy for one variable when observing the other: I(X; Y ) = H(Y ) H(Y |X). (1) It is known that the mutual information is associated with the predictability of Y (Cover Thomas & Thomas Joy, 1991). While the standard deﬁnition of mutual information unrealistically assumes the unbounded computational capacity of the predictor, rendering it hard to estimate especially in high dimensions. To mitigate this problem, Xu et al. (2020) raise the predictive V-information under realistic computational constraints, where the predictor is only allowed to use models in the predictive family V to predict the target variable Y . Deﬁnition 1 (Predictive Family (Xu et al., 2020)). Let Ω= {f : X { } P(Y)}. We say that V Ωis a predictive family if it satisﬁes: f V, P range(f), f V, s.t. x X, f [x] = P, f [ ] = P. (2)

A predictive family contains all predictive models that are allowed to use, which forms computational or statistical constraints. The additional condition in Equation 2 means that the predictor can always ignore the input covariates (x) if it chooses to (only use ). Deﬁnition 2 (Predictive V-information (Xu et al., 2020)). Let X, Y be two random variables taking values in X Y and V be a predictive family. The predictive V-information from X to Y is deﬁned as: IV(X Y ) = HV(Y | ) HV(Y |X), (3) where HV(Y | ), HV(Y |X) are the predictive conditional V-entropy deﬁned as: HV(Y |X) = inf f V Ex,y X,Y [ log f[x](y)]. (4)

HV(Y | ) = inf f V Ey Y [ log f[ ](y)]. (5)

Notably that f V is a function X { } P(Y), so f[x] P(Y) is a probability measure on Y, and f[x](y) R is the density evaluated on y Y. HV(Y | ) is also denoted as HV(Y ).

Published as a conference paper at ICLR 2023

Compared with the mutual information, the predictive V-information restricts the computational power and is much easier to estimate in high-dimensional cases. When the predictive family V contains all possible models, i.e. V = Ω, it is proved that IV(X Y ) = I(X; Y ) (Xu et al., 2020).

3 PREDICTIVE HETEROGENEITY

In this paper, from the machine learning perspective, we quantify the data heterogeneity that affects decision making, named Predictive Heterogeneity, which is easy to integrate with machine learning algorithms and could help analyze big data and build more rational algorithms.

3.1 INTERACTION HETEROGENEITY

To formally deﬁne the predictive heterogeneity, we begin with the formulation of the interaction heterogeneity. The interaction heterogeneity is deﬁned as: Deﬁnition 3 (Interaction Heterogeneity). Let X, Y be random variables taking values in X Y. Denote the set of random categorical variables as C, and take its subset E C. Then E is an environment set iff there exists E E such that X, Y E. E E is called an environment variable. The interaction heterogeneity between X and Y w.r.t. the environment set E is deﬁned as:

HE (X, Y ) = sup E E I(Y ; X|E) I(Y ; X). (6)

Each environment variable E represents a stochastic partition of X Y, and the condition for the environment set implies that the joint distribution of X, Y could be preserved in each environment. In information theory, I(Y ; X|E) I(Y ; X) is called the interaction information, which measures the inﬂuence of the environment variable E on the amount of information shared between the target Y and the covariate X. And the interaction heterogeneity deﬁned in Equation 6 quantiﬁes the maximal additional information that can be gained from involving or uncovering the environment variable E. Intuitively, large HE (P) indicates that the predictive power from X to Y is enhanced by E, which means that uncovering the latent sub-population associated with the environment partition E will beneﬁt the X Y prediction.

3.2 PREDICTIVE HETEROGENEITY

Based on the mutual information, the computation of the interaction heterogeneity is quite hard, since the standard mutual information is notoriously difﬁcult to estimate especially in big data scenarios. Also, even if the mutual information could be accurately estimated, the prediction model may not be able to make good use of it.

Inspired by Xu et al. (2020), we raise the Predictive Heterogeneity, which measures the interaction heterogeneity that can be captured under computational constraints and affects the prediction of models within the speciﬁed predictive family. To begin with, we propose the Conditional Predictive V-information, which generalizes the predictive V-information. Deﬁnition 4 (Conditional Predictive V-information). Let X, Y be two random variables taking values in X Y and E be an environment variable. The conditional predictive V-information is deﬁned as: IV(X Y |E) = HV(Y | , E) HV(Y |X, E), (7) where HV(Y | , E) and HV(Y |X, E) are deﬁned as:

HV(Y |X, E) = Ee E

inf f V Ex,y X,Y |E=e[ log f[x](y)] . (8)

HV(Y | , E) = Ee E

inf f V Ey Y |E=e[ log f[ ](y)] . (9)

Intuitively, the conditional predictive V-information measures the weighted average of predictive Vinformation among environments. And here we are ready to formalize the predictive heterogeneity measure. Deﬁnition 5 (Predictive Heterogeneity). Let X, Y be random variables taking values in X Y and E be an environment set. The predictive heterogeneity for the prediction X Y with respect to E is deﬁned as: HE V (X Y ) = sup E E IV(X Y |E) IV(X Y ), (10)

Published as a conference paper at ICLR 2023

where IV(X Y ) is the predictive V-information following from Deﬁnition 2.

Leveraging the predictive V-information, the predictive heterogeneity deﬁned in Equation 10 characterizes the maximal additional information that can be used by the prediction model when involving the environment variable E. It restricts the prediction models in V and the explored additional information could beneﬁt the prediction performance of the model f V, for which it is named predictive heterogeneity. Next, we present some basic properties of the interaction heterogeneity and predictive heterogeneity. Proposition 1 (Basic Properties of Predictive Heterogeneity). Let X, Y be random variables taking values in X Y, V be a function family, and E , E1, E2 be environment sets.

1. Monotonicity: If E1 E2, HE1 V (X Y ) HE2 V (X Y ).

2. Nonnegativity: HE V(X Y ) 0.

3. Boundedness: HE V(X Y ) HV(Y |X).

4. Corner Case: If the predictive family V is the largest possible predictive family that includes all possible models, i.e. V = Ω, we have HE (X, Y ) = HE Ω(X Y ).

For further theoretical properties of predictive heterogeneity, in Section 3.3, we derive its explicit forms under endogeneity, a common reﬂection of data heterogeneity. And we demonstrate in Section 3.4 that our proposed predictive heterogeneity can be empirically estimated with guarantees if the complexity of V is bounded (e.g., its Rademacher complexity).

3.3 THEORETICAL PROPERTIES IN LINEAR CASES

In this section, we analyze the theoretical properties of the predictive heterogeneity in multiple linear settings, including (1) a homogeneous case with independent noises and (2) heterogeneous cases with endogeneity brought by selection bias and hidden variables. Under these typical settings, we could approximate the analytical forms of the proposed measure and the conclusions provide insights for general cases.

Firstly, under a homogeneous case with no data heterogeneity, Theorem 1 proves that our measure is bounded by the scale of label noises (which is usually small) and reduces to 0 in linear case under mild assumptions. It indicates that the predictive heterogeneity is insensitive to independent noises. Notably that in the linear case we only deal with the environment variable satisfying X ϵ|E, since in common prediction tasks, the independent noises are unknown and unrealistic to be exploited for the inference of latent environments E. Theorem 1 (Homogeneous Case with Independent Noises). For a prediction task X Y where X, Y are random variables taking values in Rn R, consider the data generation process as Y = g(x) + ϵ, ϵ N(0, σ2) where g : Rn R is a measurable function. 1) For a function class G such that g G, deﬁne the function family as VG = {f|f[x] = N(φ(x), σ2 V ), φ G, σV R+}. With an environment set E , we have HE VG(X Y ) πσ2. 2) Take n = 1 and g(x) = βx,β R. Assume E[X] = 0 and E[X2] exists. Given the function family Vσ = {f|f[x] = N(θx, σ2), θ R, σ ﬁxed } and the environment set E = {E|E C, |supp(E)| = 2, X ϵ|E}. We have HE Vσ(X Y ) = 0.

Secondly, we examine the proposed measure under two typical cases of data heterogeneity (Fan et al., 2014), named endogeneity by selection bias (Heckman, 1979; Winship & Mare, 1992; Cui & Athey, 2022) and endogeneity with hidden variables (Fan et al., 2014; Arjovsky et al., 2019).

To begin with, in Theorem 2, we consider the prediction task X Y with X, Y taking values in R2 R. Let X = [S, V ]T . The predictive family is speciﬁed as:

V = {f|f[x] = N(θSS + θV V, σ2), θS, θV R, σ = 1}. (11)

And the data distribution P(X, Y ) is a mixture of latent sub-populations, which could be formulated by an environment variable E C such that P(X, Y ) = P

e supp(E ) P(E = e)P(X, Y |E = e). For each e supp(E ), P(X, Y |E = e) is the distribution of a homogeneous sub-population. Note that the prediction task is to predict Y with covariates X, and the sub-population structure is latent. That is, P(E |X, Y ) is unknown for models. In the following, we derive the analytical forms of our measure under the one typical case.

Published as a conference paper at ICLR 2023

Theorem 2 (Endogeneity with Selection Bias). For the prediction task X = [S, V ]T Y with a latent environment variable E , the data generation process with selection bias is deﬁned as:

Y = βS + f(S) + ϵY , ϵY N(0, σ2 Y ); V = r(E )f(S) + σ(E ) ϵV , ϵV N(0, 1), (12)

where f : R R and r, σ : supp(E ) R are measurable functions. β R. Assume that E[f(S)S] = 0 and there exists L > 1 such that Lσ2(E ) < r2(E )E[f 2]. For the predictive family deﬁned in equation 11 and the environment set E = C, the predictive heterogeneity of the prediction task [S, V ]T Y approximates to:

HC V(X Y ) Var(re)E[f 2] + E[σ2(E )]

E[r2e]E[f 2] + E[σ2(E )] E[f 2(S)], error bounded by 1

2 max(σ2 Y , R(r, σ, f)). (13)

And R(r(E ), σ(E ), f) = E[( 1 r2E[f2]

σ2 +1)2]E[f 2] + EE [( 1 r σ + σ r E[f2] )2] < E[f 2]( 1 (L+1)2 + 1 L+2+ 1

Intuitively, the data generation process in Theorem 2 introduces the spurious correlation between the spurious feature V and the target Y , which varies across different sub-populations (i.e. r(E ) and σ(E ) varies) and brings about data heterogeneity. Here E[f(S)S] = 0 indicates a model misspeciﬁcation since there is a nonlinear term f(S) that could not be inferred by the linear predictive family with the stable feature S. The constant L characterizes the strength of the spurious correlation between V and Y . Larger L means V could provide more information for prediction.

From the approximation in Equation 13, we can see that our proposed predictive heterogeneity is dominated by two terms: (1) Var[r(E )]/E[r2(E )] characterizes the variance of r(E ) among sub-populations; (2) E[f 2(S)] reﬂects the strength of model misspeciﬁcations. These two components account for two sources of the data heterogeneity under selection bias, which validates the rationality of our proposed measure. According to the theorem, the more various r(E ) among the sub-populations and stronger model misspeciﬁcations, the larger the predictive heterogeneity.

In general, Theorem 1 and 2 indicate that (1) our proposed measure is insensitive to the homogeneous cases and (2) for the two typical sources of data heterogeneity, our measure accounts for the key components reﬂecting the latent heterogeneity. Therefore, the theoretical results validate the rationality of our measure.

3.4 PAC GUARANTEES FOR PREDICTIVE HETEROGENEITY ESTIMATION

Deﬁned under explicit computation constraints, our Predictive Heterogeneity could be empirically estimated with guarantees if the complexity of the model family V is bounded. In this work, we provide ﬁnite sample generalization bounds with the Rademacher complexity. First, we describe the deﬁnition of the empirical predictive heterogeneity, the explicit formula for which could be found in Deﬁnition 7 in Appendix. Deﬁnition 6 (Empirical Predictive Heterogeneity (informal)). For the prediction task X Y with X, Y taking values in X Y, a dataset D is independently and identically drawn from the population such that D = {(xi, yi)N i=1 X, Y }. Given the predictive family V and the environment set EK = {E|E C, supp(E) = K} where K N+ is the number of environments, the empirical predictive heterogeneity ˆHEK V (X Y ; D) with respect to D is readily obtained by estimating HEK V (X Y ) on D with expectations replaced by statistics of ﬁnite samples. The formal deﬁnition is placed in Deﬁnition 7.

Theorem 3 (PAC Bound). Consider the prediction task X Y where X, Y are random variables taking values in X Y. Assume that the predictive family V satisﬁes x X, y Y, f V, log f[x](y) [ B, B] where B > 0. For given K N, the environment set is deﬁned as EK = {E|E C, supp(E) = K}. Let Q be the set of all probability distributions of X,Y ,E where E EK. Take an e supp(E) and deﬁne a function class GV = {g|g(x, y) = log f[x](y)Q(E = e|x, y), f V, Q Q}. Denote the Rademacher complexity of G with N samples by RN(G). Then for any δ (0, 1/(2K + 2)), with a probability over 1 2(K + 1)δ, for dataset D independently and identically drawn from X, Y , we have:

|HEK V (X Y ) ˆ HEK V (X Y ; D)| 4(K + 1)R|D|(GV) + 2(K + 1)B

δ /|D|, (14)

where R|D|(GV) = O(|D| 1

2 ) (Bartlett & Mendelson, 2002).

Published as a conference paper at ICLR 2023

4 ALGORITHM

To empirically estimate the predictive heterogeneity in Deﬁnition 6, we derive the Information Maximization (IM) algorithm from the formal deﬁnition in Equation 33 to infer the distribution of E that maximizes the empirical predictive heterogeneity ˆHEK V (X Y ; D).

Objective Function. Given dataset D = {XN, YN} = {(xi, yi)}N i=1, denote supp(E) = {e1, . . . , e K}, we parameterize the distribution of E|(XN, YN) with weight matrix W WK, where K is the pre-deﬁned number of environments and WK = {W : W RN K + and W1K = 1N} is the allowed weight space. Each element wij in W represents P(E = ej|xi, yi) (the probability of the i-th data point belonging to the j-th sub-population). For a predictive family V, the solution to the supremum problem in the Deﬁnition 7 is equivalent to the following objective function:

min W WKRV(W, θ 1(W), . . . , θ K(W)) =

j=1 wijℓV(fθ j (xi), yi) + UV(W, YN)

s.t. θ j (W) arg min θ

i=1 wijℓV(fθ(xi), yi)

, for j = 1, . . . , K,

where fθ : X Y denotes a predicting function parameterized by θ, ℓV( , ) : Y Y R represents a loss function and UV(W, YN) is a regularizer. Speciﬁcally, fθ, ℓV and UV are determined by the predictive family V. Here we provide implementations for two typical and general machine learning tasks, regression and classiﬁcation.

(1) For the regression task, the predictive family is typically modeled as:

V1 = {g : g[x] = N(fθ(x), σ2), f is the predicting function and θ is learnable, σ is a constant}. (16)

The corresponding loss function is ℓV1(fθ(X), Y ) = (fθ(X) Y )2, and UV1(W, YN) becomes

UV1(W, YN) = Varj [K](Y j N) =

!2 1 N PN i=1 wij

where Y j N denotes the mean value of the label Y given E = ej and U(W, YN) calculates the

variance of Y j N among sub-populations e1 e K.

(2) For the classiﬁcation task, the predictive family is typically modeled as:

V2 = {g : g[x] = fθ(x) c, f is the classiﬁcation model and θ is learnable}, (18)

where c is the class number and c denotes the c-dimensional simplex. Here each model in the predictive family V2 outputs a discrete distribution in the form of a c-dimensional simplex. In this case, the corresponding loss function ℓV2( , ) is the cross entropy loss and the regularizer becomes UV2(W, YN) = PK j=1 1 N (PN i=1 wij)H(Y j N), where H(Y j N) is the entropy of Y given E = ej.

Optimization. The bi-level optimization in Equation 15 can be solved by performing projected gradient descent w.r.t. W. The gradient of W can be calculated by: (we omit the subscript V here)

W R = W U + ℓ(fθj(xi), yi) N K i,j +

j=1 θj R|θ j W θ j , (19)

where θj R θ j W θ j θj R θt j

k<h (I 2L θj θT j

θt h 1 j (20)

θj R θt j 2L θj W T

θt 1 j , for j = 1, . . . , K. (21)

where ℓ(fθj(xi), yi)]N K i,j is an N K matrix in which the (i, j)-th element is ℓ(fθj(xi), yi). Here Equation 20 approximates θ j by θt j from t steps of inner loop gradient descent and Equation 21 performs 1-step truncated backpropagation (Shaban et al., 2019; Zhou et al., 2022). Our information maximization algorithm updates W by projected gradient descent as:

W Proj WK (W η W R) , η is the learning rate of W. (22)

Published as a conference paper at ICLR 2023

Then we prove that minimizing Equation 15 exactly ﬁnds the supremum w.r.t. E in the Deﬁnition 7 (formal) of the empirical predictive heterogeneity.

Theorem 4 (Justiﬁcation of the IM Algorithm). For the regression task with predictive family V1 and classiﬁcation task with V2, the optimization of Equation 15 is equivalent to the supremum problem of the empirical predictive heterogeneity ˆHEK V1 (X Y ; D), ˆHEK V2 (X Y ; D) respectively in Equation 33 with the pre-deﬁned environment number K (i.e. supp(E) = K).

Remark 1 (Difference from Expectation Maximization). The expectation maximization (EM) algorithm is to infer latent variables of a statistic model to achieve the maximum likelihood. Our proposed information maximization (IM) algorithm is to infer the latent variables W which brings the maximal predictive heterogeneity associated with the maximal information. Due to the regularizer UV in our objective function, the EM algorithm cannot efﬁciently solve our problem, and therefore we adopt bi-level optimization techniques.

5 EXPERIMENTS

5.1 PROVIDE INSIGHTS FOR THE SUB-POPULATION DIVISION

The predictive heterogeneity could provide insights for the sub-population division and beneﬁt decision-making, and we illustrate this in prediction tasks of various ﬁelds, including agricultural research, sociological research, and object recognition. From the illustrative examples, we show that the learned sub-population division is highly explainable and relevant to decision-making.

Example: Agriculture It is known that the climate affects crop yields and crop suitability (Lobell et al., 2008). We leverage the data from the NOAA database which contains daily weather from weather stations around the world. Following Zhao et al. (2021), we summarize the weather sequence of the year 2018 into summary statistics, including the average yearly temperature, humidity, wind speed and rainy days. The task is to predict the crop yield in each place with weather summary statistics and location covariates (i.e. longitude and latitude) of the place. For easy illustration, we focus on the places with crop types of wheat or rice. Notably that the input covariates do not contain the crop type. We use MLP models in this task and set K = 2 for our IM algorithm.

Since the crop yield prediction mechanisms are closely related to the crop type that is unknown in the prediction task, we think this causes data heterogeneity in the entire data and the recognized predictive heterogeneity should relate to it. In Figure 1 (a), we plot the real distribution map of wheat and rice planting areas, and in Figure 1 (b), we plot the learned two sub-populations of our IM algorithm. From the results, we surprisingly ﬁnd the division given by our algorithm is quite similar to the real division of the two crops, indicating the rationality of our measure. For the areas that are not similar (e.g. Tibet Plateau in Asia), we think it is due to some missing features (e.g. population density, altitude) that signiﬁcantly affect the crop yields.

Example: Sociology We use the UCI Adult dataset (Kohavi & Becker, 1996), which is derived from the 1994 Current Population Survey conducted by the US Census Bureau and is widely used in the study of algorithmic fairness. The task is to predict whether the income of a person is greater or less than 50k US dollars according to personal features. We use linear models in this task and set K = 2. In this example, we would like to investigate whether there exist sub-population structures inside data that affect the learning of machine learning models.

Figure 1: Results on the crop yield data. We color each region according to its main crop type, and the shade represents the proportion of the main crop type after smoothing via k-means (k = 3).

Published as a conference paper at ICLR 2023

Figure 2: Results on the Adults data. Here we show the average of features and the feature coefﬁcients of the two learned sub-populations.

Figure 3: Results on the Waterbird data. Here we randomly sample 50 images for each class and each learned sub-population.

In Figure 2 (a), we plot summary statistics for the two sub-populations, where the main difference lies in the capital gain. In Figure 2 (b), we plot the feature importance given by linear models for the two sub-populations, and we ﬁnd that for people with a high capital gain, the prediction model focuses mainly on capital gain, which is fair. However, for people with a low capital gain, models also address some sensitive attributes such as sex and marital status, which tend to cause discrimination. Our results correspond with the results in Zhao et al. (2021) and can help us identify potential inequality in decision-making. For example, our results indicate the potential discrimination for low capital gain people, which could further promote algorithm design and improve policy fairness.

Example: Object Recognition Finally, we use the Waterbird dataset (Sagawa et al., 2019), which is widely used to evaluate the model s robustness in the robust learning ﬁeld. The task is to recognize waterbirds or landbirds. However, the image backgrounds are spuriously correlated with the target label, i.e. for the majority, waterbirds are on the water and landbirds on the land, and for minority, the correlation is reversed. Therefore, the spurious correlation causes predictive heterogeneity in this dataset, since such correlation could affect the machine learning model. In this example, we use the Res Net18 and set K = 2 in our IM algorithm. In Figure 3, to show the learned sub-populations of our method, we randomly sample 50 images for each class (waterbird or landbird) and each learned sub-population. In sub-population 1, the majority of landbirds are on the ground and waterbirds are in the water, while in sub-population 2, the majority of landbirds are in the water and waterbirds are on the ground. Our measure captures such spurious correlation and in the two sub-populations, the spurious correlation between the object and background is inverse. And the learned sub-populations could be leveraged by many robust learning methods (Sagawa et al., 2019; Koyama & Yamaguchi, 2020) to learn models with better generalization ability, since they can help to eliminate the inﬂuence of backgrounds on model predictions.

5.2 BENEFIT OOD GENERALIZATION

The predictive heterogeneity could beneﬁt the out-of-distribution (OOD) generalization of machine learning models. Here we investigate the empirical performance of our IM algorithm w.r.t. the OOD generalization performances on simulated data and real-world colored MNIST data.

Baselines First, we compare with empirical risk minimization (ERM) and environment inference for invariant learning (EIIL, (Creager et al., 2021)) which infers the environments for learning invariance. Then we compare with the well-known KMeans algorithm, which is the most popular

Table 1: Results of the experiments on out-of-distribution generalization, including the simulated data and colored MNIST data.

1. Simulated Data 2. Colored MNIST Training Sub-population Error Test Error Train Accuracy Test Accuracy Major (r = 1.9) Minor (r = 1.9) r = 2.3 r = 2.7 ERM 0.255( 0.024) 0.740( 0.022) 0.738( 0.035) 0.737( 0.023) 0.998( 0.001) 0.406( 0.019) EIIL 0.164( 0.014) 1.428( 0.035) 1.431( 0.061) 1.431( 0.046) 0.812( 0.006) 0.610( 0.016)

KMeans Balance 0.231( 0.022) 0.847( 0.024) 0.846( 0.039) 0.845( 0.026) 0.999( 0.001) 0.328( 0.021) IRM 0.231( 0.022) 0.845( 0.024) 0.844( 0.039) 0.843( 0.026) 0.947( 0.004) 0.259( 0.021) IGA 0.235( 0.022) 0.840( 0.023) 0.839( 0.038) 0.838( 0.027) 0.997( 0.001) 0.302( 0.021)

Ours Balance 0.403( 0.041) 0.423( 0.016) 0.416( 0.022) 0.416( 0.014) 0.749( 0.012) 0.692( 0.039) IRM 0.391( 0.039) 0.432( 0.016) 0.430( 0.022) 0.430( 0.014) 0.759( 0.014) 0.727( 0.047) IGA 0.449( 0.037) 0.426( 0.017) 0.417( 0.022) 0.417( 0.014) 0.759( 0.012) 0.713( 0.034)

Published as a conference paper at ICLR 2023

(a) KMeans.

(c) Our IM.

Figure 4: Sub-population division on the simulated data of three methods, where two colors denote two sub-populations.

Figure 5: Sub-population division on the MNIST data of our IM algorithm. clustering algorithm. For our IM algorithm and KMeans, we involve three algorithms as backbones to leverage the learned sub-populations, including sub-population balancing and invariant learning methods. The sub-population balancing simply equally weighs the learned sub-populations. invariant risk minimization (IRM, (Arjovsky et al., 2019)) and inter-environment gradient alignment (IGA, (Koyama & Yamaguchi, 2020)) are typical methods in OOD generalization, which take the sub-populations as input environments to learn the invariant models.

Data Generation of Simulated Data The input features X = [S, T, V ]T R10 consist of stable features S R5, noisy features T R4 and the spurious feature V R:

S N(0, 2I5), T N(0, 2I4), Y = θT S S +h(S)+N(0, 0.5), V Laplace(sign(r) Y, 1/(5 ln |r|)) (23)

where θS R5 is the coefﬁcient and h(S) = S1S2S3 is the nonlinear term. |r| > 1 is a factor for each sub-population, and here the data heterogeneity is brought by the endogeneity with hidden variable (Fan et al., 2014). V is the spurious feature whose relationship with Y is unstable across sub-populations and is controlled by the factor r. Intuitively, sign(r) controls whether the spurious correlation between V and Y is positive or negative. And |r| controls the strength of the spurious correlation, i.e. the larger |r| means the stronger spurious correlation. In training, we generate 10000 points, where the major group contains 80% data with r = 1.9 (i.e. strong positive spurious correlation) and the minor group contains 20% data with r = 1.9 (i.e. strong negative spurious correlation). In testing, we test the performances of the two groups respectively, and we also set r = 2.3 and r = 2.7 to simulate stronger distributional shifts. We use linear regression and set K = 2 for all methods, and we report the mean-square errors (MSE) of all methods.

Data Generation of Colored MNIST Following Arjovsky et al. (2019), we design a binary classiﬁcation task constructed on the MNIST dataset. Firstly, digits 0 4 are labeled Y = 0 and digits 5 9 are labeled Y = 1. Secondly, noisy labels Y are induced by randomly ﬂipping the label Y with a probability of 0.2. Then we sample the colored id V spurious correlated with Y as V = + Y , with probability r, Y , with probability 1 r. . In fact, r controls the spurious correlation between Y and V . In training, we randomly sample 10000 data points and set r = 0.85, meaning that for 85% of the data, V is positively correlated with Y and for the rest 15%, the spurious correlation becomes negative, which causes data heterogeneity w.r.t. V and Y . In testing, we set r = 0 (strong negative spurious correlation), bringing strong shifts between training and testing.

Analysis From the results in Table 1, for both the simulated and colored MNIST data, the two backbones with our IM algorithm achieve the best OOD generalization performances. Also, for the simulated data, the learned predictive heterogeneity enables backbone algorithms to equally treat the majority and minority inside data (i.e. low-performance gap between Major and Minor ), and signiﬁcantly beneﬁts the OOD generalization. Further, for both experiments, we plot the learned sub-populations of our IM algorithm in Figure 4 and 5. From Figure 4, compared with KMeans and EIIL, our predictive heterogeneity exploits the spurious correlation between V and Y , and enables the backbone algorithms to eliminate it. From Figure 5, the learned sub-populations of our method also reﬂect the different directions of the spurious correlation between digit labels Y and colors (red or green), which helps backbone methods to avoid using colors to predict digits.

6 CONCLUSION

We deﬁne the predictive heterogeneity, as the ﬁrst quantitative formulation of the data heterogeneity that affects the prediction of machine learning models. We demonstrate its theoretical properties and show that it beneﬁts the out-of-distribution generalization performances.

Published as a conference paper at ICLR 2023

ACKNOWLEDGEMENTS

We would like to thank Yuting Pan, Jiaming Song, Fan Bao and anonymous reviewers for helpful feedback. Peng Cui s research was supported in part by National Key R&D Program of China (No. 2018AAA0102004, No. 2020AAA0106300), National Natural Science Foundation of China (No. U1936219, 62141607), Beijing Academy of Artiﬁcial Intelligence (BAAI). Bo Li s research was supported by the National Natural Science Foundation of China (No.72171131, 72133002); the Technology and Innovation Major Project of the Ministry of Science and Technology of China under Grants 2020AAA0108400 and 2020AAA0108403.

Martin Arjovsky, L eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

Jasmin Breitenstein, Jan-Aike Term ohlen, Daniel Lipinski, and Tim Fingscheidt. Systematization of corner cases for visual perception in automated driving. In 2020 IEEE Intelligent Vehicles Symposium (IV), pp. 1257 1264. IEEE, 2020.

Robert Challen, Joshua Denny, Martin Pitt, Luke Gompels, Tom Edwards, and Krasimira Tsaneva Atanasova. Artiﬁcial intelligence, bias and clinical safety. BMJ Quality & Safety, 28(3):231 237, 2019.

M Cover Thomas and A Thomas Joy. Elements of information theory. New York: Wiley, 3:37 38, 1991.

Elliot Creager, J orn-Henrik Jacobsen, and Richard Zemel. Environment inference for invariant learning. In International Conference on Machine Learning, pp. 2189 2200. PMLR, 2021.

Peng Cui and Susan Athey. Stable learning establishes some common ground between causal inference and machine learning. Nature Machine Intelligence, 4(2):110 115, 2022.

John Duchi, Tatsunori Hashimoto, and Hongseok Namkoong. Distributionally robust losses for latent covariate mixtures. Operations Research, 2022.

John C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378 1406, 2021.

Kevin Dzobo, Dimakatso Alice Senthebane, Nicholas Ekow Thomford, Arielle Rowe, Collet Dandara, and M Iqbal Parker. Not everyone ﬁts the mold: Intratumor and intertumor heterogeneity and innovative cancer drug design and development. Omics: a journal of integrative biology, 22 (1):17 34, 2018.

Jianqing Fan, Fang Han, and Han Liu. Challenges of big data analysis. National science review, 1 (2):293 314, 2014.

Jingrui He. Learning from data heterogeneity: Algorithms and applications. In IJCAI, pp. 5126 5130, 2017.

James J Heckman. Sample selection bias as a speciﬁcation error. Econometrica: Journal of the econometric society, pp. 153 161, 1979.

Sumyea Helal. Subgroup discovery algorithms: a survey and empirical evaluation. Journal of computer science and technology, 31(3):561 576, 2016.

Miguel A Hern an, David Clayton, and Niels Keiding. The simpson s paradox unraveled. International journal of epidemiology, 40(3):780 785, 2011.

Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International Conference on Machine Learning, pp. 2564 2572. PMLR, 2018.

Published as a conference paper at ICLR 2023

R. Kohavi and B. Becker. Adult, 1996.

Masanori Koyama and Shoichiro Yamaguchi. Out-of-distribution generalization with maximal invariant predictor. 2020.

Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer Science & Business Media, 1991.

H Li and JF Reynolds. On deﬁnition and quantiﬁcation of heterogeneity. Oikos, pp. 280 284, 1995.

Jiashuo Liu, Zheyuan Hu, Peng Cui, Bo Li, and Zheyan Shen. Heterogeneous risk minimization. In International Conference on Machine Learning, pp. 6804 6814. PMLR, 2021.

Jiashuo Liu, Zheyan Shen, Peng Cui, Linjun Zhou, Kun Kuang, and Bo Li. Distributionally robust learning with stable adversarial training. IEEE Transactions on Knowledge and Data Engineering, 2022.

David B Lobell, Marshall B Burke, Claudia Tebaldi, Michael D Mastrandrea, Walter P Falcon, and Rosamond L Naylor. Prioritizing climate change adaptation needs for food security in 2030. Science, 319(5863):607 610, 2008.

Todd K Moon. The expectation-maximization algorithm. IEEE Signal processing magazine, 13(6): 47 60, 1996.

Jonas Peters, Peter B uhlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identiﬁcation and conﬁdence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947 1012, 2016.

Douglas A Reynolds. Gaussian mixture models. Encyclopedia of biometrics, 741(659-663), 2009.

Paul R Rosenbaum. Heterogeneity and causality: Unit heterogeneity and design sensitivity in observational studies. The American Statistician, 59(2):147 152, 2005.

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019.

Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back-propagation for bilevel optimization. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pp. 1723 1732. PMLR, 2019.

Clifford H Wagner. Simpson s paradox in real life. The American Statistician, 36(1):46 48, 1982.

Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. Heterogeneous graph attention network. In The world wide web conference, pp. 2022 2032, 2019.

Christopher Winship and Robert D Mare. Models for sample selection bias. Annual review of sociology, pp. 327 350, 1992.

Yilun Xu, Shengjia Zhao, Jiaming Song, Russell Stewart, and Stefano Ermon. A theory of usable information under computational constraints. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https://openreview.net/forum?id=r1e Bey HFDH.

Shengjia Zhao, Abhishek Sinha, Yutong He, Aidan Perreault, Jiaming Song, and Stefano Ermon. Comparing distributions by measuring differences that affect decision making. In International Conference on Learning Representations, 2021.

Xiao Zhou, Yong Lin, Renjie Pi, Weizhong Zhang, Renzhe Xu, Peng Cui, and Tong Zhang. Model agnostic sample reweighting for out-of-distribution learning. In International Conference on Machine Learning, pp. 27203 27221. PMLR, 2022.

Published as a conference paper at ICLR 2023

A FORMAL DEFINITION OF EMPIRICAL PREDICTIVE HETEROGENEITY

In this section, we derive the explicit formula for the empirical estimation of the predictive heterogeneity which is described in Deﬁnition 6.

The dataset D = {(xi, yi)}|D| i=1 is independently and identically drawn from the population X, Y . Given a function family V and an environment set EK, let Q be the set of all probability distributions of X,Y ,E where E EK. For given E, denote supp(E) = {(ek)K k=1}. The empirical predictive heterogeneity ˆHEK V (X Y ; D) is given by:

ˆHEK V (X Y ; D) = sup E EK ˆIV(X Y |E; D) ˆIV(X Y ; D) (24)

= sup ˆ Q Q

h ˆQ(E = ek) ˆHV(Y |E = ek; D) ˆQ(E = ek) ˆHV(Y |X, E = ek; D) i

[ ˆHV(Y ; D) ˆHV(Y |X; D)]. (26)

Speciﬁcally,

ˆQ(E = ek) ˆHV(Y |X, E = ek; D) (27)

= inf f V ˆQ(E = ek) X

xi,yi D log f[xi](yi) ˆQ(xi, yi|E = ek) P

xj,yj D ˆQ(xj, yj|E = ek) (28)

= inf f V ˆQ(E = ek) X

xi,yi D log f[xi](yi) ˆQ(E = ek|xi, yi) ˆQ(xi, yi) P

xj,yj D ˆQ(E = ek|xj, yj) ˆQ(xj, yj) (29)

= inf f V ˆQ(E = ek) X

xi,yi D log f[xi](yi) ˆQ(E = ek|xi, yi) ˆQ(xi, yi)

ˆQ(E = ek) (30)

xi,yi D log f[xi](yi) ˆQ(E = ek|xi, yi) ˆQ(xi, yi) (31)

= inf f V 1 |D|

xi,yi D log f[xi](yi) ˆQ(E = ek|xi, yi). (32)

The explicit formula for ˆQ(E = ek) ˆHV(Y |E = ek; D), ˆHV(Y |X; D) and ˆHV(Y ; D) could be similarly derived. Here we are ready to formally deﬁne the empirical predictive heterogeneity.

Deﬁnition 7 (Empirical Predictive Heterogeneity (formal)). For the prediction task X Y with X, Y taking values in X Y, a dataset D is independently and identically drawn from the population such that D = {(xi, yi)N i=1 X, Y }. Given the predictive family V and the environment set EK = {E|E C, supp(E) = K} where K N, let Q be the set of all probability distributions of X,Y ,E where E EK. The empirical predictive heterogeneity ˆHEK V (X Y ; D) with respect to D is deﬁned as:

ˆHEK V (X Y ; D) = sup ˆ Q Q

h ˆQ(E = ek) ˆHV(Y |E = ek; D) ˆQ(E = ek) ˆHV(Y |X, E = ek; D) i

[ ˆHV(Y ; D) ˆHV(Y |X; D)], (33)

Published as a conference paper at ICLR 2023

ˆQ(E = ek) ˆHV(Y |X, E = ek; D) = inf f V 1 |D|

xi,yi D log f[xi](yi) ˆQ(E = ek|xi, yi). (34)

ˆQ(E = ek) ˆHV(Y |E = ek; D) = inf f V 1 |D|

xi,yi D log f[ ](yi) ˆQ(E = ek|xi, yi). (35)

ˆHV(Y |X; D) = inf f V 1 |D|

xi,yi D log f[xi](yi). (36)

ˆHV(Y ; D) = inf f V 1 |D|

xi,yi D log f[ ](yi). (37)

B SENSITIVITY OF K

In the experiments of Section 5, we set the K = 2 for easy illustrations. In this section, we add the results of choosing different Ks for the simulated experiment in Section 5.2 to show that the OOD generalization performances of some typical algorithms plus our proposed method are not sensitive to the choices of K.

Figure 6: The out-of-distribution generalization error of our methods with Sub-population Balancing, IRM and IGA as backbones. Here we plot the errors of different backbones under r = 2.7, which introduces strong distributional shifts with training data.

In Figure 6, we show the out-of-distribution generalization error of our methods with Sub-population Balancing, IRM and IGA as backbones. We plot the OOD testing performances under r = 2.7, which has strong distributional shift with the training distribution. From the results, we can see that the performances of three OOD generalization methods do not be affected much by the choice of K, and from Table 1 , our performances signiﬁcantly outperforms all the baselines.

Also, we add one more experiment to show that (1) when the chosen K is smaller than the groundtruth, the performances of our methods will drop but are still better than ERM (2) when the chosen K is larger, the performances are not affected much (consistent with the results in Appendix B).

Experiment Setting: The input features X = [S, T, V ] R10 consist of stable features S R5, noisy features T R4 and the spurious feature V R:

S N(2, 2I5), T N(0, 2I4), Y = θT S S + S1S2S3 + N(0, 0.5),

and we generate the spurious feature via:

V = θe V Y + N(0, 0.3),

Published as a conference paper at ICLR 2023

where θe V varies across sub-populations and is dependent on which sub-population the data point belongs to. In training, we sample 8000 data points from e1 with θ1 V = 3.0, 1000 points from e2 with θ2 V = 1.0, 1000 points from e3 with θ3 V = 2.0 and 1000 points from e4 with θ4 V = 3.0. Therefore, the ground-truth number of sub-populations is 4. In testing, we test the performances on e4 with θ4 V = 3.0, which has strong distributional shifts from training data. The average MSE over 10 runs are shown in Figure 7.

Figure 7: The out-of-distribution generalization error of our methods with Sub-population Balancing, IRM and IGA as backbones for the added experiments. The ground-truth sub-population number is 4.

From the results, we can see that when K is smaller than the ground-truth, increasing K beneﬁts the OOD generalization performance, and when K is larger, the performances are not affected much, which is consistent with the results in Figure 6.

For our IM algorithm, we think there are mainly two ways to choose K:

According to the predictive heterogeneity index: When the chosen K is smaller than the ground-truth, our measure tends to increase quickly when increasing K; and when K is larger than the ground-truth, the increasing speed will slow down, which could direct people to choose an appropriate K. According to the prediction model: Since our IM algorithm aims to learn sub-populations with different prediction mechanisms, one could compare the learned model parameters θ1, . . . , θK to judge whether K is much larger than the ground-truth, i.e., if two resultant models are quite similar, K may be too large (divide one sub-population into two). For linear models, one can directly compare the coefﬁcients. For deep models, we think one can calculate the transfer losses across sub-populations.

For a detailed analysis of the best choice of K, we leave it for future work.

C RELATED WORK

To the best of our knowledge, data heterogeneity has not converged to a uniform formulation so far, and has different meanings among different ﬁelds. Li & Reynolds (1995) deﬁne the heterogeneity in ecology based on the system property and complexity or variability. Rosenbaum (2005) views the uncertainty of the potential outcome as unit heterogeneity in observational studies in economics. For graph data, the heterogeneity refers to various types of nodes and edges (Wang et al. (2019)). More recently, in machine learning, several works of causal learning (Peters et al., 2016; Arjovsky et al., 2019; Koyama & Yamaguchi, 2020; Creager et al., 2021) and robust learning (Sagawa et al., 2019) leverage heterogeneous data from multiple environments to improve the out-of-distribution generalization ability. Speciﬁcally, invariant learning methods (Arjovsky et al., 2019; Koyama &

Published as a conference paper at ICLR 2023

Yamaguchi, 2020; Creager et al., 2021; Zhou et al., 2022) leverage the heterogeneous environment to learn the invariant predictors that have uniform performances across environments. And in distributionally robust optimization ﬁeld, Sagawa et al. (2019); Duchi et al. (2022) propose to optimize the worst-group prediction error to guarantee the OOD generalization performance. However, in machine learning, previous works have not provided a precise deﬁnition or sound quantiﬁcation of data heterogeneity, which makes it confusing and hard to leverage to develop more rational machine learning algorithms.

As for clustering algorithms, most algorithms only focus on the covariates X, typiﬁed by KMeans and Gaussian Mixture Model (GMM, (Reynolds, 2009)). However, the learned clusters by KMeans cannot reﬂect the predictive heterogeneity, which is shown by our experiments. And the expectation maximization (EM, (Moon, 1996)) can also be used for clustering. However, our IM algorithm has essential differences from EM, for our IM algorithm infers latent variables that maximizes the predictive heterogeneity but EM maximizes the likelihood. Also, there are methods (Creager et al., 2021) from the invariant learning ﬁeld to infer environments. Though it could beneﬁt the OOD generalization, it lacks the theoretical foundation and only works in some settings.

D PROOF OF PROPOSITION 1

Proof of Proposition 1.

1. Monotonicity:

Because of E1 E2,

HE1 V (X Y ) = sup E E1 IV(X Y |E) IV(X Y ) (38)

sup E E2 IV(X Y |E) IV(X Y ) (39)

= HE2 V (X Y ). (40)

2. Nonnegativity:

According to the deﬁnition of the environment set, there exists E0 E such that for any e supp(E), X, Y |E = e is identically distributed as X, Y . Thus, we have

HE V(X Y ) = sup E E [HV(Y | , E) HV(Y |X, E)] [HV(Y | ) HV(Y |X)] (41)

[HV(Y | , E0) HV(Y |X, E0)] [HV(Y | ) HV(Y |X)] . (42)

Speciﬁcally,

HV(Y |X, E0) = Ee E0

inf f V Ex,y X,Y |E=e[ log f[x](y)] (43)

inf f V Ex,y X,Y [ log f[x](y)] (44)

= HV(Y |X). (45)

Similarly, HV(Y | , E0) = HV(Y | ). Thus, HE V(X Y ) 0.

3. Boundedness:

First, we have

HV(Y |X, E) = Ee E

inf f V Ex,y X,Y |E=e[ log f[x](y)] (46)

inf f V Ex X|E=e Ey Y |x,e[ log f[x](y)] (47)

by noticing that Ey Y |x[ log f[x](y)] is the cross entropy between Y |x, e and f[x].

Published as a conference paper at ICLR 2023

HV(Y | , E) = Ee E

inf f V Ey Y |E=e[ log f[ ](y)] (49)

inf f V Ee E Ey Y |E=e[ log f[ ](y)] (50)

= inf f V Ey Y [ log f[ ](y)] (51)

= HV(Y | ), (52)

where Equation 50 is due to Jensen s inequality.

Combing the above inequalities,

HE V(X Y ) = sup E E [HV(Y | , E) HV(Y |X, E)] [HV(Y | ) HV(Y |X)] (53)

sup E E HV(Y | , E) [HV(Y | ) HV(Y |X)] (54)

HV(Y | ) [HV(Y | ) HV(Y |X)] (55) = HV(Y |X). (56)

4. Corner Case:

According to Proposition 2 in Xu et al. (2020),

HΩ(Y | ) = H(Y ). (57) HΩ(Y |X) = H(Y |X). (58)

By taking random variables R, S identically distributed as X, Y |E = e for e supp(E), we have

HΩ(Y |X, E = e) = HΩ(S|R) = H(S|R) = H(Y |X, E = e). (59)

HΩ(Y |X, E) = Ee E[HΩ(Y |X, E = e)] = Ee E[H(Y |X, E = e)] = H(Y |X, E). (60)

Similarly, we have HΩ(Y | , E) = H(Y |E). Thus,

HE Ω(X Y ) = sup E E [HΩ(Y | , E) HΩ(Y |X, E)] [HΩ(Y | ) HΩ(Y |X)] (61)

= sup E E [H(Y |E) H(Y |X, E)] [H(Y ) H(Y |X)] (62)

= sup E E I(Y ; X|E) I(Y ; X) (63)

= HE (X, Y ). (64)

E PROOF OF THEOREM 1

Proof of Theorem 1.

HVG(Y |X) = inf f VG Ex X Ey Y |x[ log f[x](y)] (65)

Ey Y |x[ log 1

2π exp (y g(x))2

= Ex X Ey Y |x[π(y g(x))2] (67)

= πσ2. (68)

Published as a conference paper at ICLR 2023

Equation 66 holds by taking f[x] = N(g(x), 1

Given the function family Vσ = {f|f[x] = N(θx, σ2), θ R, σ ﬁxed }, by expanding the Gaussian probability density function in the deﬁnition of predictive V-information, it could be shown that

IVσ(X Y ) min k R E[(Y k X)2] Var(Y ), (69)

where the predictive V-information is proportional to Mean Square Error subtracted by the variance of target, by a coefﬁcient completely dependent on σ.

The minimization problem is solved by

E[X2] = 1. (70)

Substituting k into eq.69,

IVσ(X Y ) E[ϵ2] Var(X + ϵ) (71)

= Var(X) = E[X2]. (72)

Denote supp(E) = {E1, E2}. Let Q be the joint distribution of (X, ϵ, E). Let Q(E1) = α and Q(E2) = 1 α be the marginal of E. Abbreviate Q(X, ϵ|E = E1) by P1(X, ϵ) and Q(X, ϵ|E = E2) by P2(X, ϵ).

Similar to 69,

IVσ(X Y |E) min k E[(Y k X)2|E] Var(Y |E). (73)

For E = E1, the minimization problem is solved by

k = EP1[XY ]

EP1[X2] . (74)

IVσ(X Y |E = E1) EP1

" Y EP1[XY ]

EP1[X2] X 2#

Var P1(Y ) (75)

= EP1[Y 2] E2 P1[XY ] EP1[X2] (EP1[Y 2] E2 P1[Y ]) (76)

= E2 P1[Y ] E2 P1[XY ] EP1[X2] . (77)

Similarly, we have

IVσ(X Y |E = E2) E2 P2[Y ] E2 P2[XY ] EP2[X2] . (78)

Notably, EP1[X2] and EP2[X2] are constrained by α and E[X2].

E[X2] = E[E[X2|E]] = αEP1[X2] + (1 α)EP2[X2]. (79)

Similarly, E[X2] = E[XY ] = αEP1[XY ] + (1 α)EP2[XY ]. (80) 0 = E[Y ] = αEP1[Y ] + (1 α)EP2[Y ]. (81) The moments of P2 could thereafter be represented by those of P1.

EP2[X2] = E[X2] αEP1[X2]

EP2[XY ] = E[X2] αEP1[XY ]

Published as a conference paper at ICLR 2023

EP2[Y ] = αEP1[Y ]

Substituting to eq.78,

IVσ(X Y |E = E2) α2

(1 α)2 E2 P1[Y ] 1 1 α

E[X2] αEP1[XY ] 2

E[X2] αEP1[X2] . (85)

HE Vσ(X Y ) = sup E E IVσ(X Y ) αIVσ(X Y |E = E1) (1 α)IVσ(X Y |E = E2)

sup E E E[X2] αE2 P1[Y ] + αE2 P1[XY ] EP1[X2] α2

1 αE2 P1[Y ] +

E[X2] αEP1[XY ] 2

E[X2] αEP1[X2] (87)

= sup E E α 1 αE2 P1[Y ] + α

EP1[X2] EP1[XY ] 2

EP1[X2] (E[X2] αEP1[X2])E[X2] (88)

= sup E E α 1 αE2 P1[X + ϵ] + α E2 P1[Xϵ] EP1[X2] (E[X2] αEP1[X2])E[X2]. (89)

Assuming X ϵ | E,

HE Vσ(X Y ) = sup E E α 1 αE2 P1[X + ϵ] 0. (90)

From Proposition 1, we have HE Vσ(X Y ) 0. Thus, HE Vσ(X Y ) = 0.

F PROOF OF LINEAR CASES (THEOREM 2 AND ??)

Proof of Theorem 2.

For the ease of notion, we denote the r(E ) as re, σ(E ) as σe, and σ(E ) ϵv as ϵe. And we omit the superscript C of HC V. Firstly, we calculate the HV[Y | ] as:

HV[Y | ] = 1 2σ2 Var(Y ) + log σ + 1

2 log 2π, (91)

HV[Y | , E ] = 1 2σ2 EE [Var(Y |E )] + log σ + 1

2 log 2π. (92)

Therefore, we have

HV[Y | , E ] HV[Y | ] = 1

2σ2 Var(E[Y |E ]) 0. (93)

As for HV[Y |X], we have

HV[Y |X] = inf h S,h V EX,Y Y (h SS + h V V ) 2 1 2σ2 (94)

= inf h S,h V EX,Y f(S) + ϵY (h SS + h V V ) 2 1 2σ2 (95)

= inf h S,h V EE E[ f(S) + ϵY (h SS + h V (ref(S) + ϵe)) 2|E ] 1 2σ2 , (96)

where we let h S = h S β here. Then we have

2σ2HV[Y |X] = inf h S,h V EE E[ (1 h V re)f(S) + ϵY h SS h V ϵe 2|E ] (97)

= inf h S,h V EE E[ (1 h V re)f(S) h SS 2|E ] + σ2 Y + h2 V EE [σ2 e], (98)

Published as a conference paper at ICLR 2023

notably that here for ei, ej supp(E ), we assume P ei(S, Y ) = P ej(S, Y ) (we choose such E as one possible split). And the solution of h S, h V is

h S = Var(re)E[f 2(S)]E[f(S)S] + E[σ2 e]E[f(S)S] E[r2e]E[f 2(S)]E[S2] + E[σ2e]E[S2] E2[re]E2[f(S)S], (99)

h V = E[re](E[f 2(S)]E[S2] E2[f(S)S]) E[r2e]E[f 2(S)]E[S2] + E[σ2e]E[S2] E2[re]E2[f(S)S]. (100)

According to the assumption that E[f(S)S] = 0, we have

h S = 0, (101)

h V = E[r(E )]E[f 2] E[r2(E )]E[f 2] + E[σ2(E )]. (102)

Therefore, we have

2σ2HV[Y |X] = EE [E[ (1 h V re)f(S) 2|E ]] + σ2 Y + h2 V EE [σ2 e] (103)

= Var(re)E[f 2] + E[σ2(E )]

E[r2e]E[f 2] + E[σ2(E )] E[f 2(S)] + σ2 Y , (104)

2σ2HV[Y |X, E ] = σ2 Y + E[( 1

σ2e + 1 )2]E[f 2] + EE [( 1 re σe + σe re E[f 2] )2]. (105)

Note that here we simply set σ = 1 in the main body. And we have:

HV(X Y ) Var(re)E[f 2] + E[σ2(E )]

E[r2e]E[f 2] + E[σ2(E )] E[f 2(S)] (106)

The approximation error is bounded by 1 2 max(σ2 Y , R(r(E ), σ(E ), E[f 2])), and R(r(E ), σ(E ), E[f 2]) is deﬁned as:

R(r(E ), σ(E ), E[f 2]) = E[( 1

σ2e + 1 )2]E[f 2] + EE [( 1 re σe + σe re E[f 2] )2] (107)

Proof of Theorem ??. As proved above, we have

h S = β + E[f(S)S] Var(re)(E[f 2(S)] + σ2 Y ) + E[σ2 e]

E[r2e]E[f 2(S)]E[S2] + E[r2e]σ2 Y E[S2] + E[σ2e]E[S2] E2[re]E2[f(S)S], (108)

h V = E[re](σ2 Y + E[f 2(S)])E[S2] E[re]E2[f(S)S] E[r2e]E[f 2(S)]E[S2] + E[r2e]σ2 Y E[S2] + E[σ2e]E[S2] E2[re]E2[f(S)S]. (109)

For the model misspeciﬁcation case, we further assume that (1) E[f(S)S] = 0 and (2) E[σ2 e] E[f 2(S)]E[S2], and then we have

h S = β, (110)

h V = E[re]

E[r2e], (111)

and for the heterogeneity, we have

E[r2e] (E[f 2(S)] + E[σ2 Y ]) + h2 V EE[σ2 e] + σ2 Y 2σ2HV(X Y )

E[r2e] (E[f 2(S)] + E[σ2 Y ]) + h2 V EE[σ2 e] EE[ 1

r2e σ2 e]. (112)

Without the model misspeciﬁcation, we assume that f 0, and then we have

h S = β, (113)

h V = E[re]σ2 Y E[r2e]σ2 Y + E[σ2e], (114)

Published as a conference paper at ICLR 2023

and for the heterogeneity we have

2σ2HV(X Y ) σ2 Y (1 2h V E[re] + h2 V E[r2 e]) + h2 V E[σ2 e] E[ 1

r2e σ2 e], (115)

2σ2HV(X Y ) σ2 Y (1 2h V E[re] + h2 V E[r2 e]) + h2 V E[σ2 e]. (116)

G PROOF OF THE ERROR BOUND FOR FINITE SAMPLE ESTIMATION (THEOREM 3)

In this section, we will prove the error bound of estimating the predictive heterogeneity with the empirical predictive heterogeneity. Before the proof of Theorem 3 which is inspired by Xu et al. (2020), we will introduce three lemmas.

Lemma 1. Assume x X, y Y, f V, log f[x](y) [ B, B] where B > 0. Deﬁne a function class Gk V = {g|g(x, y) = log f[x](y)q(E = ek|x, y), f V, q Q}. Denote the Rademacher complexity of G with N samples by RN(G). Deﬁne ˆfk = arg inff 1 |D| P

xi,yi D log f[xi](yi)q(E = ek|xi, yi).

Then for any q Q, any δ (0, 1), with a probability over 1 δ, we have q(E = ek)HV(Y |X, E = ek) 1

xi,yi D log ˆfk[xi](yi)q(E = ek|xi, yi)

2R|D|(Gk V) + B

δ |D| . (118)

Proof. Apply Mc Diarmid s inequality to the function Φ(D) which is deﬁned as:

Φ(D) = sup f V,q Q

q(E = ek)Eq [ log f[x](y)|E = ek] 1

xi,yi D log f[xi](yi)q(E = ek|xi, yi)

Let D and D be two identical datasets except for one data point xj = x j. We have:

Φ(D) Φ(D ) (120)

sup f V,q Q

q(E = ek)Eq [ log f[x](y)|E = ek] 1

xi,yi D log f[xi](yi)q(E = ek|xi, yi)

q(E = ek)Eq [ log f[x](y)|E = ek] 1 |D |

x i,y i D log f[x i](y i)q(E = ek|x i, y i)

sup f V,q Q

xi,yi D log f[xi](yi)q(E = ek|xi, yi) 1 |D |

x i,y i D log f[x i](y i)q(E = ek|x i, y i)

= sup f V,q Q

log f[xj](yj)q(E = ek|xj, yj) log f[x j](y j)q(E = ek|x j, y j) (124)

Published as a conference paper at ICLR 2023

According to Mc Diarmid s inequality, for any δ (0, 1), with a probability over 1 δ, we have:

Φ(D) ED[Φ(D)] + B

δ |D| . (126)

Next we derive a bound for ED[Φ(D)].

Consider a dataset D independently and identically drawn from q(X, Y ) = P(X, Y ) with the same size as D. We notice that

q(E = ek)Eq [ log f[x](y)|E = ek] (127) = q(E = ek)Eq [ log f[x](y)q(E = ek|x, y)|E = ek] (128) = Eq [Eq [ log f[x](y)q(E = ek|x, y)|E = ek]] (129) = Eq [ log f[x](y)q(E = ek|x, y)] (130)

x i,y i D log f[x i](y i)q(E = ek|x i, y i)

Thus, ED[Φ(D)] could be reformulated as:

ED[Φ(D)] (132)

sup f V,q Q

x i,y i D log f[x i](y i)q(E = ek|x i, y i)

xi,yi D log f[xi](yi)q(E = ek|xi, yi)

sup f V,q Q ED

x i,y i D log f[x i](y i)q(E = ek|x i, y i) (135)

xi,yi D log f[xi](yi)q(E = ek|xi, yi)

sup f V,q Q

xi,yi D log f[xi](yi)q(E = ek|xi, yi) (137)

x i,y i D log f[x i](y i)q(E = ek|x i, y i)

sup f V,q Q

xi,yi D σi log f[xi](yi)q(E = ek|xi, yi) (139)

x i,y i D σi log f[x i](y i)q(E = ek|x i, y i)

sup f V,q Q

xi,yi D σi log f[xi](yi)q(E = ek|xi, yi)

sup f V,q Q

x i,y i D σi log f[x i](y i)q(E = ek|x i, y i)

= 2R|D|(Gk V), (143)

Published as a conference paper at ICLR 2023

where σi are independent Rademacher variables. Equation 137 follows from Jensen s inequality and the convexity of sup. Equation 139 holds due to the symmetry of log f[xi](yi)q(E = ek|xi, yi) log f[x i](y i)q(E = ek|x i, y i) and the argument that Radamacher variables preserve the expected sum of symmetric random variables with a convex mapping (Ledoux & Talagrand (1991), Lemma 6.3).

Substituting Equation 143 to Equation 126, we have for any δ (0, 1), with a probability over 1 δ, f V, q Q, the following holds: q(E = ek)Eq [ log f[x](y)|E = ek] 1

xi,yi D log f[xi](yi)q(E = ek|xi, yi)

2R|D|(Gk V) + B

δ |D| . (145)

Let fk = arg inff{q(E = ek)Eq [ log f[x](y)|E = ek]}.

Let ˆfk = arg inff{ 1

xi,yi D log f[xi](yi)q(E = ek|xi, yi)}.

Now we have

q(E = ek)Eq h log fk[x](y)|E = ek i 1

xi,yi D log fk[xi](yi)q(E = ek|xi, yi) (146)

q(E = ek)HV(Y |X, E = ek) 1

xi,yi D log ˆfk[xi](yi)q(E = ek|xi, yi) (147)

q(E = ek)Eq h log ˆfk[x](y)|E = ek i 1

xi,yi D log ˆfk[xi](yi)q(E = ek|xi, yi). (148)

Combining Equation 144 and Equation 146-148, the lemma is proved.

Lemma 2. Assume x X, y Y, f V, log f[ ](y) [ B, B] where B > 0. The deﬁnition of Gk V and RN(G) follows from Lemma 1. Deﬁne ˆfk = arg inff 1 |D| P

xi,yi D log f[ ](yi)q(E = ek|xi, yi).

Then for any q Q, any δ (0, 1), with a probability over 1 δ, we have q(E = ek)HV(Y |E = ek) 1

xi,yi D log ˆfk[ ](yi)q(E = ek|xi, yi)

2R|D|(Gk V) + B

δ |D| . (150)

Proof. Similar to Lemma 1, we could prove that q(E = ek)HV(Y |E = ek) 1

xi,yi D log ˆfk[ ](yi)q(E = ek|xi, yi)

2R|D|(Gk V ) + B

δ |D| , (152)

where Gk V = {g|g(x, y) = log f[ ](y)q(E = ek|x, y), f V, q Q}.

According to the deﬁnition for the predictive family V (Xu et al. (2020), Deﬁnition 1), f V, there exists f V such that x X, f[ ] = f [x]. Thus, Gk V Gk V, and therefore R|D|(Gk V ) R|D|(Gk V). Substituting into Equation 151, the lemma is proved.

Published as a conference paper at ICLR 2023

Lemma 3 ((Xu et al., 2020), Theorem 1). Assume x X, y Y, f V, log f[x](y) [ B, B] where B > 0. Deﬁne a function class G V = {g|g(x, y) = log f[x](y), f V}. The deﬁnition of RN(G) follows from Lemma 1.

Then for any δ (0, 0.5), with a probability over 1 2δ, we have

IV(X Y ) ˆIV(X Y ) 4R|D|(G V) + 2B

δ |D| . (153)

Finally we are prepared to prove Theorem 3.

Proof of Theorem 3. We ﬁrst bound the error of empirical estimation with the sum of items in Lemma 1,2,3.

|HEK V (X Y ) ˆHEK V (X Y ; D)| (154)

sup E EK IV(X Y |E) IV(X Y ) sup E EK ˆIV(X Y |E; D) ˆIV(X Y ; D)

sup E EK IV(X Y |E) sup E EK ˆIV(X Y |E; D) + IV(X Y ) ˆIV(X Y ; D) (156)

IV(X Y |E) ˆIV(X Y |E; D) + IV(X Y ) ˆIV(X Y ; D) (157)

k=1 [q(E = ek)HV(Y |E = ek) q(E = ek)HV(Y |X, E = ek)] (158)

h q(E = ek) ˆHV(Y |E = ek; D) q(E = ek) ˆHV(Y |X, E = ek; D) i (159)

+ IV(X Y ) ˆIV(X Y ; D) (160)

k=1 sup q Q

q(E = ek)HV(Y |E = ek) q(E = ek) ˆHV(Y |E = ek; D) (161)

k=1 sup q Q

q(E = ek)HV(Y |X, E = ek) q(E = ek) ˆHV(Y |X, E = ek; D) (162)

+ IV(X Y ) ˆIV(X Y ; D) (163)

k=1 sup q Q

q(E = ek)HV(Y |E = ek) 1

xi,yi D log ˆfk[xi](yi)q(E = ek|xi, yi)

k=1 sup q Q

q(E = ek)HV(Y |X, E = ek) 1

xi,yi D log ˆf k[ ](yi)q(E = ek|xi, yi)

+ IV(X Y ) ˆIV(X Y ; D) , (166)

where ˆfk = arg inff 1 |D| P

xi,yi D log f[xi](yi)q(E = ek|xi, yi),

and ˆf k = arg inff 1 |D| P

xi,yi D log f[ ](yi)q(E = ek|xi, yi), for any q Q and 1 k K.

Published as a conference paper at ICLR 2023

For simplicity, let

Errk = sup q Q

q(E = ek)HV(Y |X, E = ek) 1

xi,yi D log ˆfk[xi](yi)q(E = ek|xi, yi)

Err k = sup q Q

q(E = ek)HV(Y |X, E = ek) 1

xi,yi D log ˆf k[ ](yi)q(E = ek|xi, yi)

Err = IV(X Y ) ˆIV(X Y ; D) . (169)

Then, by Lemma 1,2,3,

|HV K ˆHV K(D)| > 4(K + 1)R|D|(GV) + 2(K + 1)B

i=1 Err k + Err > 4(K + 1)R|D|(GV) + 2(K + 1)B

i=1 Err k + Err >

k=1 4R|D|(Gk V) + 4R|D|(G V) + 2(K + 1)B

Errk >2R|D|(Gk V) + B

Err k >2R|D|(Gk V) + B

Err > 4R|D|(G V) + 2B

Errk >2R|D|(Gk V) + B

Err k >2R|D|(Gk V) + B

Err > 4R|D|(G V) + 2B

2(K + 1)δ. (177)

Equation 172 is because of Gk V = GV, G V GV and therefore R|D|(Gk V) R|D|(GV), R|D|(G V) R|D|(GV).

|HEK V (X Y ) ˆHEK V (X Y ; D)| 4(K + 1)R|D|(GV) + 2(K + 1)B

1 2(K + 1)δ. (179)

Published as a conference paper at ICLR 2023

H PROOF OF THEOREM 4

Proof of Theorem 4. The objective function of our IM algorithm is directly derived from the deﬁnition of empirical predictive heterogeneity in Deﬁnition 6. For the regression task, we assume the predictive family as

V1 = {g : g[x] = N(fθ(x), σ2), f is the regression model and θ is learnable, σ = 1.0(ﬁxed)}, (180)

where we only care about the output of the model and the noise scale of the Gaussian distribution is often ignored, for which we simply set σ = 1.0 as a ﬁxed term. Then for each environment e supp(E ), the IV(X Y |E = e) becomes

IV(X Y |E = e) min θ E[ Y fθ(X) 2|E = e] Var(Y |E ), (181)

which corresponds with the MSE loss and the proposed regularizer in Equation 17. For the classiﬁcation task, the derivation is similar, and the regularizer becomes the entropy of Y in sub-population e and the loss function becomes the cross-entropy loss.

I DISCUSSION ON DIFFERENCES WITH SUB-GROUP DISCOVERY

Subgroup discovery (SD, (Helal, 2016)) is aimed at extracting interesting relations among different variables (X) with respect to a target variable Y . Coverage and precision of each discovered group is the focus of such method. To be speciﬁc, it learns a partition on P(X) such that some target label y dominates within each group. The most siginﬁcant gap between subgroup discovery and our predictive heterogeneity lies in the pattern of distributional shift among clusters: for subgroup discovery, P(X) and P(Y ) varies across subgroups but there is a universal P(Y |X). While for predictive heterogeneity P(Y |X) differs across sub-population, which indicates diversiﬁed prediction mechanism. It is such disparity of prediction mechanism that inhibits the performance of a universal predictive model on a heterogeneous dataset, which is the emphasis of OOD problem and group fairness.

We think sub-group discovery is more applicable for settings where the distributional shift is minor while high explainability is required, since it generates simpliﬁed rules that people can understand. Also, sub-group discovery methods is suitable for the settings that only involve tabular data (typlically from a relational database), where the input features have clear semantics. And our proposed method could deal with general machine learning settings, including complicated data (e.g., image data) that involves representation learning. Also, when people have to handle settings where data heterogeneity w.r.t. prediciton mechanism exists inside data, our method is more applicable. However, both kinds of methods can be used to help people understand data and make more reasonable decisions.

J DISCUSSION ON THE POTENTIAL FOR FAIRNESS

We ﬁnd combining our measure with algorithmic fairness is an interesting and promising direction and we think our measure has the potential to deal with algorithmic bias. Our method could generate sub-populations with possibly different prediction mechanisms, which could do some help in the following aspects:

Risk feature selection: we could select features according to our predictive heterogeneity measure to see what features bring the largest heterogeneity. If they are sensitive features, people should avoid their effects, and if they are not, they could direct people to build better machine learning models.

Examine the algorithmic fairness: we could use the learned sub-populations to examine whether a given algorithm is fair by calculating the performance gap across the sub-populations.