# outofdistribution_optimality_of_invariant_risk_minimization__f3a84c43.pdf

Published in Transactions on Machine Learning Research (01/2024)

Out-of-Distribution Optimality of Invariant Risk Minimization

Shoji Toyota shoji@ism.ac.jp The Institute of Statistical Mathematics

Kenji Fukumizu fukumizu@ism.ac.jp The Institute of Statistical Mathematics

Reviewed on Open Review: https: // openreview. net/ forum? id= p Wsf WDn JDa

Deep Neural Networks often inherit spurious correlations embedded in training data and hence may fail to generalize to unseen domains, which have different distributions from the domain to provide training data. Arjovsky et al. (2019) introduced the concept out-ofdistribution (o.o.d.) risk, which is the maximum risk among all domains, and formulated the issue caused by spurious correlations as a minimization problem of the o.o.d. risk. Invariant Risk Minimization (IRM) is considered to be a promising approach to minimize the o.o.d. risk: IRM estimates a minimum of the o.o.d. risk by solving a bi-level optimization problem. While IRM has attracted considerable attention with empirical success, it comes with few theoretical guarantees. Especially, a solid theoretical guarantee that the bi-level optimization problem gives the minimum of the o.o.d. risk has not yet been established. Aiming at providing a theoretical justification for IRM, this paper rigorously proves that a solution to the bi-level optimization problem minimizes the o.o.d. risk under certain conditions. The result also provides sufficient conditions on distributions providing training data and on a dimension of a feature space for the bi-leveled optimization problem to minimize the o.o.d. risk.

1 Introduction

Training data used in supervised learning may contain features that are spuriously correlated to the response variables of data. Deep Neural Networks (DNNs) often learn such spurious correlations embedded in the data and hence may fail to predict desirable response variables of test data generated by a distribution that is different from the one to provide training data. To list a few examples, in a classification of animal images, models obtained by conventional procedures tend to misclassify cows on sandy beaches because most training pictures are captured in green pastures and DNNs inherit context information in training (Beery et al., 2018; Shane, 2018). Another example is learning from medical data. Systems trained with data collected in one hospital do not generalize well to other hospitals; DNNs unintentionally extract environmental factors specific to a particular hospital in training (Al Badawy et al., 2018; Perone et al., 2019; Heaven, 2020).

Arjovsky et al. (2019) introduced the concept out-of-distribution (o.o.d.) risk to formulate the issue caused by spurious correlations. Let X and Y be measurable spaces of explanatory and response variables respectively. Let E be a set with each element e E called the domain (or environment) e. Assume that for a given domain e E, there corresponds a corresponding random variable (Xe, Y e) that takes values in X Y with its probability law PXe,Y e. Assume we are given training datasets De := {(xe i, ye i )}ne i=1 PXe,Y e i.i.d. from multiple domains Etr E. For a given predictor f : X Y,

Re(f) := Z l(f(x), y)d PXe,Y e

Published in Transactions on Machine Learning Research (01/2024)

denotes the risk of f on domain e. The o.o.d. risk of the predictor f is as follows:

Ro.o.d.(f) := max e E Re(f), (1)

which is the worst-case risk over E including unseen domain E Etr. Arjovsky et al. (2019) formulated the problem caused by spurious correlations as a minimization problem of the o.o.d. risk (1):

min f F Ro.o.d.(f), (2)

where F is the set of all measurable functions f : X Y.

It is difficult to directly solve the o.o.d. risk minimization (2) since we can not evaluate the maximum of risks among all domains E, including unseen domains E Etr, only by data from training domains Etr E. Invariant Risk Minimization (IRM) is a rapidly developing approach to the challenging o.o.d. risk minimization (Arjovsky et al., 2019). Its proposed predictor f := w Φ is composed of two maps: a feature map Φ : X H, which is called an invariance, and a predictor w : H Y which estimates the response variable of feature Φ(x). Here, for a given feature space H, we call a measurable function Φ : X H an invariance when it holds that PY e1|Φ(Xe1) = PY e2|Φ(Xe2) for any e1, e2 E1. Arjovsky et al. (2019) estimated the two maps by solving the bi-leveled optimization problem

minΦ Itr,w W X

e Etr Re(w Φ), (3)

where W is a model of predictors w : H Y and Itr is the set of invariances captured by training domains Etr: Itr := Φ : X H | PY e1|Φ(Xe1) = PY e2|Φ(Xe2) for any e1, e2 Etr . (4)

Influenced by the seminal study, several alternative bi-leveled optimization problems have been proposed (Ahuja et al., 2020; Chang et al., 2020; Ahuja et al., 2021a;b; Lin et al., 2022a; Zhou et al., 2022; Liu et al., 2021a;b; Lu et al., 2022; Koyama & Yamaguchi, 2021; Parascandolo et al., 2022; Krueger et al., 2021; Toyota & Fukumizu, 2022; Lin et al., 2022b; Huh & Baidya, 2022; Rame et al., 2022; Pogodin et al., 2023; Chen et al., 2023; Tan et al., 2023). For example, Ahuja et al. (2020) proposed a novel bi-leveled optimization problem leveraging the principles of game theory. The recently proposed Maximal Invariant Predictor (Koyama & Yamaguchi, 2021) employed a new bi-leveled problem grounded in the concept of information theory.

While IRM is widely recognized as a promising approach for the o.o.d. risk minimization (2), it comes with few theoretical guarantees; especially, a mathematical guarantee that the bi-level optimization problem (3) gives the minimum of the o.o.d. risk (1) has not yet been established.2 The original IRM paper did not mention any theoretical properties for the minimum of (3). Rosenfeld et al. (2021) proved that, assuming that data follow a simple linear Gaussian structural equation model (SEM), a predictor obtained by (3) makes a prediction relying only on a feature of Xe X whose distribution does not depend on domains (Rosenfeld et al., 2021, Section 5). However, their analysis did not focus on relations between the bi-level optimization problem (3) and the o.o.d. risk (2). More recently, Kamath et al. (2021) provided an example of distributions on which a minimum of (3) does not minimize the o.o.d risk (Kamath et al., 2021, Section 4). However, their analysis assumed that data follow particular SEMs constructed to derive the case where (3) does not provide a minimum of the o.o.d. risk; for verifying the o.o.d. performance of the bi-leveled optimization problem (3), it should be analyzed under more general assumptions on distributions.

Aiming at providing a theoretical justification for IRM, this paper rigorously proves that a solution to the bi-leveled optimization problem (3) also minimizes the o.o.d. risk (1); formally speaking, we prove that the

1The definition is based on conditional independence (Peters et al., 2016; Koyama & Yamaguchi, 2021; Rojas-Carulla et al., 2018), while Arjovsky et al. (2019); Ahuja et al. (2020) used a different type of invariances based on arg minw Re(w Φ) instead of PY e|Φ(Xe). Throughout the paper, we argue by adopting the definition based on conditional independence. 2Since it is difficult to solve the bi-leveled optimization problem (3), several papers have proposed optimization methods for (3) such as IRMv1 (Arjovsky et al., 2019) or Invariant Rationalization (Chang et al., 2020). While their optimization ability for solving (3) should also be discussed theoretically, this paper does not address it and only focuses on the problem of whether, assuming that (3) can be solved completely, the resulting predictor minimizes the o.o.d. risk.

Published in Transactions on Machine Learning Research (01/2024)

inclusion arg minΦ Itr,w W X

e Etr Re(w Φ) arg min f F Ro.o.d.(f) (5)

is attained under certain conditions. The result also provides sufficient conditions on the training domains Etr and the feature space H to minimize the o.o.d risk. In our analysis, we set distributions on domains E by the ones proposed in Rojas-Carulla et al. (2018). The distributions do not rely on any specific SEM structures, unlike existing theoretical analysis of IRM (Rosenfeld et al., 2021; Kamath et al., 2021), and they are used for the analysis of methods related to invariances (Rojas-Carulla et al., 2018; Toyota & Fukumizu, 2022).

The rest of the paper is organized as follows. Section 2 illustrates two main theorems. Section 2.1 provides the first main theorem, which states that the inclusion (5) is achieved in the regression case. In Section 2.2, we extend the first theorem to the classification case. The novelty and significance of these two theorems are discussed in Section 2.3. We provide a review of the prior works concerning the relationship between the bi-leveled optimization problem (3) and the o.o.d. risk (1) in Section 3. The two main theorems stated in Section 2 are proved in Section 4. Section 5 is devoted to brief concluding remarks.

2 Main Results

We explain the settings and assumptions persisting throughout our analysis.

We set domains {(Xe, Y e)}e E by the ones proposed in Rojas-Carulla et al. (2018). Let X := X1 X2 where X1 := Rd1 and X2 := Rd2 with d1, d2 N>0, and (XI 1, Y I) be a fixed random variable on X1 Y. Rojas Carulla et al. (2018) defined the domain set E by all the probability distributions with the fixed conditional distribution PY I|XI 1 ; namely, denoting ΦX1 : X X1 a projection onto X1, {(Xe, Y e)}e E is defined by

{(Xe, Y e)}e E := n (X, Y ) : a random variable on X Y PY |ΦX1(X) = PY I|XI 1

Note that, under the setting (6), the projection ΦX1 : X X1 is an invariance among E, because PY |ΦX1(X) = PY I|XI 1 for any (X, Y ) {(Xe, Y e)}e E. For simplicity of theoretical analysis, we assume that the conditional distribution PY I|XI 1 has a probability density function p I(y|x1).

We explain assumptions about the feature space and models. The feature space H for an invariance Φ Itr is assumed to be the multi-dimensional Euclidean space Rd H. Moreover, we assume that Φ Itr and w W in the minimization problem (3) run only continuous functions; namely, we investigate the property of a solution for minΦ IC0 tr ,w WC0 X

e Etr Re(w Φ), (7)

where WC0 is the set of all continuous functions w : H Y , and IC0 tr is the set of continuous invariances captured by a training domain Etr:

IC0 tr := Φ : X H | PY e1|Φ(Xe1) = PY e2|Φ(Xe2) for any e1, e2 Etr, Φ : continuous .

2.1 Case I: Least Square Loss

First, we consider the case where Y = Rd Y (d Y N>0) and l is the least square loss; that is, for a given predictor f : X Y, its risk Re(f) on (Xe, Y e) {(Xe, Y e)}e E is given by

Re(f) := Z y f(x) 2d PXe,Y e.

The following theorem ensures that the optimization problem (7) provides a solution for the o.o.d. risk minimization problem (2) under four conditions:

Theorem 1 (o.o.d. optimality of the bi-leveled optimization problem (7) under least square loss setting). Domains {(Xe, Y e)}e E are assumed to be (6). We also assume that the following four conditions hold:

Published in Transactions on Machine Learning Research (01/2024)

(i) IC0 tr = IC0, where IC0 is the set of continuous invariances captured by all domains E, not training domains Etr:

IC0 := Φ : X H | PY e1|Φ(Xe1) = PY e2|Φ(Xe2) for any e1, e2 E, Φ : continuous .

(ii) S e Etr supp(PΦX1(Xe)) = X1. Here, for probability measure µ on X1, supp(µ) is defined by

supp(µ) := {x1 X1 |Nx1 : open neighborhood around x1 µ(Nx1) > 0}.

(iii) The dimensions d1 and d H on the subspace X1 X of the input space X and the feature space H = Rd H satisfy d1 d H.

(iv) PY I|XI 1 has a continuous probability density function p I(y|x1). Here, we call p I(y|x1) continuous when correspondence X1 Y (x1, y) 7 p I(y|x1) is continuous.

Then, we have arg minΦ IC0 tr ,w WC0 X

e Etr Re(w Φ) arg min f F Ro.o.d.(f). (8)

Here, F is the set of all measurable functions f : X Y.

We explain the feasibilities and interpretations of the above four conditions.

Condition (i): Condition (i) implies that invariances captured by training domains Etr correspond to the ones by all domains E. Arjovsky et al. (2019) also discussed the relationship between the equation Itr = I and o.o.d. generalization, briefly illustrating that the equation Itr = I facilitates the estimation of a predictor with high o.o.d. performance solely based on data from training domains Etr (Arjovsky et al., 2019, Section 4.1). If it holds that Itr = I, we can capture an invariance Φ I among all domains E only using the training domains Etr. Arjovsky et al. (2019) pointed that, once an invariance Φ I among all domains is obtained, a predictor w that minimizes risks only on training domains Etr, namely w arg min W P e Etr Re(w Φ), satisfies Re(w Φ) = minw Re(w Φ) for all domains e E, including unseen domains E Etr, under certain settings. Developing the discussion by Arjovsky et al. (2019), Theorem 1 clarifies a more rigorous relation among the equation Itr = I, the o.o.d. risk (1), and the bi-leveled optimization problem (7): the equation I = Itr is one of the sufficient conditions for the bi-leveled optimization problem (7) to minimize the o.o.d. risk (1).

The condition Itr = I is not generally satisfied and Peters et al. (2016); Arjovsky et al. (2019) presented sufficient conditions on the training domains Etr for the equation Itr = I when data follow simple SEMs. Peters et al. (2016) proved the equation Itr = I holds when distributions on domains follow a linear Gaussian SEM and training data are obtained by certain types of interventions (Peters et al., 2016, Section 4.3). Arjovsky et al. (2019) generalized the result by Peters et al. (2016). Assuming that data follow a linear SEM, which is not restricted to a Gaussian distribution and a certain type of interventions, Arjovsky et al. (2019) deduced a sufficient condition for the equality Itr = I on training domains Etr, which is called lying in the general position (Arjovsky et al., 2019, Assumption 8). On the other hand, sufficient conditions for the equality Itr = I under the setting (6) have not yet been revealed. Providing them would be an important area for future research.

Conditions (ii) and (iii): As shown in Lemma 3, the conditional expectation R y p I(y|x1)dy = E[Y I = y|XI 1 = x1] achieves the minimization of the o.o.d. risk, signifying that the information embedded in X1 is important for predicting response variables on unseen domains. Condition (ii) implies that the support of training domains Etr covers X1 that contains such important information for o.o.d. prediction. Condition (iii) implies that H is such a large feature space that a feature Φ : X H can preserve information on the X1-component of x X by selecting Φ appropriately. Condition (iii) also provides a practical perspective on how to construct the feature space H = Rd H: the dimension d H on the feature space H should be fixed high. The dimension d H of the feature space is fixed by hand, and hence, Condition (iii) is expected to hold unless we fix the dimension of the feature space too small.

Published in Transactions on Machine Learning Research (01/2024)

Condition (iv): Condition (iv) presents continuity of the p.d.f. of PY I|XI 1 . By Condition (iv), we also have continuity of the conditional expectation R y p I(y|x1)dy = E[Y I = y|XI 1 = x1]. In our analysis, we assume that the model WC0 consists of all continuous functions, and hence, Condition (iv) ensures that the model includes the conditional expectation E[Y I|XI 1], which minimizes the o.o.d. risk (Lemma 3).

2.2 Case II: Cross Entropy Loss

Theorem 1 can be easily extended to the classification case where w W has a probabilistic output and evaluate risks by the cross entropy loss. Let Y be a finite set {1, ..., m} (m N>0), and we model w : H Y by pθ : H PY, where PY denotes the set of probabilities on Y; namely

Here, R+ := {x R |x 0} and pi denotes the i-th component of p. We call pθ : H PY continuous, that is pθ WC0, when correspondence H h 7 pθ(h) R|Y| is continuous, seeing pθ(h) PY as a vector on R|Y|. For a given pθ : H PY and i Y, pθ(h)

i is often abbreviated by pθ(i|h). The risk evaluated by the cross-entropy loss is then written as

Re(pθ Φ) = Z log pθ(Y e|Φ(Xe))d PXe,Y e.

We expand Theorem 1 to the above classification case:

Theorem 2 (o.o.d. optimality of the bi-leveled optimization problem (7) under cross-entropy loss setting). Domains {(Xe, Y e)}e E are assumed to be (6). Assume that, in addition to (i) (iii) in Theorem 1, the following condition (v) holds:

(v) For any x 1 X1, # y Y p I(y|x 1) > 0 > 1.

Then, we have the inclusion

arg minΦ IC0 tr ,pθ WC0 X

e Etr Re(pθ Φ) arg min pθ F

Ro.o.d.(f), (9)

where F is the set of all measurable functions f : X PY.3

Condition (v) indicates that domains {(Xe, Y e)}e E have high uncertainty in labels y Y given x1 X1. The condition is expected to be feasible when classes Y are subdivided and difficult to be uniquely determined from x1 X1.

2.3 Novelty and Significance of Theorems 1 and 2

Theorems 1 and 2 and their proofs have the following four novel and significant points:

Setting of Domains The first point is the setting of domains. The setting by Rojas-Carulla et al. (2018), which is used throughout our analysis, does not impose any specific SEM structures, linearity, and Gaussianity on domains while existing works on theoretical analysis of IRM assumed that data follow simple SEMs. Theorems 1 and 2 indicate that, under such a general setting, IRM presents the minimum of the o.o.d. risk. This implies that our results provide a solid foundation to use IRM for a broad range of o.o.d. generalization problem.

3The same as the definition of continuous, we call f F measurable when correspondence X x 7 f(x) R|Y| is measurable, seeing f(x) PY as a vector on R|Y|.

Published in Transactions on Machine Learning Research (01/2024)

Assumption on the Underlying Distribution PY I|XI 1 As well as the assumption on domains, prior theoretical results of IRM assume that the underlying true distribution PY I|XI 1 is represented by a simple SEM (Arjovsky et al., 2019; Rosenfeld et al., 2021; Kamath et al., 2021). On the other hand, Condition (iv) only imposes PY I|XI 1 on the continuity; hence, Condition (iv) is a significantly mild condition in comparison with the assumptions on PY I|XI 1 by prior works.

Characterization of Invariance Second, a theoretical characterization of invariances Φ IC0 is given in Lemmas 4 and 6: it is proved that Φ IC0 can be represented as Φ = Ψ ΦX1 for some continuous map Ψ . Any theoretical characterizations have not yet been presented, and hence, the results in Lemmas 4 and 6 are novel. To present the non-trivial characterization, we develop a novel theoretical technique based on the proof by contradiction. Lemmas 4 and 6 play an important role in our desirable assertion (5), and hence, the derivation of these lemmas is a significant technical contribution of our analysis.

Range of Invariance The fourth point is a range of invariances Φ: we assume that Φ run all continuous functions, while most of the existing works on theoretical analysis of IRM assume that Φ run more simplified functions, such as linear functions (Rosenfeld et al., 2021) or variable selections (Toyota & Fukumizu, 2022). It is common to construct a learning model of invariances with deep neural networks in the context of IRM, and hence, the variable selection and linear function settings by Toyota & Fukumizu (2022); Rosenfeld et al. (2021) are significantly simplified to analyze IRM. On the other hand, our large class of continuous functions is relatively realistic compared to existing ones, since it is widely recognized that neural networks of sufficient size can represent a wide range of functions (Cybenko, 1989; Hornik et al., 1989; Barron, 1993; Mhaskar, 1996; Sonoda & Murata, 2017).

3 Previous Works

As explained in Section 1, Rosenfeld et al. (2021); Kamath et al. (2021) derived the theoretical results concerning the minimum of the bi-leveled optimization problem (3) and its connection to the o.o.d. risk (1). Rosenfeld et al. (2021) proved that a predictor obtained by minimizing (3) predicts Y e Y relying only on a feature of Xe X whose distribution does not depend on domains (Rosenfeld et al., 2021, Section 5). However, they did not provide any connections between the minimum of (3) and the o.o.d risk. Moreover, they assume that data follow a linear Gaussian SEM, and that invariances Φ in the bi-leveled optimization problem run linear functions for simplicity. Unlike their analysis, this paper derives the direct relations between (3) and the o.o.d. risk (1). Additionally, we assume that data follow the distributions proposed by Rojas-Carulla et al. (2018) that do not rely on any specific SEM structures and that invariances run all continuous functions including neural networks. Kamath et al. (2021) provided an example of distributions on which a minimum of (3) does not minimize the o.o.d risk. However, the distributions are particular SEMs constructed to derive the case where (5) is violated, and analysis in more general settings is required (Kamath et al., 2021, Section 4). In construct, the distributions by Rojas-Carulla et al. (2018) used in this paper do not rely on any specific SEM structures, and they are used to analyze estimation methods related to invariances (Rojas-Carulla et al., 2018; Toyota & Fukumizu, 2022).

Arjovsky et al. (2019); Koyama & Yamaguchi (2021); Rojas-Carulla et al. (2018) discussed theoretical relations between invariances and the o.o.d. risk (1). As explained in the last section, Arjovsky et al. (2019) intuitively explained that the condition Itr = I facilitates an estimation of a predictor which can predict Y e

on unseen domains only by data from training domains Etr (Arjovsky et al., 2019, Section 4.1). They also derived sufficient conditions on training domains for the equation Itr = I, assuming that data follow a simple linear SEM (Arjovsky et al., 2019, Theorem 9). Koyama & Yamaguchi (2021); Rojas-Carulla et al. (2018) presented sufficient conditions for an invariance Φ to achieve the minimum of (1). Koyama & Yamaguchi (2021) proved that the invariance that maximizes the mutual information with labels also maximizes the o.o.d. risk. Rojas-Carulla et al. (2018) proved that, under the domain setting (6), the conditional expectation E[Y e|ΦX1(Xe) = x1] also minimizes the o.o.d. risk, even when E[Y e|ΦX1(Xe)] is nonlinear. However, all the results by Koyama & Yamaguchi (2021); Rojas-Carulla et al. (2018); Arjovsky et al. (2019) did not deal with any theoretical connections between invariances obtained by minimizing the bi-leveled optimization problem (3) and the o.o.d. risk (1). It does not follow obviously that the minimum of (3) satisfies these sufficient

Published in Transactions on Machine Learning Research (01/2024)

conditions by Koyama & Yamaguchi (2021); Rojas-Carulla et al. (2018), and hence our main theorems can not be deduced as a trivial corollary of the results by Koyama & Yamaguchi (2021); Rojas-Carulla et al. (2018). To discuss the non-trivial relation between the bi-leveled optimization problem (3) and the o.o.d. risk (1), we establish a novel characterization of invariances (Lemmas 4 and 6), and derive the main theorems based on it.

To reduce the annotation cost required for the original IRM approach, Toyota & Fukumizu (2022) introduced a new bi-level optimization problem similar to (3). They considered a situation in which the training data for target classification are provided in only one domain, while the task of a higher label hierarchy, which requires lower annotation cost, has data from multiple domains. Under the availability of data, they deduced a bi-level optimization problem, in which invariances were given by additional data in a higher label hierarchy. For further details, we refer the reader to the original paper Toyota & Fukumizu (2022). Their study provided a detailed theoretical analysis concerning their method and its connection to the o.o.d. risk; however, they did not analyze relationships between their bi-leveled optimization problem and the o.o.d. risk, which is the focus of this paper. Instead, they investigated relationships between an optimization method for their bi-level optimization problem and the o.o.d. risk. Moreover, they assume that invariances Φ run all variable selections for simplicity of theoretical analysis. On the other hand, this paper derives the direct relations between the minimum of the bi-leveled optimization problem and the o.o.d. risk (1). Moreover, we consider the more realistic setting for the analysis of IRM where invariances Φ run all continuous functions.

In this section, we prove Theorem 1 and 2. Through the section, for Xe X and x X, its Xi-components (i = 1, 2) are denoted by Xe i and xi respectively.

4.1 Proof Sketch of Main Theorems

Before giving rigorous proof, we briefly describe the rough proof sketch of the main theorems. The following two lemmas (A) and (B) play an important role in our proof:

(A) The conditional expectation E[Y e|Xe 1] = E[Y I|XI 1] and conditional probability PY e|Xe 1 = PY I|XI 1 minimize the o.o.d. risk under the least-square and cross-entropy losses respectively (Lemmas 3 and 5).

(B) Any invariance among all domains can be represented by the composition of the projection onto X1; that is, Φ IC0 can be represented as Φ = Ψ ΦX1

for some continuous map Ψ (Lemmas 4 and 6).

The two lemmas intuitively conclude the proof of the main theorem as follows. Firstly, since IC0 tr = IC0 holds (Condition (i)), observe that a predictor in (3) runs composition maps w Φ with w WC0 and Φ IC0. Moreover, since the above second lemma (B) ensures that Φ IC0 can be represented by the composition of the projection onto X1, we can see that a predictor in (3) runs w ΦX1 for some function class w W , and hence, the bi-leveled optimization problem is expressed as

e Etr Re(w ΦX1). (10)

It is well-known that, assuming that w runs all measurable functions,

ˆw arg min w Re(w ΦX1) ˆw(x1) = E[Y e|Xe 1 = x1] = E[Y I|XI 1 = x1] PXe 1 almost everywhere.

or ˆw arg min w Re(w ΦX1) ˆw(x1) = PY e|Xe 1=x1 = PY I|XI 1 =x1 PXe 1 almost everywhere.

Published in Transactions on Machine Learning Research (01/2024)

hold for any e E under the least-square and cross-entropy losses respectively (Christmann & Steinwart, 2008, Example 2.6). Hence, ignoring the capability of W and the discussion of almost everywhere, we can see that E[Y I|XI 1 = x1] arg minw W X

e Etr Re(w ΦX1) (11)

or PY I|XI 1 =x1 arg minw W X

e Etr Re(w ΦX1) (12)

hold. Combining eq.s (11), (12) and the first lemma (A), it concludes the main theorems intuitively. In the following section, we give the rigorous justification of the above rough proof sketch.

4.2 Proof of Theorem 1

To prove the main theorem, we prepare two lemmas.

Lemma 3. Let w I : X1 Y be the conditional expectation obtained by p I(y|x1); namely,

w I(x1) = E[Y I|XI 1 = x1] := Z y p I(y|x1)dy.

Then, w I ΦX1 arg min f:X Y Ro.o.d.(f).

Lemma 4. Any Φ IC0 tr is represented as

for some continuous map Ψ : X1 H.

Proof of Lemma 3 4 It suffices to prove the following statement:

For any f F and (Xa, Y a) {(Xe, Y e)}e E, there exists (Xb, Y b) {(Xe, Y e)}e E such that

Z w I ΦX1(x) y 2d PXa,Y a(x, y) Z f(x) y 2d PXb,Y b(x, y). (13)

Take arbitrary f F and (Xa, Y a) {(Xe, Y e)}e E. Define (Xb, Y b) {(Xe, Y e)}e E such that its distribution is the direct product PXa 1 ,Y a PX2, where PXa 1 ,Y a is the marginal distribution of PXa,Y a on X1 Y and PX2 is an arbitrary distribution on X2.

Then, the right-hand side of the inequality (13) is given by

Z f(x) y 2d PXb,Y b(x, y) = Z f(x) y 2d(PXa 1 ,Y a PX2)(x, y)

= Z PX2(x2) Z f(x1, x2) y 2d PXa 1 ,Y a(x1, y).

Clearly, for any x 2 X2, the inequality

Z f(x1, x 2) y 2d PXa 1 ,Y a(x1, y) Z E[Y e1|Xa 1 = x1] y 2d PXa 1 ,Y a(x1, y)

4The proof is essentially the same as the one for Theorem 1 in Rojas-Carulla et al. (2018) and Theorem 6 in Toyota & Fukumizu (2022).

Published in Transactions on Machine Learning Research (01/2024)

holds, because the minimum of a risk on the least square loss is attained at the conditional expectation E[Y a|Xa 1 ]. Hence, we obtain Z f(x) y 2d PXb,Y b(x, y) = Z PX2(x2) Z f(x1, x2) y 2d PXa 1 ,Y a(x1, y)

Z PX2(x2) Z E[Y e1|Xa 1 = x1] y 2d PXa 1 ,Y a(x1, y)

= Z E[Y a|Xa 1 = x1] y 2d PXa 1 ,Y a(x1, y)

= Z PXa 2 |Xa 1 ,Y a(x2) Z E[Y a|Xa 1 = x1] y 2d PXa 1 ,Y a(x1, y)

= Z E[Y a|Xa 1 = ΦX1(x)] y 2d PXa,Y a(x, y)

= Z w I ΦX1(x) y 2d PXa,Y a(x, y),

which concludes the proof. Here, the last equality is derived from the fact that the conditional expectation E[Y e|Xe 1 = ΦX1(x)] does not depend on e E and corresponds to w I ΦX1.

Proof sketch of Lemma 4 Before providing a complete proof, we show a proof sketch of Lemma 4 to make the flow of our proof easier to understand. First, we prove that Φ IC0 tr can be represented as

Φ = Ψ ΦX1 (14)

by some map Ψ : X1 H, which is not restricted to a continuous map. Take arbitrary Φ IC0 tr . Then, since Φ IC0 = IC0 tr (Condition (i)), for any (Xa, Y a), (Xb, Y b) {(Xe, Y e)}e E,

PY a|Φ(Xa) = PY b|Φ(Xb),

and therefore, we have PY a|Φ(Xa)(N|Φ(x)) = PY b|Φ(Xb)(N|Φ(x)) (15)

for any set N Y and x X. We prove the statement (14) by contradiction. Assume that there exist no maps Ψ that satisfy (14). Then, there exist x 1 X1, x 2, x 2 X2 such that

Φ(x 1, x 2) = Φ(x 1, x 2 ).5

By utilizing x 1 X1, x 2, x 2 X2, we can construct (Xa, Y a), (Xb, Y b) {(Xe, Y e)}e E and N Y which satisfy PY a|Φ(Xa)(N|Φ(x 1, x 2)) = PY b|Φ(Xb)(N|Φ(x 1, x 2)).

This contradicts the assumption (15), and we can conclude Φ IC0 tr can be represented as (14). The continuity of Ψ is easily derived from the continuity of Φ, and we can conclude the proof.

Proof of Lemma 4 First, we prove that Φ IC0 tr can be represented as

Φ = Ψ ΦX1 (16)

by some map Ψ : X1 H, which is not restricted to a continuous map. We prove this statement by contradiction. Take Φ IC0 tr and assume that there exist no maps Ψ which satisfy (16). Then, there exist x 1 X1, x 2, x 2 X2 such that Φ(x 1, x 2) = Φ(x 1, x 2 ). (17)

5If Φ(x 1, x 2) = Φ(x 1, x 2 ) for any x 1 X1, x 2, x 2 X2, Φ depend only on the first component X1; hence, we can see that

Φ IC0 tr can be represented as Φ = Ψ ΦX1 by some map Ψ, which contradicts to the assumption.

Published in Transactions on Machine Learning Research (01/2024)

supp(𝑃𝑋𝑎,𝑌𝑎)

supp(𝑃𝑋𝑏,𝑌𝑏)

Figure 1: Supports of probability distributions PXa,Y a and PXb,Y b. The figure implies that PXa,Y a(Ny Φ 1(Φ )) = 0 and PXb,Y b(Ny Φ 1(Φ )) = 0 ( (x 1, x 2 ) / Φ 1(Φ ) (17)), and that PXa,Y a(Φ 1(Φ )) = 0 and PXb,Y b(Φ 1(Φ )) = 0. These Eqs. lead us PY a|Φ(Xa)(Ny |Φ ) = 0 = PY b|Φ(Xb)(Ny |Φ ).

Fix y Y with p I(y |x 1) > 0 and take an open neighborhood Ny Y centered at y which satisfies

Ny p I(y|x 1)dy < 1.

Here, the existence of Ny is derived from the continuity of p I( |x 1) (Condition (iv)).

Define two maps gi : Y X2 (i = 1, 2) by

g1(y) = x 2 (y Ny ) x 2 ( else ) aaaaaa g2(y) = x 2 (y Ny ) x 2 ( else ).

Take two distributions (Xa, Y a), (Xb, Y b) {(Xe, Y e)}e E such that their distributions PXa,Y a and PXb,Y b coincide with

PXa,Y a = PXa 2 |Y a PY I|XI 1 PX1, PXb,Y b = PXb 2|Y b PY I|XI 1 PX1.

PX1 is a distribution on X1 where its p.d.f. coincides with a delta function δx 1(x1) on x 1,

the conditional p.d.f.s of PXa 2 |Y a( |y) and PXb 2|Y b( |y) coincide with δg1(y)(x2) and δg2(y)(x2) respectively.

The supports of PXa,Y a and PXb,Y b are visualized in Fig. 1. As Φ IC0 = IC0 tr (Condition (i)) and (Xa, Y a), (Xb, Y b) {(Xe, Y e)}e E,

PY a|Φ(Xa)(Ny |Φ ) = PY b|Φ(Xb)(Ny |Φ ), (18)

where Φ := Φ(x 1, x 2). Let us compute PY a|Φ(Xa)(Ny |Φ ) and PY b|Φ(Xb)(Ny |Φ ) to derive PY a|Φ(Xa)(Ny |Φ ) = PY b|Φ(Xb)(Ny |Φ ), which contradicts to the equality (18)6. We evaluate

PY a|Φ(Xa)(Ny |Φ ) = PΦ(Xa),Y a({Φ } Ny )

PΦ(Xa)({Φ }) = PXa,Y a(Φ 1(Φ ) Ny )

PXa(Φ 1(Φ ))

6Fig. 1 illustrates the intuitive reason why PY a|Φ(Xa)(Ny |Φ ) = PY b|Φ(Xb)(Ny |Φ ) is derived, and hence, will help us understand the following rigorous proof.

Published in Transactions on Machine Learning Research (01/2024)

by computing its numerator and denominator separately. First, the numerator is evaluated as

PXa,Y a(Φ 1(Φ ) Ny ) = Z

Φ 1(Φ ) Ny δg1(y)(x2) p I(y|x1) δx 1(x1)dxdy

Φ 1(Φ ) δg1(y)(x2) p I(y|x1) δx 1(x1)dx.

Noting that g1(y) = x 2 for y Ny and δx 1(x1) δx 2(x2) = δ(x 1,x 2)(x1, x2), we obtain Z

Φ 1(Φ ) δg1(y)(x2) p I(y|x1) δx 1(x1)dx = Z

Φ 1(Φ ) δx 2(x2) p I(y|x1) δx 1(x1)dx

Φ 1(Φ ) δ(x 1,x 2)(x1, x2) p I(y|x1)dx.

Since (x 1, x 2) Φ 1(Φ ), we have Z

Φ 1(Φ ) δ(x 1,x 2)(x1, x2) p I(y|x1)dx = Z

Ny p I(y|x 1)dy,

which leads us to the equality

PXa,Y a(Φ 1(Φ ) Ny ) = Z

Ny p I(y|x 1)dy.

Next, let us evaluate the denominator PXa(Φ 1(Φ )).

PXa(Φ 1(Φ )) = PXa,Y a(Φ 1(Φ ) Y) = Z

Φ 1(Φ ) Y δg1(y)(x2) p I(y|x1) δx 1(x1)dxdy

Φ 1(Φ ) δg1(y)(x2) p I(y|x1) δx 1(x1)dx

Φ 1(Φ ) δg1(y)(x2) p I(y|x1) δx 1(x1)dx

Φ 1(Φ ) δg1(y) p I(y|x1) δx 1(x1)dx

Φ 1(Φ ) δx 2(x2) p I(y|x1) δx 1(x1)dx

Φ 1(Φ ) δx 2 (x2) p I(y|x1) δx 1(x1)dx

Φ 1(Φ ) δ(x 1,x 2)(x1, x2) p I(y|x1)dx

Φ 1(Φ ) δ(x 1,x 2 )(x1, x2) p I(y|x1)dx.

Here, the fourth equality is derived from the facts that g1(y) = x 2 for y Ny and g1(y) = x 2 for y Y Ny . Noting that (x 1, x 2) Φ 1(Φ ) and (x 1, x 2 ) / Φ 1(Φ ), we obtain

PXa(Φ 1(Φ )) = Z

Φ 1(Φ ) δ(x 1,x 2)(x1, x2) p I(y|x1)dx

Φ 1(Φ ) δ(x 1,x 2 )(x1, x2) p I(y|x1)dx

Ny dy p I(y|x 1) + Z

Ny p I(y|x 1)dy.

Published in Transactions on Machine Learning Research (01/2024)

Hence, we obtain

PY a|Φ(Xa)(Ny |Φ ) = PXa,Y a(Φ 1(Φ ) Ny )

PXa(Φ 1(Φ ))

Ny p I(y|x 1)dy R

Ny p I(y|x 1)dy = 1.

Next, let us evaluate

PY b|Φ(Xb)(Ny |Φ ) = PΦ(Xb),Y b({Φ } Ny )

PΦ(Xb)({Φ }) = PXb,Y b(Φ 1(Φ ) Ny )

PXb(Φ 1(Φ )) .

The numerator is evaluated as

PXb,Y b(Φ 1(Φ ) Ny ) = Z

Φ 1(Φ ) Ny δg2(y)(x2) p I(y|x1) δx 1(x1)dxdy

Φ 1(Φ ) δg2(y)(x2) p I(y|x1) δx 1(x1)dx.

Noting that g2(y) = x 2 for y Ny , we obtain

Φ 1(Φ ) δg2(y)(x2) p I(y|x1) δx 1(x1)dx = Z

Φ 1(Φ ) δx 2 (x2) p I(y|x1) δx 1(x1)dx

Φ 1(Φ ) δ(x 1,x 2 )(x1, x2) p I(y|x1)dx

Ny dy 0 = 0.

Here, the third equality is derived from (x 1, x 2 ) / Φ 1(Φ ). Next, the denominator PXb(Φ 1(Φ )) is evaluated as

PXb(Φ 1(Φ )) = PXb,Y b(Φ 1(Φ ) Y) = Z

Φ 1(Φ ) Y δg2(y)(x2) p I(y|x1) δx 1(x1)dxdy

Φ 1(Φ ) δg2(y)(x2) p I(y|x1) δx 1(x1)dx

Φ 1(Φ ) δg2(y)(x2) p I(y|x1) δx 1(x1)dx+

Φ 1(Φ ) δg2(y) p I(y|x1) δx 1(x1)dx

Φ 1(Φ ) δx 2 (x2) p I(y|x1) δx 1(x1)dx

Φ 1(Φ ) δx 2(x2) p I(y|x1) δx 1(x1)dx

Φ 1(Φ ) δ(x 1,x 2 )(x1, x2) p I(y|x1)dx

Φ 1(Φ ) δ(x 1,x 2)(x1, x2) p I(y|x1)dx.

Published in Transactions on Machine Learning Research (01/2024)

Noting that (x 1, x 2) Φ 1(Φ ) and (x 1, x 2 ) / Φ 1(Φ ), we obtain

PXb(Φ 1(Φ )) = Z

Φ 1(Φ ) δ(x 1,x 2 )(x1, x2) p I(y|x1)dx

Φ 1(Φ ) δ(x 1,x 2)(x1, x2) p I(y|x1)dx

Ny dy 0 + Z

Y Ny dy p I(y|x 1)

Y Ny p I(y|x 1)dy = 0.

Hence, we obtain

PY b|Φ(Xb)(Ny |Φ ) = PXb,Y b(Φ 1(Φ ) Ny )

PXb(Φ 1(Φ ))

Y Ny p I(y|x 1) = 0.

Combing these results, we obtain

PY a|Φ(Xa)(Ny |Φ ) = 1 = 0 = PY b|Φ(Xb)(Ny |Φ ),

which contradicts the assumption

PY a|Φ(Xa) = PY b|Φ(Xb).

Because the continuity of Ψ is trivial, we can conclude the proof.

Finally, we prove Theorem 1.

Proof of Theorem 1 Take

f arg minΦ IC0 tr ,w WC0 X

e Etr Re(w Φ). (19)

Then, by Lemma 4, we can represent f as

for some continuous map w : X1 Y7. Let us prove that w ΦX1 arg minf:X Y Ro.o.d.(f) by contradiction; assuming that w ΦX1 / arg minf Ro.o.d.(f), we will derive w ΦX1 / arg minΦ IC0 tr ,w WC0 P e Etr Re(w Φ), which contradicts to (19). We prove it by the following three steps.

Step 1 First, we prove that there exist a training domain e Etr and an open set N1 X1 which satisfy

w (x1) = w I(x1) for x1 N1 (20)

with PXe 1 (N1) > 0. As w I ΦX1 minimizes the o.o.d. risk (Lemma 3), we have

Ro.o.d.(w ΦX1) > min f:X Y Ro.o.d.(f) = Ro.o.d.(w I ΦX1).

7By Lemma 4, f can be represented by f = w Ψ ΦX1 for Ψ : X1 H and w : H Y. Replacing w Ψ by w , we can obtain the desirable statement.

Published in Transactions on Machine Learning Research (01/2024)

Here, the first inequality is derived from the assumption of a proof by contradiction. Noting that Ro.o.d. is maximum of risk among {(Xe, Y e)}, there exists (Xe , Y e ) {(Xe, Y e)}e E such that

Re (w ΦX1) = Z y w (x1) 2d PXe 1 ,Y e

> Re (w I ΦX1) = Z y w I(x1) 2d PXe 1 ,Y e (21)

holds8. Since (21) is rewritten as Z y w (x1) 2 y w I(x1) 2 d PXe 1 ,Y e > 0,

we can see that y w (x 1) 2 y w I(x 1) 2 > 0

for some (x 1, y ) X1 Y. Since w and w I are continuous, taking sufficiently small ε > 0, we have

y w (x1) 2 y w I(x1) 2 > 0 for x1 N ε x 1, (22)

where N ε x 1 is the ε-ball centered at x 1. Here, the continuity of w I is derived from Condition (iv) in Theorem 1. Moreover, (22) leads us to the statement

w (x1) = w I(x1) for x1 N ε x 1.

By the condition (ii), N ε x 1 T supp(PXe 1 ) = for some e Etr. Take

x 1 N ε x 1 \ supp(PXe 1 ) def of supp ===

N ε x 1 \ n x1 X1 Nx1 : open neighborhood around x1 (PXe 1 )(Nx1) > 0 o = .

Replacing x 1 , if necessary, we may assume that

x 1 N ε x 1 \ n x1 X1 Nx1 : open neighborhood around x1 (PXe 1 )(Nx1) > 0 o . (23)

Take an open set N1 N ε x 1 which includes x 1 . Then, we have

w (x1) = w I(x1) for x1 N1. (24)

Observing that

x 1 {x1 X1 |Nx1 : open neighborhood with x1 Nx1 (PXe )(Nx1) > 0} ,

we have PXe 1 (N1) > 0. It concludes the proof of Step 1.

Step 2 Next, we prove the inequality X

e Etr Re(w ΦX1) > X

e Etr Re(w I ΦX1). (25)

To derive the inequality, note that

ˆw arg min w Re(w ΦX1) ˆw(x1) = w I(x1) PXe 1 a.e.,

8Note that e is not necessarily included in training domains Etr. The inequality (21) for some training domain e Etr are proved in Step 2 (eq. (29)).

Published in Transactions on Machine Learning Research (01/2024)

or equivalently,

ˆw arg min w Re(w ΦX1) PXe 1 ( x1 X1 ˆw(x1) = w I(x1) ) = 0 (26)

holds for any e E (Christmann & Steinwart, 2008, Example 2.6). Taking the contraposition of the implication from the left to right propositions in (26), we have

ˆw satisfies PXe 1 ( x1 X1 ˆw(x1) = w I(x1) ) > 0 ˆw / arg min w Re(w ΦX1). (27)

From (20), we have the inequality

PXe 1 ( x1 X1 w (x1) = w I(x1) ) > PXe 1 (N1) > 0 (28)

for some e Etr and an open set N1 X1. (27) and (28) lead us to statement w / arg minw Re (w ΦX1), and hence, we have the inequality

Re (w ΦX1) > min w Re (w ΦX1) = Re (w I ΦX1). (29)

Moreover, since the conditional expectation w I minimizes the risk, we have

Re(w ΦX1) Re(w I ΦX1) (30)

for any e E. (29) and (30) lead us to the inequality X

e Etr Re(w ΦX1) = Re (w ΦX1) + X

e Etr {e } Re(w ΦX1)

(29) > Re (w I ΦX1) + X

e Etr {e } Re(w ΦX1)

(30) Re (w I ΦX1) + X

e Etr {e } Re(w I ΦX1) = X

e Etr Re(w I ΦX1).

Step 3 Finally, we prove w ΦX1 / arg minΦ IC0 tr ,w WC0 P

e Etr Re(w Φ), which contradicts to (19). By

the inequality (25) proved in Step 2, it suffices to prove that there exist Φ IC0 tr and w WC0 such that w I ΦX1 = w Φ . Define Φ = Ψ ΦX1 where the embedding Ψ : X1 (= Rd1) H (= Rd H) is defined by

Here, we can define the embedding Ψ since d1 d H (Condition (iii)). Noting that PY e|Φ (Xe) = PY e|ΦX1(Xe) = PY I|XI 1 for any e E, we can see that Φ IC0 tr . Defining

xd+1 ... xh

w 7 E[Y I|XI 1 =

we can see that w I ΦX1 = w Φ . Observing w WC0 by Condition (iv), we can concludes w ΦX1 / arg minΦ IC0 tr ,w WC0 P

e Etr Re(w Φ), which contradicts to (19).

Published in Transactions on Machine Learning Research (01/2024)

4.3 Proof of Theorem 2

We prepare two lemmas. Lemma 5. Let p I : X1 PY be the conditional p.d.f. of PY I|XI 1 ; namely, p I(x)

i := p I(i|x).

Then, p I ΦX1 arg min f:X PY Ro.o.d.(f).

Lemma 6. Any Φ IC0 tr is represented as

for some continuous map Ψ : X1 H.

Proof of Lemma 5 The proof is essentially the same as the ones for Lemma 3; hence, we omit the proof.

Proof of Lemma 6 First, we prove that Φ IC0 tr can be represented as

Φ = Ψ ΦX1 (31)

by some map Ψ : X1 X1, which is not restricted to a continuous map. We prove this statement by contradiction in the same manner as the proof in Lemma 4. Take Φ IC0 tr . Then, there exist x 1 X1, x 2, x 2 X2 such that Φ(x 1, x 2) = Φ(x 1, x 2 ).

Fix y Y with p I(y |x 1) > 0. Define two maps gi : Y X2 (i = 1, 2) by

g1(y) = x 2 (y = y ) x 2 ( else ) aaaaaa g2(y) = x 2 (y = y ) x 2 ( else )

Take two distributions (Xa, Y a), (Xb, Y b) {(Xe, Y e)}e E such that their distributions PXa,Y a and PXb,Y b coincide with

PXa,Y a = PXa 2 |Y a PY I|XI 1 PX1, PXb,Y b = PXb 2|Y b PY I|XI 1 PX1.

PX1 is a distribution on X1 where its p.d.f. coincides with a delta function δx 1(x1) on x 1,

the conditional p.d.f.s of PXa 2 |Y a( |y) and PXb 2|Y b( |y) coincide with δg1(y)(x2) and δg2(y)(x2) respectively.

Since Φ IC0 = IC0 tr (Condition (i)) and (Xa, Y a), (Xb, Y b) {(Xe, Y e)}e E,

PY a|Φ(Xa)({y }|Φ ) = PY b|Φ(Xb)({y }|Φ ), (32)

where Φ := Φ(x 1, x 2). Let us compute PY a|Φ(Xa)({y }|Φ ) and PY b|Φ(Xb)({y }|Φ ), respectively. Same as the proof in Lemma 4, we have the two equalities

PΦ(Xa),Y a({Φ } {y }) = p I(y |x 1) and PΦ(Xa)({Φ }) = p I(y |x 1),

which lead us to the equality

PY a|Φ(Xa)({y }|Φ ) = PΦ(Xa),Y a({Φ } {y })

PΦ(Xa)({Φ })

= p I(y |x 1) p I(y |x 1) = 1.

Published in Transactions on Machine Learning Research (01/2024)

Similarly, we have

PY b|Φ(Xb)({y }|Φ ) = 0 and PΦ(Xb)({Φ }) = X

y Y {y } p I(y|x 1) = 0,

which lead us to the equality

PY b|Φ(Xb)({y }|Φ ) = PΦ(Xb),Y b({Φ } {y })

PΦ(Xb)({Φ })

y Y {y } p I(y|x 1) = 0.

y Y {y } p I(y|x 1) = 0 is derived by Condition (v). Combing these results, we obtain

PY a|Φ(Xa)({y }|Φ ) = 1 = 0 = PY b|Φ(Xb)({y }|Φ ),

which contradicts the assumption

PY a|Φ(Xa) = PY b|Φ(Xb).

Because the continuity of Ψ is trivial, we can conclude the proof.

Proof of Theorem 2 This is essentially the same as the one for Theorem 1, and hence, we omit the proof.

5 Conclusions

In this paper, we have proved that a solution for the bi-leveled optimization problem (3) also minimizes o.o.d. risk (2) under four conditions in regression and classification cases, assuming that distributions on domains are the ones proposed in Rojas-Carulla et al. (2018) and that models run all continuous functions. Particularly, we have provided a sufficient condition on the training domains Etr and the dimension of the feature space H for the optimization problem (3) to minimize the o.o.d. risk.

Several challenges still exist. The first problem is the theoretical analysis of the optimization method for (3). To solve the challenging optimization problem (3), various optimization techniques have been proposed (Arjovsky et al., 2019; Lin et al., 2022a; Zhou et al., 2022), and there has been little discussion about their effectiveness. For example, while Arjovsky et al. (2019) optimized (3) by minimizing X

e Etr Re(Φ) + λ w|w=1.0Re(w Φ) 2,

their effectiveness was evaluated only under specific SEMs (Rosenfeld et al., 2021). Thus, it is important to investigate this analysis under a more general case.

Second, we should evaluate the o.o.d. performance of the bi-leveled optimization problem (3) under the case where the conditions in Theorems 1 and 2 are violated. Particularly, as noted in Section 2, condition Itr = I does not generally hold. In such cases, for (Φ , w ) arg minΦ IC0 tr ,w WC0 P

e Etr Re(w Φ),

Ro.o.d(w Φ ) min Ro.o.d(f)

is not necessarily 0. The quantitative evaluation of the difference is crucial for future work.

Thirdly, we should investigate the feasibility of the condition Itr = I, which is known to be an important and unsolved problem shared by all invariance-based methods (Arjovsky et al., 2019; Peters et al., 2016; Toyota & Fukumizu, 2022). As Condition (i) in our main results, all methods based on invariances implicitly or explicitly assume that invariances among training domains correspond to ones among all domains. As discussed in Section 2, some sufficient conditions under a simple linear SEM setting have been found (Peters

Published in Transactions on Machine Learning Research (01/2024)

et al., 2016; Arjovsky et al., 2019), but general theoretical results have not yet been established. This is also among our unsolved problems, and should be provided in further work.

Finally, extending our results to general domain sets beyond the case by Rojas-Carulla et al. (2018) is an important topic for future work. Invariant Risk Minimization (IRM) estimates the feature map Φ that has the same conditional distribution PY e|Φ(Xe) among all domains e E; in other words, IRM framework assumes that a domain set E has a feature map Φ such that PY e|Φ(Xe) are equal among all domains. Among domain sets that satisfy the property, the domain set by Rojas-Carulla et al. (2018) is the simplest one; the projection ΦX1 induces the same conditional independence PY e|ΦX1(Xe). In some cases, a map that induces the same conditional distribution is a more complex function than the projection ΦX1, so the relation between (3) and the o.o.d. risk on such general domains beyond the case by Rojas-Carulla et al. (2018) is should be investigated.

Acknowledgements

We thank Dr. Yano in the Institute of Statistical Mathematics for valuable discussions. The research was supported by Grant-in-Aid for JSPS Fellows 20J21396, Grant-in-Aid for Research Activity Start-up 23K19966, JST CREST JPMJCR2015, and JSPS Grant-in-Aid for Transformative Research Areas (A) 22H05106.

Kartik Ahuja, Karthikeyan Shanmugam, Kush Varshney, and Amit Dhurandhar. Invariant Risk Minimization Games. In Proceedings of the 37th International Conference on Machine Learning, pp. 145 155, 2020.

Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Jean-Christophe Gagnon-Audet, Yoshua Bengio, Ioannis Mitliagkas, and Irina Rish. Invariance Principle Meets Information Bottleneck for Out-of-Distribution Generalization. In Advances in Neural Information Processing Systems, 2021a.

Kartik Ahuja, Karthikeyan Shanmugam, and Amit Dhurandhar. Linear Regression Games: Convergence Guarantees to Approximate Out-of-Distribution Solutions. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, pp. 1270 1278, 2021b.

Ehab A. Al Badawy, Ashirbani Saha, and Maciej A. Mazurowski. Deep learning for segmentation of brain tumors: Impact of cross-institutional training and testing. Medical Physics, 45(3):1150 1158, 2018.

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant Risk Minimization. ar Xiv:1907.02893, 2019.

Andrew R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930 945, 1993.

Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in Terra Incognita. In Computer Vision ECCV 2018, volume 11220, pp. 472 489. 2018.

Shiyu Chang, Yang Zhang, Mo Yu, and Tommi Jaakkola. Invariant Rationalization. In Proceedings of the 37th International Conference on Machine Learning, pp. 1448 1458, 2020.

Yongqiang Chen, Kaiwen Zhou, Yatao Bian, Binghui Xie, Bingzhe Wu, Yonggang Zhang, Ma Kaili, Han Yang, Peilin Zhao, Bo Han, and James Cheng. Pareto Invariant Risk Minimization: Towards Mitigating the Optimization Dilemma in Out-of-Distribution Generalization. In The Eleventh International Conference on Learning Representations, 2023.

Andreas Christmann and Ingo Steinwart. Support Vector Machines. Information Science and Statistics. Springer, 2008.

George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303 314, 1989.

Published in Transactions on Machine Learning Research (01/2024)

Will Douglas Heaven. Google s medical AI was super accurate in a lab. Real life was a different story. https://www.technologyreview.com/2020/04/27/1000658/google-medical-ai-accurate-lab-real-lifeclinic-covid-diabetes-retina-disease/, 2020.

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359 366, 1989.

Dongsung Huh and Avinash Baidya. The Missing Invariance Principle found the Reciprocal Twin of Invariant Risk Minimization. In Advances in Neural Information Processing Systems, 2022.

Pritish Kamath, Akilesh Tangella, Danica Sutherland, and Nathan Srebro. Does Invariant Risk Minimization Capture Invariance? In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, pp. 4069 4077, 2021.

Masanori Koyama and Shoichiro Yamaguchi. When is invariance useful in an Out-of-Distribution Generalization problem ? ar Xiv:2008.01883, 2021.

David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-Distribution Generalization via Risk Extrapolation (REx). In Proceedings of the 38th International Conference on Machine Learning, pp. 5815 5826, 2021.

Yong Lin, Hanze Dong, Hao Wang, and Tong Zhang. Bayesian Invariant Risk Minimization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000 16009, 2022a.

Yong Lin, Shengyu Zhu, Lu Tan, and Peng Cui. Zin: When and how to learn invariance without environment partition? In Advances in Neural Information Processing Systems, 2022b.

Jiashuo Liu, Zheyuan Hu, Peng Cui, Bo Li, and Zheyan Shen. Heterogeneous Risk Minimization. In Proceedings of the 38th International Conference on Machine Learning, pp. 6804 6814, 2021a.

Jiashuo Liu, Zheyuan Hu, Peng Cui, Bo Li, and Zheyan Shen. Kernelized Heterogeneous Risk Minimization, 2021b.

Chaochao Lu, Yuhuai Wu, José Miguel Hernández-Lobato, and Bernhard Schölkopf. Invariant Causal Representation Learning for Out-of-Distribution Generalization. In International Conference on Learning Representations, 2022.

Hrushikesh Mhaskar. Neural Networks for Optimal Approximation of Smooth and Analytic Functions. Neural Computation, 8(1):164 177, 1996.

Giambattista Parascandolo, Alexander Neitz, Antonio Orvieto, Luigi Gresele, and Bernhard Schölkopf. Learning explanations that are hard to vary. In International Conference on Learning Representations, 2022.

Christian S. Perone, Pedro Ballester, Rodrigo C. Barros, and Julien Cohen-Adad. Unsupervised domain adaptation for medical imaging segmentation with self-ensembling. Neuro Image, 194:1 11, 2019.

Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: Identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947 1012, 2016.

Roman Pogodin, Namrata Deka, Yazhe Li, Danica J. Sutherland, Victor Veitch, and Arthur Gretton. Efficient Conditionally Invariant Representation Learning. In The Eleventh International Conference on Learning Representations, 2023.

Alexandre Rame, Corentin Dancette, and Matthieu Cord. Fishr: Invariant Gradient Variances for Out-of Distribution Generalization. In Proceedings of the 39th International Conference on Machine Learning, pp. 18347 18377, 2022.

Published in Transactions on Machine Learning Research (01/2024)

Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters. Invariant Models for Causal Transfer Learning. Journal of Machine Learning Research, 19(36):1 34, 2018.

Elan Rosenfeld, Pradeep Kumar Ravikumar, and Andrej Risteski. The Risks of Invariant Risk Minimization. In International Conference on Learning Representations, 2021.

Janelle Shane. Do neural nets dream of electric sheep? - AI Weirdness Comment Share Comment Share. https://www.aiweirdness.com/do-neural-nets-dream-of-electric-18-03-02/, 2018.

Sho Sonoda and Noboru Murata. Neural network with unbounded activation functions is universal approximator. Applied and Computational Harmonic Analysis, 43(2):233 268, 2017.

Xiaoyu Tan, LIN Yong, Shengyu Zhu, Chao Qu, Xihe Qiu, Xu Yinghui, Peng Cui, and Yuan Qi. Provably invariant learning without domain information. In Proceedings of the 40th International Conference on Machine Learning, 2023.

Shoji Toyota and Kenji Fukumizu. Invariance Learning based on Label Hierarchy. In Advances in Neural Information Processing Systems, 2022.

Xiao Zhou, Yong Lin, Weizhong Zhang, and Tong Zhang. Sparse Invariant Risk Minimization. In Proceedings of the 39th International Conference on Machine Learning, pp. 27222 27244, 2022.