# coverageguaranteed_prediction_sets_for_outofdistribution_data__85608314.pdf

Coverage-Guaranteed Prediction Sets for Out-of-Distribution Data

Xin Zou, Weiwei Liu*

School of Computer Science, Institute of Artificial Intelligence, National Engineering Research Center for Multimedia Software, Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China {zouxin2021, liuweiwei863}@gmail.com

Out-of-distribution (OOD) generalization has attracted increasing research attention in recent years, due to its promising experimental results in real-world applications. In this paper, we study the confidence set prediction problem in the OOD generalization setting. Split conformal prediction (SCP) is an efficient framework for handling the confidence set prediction problem. However, the validity of SCP requires the examples to be exchangeable, which is violated in the OOD setting. Empirically, we show that trivially applying SCP results in a failure to maintain the marginal coverage when the unseen target domain is different from the source domain. To address this issue, we develop a method for forming confident prediction sets in the OOD setting and theoretically prove the validity of our method. Finally, we conduct experiments on simulated data to empirically verify the correctness of our theory and the validity of our proposed method.

1 Introduction Recent years have witnessed the remarkable success of modern machine learning techniques in many applications. A fundamental assumption of most machine learning algorithms is that the training and test data are drawn from the same underlying distribution. However, this assumption is consistently violated in many practical applications. In reality, the test environment is influenced by a range of factors, such as the distributional shifts across photos caused by the use of different cameras in image classification tasks, the voices of different persons in voice recognition tasks, and the variations between scenes in self-driving tasks (Nagarajan, Andreassen, and Neyshabur 2021). As a result, there is now a rapidly growing body of research with a focus on generalizing to unseen target domains with the help of the source domains, namely OOD generalization (Shen et al. 2021). Existing OOD generalization methods focus on improving worst-case performance on the target domains, i.e., improving the average test accuracy of the model on the worst target domain. However, in some systems that require high security (such as medical diagnosis), even a single mistake may have disastrous consequences. In these cases, it is important to quantify the uncertainty of the predictions.

*corresponding author Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

One way to perform uncertainty estimation (Amodei et al. 2016; Jiang et al. 2012, 2018; Angelopoulos et al. 2021) is to create confident prediction sets that provably contain the correct answer with high probability. Let Xn+1 X be a new test example for which we would like to predict the corresponding label Yn+1 Y, where X is the input space and Y is the label space. For any given α (0, 1), the aim of confidence set prediction is to construct a setvalued output, C(Xn+1), which contains the label Yn+1 with distribution-free marginal coverage at a significance level α, i.e., P (Yn+1 C(Xn+1)) 1 α. A confidence set predictor Cα is said to be valid if P (Yn+1 Cα(Xn+1)) 1 α for any α (0, 1), where α is a hyper-parameter of the predictor. To simplify the notation, we omit the superscript α in the remainder of this paper. Conformal prediction (Vovk, Gammerman, and Shafer 2005; Shafer and Vovk 2008, CP) is a model-agnostic, nonparametric and distribution-free (the coverage guarantee holds for any distribution) framework for creating confident prediction sets. Split conformal prediction (Vovk 2013; Vovk, Gammerman, and Shafer 2005, SCP), a special type of CP, has been shown to be computationally efficient. SCP reserves a set of data as the calibration set, and then uses the relative value of scores of the calibration set and that of a new test example to construct the prediction set. The validity of SCP relies on the assumption that the examples are exchangeable. However, in the OOD setting, the distributional shift between the training and test distributions leads to the violation of the exchangeability assumption. We empirically evaluate the performance of SCP in the OOD setting in Section 4. Unfortunately, we find that trivially applying SCP results in a failure to maintain marginal coverage in the OOD setting. To address this issue, we construct a set predictor based on the f-divergence (Alfr ed 1961) between the test distribution (target domain) and the convex hull of the training distributions (source domains). We theoretically show that our set predictor is guaranteed to maintain the marginal coverage (Corollary 9). We then conduct simulation experiments to verify our theory. The remainder of this article is structured as follows: 2 introduces some related works; 3 presents the notation definitions and preliminaries; 4 conducts experiments that show the failure of SCP in the OOD generalization setting; 5 creates corrected confidence set predictor in the OOD gen-

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

eralization setting; 6 provides our experimental results. 7 make discussions with the most related work. Finally, the conclusions are presented in 8. All of our proofs are attached in Appendix A.

2 Related Works

OOD generalization. OOD generalization aims to train a model with data from the source domains so that it is capable of generalizing to an unseen target domain. A large number of algorithms have been developed to improve OOD generalization. One series of works focuses on minimizing the discrepancies between the source domains (Li et al. 2018b; Ganin et al. 2016; Li et al. 2018c; Sun and Saenko 2016). Meta-learning domain generalization (Li et al. 2018a, MLDG) leverages the meta-learning approach and simulates train/test distributional shift during training by synthesizing virtual testing domains within each mini-batch. Another line of works (Xin et al. 2022; Wang et al. 2022) conducts adversarial training (Madry et al. 2018) to improve the OOD generalization performance. (Zou and Liu 2023) considers improving the adversarial robustness of the unseen target domain. Notably, the above works all focus on improving the average performance on the target domain; in contrast, we focus on designing valid confidence set predictors for data from the unseen target domains, as this is a crucial element of making high-stakes decisions in systems that require high security.

Conformal prediction. As introduced in 1, conformal prediction is a model-agnostic, non-parametric, and distribution-free framework that provides valid confidence set predictors. Generally speaking, examples are assumed to be exchangeable in a CP context. Most pertinent to our work, (Gendler et al. 2022; Tibshirani et al. 2019; Fisch et al. 2021; Cauchois et al. 2020; Gibbs and Cand es 2021; Oliveira et al. 2022) all consider various situations in which the exchangeability of the examples is violated to some extent. (Gendler et al. 2022) considers the case in which the test examples may be adversarially attacked (Szegedy et al. 2014; Goodfellow, Shlens, and Szegedy 2015; Madry et al. 2018); (Tibshirani et al. 2019) investigates the situation in which the density ratio between the target domain and the source domain is known; (Fisch et al. 2021) studies the few-shot learning setting and assumes that the source domains and the target domain are independent and identically distributed (i.i.d.) from some distribution on the domains; (Gibbs and Cand es 2021) considers an online learning setting and (Oliveira et al. 2022) provides results when the examples are mixing (Achim 2013; Xiaohong, Lars Peter, and Marine 2010; Bin 1994). Different from all the works discussed above, we consider the OOD generalization setting in which the f-divergence between the target domain and the convex hull of the source domains is constrained. The most related work among them is (Cauchois et al. 2020), which studies the worst-case coverage guarantee of a f-divergence ball centered at the single source domain. For the discussions about similarities and differences with (Cauchois et al. 2020), please refer to Section 7.

3 Preliminaries We begin with the OOD setups and a review of conformal prediction. Notations. We denote {1, 2, . . . , n} by [n] for n N+. For a distribution P on R, we define the quantile function of P as Q(β; P) := inf{s R|P(S s) β}. Similarly, for a cumulative distribution function (c.d.f.) F on R, we define Q(β; F) := inf{s R|F(s) β}. For n distributions P1, . . . , Pn, we define CH (P1, . . . , Pn) := {Pn i=1 λi Pi|λ1, . . . , λn 0; Pn i=1 λi = 1} as the convex hull of the distributions P1, . . . , Pn. We further define N(µ, Σ) as the multi-variable Gaussian distribution with mean vector µ and covariance matrix Σ. For a set A, we define the indicator function as IA( ), where IA(x) = 1 if x A and IA(x) = 0 otherwise.

3.1 Out-of-Distribution Generalization We define the input space as X and the label space as Y. We set Y = { 1}, Y = {1, 2, . . . , K} (where K is the number of classes), and Y = R for the binary classification problem, the multi-class classification problem, and the regression problem, respectively. Let S := {S1, . . . , Sd} be the set of source domains, where d is the number of source domains. S1, . . . , Sd are distributions on Z := X Y, and we use the terminologies domain and distribution interchangeably in this paper. Let T denote the target domain. The goal of OOD generalization is to obtain good performance on all T T , where T is the set of all possible target domains; we usually assume S T . In a standard OOD generalization setting, we learn a predictor h H {h : X Y} from the source domains S and define a loss function ℓ: Y Y R where R = [0, + ). We aim to minimize the worst-case population risk of the predictor h on the unseen target domain as follows:

RT (h) = max T T E (X,Y ) T [ℓ(h(X), Y )] .

However, in some systems that require high security, a mistake may lead to serious disasters. In these cases, a good solution is to output a prediction set with a marginal coverage guarantee. For a predefined confidence level 1 α (0, 1), we wish to output a prediction set C(x) Y such that, for any T T :

P (X,Y ) T,C [Y C(X)] 1 α, (1)

where the probability is over the randomness of test examples (X, Y ) T and the randomness of the prediction set C. To achieve (1), we follow the idea of SCP (Vovk 2013; Vovk, Gammerman, and Shafer 2005) to construct C(x). The next section introduces the main idea of SCP.

3.2 Split Conformal Prediction Nonconformity score. In SCP, we consider a supervised learning problem that involves predicting the label y Y of the input x X. We assume that we have a predictive model s : X Y R, which outputs the nonconformity score s(x, y). The nonconformity score function s( , ) is

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

usually trained with a set of training data. s(x, y) < s(x, y ) means that for the input x, y is more likely than y to be the label. Some examples of nonconformity scores are as follows: for a probabilistic model p(y|x), we can take the negative log-likelihood as the score, s(x, y) = log(p(y|x)); for a regression model h : X Y, a typical choice is s(x, y) = |h(x) y|; for a multi-class classifier h : X K 1, where K 1 is the K 1 dimensional simplex in RK, we can take s(x, y) = 1 h(x)y. In SCP, we assume that the examples {(Xi, Yi)}n+1 i=1 X Y are exchangeable (Definition 1). For a predefined significance level α (0, 1), the goal is to provide a valid confidence set b C(Xn+1). CP methods (Shafer and Vovk 2008) take advantage of the exchangeability of the data and the properties of the quantile function to make such a construction possible. Definition 1 ((Shafer and Vovk 2008, Exchangeability)). The random variables Z1, . . . , Zn are exchangeable if for every permutation τ for integers 1, . . . , n, the variables W1, . . . , Wn, where Wi = Zτ(i), have the same joint probability distribution as Z1, . . . , Zn. Let Vi = s(Xi, Yi) for i [n + 1] be the nonconformity scores corresponding to the examples {(Xi, Yi)}n+1 i=1 , where s( , ) is independent of {(Xi, Yi)}n+1 i=1 . The independence between s( , ) and {(Xi, Yi)}n+1 i=1 is useful since in this case we can prove that the scores {Vi}n+1 i=1 are exchangeable. The exchangeability of {Vi}n+1 i=1 comes from the exchangeability of {(Xi, Yi)}n+1 i=1 and the independence between the s( , ) and {(Xi, Yi)}n+1 i=1 . Next, define rank(Vi) as the rank of Vi among {Vi}n+1 i=1 for any i [n + 1] (in ascending order; we assume that ties are broken randomly). By the exchangeability of {Vi}n+1 i=1 , rank(Vi) is uniform on [n + 1], which is used to prove the validity of SCP in Lemma 2. We use b P ({Vi}n i=1) to denote the empirical distribution determined by the examples V1, . . . , Vn. Let

b Cn(x):= y Y s(x, y) Q n + 1

n (1 α); b P({Vi}n i=1) ,

(2) we then have the following marginal coverage guarantee. Lemma 2 (The validity of SCP). Assume that examples {(Xi, Yi)}n+1 i=1 are exchangeable. For any nonconformity score s( , ) and any α (0, 1), the prediction set defined in Equation (2) satisfies:

P Yn+1 b Cn(Xn+1) 1 α, (3)

where the probability is over the randomness of {(Xi, Yi)}n+1 i=1 . In the OOD generalization setting, we also want to obtain a valid set predictor that is valid for any T T . In light of this, some natural questions arise: Does the set predictor defined in Equation (2) remain valid when the unseen target domain is different from the source domains? If not, can we construct a new set predictor that is valid in the OOD generalization setting?

Unfortunately, the answer to the first question is negative. Theoretically, as shown in Appendix A.1, the proof of Lemma 2 is highly dependent on the exchangeability of the examples {(Xi, Yi)}n+1 i=1 , which is easily violated if there is any distributional shift between the distribution of {(Xi, Yi)}n i=1 and the distribution of (Xn+1, Yn+1). This means that in the OOD setting, the proof technique of Lemma 2 cannot be applied. Empirically, in Section 4, we provide a toy example to show that the set predictor b Cn(x) is no longer valid in the OOD setting. In Section 5, we give an affirmative answer to the second question. We first construct a new set predictor based on the f-divergence between the target domain and the convex hull of the source domains, then provide marginal coverage guarantees for the constructed predictor.

4 SCP Fails in the OOD Setting In this section, we construct a toy example to show that for the OOD confidence set prediction problem, SCP is no longer valid, even under a slight distributional shift. For simplicity, we consider a single-domain case. Specifically, we consider the regression problem and set X = Rl, Y = R. We define the source domain S as follows: given a linear predictor L(x) = w , x + b where w Rl and b R. The marginal distribution of X and the conditional distribution of Y given X are defined as:

X N(µs, σ2 s,x Il), Y |X = x N(L(x), σ2 s,y),

where µs Rl is the mean vector of X, σs,x, σs,y are positive scalars, and Il Rl l is an identity matrix. Similarly, for the target domain T, we define:

X N(µt, σ2 t,x Il), Y |X = x N(L(x), σ2 t,y),

where µt Rl is the mean vector of X and σt,x, σt,y are positive scalars. For simplicity, we set µs = µt, σs,x = σt,x and σs,y = σt,y. We sample mtrain training examples from S to train a linear predictor ˆL(x) = ˆw, x + ˆb, where ˆw Rl and ˆb R. We then define the nonconformity score as s(x, y) = |ˆL(x) y|. We sample n examples from S to construct the prediction set b Cn(x) in Equation (2) and sample mtest examples from T to form the test data. We run 1000 times with different random seeds. The results for the coverage (left) and length (right) of the prediction set are presented in box plot form in Figure 1. Here, the coverage is the ratio between the number of test examples such that yi b Cn(xi) and the size of the test set. The red lines stand for the desired marginal coverages. Since the boxes are below the red coverage lines, we conclude that SCP fails to provide a prediction set with desired coverage when there exists a distributional shift between the source domain and the target domain.

5 Corrected SCP for OOD Data In this section, we consider correcting SCP for OOD data. We first consider the case in which we have access to the population distributions of the scores from the source domains.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

0.05 0.1 0.15 0.2 0.25 0.3 α 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Coverage

0.05 0.1 0.15 0.2 0.25 0.3 α 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25 Length

Figure 1: The box plots for the results of the 1000 runs. We show the results for α = {0.05, 0.1, 0.15, 0.2, 0.25, 0.3} and the horizontal axis represents the value of α. The left plot shows the results for the coverage of the prediction sets. The red lines are the marginal coverage guarantees that we wish to achieve. The right plot shows the results for the length of the prediction sets.

We then consider the case in which we only have access to the empirical distributions and correct the prediction set to obtain a marginal coverage guarantee in Equation (3).

5.1 Target Distribution Set and Confidence Sets It is obvious that obtaining a marginal coverage guarantee for an arbitrary target distribution is impossible unless we set b C(x) = Y for all x X, which is a trivial confidence set predictor and does not provide any useful information. In this paper, we consider the case in which the f-divergence (Alfr ed 1961) between the target domain and the convex hull of the source domains does not exceed a predefined value. The well-known KL divergence and TV distance are both special cases of f-divergence. As Section 4 shows, when the target domain differs from the source domain, the marginal coverage does not hold for the predictor (2). Below, we construct new prediction sets where the marginal coverage (3) holds. We define a set of distributions T {Q|Q is a distribution on X Y}. For each T T , the distribution of the score for the data from T is defined as the push forward distribution s#T, where (s#T)(A) = T(s 1(A)) for any measurable set A R. We define the distribution set of the scores as P := {s#T : T T }. For a given α (0, 1), our goal is to choose a threshold t R such that the confidence set e C(x) := {y Y|s(x, y) t} satisfies (3) when (Xn+1, Yn+1) is drawn from any target domain T T . The following lemma provides a proper choice of t. Lemma 3. For any unknown target distribution T T , assume that (Xn+1, Yn+1) is drawn from T. If we set t max P P Q(1 α; P), then:

P Yn+1 e C(Xn+1) 1 α. (4)

For a given set P of distributions for the score, Lemma 3 reduces the problem of finding a valid confidence set predictor to the following optimization problem:

max Q(1 α; P) s.t. P P. (5)

Next, we formulate the set T through the lens of fdivergence.

Definition 4 (f-divergence). Let f : R R be a closed convex function satisfying f(1) = 0 and f(t) = + for t < 0. Let P, Q be two probability distributions such that P Q (P is absolutely continuous with respect to Q). The f-divergence between P and Q can then be defined as follows:

Df(P Q) := Z f d P

d Q is the Radon-Nikodym derivative (Patrick 2008).

Remark 1. For a given function f that satisfies the conditions in Definition 4, define f0(t) := f(t) f (1)(t 1). We then obtain that, for any P Q:

Df0(P Q) = Z f0

d Q = Df(P Q).

By the convexity of f, it can be easily observed that f0(t) 0 for all t R. Moreover, inft f0(t) = f0(1) = 0 and f 0(1) = 0. Since f0 produces the same f-divergence as f, without loss of generality, we can assume that f (1) = f(1) = 0 and f 0. Equipped with the f-divergence, we can now define our target distribution set T for a given threshold ρ > 0:

Tf,ρ(S1, ,Sd):={T| Q CH(S1, ,Sd) s.t. Df(T Q) ρ} .

We omit S1, . . . , Sd and use Tf,ρ for simplicity. The corresponding distribution set for the scores is then:

P := {s#T|T Tf,ρ}. (6)

However, it is hard to obtain the precise relationship between P and the distributions s#S1, . . . , s#Sd, which makes it difficult to analyze P. We instead consider the following distribution set of scores:

Pf,ρ := {S is a distribution on R| S0 CH(s#S1, , s#Sd) s.t. Df(S S0) ρ} . (7) The following lemma reveals the relationship between P and Pf,ρ. Lemma 5. Let P, Pf,ρ be defined as in (6), (7) respectively. Then, P Pf,ρ. Remark 2. According to Lemma 5, sup P Pf,ρ Q(1 α; P)

sup P P Q(1 α; P). Lemma 3 accordingly tells us that if

we set t = sup P Pf,ρ Q(1 α; P), then for (Xn+1, Yn+1)

drawn from any target distribution T Tf,ρ, we have

P Yn+1 e C(Xn+1) 1 α. Our goal is now to solve Problem (5) for the set Pf,ρ. According to Remark 2, we define the worst-case quantile function for the distribution set Pf,ρ as e Q(α; Pf,ρ) := sup P Pf,ρ Q(α; P). Remark 2 tells us that taking t = e Q(1

α; Pf,ρ) produces a valid confidence set e C. The next theorem allows us to express the worst-case quantile function in terms of the standard quantile function, which helps us to calculate the worst-case quantile efficiently.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Theorem 6. Let F1, . . . , Fd be the c.d.f. s of the distributions s#S1, . . . , s#Sd. Define the function gf,ρ : [0, 1] [0, 1] as

gf,ρ(β):=inf z [0, 1] βf z

+(1 β)f 1 z

and define the inverse of gf,ρ as g 1 f,ρ(τ) = sup{β [0, 1] gf,ρ(β) τ}. Let Fmin(x) := min 1 i d Fi(x) be a c.d.f.,

the following holds for all α (0, 1):

e Q(α; Pf,ρ) = Q(g 1 f,ρ(α); Fmin).

5.2 Marginal Coverage Guarantee for Empirical Source Distributions

In the previous section, we presented marginal coverage guarantees when we have access to the population distributions of the scores for source domains. However, in practice, it is difficult or even impossible to access these population distributions. In this section, we provide marginal coverage guarantees even when we only have access to the empirical distributions, which is useful in practice. For any i [d], assume we have mi i.i.d. examples {Vij = s(Xij, Yij)}mi j=1 from the source distribution Si. Fur-

ther, suppose that ˆFi is the empirical c.d.f. corresponding to Fi, which is defined as ˆFi(x) = 1 mi Pmi j=1 I( ,x](Vij). De-

fine ˆFmin(x) = min 1 i d ˆFi(x). We first provide an error bound

when we estimate Fmin with ˆFmin.

Proposition 7. Let F1, . . . , Fd be c.d.f. s on R, define Fmin(x) = min 1 i d Fi(x). Suppose ˆF1, . . . , ˆFd are the em-

pirical c.d.f. s corresponding to F1, . . . , Fd, defined with m1, . . . , md examples, respectively. Define ˆFmin(x) = min 1 i d ˆFi(x). Then, for any ϵ > 0,

Fmin(x) ˆFmin(x) > ϵ 2

i=1 e 2miϵ2,

where the probability is over the randomness of the examples that define the empirical c.d.f. s.

The above Proposition 7 allows us to quantify the error caused by replacing the population distributions with the empirical distributions, which leads to the following marginal coverage guarantee for the prediction set e C that we have defined before.

Theorem 8 (Marginal coverage guarantee for the empirical estimations). Assume Vn+1 = s(Xn+1, Yn+1) P Pf,ρ is independent of {Vij}d,mi i,j=1 where {Vij}mi j=1 i.i.d. s#Si for i [d]. Suppose ρ = inf P0 CHs Df(P P0) ρ where CHs =

CH(s#S1, , s#Sd). Let ˆFmin be defined as in Proposition 7 and let ˆS1, . . . , ˆSd be the empirical distributions of S1, . . . , Sd respectively. If we set t = e Q(1 α; ˆPf,ρ) =

Q(g 1 f,ρ(1 α); ˆFmin), then for any ϵ > 0, we obtain the

following marginal coverage guarantee for e C:

P Yn+1 e C(Xn+1)

i=1 e 2miϵ2 !

gf,ρ g 1 f,ρ(1 α) ϵ ,

where the randomness is over the choice of the source examples and (Xn+1, Yn+1) and

ˆPf,ρ := n S S0 CH(s# ˆS1, , s# ˆSd) s.t. Df(S S0) ρ o .

By Lemma 14 in the Appendix, gf,ρ(β) is non-increasing in ρ and non-decreasing in β, so gf,ρ (g 1 f,ρ(1 α) ϵ) gf,ρ(g 1 f,ρ(1 α) ϵ). In practice, we do not know ρ , so we use gf,ρ(g 1 f,ρ(1 α) ϵ) instead. Since gf,ρ(g 1 f,ρ(1 α) ϵ) gf,ρ(g 1 f,ρ(1 α))=1 α, we get guaranteed coverage 1 2 Pd i=1 e 2miϵ2 gf,ρ(g 1 f,ρ(1 α) ϵ) 1 α. To achieve a marginal coverage with the level of at least 1 α, we need to correct the output set by replacing α with some α < α when running our confidence set predictor. The following corollary tells us how to choose α to correct the prediction set. Corollary 9 (Correct the prediction set to get a (1 α) marginal coverage). Let (Xn+1, Yn+1), ˆFmin, ˆPf,ρ be defined as in Theorem 8. For arbitrary ϵ > 0, if we set t = e Q(1 α ; ˆPf,ρ) = Q g 1 f,ρ(1 α ); ˆFmin , where

ϵ + g 1 f,ρ

1 2 Pd i=1 e 2miϵ2

then we obtain the following marginal coverage guarantee:

P Yn+1 e C(Xn+1) 1 α.

Remark 3. Corollary 9 tells us that we can take t = Q g 1 f,ρ(1 α ); ˆFmin =

Q ϵ + g 1 f,ρ 1 α 1 2 Pd i=1 e 2miϵ2 ; ˆFmin to get a marginal

coverage guarantee with confidence level 1 α. When f( ), s( , ) are chosen and the numbers of examples that are used to estimate the source distributions, i.e., m1, . . . , md, are given, we solve the following optimization problem to find a desired t.

min 0<ϵ 1 Q

ϵ + g 1 f,ρ

1 2 Pd i=1 e 2miϵ2

s.t. ϵ + g 1 f,ρ

1 2 Pd i=1 e 2miϵ2

Since the quantile function Q ; ˆFmin is non-decreasing, let

h(ϵ) = ϵ + g 1 f,ρ 1 α 1 2 Pd i=1 e 2miϵ2 , we solve the following

problem instead:

min h(ϵ) s.t. 0 < ϵ 1, h(ϵ) 1.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

For some choices of f, the functions gf,ρ and g 1 f,ρ have closed forms (please refer to the examples in Section 5.3). For general f that we do not have a closed form of g 1 f,ρ, the following lemma tells us that we can use a binary search algorithm to efficiently compute the value of g 1 f,ρ(τ) for a given τ.

Lemma 10 ((Cauchois et al. 2020), The form of g 1 f,ρ that can be efficiently solved). Let gf,ρ, g 1 f,ρ be defined as in Theorem 6. Then, for any τ [0, 1], we have:

g 1 f,ρ(τ)=sup β [τ, 1] βf τ

+(1 β)f 1 τ

5.3 Examples In this section, we present some examples of calculating gf,ρ and g 1 f,ρ for some important f-divergences.

Example 1 (χ2-divergence). Let f(t) = (t 1)2; then,

Df(P Q) = EQ

d P d Q 1 2 = EQ

d P d Q 2 1

is the χ2-divergence. In this case, we have gf,ρ(β) = β p

+, where (x)+ = max{0, x}. g 1 f,ρ(τ)

is the solution of the following optimization problem:

max β s.t. ρ ρ+1 β 1 β p

ρβ(1 β) τ .

Example 2 (Total variation distance, (Cauchois et al. 2020)). Let f(t) = 1

2|t 1|; then, Df(P Q) = EQ h 1 2 d P

is the total variation distance. In this case, we can provide analytic forms for gf,ρ and g 1 f,ρ:

gf,ρ(β) = (β ρ)+, g 1 f,ρ(τ) = min{τ + ρ, 1}.

Example 3 (Kullback-Leibler divergence). Let f(t) = t log t; then, Df(P Q) = EQ h d P d Q log d P d Q i is the Kullback-Leibler (KL) divergence (Solomon and Richard A 1951). Unfortunately, we cannot provide the analytic forms of gf,ρ and g 1 f,ρ for KL-divergence. Fortunately, according to Theorem 6 and Remark 3, we can compute the values gf,ρ(β) and g 1 f,ρ(τ) by solving a one-dimensional convex optimization problem, which can be solved efficiently using binary search.

6 Experiments In this section, we use simulated data to verify our theory and the validity of our constructed confidence set predictor (referred to as OOD-SCP in the remainder of this paper). We consider two cases: first, we verify the validity of OOD-SCP using the same settings as in Section 4; then, we construct a multi-source OOD confidence set prediction task and show that OOD-SCP is valid for this task. According to Figure 2, unlike standard SCP, for all values of α, the violin for OOD-SCP is above the desired coverage line, which shows that OOD-SCP is empirically valid. We next consider a multi-source OOD confidence set prediction task. Similar to Section 4, we consider the regression

SCP OOD-SCP α = 0.05 0.90

SCP OOD-SCP α = 0.1 0.80

SCP OOD-SCP α = 0.15 0.75

SCP OOD-SCP α = 0.2 0.70 0.75 0.80 0.85 0.90

SCP OOD-SCP α = 0.25 0.65 0.70 0.75 0.80 0.85

SCP OOD-SCP α = 0.3 0.60 0.65 0.70 0.75 0.80

Figure 2: The violin plots for the coverage of the 1000 runs under the same data generation settings as in Section 4. We show results for α = {0.05, 0.1, 0.15, 0.2, 0.25, 0.3}. Here, the red lines are the marginal coverage guarantees that we wish to achieve. The white point represents the median, while the two endpoints of the thick line are the 0.25 quantile and the 0.75 quantile.

SCP OOD-SCP α = 0.05 0.90

SCP OOD-SCP α = 0.1 0.80 0.85 0.90 0.95 1.00

SCP OOD-SCP α = 0.15 0.75 0.80 0.85 0.90 0.95

SCP OOD-SCP α = 0.2 0.70 0.75 0.80 0.85 0.90

SCP OOD-SCP α = 0.25 0.65 0.70 0.75 0.80 0.85

SCP OOD-SCP α = 0.3 0.60 0.65 0.70 0.75 0.80

Figure 3: The violin plots for the coverage of the 1000 runs for the multi-source OOD confidence set prediction task. We show results for α = {0.05, 0.1, 0.15, 0.2, 0.25, 0.3}. Here, the red lines are the marginal coverage guarantees that we wish to achieve. The white point represents the median, while the two endpoints of the thick line are the 0.25 quantile and the 0.75 quantile.

problem and set X = Rl, Y = R. Define the oracle linear predictor L : X Y as L(x) = w , x + b , where w Rl and b R. We define the marginal distribution of X for the source domains S1 and S2 as

S1X = N(µ1, σ2 s,x Il), S2X = N(µ2, σ2 s,x Il)

respectively, where µ1, µ2 Rl are the mean vectors, σs,x > 0 is a scalar, and Il Rl l is the identity matrix with dimension l l. We define Y |X = x N(L(x), σ2 s,y) for both S1 and S2. For the target domain T, we define the marginal distribution of X as TX = S1X+S2X

2 and the conditional distribution of Y given X as Y |X = x N(L(x), σ2 t,y). Here, σs,y, σt,y > 0 are the standard deviations and σs,y = σt,y. Similar to Section 4, we sample mtrain

2 examples from S1 and mtrain

2 examples from S2 to train a linear predictor ˆL(x) = ˆw, x + ˆb, where ˆw Rl and ˆb R. We then define the nonconformity score as s(x, y) = |ˆL(x) y|. We sample

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

n 2 examples from S1, and n

2 examples from S2 to construct the prediction set e C(x) and sample mtest examples from T to form the test data. Figure 3 shows the results for the multi-source OOD confidence set prediction task. From the figure, we can see that the violins for the standard SCP are under the desired coverage lines, which means that the standard SCP is invalid in this case. By contrast, the violins for OOD-SCP are above the desired coverage lines, indicating that OOD-SCP is valid, which validates Corollary 9. The reason we do not do experiments on real datasets is that do not know how to set the value of ρ for the existing OOD datasets. Our main claim is that when the target domain satisfies T Tf,ρ, the coverage of our method is guaranteed. However, we claim it is acceptable. In many fields, we face the same problem. In adversarial robustness (Szegedy et al. 2014), the theories (for example (Montasser, Hanneke, and Srebro 2019)) provide an upper bound of the test adversarial robustness P (x,y) D[ δ ϵ : h(x + δ) = y], where h is a classifier.

The results just tell us that we have the guarantee for the test accuracy if the test perturbation δ satisfies δ ϵ. However, what if δ > ϵ? It is out of the scope of their theories. For distributional robustness optimization (DRO), the theories (Lee and Raginsky 2018) prove that if the test distribution is in a Wasserstein ball with radius r, then the test risk can be upper bounded. Formally, max D W (r) P (x,y) D[h(x) = y]

is upper bounded, where W(r) is a Wasserstein ball with radius. They do not know how to set r to make W(r) contain the test distribution either, however, this does not overshadow their contribution to the DRO community. In other words, the issues of ρ do not overshadow our contribution to the OOD community.

7 Discussions Our work is an extension of (Cauchois et al. 2020) to the multi-domain case. In this section, we discuss the differences between our work and (Cauchois et al. 2020).

7.1 The Necessity of Our Extension In the multi-source setting, to make use of all the source domains, a trivial method is to regard the mixture of the source domains, as a domain S = Pd i=1 λi Si and use the method in (Cauchois et al. 2020). However, there are two issues:

Given the empirical data from S1, . . . , Sd, we don t know the exact values of λ1,. . . ,λd for the mixed domain, so we don t know the set P = {s#T|Df(T S) ρ} for a given ρ. So we don t know the set that we are giving a coverage guarantee for. We may be not able to provide a coverage guarantee for data from one of the source domains. Take KL-divergence as an example, then drawing from S can be regarded as first drawing an index I from λ and then drawing an example from SI. Si can be seen as drawn from the same

process with λ = ei, where ei = (0, . . . , 0, 1, 0, . . . , 0) Rd with only the i-th element being 1. By the chain rule, KL(Si S) = KL(ei λ) + Ej ei KL(Sj Sj) = log(1/λi). If ρ < maxi log(1/λi), then there exists a Si s.t. KL(Si S) > ρ, i.e., we can not even get a coverage guarantee for the source domain Si, which is unacceptable! The problem gets worse if the source domain number d becomes larger since maxi log(1/λi) log d. However, in our generalization, there is no such problem even if we choose ρ = 0, which means the method in (Cauchois et al. 2020) is not compatible with the multi-source setting, so our extension is necessary.

7.2 The Difference in Proof Skills

In fact, our Theorem 6 is an extension of (Cauchois et al. 2020) to the OOD setting and mainly depends on Lemmas 17 and 18. Lemma 17 helps us reduce multi-input gf,ρ to a single-input case. The main idea of the proof of Lemma 18 comes from the argument in (Cauchois et al. 2020), however, the extension is non-trivial. In Lemma 18, let h(z, β) = βf(z/β) + (1 β)f((1 z)/(1 β)) we use the multi-input gf,ρ(β1, , βd) = inf{z [0, 1]| infλ d 1 h(z, Pd i=1λiβi) ρ}, which involves taking infimum w.r.t. λ and is much more complicated than the single-input case in (Cauchois et al. 2020). We construct a set P f,ρ that is more complicated than that in (Cauchois et al. 2020) and the proof is more difficult. Moreover, due to multiple inputs and the infλ d 1 operator, we need to consider 4 cases according to whether each Fi(t) is 0 or 1. Our Theorem 8 and its corresponding Corollary 9 are novel and quite different from Corollaries 2.1 and 2.2 in (Cauchois et al. 2020). The common point is that they all consider finite sample approximation. The proof of Corollary 2.1 in (Cauchois et al. 2020) relies on the exchangeability of the source examples, however, in the OOD setting, examples are drawn from different source domains and are not exchangeable. So the analysis techniques in (Cauchois et al. 2020) can not be applied in our case. To fill this gap, we use the decomposition technique and concentration inequalities.

8 Conclusion

We study the confidence set prediction problem in the OOD generalization setting. We first empirically show that SCP is not valid in the OOD generalization setting. We then develop a method for forming valid confident prediction sets in the OOD setting and theoretically prove the validity of our proposed method. Finally, we conduct experiments on simulated data to empirically verify both the correctness of our theory and the validity of our proposed method.

Acknowledgements

This work is supported by the National Key R&D Program of China under Grant 2023YFC3604702, the National Natural Science Foundation of China under Grant 61976161, the Fundamental Research Funds for the Central Universities under Grant 2042022rc0016.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Achim, K. 2013. Probability theory: a comprehensive course. Springer Science & Business Media. Alfr ed, R. 1961. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, volume 4, 547 562. Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P. F.; Schulman, J.; and Man e, D. 2016. Concrete Problems in AI Safety. Co RR, abs/1606.06565. Angelopoulos, A. N.; Bates, S.; Jordan, M. I.; and Malik, J. 2021. Uncertainty Sets for Image Classifiers using Conformal Prediction. In ICLR. Bin, Y. 1994. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 94 116. Cauchois, M.; Gupta, S.; Ali, A.; and Duchi, J. C. 2020. Robust Validation: Confident Predictions Even When Distributions Shift. Co RR, abs/2008.04267. Fisch, A.; Schuster, T.; Jaakkola, T. S.; and Barzilay, R. 2021. Few-Shot Conformal Prediction with Auxiliary Tasks. In ICML, 3329 3339. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. S. 2016. Domain-Adversarial Training of Neural Networks. J. Mach. Learn. Res., 17: 59:1 59:35. Gendler, A.; Weng, T.; Daniel, L.; and Romano, Y. 2022. Adversarially Robust Conformal Prediction. In ICLR. Gibbs, I.; and Cand es, E. J. 2021. Adaptive Conformal Inference Under Distribution Shift. In Neur IPS, 1660 1672. Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2015. Explaining and Harnessing Adversarial Examples. In ICLR. Jiang, H.; Kim, B.; Guan, M. Y.; and Gupta, M. R. 2018. To Trust Or Not To Trust A Classifier. In Neur IPS, 5546 5557. Jiang, X.; Osl, M.; Kim, J.; and Ohno-Machado, L. 2012. Calibrating predictive model estimates to support personalized medicine. J. Am. Medical Informatics Assoc., 19(2): 263 274. Lee, J.; and Raginsky, M. 2018. Minimax Statistical Learning with Wasserstein distances. In Neur IPS, 2692 2701. Li, D.; Yang, Y.; Song, Y.; and Hospedales, T. M. 2018a. Learning to Generalize: Meta-Learning for Domain Generalization. In AAAI, 3490 3497. Li, H.; Pan, S. J.; Wang, S.; and Kot, A. C. 2018b. Domain Generalization With Adversarial Feature Learning. In CVPR, 5400 5409. Li, Y.; Tian, X.; Gong, M.; Liu, Y.; Liu, T.; Zhang, K.; and Tao, D. 2018c. Deep Domain Generalization via Conditional Invariant Adversarial Networks. In ECCV, volume 11219, 647 663. Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and Vladu, A. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. In ICLR.

Montasser, O.; Hanneke, S.; and Srebro, N. 2019. VC Classes are Adversarially Robustly Learnable, but Only Improperly. In COLT, 2512 2530. Nagarajan, V.; Andreassen, A.; and Neyshabur, B. 2021. Understanding the failure modes of out-of-distribution generalization. In ICLR. Oliveira, R. I.; Orenstein, P.; Ramos, T.; and Romano, J. V. 2022. Split Conformal Prediction for Dependent Data. ar Xiv:2203.15885. Patrick, B. 2008. Probability and measure. John Wiley & Sons. Shafer, G.; and Vovk, V. 2008. A Tutorial on Conformal Prediction. J. Mach. Learn. Res., 9: 371 421. Shen, Z.; Liu, J.; He, Y.; Zhang, X.; Xu, R.; Yu, H.; and Cui, P. 2021. Towards Out-Of-Distribution Generalization: A Survey. Co RR, abs/2108.13624. Solomon, K.; and Richard A, L. 1951. On information and sufficiency. The annals of mathematical statistics, 22(1): 79 86. Sun, B.; and Saenko, K. 2016. Deep CORAL: Correlation Alignment for Deep Domain Adaptation. In ECCV, volume 9915, 443 450. Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I. J.; and Fergus, R. 2014. Intriguing properties of neural networks. In ICLR. Tibshirani, R. J.; Barber, R. F.; Cand es, E. J.; and Ramdas, A. 2019. Conformal Prediction Under Covariate Shift. In Neur IPS, 2526 2536. Vovk, V. 2013. Conditional validity of inductive conformal predictors. Mach. Learn., 92(2-3): 349 376. Vovk, V.; Gammerman, A.; and Shafer, G. 2005. Algorithmic Learning in a Random World. Springer-Verlag. ISBN 0387001522. Wang, Q.; Wang, Y.; Zhu, H.; and Wang, Y. 2022. Improving Out-of-Distribution Generalization by Adversarial Training with Structured Priors. Co RR, abs/2210.06807. Xiaohong, C.; Lars Peter, H.; and Marine, C. 2010. Nonlinearity and temporal dependence. Journal of Econometrics, 155: 155 169. Xin, S.; Wang, Y.; Su, J.; and Wang, Y. 2022. Domain-wise Adversarial Training for Out-of-Distribution Generalization. Zou, X.; and Liu, W. 2023. On the Adversarial Robustness of Out-of-distribution Generalization Models. In Thirty-seventh Conference on Neural Information Processing Systems.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)