# pac_prediction_sets_under_covariate_shift__b2e617cd.pdf Published as a conference paper at ICLR 2022 PAC PREDICTION SETS UNDER COVARIATE SHIFT Sangdon Park Dept. of Computer & Info. Science PRECISE Center University of Pennsylvania sangdonp@seas.upenn.edu Edgar Dobriban Dept. of Statistics & Data Science The Wharton School University of Pennsylvania dobriban@wharton.upenn.edu Insup Lee Dept. of Computer & Info. Science PRECISE Center University of Pennsylvania lee@cis.upenn.edu Osbert Bastani Dept. of Computer & Info. Science PRECISE Center University of Pennsylvania obastani@seas.upenn.edu An important challenge facing modern machine learning is how to rigorously quantify the uncertainty of model predictions. Conveying uncertainty is especially important when there are changes to the underlying data distribution that might invalidate the predictive model. Yet, most existing uncertainty quantification algorithms break down in the presence of such shifts. We propose a novel approach that addresses this challenge by constructing probably approximately correct (PAC) prediction sets in the presence of covariate shift. Our approach focuses on the setting where there is a covariate shift from the source distribution (where we have labeled training examples) to the target distribution (for which we want to quantify uncertainty). Our algorithm assumes given importance weights that encode how the probabilities of the training examples change under the covariate shift. In practice, importance weights typically need to be estimated; thus, we extend our algorithm to the setting where we are given confidence intervals for the importance weights. We demonstrate the effectiveness of our approach on covariate shifts based on Domain Net and Image Net. Our algorithm satisfies the PAC constraint, and gives prediction sets with the smallest average normalized size among approaches that always satisfy the PAC constraint. 1 INTRODUCTION A key challenge in machine learning is quantifying prediction uncertainty. A promising approach is via prediction sets, where the model predicts a set of labels instead of a single label. For example, prediction sets can be used by a robot to navigate to avoid regions where the prediction set includes an obstacle, or in healthcare to notify a doctor if the prediction set includes a problematic diagnosis. Various methods based on these approaches can provide probabilistic correctness guarantees (i.e., that the predicted set contains the true label with high probability) when the training and test distributions are equal formally, assuming the observations are exchangeable (Vovk et al., 2005; Papadopoulos et al., 2002; Lei et al., 2015) or i.i.d. (Vovk, 2013; Park et al., 2020a; Bates et al., 2021). However, this assumption often fails to hold in practice due to covariate shift i.e., where the input distribution changes but the conditional label distribution remains the same (Sugiyama et al., 2007; Qui nonero-Candela et al., 2009). These shifts can be both natural (e.g., changes in color and lighting) (Hendrycks & Dietterich, 2019) or adversarial (e.g., ℓ attacks) (Szegedy et al., 2014). We consider unsupervised domain adaptation (Ben-David et al., 2007), where we are given labeled examples from the source domain, but only unlabeled examples from the target (covariate shifted) domain. We propose an algorithm that constructs probably approximately correct (PAC) prediction sets under bounded covariate shifts (Wilks, 1941; Valiant, 1984) i.e., with high probability over Published as a conference paper at ICLR 2022 = n [ brain o [ brain, fish, lion, lollipop, sea turtle (a) prediction set example 0 50 100 150 200 250 300 350 size PS WSCI PS-W (b) size and error tradeoff 0 50 100 150 200 size PS WSCI PS-W (c) results across many shifts Figure 1: (a) An example of a covariate shifted image where an existing approach, PS (Park et al., 2020a), constructs prediction sets that do not contain the true label (i.e., sea turtle ); in contrast, our proposed approach PS-W does. (b) The red curve shows the tradeoff between size and error achieved by different choices of τ on a single shift; the goal is to be as far to the left as possible without crossing the desired error bound ε = 0.1 shown as the black dashed line. The existing approaches PS and WSCI (Tibshirani et al., 2019) fail to satisfy the desired error bound due to covariate shift, whereas our approach satisfies it. (c) Our approach satisfies the error bound over nine covariate shifts on Domain Net and Image Net, whereas existing approaches do not. the training data ( probably ), the prediction set contains the true label for test instances ( approximately correct ). Our algorithm uses importance weights to capture the likelihood of a source example under the target domain. When the importance weights are known, it uses rejection sampling (von Neumann, 1951) to construct the prediction sets. Often, the importance weights must be estimated, in which case we need to account for estimation error. When only given confidence intervals around the importance weights, our algorithm constructs prediction sets that are robust to this uncertainty. We evaluate our approach in two settings. First, we consider rate shift, where the model is trained on a broad domain, but deployed on a narrow domain e.g., an autonomous car may be trained on both daytime and nighttime images but operate at night. Even if the model continues to perform well, this covariate shift may invalidate the prediction sets. We show that our approach constructs valid prediction sets under rate shifts constructed from Domain Net (Peng et al., 2019), whereas several existing approaches do not. Second, we consider support shift, where the model is trained on one domain, but evaluated on another domain with shifted support. We train Res Net101 (He et al., 2016) using unsupervised domain adaptation (Ganin et al., 2016) on both Image Net-C synthetic perturbations (Hendrycks & Dietterich, 2019) and PGD adversarial perturbations (Madry et al., 2017) to Image Net (Russakovsky et al., 2015). Our PS-W algorithm satisfies the PAC constraint, and gives prediction sets with the smallest average normalized size among approaches that always satisfy the PAC constraint. See Figure 1 for a summary of our approach and results. Related work. Our work is related to conformal prediction (Vovk et al., 2005; Balasubramanian et al., 2014) without a shift, where the goal is to construct models that predict sets of labels designed to include the true label with high probability (instead of predicting individual labels) under the assumption that the source and target distributions are the same. In particular, our setting is related to inductive (or split) conformal prediction (Papadopoulos et al., 2002; Vovk, 2013; Lei et al., 2015), where the training set is split into a proper training set used to train a traditional predictive model, and a calibration set used to construct the prediction sets. The general approach is to train a model f : X Y R that outputs conformity scores, and then to choose a threshold τ R that satisfies a correctness guarantee, where the corresponding prediction sets are C(x) = {y Y | f(x, y) τ} (Park et al., 2020a; Gupta et al., 2021); this notation is defined in Section 2. Several kinds of correctness guarantees under no shift have been considered. One possibility is input conditional validity (Vovk, 2013; Barber et al., 2020), which ensures correctness for all future covariates x, with high probability only over the conditional label distribution p(y | x). This guarantee is very strong, and hard to ensure in practice. A weaker notion is approximate (Lei & Wasserman, 2014; Barber et al., 2020) or group (Romano et al., 2019) conditional validity, which ensures correctness with high probability over p(y | x) as well as some distribution p(x) over a subgroup. Finally, unconditional validity ensures only correctness over the joint distribution p(x, y). We focus on unconditional validity, though our approach extends readily to group conditional validity. Published as a conference paper at ICLR 2022 A separate issue, arising in the no shift setting, is how to condition on the calibration set Z. A conventional goal is unconditional validity, over the distribution p(x, y, Z). We refer to this strategy as fully unconditional validity. However, the guarantee we consider uses a separate confidence level for the training data, which is called a training conditional guarantee (Vovk, 2013); this correctness notion is equivalent to a PAC correctness guarantee (Park et al., 2020a), and is also equivalent to the standard content guarantee with a certain confidence level for tolerance regions (Wilks, 1941; Fraser, 1956). We build on Park et al. (2020a), which formulates the problem of choosing τ as learning a binary classifier where the input and parameter spaces are both one-dimensional; thus, the correctness guarantee corresponds to a PAC generalization bound. This approach is widely applicable since it can work with a variety of objectives (Bates et al., 2021). Recent work has extended inductive conformal prediction to a setting with covariate shift (Tibshirani et al., 2019; Lei & Cand es, 2020); similarly, Podkopaev & Ramdas (2021) considers conformal prediction under label shift (Lipton et al., 2018), i.e., assuming the conditional probabilities p(x | y) do not change. These approaches start from the assumption that the true importance weights (IWs) are known (Tibshirani et al., 2019; Podkopaev & Ramdas, 2021), or assume some properties of the estimated IWs (Lei & Cand es, 2020), whereas our approach considers a special form of ambiguity in the estimated IWs. Furthermore, they are focused on fully unconditional validity, whereas we obtain PAC prediction sets. In addition, Cauchois et al. (2020) designs confidence sets that are robust to all distribution shifts with bounded f-divergence; in contrast, we consider the unsupervised learning setting where we have unlabeled examples from the target distribution. We provide additional related work in Appendix A. 2 BACKGROUND ON PAC PREDICTION SETS We give background on PAC prediction sets (Park et al., 2020a); we also re-interpret this approach using Clopper-Pearson confidence intervals (Clopper & Pearson, 1934), which aids our analysis. 2.1 PAC PREDICTION SETS ALGORITHM Let x X be covariates and y Y be labels. We consider a source distribution P over X Y with probability density function (PDF) p(x, y).1 A prediction set2 is a set-valued function C : X 2Y. Inputs. We are given a held-out calibration set Sm of i.i.d. samples (xi, yi) P for i [m] := {1, . . . , m}, written as Sm P m, and a score function f : X Y R 0. For example, f(x, y) can be a prediction for the probability that y is the label for x. The score function can be arbitrary, but score functions assigning higher scores to likely labels yield smaller prediction sets. PAC prediction set. A PAC prediction set is a set-valued function C : X 2Y satisfying the following two conditions. First, C is approximately correct in the sense that its expected error (failing to contain the true label) is bounded by a given ε (0, 1), i.e., LP (C) := E(x,y) P [1(y / C(x))] = P(x,y) P [y / C(x)] ε. Second, noting that CSm is constructed based on a calibration set Sm P m, the condition that CSm is approximately correct must be satisfied with high probability i.e., for given δ (0, 1), we have PSm P m [LP (CSm) ε] 1 δ. Our goal is to devise an algorithm for constructing a PAC prediction set C. Letting C(x) = Y for all x X always satisfies both conditions above, but this extreme case would be uninformative if used as a prediction set. Therefore, we additionally want to minimize the expected size E[S(C(x))] of the prediction sets C(x), where S : 2Y R 0 is a size measure, which is application specific (e.g., the cardinality of a set in classification); however, we only rely on the monotonicity of the size measure with respect to the prediction set parameterization in construction. 1All quantities that we define are measurable with respect to a fixed σ-algebra on X Y; for instance, p is the density induced by a fixed σ-finite measure. To be precise, we consider a probability measure P defined with respect to the base measure M on X Y; then, p = d P/d M is the Radon-Nykodym derivative of P with respect to M. For classification, M is the product of a Lebesgue measure on X and a counting measure on Y. 2We use the term prediction set to denote both the set-valued function and a set output by this function. Published as a conference paper at ICLR 2022 Algorithm. To construct C, we first define the search space of possible prediction sets along with the size measure S. We parameterize C by a scalar τ T := R 0 as Cτ(x) = {y Y | f(x, y) τ}, i.e., τ is the threshold on f(x, y) above which we include y in C(x). Importantly, τ controls the tradeoff between size and expected error. The reason is that if τ1 τ2, then Cτ2(x) Cτ1(x) for all x X. Thus, size is monotonically decreasing in τ i.e., S(Cτ2(x)) S(Cτ1(x)) for all x X, and error is monotonically increasing in τ i.e., LP (Cτ1) LP (Cτ2). See Figure 1b for an illustration, and Park et al. (2020a) and Gupta et al. (2021) for details. As a consequence, a typical goal is to construct Cτ that provably contains the true label with high probability, while empirically minimizing size (Vovk et al., 2005; Gupta et al., 2021). In our setting, we want to maximize τ (equivalently, minimize expected size) subject to the constraint that Cτ is PAC. Let LSm(Cτ) := P (x,y) Sm 1(y / Cτ(x)) be the empirical error count on Sm, and F(k; m, ε) = Pk i=0 m k εi(1 ε)m i be the cumulative distribution function (CDF) of the binomial distribution Binom(m, ε) with m trials and success probability ε. In prior work, Park et al. (2020a) constructs Cˆτ by solving the following problem: ˆτ = max τ T τ subj. to LSm(Cτ) k(m, ε, δ), (1) where k(m, ε, δ) = maxk N {0} k subj. to F(k; m, ε) δ. Intuitively, the PAC guarantee via this construction is related to the Binomial distribution; LSm(C) has distribution Binom(m, LP (C)) (given a fixed C), since 1(y C(x)) has a Bernoulli(LP (C)) distribution when (x, y) P. Thus, k(m, ε, δ) defines a confidence interval such that if LSm(C) k(m, ε, δ), then LP (C) ε with probability at least 1 δ. Below, we formalize this intuition by drawing a connection to the Clopper-Pearson confidence interval. 2.2 CLOPPER-PEARSON INTERPRETATION We interpret (1) using the Clopper-Pearson (CP) upper bound θ(k; m, δ) [0, 1] (Clopper & Pearson, 1934; Park et al., 2021). This is an upper bound on the true success probability µ, constructed from a sample k Binom(m, µ), which holds with probability at least 1 δ, i.e., Pk Binom(m,µ)[µ θ(k; m, δ)] 1 δ, where θ(k; m, δ) := inf {θ [0, 1] | F(k; m, θ) δ} {1}. (2) Intuitively, for a fixed C, LSm(C) Binom(m, LP (C)), so the true error LP (C) is contained in the CP upper bound θ(LSm(C); m; δ) with probability at least 1 δ. Intuitively, we can therefore choose τ so that this upper bound is ε. However, we need to account for generalization error of our estimator. To this end, we have the following (see Appendix D.1 for a proof and Algorithm 2 in Appendix E for implementation details on the corresponding algorithm): Theorem 1 Let UCP(C, Sm, δ) := θ( LSm(C); m; δ), where θ is defined in (2). Let ˆτ be the solution of the following problem3: ˆτ = max τ T τ subj. to UCP(Cτ, Sm, δ) ε. (3) Then, we have PSm P m[LP (Cτ) ε] 1 δ for any τ ˆτ. 3 PAC PREDICTION SETS UNDER COVARIATE SHIFT We extend the PAC prediction set algorithm described in Section 2 to the covariate shift setting. Our novel approach combines rejection sampling (von Neumann, 1951) with Theorem 1. 3We consider a trivial solution τ = 0 when (3) is not feasible, but omitting here to avoid clutter. Published as a conference paper at ICLR 2022 3.1 PROBLEM FORMULATION We assume labeled training examples from the source distribution P are given, but want to construct prediction sets that satisfy the PAC property with respect to a (possibly) shifted target distribution Q. Both of these are distributions over X Y; denote their PDFs by p(x, y) and q(x, y), respectively. The challenge is that we are only given unlabeled examples from Q. Preliminaries and assumptions. We denote the likelihood ratio of covariate distributions by w (x) := q(x)/p(x), also called the importance weight (IW) of x. Our main assumption is the covariate shift condition, which says the label distributions are equal (i.e., p(y | x) = q(y | x)), while the covariate distributions may differ (i.e., we can have p(x) = q(x)). Inputs. We assume a labeled calibration set Sm consisting of i.i.d. samples (xi, yi) P (for i [m]) is given, and a score function f : X Y R 0. For now, we also assume we have the true importance weights w i := w (xi) for each (xi, yi) P, as well as an upper bound b maxx X w (x) on the IWs. In Sections 3.3 and Appendix B, we describe how to estimate these quantities in a way that provides guarantees under smoothness assumptions on the distributions.4 Problem. Our goal is to construct CSm that is a PAC prediction set under Q i.e., PSm P m [LQ(CSm) ε] 1 δ, where LQ(C) := P(x,y) Q[y C(x)] is the error of C on Q, while empirically controlling its size. 3.2 REJECTION SAMPLING STRATEGY Our strategy is to use rejection sampling to convert Sm consisting of i.i.d. samples from P into a labeled calibration set consisting of i.i.d. samples from Q. Rejection sampling. Rejection sampling (von Neumann, 1951; Owen, 2013; Rubinstein & Kroese, 2016) is a technique for generating samples from a target distribution q(x) based on samples from a proposal distribution p(x). Given importance weights (IWs) w (x) and an upper bound b maxx X w (x), it constructs a set of i.i.d. samples from q(x). Typically, rejection sampling is used when the proposal distribution is easy to sample compared to the target distribution. In contrast, we use it to convert samples from the source distribution into samples from the target distribution. In particular, our algorithm takes the proposal distribution to be the source covariate distribution PX, and the target distribution to be our target covariate distribution QX. Let Vi U := Uniform([0, 1]) be i.i.d., V := (V1, . . . , Vm), and w := (w 1, . . . , w m) with w i = w (xi) as defined before. Then, given Sm, it uses rejection sampling to construct a randomly chosen set of N samples TN(Sm, V, w , b) := {(xi, yi) Sm | Vi w i /b} (4) from QX. The expected number of samples is E[N] = m/b; thus, rejection sampling is more effective when the proposal distribution is similar to the target. Rejection sampling Clopper-Pearson (RSCP) bound. Given TN, our algorithm uses the CP bound to construct a PAC prediction set C for Q. Let σi := 1(Vi w i /b) indicate whether example (xi, yi) Sm is accepted i.e., TN(Sm, V, w , b) = {(xi, yi) Sm | σi = 1},where |TN(Sm, V, w , b)| = Pm i=1 σi. Then, it follows that URSCP bounds the error on Q, where URSCP(C, Sm, V, w , b, δ) := UCP(C, TN(Sm, V, w , b), δ). (5) Thus, we have the following (see Appendix D.3 for the proof, and Algorithm 4 in Appendix E for a detailed description of the PS-R algorithm that leverages this bound): Theorem 2 Define URSCP as in (5). Let ˆτ be the solution of the following problem: ˆτ = max τ T τ subj. to URSCP(Cτ, Sm, V, w , b, δ) ε. (6) Then, we have PSm P m,V U m [LQ(Cτ) ε] 1 δ for any τ ˆτ. Note that V is sampled only once for Algorithm 4, so the randomness in V can contribute to the failure of the correctness guarantee; however, the failure rate is controlled by δ. 4In practice, we can improve stability of importance weights by considering source and target distributions induced by a feature mapping, e.g., learned using unsupervised domain-adaptation (Park et al., 2020b). Published as a conference paper at ICLR 2022 3.3 APPROXIMATE IMPORTANCE WEIGHTS So far, we have assumed that the true importance weight w (x) is known. Since in practice, it needs to be estimated, we relax this assumption to only needing an uncertainty set of possible importance weights. This allows us to handle estimation error in the weights. Problem. Let SX m be unlabeled calibration examples from the source distribution (i.e., the covariates in Sm), T X n be n unlabeled calibration examples from the target distribution (denoted by T X n Qn X), and w := (w 1, ..., w m) Rm be the vector of true importance weights w i := w (xi), for (xi, yi) Sm. Then, we assume an uncertainty set W Rm that has w with high probability, i.e., PSX m P m X ,T X n Qn X [ w W] 1 δw, (7) where δw (0, 1). We assume W has the form W := {w Rm | i [m] , wi wi wi} , for some wi and wi. See the discussion on our choice of W in Appendix C.1. Robust Clopper-Pearson bound. To construct a PAC prediction set C for Q, it suffices to bound the worst-case error over w W, i.e., we have the following (see Appendix D.4 for the proof): Theorem 3 Suppose W satisfies (7). Define URSCP as in (5). Let ˆτ be the solution of the following: ˆτ = max τ T τ subj. to max w W URSCP(Cτ, Sm, V, w, b, δC) ε. (8) Then, we have PSm P m,V U m,T X n Qn X[LQ(Cτ) ε] 1 δC δw for any τ ˆτ. A key challenge applying Theorem 3 is solving the maximum over w W. We propose a simple greedy algorithm that achieves the maximum. Greedy algorithm for robust URSCP. The RSCP bound URSCP satisfies certain monotonicity properties that enable us to efficiently compute an upper bound to the maximum in (8). In particular, if C makes an error on (xi, yi) (i.e., yi C(xi)), then URSCP is monotonically non-decreasing in w i = w (xi); intuitively, this holds since a larger w i increases the probability that (xi, yi) is included in TN(Sm, V, w , b), which in turn increases the empirical error LTN(Sm,V,w ,b)(C). Conversely, if C does not make an error on (xi, yi) (i.e., yi C(xi)), URSCP is non-increasing in w i . More formally, we have the following result (see Appendix D.5 for a proof): Lemma 1 For any i [m], URSCP(C, Sm, V, w , b, δ) is monotonically non-decreasing in w i if yi / C(xi), and monotonically non-increasing in w i if yi C(xi). Our greedy algorithm leverages the monotonicity of URSCP. In particular, given W and C, the choice ˆw := ( ˆw1, . . . , ˆwm), where ˆwi = wi if yi C(xi) wi if yi C(xi) ( i [m]) (9) is the maximum value over w W of the constraint URSCP(Cτ, Sm, V, w, b, δC) in (8), i.e., max w W URSCP(C, Sm, V, w, b, δ) = URSCP(C, Sm, V, ˆw, b, δ). (10) Thus, we have the following, which follows by (10) and the same argument as the Theorem 3: Theorem 4 Suppose W satisfies (7). Define ˆwτ,SX m,T X n as in (9), making the dependency on τ, SX m, and T X n explicit, and URSCP as in (5). Let ˆτ be the solution of the following problem: ˆτ = max τ T τ subj. to URSCP(Cτ, Sm, V, ˆwτ,SX m,T X n , b, δC) ε. (11) Then, we have PSm P m,V U m,T X n Qn X[LQ(Cτ) ε] 1 δC δw for any τ ˆτ. 5 Importance weight estimation. In general, to estimate the importance weights (IWs), some assumptions on their structure are required (Kanamori et al., 2009; Cortes et al., 2008; Nguyen et al., 5We assume b is given for simplicity, but our approach estimates it; see Appendix C.2 for details. Published as a conference paper at ICLR 2022 Algorithm 1 PS-W: an algorithm using the robust RSCP bound in (20) 1: procedure PS-W(Sm, T X n , f, g, T , ε, δw, δC, K, E) 2: W ESTIMATEIWS(SX m, T X n , g, δw, K, E) ( ) Compute an uncertainty set W 3: b maxi [K] wi ( ) Compute the maximum IW (see (19) in Appendix C.2) 4: return PS-ROBUST(Sm, f, T , W, b, ε, δC) 5: procedure ESTIMATEIWS(SX m, T X n , g, δw, K, E) 6: Construct bins B1, . . . , BK using g as described in Appendix B.2 7: Construct W using B1, . . . , BK, SX m, T X n , g, δw and, E as described in Appendix B.1 8: return W 9: procedure PS-ROBUST(Sm, f, g, T , W, b, ε, δC) 10: V Uniform([0, 1])m 11: ˆτ 0 12: for τ T do ( ) Grid search in ascending order 13: Construct ˆw using (9) given τ, Sm, and W 14: if URSCP(Cτ, Sm, V, ˆw, b, δC) ε then 15: ˆτ max(ˆτ, τ) 16: else break 17: return ˆτ 2010; Lipton et al., 2018). We use a cluster-based approach (Cortes et al., 2008). In particular, we heuristically partition the feature space of the score function f into K bins by using a probabilistic classifier g that separates source and target examples, and then estimate the source and target covariate distributions p(x) and q(x) based on the smoothness assumption over the distributions, where the degree of the smoothness is parameterized by E. Then, we can construct the uncertainty set W that satisfies the specified guarantee in (7). See Appendix B for details. Algorithm. Our algorithm, called PS-W, is detailed in Algorithm 1; it solves (20) and also performs importance weight estimation according to (17) in Appendix B. In particular, Algorithm 1 constructs a prediction set that satisfies the PAC guarantee in Theorem 4. See Section 4.1 for our choice of parameters (e.g., K, E, and grid search parameters). 4 EXPERIMENTS We show the efficacy of our approach on rate and support shifts on Domain Net and Image Net. 4.1 EXPERIMENTAL SETUP Models. For each source-target distribution pair, we split each the labeled source data and unlabeled target data into train and calibration sets. We use a separate labeled test set from the target for evaluation. For each shift from source to target, we use a deep neural network score function f based on Res Net101 (He et al., 2016), trained using unsupervised domain adaptation based on the source and target training sets. See Appendix F for details, including data split. Prediction set construction. To construct our prediction sets, we first estimate IWs by training a probabilistic classifier g using the source and target training sets. Next, we use g to construct heuristic IWs w(x) = 1/g(s = 1 | x) 11, where s = 1 if x is from source. Then, we estimate the lower and upper bound of the true IWs using Theorem 5 with E = 0.001 and K = 10 bins (chosen to contain equal numbers of heuristic IWs), where we compute the lower and upper Clopper-Pearson interval using the source and target calibration sets. Furthermore, given a confidence level δ, we use δC = δw = δ/2. For the grid search in line 12 of Algorithm 1, we increase τ by 10 7 until the bound URSCP exceeds 1.5ε. Finally, we evaluate the prediction set error on the labeled target test set. Baselines. We compare the proposed method in Algorithm 1 (PS-W) with the following: PS: The prediction set using Algorithm 2 that satisfies the PAC guarantee from Theorem 1, based on Park et al. (2020a), which makes the i.i.d. assumption. PS-C: A Clopper-Pearson method addressing covariate shift by a conservative upper bound on the empirical loss (see Appendix E.2 for details), resulting in Algorithm 3 that solves the following: ˆτ = arg max τ T τ subj. to UCP(Cτ, Sm, δ) ε/b. WSCI: Weighted split conformal inference, proposed in (Tibshirani et al., 2019). Under the exchangeability assumption (which is somewhat weaker than our i.i.d. assumption), this approach Published as a conference paper at ICLR 2022 PS WSCI PS-C PS-W 0.00 prediction set error PS WSCI PS-C PS-W 0 prediction set size (a) Comparison PS-R PS-M PS-W 0.00 prediction set error PS-R PS-M PS-W 0 (b) Ablation study Figure 2: The prediction set error and size over 100 random trials under synthetic shift from Image Net to Image Net-C13. Parameters are m = 20, 000, ε = 0.1, and δ = 10 5. (a) and (b) shows the box plots for comparison and ablation study, respectively. provides correctness guarantees only for a single future test instance, i.e., it includes an ε confidence level (denoted α in their paper) but not δ. Also, it assumes the true IWs are known; following their experiments, we use the heuristic IWs constructed using g. PS-R: The prediction set using Algorithm 4 that satisfies the PAC guarantee from Theorem 2 with the heuristic IWs constructed using a probabilistic classifier g. PS-M: The prediction set using Algorithm 5, which is identical to PS-R except that PS-M estimates IWs heuristically using histogram density estimation and a calibration set. In particular, we partition the IW values into bins, project source and target calibration examples into bins based on the value of the probabilistic classifier g, and estimate the source density ˆp B and target density ˆq B for each bin. The estimated importance weight is ˆw(x) = ˆq B(x)/ˆp B(x), where ˆp B and ˆq B are defined in (14) and (15), respectively. Metrics. We measure performance via the prediction set error and size on a held-out test set i.e., error is the fraction of (x, y) such that y C(x) and size is the average of |C(x)| over x. Rate shifts on Domain Net. We consider settings where the model is trained on data from a variety of domains, but is then deployed on a specific domain; we call such a shift rate shift. For instance, a self-driving car may be trained on both day and night images, but tested during the night time. While the model should still perform well since the target is a subset of the source, the covariate shift can nevertheless invalidate prediction set guarantees. We consider rate shifts on Domain Net (Peng et al., 2019), which consists of images of 345 classes from six domains (sketch, clipart, painting, quickdraw, real, and infograph); we use all domains as the source and each domain as a target. Support shifts on Image Net. Next, we consider support shifts, where the support of the target is different from the support of the source, but unsupervised domain adaptation is used to learn feature representations that align the two distributions (Ganin et al., 2016); then, the score function f is trained on this representation. First, we consider Image Net-C (Hendrycks & Dietterich, 2019), which modifies the original Image Net dataset (Russakovsky et al., 2015) using 15 synthetic perturbations with 5 severity levels. We use 13 perturbations, omitting snow and glass blur , which are computationally expensive to run. We consider the original Image Net dataset as the source, and the all synthetic perturbations on all of Image Net (denoted Image Net-C13) as the target. See Appendix F.3 for data split. Second, we consider adversarial shifts, where we generate adversarial examples for Image Net using the PGD attack (Madry et al., 2017) with 0.01 ℓ -norm perturbations with respect to a pretrained Res Net101. We consider the original Image Net as the source and the adversarially perturbed Image Net as the target. 4.2 EXPERIMENTAL RESULTS We summarize our results in Table 1, and provide more details in Figure 4 in Appendix G.2. Our PS-W algorithm satisfies the PAC constraint, and gives prediction sets with the smallest average normalized size among approaches that always satisfy the PAC constraint. Rate shifts on Domain Net. We use ε = 0.1 and δ = 10 5. As can be seen in Table 1, the prediction set error of our approach (PS-W) does not violate the error constraint ε = 0.1, while all other comparing approaches (except for PS-C) violate it for at least one of the shifts. While PS-C satisfies the desired bound, it is overly conservative; its prediction sets are significantly larger than necessary, making it less useful for uncertainty quantification. For the shift to Infograph in Table 1, PS-W achieves relatively larger average prediction set size compared to other shifts; this is because the classification error of the score function f over Infograph is large 71.28%, whereas Published as a conference paper at ICLR 2022 Table 1: Average prediction set error and sizes over 100 random trials under rate shifts on Domain Net (first six shifts) and support shifts on Image Net (last two shifts). We denote an approach satisfying the ε error constraint by , and otherwise. The normalized size is the size divided by the total number of classes (i.e., 345 for Domain Net and 1000 for Image Net). Parameters are m = 50, 000 for Domain Net, m = 20, 000 for Image Net, ε = 0.1, and δ = 10 5. Our approach PS-W satisfies the ε constraint, while producing prediction sets with the smallest average normalized size among approaches that always satisfy the error constraint. See Appendix G.2 for box plots. Baselines Ablations Ours PS WSCI PS-C PS-R PS-M PS-W error size error size error size error size error size error size All (0.094) 10.5 (0.099) 9.5 (0.093) 10.7 (0.094) 10.6 (0.094) 10.8 (0.070) 17.0 Sketch (0.142) 13.1 (0.116) 18.6 (0.020) 141.7 (0.097) 28.2 (0.105) 26.1 (0.078) 40.3 Painting (0.159) 15.4 (0.113) 30.0 (0.025) 125.4 (0.096) 37.7 (0.103) 34.5 (0.076) 52.8 Quickdraw (0.069) 5.9 (0.097) 3.8 (0.021) 23.8 (0.088) 4.3 (0.087) 4.2 (0.067) 6.1 Real (0.079) 8.7 (0.087) 7.2 (0.032) 47.8 (0.080) 8.7 (0.087) 7.1 (0.068) 11.8 Clipart (0.105) 10.2 (0.101) 10.9 (0.000) 345.0 (0.080) 19.4 (0.086) 14.8 (0.060) 25.7 Infograph (0.363) 36.4 (0.114) 165.1 (0.000) 345.0 (0.085) 202.6 (0.107) 177.4 (0.078) 216.4 Image Net-PGD (0.090) 5.5 (0.096) 4.7 (0.000) 1000.0 (0.000) 1000.0 (0.074) 7.8 (0.049) 13.9 Image Net-C13 (0.127) 9.3 (0.111) 67.0 (0.000) 1000.0 (0.000) 1000.0 (0.095) 15.9 (0.061) 35.8 mean normalized size 0.0338 0.0257 0.0047 that over Sketch is 37.16%. Even with the poor score function, our proposed approach still satisfies the ε = 0.1 error constraint. Finally, while our approach PS-W generally performs best subject to satisfying the error constraint, we note that our ablations PS-R and PS-M also perform well, providing different tradeoffs. First, the performance of PS-W is significantly more reliable than PS-R, but in some cases PS-R performs better (e.g., it produces slightly smaller prediction sets on rate shifts but significantly larger sets on support shifts). Alternatively, PS-M consistently produces smaller prediction sets, though it sometimes violates the ε error constraints. Support shifts on Image Net. We show results for synthetic and adversarial shifts in Table 1 (and Figure 4 in Appendix G.2). As can be seen, the error of our approach (PS-W) is below the desired level. PS-R performs poorly, likely due to the uncalibrated point IW estimation we find that calibrated importance weights mitigate these issues, though accounting for uncertainty in the IWs is necessary for achieving the desired error rate; see Appendix G.3 for details. For adversarial shifts, the target classification error of the source-trained Res Net101 and the domainadapted Res Net101 is 99.97% and 28.05%, respectively. Domain adaptation can significantly decrease average error rate, but label predictions still do not have guarantees. However, our prediction set controls the prediction set error rate as specified by ε. As shown in Figure 4, our prediction set function outputs a prediction set for a given example that includes the true label at least 90% of the time. Thus, downstream modules can rely on this guarantee for further decision-making. Ablation and sensitivity study. We conduct an ablation study on the effect of IW calibration and the smoothness parameter E for PS-W. We observe that the IW calibration considering uncertainty intervals around calibrated IWs is required for the PAC guarantee (e.g., Figure 2b). Also, we find that a broad range of E-s can be used to satisfy the PAC guarantee; we believe this result is due to the use of a domain adapted score function. See Appendices G.3 & G.5 for details. 5 CONCLUSION We propose a novel algorithm for building PAC prediction sets under covariate shift; it leverages rejection sampling and the Clopper-Pearson interval, and accounts for uncertain IWs. We demonstrate the efficacy of our approach on natural, synthetic, and adversarial covariate shifts on Domain Net and Image Net. Future work includes providing optimality guarantees on the prediction set size, rigorous estimation of the hyperparameter E, and incorporating probabilistic IW uncertainty estimates. Published as a conference paper at ICLR 2022 Ethics statement. We do not foresee any significant ethical issues with our work. One possible issue is that end-users may overly trust the prediction sets if they do not understand the limitations of our approach e.g., it is only a high probability guarantee. Reproducibility statement. Algorithms used for evaluation, including ours, are stated in Algorithm 1, Algorithm 2, Algorithm 3, Algorithm 4, and Algorithm 5, along with hyperparameters for algorithms in Section 4.1 and Appendix F. Related code is released6. The proof of our theorems are stated in Appendix D; the required assumption is stated in Assumption 1. ACKNOWLEDGMENTS This work was supported in part by ARO W911NF-20-1-0080, AFRL and DARPA FA8750-18-C-0090, NSF award CCF 1910769, and NSF award 2031895 on the Mathematical and Scientific Foundations of Deep Learning (Mo DL). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Air Force Research Laboratory (AFRL), the Army Research Office (ARO), the Defense Advanced Research Projects Agency (DARPA), or the Department of Defense, or the United States Government. The authors are grateful to Art Owen for helpful comments. Anastasios N Angelopoulos, Stephen Bates, Emmanuel J Cand es, Michael I Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control. ar Xiv preprint ar Xiv:2110.01052, 2021. Vineeth Balasubramanian, Shen-Shyang Ho, and Vladimir Vovk. Conformal prediction for reliable machine learning: theory, adaptations and applications. Newnes, 2014. Rina Foygel Barber, Emmanuel J. Cand es, Aaditya Ramdas, and Ryan J. Tibshirani. The limits of distribution-free conditional predictive inference, 2020. Stephen Bates, Anastasios Angelopoulos, Lihua Lei, Jitendra Malik, and Michael I Jordan. Distribution-free, risk-controlling prediction sets. ar Xiv preprint ar Xiv:2101.02703, 2021. Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in neural information processing systems, pp. 137 144, 2007. Steffen Bickel, Michael Br uckner, and Tobias Scheffer. Discriminative learning for differing training and test distributions. In Proceedings of the 24th international conference on Machine learning, pp. 81 88. ACM, 2007. Glenn W Brier. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1 3, 1950. Emmanuel J. Cand es, Lihua Lei, and Zhimei Ren. Conformalized survival analysis, 2021. Maxime Cauchois, Suyash Gupta, Alnur Ali, and John C. Duchi. Robust validation: Confident predictions even when distributions shift, 2020. Victor Chernozhukov, Kaspar W uthrich, and Zhu Yinchu. Exact and robust conformal inference methods for predictive machine learning with dependent data. In Conference On Learning Theory, pp. 732 749. PMLR, 2018. Charles J Clopper and Egon S Pearson. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(4):404 413, 1934. Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correction theory. In International conference on algorithmic learning theory, pp. 38 53. Springer, 2008. David R Cox. Two further applications of a model for binary regression. Biometrika, 45(3/4): 562 565, 1958. 6https://github.com/sangdon/pac-ps-w Published as a conference paper at ICLR 2022 Morris H De Groot and Stephen E Fienberg. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician), 32(1-2):12 22, 1983. Adam Fisch, Tal Schuster, Tommi Jaakkola, and Regina Barzilay. Few-shot conformal prediction with auxiliary tasks, 2021. Donald Alexander Stuart Fraser. Nonparametric methods in statistics. John Wiley & Sons Inc, 1956. Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Francois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(59):1 35, 2016. URL http://jmlr.org/ papers/v17/15-239.html. Isaac Gibbs and Emmanuel Cand es. Adaptive conformal inference under distribution shift, 2021. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. ar Xiv preprint ar Xiv:1706.04599, 2017. Chirag Gupta, Arun K. Kuchibhotla, and Aaditya K. Ramdas. Nested conformal prediction and quantile out-of-bag ensemble methods, 2021. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, 2016. Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations, 2019. Ying Jin, Zhimei Ren, and Emmanuel J Cand es. Sensitivity analysis of individual treatment effects: A robust conformal inference approach. ar Xiv preprint ar Xiv:2111.12161, 2021. Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10(Jul):1391 1445, 2009. Danijel Kivaranovic, Kory D Johnson, and Hannes Leeb. Adaptive, distribution-free prediction intervals for deep networks. In International Conference on Artificial Intelligence and Statistics, pp. 4346 4356. PMLR, 2020. Volodymyr Kuleshov and Percy S Liang. Calibrated structured prediction. In Advances in Neural Information Processing Systems, pp. 3474 3482, 2015. Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. ar Xiv preprint ar Xiv:1807.00263, 2018. Ananya Kumar, Percy S Liang, and Tengyu Ma. Verified uncertainty calibration. In Advances in Neural Information Processing Systems, pp. 3792 3803, 2019. Donghwan Lee, Xinmeng Huang, Hamed Hassani, and Edgar Dobriban. T-cal: An optimal test for the calibration of predictive models. ar Xiv preprint ar Xiv:2203.01850, 2022. Jing Lei and Larry Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):71 96, 2014. Jing Lei, Alessandro Rinaldo, and Larry Wasserman. A conformal prediction approach to explore functional data. Annals of Mathematics and Artificial Intelligence, 74(1):29 43, 2015. Lihua Lei and Emmanuel J Cand es. Conformal inference of counterfactuals and individual treatment effects. ar Xiv preprint ar Xiv:2006.06138, 2020. Sarah Lichtenstein, Baruch Fischhoff, and Lawrence D Phillips. Calibration of probabilities: The state of the art. Decision making and change in human affairs, pp. 275 324, 1977. Published as a conference paper at ICLR 2022 Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. Detecting and correcting for label shift with black box predictors. In International conference on machine learning, pp. 3122 3130. PMLR, 2018. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017. Ali Malik, Volodymyr Kuleshov, Jiaming Song, Danny Nemer, Harlan Seymour, and Stefano Ermon. Calibrated model-based deep reinforcement learning. In International Conference on Machine Learning, pp. 4314 4323, 2019. Robert G Miller. Statistical prediction by discriminant analysis. In Statistical Prediction by Discriminant Analysis, pp. 1 54. Springer, 1962. Allan H Murphy. Scalar and vector partitions of the probability score: Part i. two-state situation. Journal of Applied Meteorology, 11(2):273 282, 1972. Xuan Long Nguyen, Martin J. Wainwright, and Michael I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861, Nov 2010. ISSN 1557-9654. doi: 10.1109/tit.2010.2068870. URL http: //dx.doi.org/10.1109/TIT.2010.2068870. Art B. Owen. Monte Carlo theory, methods and examples. 2013. Artidoro Pagnoni, Stefan Gramatovici, and Samuel Liu. Pac learning guarantees under covariate shift. ar Xiv preprint ar Xiv:1812.06393, 2018. Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman. Inductive confidence machines for regression. In European Conference on Machine Learning, pp. 345 356. Springer, 2002. Sangdon Park, Osbert Bastani, Nikolai Matni, and Insup Lee. Pac confidence sets for deep neural networks via calibrated prediction. In International Conference on Learning Representations, 2020a. URL https://openreview.net/forum?id=BJx VI04Yv B. Sangdon Park, Osbert Bastani, James Weimer, and Insup Lee. Calibrated prediction with covariate shift via unsupervised domain adaptation. In The 23rd International Conference on Artificial Intelligence and Statistics, 2020b. Sangdon Park, Shuo Li, Insup Lee, and Osbert Bastani. PAC confidence predictions for deep neural network classifiers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Qk-Wq5AIjpq. Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1406 1415, 2019. John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 1999. Aleksandr Podkopaev and Aaditya Ramdas. Distribution-free uncertainty quantification for classification under label shift. ar Xiv preprint ar Xiv:2103.03323, 2021. Dimitris N Politis. Model-free prediction in regression. In Model-Free Prediction and Regression, pp. 57 80. Springer, 2015. Hongxiang Qiu, Edgar Dobriban, and Eric Tchetgen Tchetgen. Distribution-free prediction sets adaptive to unknown covariate shift. ar Xiv preprint ar Xiv:2203.06126, 2022. Joaquin Qui nonero-Candela, Masashi Sugiyama, Neil D Lawrence, and Anton Schwaighofer. Dataset shift in machine learning. Mit Press, 2009. Yaniv Romano, Rina Foygel Barber, Chiara Sabatti, and Emmanuel J. Cand es. With malice towards none: Assessing uncertainty via equalized coverage, 2019. Published as a conference paper at ICLR 2022 Reuven Y Rubinstein and Dirk P Kroese. Simulation and the Monte Carlo method, volume 10. John Wiley & Sons, 2016. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211 252, 2015. Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Muller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(May):985 1005, 2007. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), 2014. Ryan J Tibshirani, Rina Foygel Barber, Emmanuel Candes, and Aaditya Ramdas. Conformal prediction under covariate shift. Advances in Neural Information Processing Systems, 32:2530 2540, 2019. Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1984. John von Neumann. Various techniques used in connection with random digits. In A. S. Householder, G. E. Forsythe, and H. H. Germond (eds.), Monte Carlo Method, volume 12 of National Bureau of Standards Applied Mathematics Series, chapter 13, pp. 36 38. US Government Printing Office, Washington, DC, 1951. Vladimir Vovk. Conditional validity of inductive conformal predictors. Machine learning, 92(2-3): 349 376, 2013. Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic learning in a random world. Springer Science & Business Media, 2005. Ximei Wang, Mingsheng Long, Jianmin Wang, and Michael Jordan. Transferable calibration with lower bias and variance in domain adaptation. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 19212 19223. Curran Associates, Inc., 2020. URL https://proceedings.neurips. cc/paper/2020/file/df12ecd077efc8c23881028604dbb8cc-Paper.pdf. Samuel S Wilks. Determination of sample sizes for setting tolerance limits. The Annals of Mathematical Statistics, 12(1):91 96, 1941. Chen Xu and Yao Xie. Conformal prediction interval for dynamic time-series. In International Conference on Machine Learning, pp. 11559 11569. PMLR, 2021. Yachong Yang, Arun Kumar Kuchibhotla, and Eric Tchetgen Tchetgen. Doubly robust calibration of prediction sets under covariate shift. ar Xiv preprint ar Xiv:2203.01761, 2022. Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In In Proceedings of the Eighteenth International Conference on Machine Learning. Citeseer, 2001. Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694 699. ACM, 2002. Published as a conference paper at ICLR 2022 A ADDITIONAL RELATED WORK Prediction sets under i.i.d. assumption. We built our proposed approach on the known PAC prediction set approach (Wilks, 1941; Park et al., 2020a) due to its simplicity and sample efficiency. As other candidates, nested conformal prediction (Vovk et al., 2005; Gupta et al., 2021) reinterprets multiple known conformal prediction approaches using a scalar parameteriztion of prediction sets; but different from Wilks (1941) and Park et al. (2020a), it is not known to provide PAC guarantees. Kivaranovic et al. (2020) provides a PAC style guarantee, but their approach is limited to regression and is sample-inefficient, i.e., it requires a sample of size O(1/ϵ2), while Park et al. (2020a) has a better sample complexity, as demonstrated in their paper. Prediction sets under various settings. Prior prediction set algorithms have been considered in several settings. First, traditional conformal prediction (Vovk et al., 2005) often considers the setting where labeled examples arrive sequentially from the same distribution; there has also been work extending conformal prediction to the setting where the distribution is time-varying (Politis, 2015; Chernozhukov et al., 2018; Xu & Xie, 2021; Gibbs & Cand es, 2021). Alternatively, there has been work on constructing risk-controlling prediction sets for supervised learning setting (Vovk, 2013; Park et al., 2020a; Bates et al., 2021; Angelopoulos et al., 2021). Finally, there has been recent work on conformal prediction in the meta-learning setting (Fisch et al., 2021); in particular, given a few labeled examples drawn from the new task, their approach leverages labeled examples from previous tasks to construct a conformal predictor for the new task, assuming the tasks are exchangeable. Developments after our work. After our work was made publicly available, Jin et al. (2021) has developed a different, robust conformal inference approach to constructing prediction sets with estimated weights under covariate shift. Their algorithm assumes given upper and lower bounds on the importance weights, and uses the worst-case quantile over all weights that satisfy the constraint to set the critical values. Further, Yang et al. (2022) have developed a doubly robust approach to construct prediction sets satisfying approximate marginal coverage under covariate shift (which can be robust to estimating the weights and per-covariate prediction error), leveraging semiparametric efficiency theory. Qiu et al. (2022) have developed a parallel approach for the PAC case. Calibration. An alternative way to quantify uncertainty is calibrated prediction (e.g., Brier, 1950; Cox, 1958; Miller, 1962; Murphy, 1972; Lichtenstein et al., 1977; De Groot & Fienberg, 1983; Guo et al., 2017, etc), which aims to ensure that among instances with a predicted confidence p, the model is correct a fraction p of the time. Techniques have been proposed to re-scale predicted confidences to improve calibration (Platt, 1999; Guo et al., 2017; Zadrozny & Elkan, 2001; 2002; Kuleshov & Liang, 2015; Kuleshov et al., 2018; Malik et al., 2019); including ones with theoretical guarantees (Kumar et al., 2019; Park et al., 2021) and ones that handle covariate shift (Park et al., 2020b; Wang et al., 2020). There are also methods to rigorously test calibration, dating back to Cox (1958); Miller (1962), see e.g., Lee et al. (2022) for a recent approach. These approaches provide a qualitatively different form of uncertainty quantification compared to the one we consider. Rejection sampling. Rejection sampling (sometimes accept-reject sampling) is a well-known technique (Owen, 2013; Rubinstein & Kroese, 2016) dating back at least to von Neumann (1951). We highlight that most work on covariate shift relies on importance weighting; our rejection sampling approach is relatively less common and more novel. In fact, to the best of our knowledge, only Pagnoni et al. (2018) has used rejection sampling for a different problem in this area. IW estimation. There has been a long line of work studying the problem of estimating importance weights (IWs), also called likelihood ratios, in a way that provides theoretical guarantees. For instance, (Nguyen et al., 2010) provides a convergence rate analysis of IW estimators under smoothness assumptions, they provide a finite sample bound on the Hellinger distance between the true IW and estimated IW. Next, (Kanamori et al., 2009) shows a similar finite sample guarantee, assuming the true IW can be represented as a linear combination of kernels. Finally, (Cortes et al., 2008) proposes non-parametric IW estimators, modeling the source and target distribution by histograms over clusters in sample space. The IW estimation approaches can be used in conjuction with prediction set construction. Compared to Cand es et al. (2021), which only guarantees asymptotic validity under certain assumptions on the estimated IWs, Theorem 4 provides a finite-sample correctness guarantee. Furthermore, we explicitly describe an algorithm for approximate IWs, required by Theorem 4, in Appendix B. Published as a conference paper at ICLR 2022 B IMPORTANCE WEIGHT ESTIMATION In general, to estimate the importance weights (IWs), some assumptions on their structure are required. A number of approaches have been proposed, with varying guarantees under different assumptions (Kanamori et al., 2009; Cortes et al., 2008; Nguyen et al., 2010; Lipton et al., 2018). We use a cluster-based approach (Cortes et al., 2008); our approach is compatible with any of these strategies if they can be modified to provide uncertainty estimates of the IWs. In our approach, given a partition X = SK j=1 Bj into bins, we can estimate the IWs based on the fractions of source and target samples in each bin. If the partition is sufficiently fine, then we can obtain confidence intervals around the estimated IWs with finite-sample guarantees. However, this strategy requires the number of bins in the partition to be exponential in the dimension of X. Thus, in practice, we use a heuristic to construct the partition. We describe the cluster-based approach and our partition construction heuristic below. B.1 CLUSTER-BASED APPROACH We assume given unlabeled calibration sets SX m and T X n , where SX m consists of i.i.d. samples xi p(x) for i [m], and T X n consists of i.i.d. samples xi q(x) for i [n], respectively7. Roughly speaking, the cluster-based strategy estimates the average IW in each bin Bj; assuming p and q are roughly constant in each bin, these accurately estimate the true IWs. Let j(x) be the bin containing x (i.e., x Bj(x)), and let p B(x) := pj(x) s.t. pj = Z Bj p(x ) dx and q B(x) := qj(x) s.t. qj = Z Bj q(x ) dx be the (unnormalized) approximations of the densities p and q, respectively, that are constant on each bin. We assume that p B and q B are accurate approximations: Assumption 1 Given E R 0, the partition satisfies Z Bj |p(x) p(x )|dx E and Z Bj |q(x) q(x )|dx E (j [K], x Bj). (12) Thus, p and q are roughly constant on the partitions. In general, (12) can hold for any E R>0 if p and q are Lipschitz continuous and each Bj is sufficiently small (see Appendix B.3 for discussion). Then, under Assumption 1, it can be verified that |v(x) p(x) p B(x)| E and |v(x) q(x) q B(x)| E ( x X), (13) where v(x) = vj(x), and vj = R Bj dx is the volume of bin Bj (see the proof of Theorem 5 for the validity of (13)). Next, we have the following empirical estimates of p B and q B, respectively: ˆp B(x) := ˆpj(x) s.t. ˆpj = 1 x SX m 1 (x Bj) and (14) ˆq B(x) := ˆqj(x) s.t. ˆqj = 1 x T X n 1 (x Bj) . (15) Now, 1(x Bj) has distribution Bernoulli(pj) when x P, thus m ˆpj has distribution Binom(m, pj), and pj is contained in a Clopper-Pearson interval around ˆpj with high probability; in particular, let θ be the Clopper-Pearson lower bound corresponding to the Clopper-Pearson upper bound defined in Section 2.2, i.e., Pk Binom(m,µ)[µ θ(k; m, δ)] 1 δ. Then, we have θ(m ˆpj; m, δ ) pj θ(m ˆpj; m, δ ) (16) with probability at least 1 δ with respect to the samples SX m. Combining (13) and (16), we have the following result (and see Appendix D.6 for a proof): 7We can use the same calibration set to construct IWs and prediction sets due to the union bound in Theorem 3. Published as a conference paper at ICLR 2022 Theorem 5 Letting δ = δw/(2K) and [v]+ := max{0, v} for all v R, we have w(x) := [θ (n ˆq B(x); n, δ ) E]+ θ (m ˆp B(x); m, δ ) + E w (x) w(x) := θ (n ˆq B(x), n, δ ) + E [θ (m ˆp B(x), m, δ ) E]+ ( x X) with probability at least 1 δw over SX m and T X n . We use these upper and lower bounds on the IWs as the inputs w and w to our algorithm i.e., W = {w : X R | x X, w(x) w(x) w(x)} , and use b = maxx X w(x) = maxj [K] wj as the maximum IW. If b is known, we need importance weights associated with source calibration samples SX m; thus we use a simpler form of W as follows: W = {w Rm | i [m], wi wi wi} , where wi := w(xi) and wi := w(xi) for xi SX m. Theorem 5 under Assumption 1 is one way to construct uncertainty set W that contains the true IWs. Corollary 1 Given K N, E R 0, and δw (0, 1), suppose Assumption 1 is satisfied and W is constructed using Theorem 5. Then, we have PSX m P m X ,T X n Qn X [w W] 1 δw. B.2 PARTITION CONSTRUCTION HEURISTIC In general, exponentially many bins are needed to guarantee Assumption 1. Instead, we consider an intuitive heuristic for constructing these bins, so that the importance weights w(x) rather than the density functions p(x) and q(x) individually are roughly constant on each bin, inspired by (Park et al., 2021; 2020b); a standard heuristic for estimating IWs is to train a probabilistic classifier g(s | x) to distinguish source and target training examples, and then use these probabilities to construct the IWs. In particular, define the distribution g (x, y) = 1 2p(x) 1(s = 1) + 1 2q(x) 1(s = 0). Then, letting g (y | x) be the conditional distribution, we have w (x) = 1/g (s = 1 | x) 1 (Bickel et al., 2007). Thus, we train g(s | x) g (s | x) and construct bins according to w(x) = 1/g(s = 1 | x) 1, i.e., Bj = {x X | w(x) [wj, wj+1)}, where 0 = w1 w2 ... w K+1 = . Finally, we describe how to train g. Let SX m and T X n be unlabeled training examples from a source and target, respectively; then, the set RX m ,n = {(x, 1) | x SX m } {(x, 0) | x T X n } consists of i.i.d. samples (x, y) s (x, y). Thus, we can train s on RX m ,n using supervised learning. In practice, the corresponding IW estimates w(x) can be inaccurate partly since w is likely overfit to RX m ,n , which is why re-estimating the IWs in each bin according to Theorem 5 remains necessary. B.3 DENSITY ESTIMATION SATISFYING ASSUMPTION 1 We consider the following binning strategy that satisfies Assumption 1 for Lipschitz continuous PDFs. In particular, assume the PDFs p(x) and q(x) are L-Lipschitz continuous for some norm . Then, for any given E, construct each bin Bj such that x x E L vj x, x Bj, Published as a conference paper at ICLR 2022 where vj is the volume of bin Bj. Then, we know that |p(x) p(x )| L x x E Bj |p(x) p(x )|dx E. Similarly, we have Z Bj |q(x) q(x )|dx E. Thus, the bins satisfy Assumption 1. Finally, assuming is the L norm, we describe one way to construct bins. In particular, construct bins by taking an ε-net with ε = (E/L)1/(d+1), where d is the dimension of X. Then, we have x x ε = εd+1 εd = E L vj , as desired. C ADDITIONAL DISCUSSION ON IMPORTANCE WEIGHTS C.1 APPROXIMATE IMPORTANCE WEIGHTS In Section 3.3, we assume W has the following form: W := {w Rm | wi wi wi} . However, considering the fact that the expected importance weight is one, i.e., Ex P [w (x)] = 1, the uncertainty set W that contains the true importance weights with high probability can be further constrained as follows: W := {w Rm | i [m] , wi wi wi , c Pm i=1 wi c} for some c and c. In particular, using the Hoeffding s inequality for example, we can estimate c and c such that w1, . . . , wm can be the part of the true importance weight w that satisfies the mass constraint Ex P [w (x)] = 1. Recall that we have to find a maximizer in the uncertainty set W as in (8); however, due to the additional constraint on Pm i=1 wi in W , it is challenging to solve the maximization problem exactly. C.2 MAXIMUM IMPORTANCE WEIGHT ESTIMATION To generalize our approach to estimate the maximum importance weight b, we redefine the uncertainty set W over importance weights as follows: W := {w : X R | x X, w(x) w(x) w(x)} for some w : X R and w : X R such that it contains the true importance weight w with high probability i.e., PSX m P m X ,T X n Qn X [w W] 1 δw. (18) Given this, the maximum importance weight is obtained as follows: ˆb = max x X w(x). Considering that W can be estimated using binning as in Appendix B, the maximum importance weight is rewritten as follows: ˆb = max x X w(x) = max j [K] wj, (19) Published as a conference paper at ICLR 2022 wj := θ (n ˆqj, n, δ ) + E [θ (m ˆpj, m, δ ) E]+ . Here, θ, θ, m, n, δ , E, ˆpj, and ˆqj are defined in Theorem 5 and Appendix B. The guarantee (18) implies the guarantee (7). Thus, for the maximization over the uncertainty set in (8), we use the original definition W and the greedy algorithm for ˆw in (9). Then, Theorem 4 with the estimated b using (19) still holds as follows: Theorem 6 Suppose Assumption 1 is satisfied and W is estimated using Theorem 5. Define ˆwτ,SX m,T X n as in (9) and ˆb SX m,T X n as in (19), making the dependency on τ, SX m, and T X n explicit, and URSCP as in (5). Let ˆτ be the solution of the following problem: ˆτ = max τ T τ subj. to URSCP(Cτ, Sm, V, ˆwτ,SX m,T X n ,ˆb SX m,T X n , δC) ε. (20) Then, we have PSm P m,V U m,T X n Qn X[LQ(Cτ) ε] 1 δC δw for any τ ˆτ. C.3 CHOOSING HYPERPARAMETERS In general, there is no systematic way to choose the smoothness parameter E and the number of bins K; we briefly discuss strategies for doing so. Number of bins K. The bins are defined in one dimensional space as described in Appendix B.2, so we follow the standard practice in the calibration literature for binning (Guo et al., 2017; Park et al., 2021), where K is between 10 and 20. As we are using equal-mass binning, we choose the number of bins so that each bin contains sufficiently many source examples (in our case, 5000 examples) for the length of the Clopper-Pearson interval over IWs of each bin to be below some threshold (in our case, 10 3), which leads us to K = 10. We provide a sensitivity analysis in Appendix G.6. Smoothness parameter E. Estimating E (i.e., computing the integral of the difference of source and target probabilities) is intractable in general as it requires that we perform density estimation in a high-dimensional space (i.e., 2048 in our case), and then integrate this density over each bin. To avoid this hyperparameter selection, we choose E equal to zero or close to zero (in our case, 0.001). Intuitively, we find that binning based on the source-discriminator scores is an effective way to group examples with similar IWs; thus, the main contribution to the uncertainty is the error of the point estimate. We provide a sensitivity analysis on E in Appendix G.5. Importantly, note that PS-W even with E = 0 satisfies PAC criterion. One direction for future work is devising better strategies for choosing E. C.4 COMPARISON WITH WSCI Guarantee. The main difference between WSCI and PS-W lies in the guarantee provided (rather than prediction set size); PS-W provides a stronger guarantee than WSCI. In particular, WSCI does not provide a PAC guarantee. Instead, to satisfy the coverage probability guarantee, it requires a new calibration set for every new test example. However, in practice, we usually have a single held-out calibration set. Thus, our approach PS-W provides a guarantee that holds conditioned on this set. This difference is illustrated in Figure 3a, which compares WSCI and PS-W given the true importance weight. PS-W produces a larger set size than WSCI, but strictly satisfies the error constraint. The shortcoming of WSCI can also be observed in Figure 2 in Tibshirani et al. (2019), which empirically shows that the guarantee only holds on average over examples. Usage of IWs. WSCI requires the true IWs, whereas our method can use approximate IWs. Also, IWs are used differently. Given the true importance weights, WSCI uses the importance weights to reweight examples in the calibration set, and PS-W uses the importance weights to generate a target labeled calibration set using rejection sampling. Published as a conference paper at ICLR 2022 D.1 PROOF OF THEOREM 1 First, note that the constraint in (1) implies F( LSm(Cτ); m, ε) δ; conversely, any value of τ satisfying F( LSm(Cτ); m, ε) δ also satisfies LSm(Cτ) k(m, ε, δ). Thus, we can rewrite (1) as ˆτ = arg max τ T τ subj. to F( LSm(Cτ); m, ε) δ. On the other hand, if τ satisfies (3), then by definition of UCP(Cτ, Sm, δ) = θ(k; m, δ), we have inf{θ [0, 1] | F( LSm(Cτ); m, θ) δ} {1} ε, which implies that F( LSm(Cτ); m, ε) δ (since the infimum is obtained within the set since the binomial CDF F is continuous in ε). Conversely, any value of τ satisfying F( LSm(Cτ); m, ε) δ also satisfies UCP(Cτ, Sm, δ) ε. Thus, we can rewrite (3) as ˆτ = arg max τ T τ subj. to F( LSm(Cτ); m, ε) δ, implying (1) and (3) are equal. Thus, the claim follows from Theorem 1 of (Park et al., 2020a) and the fact that the error LP (Cτ) is monotonically increasing in τ. D.2 MONOTONICITY OF THE CLOPPER-PEARSON BOUND The CP bound UCP enjoys certain monotonicity properties that we will need. Intuitively, the CDF decreases as the number of observations m increases while holding the number of successes k fixed, but increases if both m and k are increased by the same amount (i.e., holding the number of failures m k fixed). In particular, we have the following: Lemma 2 We have θ(k; m 1, δ) θ(k; m, δ) and θ(k 1; m 1, δ) θ(k; m, δ). Proof. Recall that F(k; m, θ) is the cumulative distribution function of a binomial distribution Binom (m, θ), or equivalently of the random variable Pm i=1 Xi, where Xi Bernoulli(θ) are i.i.d. Decreasing case. If k m 1, then we have so F(k; m, θ) F(k; m 1, θ). Then, we have θ(k; m, δ) := inf {θ [0, 1] | F(k; m, θ) δ} {1} inf {θ [0, 1] | F(k; m 1, θ) δ} {1} =: θ(k; m 1, δ), thus θ is monotonically non-increasing in m. Increasing case. We have m 1 X Published as a conference paper at ICLR 2022 so F(k 1; m 1, θ) F(k; m, θ). Then, we have θ(k; m, δ) := inf {θ [0, 1] | F(k; m, θ) δ} {1} inf {θ [0, 1] | F(k 1; m 1, θ) δ} {1} =: θ(k 1; m 1, δ), thus θ is monotonically jointly non-decreasing in (m, k). D.3 PROOF OF THEOREM 2 The rejection sampling prediction set consists of two steps: (i) generate target samples, using source samples Sm, importance weights w, and an upper bound on their maximum value b, and (ii) construct the Clopper-Pearson prediction set using the generated target samples. From rejection sampling, we choose N := Pm i=1 σi samples from Sm, denoting them by TN; here, N Binom (m, 1/b), and 1/b is the acceptance probability (von Neumann, 1951) i.e., where V Uniform([0, 1]). The samples in TN are independent and identically distributed, conditionally on the random number N of samples being equal to any fixed value n. The reason is that one can view the rejection sampling algorithm proceeding in stages, iterating through the samples one by one. The first stage starts at the very beginning, and then each stage ends when a datapoint is accepted, followed by starting a new stage at the next datapoint. The last stage ends at the last datapoint. Based only on the source samples observed in one stage, rejection sampling produces a sample from the target distribution. Thus, within each stage, we produce one sample from the target distribution, and because each stage is independent of all the other ones, conditionally on any number of stages reached, our produced target samples are iid. Thus, we can use the Clopper-Pearson bound conditionally on each N = n. To this end, let ˆτ(Sm, V ) = ˆτ to explicitly denote the dependence on Sm and V , and let τ(Tn) = arg max τ T τ subj. to URSCP(Cτ, Tn, δ) ε. Note that conditioned on obtaining n samples using rejection sampling (i.e., |Tn(Sm, V, w, b)| = n), we have ˆτ(Sm, V ) D= τ(Tn), where D= denotes equality in distribution. Then, we have PSm P m,V U m LQ(Cˆτ(Sm,V )) ε n=0 PSm P m,V U m[LQ(Cˆτ(Sm,V )) ε | N = n] P[N = n] n=0 PTn Qn[LQ(C τ(Tn)) ε] P[N = n] n=0 (1 δ) P[N = n] where the inequality follows by Theorem 1. The claim follows. D.4 PROOF OF THEOREM 3 τ = arg max τ T τ subj. to URSCP(Cτ, Sm, V, w , b, δC) ε, (21) Published as a conference paper at ICLR 2022 which satisfies PSm P m,V U m [LQ(C τ) ε] 1 δC by Theorem 2. Now, with probability at least 1 δw, we have w W. Under this event, we have URSCP(Cτ, Sm, V, w , b, δC) max w W URSCP(Cτ, Sm, V, w, b, δC), so ˆτ satisfies the constraint in (21). Thus, we must have ˆτ τ. By monotonicity of LQ(Cτ) in τ, we have LQ(Cˆτ) LQ(C τ), which implies that PSm P m,V U m,T X n Qn X[LQ(Cˆτ) ε] PSm P m,V U m[LQ(C τ) ε] 1 δC, where the last step follows by Theorem 2. The claim follows by a union bound, since w W with probability at least 1 δw. D.5 PROOF OF LEMMA 1 Let w and v be IWs where w(xi) v(xi) and w(xj) = v(xj) for j = i. Additionally, we use the following shorthands: i=1 1 Vi w(xi) Tnw := (xi, yi) Sm (x,y) Tnw 1 (y / C(x)) , i=1 1 Vi v(xi) Tnv := (xi, yi) Sm (x,y) Tnv 1 (y / C(x)) . Here, nw nv since w(xi) v(xi). Finally, recall that F(k; m, θ) be the cumulative distribution function of a binomial random variable Pm i=1 Xi, where Xi Bern(θ). Non-decreasing case. If yi / C(xi), there are two cases to consider: 1. If v(xi) b < Vi w(xi) b , then we can verify that nw = nv + 1 and kw = kv + 1. 2. Otherwise, we can verify that nw = nv and kw = kv. In both cases, kw nv and nw nv. Since θ is monotonically jointly non-decreasing in (m, k) as in Lemma 2, we have URSCP(C, Sm, V, w, b, δ) := UCP( LTnw (C), δ) := θ(kw; nw, δ) θ(kv; nv, δ) =: UCP( LTnv (C), δ) =: URSCP(C, Sm, V, v, b, δ), thus URSCP is monotonically non-decreasing in w(xi). Non-increasing case. If yi C(xi), then kw = kv. Since θ is monotonically non-increasing in m as in Lemma 2, we have URSCP(C, Sm, V, w, b, δ) := UCP( LTnw (C), δ) := θ(kw; nw, δ) θ(kv; nv, δ) =: UCP( LTnv (C), δ) =: URSCP(C, Sm, V, v, b, δ), Published as a conference paper at ICLR 2022 thus URSCP is monotonically non-increasing in w(xi). D.6 PROOF OF THEOREM 5 Recall that j=1 1 (x Bj) x SX m 1 (x Bj) j=1 1 (x Bj) x T X n 1 (xi Bj) j=1 1 (x Bj) Z Bj p(x ) dx , j=1 1 (x Bj) Z Bj q(x ) dx , and v(x) := vj(x) = Z Due to the assumption of (12), |v(x) p(x) p K(x)| is bounded for any x Bj as follows: |v(x) p(x) p B(x)| = Bj p(x) dx Z Bj p(x ) dx = Bj p(x) p(x )dx Bj |p(x) p(x )| dx |v(x) q(x) q B(x)| E. (23) Observe that mˆp(x) Binom m, R Bj p(x )dx for any x Bj; thus p K is bounded with proba- bility at least 1 δ as follows due to the Clopper-Pearson interval (θ, θ): θ(mˆp(x); m, δ ) p K(x) θ(mˆp(x); m, δ ). (24) θ(nˆq(x); n, δ ) q K(x) θ(nˆq(x); n, δ ). (25) From (22), (23), (24), and (25), the following holds: θ(mˆp(x); m, δ ) E v(x) p(x) θ(mˆp(x); m, δ ) + E and θ(nˆq(x); n, δ ) E v(x) q(x) θ(nˆq(x); n, δ ) + E. Therefore, for any x Bj, w (x) is bounded as follows: θ(nˆq(x); n, δ ) E θ(mˆp(x); m, δ ) + E w (x) = q(x) p(x) θ(nˆq(x); n, δ ) + E θ(mˆp(x); m, δ ) E . Since we apply the Clopper-Pearson interval for K partitions for both source and target, the claim holds due to the union bound. Published as a conference paper at ICLR 2022 E ADDITIONAL ALGORITHMS E.1 PS ALGORITHM Algorithm 2 PS: an algorithm using the CP bound in (3) procedure PS(Sm, f, T , ε, δ) ˆτ 0 for τ T do ( ) Grid search in ascending order if UCP(Cτ, Sm, δ) ε then ˆτ max(ˆτ, τ) else break return ˆτ E.2 PS-C ALGORITHM Algorithm 3 PS-C: an algorithm using the CP bound in (3) with ε/b procedure PS-C(Sm, f, T , b, ε, δ) return PS(Sm, f, T , ε/b, δ) We describe the PS-C algorithm, which uses a conservative upper bound on the CP interval. Let LP (C) := E (x,y) P [1 (y / C(x))] LQ(C) := E (x,y) Q [1 (y / C(x))] w (x) := q(x) p(x) b := max x X w (x). Then, we have LQ(C) = E (x,y) Q [1 (y / C(x))] = E (x,y) P [w (x)1 (y / C(x))] E (x,y) P [b 1 (y / C(x))] = b E (x,y) P [1 (y / C(x))] = b LP (C). Thus, LQ(C) ε if b LP (C) ε. Equivalently, LQ(C) ε if LP (C) ε/b. As a consequence, we can choose C based on the CP bound for the i.i.d. case (i.e., Algorithm 2), except using the desired error of ε/b (instead of ε). The algorithm is described in Algorithm 3. Published as a conference paper at ICLR 2022 E.3 PS-R ALGORITHM Algorithm 4 PS-R: an algorithm using the RSCP bound in (6) procedure PS-R(Sm, f, T , w, b, ε, δ) V Uniform([0, 1])m ˆτ 0 for τ T do ( ) Grid search in ascending order if URSCP(Cτ, Sm, V, w, b, δ) ε then ˆτ max(ˆτ, τ) else break return ˆτ E.4 PS-M ALGORITHM Algorithm 5 PS-M: an algorithm using the RSCP bound in (6) along with IWs rescaling procedure PS-M(Sm, T X n , f, T , w, b, ε, δ) ˆw(x) ˆq B(x) ˆp B(x) for x X, where ˆp B and ˆq B are defined in (14) and (15), respectively return PS-R(Sm, f, T , ˆw, b, ε, δ) F EXPERIMENT DETAILS F.1 DOMAIN ADAPTATION We use a fully-connected network (with two hidden layers, where each layers has 500 neurons followed by Re LU activations and a 0.5-dropout layer) as the domain classifier (recall that the input of this domain classifier is the last hidden layer of Res Net101). We use the last hidden layer of the model as example space X, where its dimension is 2048. For neural network training, we run stochastic gradient descent (SGD) for 100 epochs with an initial learning rate of 0.1, decaying it by half once every 20 epochs. The domain adaptation regularizer is gradually increased as in (Ganin et al., 2016). We use the same hyperparameters for all experiments. F.2 DOMAINNET We split the dataset into 409,832 training, 88,371 calibration, and 88,372 test images. F.3 IMAGENETC-13 We split Image Net into 1.2M training, 25K calibration, and 25K test images, and Image Net-C13 into 83M training, 1.6M calibration, and 1.6M test images. To train a model using domain adaptation, due to the large size of the target training set, we subsample the target training set to be the same size as the source training set on for each random trials. G ADDITIONAL RESULTS G.1 SYNTHETIC RATE SHIFT BY TWO GAUSSIANS We demonstrate the efficacy of the proposed approaches (i.e., PS-R with the known IWs and PS-W with the estimated IWs) using a synthetic dataset consisting of samples from two Gaussian distributions. Dataset. We consider two Gaussian distributions N(µ, Σ) and N(µ, Σ ) over 2048-dimensional covariate space X. Here, µ = 0; Σ and Σ are diagonal where Σ1,1 = 52, Σi,i = 10 1, Σ 1,1 = 1, Published as a conference paper at ICLR 2022 PS WSCI PS-C PS-W 0.00 prediction set error PS WSCI PS-C PS-W 0.00 prediction set size (a) With the known true IW PS WSCI PS-C PS-R PS-W 0.00 prediction set error PS WSCI PS-C PS-R PS-W 0.00 prediction set size (b) With an estimated IW Figure 3: Error under the rate shift by Two Gaussians (over 100 random trials). Parameters are m = 50, 000, ε = 0.01, and δ = 10 5. and Σ i,i = 10 1 for i {2, . . . , 2048}. We consider the flat Gaussian N(µ, Σ) as the source and the tall Gaussian N(µ, Σ ) as the target. Intuitively, there is a rate shift from the source to the target i.e., the target examples are a subset of the source, but occur with higher frequencies. We use the following labeling function: p(y | x) = σ(5x1), where σ is the sigmoid function. Finally, we generate 50,000 labeled examples for each training, calibration, and test. Results. We consider two different setups: 1) the true IW is known, and 2) the true IW is unknown. In Figure 3a, we demonstrate the prediction set errors given the true IW. As expected PS-R satisfies the PAC guarantee i.e., the error is below ε. However, as shown in Figure 3b, when we need to estimate IWs, using just the point-estimate of the IW results in PS-R performing poorly in terms of prediction set error; it still satisfies the ε constraint, but the error is close to zero, indicating that the prediction set size is too large to be useful for an uncertainty quantifier. In contrast, PS-W (i.e., rejection sampling based on interval estimates of IWs) produces a larger, more reasonable error rate while still satisfying the PAC condition. These experiments demonstrate that PS-R works well when given the true IW, but accounting for IW uncertainty is important when using estimated IWs. Published as a conference paper at ICLR 2022 G.2 PREDICTION SET SIZE AND ERROR PS WSCI PS-C PS-R PS-W 0.00 prediction set error PS WSCI PS-C PS-R PS-W 0 prediction set size PS WSCI PS-C PS-W 0.00 prediction set error PS WSCI PS-C PS-W 0 prediction set size PS WSCI PS-C PS-W 0.00 prediction set error PS WSCI PS-C PS-W 0 prediction set size (c) Clipart PS WSCI PS-C PS-W 0.000 prediction set error PS WSCI PS-C PS-W 0 prediction set size (d) Painting PS WSCI PS-C PS-W 0.00 prediction set error PS WSCI PS-C PS-W 0 prediction set size (e) Quickdraw PS WSCI PS-C PS-W 0.00 prediction set error PS WSCI PS-C PS-W 0 prediction set size PS WSCI PS-C PS-W 0.00 prediction set error PS WSCI PS-C PS-W 0 prediction set size (g) Infograph PS WSCI PS-C PS-W 0.00 prediction set error PS WSCI PS-C PS-W 0 prediction set size (h) Image Net-C13 PS WSCI PS-C PS-W 0.00 prediction set error PS WSCI PS-C PS-W 0 prediction set size (i) Image Net + PGD Figure 4: Prediction set error and size under rate shifts on Domain Net (a-g) and under support shifts on Image Net (h, i) (over 100 random trials for error and over a held-out test set for size). Parameters are m = 50, 000 for Domain Net and m = 20, 000 for Image Net, ε = 0.1, and δ = 10 5. Published as a conference paper at ICLR 2022 G.3 ABLATION STUDY ON CALIBRATION PS-R PS-M PS-W 0.00 prediction set error PS-R PS-M PS-W 0 (a) Domain Net-Sketch PS-R PS-M PS-W 0.00 prediction set error PS-R PS-M PS-W 0 (b) Domain Net-Clipart PS-R PS-M PS-W 0.00 prediction set error PS-R PS-M PS-W 0 (c) Domain Net-Painting PS-R PS-M PS-W 0.00 prediction set error PS-R PS-M PS-W 0 (d) Domain Net-Quickdraw PS-R PS-M PS-W 0.00 prediction set error PS-R PS-M PS-W 0 (e) Domain Net-Real PS-R PS-M PS-W 0.00 prediction set error PS-R PS-M PS-W 0 (f) Domain Net-Infograph PS-R PS-M PS-W 0.00 prediction set error PS-R PS-M PS-W 0 (g) Image Net-C13 PS-R PS-M PS-W 0.00 prediction set error PS-R PS-M PS-W 0 (h) Image Net-PGD Figure 5: Ablation study on calibration (over 100 random trial for error and over a held-out test set for size). PS-R does not use rescaled IWs, PS-M uses point estimates of rescaled IWs, and PS-W uses intervals around rescaled IWs. Parameters are m = 50, 000 for Domain Net shifts, m = 20, 000 for Image Net shifts, ε = 0.1, and δ = 10 5. In particular, we consider a variant PS-M of PS-W that ignores the worst-case IWs as in Theorem 3; instead, it rescales the importance weights in each bin via a point estimate, i.e., Theorem 5 without the Clopper-Pearson interval (i.e., m = n = ) and with E = 0. As can be seen in the results on the shift to various domains, our approach satisfies the PAC guarantee but this version does not in fact, its error is even worse than PS-R, which uses the non-rescaled importance weights from the probabilistic classifier s. Published as a conference paper at ICLR 2022 G.4 COMPARISON WITH VARIOUS TOP-K PREDICTION SETS Top1 Top5 Top10 Top50 PS-W 0.00 prediction set error Top1 Top5 Top10 Top50 PS-W 0 prediction set size Top1 Top5 Top10 Top50 PS-W 0.00 prediction set error Top1 Top5 Top10 Top50 PS-W 0 prediction set size Top1 Top5 Top10 Top50 PS-W 0.00 prediction set error Top1 Top5 Top10 Top50 PS-W 0 prediction set size (c) Clipart Top1 Top5 Top10 Top50 PS-W 0.0 prediction set error Top1 Top5 Top10 Top50 PS-W 0 prediction set size (d) Painting Top1 Top5 Top10 Top50 PS-W 0.00 prediction set error Top1 Top5 Top10 Top50 PS-W 0 prediction set size (e) Quickdraw Top1 Top5 Top10 Top50 PS-W 0.00 prediction set error Top1 Top5 Top10 Top50 PS-W 0 prediction set size Top1 Top5 Top10 Top50 PS-W 0.0 prediction set error Top1 Top5 Top10 Top50 PS-W 0 prediction set size (g) Infograph Top1 Top5 Top10 Top50 PS-W 0.0 prediction set error Top1 Top5 Top10 Top50 PS-W 0 prediction set size (h) Image Net-C13 Top1 Top5 Top10 Top50 PS-W 0.00 prediction set error Top1 Top5 Top10 Top50 PS-W 0 prediction set size (i) Image Net + PGD Figure 6: Prediction set error size size under rate shifts on Domain Net (a-g) and under support shifts on Image Net (h, i) (over 100 random trial). Default parameters are m = 50, 000 for Domain Net and m = 20, 000 for Image Net, ε = 0.1, and δ = 10 5. A Top-K prediction set is a prediction set that contains top K labels based on a domain-adapted score function; thus the set size is always K. As can be seen in the prediction set error plots, Top-K prediction sets do not consistently satisfy the desired ϵ guarantee. Moreover, the prediction set size is worse than PS-W. For example, when the Top-50 prediction set error rate almost achieves the desired error rate, e.g., (6d), the mean and medial of the corresponding prediction set size is larger than PS-W. Published as a conference paper at ICLR 2022 G.5 VARYING A SMOOTHNESS PARAMETER 0 0.001 0.005 0.01 E prediction set error 0 0.001 0.005 0.01 0 0 0.001 0.005 0.01 E prediction set error 0 0.001 0.005 0.01 0 0 0.001 0.005 0.01 E prediction set error 0 0.001 0.005 0.01 0 (c) Clipart 0 0.001 0.005 0.01 E prediction set error 0 0.001 0.005 0.01 0 (d) Painting 0 0.001 0.005 0.01 E prediction set error 0 0.001 0.005 0.01 0 (e) Quickdraw 0 0.001 0.005 0.01 E prediction set error 0 0.001 0.005 0.01 0 0 0.001 0.005 0.01 E prediction set error 0 0.001 0.005 0.01 0 (g) Infograph 0 0.001 0.005 0.01 E prediction set error 0 0.001 0.005 0.01 0 (h) Image Net-C13 0 0.001 0.005 0.01 E prediction set error 0 0.001 0.005 0.01 0 (i) Image Net + PGD Figure 7: Prediction set error size size under rate shifts on Domain Net (a-g) and under support shifts on Image Net (h, i) (over 100 random trial) for varying E. Parameters are m = 50, 000 for Domain Net and m = 20, 000 for Image Net, ε = 0.1, and δ = 10 5. In particular, the parameter E in Assumption 1 bounds the quality of our estimates of p(x) and q(x); since these errors cannot be conveniently measured, we have chosen it heuristically as a hyperparameter. In this figure, we show the error of PS-W as a function of E. As E becomes smaller, the prediction sets become more optimistic while still satisfying the PAC guarantee. Note that the optimal case is E = 0, since PS-W still satisfies the PAC guarantee; this result suggests that our IW estimates are reasonably accurate. Published as a conference paper at ICLR 2022 G.6 VARYING A NUMBER OF BINS 5 10 15 20 K prediction set error 5 10 15 20 0 5 10 15 20 K prediction set error 5 10 15 20 0 5 10 15 20 K prediction set error 5 10 15 20 0 (c) Clipart 5 10 15 20 K prediction set error 5 10 15 20 0 (d) Painting 5 10 15 20 K prediction set error 5 10 15 20 0 (e) Quickdraw 5 10 15 20 K prediction set error 5 10 15 20 0 5 10 15 20 K prediction set error 5 10 15 20 0 (g) Infograph 5 10 15 20 K prediction set error 5 10 15 20 0 (h) Image Net-C13 5 10 15 20 K prediction set error 5 10 15 20 0 (i) Image Net + PGD Figure 8: Prediction set error size size under rate shifts on Domain Net (a-g) and under support shifts on Image Net (h, i) (over 100 random trial) for varying K. Parameters are m = 50, 000 for Domain Net and m = 20, 000 for Image Net, ε = 0.1, and δ = 10 5. In general, K must be chosen to be small enough so each bin contains sufficiently many source examples to achieve a small Clopper-Perason interval size (e.g., 10 3), though it also needs to be sufficiently large to satisfy the smoothness assumption. Published as a conference paper at ICLR 2022 G.7 PREDICTION SET VISUALIZATION Example x ˆCPS(x) ˆCPS-W(x) Example x ˆCPS(x) ˆCPS-W(x) owl, \ raccoon angel, d harp cello, d harp, microphone, piano, violin \ wine bottle bread, grapes, \ wine bottle, wine glass [ shark, snorkel dolphin, [ shark, snorkel, submarine, n \ campfire o \ campfire, ocean, coffee cup, c cup coffee cup, c cup, mug, teapot hurricane, [ ocean, square, tornado n [ brain o [ brain, fish, lion, lollipop, sea turtle n \ penguin o fire hydrant, foot, \ penguin, telephone hat, \ hot tub bed, belt, birthday cake, guitar, hat, \ hot tub, tiny paint can, pillow, shoe, table \ asparagus \ asparagus, basket, bread, carrot, harp, lobster, toothbrush \ baseball, onion \ baseball, baseball bat, bread, light bulb, onion, potato Figure 9: Prediction sets of the Domain Net shift from All to Paint. Parameters are m = 50, 000, ε = 0.1, and δ = 10 5. The green label is the true label and the label with the hat is the predicted label. We choose examples where the two approaches differ; in particular, if PS-W is incorrect, then PS is incorrect as well since the prediction set sizes are monotone in τ. Published as a conference paper at ICLR 2022 Example x ˆCPS(x) ˆCPS-W(x) Example x ˆCPS(x) ˆCPS-W(x) n \ photocopier o printer, \ photocopier \ timber wolf, red wolf, dingo \ timber wolf, white wolf, red wolf, coyote, dingo airliner, airship, \ warplane aircraft carrier, airliner, airship, container ship, stopwatch, parachute, tank, \ warplane, forklift, golfcart, harvester, \ lawn mower, snowplow, tracktor forklift, gokart, golfcart, harvester, \ lawn mower, pickup, snowplow, thresher, tracktor snow leopard, cheetah snow leopard, jaguar, cheetah \ diamondback, sidewinder hognose snake, \ diamondback, sidewinder n \ television o ent. center, monitor, screen, \ television analog clock, barometer, odometer, stopwatch, \ wall clock analog clock, barometer, digital watch, mag. compass, odometer, stopwatch, \ wall clock chameleon, \ green lizard chameleon, \ green lizard, green snake, waling stick, mantis \ barbell, dumbbell, lens cap, puck \ barbell, barometer, car wheel, dumbbell, lens cap, power drill, puck, stethoscope \ Pembroke, Cardigan \ Pembroke, Cardigan face powder, \ lipstick, paintbrush ballpoint, face powde, \ lipstick, paintbrush, perfume, sunscreen Figure 10: Prediction sets of the shift from Image Net to Image Net-C13. Parameters are m = 20, 000, ε = 0.1, and δ = 10 5. The green label is the true label and the label with the hat is the predicted label. We choose examples where the two approaches differ; in particular, if PS-W is incorrect, then PS is incorrect as well since the prediction set sizes are monotone in τ. Published as a conference paper at ICLR 2022 Example x ˆCPS(x) ˆCPS-W(x) Example x ˆCPS(x) ˆCPS-W(x) n \ box turtle o terrapin, \ box turtle brain coral, \ starfish, sea urchin brain coral, chiton, \ starfish, sea urchin, sea cucumber, coral reef, stinkhorn \ white terrier, kuvasz, komondor Maltese dog, \ white terrier, kuvasz, komondor, Samoyed n \ kit fox o red fox, \ kit fox n \ ladybug o leaf beetle, \ ladybug n \ tusker o \ tusker, Afri. elephant elec. guitar, [ stage aco. guitar, banjo, elec. guitar, [ stage beaker, \ pop bottle, water bottle, wine bottle beaker, beer bottle, perfume, \ pop bottle, water bottle, wine bottle n \ cuirass o breastplate, \ cuirass convertible, \ sports car car wheel, convertible, \ sports car n \ confectionery o n \ confectionery n \ crash helmet bonnet, \ crash helmet, football helmet Figure 11: Prediction sets of the shift from Image Net to Image Net-PGD. Parameters are m = 20, 000, ε = 0.1, and δ = 10 5. The green label is the true label and the label with the hat is the predicted label. We choose examples where the two approaches differ; in particular, if PS-W is incorrect, then PS is incorrect as well since the prediction set sizes are monotone in τ.