# jaws_auditing_predictive_uncertainty_under_covariate_shift__d61a0e7d.pdf

JAWS: Auditing Predictive Uncertainty Under

Covariate Shift

Drew Prinster Department of Computer Science

Johns Hopkins University

Baltimore, MD 21211

drew@cs.jhu.edu

Anqi Liu Department of Computer Science

Johns Hopkins University

Baltimore, MD 21211

aliu@cs.jhu.edu

Suchi Saria Department of Computer Science

Johns Hopkins University

Baltimore, MD 21211 ssaria@cs.jhu.edu

We propose JAWS, a series of wrapper methods for distribution-free uncertainty quantiﬁcation tasks under covariate shift, centered on the core method JAW, the JAckknife+ Weighted with data-dependent likelihood-ratio weights. JAWS also includes computationally efﬁcient Approximations of JAW using higherorder inﬂuence functions: JAWA. Theoretically, we show that JAW relaxes the jackknife+ s assumption of data exchangeability to achieve the same ﬁnite-sample coverage guarantee even under covariate shift. JAWA further approaches the JAW guarantee in the limit of the sample size or the inﬂuence function order under common regularity assumptions. Moreover, we propose a general approach to repurposing predictive interval-generating methods and their guarantees to the reverse task: estimating the probability that a prediction is erroneous, based on user-speciﬁed error criteria such as a safe or acceptable tolerance threshold around the true label. We then propose JAW-E and JAWA-E as the repurposed proposed methods for this Error assessment task. Practically, JAWS outperform state-ofthe-art predictive inference baselines in a variety of biased real world data sets for interval-generation and error-assessment predictive uncertainty auditing tasks.

1 Introduction

Auditing the uncertainty under data shift Principled quantiﬁcation of predictive uncertainty is crucial for enabling users to calibrate how much they should or should not trust a given prediction [Thiebes et al., 2021, Ghosh et al., 2021, Tomsett et al., 2020, Bhatt et al., 2021]. Uncertainty-based predictor auditing can be considered a type of uncertainty quantiﬁcation performed post-hoc, for example by a regulator without detailed knowledge of a predictor s architecture and with limited resources [Schulam and Saria, 2019]. Data shift poses a major challenge to uncertainty quantiﬁcation due to violation of the common assumption that the training and test data are exchangeable, or more speciﬁcally independent and identically distributed (i.i.d.) [Ovadia et al., 2019, Ulmer et al., 2020, Zhou and Levine, 2021, Chan et al., 2020]. Therefore, it is essential to develop convenient tools for users or regulators to audit the uncertainty of a given prediction even when training data is biased.

Predictor auditing: Interval generation In this work we distinguish between two types of predictive uncertainty auditing. We describe the ﬁrst type as interval generation, which refers to a common

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

goal in the distribution-free uncertainty quantiﬁcation literature: to generate a predictive conﬁdence interval (or set) that covers the true label with at least a user-speciﬁed probability. For instance, an auditor might ask for predictive intervals that contain the true label with at least, say 90% frequency.

Predictor auditing: Error assessment While predictive interval generation has been a central focus of the distribution-free uncertainty quantiﬁcation literature [Angelopoulos and Bates, 2021], in some applications the reverse computation may be more actionable: estimating the probability that a prediction is erroneous or not, based on user-speciﬁed error critieria such as a safe or acceptable tolerance region around the true label. We thus refer to this task as error assessment. For instance, take the setting of chemical or radiation therapy dose prediction for cancer treatment, where administering a dose within approximately 10% of the optimal dose is considered safety-critical (see Appendix A.1 for details). Whereas predictive interval generation could fail to provide safety assurance (e.g., if the predictive conﬁdence interval is larger than the safe tolerance region), error assessment would give a worst-case probability of the prediction being safe. Similar examples could be formulated in other applications, such as incision planning in surgical robotics and autonomous vehicle navigation.

Coverage We assume a standard regression setup with a multiset of training data {(X1, Y1), ..., (Xn, Yn)} and a test point (Xn+1, Yn+1) with unknown label Yn+1, where (Xi, Yi) 2 Rd R for all i 2 {1, ..., n + 1}. Also, we denote a predictor as bµ = A({(X1, Y1), ..., (Xn, Yn)}), where A is a model-ﬁtting algorithm. For a predictive interval (or set) b Caudit

n, : Rd ! {subsets of R}, a coverage guarantee gives a lower bound to the probability that the interval covers the true test label:

Yn+1 2 b Caudit

The coverage guarantee provides the basis for both interval-generation and error-assessment auditing, though it is important to note that in this work we focus on marginal rather than conditional coverage (see [Foygel Barber et al., 2021] for more details on this distinction). Standard conformal prediction methods [Vovk et al., 2005, Shafer and Vovk, 2008, Vovk, 2013] along with the jackknife+ and related methods [Barber et al., 2021], which we refer to together as predictive inference methods, provide a framework for generating predictive intervals with ﬁnite-sample guaranteed coverage.

Exchangeability Standard conformal prediction and the jackknife+ rely on two crucial notions of exchangeability: data exchangeability, that is that the training and test data are all exchangeable (e.g., i.i.d.); and secondly that the model-ﬁtting algorithm A treats the data symmetrically [Barber et al., 2022]. In common situations of dataset shift, however, the data exchangeability assumption is violated. Empirically, the coverage performance of standard conformal prediction methods can suffer under data shift [Tibshirani et al., 2019, Podkopaev and Ramdas, 2021].

Figure 1: Jackknife+ loses coverage on the airfoil dataset under covariate shift (details in Section 4).

In this work, we build on the jackknife+ method due to its beneﬁcial compromise between the statistical and computational limitations of other conformal prediction methods [Barber et al., 2021]. However, jackknife+ coverage performance can still degrade under data shift, such as shown in Figure 1, and in some applications its computational requirements can still be limiting. To address these

concerns and make extensions to error assessment, we develop JAWS, a series of wrapper methods for distribution-free uncertainty quantiﬁcation under covariate shift (see Table 1 for key properties).

Table 1: Summary of key properties for JAWS methods (details in Section 3).

Guarantee (under covariate shift) Method Task Finite sample Asymptotic Avoids retraining JAW Interval generation 3 3 7 JAWA Interval generation 7 3 3 JAW-E Error assessment 3 3 7 JAWA-E Error assessment 7 3 3

Our contributions can be summarized as follows:

1. We develop JAW: a jackknife+ method with data-dependent likelihood-ratio weights for predic-

tive interval generation under covariate shift. We show that JAW achieves the same rigorous, ﬁnite-sample coverage guarantee as jackknife+ [Barber et al., 2021] while relaxing the data exchangeability assumption to allow for covariate shift. 2. We develop JAWA: a sequence of computationally efﬁcient approximations to JAW that uses

higher-order inﬂuence functions to avoid retraining. Under assumptions outlined in Giordano et al. [2019a] regarding the regularity of the data, Hessian of the objective (local strong convexity), and the existence and boundedness of higher order derivatives, we provide an asymptotic guarantee for the JAWA coverage in the limit of the sample size or inﬂuence function order. 3. We propose a general approach to repurposing any distribution-free predictive inference method

to the error assessment task, with rigorous guarantees for the coverage probability estimation. Our approach applies to methods that assume exchangeable data and to methods like JAW and JAWA that allow for covariate shift JAW-E and JAWA-E refer to the error assessment versions. 4. We demonstrate superior empirical performance of JAWS over other distribution-free predictive

inference baselines on a variety of benchmark datasets under covariate shift.

2 Background and related work

2.1 Standard conformal prediction

Conformal prediction has grown into a broad research ﬁeld since arising in the 1990s [Vovk et al., 2005, Shafer and Vovk, 2008, Balasubramanian et al., 2014, Angelopoulos and Bates, 2021]. Standard conformal prediction methods generate a prediction interval (or set) with a ﬁnite-sample coverage guarantee as in (1), which is distribution-free in the sense that the guarantee applies to any exchangeable data distribution [Lei and Wasserman, 2014, Lei et al., 2018]. With the exchangeability assumptions in Section 1, standard conformal prediction methods rely on a pre-ﬁt score function

b S : Rd R ! R (in regression, the absolute-value residual score b S(x, y) = |y bµ(x)| is commonly used). A conformal prediction interval at conﬁdence level 1 is then determined by a corresponding quantile on a multiset of (exchangeable) score values.

Split conformal and full conformal are two main types of standard conformal prediction, and each bears its own limitation [Vovk et al., 2005, Shafer and Vovk, 2008]. Split conformal generates scores on labeled holdout data and is computationally efﬁcient due to not requiring retraining, but sample splitting to obtain the holdout set can reduce model accuracy [Papadopoulos, 2008, Lei et al., 2018, Vovk, 2012]. On the other hand, full conformal prediction avoids the holdout set requirement, but at the heavy computational cost of retraining the model on every possible target value (or, in practice, on a ﬁne grid of target values) [Ndiaye and Takeuchi, 2019, Zeni et al., 2020].

2.2 Covariate shift

Under the covariate shift assumption, the Y |X distribution is assumed to be the same between training and test data but the marginal X distributions may change [Sugiyama et al., 2007, Shimodaira, 2000]:

i.i.d. PX PY |X, i = 1, ..., n; (Xn+1, Yn+1) e PX PY |X, independently. (2)

Rich literature exist in this domain see Appendix A.2 for more details. Uncertainty quantiﬁcation is relatively less explored under covariate shift, though recent work [Ovadia et al., 2019, Zhou and Levine, 2021, Chan et al., 2020] emphasizes its importance, especially in deep learning.

2.3 Conformal prediction under covariate shift and beyond exchangeability

Tibshirani et al. [2019] develop the idea of weighted exchangeability for adapting conformal prediction to the covariate shift setting. Random variables V1, ..., Vn are weighted exchangeable with weight functions w1, ..., wn if their joint density f can be factorized as f(v1, ..., vn) = Qn

i=1 wi(vi) g(v1, ..., vn), where g is independent of ordering on its inputs. For covariate shift as in (2), if e PX is absolutely continuous with respect to PX, then the data {(Xi, Yi)} are weighted exchangeable with weight functions w1 = ... = wn = 1 and wn+1 = w = d e PX/d PX [Tibshirani et al., 2019].

If {vi} represents a set of scores for standard conformal prediction, then we can represent the empirical distribution of {vi} as 1 n+1

i=1 δvi + 1 n+1δ1, where δvi denotes a point mass at vi [Barber et al., 2022]. By extension, weighted conformal prediction uses the weighted empirical distribution deﬁned as Pn

i (x)δvi + pw

n+1(x)δ1, with weights given by

i (x) = w(Xi) Pn

j=1 w(Xj) + w(x), i = 1, ..., n; and pw

n+1(x) = w(x) Pn

j=1 w(Xj) + w(x), (3)

where w = d e PX/d PX, so pw

i (Xn+1) can be thought of as a normalized likelihood ratio weight for each i 2 {1, ..., n + 1}. Corollary 1 in [Tibshirani et al., 2019] provides the coverage guarantee of weighted conformal prediction that takes the form of (1) but relaxes the exchangeable data assumption to allow for covariate shift. However, weighted split and weighted full conformal inherit the same statistical and computational limitations, respectively, from their standard (exchangeable) variants.

The recent work of [Barber et al., 2022] provides a novel extension of conformal prediction and the jackknife+ to unknown violations of the exchangeability assumption, including a nonexchangeable jackknife+ deﬁned with ﬁxed weights. The key difference between the nonexchangeable jackknife+ in Barber et al. [2022] and our proposed JAW method is that Barber et al. [2022] use ﬁxed weights to compensate for unknown exchangeability violations (not limited to covariate shift) but at the expense of a bounded but generally nonzero coverage gap (drop in guaranteed coverage relative to if the data were exchangeable), whereas our JAW method with data-dependent weights assumes covariate shift but does not suffer from any similar coverage gap. See Appendix A.3 for more details.

2.4 Jackknife+

The jackknife+ [Barber et al., 2021], which is closely related to cross conformal prediction [Vovk et al., 2018], offers a compromise between the statistical limitation of split conformal and the computational limitation of full conformal, at the cost of a slightly weaker coverage guarantee. The jackknife+ predictive interval can most easily be understood as a modiﬁcation to a predictive interval from the classic jackknife resampling method [Miller, 1974, Steinberger and Leeb, 2018, 2016]. For a set of point masses {δvi} at values v1, ..., vn, let Q

β { 1 n+1δvi} denote the level β quantile on the empirical distribution Pn

1 n+1δvi + 1 n+1δ 1 and let Q+

β { 1 n+1δvi} denote the level β quantile on the empirical distribution Pn

1 n+1δvi + 1 n+1δ1. Then, denoting the model trained without the ith point as bµ i = A

(X1, Y1), ..., (Xi 1, Yi 1), (Xi+1, Yi+1), ..., (Xn, Yn)

and the leave-one-out residual RLOO

i = |Yi bµ i(Xi)|, the jackknife prediction interval can be written as

b Cjackknife

n, (Xn+1) =

1 n+1δbµ(Xn+1) RLOO

1 n+1δbµ(Xn+1)+RLOO

In contrast, we obtain the jackknife+ predictive interval in Barber et al. [2021] by replacing the full model prediction bµ(Xn+1) in (4) with bµ i(Xn+1):

b Cjackknife+

n, (Xn+1) =

1 n+1δbµ i(Xn+1) RLOO

1 n+1δbµ i(Xn+1)+RLOO

[Barber et al., 2021] prove that, with the same exchangeability assumptions as in standard conformal prediction, the jackknife+ prediction interval satisﬁes

P{Yn+1 2 b Cjackknife+

n, (Xn+1)} 1 2 . (6)

2.5 Approximating leave-one-out models with higher-order inﬂuence functions

Inﬂuence functions (IFs) [Cook, 1977] have a long history in robust statistics for estimating the dependence of parameters on sample data. Recently, IFs have become more widespread in machine learning for uses including model interpretability Koh and Liang [2017] and approximating classic resampling-based uncertainty quantiﬁcation methods including bootstrap [Schulam and Saria, 2019], jackknife, and leave-k-out cross validation [Giordano et al., 2019b,a]. In each of these cases, IFs enable approximation of the parameters that would be obtained if the model were retrained on resampled data by instead estimating the effect of a corresponding reweighting. In prior work, Alaa and Van Der Schaar [2020] proposed approximating the leave-one-out models required by the jackknife+ with higher-order IFs, but their work assumes exchangeable or i.i.d. train and test data.

Let ˆ denote the ﬁtted parameters for predictor bµ trained on the full training data. Given Assumptions 1-4 in Giordano et al. [2019a] which require that ˆ is a local minimum of the objective function, that

the objective is k + 1 times continuously differentiable with bounded norms, and that the objective is strongly convex in the neighborhood of ˆ then the k-th order leave-one-out IF refers to the k-th order directional derivative of the model parameters ˆ with respect to the data weights, in the direction of the leave-one-out change in weights (See A.4 for more details). With each of these kth order leave-one-out IFs for k 2 {1, ..., K}, denoted with condensed notation δk

iˆ , we can construct a K-th order Taylor series approximation to estimated the leave-one-out model parameters ˆ i :

In this work we implement the algorithm proposed by Giordano et al. [2019a] to compute higher-order IFs, a recursive procedure based on foreward-mode automatic differentiation [Maclaurin et al., 2015] for memory efﬁciency in computing higher-order directional derivatives. Our introduction of IFs is highly simpliﬁed we refer to Appendix A.4 and to Giordano et al. [2019a] for more details.

2.6 Error assessment

Whereas conformal prediction and related methods generate prediction intervals that control the error probability (miscoverage level ) at a user-speciﬁed level, we refer to the reverse task as error assessment: estimating the probability that a prediction is erroneous or not, based on user-speciﬁed error criteria. For instance, a user might deﬁne an error as any deviation between the prediction

bµ(Xn+1) and the true label Yn+1 greater than some acceptable tolerance threshold : that is, when |Yn+1 bµ(Xn+1)| > . In Section 3.3, we present a general approach to repurposing predictive inference methods with validity under covariate shift to error assessment.

We note that for score functions that are monotonic in y, such as b S(x, y) = y bµ(x), guarantees for this error assessment task can be obtained using conformal predictive distributions as described by Vovk et al. [2017] (also see Vovk et al. [2020], Vovk and Bendtsen [2018], Xie and Zheng [2022]). In regression tasks assuming exchangeable data, CPDs generate a probability distribution for the label over R. However, CPDs require that score functions be monotonic in y, whereas we allow for certain non-monotone conformity scores such as the commonly used absolute-value residual |y bµ(x)|; Moreover, CPDs assume exchangeable data, whereas our approach extends to covariate shift.

3 Proposed approach and theoretical results

3.1 JAW: Jackknife+ weighted with data-dependent weights

We present JAW, the JAckknife+ Weighted with data-dependent likelihood-ratio weights, deﬁned by the following predictive interval:

n, (Xn+1) =

i (Xn+1) δbµ i(Xn+1) RLOO

i (Xn+1) δbµ i(Xn+1)+RLOO

..bµ i(Xi) Yi

.., with pw

i (x) for i 2 {1, ..., n + 1} as in (3), where Q

i (Xn+1) δbµ i(Xn+1) RLOO

denotes the level quantile of the empirical distribution Pn

i (Xn+1) δbµ i(Xn+1) RLOO

n+1(Xn+1) δ 1, and where Q+

i (Xn+1) δbµ i(Xn+1)+RLOO

is the level 1 quantile for Pn

i (Xn+1) δbµ i(Xn+1)+RLOO

n+1(Xn+1) δ1.

We choose to deﬁne JAW using likelihood-ratio weights w(Xi) = d e PX(Xi)/d PX(Xi) in the pw

i (x) to address covariate shift, but a similar result holds for other instances of weighted exchangeability and corresponding data-dependent weight functions (see Appendix B.1). We show that b CJAW

n, (Xn+1) satisﬁes the same coverage guarantee as the jackknife+ except relaxing the data exchangeability assumption to allow for covariate shift, which we state formally in the following theorem.

Theorem 1. Assume data under covariate shift from (2). If e PX is absolutely continuous with respect to PX, then the JAW interval in (8) satisﬁes

Yn+1 2 b CJAW

Remark 1. The results from Tibshirani et al. [2019] do not directly imply Theorem 1. The approach in Tibshirani et al. [2019] relies on leveraging the weighted exchangeability of the data to reweight the nonconformity scores {V1, ..., Vn+1} so they can be treated as exchangeable, and for the jackknife+ this approach would entail treating bµ i(Xn+1) RLOO

i as implicit nonconformity scores. But, observe that for i 2 {1, ..., n}, bµ i is trained on n 1 datapoints, whereas bµ (n+1) = bµ is trained on n datapoints. Thus, no reweighting can make bµ i equivalent in distribution to bµ and thereby allow us to treat the reweighted bµ i(Xn+1) RLOO

i and bµ(Xn+1) RLOO

n+1 as exchangeable.

Proof sketch: Our proof technique for Theorem 1 extends the jackknife+ coverage guarantee proof in Barber et al. [2021] to the covariate shift setting for JAW using likelihood ratio weights as in Tibshirani et al. [2019]. The full proof is given in Appendix C.1, but the outline is as follows:

Setup: Following Barber et al. [2021], we deﬁne a set of leave-two-out models { µ (i,j)}. We then generalize the notion of strange points described in Barber et al. [2021] to covariate shift. 1. Bounding the total normalized weight of strange points: We establish deterministically that the

total normalized weight of strange points cannot exceed 2 . 2. Weighted exchangeability using the leave-two-out models: Using the leave-two-out model con-

struction, we leverage weighted exchangeability to show that the probability that a test point n + 1 is strange is thus bounded by 2 . 3. Connection to JAW: Lastly, we show that the JAW interval can only fail to cover the test label

value Yn+1 if n + 1 is a strange point.

While JAW assumes access to oracle likelihood ratio weights, in practice this information often has to be estimated. See Appendix D.5 for a discussion and experiments of JAW with estimated weights.

3.2 JAWA: Using higher-order inﬂuence functions to approximate JAW without retraining

For computationally efﬁcient JAW Approximations that avoid retraining n leave-one-out models, we propose the JAWA sequence, which approximates the leave-one-out models required by JAW using higher-order inﬂuence functions. For each training point i 2 {1, ..., n}, deﬁne the K-th order inﬂuence function approximation to the leave-one-out reﬁt parameters ˆ i, obtained from Algorithm 4 in Giordano et al. [2019a], as given by equation (7), and let bµIF-K

i be the model with with these approximated parameters ˆ IF-K

i for each i 2 {1, ..., n}. Then, the prediction interval for the K-th order JAWA (i.e., for JAWA-K) is given by

n, (Xn+1) =

i (Xn+1) δbµIF-K

i (Xn+1) RIF-K,LOO

i (Xn+1) δbµIF-K

i (Xn+1)+RIF-K,LOO

with RIF-K,LOO

i (x) as in (3), and quantiles deﬁned analogously to JAW.

We now provide an asymptotic coverage guarantee for b CJAWA-K

n, (Xn+1) that holds either in the limit of the sample size or in the limit of the inﬂuence function order, under regularity conditions formally described in Giordano et al. [2019a]. These assumptions concern the regularity and continuity of the training data, local convexity of the objective (or that the Hessian is strongly positive deﬁnite), and the existence and boundedness of the objective s 1st through K + 1th order directional derivatives.

Theorem 2. Assume data under covariate shift from (2) and that e PX is absolutely continuous with respect to PX. Let Assumptions 1 - 4 and either Condition 2 or Condition 4 from Giordano et al. [2019a] hold uniformly for all n. Then, in the limit of the training sample size n ! 1 or in the limit of the inﬂuence function order K ! 1, the JAWA-K interval in (10) satisﬁes

Yn+1 2 b CJAWA-K

We leave the proof to Appendix C.2, but we note that the result follows by combining Propositions 1 and 3 in Giordano et al. [2019a] with the JAW coverage guarantee that we present in Theorem 1.

3.3 Error assessment under covariate shift

We now propose a general approach to repurposing predictive inference methods with validity under covariate shift from predictive interval generation to the reverse task: estimating the probability that a

prediction is erroneous or not, based on user-speciﬁed error criteria. For example, consider a user that deﬁnes a prediction bµ(Xn+1) as erroneous, relative to the true label Yn+1, if it is farther than some acceptable tolerance threshold from Yn+1: i.e., if |Yn+1 bµ(Xn+1)| > . For this common regression error criterion, our approach to adapting a method such as JAW (8) or weighted split conformal prediction [Tibshirani et al., 2019] to error assessment reduces to ﬁrst deﬁning the set of labels that would not be considered erroneous, E = [bµ(Xn+1) , bµ(Xn+1) + ], and then ﬁnding the method s largest predictive interval contained within E, call it b Cw-audit

n, E (Xn+1). The coverage guarantee for b Cw-audit

n, E (Xn+1) then yields a lower bound on P{Yn+1 2 E}, the probability of no error (or an upper bound on the error probability). See Figure 2 for an illustration of this example.

Figure 2: Illustration of approach to repurposing a predictive inference method w-audit to error assessment. The interval E = [bµ(Xn+1) , bµ(Xn+1)+ ] is shown in violet, the lower score values {V L

i } in blue, the upper score values {V U

i } in red, and the interval b Cw-audit

n, E (Xn+1) in green. Each vertical line at a location Vi on the real line represents a point mass δVi with height corresponding to the normalized likelihood ratio weight pw

Generally, a user must specify error criteria by a test point score function b S : Rd R ! R (for conformal prediction, a nonconformity score), as well as minimum and maximum acceptable score values and +, where < +; i.e., bµ(Xn+1) is considered erroneous if b S(Xn+1, Yn+1) <

or if + < b S(Xn+1, Yn+1). (For nonnegative b S, we might let = 0.) Then, the values of y for which observing Yn+1 = y would not imply bµ(Xn+1) is erroneous are:

y 2 R : b S(Xn+1, y) +

. (12) Now, assume a predictive inference method with predictive sets that can be written in the form

n, (Xn+1) =

y 2 R : b Q

i (Xn+1)δV L

i } b S(Xn+1, y) b Q+

i (Xn+1)δV U

(13) with valid coverage guaranteed under covariate shift. (Note, (13) gives the JAW interval (8) by setting

b S(x, y) = y bµ(x), V L

i = bµ i(Xn+1) bµ(Xn+1) RLOO

i , and V U

i = bµ i(Xn+1) bµ(Xn+1) + RLOO

i ; see Appendix B.3. Similarly, (13) gives the prediction interval for weighted split conformal prediction [Tibshirani et al., 2019] for absolute value residual scores when b S(x, y) = |y bµ(x)|, and for all calibration data i we let V U

i = |Yi bµ(Xi)| and V L

i = 0.) Then, deﬁning

i (Xn+1)δV L

i (Xn+1)δV U

we can estimate the probability of bµ(Xn+1) not resulting in an error as in (12) as:

bp{Yn+1 2 E} =

E if w-audit

E exists 0 otherwise. (15)

While the target coverage for b Cw-audit

n, E (Xn+1) is used in (15), the following theorem gives the worstcase error assessment guarantee for covariate shift (proof in Appendix C.3). Corollary 1 in Appendix B.3 and Corollary 2 in Appendix B.4 give the error assessment guarantees for JAW-E and JAWA-E respectively. Appendix B.2 gives the analogous guarantee for exchangeable data. Theorem 3. Assume a predictive inference method of the form (13) has coverage guarantee P{Yn+1 2 b Cw-audit

n, (Xn+1)} 1 c1 c2 , with c1, c2 2 R, under covariate shift (2) where e PX is absolutely continuous with respect to PX. Deﬁne E as in (12) and w-audit

E as in (14). Then,

P{Yn+1 2 E}

1 c1 w-audit

E c2 if w-audit

E exists and w-audit

c1 0 otherwise. (16)

4 Experiments1

4.1 Datasets and creation of covariate shift

We conduct experiments on ﬁve UCI datasets Dua and Graff [2017] with various dimensionality (Table 2): airfoil self-noise, red wine quality prediction [Cortez et al., 2009], wave energy converters, superconductivity [Hamidieh, 2018], and communities and crime [Redmond and Baveja, 2002].

Table 2: Statistics for the UCI datasets. Only the ﬁrst 2000 samples were used for the wave and superconductivity datasets (for wave, the ﬁrst 2000 samples of Adelaide data).

Dataset # of samples # of features Label range Airfoil self-noise (airfoil) 1503 5 [103.38, 140.987] Red wine quality (wine) 1599 11 [3, 8] Wave energy converters (wave) 2000 48 [1226969, 1449349] Superconductivity (superconduct) 2000 81 [0.2, 136.0] Communities and crime (communities) 1994 99 [0, 1]

We use exponential tilting to induce covariate shift on the test data, based on the approach used in Tibshirani et al. [2019]. We ﬁrst randomly sample 200 points for the training data, and then sample the biased test data from the remaining datapoints that are not used for training with probabilities proportional to exponential tilting weights. See Appendix D.1 for additional details.

4.2 Baselines

Baselines for comparison to JAW We compared JAW to the following baselines:

1. Naive estimates are based on training data residuals |Yi bµ(Xi)|, which suffers from overﬁtting. 2. Jackknife uses the classic Jackknife resampling as in (4). 3. Jackknife+ follows (5), which replaces the prediction bµ(Xn+1) in jackknife with bµ i(Xn+1). 4. Jackknife-mm [Barber et al., 2021] is a more conservative alternative to jackknife+ that guarantees

coverage at the 1 level with exchangeable data, but usually with overly-wide intervals.

b Cjackknife-mm

n, (Xn+1) =

min i=1,...,n bµ i(Xn+1) Q+

i=1,...,n bµ i(Xn+1) + Q+

5. Cross validation+ (CV+) [Barber et al., 2021] is similar to jackknife+ but splits data into K folds

and replaces the bµ i(Xn+1) with bµ k(Xn+1), the model trained with the kth subset removed. 6. Split method follows split conformal prediction, which uses half the data for training and the

other half for generating the nonconformity scores. 7. Weighted split is a version of split conformal with likelihood ratio weights to maintain coverage

under covariate shift, as in Tibshirani et al. [2019].

Baselines for comparison to JAWA For inﬂuence function orders K 2 {1, 2, 3}, we compared the proposed JAWA-K method with K-th order inﬂuence function approximations of the jackknife-based baselines that we used as comparisons to JAW we thus refer to these approximations as IF-K jackknife, IF-K jackknife+, and IF-K jackknife-mm. Each baseline compared to JAWA-K is thus also approximated with the same K-th order leave-one-out inﬂuence function models.

4.3 Experimental results

We report experimental results on the predictive interval-generation task for both JAW and JAWA and on the error assessment task for JAW, compared to baselines. Additional experimental details and supplementary experiments can be found in Appendix D, including for estimated likelihood ratio weights in D.5, ablation study with shift magnitudes in D.6, and coverage histograms in D.9.

4.3.1 Interval generation results for JAW: Coverage and interval width

Figure 3 compares JAW and its baselines, ﬁrstly regarding mean coverage and secondarily regarding median interval width, on all ﬁve UCI datasets for both neural network and random forest predictors,

1Additional analysis in Appendix D and code at https://github.com/drewprinster/jaws.git.

averaged over 1000 experimental replicates. See Appendix D.2 for predictor function details. Meeting the target coverage level of 1 is the primary goal of the interval-generation audit task, but for methods that meet or nearly meet the target coverage level, smaller interval widths are more informative. Additionally, smaller variance in coverage indicates a more reliable or consistent method.

As seen in Figure 3, the JAW predictive interval coverage is above the target level of 0.9 across all datasets, for both random forest and neural network bµ functions, along with the jackknife-mm and weighted split methods. However, JAW s interval widths are generally smaller and thus more informative than those of jackknife-mm (which are often overly large, as noted in Barber et al. [2021]). Weighted split and JAW perform similarly on mean coverage and median interval width (both methods have coverage guarantees under covariate shift), but JAW avoids sample splitting and as a result has lower coverage variance than weighted split for all dataset and predictor conditions (see Appendix D.3), which suggests that JAW s predictive intervals are more reliable.

(a) Airfoil (b) Wine (c) Wave (d) Superconduct (e) Communities

Figure 3: Mean coverage (ﬁrst row) and median interval width (second row) for neural network and random forest predictors on UCI datasets. Dashed line is the target coverage level (1 = 0.9). Error bars show the standard error of 1000 repeated experiments. JAW maintains target coverage under covariate shift for all predictor and dataset conditions along with jackknife-mm and weighted split however, JAW s intervals are generally smaller and thus more informative than jackknife-mm s, and JAW s coverage variance is smaller and thus more reliable than weighted split s (Appendix D.3).

4.3.2 Interval generation results for JAWA: Coverage and interval width

Figure 4 evaluates JAWA coverage and interval width compared to baselines for IF orders K 2 {1, 2, 3} with a neural network predictor (see Appendix D.2 for predictor details). As with the JAW experiments, coverage at the target level of 1 = 0.9 is the primary goal, while secondarily, smaller intervals are more informative for methods that meet or nearly meet target coverage. For three of the ﬁve datasets (airfoil, wine, and communities), JAWA is the only method that consistently reaches or nearly reaches the target coverage level. JAWA and all the baselines perform well on the wave datasets, and in the superconduct dataset JAWA still outperforms approximations of jackknife and jackknife+ for all IF orders. Appendix D.7 provides an example empirical comparison of JAWA and JAW runtimes, which demonstrates that JAWA can be orders of magnitude faster to compute.

4.3.3 Error assessment results for JAW-E: AUC

We now turn to an error-assessment audit task where the goal is to evaluate a method s ability to estimate the probability that a given prediction is erroneous or not, based on the error criterion |Yn+1 bµ(Xn+1)| > . Let E = [bµ(Xn+1) , bµ(Xn+1) + ]. Then, the goal is to estimate the probability that bµ(Xn+1) is correct, i.e., Yn+1 2 E; or an error, i.e., Yn+1 62 E. For ﬁve predictive interval-generation methods repurposed to the error assessment task (JAW-E, jackknife+E, cross validation+E, split conformal-E, and weighted split conformal-E), Figure 5 reports the area under the receiver operating characteristic curve (AUROC) for 50 repeated experiments with a neural network predictor, with dataset-speciﬁc values of (see Appendix D.4 for details and additional experiments with random forest predictor). Better performing methods have higher AUROC values for all values . For most tolerance levels and datasets, JAW achieves AUROC values comparable to jackknife+ and CV+ as well as higher AUROC values than split and weighted split conformal prediction. The

(a) Airfoil (b) Wine (c) Wave (d) Superconduct (e) Communities

Figure 4: Mean coverage (ﬁrst row) and median interval width (second row) for JAWA and baselines for inﬂuence function orders K 2 {1, 2, 3}. Dashed line is the target coverage level (1 = 0.9). Error bar shows the standard error of 200 repeated experiments. JAWA is more consistent than baselines in reaching or nearly reaching the target coverage level across datasets and inﬂuence function orders, and it is more computationally efﬁcient than JAW (Appendix D.7).

(a) Airfoil (b) Wine (c) Wave (d) Superconduct (e) Communities

Figure 5: AUROC values for tolerance levels across the three datasets for the neural net predictor, averaged across 50 experiment replicates. Results for random forest predictor in Appendix D.4.

comparable performance of JAW and jackknife+ is likely due to a tradeoff between the beneﬁt of JAW s validity under covariate shift and its reduced effective sample size inherent to likelihood-ratio weighting, as jackknife+ s and CV+ s AUROC degrades with reduced sample size (Appendix D.6).

5 Conclusion

In this paper, we develop JAWS, a series of wrapper methods for distribution-free predictive uncertainty auditing tasks when the data exchangeability assumption is violated due to covariate shift. We also propose a general approach to repurposing any distribution-free predictive inference method to the error assessment task. We provide rigorous ﬁnite-sample guarantees for JAW and JAW-E on the interval generation and error assessment tasks respectively, and analogous asymptotic guarantees for the computationally efﬁcient JAWA and JAWA-E. We moreover demonstrate superior performance of the JAWS series on a variety of datasets. In supplementary experiments we investigate a number of JAWS limitations: weight estimation can address the assumed access to oracle weights with similar empirical performance (Appendix D.5), and JAW s increased coverage variance with covariate shift can be explained by reduced effective sample size due to importance weighting (Appendix D.6). Additionally, we note that JAW and JAWA share a limitation with weighted conformal prediction [Tibshirani et al., 2019] of potentially producing overly large intervals in extreme covariate shift cases where a test point s normalized likelihood ratio approaches or exceeds . In the future, we aim to address the problems of reducing coverage variance and improving predictive interval sharpness.

Acknowledgments

This work was supported by the National Science Foundation grant IIS-1840088. We thank Yoav Wald for helpful discussions and advice, as well as Peter Schulam for sharing code that facilitated our inﬂuence function approximation implementation and AUC experiments.

Scott Thiebes, Sebastian Lins, and Ali Sunyaev. Trustworthy artiﬁcial intelligence. Electronic

Markets, 31(2):447 464, 2021.

Soumya Ghosh, Q Vera Liao, Karthikeyan Natesan Ramamurthy, Jiri Navratil, Prasanna Sattigeri,

Kush R Varshney, and Yunfeng Zhang. Uncertainty quantiﬁcation 360: A holistic toolkit for quantifying and communicating the uncertainty of ai. ar Xiv preprint ar Xiv:2106.01410, 2021.

Richard Tomsett, Alun Preece, Dave Braines, Federico Cerutti, Supriyo Chakraborty, Mani Srivastava,

Gavin Pearson, and Lance Kaplan. Rapid trust calibration through interpretable and uncertaintyaware ai. Patterns, 1(4):100049, 2020.

Umang Bhatt, Javier Antorán, Yunfeng Zhang, Q Vera Liao, Prasanna Sattigeri, Riccardo Fogliato,

Gabrielle Melançon, Ranganath Krishnan, Jason Stanley, Omesh Tickoo, et al. Uncertainty as a form of transparency: Measuring, communicating, and using uncertainty. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 401 413, 2021.

Peter Schulam and Suchi Saria. Can you trust this prediction? auditing pointwise reliability after

learning. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 1022 1031. PMLR, 2019.

Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua

Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32, 2019.

Dennis Ulmer, Lotta Meijerink, and Giovanni Cinà. Trust issues: Uncertainty estimation does not

enable reliable ood detection on medical tabular data. In Machine Learning for Health, pages 341 354. PMLR, 2020.

Aurick Zhou and Sergey Levine. Amortized conditional normalized maximum likelihood: Reliable

out of distribution uncertainty estimation. In International Conference on Machine Learning, pages 12803 12812. PMLR, 2021.

Alex Chan, Ahmed Alaa, Zhaozhi Qian, and Mihaela Van Der Schaar. Unlabelled data improves

bayesian uncertainty calibration under covariate shift. In International Conference on Machine Learning, pages 1392 1402. PMLR, 2020.

Anastasios N Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and

distribution-free uncertainty quantiﬁcation. ar Xiv preprint ar Xiv:2107.07511, 2021.

Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. The limits of

distribution-free conditional predictive inference. Information and Inference: A Journal of the IMA, 10(2):455 482, 2021.

Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic learning in a random world.

Springer Science & Business Media, 2005.

Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning

Research, 9(3), 2008.

Vladimir Vovk. Transductive conformal predictors. In IFIP International Conference on Artiﬁcial

Intelligence Applications and Innovations, pages 348 360. Springer, 2013.

Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Predictive

inference with the jackknife+. The Annals of Statistics, 49(1):486 507, 2021.

Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Conformal

prediction beyond exchangeability. ar Xiv preprint ar Xiv:2202.13415, 2022.

Ryan J Tibshirani, Rina Foygel Barber, Emmanuel Candes, and Aaditya Ramdas. Conformal

prediction under covariate shift. Advances in neural information processing systems, 32, 2019.

Aleksandr Podkopaev and Aaditya Ramdas. Distribution-free uncertainty quantiﬁcation for clas-

siﬁcation under label shift. In Uncertainty in Artiﬁcial Intelligence, pages 844 853. PMLR, 2021.

Ryan Giordano, Michael I Jordan, and Tamara Broderick. A higher-order swiss army inﬁnitesimal

jackknife. ar Xiv preprint ar Xiv:1907.12116, 2019a.

Vineeth Balasubramanian, Shen-Shyang Ho, and Vladimir Vovk. Conformal prediction for reliable

machine learning: theory, adaptations and applications. Newnes, 2014.

Jing Lei and Larry Wasserman. Distribution-free prediction bands for non-parametric regression.

Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):71 96, 2014.

Jing Lei, Max G Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. Distribution-free

predictive inference for regression. Journal of the American Statistical Association, 113(523): 1094 1111, 2018.

Harris Papadopoulos. Inductive conformal prediction: Theory and application to neural networks.

INTECH Open Access Publisher Rijeka, 2008.

Vladimir Vovk. Conditional validity of inductive conformal predictors. In Asian conference on

machine learning, pages 475 490. PMLR, 2012.

Eugene Ndiaye and Ichiro Takeuchi. Computing full conformal prediction set with approximate

homotopy. Advances in Neural Information Processing Systems, 32, 2019.

Gianluca Zeni, Matteo Fontana, and Simone Vantini. Conformal prediction: a uniﬁed review of

theory and new challenges. ar Xiv preprint ar Xiv:2005.07972, 2020.

Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller. Covariate shift adaptation by

importance weighted cross validation. Journal of Machine Learning Research, 8(5), 2007.

Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-

likelihood function. Journal of statistical planning and inference, 90(2):227 244, 2000.

Vladimir Vovk, Ilia Nouretdinov, Valery Manokhin, and Alexander Gammerman. Cross-conformal

predictive distributions. In Conformal and Probabilistic Prediction and Applications, pages 37 51. PMLR, 2018.

Rupert G Miller. The jackknife-a review. Biometrika, 61(1):1 15, 1974.

Lukas Steinberger and Hannes Leeb. Conditional predictive inference for high-dimensional stable

algorithms. ar Xiv preprint ar Xiv:1809.01412, 2018.

Lukas Steinberger and Hannes Leeb. Leave-one-out prediction intervals in linear regression models

with many variables. ar Xiv preprint ar Xiv:1602.05801, 2016.

R Dennis Cook. Detection of inﬂuential observation in linear regression. Technometrics, 19(1):15 18,

Pang Wei Koh and Percy Liang. Understanding black-box predictions via inﬂuence functions. In

International conference on machine learning, pages 1885 1894. PMLR, 2017.

Ryan Giordano, William Stephenson, Runjing Liu, Michael Jordan, and Tamara Broderick. A swiss

army inﬁnitesimal jackknife. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 1139 1147. PMLR, 2019b.

Ahmed Alaa and Mihaela Van Der Schaar. Discriminative jackknife: Quantifying uncertainty in deep

learning via higher-order inﬂuence functions. In International Conference on Machine Learning, pages 165 174. PMLR, 2020.

Dougal Maclaurin, David Duvenaud, and Ryan P Adams. Autograd: Effortless gradients in numpy.

In ICML 2015 Auto ML workshop, volume 238, 2015.

Vladimir Vovk, Jieli Shen, Valery Manokhin, and Min-ge Xie. Nonparametric predictive distributions

based on conformal prediction. In Conformal and Probabilistic Prediction and Applications, pages 82 102. PMLR, 2017.

Vladimir Vovk, Ivan Petej, Ilia Nouretdinov, Valery Manokhin, and Alexander Gammerman. Compu-

tationally efﬁcient versions of conformal predictive distributions. Neurocomputing, 397:292 308, 2020.

Vladimir Vovk and Claus Bendtsen. Conformal predictive decision making. In Conformal and

Probabilistic Prediction and Applications, pages 52 62. PMLR, 2018.

Min-ge Xie and Zheshi Zheng. Homeostasis phenomenon in conformal prediction and predictive

distribution functions. International Journal of Approximate Reasoning, 141:131 145, 2022.

Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.

uci.edu/ml.

Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. Modeling wine

preferences by data mining from physicochemical properties. Decision support systems, 47(4): 547 553, 2009.

Kam Hamidieh. A data-driven statistical model for predicting the critical temperature of a supercon-

ductor. Computational Materials Science, 154:346 354, 2018.

Michael Redmond and Alok Baveja. A data-driven software tool for enabling cooperative information

sharing among police departments. European Journal of Operational Research, 141(3):660 678, 2002.

Mary Feng, Gilmer Valdes, Nayha Dixit, and Timothy D Solberg. Machine learning in radiation

oncology: opportunities, requirements, and needs. Frontiers in oncology, 8:110, 2018.

Elizabeth Huynh, Ahmed Hosny, Christian Guthier, Danielle S Bitterman, Steven F Petit, Daphne A

Haas-Kogan, Benjamin Kann, Hugo JWL Aerts, and Raymond H Mak. Artiﬁcial intelligence in radiation oncology. Nature Reviews Clinical Oncology, 17(12):771 781, 2020.

Saul N Weingart, Lulu Zhang, Megan Sweeney, and Michael Hassett. Chemotherapy medication

errors. The Lancet Oncology, 19(4):e191 e199, 2018.

Marcel Van Herk. Errors and margins in radiotherapy. In Seminars in radiation oncology, volume 14,

pages 52 64. Elsevier, 2004.

Howard Gurney. How to calculate the dose of chemotherapy. British journal of cancer, 86(8):

1297 1302, 2002.

Michael R Cohen, Roger W Anderson, Richard M Attilio, Laurence Green, Raymond J Muller, and

Jane M Pruemer. Preventing medication errors in cancer chemotherapy. American Journal of Health-System Pharmacy, 53(7):737 746, 1996.

Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning under covariate shift.

Journal of Machine Learning Research, 10(9), 2009.

Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine

learning. Cambridge University Press, 2012.

Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard

Schölkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009.

Yaoliang Yu and Csaba Szepesvári. Analysis of kernel mean matching under covariate shift. ar Xiv

preprint ar Xiv:1206.4650, 2012.

Kai Zhang, Vincent Zheng, Qiaojun Wang, James Kwok, Qiang Yang, and Ivan Marsic. Covariate

shift in hilbert space: A solution via sorrogate kernels. In International Conference on Machine Learning, pages 388 395. PMLR, 2013.

Yin Zhao, Longjun Cai, et al. Reducing the covariate shift by mirror samples in cross domain

alignment. Advances in Neural Information Processing Systems, 34, 2021.

Anqi Liu and Brian Ziebart. Robust classiﬁcation under sample selection bias. Advances in neural

information processing systems, 27, 2014.

Xiangli Chen, Mathew Monfort, Anqi Liu, and Brian D Ziebart. Robust covariate shift regression. In

Artiﬁcial Intelligence and Statistics, pages 1270 1279. PMLR, 2016.

John C Duchi, Tatsunori Hashimoto, and Hongseok Namkoong. Distributionally robust losses against

mixture covariate shifts. Under review, 2, 2019.

Ashkan Rezaei, Anqi Liu, Omid Memarrast, and Brian D Ziebart. Robust fairness under covariate

shift. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 35, pages 9419 9427, 2021.

Luisa Turrin Fernholz. Von Mises calculus for statistical functionals, volume 19. Springer Science &

Business Media, 2012.

HG Landau. On dominance relations and the structure of animal societies: Iii the condition for a

score structure. The bulletin of mathematical biophysics, 15(2):143 148, 1953.

Sashank Reddi, Barnabas Poczos, and Alex Smola. Doubly robust covariate shift correction. In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 29, 2015.

Samyadeep Basu, Philip Pope, and Soheil Feizi. Inﬂuence functions in deep learning are fragile.

ar Xiv preprint ar Xiv:2006.14651, 2020.

Fengpei Li, Henry Lam, and Siddharth Prusty. Robust importance weighting for covariate shift. In

International Conference on Artiﬁcial Intelligence and Statistics, pages 352 362. PMLR, 2020.

Anastasios Angelopoulos, Stephen Bates, Jitendra Malik, and Michael I Jordan. Uncertainty sets for

image classiﬁers using conformal prediction. ar Xiv preprint ar Xiv:2009.14193, 2020.