# nonexchangeable_conformal_risk_control__50358da4.pdf

Published as a conference paper at ICLR 2024

NON-EXCHANGEABLE CONFORMAL RISK CONTROL

Ant onio Farinhas 1,2, Chrysoula Zerva 1,2, Dennis Ulmer 3,4, Andr e F. T. Martins 1,2,5

1Instituto de Telecomunicac oes, 2Instituto Superior T ecnico, Universidade de Lisboa (Lisbon ELLIS Unit), 3IT University of Copenhagen, 4Pioneer Centre for Artificial Intelligence , 5Unbabel {antonio.farinhas,chrysoula.zerva,andre.t.martins}@tecnico.ulisboa.pt, dennis.ulmer@mailbox.org

Split conformal prediction has recently sparked great interest due to its ability to provide formally guaranteed uncertainty sets or intervals for predictions made by black-box neural models, ensuring a predefined probability of containing the actual ground truth. While the original formulation assumes data exchangeability, some extensions handle non-exchangeable data, which is often the case in many real-world scenarios. In parallel, some progress has been made in conformal methods that provide statistical guarantees for a broader range of objectives, such as bounding the best F1-score or minimizing the false negative rate in expectation. In this paper, we leverage and extend these two lines of work by proposing nonexchangeable conformal risk control, which allows controlling the expected value of any monotone loss function when the data is not exchangeable. Our framework is flexible, makes very few assumptions, and allows weighting the data based on its relevance for a given test example; a careful choice of weights may result in tighter bounds, making our framework useful in the presence of change points, time series, or other forms of distribution drift. Experiments with both synthetic and real world data show the usefulness of our method.

1 INTRODUCTION

As the use of machine learning systems for automated decision-making becomes more widespread, the demand for these systems to produce reliable and trustworthy predictions has grown significantly. In this context, conformal prediction (Papadopoulos et al., 2002; Vovk et al., 2005) has recently resurfaced as an attractive framework. Instead of providing a single output, this framework creates prediction sets or intervals that inherently account for uncertainty. These sets come with a statistical guarantee known as coverage, which ensures that they contain the ground truth in expectation, thereby providing a formal promise of reliability.

The standard formulation of conformal prediction has, however, important limitations. First, it assumes that all data is exchangeable, a condition which is often violated in practice (e.g., when there is correlation over time or space). Second, while the predicted sets/intervals provide guarantees on coverage, they do not bound arbitrary losses, some of which may be more relevant for the situation at hand (e.g., the F1-score or the false negative rate in multilabel classification problems). Several works have been proposed to improve over these two shortcomings, namely through nonexchangeable conformal prediction (Tibshirani et al., 2019; Gibbs & Candes, 2021; Barber et al., 2023) and conformal risk control (Bates et al., 2021; Angelopoulos et al., 2023a, CRC). In this paper, we extend these lines of research and propose non-exchangeable conformal risk control (non-X CRC). Our main contributions are:

We propose a new method for conformal risk control that provides formal guarantees when the data is not exchangeable, while also achieving the same guarantees as existing methods if the data is in fact exchangeable (see Table 1 where we position our work in the literature);

Theorem 1 establishes a new bound on the expected loss (assumed to be monotonic and bounded), allowing weighting the calibration data based on its relevance for a given test example;

Published as a conference paper at ICLR 2024

Table 1: Our framework combines two approaches, non-exchangeable conformal prediction and conformal risk control. Through this combination we are able to control the expected value of arbitrary monotonic loss functions when the data is not exchangeable, extending both frameworks.

Method Data assumptions Loss

Papadopoulos et al. (2002) exchangeable % miscoverage Barber et al. (2023) % miscoverage Angelopoulos et al. (2023a) exchangeable % nonincreasing, arbitrary Angelopoulos et al. (2023a, Prop. 3) covariate shift, known likelihood ratio% nonincreasing, arbitrary This paper % nonincreasing, arbitrary

We demonstrate the usefulness of our framework on three tasks: multilabel classification on synthetic data by minimizing the false negative rate; monitoring electricity usage by minimizing the λ-insensitive absolute loss; and open-domain question answering by bounding the best F1-score.1

Throughout the paper, we use the following definition of exchangeable data distribution, which is a weaker assumption than independent and identically distributed (i.i.d.) data.

Definition 1 (Exchangeable data distribution). Let X and Y designate input and output spaces. A data distribution in X Y is said to be exchangeable if and only if we have P((Xπ(1), Yπ(1)), . . . , (Xπ(n), Yπ(n))) = P((X1, Y1), . . . , (Xn, Yn)) for any finite sample {(Xi, Yi)}n i=1 X Y and any permutation function π. Note that if the data distribution is i.i.d., then it is also exchangeable, since P((X1, Y1), . . . , (Xn, Yn)) = Qn i=1 P((Xi, Yi)).

2 BACKGROUND

We start by providing background on conformal prediction (Papadopoulos et al., 2002; Vovk et al., 2005) in 2.1. We then discuss recent extensions of the framework 2.2 discusses the case where the data is non-exchangeable (Barber et al., 2023), which is often the case when models are deployed in practice. Another extension pivots from guaranteeing coverage to instead constraining the expected value of any monotone loss function (Angelopoulos et al., 2023a), useful for tasks in which the natural notion of error is not miscoverage ( 2.3).

2.1 CONFORMAL PREDICTION

Although other methods exist, this paper focuses on split conformal prediction (Papadopoulos et al., 2002; hereinafter referred to simply as conformal prediction). We start with a pretrained model and measure its performance on a calibration set {(Xi, Yi)}n i=1 of paired examples. Under the assumption of exchangeable data {(Xi, Yi)}n+1 i=1 , conformal prediction constructs prediction sets with the following coverage guarantee:

P Yn+1 C(Xn+1) 1 α, (1)

where (Xn+1, Yn+1) is a new data point and α a predefined confidence level. This is accomplished through the following steps: Let s(x, y) R be a non-conformity score function, where larger scores indicate worse agreement between x and y. We compute the value ˆq as the 1/n (n + 1)(1 α) quantile of the calibration scores and construct a prediction set as follows:

C Xn+1 = y : s(Xn+1, y) ˆq . (2)

This prediction set satisfies the coverage guarantee in Eq. (1), see e.g., Angelopoulos & Bates, 2021, App. D for a proof. While this guarantee helps to ensure a certain reliability of the calibrated model, the assumption of exchangeable data is often not true when models are deployed in practice, e.g., due to distribution drift in time series or correlations between different data points.

1Our code is available at https://github.com/deep-spin/non-exchangeable-crc.

Published as a conference paper at ICLR 2024

2.2 NON-EXCHANGEABLE CONFORMAL PREDICTION

Let us now consider prespecified weights {wi}n i=1 [0, 1]n and define wi := wi/(1 + PN i=1 wi). We take a look at a generalization of conformal prediction put together by Barber et al. (2023), which provides the following coverage guarantee, also valid when exchangeability is violated:

P Yn+1 C(Xn+1) 1 α

i=1 wid TV(Z, Zi), (3)

where Z := (X1, Y1), . . . , (Xn, Yn), (Xn+1, Yn+1) is a sequence of n calibration examples followed by a test example, Zi denotes Z after swapping (Xi, Yi) with (Xn+1, Yn+1), and d TV(Z, Zi) is the total variation (TV) distance between Z and Zi. This is accomplished by using

ˆq = inf n q :

i=1 wi1 si q 1 α o (4)

to construct prediction sets the same way as in Eq. (2). See Barber et al. (2023, 4) for a proof. It is worth noting that this method recovers standard conformal prediction when {wi}n i=1 = 1. Besides, if the data is exchangeable, then the distribution of Z is equal to the distribution of Zi, and thus using a weighted procedure does not hurt coverage according to Eq. (3), since d TV(Z, Zi) = 0 for all i. Intuitively, the closer to exchangeable the data is, the smaller the last term will be in Eq. (3). By choosing wisely the weights wi e.g., by setting large weights to calibration points (xi, yi) such that Z and Zi are similarly distributed and smaller weights otherwise tighter bounds can be obtained. For example, in time series data we may want to place larger weights on more recent observations.

2.3 CONFORMAL RISK CONTROL

Let us now consider an additional parameter λ and construct prediction sets of the form Cλ( ), where larger λ yield larger prediction sets, i.e., λ λ = Cλ(.) Cλ (.) (see Angelopoulos & Bates (2021, 4.3) for an example). Let ℓbe an arbitrary (bounded) loss function that shrinks as C(Xn+1) grows (i.e., that is monotonically nonincreasing with respect to λ). We switch from conformal methods that provide prediction sets that bound the miscoverage P Yn+1 / C(Xn+1) α to conformal risk control (Angelopoulos et al., 2023a), which provides guarantees of the form

E h ℓ(C(Xn+1), Yn+1) | {z }

This is accomplished as follows. Let Li(λ) = ℓ(Cλ(Xi), Yi), i = 1, . . . , n + 1, with Li : Λ ( , B] and λmax := sup Λ, be an exchangeable collection of nonincreasing functions of λ. Choosing an optimal ˆλ as

ˆλ = inf n λ : n n + 1 ˆRn(λ) + B n + 1 α o , ˆRn(λ) = 1

i=1 Li(λ), (6)

yields the guarantee in Eq. (5), see Angelopoulos et al. (2023a, 2) for a proof. When ℓ(C(Xn+1), Yn+1) = 1 Yn+1 / C(Xn+1) is the miscoverage loss, we recover standard conformal prediction ( 2.1). Note that, as required, this loss is nonincreasing. Other nonincreasing losses include the false negative rate, λ-insensitive absolute error, and the best token-level F1-loss, all of which used in our experiments in 4. A limitation of the construction presented in this section is that it relies on the assumption of data exchangeability, which might be violated in practical settings. Our work circumvents this requirement, as we show next.

3 NON-EXCHANGEABLE CONFORMAL RISK CONTROL

Up to this point, we have described how to construct prediction sets/intervals with coverage guarantees for non-exchangeable data, in 2.2, and how to control the expected value of arbitrary monotone loss functions, when the data is exchangeable, in 2.3. Using the same notation as before, we now

Published as a conference paper at ICLR 2024

present our method, non-exchangeable conformal risk control, which puts together these parallel lines of research, providing guarantees of the form:

E[L(ˆλ; (Xn+1, Yn+1))] α + (B A)

i=1 wid TV(Z, Zi), (7)

where we additionally assume A < B < to be a lower bound on Li : Λ [A, B]. Let us define Nw := PN i=1 wi. Eq. (7) is obtained by choosing an optimal ˆλ as

ˆλ = inf λ : Nw Nw + 1 ˆRn(λ) + B Nw + 1 α , ˆRn(λ) = 1 Nw

i=1 wi L(λ; (xi, yi)). (8)

We can see how Eq. (7) simultaneously mirrors both Eq. (3) and Eq. (5): for an optimal choice of λ, the expected risk for a new test point is bounded by α plus an extra loosening term that depends on the normalized weights {wi}n i=1 and on the total variation distance between Z and Zi. When the data is in fact exchangeable, we have again d TV(Z, Zi) = 0 for all i, and we recover Eq. (5), i.e., our method achieves the same coverage guarantees as standard conformal risk control. Although our theoretical bound in Eq. (7) holds for any choice of weights, this result is only useful when the loosening term is small, i.e., if we choose small weights wi for data points Zi with large total variation distance d TV(Z, Zi). While the true value of this term is typically unknown, in some situations, such as distribution drift in time series, we expect it to decrease with i, motivating the choice of weights that increase with i. The same principle can be applied in other domains (e.g., for spatial data, one may place higher weights to points close in space to the test point). We come back to this point in 3.2.

The result in Eq. (7) is valid when the weights are fixed, i.e., data-independent. However, our result still applies in the case of data-dependent weights wi = w(Xi, Xn+1) if we replace Pn i=1 wid TV(Z, Zi) by E Pn i=1 wid TV(Z, Zi|w1, . . . , wn) (see Barber et al. (2023, 4.5) for more information). We experiment with this approach in 4.3, where wi is a function of the embedding similarity between Xi and Xn+1, showing that the new bound is still useful in practice.

3.1 FORMAL GUARANTEES

Now that we have presented an overview of our method, we proceed to providing a formal proof for the guarantee in Eq. (7). We begin with a lemma, proved in App. A, that establishes a TV bound that extends the one introduced by Barber et al. (2023):

Lemma 1. Let f : S [A, B] R be a bounded function on a measurable space (S, A) (where A 2S is a σ-algebra) and let P and Q be two probability measures on (S, A). Then

|EP [f] EQ[f]| (B A)d TV(P, Q). (9)

Note that when f(t) = 1 t V for some event V A, the left-hand side becomes |P(V ) Q(V )| and we recover the bound used in the proof of Barber et al. (2023, 6.2).

We now state the main result. The proof technique is similar to that of Barber et al. (2023), but instead of modeling the event of a variable belonging to a strange set , we model expectations of loss functions that depend on a calibration variable. See App. B for the full proof.

Theorem 1 (Non-exchangeable conformal risk control). Assume that for all (x, y) X Y the loss L(λ; (x, y)) is nonincreasing in λ and bounded as A L(λ; (x, y)) B for any λ. Let Z := (X1, Y1), . . . , (Xn, Yn), (Xn+1, Yn+1)

be a sequence of n calibration examples followed by a test example, and let w1, . . . , wn [0, 1]n be data-independent weights. Define Nw = Pn i=1 wi, wi = wi/(Nw + 1) for i [n] and wn+1 = 1/(Nw + 1). Let α [A, B] be the maximum tolerable risk, and define

ˆλ = inf λ : Nw Nw + 1 ˆRn(λ) + B Nw + 1 α , (10)

Published as a conference paper at ICLR 2024

where ˆRn(λ) is the weighted empirical risk in the calibration set:

ˆRn(λ) = 1 Nw

i=1 wi L(λ; (xi, yi)). (11)

Then, we have

E[L(ˆλ; (Xn+1, Yn+1))] α + (B A)

i=1 wid TV(Z, Zi), (12)

where Zi is obtained from Z by swapping (Xi, Yi) and (Xn+1, Yn+1).

The next section illustrates how we can make practical use of this result to minimize loss functions beyond the miscoverage loss in the presence of non-exchangeable data distributions.

3.2 HOW TO CHOOSE WEIGHTS

To make practical use of Theorem 1, we need a procedure to choose the weights wi. We next suggest a strategy based on regularized minimization of the coverage gap g( w1, ..., wn) := (B A) Pn i=1 wid TV(Z, Zi) via the maximum entropy principle (Jaynes, 1957). Note first that simply minimizing this gap would lead to wi = 0 for all i [n] and wn+1 = 1, which ignores all the calibration data and leads to an infeasible ˆλ in Eq. (6). In general, if all weights wi are too small, this leads to a very large wn+1 and an unreasonably large ˆλ. On the other extreme, having all weights too large (e.g. wi = 1 for all i, which leads to wi = 1/(n + 1) for i [n + 1]) ignores the non-exchangeability of the data and may lead to a large coverage gap. Therefore, it is necessary to find a good balance between ensuring a small coverage gap but at the same time ensuring that the distribution w1, ..., wn+1 is not too peaked, i.e., that it has sufficiently high entropy. Since by definition, we must have wn+1 wi for all i [n], this can be formalized as the following regularized minimization problem:

min w1,..., wn+1(B A)

i=1 wid TV(Z, Zi) βH( w1, ..., wn+1)

i=1 wi = 1 and 0 wi wn+1 for all i [n], (13)

where H( w1, ..., wn+1) = Pn+1 i=1 wi log wi is the entropy function and β > 0 is a temperature parameter. The solution of this problem is wi exp( β(B A)d TV(Z, Zi)) for i [n + 1].

Although in general d TV(Z, Zi) is not known, it is possible in some scenarios to bound or to estimate this quantity: for example, when variables are independent but not identically distributed, it can be shown that d TV(Z, Zi) 2d TV(Zi, Zn+1) (Barber et al., 2023, Lemma 1); and it is possible to upper bound the total variation distance as a function of the (more tractable and amenable to estimation) Kullback-Leibler divergence, e.g., via Pinsker s or Bretagnolle-Huber s inequalities (Bretagnolle & Huber, 1979; Csisz ar & K orner, 2011), which may provide good heuristics. For example, in a time series under a distribution shift scenario bounded with a Lipschitztype condition d TV(Zi, Zn+1) ϵ(n + 1 i) for some ϵ > 0 (see e.g. (Barber et al., 2023, 4.4)), we could replace d TV(Z, Zi) in Eq. (13) by this upper bound to obtain the maxent solution wi exp( βϵ(n + 1 i)) = ρn+1 i, where ρ = exp( βϵ) (0, 1). This exponential decay of the weights was suggested by (Barber et al., 2023); our maximum entropy heuristic provides further justification for that choice. We use this strategy in some of our experiments in 4.

4 EXPERIMENTS

In this section, we turn to demonstrating the validity of our theoretical results in three different tasks using different nonincreasing losses: a multilabel classification problem using synthetic time series data, minimizing the false negative rate ( 4.1), a problem involving monitoring electricity usage, minimizing the λ-insensitive absolute loss ( 4.2), and an open-domain question answering (QA)

Published as a conference paper at ICLR 2024

task, where we control the best token-level F1-score ( 4.3). Throughout, we report our method alongside a conformal risk control (CRC) baseline that predicts ˆλ following Eq. (6).

4.1 MULTILABEL CLASSIFICATION IN A TIME SERIES

We start by validating our approach on synthetic data, before moving to real-world data in the following subsections. To this end, we modified the synthetic regression experiment of Barber et al. (2023, 5.1) to turn it into a multilabel classification problem with up to M = 10 different labels. We consider three different setups:

1. Exchangeable (i.i.d.) data: We sample N = 2000 i.i.d. data points (Xi, Yi)

RM RM. We sample Xi from a Gaussian distribution, Xi iid N(0, IM), and we set Yi sign(W Xi + b + .1N(0, IM)). The coefficient matrix W is set to the identity matrix IM and the biases to b = 0.5, to encourage a sparse set of labels.

2. Changepoints: We follow setting (1) and sample N = 2000 i.i.d. data points (Xi, Yi),

setting Xi iid N(0, IM) and Yi sign(W (k)Xi+b+.1N(0, IM)), again with b = 0.5. We start with the same coefficients W (0) = IM and for every changepoint k > 0 we rotate the coefficients such that W (k) i,j = W (k 1) i 1,j for i > 1 and W (k) 1,j = W (k 1) M,j . Following Barber et al. (2023), we use two changepoints (k = 2) at timesteps 500 and 1500.

3. Distribution drift: We follow setting (2) and sample N = 2000 i.i.d. data points (Xi, Yi),

with Xi iid N(0, IM) and Yi sign(W (k)Xi+b+.1N(0, IM)), with b as above. Again, we start with W (0) = IM but now we set W (N) to the last matrix of setting (2). We then compute each intermediate W (k) by linearly interpolating between W (0) and W (N).

After a warmup period of 200 time points, at each time step n = 200, . . . , N 1 we assign odd indices to the training set, even indices to the calibration set, and we let Xn+1 be the test point. We fit M independent logistic regression models to the training data to obtain predictors for each label; we let fm(Xi) denote the estimated probability of the mth label according to the model. Based on this predictor, we define prediction sets Cλ(Xi) := {m [M] : fm(Xi) 1 λ}. We compare standard CRC with non-exchangeable (non-X) CRC, for which we use weights wi = 0.99n+1 i

and predict ˆλ following Eq. (10). In both cases, we minimize the false negative rate (FNR):2

L(λ; (Xi, Yi)) = 1 |Yi Cλ(Xi)|

|Yi| . (14)

Note that this loss is nonincreasing in λ, as required. App. C contains additional experiments considering λ to be the number of active labels and using Cλ(Xi) = top-λ(f(Xi)).

Fig. 1 shows results averaged across 10 independent trials for α = 0.2, summarized in Table 2. We see that the performance of both methods is comparable when the data is i.i.d, with non-X CRC being slightly more conservative. However, when the data is not exchangeable due to the presence of changepoints or distribution drift, our proposed method is considerably better. In particular, after the changepoints in setting (2), non-X CRC is able to achieve the desired risk level more rapidly; in setting (3), the performance of standard CRC gradually drops over time a problem that can be mitigated by accounting for non-exchangeability introduced by the distribution drift. Importantly, while the average risk is above the predefined threshold for standard CRC for settings (2) and (3) (0.246 and 0.225, respectively), our method achieves the desired risk level on average (0.196 and 0.182, respectively).

4.2 MONITORING ELECTRICITY USAGE

We use the ELEC2 dataset (Harries, 1999), which tracks electricity transfer between two states in Australia, considering the subset of the data used by Barber et al. (2023), which contains 3444 time points. The data points correspond to the 09:00am - 12:00pm timeframe and we use the price (nswprice, vicprice) and demand (nswdemand, vicdemand) variables as input features,

2With some abuse of notation, we use Yi {1, . . . , M} to denote the set of gold labels with value +1.

Published as a conference paper at ICLR 2024

Figure 1: Average loss (top) and ˆλ (bottom) over 10 independent trials for settings (1), (2), and (3). We smooth all the curves by taking a rolling average with a window of 30 time points.

Table 2: Scalar statistics (mean/median) for settings (1), (2), and (3) for the multilabel classification problem using synthetic time series data reported in 4.1.

Method Setting 1 (i.i.d. data) Setting 2 (changepoints) Setting 3 (distribution drift)

CRC 0.191 / 0.183 0.246 / 0.228 0.225 / 0.218 non-X CRC 0.181 / 0.175 0.196 / 0.183 0.182 / 0.175

xi to predict the target transfer values yi. We also consider a randomly permuted version of the dataset such that the exchangeability assumption is satisfied. We use the same definitions and settings of 4.1, but this time we fit a least squares regression model to predict the transfer values, ˆyi = f(xi), at each time step. For non-X CRC, we use weights wi = 0.99n+1 i and we also experiment with weighted least-squares regression, placing weights ti = wi on each data point (non-X CRC + WLS). For both standard and non-X CRC we control the residual (distance) with respect to the confidence interval Cλ(xi) = [f(xi) λ, f(xi) + λ], where f(xi) corresponds to the predicted values for transfer. We use the λ-insensitive absolute loss, a loss function commonly used in support vector regression (Sch olkopf et al., 1998; Vapnik, 1999):

L(λ; (xi, yi)) = 0, if |f(xi) yi| λ, |f(xi) yi| λ, otherwise. (15)

We experiment using λ [0, 1] with a step of 0.01. Since we are using the normalized ELEC2 dataset, transfer takes values in [0, 1], thus L(λ; (f(xi), yi)) is bounded by B = 1. By definition L(λ; (f(xi), yi)) is nonincreasing with respect to λ.

Fig. 2 shows results for the aforementioned setup. We can observe that in the original setting, both non-exchangeable methods approximate well the desired loss threshold even during the timesteps at which the data suffers from distribution drift. Specifically, as observed by Barber et al. (2023), the electricity transfer values are more noisy during the middle of the time range and we can see that the standard CRC + LS method underestimates the ˆλ for these data points resulting in increased loss, above the desired one. With respect to the CRC + WLS setup, we can see that it manages to reach the desired loss with a smaller interval width on average, indicating that fitting the weighted leastsquares model performs better when the data distribution changes, allowing for smaller λ during calibration. For the permuted data that simulates the exchangeable data scenario, we can see that all methods perform similarly, reaching the desired loss, as expected.

Published as a conference paper at ICLR 2024

Figure 2: Results on ELEC2 data for α = 0.05 and λ defined by the prediction interval width. Presented curves are smoothed by taking a rolling average with a window of 300 data points per timestep.

Figure 3: F1-score control on the Natural Questions dataset. Average set size (left) and risk (right) over 1000 independent random data splits.

4.3 OPEN-DOMAIN QUESTION ANSWERING

We now shift to open-domain QA, a task that consists in answering factoid questions using a large collection of documents. This is done in two stages, following Angelopoulos et al. (2023a): (i) a retriever model (Karpukhin et al., 2020, DPR) selects passages from Wikipedia that might contain the answer to the question, and (ii) a reader model examines the retrieved contexts and extract text sub-spans that serve as candidate answers.3

Given a vocabulary V, each Xi Z is a question and Yi Zk a set of k correct answers, where Z := Vm (we assume that Xi and Yi are sequences composed of up to m tokens). We calibrate the best token-based F1-score of the prediction set 4, taken over all pairs of predictions and answers,

L(λ; (Xi, Yi)) = 1 max{F1(a, c) : c Cλ(Xi), a Yi}, Cλ = {y : f(Xi, y) λ}, (16)

which is nonincreasing and upper-bounded by B = 1. We consider a CRC baseline that predicts ˆλ following Eq. (6). For non-X CRC, we choose weights {wi}n i=1 by computing the dot product between the embedding representations of {Xi}n i=1 and Xn+1, obtained using a sentence-transformer model (Reimers & Gurevych, 2019) designed for semantic search,5 and predict ˆλ following Eq. (10). While in standard CRC ˆλ is the same for each test example, this is not the case for non-X CRC.

3Enumerating all possible answers is intractable, and thus we retrieve the top several hundred candidate answers, extracted from the top 100 passages (which is sufficient to control all risks). 4This is the same loss used by Angelopoulos et al. (2023a). 5We use the multi-qa-mpnet-base-dot-v1 model available at https://huggingface.co/ sentence-transformers/multi-qa-mpnet-base-dot-v1.

Published as a conference paper at ICLR 2024

While Theorem 1 requires the weights to be independent of the test example, we relax this assumption by setting higher weights for questions in a neighborhood of Xn+1 (see 3). Intuitively, we could think of a situation where the questions are posed by multiple users, each of which may have a tendency to ask semantically similar questions or from the same domain. In this case, we could choose a priori higher weights for closer domains/users without violating this assumption.

We use the Natural Questions dataset (Kwiatkowski et al., 2019; Karpukhin et al., 2020), considering n = 2500 points for calibration and 1110 for evaluation. Following Angelopoulos et al. (2023a), we use α = 0.3 and report results over 1000 trials in Fig. 3. While the test risk is similar in both cases (0.30 0.015), the prediction sets of our method are considerably smaller than those of standard CRC (23.0 1.47 vs. 24.6 1.83, respectively). By choosing appropriate weights we can better estimate the set size needed to obtain the desired risk level, while standard CRC tends to overestimate the set size to reach the same value. We thus obtain better estimates of confidence over the predictions.

5 RELATED WORK

Conformal prediction (Gammerman et al., 1998; Vovk et al., 1999; Saunders et al., 1999) has proven to be a useful tool for obtaining uncertainty sets/intervals for the predictions of machine learning models, having found a variety of extensions and applications over the years. Among these are split conformal prediction (Papadopoulos et al., 2002), which does not require retraining the predictor and instead uses a held-out dataset and cross-conformal prediction (Vovk, 2015), which is a hybrid between split conformal prediction and cross-validation. Some of these methods have recently been applied in tasks such as language modeling (Schuster et al., 2022), molecular design (Fannjiang et al., 2022), pose estimation (Yang & Pavone, 2023), and image denoising (Teneggi et al., 2023).

In addition to the works discussed in 2, several extensions to non-exchangeable data have been proposed for time series (Chernozhukov et al., 2018; 2021b; Xu & Xie, 2021; Stankeviciute et al., 2021; Lin et al., 2022; Zaffran et al., 2022; Sun & Yu, 2022; Schlembach et al., 2022; Angelopoulos et al., 2023b), covariate shift (Tibshirani et al., 2019), label shift (Podkopaev & Ramdas, 2021), and others (Cauchois et al., 2020; Gibbs & Candes, 2021; Chernozhukov et al., 2021a; Gibbs & Cand es, 2022; Oliveira et al., 2022; Guan, 2022). Moreover, there is recent work aiming at controlling arbitrary risks in an online setting (Feldman et al., 2022). The ideas, assumptions, or formal guarantees in these works are different to ours we refer the reader to the specific papers for further information.

Angelopoulos et al. (2023a) touch the case of conformal risk control under covariate shift (Proposition 3; without providing any empirical validation), explaining how to generalize the work of Tibshirani et al. (2019) to any monotone risk under the strong assumption that the distribution of Y |X is the same for both the training and test data and that the likelihood ratio between Xtest and Xtrain is known or can be accurately estimated using a large set of test data. This result is orthogonal to ours. Besides, they quantify how unweighted conformal risk control degrades when there is an arbitrary distribution shift. Our work is more general and differs in several significant ways: we allow for an arbitrary design of weights, the bounds can be tighter, and the losses are bounded in [A, B], not necessarily in [0, B]. Specifically, their Proposition 4 is a particular case of our main result (choosing A = 0 and unitary weights), which we use as a baseline in our experiments.

6 CONCLUSIONS

We have proposed a new method for conformal risk control, which is still valid when the data is not exchangeable (e.g., due to an arbitrary distribution shift) and provides a tighter bound on the expected loss than that of previous work. Our simulated experiments illustrate how non-exchangeable conformal risk control effectively provides prediction sets satisfying the risk requirements in the presence of non-exchangeable data (in particular, in the presence of change points and distribution drift), without sacrificing performance if the data is in fact exchangeable. Additional experiments with real data validate the usefulness of our approach.

Our work opens up exciting possibilities for research on risk control in challenging settings. For instance, it is an attractive framework for providing guarantees on the predictions of large language models, being of particular interest in tasks involving language generation, medical data (Jalali et al., 2020), or reinforcement learning (Wang et al., 2023), where the i.i.d. assumption does not hold.

Published as a conference paper at ICLR 2024

ACKNOWLEDGMENTS

We would like to thank M ario Figueiredo, the SARDINE lab team, and the anonymous reviewers for helpful discussions. This work was built on open-source software; we acknowledge Van Rossum & Drake (2009); Oliphant (2006); Virtanen et al. (2020); Walt et al. (2011); Pedregosa et al. (2011), and Paszke et al. (2019). This work was supported by EU s Horizon Europe Research and Innovation Actions (UTTER, contract 101070631), by the project DECOLLAGE (ERC-2022-Co G 101088763), by the Portuguese Recovery and Resilience Plan through project C645008882-00000055 (Center for Responsible AI), and by Fundac ao para a Ciˆencia e Tecnologia through contract UIDB/50008/2020.

Anastasios N Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. ar Xiv preprint ar Xiv:2107.07511, 2021.

Anastasios N. Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Conformal risk control, 2023a.

Anastasios N. Angelopoulos, Emmanuel J. Candes, and Ryan J. Tibshirani. Conformal pid control for time series prediction, 2023b.

Rina Foygel Barber, Emmanuel J. Cand es, Aaditya Ramdas, and Ryan J. Tibshirani. Conformal prediction beyond exchangeability. The Annals of Statistics, 51(2):816 845, 2023. doi: 10. 1214/23-AOS2276. URL https://doi.org/10.1214/23-AOS2276.

Stephen Bates, Anastasios Angelopoulos, Lihua Lei, Jitendra Malik, and Michael Jordan. Distribution-free, risk-controlling prediction sets. J. ACM, 68(6), sep 2021. ISSN 0004-5411. doi: 10.1145/3478535. URL https://doi.org/10.1145/3478535.

Jean Bretagnolle and Catherine Huber. Estimation des densit es: risque minimax. Zeitschrift f ur Wahrscheinlichkeitstheorie und verwandte Gebiete, 47:119 137, 1979.

Maxime Cauchois, Suyash Gupta, Alnur Ali, and John C Duchi. Robust validation: Confident predictions even when distributions shift. ar Xiv preprint ar Xiv:2008.04267, 2020.

Victor Chernozhukov, Kaspar W uthrich, and Zhu Yinchu. Exact and robust conformal inference methods for predictive machine learning with dependent data. In S ebastien Bubeck, Vianney Perchet, and Philippe Rigollet (eds.), Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp. 732 749. PMLR, 06 09 Jul 2018. URL https://proceedings.mlr.press/v75/chernozhukov18a.html.

Victor Chernozhukov, Kaspar W uthrich, and Yinchu Zhu. Distributional conformal prediction. Proceedings of the National Academy of Sciences, 118(48):e2107794118, 2021a. doi: 10.1073/pnas.2107794118. URL https://www.pnas.org/doi/abs/10.1073/pnas. 2107794118.

Victor Chernozhukov, Kaspar W uthrich, and Yinchu Zhu. An exact and robust conformal inference method for counterfactual and synthetic controls. Journal of the American Statistical Association, 116(536):1849 1864, Jun 2021b. ISSN 1537-274X. doi: 10.1080/01621459.2021.1920957. URL http://dx.doi.org/10.1080/01621459.2021.1920957.

Imre Csisz ar and J anos K orner. Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011.

Clara Fannjiang, Stephen Bates, Anastasios N. Angelopoulos, Jennifer Listgarten, and Michael I. Jordan. Conformal prediction under feedback covariate shift for biomolecular design. Proceedings of the National Academy of Sciences, 119(43):e2204569119, 2022. doi: 10. 1073/pnas.2204569119. URL https://www.pnas.org/doi/abs/10.1073/pnas. 2204569119.

Shai Feldman, Liran Ringel, Stephen Bates, and Yaniv Romano. Achieving risk control in online learning settings, 2022.

Published as a conference paper at ICLR 2024

Alexander Gammerman, Volodya Vovk, and Vladimir Vapnik. Learning by transduction. In Gregory F. Cooper and Seraf ın Moral (eds.), UAI 98: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, University of Wisconsin Business School, Madison, Wisconsin, USA, July 24-26, 1998, pp. 148 155. Morgan Kaufmann, 1998.

Isaac Gibbs and Emmanuel Candes. Adaptive conformal inference under distribution shift. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 1660 1672. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/ paper/2021/file/0d441de75945e5acbc865406fc9a2559-Paper.pdf.

Isaac Gibbs and Emmanuel Cand es. Conformal inference for online prediction with arbitrary distribution shifts, 2022.

Leying Guan. Localized conformal prediction: a generalized inference framework for conformal prediction. Biometrika, 110(1):33 50, Jul 2022. ISSN 1464-3510. doi: 10.1093/biomet/asac040. URL http://dx.doi.org/10.1093/biomet/asac040.

Michael Harries. Splice-2 comparative evaluation: Electricity pricing. In Technical report, University of New South Wales, 1999.

Ali Jalali, Hannah Lonsdale, Nhue Do, Jacquelin Peck, Monesha Gupta, Shelby Kutty, Sharon R. Ghazarian, Jeffrey P. Jacobs, Mohamed Rehman, and Luis M. Ahumada. Deep learning for improved risk prediction in surgical outcomes. Scientific Reports, 10(1):9289, 2020. doi: 10.1038/ s41598-020-62971-3. URL https://doi.org/10.1038/s41598-020-62971-3.

Edwin T Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620, 1957.

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769 6781, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URL https://aclanthology.org/2020. emnlp-main.550.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452 466, 2019. doi: 10.1162/tacl a 00276. URL https://aclanthology.org/Q19-1026.

Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Conformal prediction intervals with temporal dependence. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=8Qox XTDcs H.

Alfred M uller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429 443, 1997. ISSN 00018678. URL http://www.jstor. org/stable/1428011.

Travis E Oliphant. A guide to Num Py, volume 1. Trelgol Publishing USA, 2006.

Roberto I. Oliveira, Paulo Orenstein, Thiago Ramos, and Jo ao Vitor Romano. Split conformal prediction for dependent data, 2022.

Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman. Inductive confidence machines for regression. In Tapio Elomaa, Heikki Mannila, and Hannu Toivonen (eds.), Machine Learning: ECML 2002, pp. 345 356, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg. ISBN 978-3-540-36755-0.

Published as a conference paper at ICLR 2024

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Aleksandr Podkopaev and Aaditya Ramdas. Distribution-free uncertainty quantification for classification under label shift. In Cassio de Campos and Marloes H. Maathuis (eds.), Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, volume 161 of Proceedings of Machine Learning Research, pp. 844 853. PMLR, 27 30 Jul 2021. URL https://proceedings.mlr.press/v161/podkopaev21a.html.

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERTnetworks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982 3992, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. URL https://aclanthology.org/ D19-1410.

Craig Saunders, Alexander Gammerman, and Volodya Vovk. Transduction with confidence and credibility. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI 99, pp. 722 726, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. ISBN 1558606130.

Filip Schlembach, Evgueni Smirnov, and Irena Koprinska. Conformal multistep-ahead multivariate time-series forecasting. In Ulf Johansson, Henrik Bostr om, Khuong An Nguyen, Zhiyuan Luo, and Lars Carlsson (eds.), Proceedings of the Eleventh Symposium on Conformal and Probabilistic Prediction with Applications, volume 179 of Proceedings of Machine Learning Research, pp. 316 318. PMLR, 24 26 Aug 2022. URL https://proceedings.mlr.press/v179/ schlembach22a.html.

Bernhard Sch olkopf, Peter Bartlett, Alex Smola, and Robert C Williamson. Shrinking the tube: a new support vector regression algorithm. Advances in neural information processing systems, 11, 1998.

Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Q. Tran, Yi Tay, and Donald Metzler. Confident adaptive language modeling. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=u LYc4L3C81A.

Kamile Stankeviciute, Ahmed M Alaa, and Mihaela van der Schaar. Conformal time-series forecasting. Advances in neural information processing systems, 34:6216 6228, 2021.

Sophia Sun and Rose Yu. Copula conformal prediction for multi-step time series forecasting, 2022.

Jacopo Teneggi, Matthew Tivnan, Web Stayman, and Jeremias Sulam. How to trust your diffusion model: A convex optimization approach to conformal risk control. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 33940 33960. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/v202/teneggi23a.html.

Ryan J Tibshirani, Rina Foygel Barber, Emmanuel Candes, and Aaditya Ramdas. Conformal prediction under covariate shift. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e-Buc, E. Fox,

Published as a conference paper at ICLR 2024

and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/ paper/2019/file/8fb21ee7a2207526da55a679f0332de2-Paper.pdf.

Guido Van Rossum and Fred L. Drake. Python 3 Reference Manual. Create Space, Scotts Valley, CA, 2009. ISBN 1441412697.

Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 1999.

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, St efan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, Ilhan Polat, Yu Feng, Eric W. Moore, Jake Vand er Plas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R Harris, Anne M. Archibald, Antˆonio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and Sci Py 1. 0 Contributors. Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 2020. doi: https://doi.org/10.1038/s41592-019-0686-2.

Vladimir Vovk. Cross-conformal predictors. Annals of Mathematics and Artificial Intelligence, 74(1):9 28, 2015. doi: 10.1007/s10472-013-9368-4. URL https://doi.org/10.1007/ s10472-013-9368-4.

Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer-Verlag, Berlin, Heidelberg, 2005. ISBN 0387001522.

Volodya Vovk, Alexander Gammerman, and Craig Saunders. Machine-learning applications of algorithmic randomness. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML 99, pp. 444 453, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. ISBN 1558606122.

St efan van der Walt, S Chris Colbert, and Gael Varoquaux. The Num Py array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22 30, 2011.

Jun Wang, Jiaming Tong, Kaiyuan Tan, Yevgeniy Vorobeychik, and Yiannis Kantaros. Conformal temporal logic planning using large language models: Knowing when to do what and when to ask for help, 2023.

Chen Xu and Yao Xie. Conformal prediction interval for dynamic time-series. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 11559 11569. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/v139/xu21h.html.

Heng Yang and Marco Pavone. Object pose estimation with statistical guarantees: Conformal keypoint detection and geometric uncertainty propagation. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2023. doi: 10.1109/cvpr52729.2023.00864. URL http://dx.doi.org/10.1109/CVPR52729.2023.00864.

Margaux Zaffran, Olivier Feron, Yannig Goude, Julie Josse, and Aymeric Dieuleveut. Adaptive conformal predictions for time series. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 25834 25866. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/ v162/zaffran22a.html.

A PROOF OF LEMMA 1

The TV distance can be written as an integral probability metric (M uller, 1997):

d TV(P, Q) = 1

2 sup g: g 1

EP [g] EQ[g] . (17)

Published as a conference paper at ICLR 2024

Now, we define m = (A + B)/2, v = (B A)/2, and f = (f m)/v : S [ 1, 1]. Noticing that for any c R, we have EP [f] EQ[f] = EP [f +c] EQ[f +c], we can evaluate the difference in expectations as

EP [f] EQ[f] = v EP [ f] EQ[ f] (18)

2 sup g: g 1

EP [g] EQ[g] (19)

= (B A) d TV(P, Q). (20)

Repeating with f = (m f)/v (which is also in [ 1, 1]), yields a similar upper-bound for EQ[f] EP [f], from which the result for |EP [f] EQ[f]| follows.

B PROOF OF THEOREM 1

The proof adapts elements of the proofs from Barber et al. (2023) and Angelopoulos et al. (2023a). Let ZK be obtained from Z by swapping (XK, YK) and (Xn+1, Yn+1), where K is a random variable where P{K = i} = wi (note that Zn+1 = Z). Let

i=1 wi L(λ; (xi, yi)) = Nw ˆRn(λ) + L(λ; (xn+1, yn+1))

Nw + 1 (21)

be the weighted empirical risk in the calibration set plus the additional test example. Let us define

λ = inf n λ : ˆRn+1(λ) α o . (22)

Given the random variable Z, we can think of λ (Z) as another random variable which is a transformation of Z. Moreover, we define the random variable Fi(Z) = L(λ (Z); (Xi, Yi)) for i [n+1], as well as the vector of random variables F(Z) = [F1(Z), . . . , Fn+1(Z)]. From Lemma 1, we have

E[Fi(Zi)] E[Fi(Z)] + (B A)d TV(F(Z), F(Zi)), (23)

a bound that we will use later. Writing Li(λ) L(λ; (Xi, Yi)) for convenience, we also have, for any λ and for any k [n + 1],

ˆRn+1(λ; Zk) =

i=1,i =k wi Li(λ) + wk Ln+1(λ) + wn+1Lk(λ)

i=1,i =k wi Li(λ) + wk(Lk(λ) + Ln+1(λ) | {z } B

) + ( wn+1 wk) | {z } 0

Lk(λ) | {z } B

i=1,i =k wi Li(λ) + wk(Lk(λ) + B) + ( wn+1 wk)B

i=1 wi Li(λ) + wn+1B

= Nw Nw + 1 ˆRn(λ; Z) + B Nw + 1. (24)

Published as a conference paper at ICLR 2024

Figure 4: Average loss (top) and ˆλ (bottom) over 10 independent trials for settings (1), (2), and (3). In this case, λ represents the number of predicted labels. We smooth the curves by taking a rolling average with a window of 30 time points.

Therefore, setting λ = ˆλ and using Eq. (10), we obtain ˆRn+1(ˆλ; Zk) Nw Nw+1 ˆRn(ˆλ; Z)+ B Nw+1

α, which, from Eq. (22), implies λ (Zk) ˆλ(Z). Since the loss L is nonincreasing with λ, we get

E[Ln+1(ˆλ(Z); Z)] E[Ln+1(λ (ZK); Z)] = E[LK(λ (ZK); ZK]

i=1 P{K = i} | {z } = wi

E[Li(λ (Zi), Zi] | {z } =E[Fi(Zi)]

E[Li(λ (Z), Z] | {z } =E[Fi(Z)]

+(B A)d TV(F(Z), F(Zi))

i=1 wi Li(λ (Z), Z)

i=1 wid TV(F(Z), F(Zi))

= E h ˆRn+1(λ (Z)) i + (B A)

i=1 wid TV(F(Z), F(Zi))

i=1 wid TV(F(Z), F(Zi)). (25)

The result follows by noting that d TV(F(Z), F(Zi)) d TV(Z, Zi). Eq. (25) is actually a tighter bound, similarly to what has been noted by Barber et al., 2023.

C MULTILABEL CLASSIFICATION IN A TIME SERIES

Fig. 4 shows results averaged across 10 independent trials for α = 0.2 and setting λ in a slightly different way than that of 4.1. In this case, λ represents the number of active labels and we use Cλ(Xi) = top-λ(f(Xi)). The main takeaways remain the same: both methods perform similarly when the data is exchangeable, in setting (1). Accounting for the non-exchangeability introduced by changepoints and distribution drift using our method enables lowering the risk to the desired level in settings (2) and (3).