# split_conformal_prediction_and_nonexchangeable_data__985da890.pdf

Journal of Machine Learning Research 25 (2024) 1-38 Submitted 11/23; Revised 7/24; Published 7/24

Split Conformal Prediction and Non-Exchangeable Data

Roberto I. Oliveira rimfo@impa.br IMPA, Rio de Janeiro, Brazil.

Paulo Orenstein pauloo@impa.br IMPA, Rio de Janeiro, Brazil.

Thiago Ramos thiagorr@impa.br IMPA, Rio de Janeiro, Brazil.

João Vitor Romano joao.vitor@impa.br IMPA, Rio de Janeiro, Brazil.

Editor: Chris Oates

Split conformal prediction (CP) is arguably the most popular CP method for uncertainty quantiﬁcation, enjoying both academic interest and widespread deployment. However, the original theoretical analysis of split CP makes the crucial assumption of data exchangeability, which hinders many real-world applications. In this paper, we present a novel theoretical framework based on concentration inequalities and decoupling properties of the data, proving that split CP remains valid for many non-exchangeable processes by adding a small coverage penalty. Through experiments with both real and synthetic data, we show that our theoretical results translate to good empirical performance under non-exchangeability, e.g., for time series and spatiotemporal data. Compared to recent conformal algorithms designed to counter speciﬁc exchangeability violations, we show that split CP is competitive in terms of coverage and interval size, with the beneﬁt of being extremely simple and orders of magnitude faster than alternatives.

Keywords: concentration inequalities, conformal prediction, non-exchangeable data, β-mixing, uncertainty quantiﬁcation

1. Introduction

Conformal prediction (CP), introduced by Vovk et al. (2005), is a set of techniques for quantifying uncertainty in the predictions of any model, under very general assumptions on the data-generating distribution. CP yields ﬁnite-sample coverage guarantees of many kinds, and has generated much recent interest (Shafer and Vovk, 2008; Lei et al., 2018; Romano et al., 2019; Angelopoulos and Bates, 2023; Cauchois et al., 2021). A concrete and popular formulation of CP is split conformal prediction (Papadopoulos et al., 2002; Lei et al., 2018). Consider a regression setting where the data is a random sample (Xi, Yi)n i=1 of covariate and response pairs (Xi, Yi) X Y. Split CP proceeds as follows: (a) partition the data indices in three parts: a training set Itrain, a calibration set Ical and a test set Itest, each with sizes ntrain, ncal and ntest; (b) train a nonconformity score bstrain : X Y R, for example the residual bstrain(x, y) = |y bµ(x)| of an arbitrary model bµ trained on (Xi, Yi)i Itrain; (c) compute the empirical (1 α)-quantile bq1 α of

2024 Roberto I. Oliveira, Paulo Orenstein, Thiago Ramos and João Vitor Romano.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-1553.html.

Oliveira, Orenstein, Ramos and Romano

{bstrain(Xi, Yi)}i Ical; and (d) for each i Itest, deﬁne a conﬁdence set

C1 α(Xi) := {y Y : bstrain(Xi, y) bq1 α}.

We note in passing that split CP itself does not depend on test data; Itest is included for notational ease and theoretical convenience, as the guarantees of the method pertain to unseen data. If the data (Xi, Yi)n i=1 is exchangeable, then the usual theory of conformal prediction guarantees that the sets C1 α(Xi) have good marginal coverage over the test set; that is, for any i Itest, P[Yi C1 α(Xi)] 1 α η, (1)

where η = (1 α)/(ncal + 1). Equivalently, one can take the lower bound to be 1 α by employing C1 α+η instead. Additionally, for independent and identically distributed (iid) data and η 1/ min{ncal, ntest}, Lei et al. (2018) prove empirical coverage over the test set; that is, for some positive constant c > 0 and δ = exp( c η2 min{ncal, ntest}),

i Itest 1{Yi C1 α(Xi)} 1 α η

Unfortunately, the results above are strongly reliant on the data exchangeability. Similarly, most guarantees from the classical theory of CP do not apply to several important data processes. For instance, in a time series context it is typical to predict the current value Yi via a time-lagged vector Xi = (Yi j)p j=1, and the resulting process (Xi, Yi)n i=1 is typically far from exchangeable. Spatial models and shifting data distributions will also break the exchangeability assumption. Several recent papers have tried to address these issues (Chernozhukov et al., 2018; Xu and Xie, 2021, 2023; Jensen et al., 2022; Gibbs and Candès, 2021, 2024; Barber et al., 2023; Zaﬀran et al., 2022; Feldman et al., 2022), but they generally require the introduction of new CP algorithms speciﬁcally tailored to diﬀerent types of non-exchangeability. Some of these are very computationally intensive, while others only possess asymptotic guarantees. A third group requires adaptativity that is, recalibration of the prediction set C1 α at each time step , which may not be desirable in applications. The main message of this work is that on many occasions there is no need to introduce speciﬁc CP methods for non-exchangeable data. We prove that in such cases split CP possesses the marginal and empirical guarantees above, up to the addition of a slightly larger penalty term η in (1) and (2). These guarantees hold in ﬁnite samples and make no underlying assumptions on model consistency. While the penalty depends on the nature of the non-exchangeability more speciﬁcally, on decoupling and concentration properties of the process , we show that in practice the eﬀect is small even for moderately dependent data, and that increasing the calibration set size is a viable corrective. Importantly, split CP is computationally simple, avoiding intensive routines such as bootstrapping, ensembling or blocking. Finally, the method is exactly the same as the one used for the iid data and attests to its robustness, which is essential to ensure its validity in practical settings. For example, Figure 1 shows how split CP s marginal coverage (7), calculated through 10 000 simulations, behaves for an AR(1) time series and three diﬀerent underlying models. The data-generating mechanism is given by Yt = λYt 1 + εt, t N, λ R and εt N(0, 1)

Split CP and Non-Exchangeable Data

0.0 0.2 0.4 0.6 0.8 1.0 Autoregressive coefficient

Marginal coverage

Gradient Boosting

0.0 0.2 0.4 0.6 0.8 1.0 Autoregressive coefficient

Neural Network

0.0 0.2 0.4 0.6 0.8 1.0 Autoregressive coefficient

Random Forest

Figure 1: Marginal coverage for AR(1) process (solid) and nominally prescribed iid level of 90% (dashed) for diﬀerent values of the autoregressive coeﬃcient and three diﬀerent models. Split CP holds well even under moderate dependence and undercoverage only happens at very high levels.

independently. Split conformal quantile regression (Romano et al., 2019) is employed to achieve a nominal level of 90%, with models trained on 11 lags (i.e. Xt = (Yt j)11 j=1) to predict the next element in the sequence. The x-axis is indexed by λ, which can be interpreted as a level of dependence in the data. Unless the dependence is very high, split CP still has adequate coverage: autoregressive coeﬃcients up to λ = 0.99 achieve coverage higher than 89%. Signiﬁcant losses of coverage only happen when λ 0.999. Our main contributions are as follows:

We show coverage guarantees from split CP can be extended to large classes of dependent processes through the addition of a small coverage penalty;

We do so by introducing a novel mathematical framework that is based on concentration inequalities and data decoupling properties rather than exchangeability;

We present many generalizations that ﬁt our framework, including both marginal, empirical and conditional coverage guarantees, as well as extensions to non-split CP methods and non-stationary data;

We explicitly consider the broad class of stationary β-mixing processes (which includes, e.g., hidden Markov models, ARMA models and Markov chains), and show their empirical coverage bounds can match the order of the bounds under exchangeability;

We conduct experiments that show split CP s success in generating prediction intervals in real and synthetic data, highlighting its advantages in terms of coverage, interval size, simplicity and speed compared to many recent algorithms, such as Gibbs and Candès (2024); Xu and Xie (2021, 2023); Barber et al. (2023).

Proofs are postponed to the appendices.

2. Related Works

There have been many new proposals in extending CP to non-exchangeable data. Barber et al. (2023) focus on distributional drift. They bound the coverage gap i.e., the diﬀerence between nominal and actual coverage levels by a measure of deviation from exchangeability,

Oliveira, Orenstein, Ramos and Romano

which may be quite large for the time series or spatiotemporal data we deal with. Gibbs and Candès (2021, 2024) consider an online method where there is no calibration set and the quantile bq1 α is tuned online. They obtain very strong empirical coverage guarantees, but have no marginal coverage guarantees, and the online aspect of their method, which requires updates at every step, may be undesirable, e.g., in deployed applications. Similar comments apply to extensions and variants of this approach in Feldman et al. (2022) and Zaﬀran et al. (2022).

Chernozhukov et al. (2018) design prediction sets for time series data via a blockpermutation method. Their Theorem 2 gives approximate guarantees in a setting including α-mixing processes, which is in principle weaker than our β-mixing assumption. However, this mixing condition interacts in a complicated way with the block structure (cf. Lemma 1 in Chernozhukov et al. 2018), which is a complicated hyperparameter that increases the computational cost. In other work, the same authors (Chernozhukov et al., 2021) achieve stronger conditional guarantees for a simpler method than in Chernozhukov et al. (2018), at the cost of much stronger assumptions. That is, they require that population objects be learned asymptotically from the data, which can be undesirable in real-world applications where one may want to be agnostic about the quality of the model. Similar comments apply to Xu and Xie (2021, 2023), which have additional hyperparameters and computationally demanding algorithms.

In contrast, our framework for split CP holds for a broad range of non-exchangeable data; retains ﬁnite-sample marginal, empirical and conditional guarantees; does not require online modiﬁcations; has no hyperparameters and is both simpler and orders of magnitude faster than these alternatives. These advantages can be seen in the experiments in Section 4.

3. Theoretical Results

We begin this section by introducing the notation for the theoretical results. Denote by (Xi, Yi)n i=1 a sample of n random covariate/response pairs with stationary marginals. The pairs (Xi, Yi) take values in X Y, where X and Y are measurable spaces. An additional random pair (X , Y ) with values in X Y, independent from the sample (Xi, Yi)n i=1, will also be considered, and we assume (Xi, Yi) (X , Y ) for all i [n], where [n] := {1, . . . , n}.

The data indices can be partitioned [n] = Itrain Ical Itest, where n = ntrain +ncal +ntest and Itrain := [ntrain] corresponds to the training data, Ical := [ntrain+ncal]\[ntrain] corresponds to calibration data, and Itest := [n]\[ntrain + ncal] corresponds to test data.

For any function s: (X Y)ntrain+1 R and (x, y) X Y, denote a nonconformity score as

bstrain(x, y) := s((Xi, Yi)i Itrain, (x, y)),

corresponding to the values of s when the ﬁrst ntrain pairs in the input are the training data. Intuitively, the role of bstrain is to measure how discrepant a prediction based on xi is compared to the true yi; e.g., bstrain(x, y) = |y bµ(x)|, where bµ is some regression model trained on (Xi, Yi)i Itrain. Several choices have been proposed in the literature (Lei et al., 2018; Hechtlinger et al., 2019; Romano et al., 2019; Angelopoulos et al., 2021).

Split CP and Non-Exchangeable Data

Given φ [0, 1), let bqφ,cal denote the empirical φ-quantile of bstrain(Xi, Yi) over Ical; i.e.,

bqφ,cal := inf

t R : 1 ncal

i Ical 1{bstrain(Xi, Yi) t} φ

For x X, the prediction sets are then deﬁned via Cφ(x) := {y Y : bstrain(x, y) bqφ,cal}. Also, given a measurable function q: (X Y)ntrain R (which can be thought of as a quantile), deﬁne qtrain := q((Xi, Yi)i Itrain) and the probability

Pq,train := P[bstrain(X , Y ) qtrain | (Xi, Yi)i Itrain]. (4)

3.1 Marginal and Empirical Guarantees

This subsection details how marginal and empirical guarantees (1) and (2) can be extended when the data is not exchangeable. Some basic assumptions are needed, though they are satisﬁed by large classes of processes. Section 3.3 shows that is the case for stationary β-mixing data, but the general framework developed here generalizes to other processes. First, it is necessary to have some form of concentration over the calibration data, as well as a degree of decoupling over the test data. We assume there exist εcal (0, 1) and δcal (0, 1) such that

i Ical 1{bstrain(Xi, Yi) qtrain} Pq,train

1 δcal, (5)

where Pq,train is deﬁned as in (4). Intuitively, this condition requires that the empirical and population cumulative distribution functions of bstrain(X, Y ) are close over calibration data. A key point here, however, is that this closeness should hold even when the cdf is computed over a point depending on training data. Further, we assume that there exists εtrain such that, for i Itest,

|P[bstrain(Xi, Yi) qtrain] P[bstrain(X , Y ) qtrain]| εtrain. (6)

This means that (Xi, Yi) for i Itest essentially behaves like (X , Y ), i.e., a data point that is independent of training data. Conditions like (5) and (6) (with small error parameters) hold for many types of dependent data, such as strictly stationary processes whose memory is not too strong. Examples include ﬁnite Markov chains, hidden Markov chains and ARMA-type processes (both linear and nonlinear). A broader class of such processes can be described under so-called β-mixing conditions, which we discuss in detail in Section 3.3. Under these conditions, the usual marginal coverage guarantees can be recovered for split CP.

Theorem 1 (Marginal coverage over test data) Given α (0, 1) and δcal > 0, if conditions (5) and (6) hold, then, for all i Itest:

P[Yi C1 α(Xi)] 1 α εcal δcal εtrain. (7)

Additionally, if bstrain(X , Y ) almost surely has a continuous distribution conditionally on the training data, then

|P[Yi C1 α(Xi)] (1 α)| εcal + δcal + εtrain.

Oliveira, Orenstein, Ramos and Romano

To also guarantee empirical coverage, suppose that instead of the decoupling assumption (6), there exists concentration of the empirical cumulative distribution function of the nonconformity score over the test data, that is, there exist εtest, δtest (0, 1) such that

" Pq,train 1 ntest

i Itest 1{bstrain(Xi, Yi) qtrain}

1 δtest. (8)

Theorem 2 (Empirical coverage over test data) Given α (0, 1), δcal > 0 and δtest > 0, if (5) and (8) hold, then:

i Itest 1{Yi C1 α(Xi)} 1 α η

1 δcal δtest,

where η = εcal+εtest. Additionally, if bstrain(X , Y ) almost surely has a continuous distribution conditionally on the training data, then:

i Itest 1{Yi C1 α(Xi)} (1 α)

1 2δcal 2δtest.

While the purpose of the above theorems is to extend split conformal guarantees to non-exchangeable data, they also readily apply to the iid case. Indeed, it is straightforward to show that, in such case, it suﬃces to take εcal = p

(2ncal) 1 log(2/δcal) and εtest = p

(2ntest) 1 log(2/δtest).

3.2 Conditional Guarantees

Obtaining a conditional version of (1) and (2) is of interest in many cases. Barber et al. (2020) prove that coverage is not generally attainable, even for iid data. On the positive side, they show they can be achieved when conditioning on sets of ﬁnite VC dimension that are not too small. Our goal is to show similar guarantees for split conformal prediction under non-exchangeable data. First, conditional versions of assumptions (5) and (6) are needed. For concentration over the calibration data, suppose there exist δcal and εcal (0, 1) such that, for Pq,train(A) := P[bstrain(X , Y ) qtrain | (Xi, Yi)i Itrain, X A] and ncal(A) := #{i Ical : Xi A},

1 max{ncal(A), 1}

i Ical(A) 1{bstrain(Xi, Yi) qtrain} Pq,train(A)

(9) For a conditional version of marginal decoupling, assume there exists εtrain (0, 1) such that

|P[bstrain(Xk, Yk) qtrain | Xk A] P[bstrain(X , Y ) qtrain | X A]| εtrain. (10)

These conditions suﬃce for conditional marginal coverage.

Theorem 3 (Conditional coverage over test data) Given α (0, 1) and δcal > 0, if (9) and (10) hold, then, for each A A X and any i Itest:

P[Yi C1 α(Xi; A) | Xi A] 1 α εcal δcal εtrain. (11)

Split CP and Non-Exchangeable Data

Additionally, if bstrain(X , Y ) almost surely has a continuous distribution conditionally on the training data, then:

|P[Yi C1 α(Xi; A) | Xi A] (1 α)| εcal + δcal + εtrain.

Appendix B includes a conditional version of the empirical coverage guarantee.

3.3 Application to Stationary β-Mixing Data

We now apply the framework from Section 3.1 to the class of stationary β-mixing processes. This class of non-exchangeable data is broad enough to cover many important applications, such as hidden Markov models and Markov chains (Doukhan, 2012) as well as ARMA and GARCH models (Carrasco and Chen, 2002; Mokkadem, 1988), while still providing explicit error terms in the bounds of Theorems 1 and 2. Recall a sequence of random elements {Zt} t= of a measurable space Z is stationary if its ﬁnite-dimensional distributions are time-invariant; that is, for any t Z and m, k N,

Zt:(t+m) = (Zt, . . . , Zt+m) d= (Zt+k, . . . , Zt+m+k) = Z(t+k):(t+m+k).

Furthermore, for a stationary stochastic process {Zt} t= and index a N, the β-mixing coeﬃcient of the process at a is deﬁned as

β(a) = P :0,a: P :0 Pa: TV,

where TV denotes the total variation norm, and P :0,a: is the joint distribution of the blocks (Z :0, Za: ). The process is said to be β-mixing if β(a) 0 when a . The β-mixing condition allows us to replace independence with asymptotic independence and still retain some important concentration results. In particular, the so-called Blocking Technique (Yu, 1994; Mohri and Rostamizadeh, 2010; Kuznetsov and Mohri, 2017) allows one to compare a β-mixing process with another process made of independent blocks. The results below generally follow from combining the Blocking Technique with decoupling arguments and Bernstein s concentration inequality. Crucially, the guarantees might depend on block sizes, but split CP itself does not. This is an advantage over CP variants (Chernozhukov et al., 2018). The sets of parameters we optimize over are deﬁned as follows:

Fcal = {(a, m, r) N3 >0 : 2ma = ncal r + 1, δcal > 4(m 1)β(a) + β(r)},

Ftest = {(a, m, s) N2 >0 N : 2ma = ntest s, δtest > 4(m 1)β(a) + β(ncal)}.

These two sets correspond to block size choices in the calibration and test sets, respectively. For the calibration set, deﬁne the error term as follows:

εcal := inf (a,m,r) Fcal

4 ncal r + 1 log 4 δcal 4(m 1)β(a) β(r)

3m log 4 δcal 4(m 1)β(a) β(r)

Oliveira, Orenstein, Ramos and Romano

where eσ(a) = q

a Pa 1 j=1(a j)β(j). Similarly, we deﬁne the test error correction factor for a stationary β-mixing process as

εtest = inf (a,m,s) Ftest

4 ntest log 4 δtest 4(m 1)β(a) β(ncal)

3m log 4 δtest 4(m 1)β(a) β(ncal)

We emphasize that this optimization of block sizes plays a role exclusively on the coverage guarantees below; the split CP algorithm itself remains unchanged. With εcal as above, Theorem 1 yields the following result for stationary β-mixing processes:

Theorem 4 (Marginal coverage: stationary β-mixing processes) Suppose the sample (Xi, Yi)n i=1 is stationary β-mixing. Then given α (0, 1) and δcal > 0, for i Itest,

P [Yi C1 α(Xi)] 1 α η,

with η = εcal + εtrain + δcal, where εcal is as in (12) and εtrain = β(i ntrain). Additionally, if bstrain(X , Y ) almost surely has a continuous distribution conditionally on the training data:

|P[Yi C1 α(Xi)] (1 α)| η.

Under certain assumptions over the dependence of the processes, the stationary βmixing bounds given by (12) are of the same asymptotic order as the corresponding iid bounds. Indeed, if β(k) k b and δ n c cal for b > 1, c > 0, with 1 + 2c < b, as long as m = o(n(b c)/(b+1) cal ) and p

ncal log(ncal) = o (m), the bounds are of the same order. This is satisﬁed, for example, if m = nλ cal, a = n1 λ cal /2 with 1/2 < λ < (b c)/(b + 1). Additionally, with εtest as above, Theorem 2 yields the following:

Theorem 5 (Empirical coverage: stationary β-mixing processes) Suppose the sample (Xi, Yi)n i=1 is stationary β-mixing. Then given α (0, 1), δcal > 0 and δtest > 0

i Itest 1{Yi C1 α(Xi)} 1 α η

1 δcal δtest,

with η = εcal + εtest, and εcal and εtest deﬁned in (12) and (13). Additionally, if bstrain(X , Y ) almost surely has a continuous distribution conditionally on the training data, then:

i Itest 1{Yi C1 α(Xi)} (1 α)

1 2δcal 2δtest.

We note in passing that the expression in (12) follows from a stationary β-mixing version of Bernstein s inequality which might be of independent interest. See Appendix C for the proof and for conditional extensions of the above results.

Split CP and Non-Exchangeable Data

3.4 Extensions to Non-Split CP Methods and Non-Stationary Data

The results obtained in previous subsections also apply when the data is non-stationary. For brevity, we focus on marginal coverage. Our analysis is partly inspired by the recent work of Barber et al. (2023). Replace the pair (X , Y ) with an auxiliary process (X ,i, Y ,i)i [n] that is an independent copy of the original data (Xi, Yi)i [n]. Let Ncal be a random number, uniformly distributed over Ical, independently of the problem data and auxiliary process. For j Itest, the quantity δ(j) := Law(Xj, Yj) Law(XNcal, YNcal) TV, for j Itest, measures how far the distribution of (Xj, Yj) is to that of a randomly chosen point in the calibration data set (i.e., a measure of distributional drift). For marginal coverage, we take the random variable qtrain as before, but replace (4) by a time-inhomogeneous version, that is, P (j) q,train := P[bstrain(X ,j, Y ,j) qtrain | (Xi, Yi)i Itrain]. Furthermore, (5) and (6) are also replaced with time-inhomogeneous versions:

i Ical (1{bstrain(Xi, Yi) qtrain} P (i) q,train)

1 δcal, (14)

and, for j Itest,

|P[bstrain(Xj, Yj) qtrain] P[bstrain(X ,j, Y ,j) qtrain]| εtrain. (15)

Theorem 6 (Marginal coverage for split CP on non-stationary data) Given a level α (0, 1), if (14) and (15) hold, then for all j Itest:

P[bstrain(Xj, Yj) bq1 α,cal] 1 α η δ(j),

where η = εcal + δcal + εtrain is as in Theorem 1.

In particular, we recover Theorem 1 up to an error depending on how much distributional drift there is between j and the calibration set. This is similar to the main result in Barber et al. (2023), except that there the authors consider weighted calibration sets. Finally, our framework also extends to other popular methods in the literature that are not based on split CP, such as rank-one-out (ROO) conformal prediction (Lei et al., 2018) and risk-controlling prediction sets (RCPS) (Bates et al., 2021). ROO calibrates the quantiles used for each test data point by using the remaining test points, so it is diﬀerent from split CP. RCPS gives a general CP methodology that applies in a variety of settings, including regression, multiclass classiﬁcation and image segmentation. Importantly, RCPS does not involve nonconformity scores but the construction of nested sets. Details for both of these extensions to non-exchangeable data are given in Appendix D.

4. Experiments

This section studies split CP s empirical performance in several numerical experiments. The ﬁrst one uses real spatiotemporal climate data to compare split CP s coverage and average interval size to recent alternative conformal algorithms due to Barber et al. (2023) and Gibbs and Candès (2024). The second experiment benchmarks split CP on synthetic β-mixing

Oliveira, Orenstein, Ramos and Romano

data against a popular conformal approach for time-series (Xu and Xie, 2021, 2023). In both cases, split conformal is used with absolute residuals as nonconformity score. The third example involves a hidden Markov model in which the bounds can be calculated explicitly, while the last shows that split CP s marginal and conditional guarantees work with real ﬁnancial data, even when exchangeability is clearly violated. In these two ﬁnal examples, as in the AR(1) experiment (cf. Figure 1), we employ split conformal quantile regression (Romano et al., 2019). The experiments were conducted on a server with 774 GB of RAM and two Intel Xeon Platinum 8354H processors, totalling 8 physical cores and 288 threads. All conformal implementations made use of multiprocessing for better performance. Further implementation details are discussed in Appendix E. Code to reproduce all the ﬁgures and tables is available at https://github.com/jv-rv/split-conformal-nonexchangeable. Example 1 (Spatiotemporal climate data) Let Yt,l R be the 14-day forward rolling average temperature in degrees Celsius, with t representing the ﬁrst measurement day included in the average and l L a location on Earth deﬁned by latitudes and longitudes discretized in a 1.5 1.5 grid totalling 7512 distinct locations. For all grid locations, measurements from the start of 1979 through the end of 2022 were retrieved via the National Oceanic and Atmospheric Administration (NOAA, 2023). It will be notationally convenient to split the dates in year, month and day, so we consider instead the equivalent representation Yy,m,d,l. A standard procedure in the meteorological literature to predict temperatures for a given date t and location l, called climatology, is to average the temperatures observed on the same day, month and location over the previous 30 years (Hwang et al., 2019; Arguez et al., 2012). More precisely, we take Xy,m,d,l := (Yy j,m,d,l)30 j=1 as features and ˆµ( ) = Avg( ) as the prediction model. Since 30 years (1979 2008) of temperature data are needed for the features, our response series starts on 2009. Note that the model ˆµ is ﬁxed a priori and needs no training data, so Itrain = . For a new test point Yy ,m ,d ,l , the calibration set Ical is taken as all the previous observations for that same day and month and the absolute residual is used as nonconformity score function. Thus, the set of nonconformity scores evaluated on calibration data is given by

{|Yy,m,d,l ˆµ(Xy,m,d,l)| : y < y , m = m , d = d , l L}. (16)

The intuition behind this choice is that temperature measurements for a given location, day and month can be reasonably modeled as β-mixing and stationary over the years (Paura, 2006). Split conformal prediction is eﬃcient in generating prediction intervals in this scenario. At any given date, the calibration set for all 7512 geographic points is the same and taking the quantile of (16) allows us to build thousands of prediction intervals at once. We follow the online sequential procedure through time implied by (16), with no additional overhead due to the spatial dependence. Figure 2 shows the temperature measurements in the center, from which the spatial dependence becomes evident. The prediction intervals are represented by their minimum (left) and maximum (right). Coverage is generally attained and intervals are narrow enough to be useful. We compare the standard and widespread split CP method (Papadopoulos et al., 2002; Vovk et al., 2005; Lei et al., 2018) against recent developments tailored to non-exchangeable

Split CP and Non-Exchangeable Data

Figure 2: Average temperatures around the globe on 2022-12-31 (center), split CP s lower prediction (left) and upper prediction (right). Darker blues (reds) indicate lower (higher) temperatures. Approximately 92.6% of the 7512 grid points are inside the prediction interval, which is close to the 90% prescribed coverage. Averaging over all days, coverage is of 90.9%

data, namely Nex CP due to Barber et al. (2023) and dynamically-tuned adaptive conformal inference (Dt ACI) due to Gibbs and Candès (2024), which we brieﬂy describe next. Barber et al. (2023) proposed variations of conformal prediction to deal with nonexchangeability. We focus on the split version of their Nex CP algorithm, that is identical to split CP apart from the fact that quantiles are weighted to give more importance to more dependent residuals. In the original paper, this novel method is evaluated on time series data and shows improvement on coverage for data presenting distribution drift. For the non-exchangeable violations studied in our work, however, one would expect standard split CP to work well enough. Using Nex CP for spatiotemporal data requires a weighting scheme that takes into account both space and time. We deal with the time dimension as in the original paper via ρy y time for some decay parameter ρtime (0, 1], giving less weight to observations from the distant past. In a similar vein, we employ the exponential decay ρd(l ,l) space for ρspace (0, 1] and d(l , l) a distance between the test point location l and the residual location l. A good approximation for the distance of two points on Earth deﬁned in terms of latitudes and longitudes is achieved by assuming a spherical surface and making use of the haversine formula. We employ this procedure so that d: R2 R2 [0, π] is the angular distance between l and l. The ﬁnal weights are given by ρy y time ρd(l ,l) space . Calculating a weighted quantile has average time complexity O(ncal log ncal) in contrast to O(ncal) for the unweighted case. Moreover, Nex CP must calculate a diﬀerent quantile for each one of the 7512 geographic points, while Split CP makes one single calculation per time step. Therefore, one would expect Nex CP to be much slower than Split CP for spatiotemporal data, even when parallelized (cf. Table 1). Gibbs and Candès (2024) introduced dynamically-tuned adaptive conformal inference (Dt ACI) as an online learning method for constructing prediction intervals with a target coverage. As usual in the online setting, it is assumed that one new test point arrives at each time step. In the case of spatiotemporal data, with all grid points arriving at once, a reasonable strategy is to consider many individual time series. Indeed, this is how Covid-19 predictions are dealt with in the original paper and how we proceed. Table 1 summarizes the comparison between split CP, Nex CP and Dt ACI. The main takeaway is that split CP a much simpler method, with no hyperparameters, and orders of

Oliveira, Orenstein, Ramos and Romano

Method Hyperparams Coverage Interval size Avg time Avg SD (time) SD (space) ( C) (min)

Split CP None 90.9% 0.033 0.130 9.605 2.85

ρtime = ρspace = 0.99 90.9% 0.033 0.130 9.601

ρtime = ρspace = 0.9 90.8% 0.033 0.131 9.554 ρtime = ρspace = 0.7 90.6% 0.033 0.132 9.446 ρtime = ρspace = 0.5 90.3% 0.035 0.133 9.360 ρtime = ρspace = 0.3 90.0% 0.038 0.134 9.289 ρtime = ρspace = 0.1 89.5% 0.042 0.136 9.205

|I| = 1, γ ( 2i

1000)8 i=0 91.9% 0.034 0.096 9.658

75.15 |I| = 3, γ ( 2i

1000)8 i=0 92.1% 0.034 0.090 9.779 |I| = 5, γ ( 2i

1000)8 i=0 92.1% 0.034 0.090 9.806 |I| = 7, γ ( 2i

1000)8 i=0 92.1% 0.034 0.090 9.814 |I| = 9, γ ( 2i

1000)8 i=0 92.2% 0.034 0.091 9.815

Table 1: Comparison of conformal methods in terms of average running time, hyperparameters, average coverage, standard deviation of coverage over time and space, and average interval size. The prescribed level is 90%. While all methods are close to the nominal level, split CP is simpler, has no hyperparameter and is orders of magnitude faster for the spatiotemporal data considered.

magnitude faster is comparable to the benchmarks both in terms of coverage and interval size, working as expected for the spatiotemporal data. Nex CP s weighting scheme was useful in reducing the average interval size while maintaining a coverage above the prescribed 90% level, but the gains were small (less than 4% for the optimal choice ρtime and ρspace equal to 0.3). Dt ACI reduced the coverage standard deviation over space, but generated larger intervals. Xu and Xie (2021, 2023) developed ensemble batch prediction intervals (Enb PI) speciﬁcally as a conformal method for time-series data. However, it crucially relies on ﬁtting an underlying prediction algorithm on bootstrapped samples of the training data. Recall that the climatology model uses no training data and simply takes the average of the features, so Itrain = and Enb PI cannot be applied. In order to compare split CP to this important recent development, we conducted the following experiment. Example 2 (Hidden random walk on the cycle graph) Consider a β-mixing sequence generated from a random walk on the cycle graph. On any given state, there is a probability b of moving backward, f of moving forward and s of staying still. The dependence increases for larger values of s, which makes it more likely to have repeating values. Intuitively, the number of vertices also inﬂuence dependence, as it should take a larger sample to account for a larger number of states to be visited. Indeed, for r N>0 the β-mixing coeﬃcient β(r) for a cycle of v vertices decays at rate e r/v2. A Gaussian noise with zero mean and variance of 10 6 is added to the resulting sequence to avoid draws. Figure 3 compares the rolling coverage 1 500 P499+j i=j 1{Yi C1 α(Xi)} for j {1, . . . , 2501} of split CP and Enb PI, both using a random forest trained on 11 lags to predict the next element in the sequence as model and α set to 0.1. The implementation of Enb PI used

Split CP and Non-Exchangeable Data

0 500 1000 1500 2000 2500 Starting point

Rolling coverage

Split CP Enb PI

Figure 3: Split CP s rolling coverage for β-mixing data generated from a random walk on the cycle graph is close to the prescribed 90% level, whereas Enb PI signiﬁcantly undercovers at some points. Considering all test points, Split CP s coverage was of 91.4% with an average interval size of 0.34; Enb Pi s coverage was of 86.6% with an average interval size of 0.15.

was from MAPIE (Taquet et al., 2022). In this experiment, split CP was more than 8 times faster and had a rolling coverage much closer to the prescribed level. Fifteen other random sequences besides the one presented here were generated from the random walk on the cycle and the conclusion was invariably the same. Example 3 (Two-state hidden Markov model) Let (W0, W1, W2, . . .) be a Markov chain with state space W = {0, 1}, probabilities P[Wt = 1|Wt 1 = 0] = p and P[Wt = 0|Wt 1 = 1] = q, following the stationary distribution with π = h q p+q p p+q i , with p, q (0, 1) and p+q > 0. This data is stationary β-mixing, and the mixing coeﬃcients can be found explicitly (Mc Donald et al., 2015). When p = q = 0.5, β(r) 0 for all r N>0, so the Markov chain reduces to a sequence of iid Bernoulli trials. On the other hand, as p and q tend towards zero, β(r) becomes large for every r and dependence increases. We construct a hidden Markov model by adding a Gaussian noise with zero mean and variance of 10 6 to the Markov chain

0.5 0.6 0.7 0.8 0.9 1.0 Probability of repeating previous state

Marginal coverage

Gradient Boosting

0.5 0.6 0.7 0.8 0.9 1.0 Probability of repeating previous state

Neural Network

0.5 0.6 0.7 0.8 0.9 1.0 Probability of repeating previous state

Random Forest

Figure 4: Marginal coverage for two-state hidden Markov model (solid) and nominally prescribed 90% target (dashed) for diﬀerent levels of dependence and three diﬀerent models. Coverage is above 89% for all but very extreme levels.

Oliveira, Orenstein, Ramos and Romano

0 10000 20000 30000 40000 50000 Calibration set size

Coverage guarantee

Dependence factor

0.5 (indep.) 0.55 0.6 0.65 0.7

Figure 5: Marginal coverage for diﬀerent calibration set sizes and dependence levels; the guarantees under dependence converge to the iid case with larger calibration sizes.

with p = q, and consider predicting a single data point by using the 11 preceding elements in the series as features. Figure 4 shows how marginal coverage (7), calculated through 10 000 simulations, is aﬀected by increasing levels of dependence 1 p for three diﬀerent models (boosting, neural network and random forest) and ntrain = 1000, ncal = 500 and ntest = 1. We note in passing that the AR(1) experiment (cf. Figure 1) used the same amount of data for training, calibration and testing. Marginal coverage observed is close to nominal iid value of 90% for the independent case (p = 0.5) and weak to medium dependence, measured by the probabilities 1 p of repeating the previous state. Coverage remains above 89% even for large values of dependence, and falls below 88% only after 1 p = 0.999. Also, Figure 5 shows how the correction εcal + δcal + εtrain in Theorem 1 depends on the calibration set sizes, quickly converging to the iid limit even for moderately dependent data.

0.50 0.55 0.60 0.65 0.70 Dependence factor

Empirical coverage Theoretical bound

Figure 6: Empirical coverage bound; while the worst-case bound decreases with dependence, empirical coverage remains close to the iid level.

Split CP and Non-Exchangeable Data

Figure 7: Daily marginal coverages of minute-by-minute online prediction for ﬁnancial time series eurusd, bcousd and spxusd (solid blue), prescribed iid levels of 1 α = 0.9 (dashed black) and observed marginal coverages over the entire year (dashed orange), above 0.895 in all cases.

Further, Figure 6 shows the empirical coverage for a gradient boosting model over a thousand simulations with ntrain = 1000, ncal = ntest = 15 000 and δcal = δtest = 0.005. Note the empirical coverage revolves around the prescribed iid level 0.9, and it remains above the worst-case theoretical bound, which decreases with the dependence level 1 p. Example 4 (Financial time series) We study split CP s performance on three real-world time series: the euro spot exchange rate (eurusd), Brent crude oil future (bcousd) and S&P 500 stock index future (spxusd). Minute-by-minute data is retrieved directly from Hist Data (2022) via an open-source API (Rémy, 2022). We compute linear returns by dividing a price at minute t by the price at minute t 1 and subtracting 1. Due to market closures, Fridays and Sundays were discarded. We use gradient boosting to predict the price at time t using the prices at times t 11, . . . , t 1. Then, we apply online conformal prediction over a sliding window of 1000 training points, 500 calibration points and 1 single test point for the entire year of 2021. Figure 7 shows the daily coverage of split CP. The dashed black line represents the iid nominal coverage of 90% and the dashed orange one the marginal coverage over the entire year. Marginal coverage is slightly below 90%, but never drastically so.

Data set Cal. set size Conditional coverage

Uptrend Downtrend High vol. Low vol.

eurusd 500 88.76% 88.82% 87.64% 90.07% 1000 89.19% 89.17% 88.38% 90.19% 5000 90.03% 89.98% 89.85% 90.08%

bcousd 500 88.94% 88.72% 87.10% 89.43% 1000 89.35% 89.04% 87.65% 89.95% 5000 89.78% 89.77% 89.33% 89.98%

spxusd 500 89.12% 89.01% 88.87% 89.68% 1000 89.53% 89.48% 88.84% 90.03% 5000 90.04% 89.73% 89.53% 90.30%

Table 2: Conditional coverage for distinct trend and volatility events and varying calibration set size (before conditioning). Note that conditional coverage is generally close to nominal iid level 1 α = 0.9 and results improve given more calibration points.

Oliveira, Orenstein, Ramos and Romano

Table 2 presents the conditional coverage (11) on four events of interest for all three ﬁnancial data sets. Uptrend (respectively, downtrend) stands for two consecutive observations of positive (negative) returns. High (low) volatility events are taken to be those in which the standard deviation of the previous 10 returns observed is above (below) a given threshold. Note that conditioning on all such events still yields coverage close to the nominal iid level, on all three data sets. As previously noted, larger calibration sets have an important eﬀect in improving coverage.

All experiments in this section were also conducted for nominally prescribed iid levels of 85% and 95% and the conclusions remained the same, as expected.

5. Conclusion

This paper shows that many of the appealing properties of split conformal prediction hold well even when the data is not exchangeable. Theorems 1 and 2 give marginal and empirical coverage guarantees for broad classes of non-exchangeable data, and Theorem 3 gives marginal coverage guarantees. We also consider the concrete case of stationary β-mixing sequences in Theorems 4 and 5, and show that our error bounds for such non-exchangeable data may match the order of the iid bounds. To prove these results, we introduce a novel mathematical framework that frees split CP from its exchangeability hypotheses, replacing it with concentration inequalities and decoupling properties.

We further show that these guarantees translate to robust empirical performance by split CP beyond independent data, with examples included in Section 4 ranging from synthetic to real data. In the same section, in spite of its generality, split CP is shown to fare well when compared to recent conformal algorithms developed for speciﬁc kinds of non-exchangeability. Except for extreme levels of dependence or very small calibration sets small eﬀective sample size regimes , performance holds exceedingly well for standard split CP without any modiﬁcations.

Finally, we note that the general framework introduced in Section 3.1 can be extended in many directions. Section 3.4 looks at how results generalize for non-stationary data, as well as for other conformal prediction methods that are not based on split CP. One particular open direction is the application of concentration inequalities for weighted sums of exchangeable random variables (e.g., Barber 2024) to analyze weighted conformal methods such as Nex CP (Barber et al., 2023). More generally, we expect that several results and methods (e.g., Chernozhukov et al. 2021, 2018; Zaﬀran et al. 2022; Feldman et al. 2022) can be analyzed and extended via our uniﬁed framework, which remains an avenue for future work.

Acknowledgments

RIO is supported by CNPq grants 432310/2018-5 (Universal) and 304475/2019-0 (Produtividade em Pesquisa), and FAPERJ grants 202.668/2019 (Cientista do Nosso Estado) and 290.024/2021 (Edital Inteligência Artiﬁcial). PO is supported by FAPERJ grant SEI260003/001545/2022.

Split CP and Non-Exchangeable Data

Appendix A. Proofs of Section 3.1

For the proofs below, we need to introduce certain population quantiles for bstrain(X , Y ) conditionally on the training data.

Deﬁnition 7 (Conditional φ-quantile of the conformity score) Given φ [0, 1), let qφ,train denote the φ-quantile of bstrain(X , Y ) conditioned on the training data; that is:

qφ,train := inf{t R : P[bstrain(X , Y ) t | {(Xi, Yi)}i Itrain] φ}.

Alternatively, deﬁne, for a deterministic (xi, yi)ntrain i=1 (X Y)ntrain, the φ-quantile:

qφ((xi, yi)ntrain i=1 ) := inf{t R : P[s((xi, yi)ntrain i=1 , (X , Y )) t] φ},

and set qφ,train := qφ((Xi, Yi)i Itrain). We also deﬁne:

pφ,train := P[bstrain(X , Y ) t | (Xi, Yi)i Itrain]

Remark 8 When the conditional law of bstrain(X , Y ) given the training data is continuous, we have pφ,train = φ. Otherwise, it only holds that pφ,train φ.

Proof [Theorem 1] First we show that the event F = {q1 α εcal,train bq1 α,cal} satisﬁes P[F] 1 δcal. Indeed, by (3) and Deﬁnition 7, if condition (5) holds, for any ℓ N>0, with probability at least 1 δcal,

i Ical 1{bstrain(Xi, Yi) q1 α εcal,train 1/ℓ} P[bstrain(X , Y ) q1 α εcal,train 1/ℓ] + εcal

< 1 α εcal + εcal = 1 α

i Ical 1{bstrain(Xi, Yi) bq1 α,cal}.

This implies that the event

i Ical 1{bstrain(Xi, Yi) q1 α εcal,train 1/ℓ} < 1 ncal

i Ical 1{bstrain(Xi, Yi) bq1 α,cal}

satisﬁes P[Eℓ] 1 δcal for all ℓ N>0, and since Eℓ+1 Eℓ, we have 1 δcal limℓ P[Eℓ] = P[E ], where,

i Ical 1{bstrain(Xi, Yi) q1 α εcal,train} 1 ncal

i Ical 1{bstrain(Xi, Yi) bq1 α,cal}

proving that P[F] 1 δcal. Therefore, given i Itest, using the fact that the function t 7 P[bstrain(Xi, Yi) t] is increasing,

P[bstrain(Xi, Yi) bq1 α,cal] P[{bstrain(Xi, Yi) bq1 α,cal} F]

P[bstrain(Xi, Yi) q1 α εcal,train] δcal.

Oliveira, Orenstein, Ramos and Romano

Hence, by condition (6) and a conditioning argument,

P[bstrain(Xi, Yi) bq1 α,cal] P[bstrain(Xi, Yi) q1 α εcal,train] δcal (17)

P[bstrain(X , Y ) q1 α εcal,train] εtest δcal 1 α εcal εtest δcal,

proving the ﬁrst part of the theorem. For the second part, note that by Deﬁnition 7 and condition (5) we have with probability at least 1 δcal,

i Ical 1{bstrain(Xi, Yi) q1 α+εcal,train} P[bstrain(X , Y ) q1 α+εcal,train] εcal 1 α,

and since bq1 α εcal,cal is the smallest possible value satisfying the expression above, the event G = {bq1 α,cal q1 α+εcal,train} satisﬁes P[G] 1 δcal. Then,

P[bstrain(Xi, Yi) bq1 α,cal] P[{bstrain(Xi, Yi) bq1 α,cal} G] + δcal P[bstrain(Xi, Yi) q1 α+εcal,train] + δcal.

Hence, if bstrain(X , Y ) almost surely has a continuous distribution conditionally on the training data, by condition (6)

P[bstrain(Xi, Yi) bq1 α,cal] P[bstrain(X , Y ) q1 α+εcal,train] + εtest + δcal = 1 α + εcal + εtest + δcal.

Putting this together with (17) concludes the second part.

Proof [Theorem 2] Assuming that condition (5) and condition (8) hold and using a similar argument as we did in the proof of Theorem 1, it is straightforward to show that the event F = {bq1 α εcal εtest,test bq1 α,cal} satisﬁes P[F] 1 δcal δtest. But then,

i Itest 1{bstrain(Xi, Yi) bq1 α,cal} 1 α εcal εtest

i Itest 1{bstrain(Xi, Yi) bq1 α εcal εtest,test} 1 α εcal εtest

1 δcal δtest,

proving the ﬁrst part. For the second part, note that G = {bq1 α+εcal+εtest,test bq1 α,cal} also has probability at least 1 δcal δtest, therefore the event F G = {bq1 α+εcal+εtest,test bq1 α εcal εtest,test} has probability at least 1 2δcal 2δtest Hence, if bstrain(X , Y ) almost surely has a continuous distribution conditionally on the training data, using the same argument as we did in the proof of Theorem 1 concludes the theorem.

Split CP and Non-Exchangeable Data

Proof [Application to the iid case] First, note that, in the iid case, when i Itest,

P[bstrain(Xi, Yi) qtrain] = P[bstrain(X , Y ) qtrain],

showing that condition (6) holds with εtest = 0. Moreover, using the fact that (1{bstrain(Xi, Yi) qtrain})n i=1 is an iid sample of bounded random variables, by Hoeﬀding s inequality, with probability at least 1 δ, 1 n

i=1 1{bstrain(Xi, Yi) qtrain} Pq,train

Therefore, conditions (5) and (8) are proved by taking

1 2ncal log 2

and εtest =

1 2ntest log 2

Appendix B. Proofs of Section 3.2 and Further Results

Let A be a family of measurable subsets of X. Given φ [0, 1) and A A, let Ical(A) := {i Ical : Xi A} and ncal(A) := #Ical(A). Denote the empirical φ-quantile of bstrain(Xi, Yi) over i Ical as:

bqφ,cal(A) := inf

t R : 1 ncal(A)

i Ical(A) 1{bstrain(Xi, Yi) t} φ

and, for x A, deﬁne the prediction set:

Cφ(x; A) := {y Y : bstrain(x, y) bqφ,cal(A)}.

For A A with P[X A] > 0, let

Pq,train(A) := P[bstrain(X , Y ) qtrain | (Xi, Yi)i Itrain, X A]. (18)

Proof [Theorem 3] Following the same strategy as in Theorem 1, we have that, with probability at least 1 δcal, the event Fcal = {q1 α εcal(A) bq1 α,cal(A), A A} satisﬁes P[Fcal] 1 δcal. Now, using the fact that the function t 7 P[bstrain(Xk, Yk) t | Xk A] is increasing, for any k Itest and all A A,

P[bstrain(Xk, Yk) bq1 α,cal(A) | Xk A] P[{bstrain(Xk, Yk) bq1 α,cal(A)} Fcal | Xk A]

P[{bstrain(Xk, Yk) q1 α εcal(A)} Fcal | Xk A]

P[bstrain(Xk, Yk) q1 α εcal(A) | Xk A] δcal.

Then, to conclude,

P[bstrain(Xk, Yk) bq1 α,cal(A) | Xk A]

P[bstrain(Xk, Yk) q1 α εcal(A) | Xk A] δcal P[bstrain(X , Y ) q1 α εcal(A) | X A] εtrain δcal = E[P[bstrain(X , Y ) q1 α εcal(A) | (Xi, Yi)i Itrain, X A]] εtrain δcal 1 α εcal εtrain δcal.

Oliveira, Orenstein, Ramos and Romano

We now proceed to give an empirical conditional coverage guarantee. For a conditional version of (8), suppose there exist δtest and εtest (0, 1) such that

Pq,train(A) 1 ntest(A)

i Itest(A) 1{bstrain(Xi, Yi) qtrain}

1 δtest, (19)

where Pq,train(A) is deﬁned as in (18). This suﬃces for empirical conditional coverage.

Theorem 9 (Empirical conditional coverage over test data) Given α (0, 1), δcal > 0 and δtest > 0, if (9) and (19) hold, then for each A A:

inf A A 1 ntest(A)

i Itest(A) 1{Yi C1 α(Xi; A)} 1 α η

1 δcal δtest,

where η = εcal+εtest. Additionally, if bstrain(X , Y ) almost surely has a continuous distribution conditionally on the training data, then:

i Itest(A) 1{Yi C1 α(Xi; A)} (1 α)

1 2δcal 2δtest.

Proof As in the proof of empirical coverage over test data (Theorem 2), the event F = {bq1 α εcal εtest,test(A) bq1 α,cal(A) A A} satisﬁes P[F] 1 δcal δtest and the remaining of the proof is just a direct application of the deﬁnition of conditional empirical quantile calibrated over the test data.

The results above are directly applicable in the iid case. It is straightforward to show that (10) holds with εtest = 0 and, if the family A has ﬁnite VC dimension VC(A) = d and P[A] > γ for some γ > 0 and all A A, it suﬃces to take εcal = γ 1(4 p

log(2(n + 1)d)/n + 2 p

log(4/δ)/(2n)).

Appendix C. Proofs of Section 3.3 and Further Results

C.1 Standard Coverage Guarantees

Our goal is to check conditions (5), (6) and (8) when (Xi, Yi)n i=1 is a stationary β-mixing process. As stated in the main text, the main tool we will use is the so-called Blocking Technique (Yu, 1994; Mohri and Rostamizadeh, 2010; Kuznetsov and Mohri, 2017). It allows one to measure the diﬀerence in expectation between a function of a β-mixing process and the same function over an independent process, thereby transforming the original dependent problem into an independent one with the addition of a penalty factor.

Proposition 10 (Blocking Technique) Let {Zt}T t=1 be a sample of a stationary β-mixing process. Split the sample into 2m interleaved blocks, with even blocks of size a and odd blocks of size b, such that T = m(a + b). Denote each block by Bj = {Zi}u(j) i=l(j), where

Split CP and Non-Exchangeable Data

l(j) = 1+ (j 2)/2 a+ j/2 b and u(j) = j/2 a+ j/2 b, so the set of odd blocks, each of size b, is given by Bodd = (B1, B3, . . . , B2m 1). Consider also the set B odd = (B 1, B 3, . . . , B 2m 1)

where B j are independent for j = 1, 3, . . . , 2m 1, and B j d= Bj. If h : Rmb R is a Borel-measurable function with |h| M for some M > 0, then

|E[h(Bodd)] E[h(B odd)]| 2M(m 1)β(a),

where β(a) is the β-mixing coeﬃcient of {Zt}T t=1.

Using the Blocking Technique, we can prove that up to a error correction factor, we can transform our stationary β-mixing problem into an iid one:

Lemma 11 (Mohri and Rostamizadeh 2009) Let Z1, . . . , Zn be a sample drawn from a stationary β-mixing distribution and F be a class of functions from X to [0, 1]. Split the sample into 2m blocks, with blocks of size a with n = 2ma. Denote the blocks by Bj = {Zi}u(j) i=l(j) where l(j) = 1 + (j 1)a and u(j) = ja, with Bodd = (B1, B3, . . . , B2m 1). Call the independent

version of Bodd by B odd = (B 1, B 3, . . . , B 2m 1), where B j are independent with B j d= Bj, and let P be their law. Then,

E [f(Z1)] 1

Zj B odd f(Zj)

+ 4(m 1)β(a).

Finally, using Lemma 11 and Bernstein s Inequality, we are ready to prove a concentration inequality for stationary β-mixing processes.

Lemma 12 Let Z1, . . . , Zn be a sample drawn from a stationary β-mixing distribution with Z1 [0, 1] and Var[Z1] = v < . Then, for any m, a, s N+ with m > 1, n = 2ma + s and δ > 4(m 1)β(a)), with probability at least 1 δ it holds that E [Z1] 1

4 n log 4 δ 4(m 1)β(a)

3m log 4 δ 4(m 1)β(a)

and eσ(a) = q

a Pa 1 k=1(a k)β(k).

Proof By an application of Lemma 11 taking F = {x 7 x, x [0, 1]} and Bernstein s inequality over the m independent blocks, with probability at least 1 δ E [Z1] 1

E [Z1] 1 2ma

2 m log 4 δ 4(m 1)β(a)

3m log 4 δ 4(m 1)β(a)

Oliveira, Orenstein, Ramos and Romano

where σ2 = Var h 1 a P i:Zi Bj Zi i . To estimate σ2, note that by stationarity,

a E Z2 1 E[Z1]2 + 1

|i j|=k E [Zi Zj] .

Now, using the fact that {Zi}i Bj is β-mixing, we have

a E Z2 1 E[Z1]2 + 1

E [Z]2 + β(k)

a E Z2 1 E[Z1]2 + 1

|i j|=k E [Z]2 + 1

|i j|=k β(k)

a Var[Z1] + 1

|i j|=k β(k) = 1

Var[Z1] + 2

k=1 (a k)β(k)

k=1 (a k)β(k)

Plugging the above expression in (20) and using the fact that 2ma = n, yields the result.

Now we are ready to prove conditions (5), (6) and (8) for the stationary β-mixing case.

Proposition 13 If (Xi, Yi)n i=1 is stationary β-mixing, then condition (5) holds with

εcal = inf (a,m,r) Fcal

4 ncal r + 1 log 4 δcal 4(m 1)β(a) β(r)

3m log 4 δcal 4(m 1)β(a) β(r)

for Fcal = {(a, m, r) N3 >0 : 2ma = ncal r + 1, δcal > 4(m 1)β(a) + β(r)},

where eσ(a) = q

a Pa 1 k=1(a k)β(k).

Proof We want to use Lemma 12 for the random variables {1{bstrain(Xi, Yi) qtrain}}i Ical , however, since the random variables (Xi, Yi)i Ical are dependent to bstrain and the quantile qtrain, we cannot simply apply the result. To ﬁx this problem, it will be necessary to create a gap between our training and calibration data and use the Blocking Technique, Proposition 10, to transpose our problem to an independent setting. For ε > 0 and r {1, . . . , ncal}, let Ical,r = {ntrain + r, . . . , ntrain + ncal} and deﬁne the event

1 ncal r + 1

i Ical,r 1{bstrain(Xi, Yi) qtrain} Pq,train

Split CP and Non-Exchangeable Data

we want to show that there exists ε > 0 such that P[E(1, ε)] δ. Note that if E(1, ε) holds, then

i Ical,r 1{bstrain(Xi, Yi) qtrain} Pq,train

and since ncal ncal r + 1,

1 ncal r + 1

i Ical,r 1{bstrain(Xi, Yi) qtrain} Pq,train

that is, E(1, ε) E(r, ε (r 1)/ncal). Now, deﬁne P = Pntrain 1 Pntrain+ncal ntrain+r , so under P , we have that (Xi, Yi)i Itrain and (Xi, Yi)i Ical,r are independent. Then, by Proposition 10,

P[E(1, ε)] P[E(r, ε (r 1)/ncal)]

P [E(r, ε (r 1)/ncal)] + β(r)

= E [P [E(r, ε (r 1)/ncal) | (Xi, Yi)i Itrain]] + β(r).

Note that by Lemma 12, for any m, a N+ with ncal (r + s) + 1 = 2ma and δcal > 4(m 1)β(a)), using the fact that Var [1{bstrain(Xi, Yi) qtrain}] 1/4, taking

4 ncal r + 1 log 4 δ 4(m 1)β(a)

3m log 4 δ 4(m 1)β(a)

implies that P [E(r, ε (r 1)/ncal) | (Xi, Yi)i Itrain] δcal. Therefore, P[E(1, ε)] δcal+β(r), which is equivalent to P[E(1, ε )] δcal taking

4 ncal r + 1 log 4 δ 4(m 1)β(a) β(r)

3m log 4 δ 4(m 1)β(a) β(r)

Finally, since this is true for any choice of a, m, r N>0 and s N with s+2ma = ncal r+1 and δ > 4(m 1)β(a) + β(r), we can choose a, m, r optimally such that the value of ε is minimized and there is no need to optimize in s in this case.

Proposition 14 If (Xi, Yi)n i=1 is stationary β-mixing, then condition (6) holds with εtrain = β(k ntrain). Moreover, since β(k ntrain) β(1 ntrain), it is possible to ﬁnd εtrain not depending on k.

Proof Given k Itest, deﬁne P = Pntrain 1 Pk k, so under P the random variable (Xk, YK) is independent of the training data (Xi, Yi)i Itrain. Then, if βk = β(k ntrain) we have,

βk |P[bstrain(Xk, Yk) qtrain] P [bstrain(Xk, Yk) qtrain]|

= |P[bstrain(Xk, Yk) qtrain] E [P [bstrain(Xk, Yk) qtrain | (Xi, Yi)i Itrain]]|

= |P[bstrain(Xk, Yk) qtrain] P[bstrain(X , Y ) qtrain | (Xi, Yi)i Itrain]| ,

Oliveira, Orenstein, Ramos and Romano

where the βk penalty follows from Proposition 10. Note that the larger the k the smaller the penalty incurred by the dependence in the β-mixing process. Moreover, since βk β1, it is possible to deﬁne εtrain = β1 not depending on k Itest.

Proposition 15 If (Xi, Yi)n i=1 is stationary β-mixing, then condition (8) holds with

εtest = inf (a,m) Ftest

4 ntest log 4 δtest 4(m 1)β(a) β(ncal)

3m log 4 δtest 4(m 1)β(a) β(ncal)

Ftest = {(a, m, s) N2 >0 N : s + 2ma = ntest, δ > 4(m 1)β(a) + β(ncal)},

and eσ(a) = q

a Pa 1 k=1(a k)β(k).

Proof The proof is similar to the proof of Proposition 13. Let the event E(ε) be

( Pq,train 1 ntest

i Itest 1{bstrain(Xi, Yi) qtrain}

Deﬁne, P = Pntrain 1 Pntrain+ncal+ntest ntrain+ncal , so under P we have that (Xi, Yi)i Itrain and (Xi, Yi)i Itest are independent. By Proposition 10 we have

P[E(ε)] P [E(ε)] + β(ncal) = E [P [E(ε) | (Xi, Yi)i Itrain]] + β(ncal).

Now we can apply Lemma 12 and conclude that, just as we did in Proposition 13, that for any m, a N+, s N with ntest = 2ma s and δtest > 4(m 1)β(a)) + β(ncal), it is true that P[E(ε)] δ, where

4 ntest log 4 δtest 4(m 1)β(a) β(ncal)

3m log 4 δtest 4(m 1)β(a) β(ncal)

+ s ntest ,

and eσ(a) = q

a Pa 1 k=1(a k)β(k). Finally, since this is true for any choice of a, m N>0 and s N, with s + 2ma = ntest and δtest > 4(m 1)β(a) + β(ncal), we can choose a, m, s optimally to minimize ε.

Proof [Theorem 4] The result follows from applying Propositions 13 and 14 in Theorem 1.

Proof [Theorem 5] The result follows from applying Propositions 13 and 15 in Theorem 2.

Split CP and Non-Exchangeable Data

C.2 Conditional Guarantees

To obtain conditional guarantees for stationary β-mixing processes, we need to specify a family A of measurable sets in X satisfying certain conditions. In particular, we will assume that, for a ﬁxed value γ > 0, the family A has ﬁnite VC dimension VC(A) = d and P[X A] > γ for all A A. Then, given δcal > 0 and α (0, 1), deﬁne the calibration error correction factor for a stationary β-mixing process conditioned to the family A as

εcal = inf (a,m,r) Gcal

2 m log 16 δcal 16(m 1)β(a) β(r)

where κ(m, r) = 4ncal p

log(2(m + 1)d)/m + 2(r 2) and

Gcal = {(a, m, r) N3 >0 : 2ma = ncal r + 1, δcal > 16(m 1)β(a) + β(r)}.

Note the factor 1/γ in εcal: for η to be small, we need εcal to be small and consequently m has to be large. This is quite natural, since if γ is too small, the probability P[X A] can be close to zero, and thus a larger sample is necessary to estimate the empirical quantile well. Therefore, Theorem 3 yields the following.

Theorem 16 (Conditional coverage: stationary β-mixing processes) Suppose that (Xi, Yi)n i=1 is stationary β-mixing. Then given α (0, 1), γ > 0 and δcal > 0, for each A A and any i Itest

P[Yi C1 α(Xi; A) | Xi A] 1 α η,

with η = εcal + εtest, where εcal is as in (21) and εtest = β(i ntrain). Additionally, if bstrain(X , Y ) almost surely has a continuous distribution conditionally on the training data, then:

|P[Yi C1 α(Xi; A) | Xi A] (1 α)| εcal + δcal + εtest.

Now, denote the test error correction factor for a stationary β-mixing process conditioned to the family A as

εtest = inf (a,m,s) Gtest

2 m log 8 δtest 8(m 1)β(a) β(ncal)

where κ(m, r) = 4ntest p

log(2(m + 1)d)/m + 2s and

Gtest = {(a, m, s) N2 >0 N : 2ma = ntest s, δtest > 8(m 1)β(a) + β(ncal)}.

The following result then follows from Theorem 9.

Theorem 17 (Empirical conditional coverage: stationary β-mixing processes) Suppose that (Xi, Yi)n i=1 is stationary β-mixing, then given α (0, 1), γ > 0, δcal > 0 and δtest > 0, for each A A:

inf A A 1 ntest(A)

i Itest(A) 1{Yi C1 α(Xi; A)} 1 α η

1 δcal δtest,

Oliveira, Orenstein, Ramos and Romano

where η = εcal + εtest, for εcal as in (21) and εtest as in (22). Additionally, if bstrain(X , Y ) almost surely has a continuous distribution conditionally on the training data, then:

i Itest(A) 1{Yi C1 α(Xi; A)} (1 α)

1 2δcal 2δtest.

The proofs in this section are very similar to the proofs in Section C.1, however, since we are dealing with a family of Borel measurable sets A, we will need concentration results that allow us to uniformly control certain quantities over the family A. First, we state such classical results for iid sequences.

Theorem 18 Let Z1, . . . , Zn be iid random variables taking values on X and F be a class of functions from X to {0, 1}. Then

log(2SF(n))

Using the Blocking Technique, we can prove that up to a error correction factor, we can transform our stationary β-mixing problem into a iid one. For example,

Corollary 19 Let Z1, . . . , Zn be a sample drawn from a stationary β-mixing distribution and F be a class of functions from X to {0, 1}. Then, for any a, m, s N+, with m > 1, n = 2ma + s and δ > 4(m 1)β(a), it holds that

E [f(Z1)] 1

ε0(a, m, δ)

ε0(a, m, s, δ) = 2

log(2SF(m))

1 2m log 4 δ 4(m 1)β(a)

Proof By an application of Lemma 11 and Mc Diarmids s inequality over the m independent blocks, it follows that

4(e 2mε 2 + (m 1)β(a)).

i:Zi Bj f(Zi)

Split CP and Non-Exchangeable Data

Denote by Z(i) j the ith random variable of the jth block Bj B odd, therefore the expectation in (24) can be written as

i=1 f(Z(i) j )

i=1 f(Z(i) j )

where the inequality comes from the triangular inequality and the monotonicity of the supremum. Note that in 1 m Pm i=1 f(Z(i) j ) we are considering only one element of each independent block Bj B odd, therefore this is a sum over iid random variables. Hence, by Theorem 18

i:Zi Bj f(Zi)

log(2SF(m))

That is, ε > ε 2 q

log(2SF(m))

n. So taking δ > 4(e 2mε 2 +(m 1)β(a)) and ε = ε0(a, m, δ) yields

> ε0(a, m, δ)

Corollary 20 Let (X , Y ), . . . , (Xn, Yn) be a sample drawn from a stationary β-mixing distribution, s : X Y R be any deterministic function and A be a family of Borel measurable sets in X with ﬁnite VC dimension VC(A) = d. Then, for any m, a N+ with m > 1, n = 2ma and δ > 4(m 1)β(a), it holds that

i=1 1{Xi A}

ε0(a, m, δ)

where ε0(a, m, δ) is as deﬁned in (23).

Proof Taking F = {x 7 1{X A} : A A} in Corollary 19 and using Sauer-Shelah lemma (Sauer, 1972; Mohri et al., 2018) yields the result.

Lemma 21 Let X1, . . . , Xn be a sample drawn from a stationary β-mixing distribution, γ (0, 1) and A be a family of Borel measurable sets in X with ﬁnite VC dimension VC(A) = d such that P[X A] > γ for all A A. For m, a N+ with m > 1, n = 2ma and δ > 4(m 1)β(a) suppose that 2

γ ε0(a, m, δ) < 1, with ε0(a, m, δ) as in (23). Then,

inf A A 1 n

i=1 1{Xi A} > γ

Oliveira, Orenstein, Ramos and Romano

Proof By Corollary 20, for any m, a N+ with m > 1, n = 2ma and δ > 4(m 1)β(a), using the fact that ε0(a, m, δ) < γ/2,

inf A A 1 n

i=1 1{Xi A} γ

sup A A γ 1

i=1 1{Xi A} γ

i=1 1{Xi A}

i=1 1{Xi A}

ε0(a, m, δ)

Lemma 22 Let (X , Y ), . . . , (Xn, Yn) be a sample drawn from a stationary β-mixing distribution, s : X Y R be a deterministic function, γ (0, 1) and A be a family of Borel measurable sets in X with ﬁnite VC dimension VC(A) = d such that P[X A] > γ for all A A. For m, a N+ with m > 1, n = 2ma and δ > 8(m 1)β(a), if ε := 2

γ ε0(a, m, δ/2) < 1, for ε0(a, m, δ/2) as in (23), then with probability at least 1 δ

sup A A t R

P[s(X , Y ) t, X A]

P[X A] Pn i=1 1{s(Xi, Yi) t} 1{Xi A} Pn i=1 1{Xi A}

Proof Deﬁne ε as in the lemma statement. We want to show that:

sup A A t R

P[s(X , Y ) t, X A]

P[X A] Pn i=1 1{s(Xi, Yi) t} 1{Xi A} Pn i=1 1{Xi A}

has probability at most δ. To this end, we deﬁne the following auxiliary event, which controls the random denominator term in C:

inf A A 1 n

i=1 1{Xi A} > γ

By Lemma 21, P[Bc] < δ/2, so it suﬃces to show that P[E] δ

2 where E := C B.

Note that, if E holds, then the quotient Pn i=1 1{s(Xi,Yi) t} 1{Xi A} Pn i=1 1{Xi A} is well deﬁned and

ε < sup A A t R

P[s(X , Y ) t, X A]

P[X A] Pn i=1 1{s(Xi, Yi) t} 1{Xi A} Pn i=1 1{Xi A}

sup A A t R

P[s(X , Y ) t, X A] 1

n Pn i=1 1{s(Xi, Yi) t} 1{Xi A} P[X A]

+ sup A A t R

Pn i=1 1{s(Xi, Yi) t} 1{Xi A} P[X A] 1

n Pn i=1 1{Xi A}

P[X A] Pn i=1 1{Xi A}

Split CP and Non-Exchangeable Data

sup A A t R

P[s(X , Y ) t, X A] 1

n Pn i=1 1{s(Xi, Yi) t} 1{Xi A} γ

n Pn i=1 1{Xi A} γ

Moreover, for any A A: P[X A] 1

n Pn i=1 1{Xi A} γ

P[s(X , Y ) t, X A] 1

n Pn i=1 1{s(Xi, Yi) t} 1{Xi A} γ

We deduce that:

E holds ε < 2 sup A A

n Pn i=1 1{Xi A} γ

By Corollary 20, we know that, with our choice of ε:

n Pn i=1 1{Xi A} γ

By (25), we also have P[E] δ/2. This ﬁnishes the proof.

Proposition 23 Let

ε = inf (a,m,r) Gcal

a, m, δ β(r)

where ε0(a, m, δ/2) as in (23) and

Gcal = {(a, m, r) N3 >0 : 2ma = ncal r + 1, δ > 16(m 1)β(a) + β(r)}.

If ε < 1, then condition (9) holds with εcal = ε.

Proof The proof is similar to the proof of Proposition 13. For ε > 0 and r {1, . . . , ncal}, let Ical,r = {ntrain + r, . . . , ntrain + ncal} and Ical,r(A) = {i Ical,r : Xi A}. Deﬁne the events

P i Ical,r(A) 1{bstrain(Xi, Yi) qtrain}

#Ical,r(A) Pq,train(A)

inf A A 1 ncal

i Ical 1{Xi A} > γ

and B(r, ε ) = E(r, ε ) C.

Oliveira, Orenstein, Ramos and Romano

We want to show that there exists ε > 0 such that if ε < 1 then P[E(1, ε )] δ. Note that if B(1, ε ) holds, then for all A A,

P i Ical,r(A) 1{bstrain(Xi, Yi) qtrain}

#Ical,r(A) Pq,train(A)

That is, B(1, ε ) B r, ε 2(r 1)

. Now, deﬁne P = Pntrain 1 Pntrain+ncal ntrain+r , so under P we have that (Xi, Yi)i Itrain and (Xi, Yi)i Ical,r are independent. By Proposition 10 we have

P[B(1, ε )] P B r, ε 2(r 1)

B r, ε 2(r 1)

But this implies that P[E(1, ε )] P h B r, ε 2(r 1)

i +β(r)+1 P[C]. For any m, a N+ with ncal r + 1 = 2ma and δcal > 8(m 1)β(a), if we take

log(2(m + 1)d)

1 2m log 8 δ 8(m 1)β(a)

and assume that ε < 1 then

log(2(m + 1)d)

1 2m log 8 δ 8(m 1)β(a)

so Lemma 22 tells us that E h P h B r, ε 2(r 1)

| (Xi, Yi)i Itrain ii δ and Lemma 21

tells us 1 P[C] δ. That is, P[E(1, ε )] 2δ + β(r), which is equivalent to P[E(1, ε)] δ, if ε is as in the proposition statement.

Proposition 24 Condition (10) holds with εtrain = β(k ntrain).

Proof Given k Itest, note that we can decompose P = Pntrain 1 Pk k, so under P we have that (Xk, Yk) is independent of (Xi, Yi)i Itrain. Then, deﬁning βk = β(k ntrain) we have for all A A,

βk |P[bstrain(Xk, Yk) qtrain(A), Xk A] P [bstrain(Xk, Yk) qtrain(A), Xk A]|

where the βk penalty follows from Proposition 10. But then, by a conditioning argument,

βk |P[bstrain(Xk, Yk) qtrain(A), Xk A] P[bstrain(X , Y ) qtrain(A), X A]| .

Since βk P[Xk A] βk, dividing by P[Xk A] = P[X A] yields the result.

Split CP and Non-Exchangeable Data

Proposition 25 Deﬁne

ε = inf (a,m,s) Gtest

a, m, δ β(ncal)

where Gtest = {(a, m) N2 >0 : s + 2ma = ntest, δ > 8(m 1)β(a) + β(ncal)}. If ε < 1, then condition (19) holds with εtest = ε.

Proof The proof is similar to the proof of Proposition 23. Let the event E(ε) be

Pq,train(A)

P i Itest(A) 1{bstrain(Xi, Yi) qtrain]}

Deﬁne, P = Pntrain 1 Pntrain+ncal+ntest ntrain+ncal , so under P we have that (Xi, Yi)i Itrain and (Xi, Yi)i Itest are independent. By Proposition 10 we have

P[E(ε)] P [E(ε)] + β(ncal) = E [P [E(ε) | (Xi, Yi)i Itrain]] + β(ncal).

Now we can apply Lemma 22 and conclude that if ε < 1, for any m, a N+ with ntest = 2ma and δtest > 8(m 1)β(a)) + β(ncal), it is true that P[E] δ, where

log(2(m + 1)d)

1 2m log 8 δ 8(m 1)β(a) β(ncal)

Finally, since this is true for any choice of a, m, s N>0 with s + 2ma = ntest and δtest > 8(m 1)β(a) + β(ncal), we can choose a, m, s optimally to minimize ε.

Proof [Theorem 16] Follows from applying Propositions 23 and 24 in Theorem 3.

Proof [Theorem 17] Follows from applying Propositions 23 and 25 in Theorem 9.

Appendix D. Proofs of Section 3.4 and Further Results

D.1 Non-Stationary Data

Proof [Theorem 6] This proof is similar to that of Theorem 1, but the notation is somewhat more complicated due to nonstationarity. Consider Ncal as in the statement of the present Theorem. For φ (0, 1), deﬁne the conditional φ-quantile of bstrain(X ,Ncal, Y ,Ncal) given the training data:

qφ,train := inf{t R : P[bstrain(X ,Ncal, Y ,Ncal) t | {(Xi, Yi)}i Itrain] φ},

or alternatively,

qφ,train := inf

t R : 1 ncal

j Ncal P[bstrain(X ,j, Y ,j) t | {(Xi, Yi)}i Itrain] φ

Oliveira, Orenstein, Ramos and Romano

For each j Ical, set p(j) φ,train := P[bstrain(X ,j, Y ,j) qφℓ,train | {(Xi, Yi)}i Itrain]. Fix i Itest. Following the proof of Theorem 1, we consider values of φ of the form:

φℓ:= 1 α εcal δ(i) 1/ℓfor ℓ= 1, 2, 3 . . .

to obtain that P[F] 1 δcal, where F := {q1 α εcal. cal bq1 α,cal} . Now,

P[bstrain(Xi, Yi) bq1 α,cal] P[bstrain(Xi, Yi) q1 α εcal] P[F c]

P[bstrain(X ,i, Y ,i) q1 α εcal] δcal εtest.

By assumption, the law of (X ,i, Y ,i) is δ(i)-close in total variation to that of (X ,Ncal, Y ,Ncal). Since the event {bstrain(X ,i, Y ,i) q1 α δ(i) εcal} depends on (X ,i, Y ,i) and on the independent process (Xj, Yj)j [n], and q1 α εcal,train is the conditional (1 α εcal)-quantile of bstrain(X ,Ncal, Y ,Ncal) given the training data, one may conclude:

P[bstrain(X ,i, Y ,i) q1 α εcal] P[bstrain(X ,Ncal, Y ,Ncal) q1 α εcal] δ(i)

1 α εcal δ(i).

Plugging this back into the previous display ﬁnishes the proof.

D.2 Risk-Controlling Prediction Sets

Risk-controlling prediction sets (RCPS), introduced by Bates et al. (2021), give a general methodology for CP that applies in a variety of settings, including regression, multiclass classiﬁcation and image segmentation. Importantly, RCPS does not involve nonconformity scores, but rather, the construction of nested sets. While the original theory of RCPS assumes independent data, we now show it also applies within our framework. Suppose Y is a family of sets, Λ R {+ } is a closed set, and a map T : (X Y)ntrain X Λ Y is given with the following property: for all choices of (xi, yi)ncal i=1 (X Y)ncal, x X and λ1, λ2 Λ: if λ1 λ2, then T ((xi, yi)ncal i=1, x, λ1) T ((xi, yi)ncal i=1, x, λ2). For (x, λ) X, we use the notation

b Tλ,train(x) := T ((Xi, Yi)i Itrain, x, λ)

to denote the values of T when the ﬁrst ntrain pairs in the input correspond to the training data. We call b Tλ,train( ) a trained tolerance region. Finally, L: Y Y R is a loss function that is decreasing in the Y component. The goal of RCPS is to compute a value bλ from the calibration data that achieves (conditional) risk smaller than a prespeciﬁed level α > 0. To deﬁne the conditional risk, assume that the map from λ Λ to E[L(Y , b Tλ(X )) | (Xi, Yi)i Itrain] almost surely is continuous and achieves arbitrarily small positive values. Given a measurable ℓ: (X Y)ntrain Λ, we let ℓtrain := ℓ((Xi, Yi)i Itrain), deﬁne the conditional expected risk as

R(ℓ) := E[L Y , b Tℓtrain(X ) | (Xi, Yi)i Itrain].

Split CP and Non-Exchangeable Data

Also, deﬁne the empirical risk over the calibration data as

b Rcal(λ) := 1 ncal

i Ical L(Yi, Tλ(Xi)).

Now, a threshold bλ must be chosen from calibration data to control the risk. In Bates et al. (2021), this requires ﬁnding a function λ 7 b RUCB(λ) that gives a pointwise high-probability upper bound on R(λ). In our case, we can allow for a b R(λ) that bounds the risk up to a small error; for us, the empirical risk will play this role. Thus, consider the empirical threshold

bλα,cal := inf n λ Λ : λ Λ, λ > λ = b Rcal(λ) < α o .

Finally, we give conditions that guarantee that bλα,cal controls the risk with high probability. First, assume that there exist εcal > 0, δcal (0, 1) such that, for any ℓ, ℓtrain,

P h | b Rcal(ℓtrain) R(ℓ)| εcal i 1 δcal. (26)

Also, assume there exists a εtest such that for all i Itest and all ℓ,

|E[L(Yi, Tℓtrain(Xi))] E[R(ℓ)]| εtest. (27)

Then, the following result on the performance of RCPS over a single test point holds.

Theorem 26 (Approximate risk control for bλα,cal) Assume (26) and (27). Then,

P h E[L(Y , Tbλα,cal(X ))] α + εcal i 1 δcal.

Moreover, if L is uniformly bounded, we have the following for all i Itest:

E[L(Yi, Tbλα,cal(Xi))] α + εtest + εcal + L δtest.

Proof The proof is a combination of our arguments for Theorem 1 with the reasoning in Bates et al. (2021). Recall that for any function ℓ: (X Y)ntrain Λ, if ℓtrain := ℓ((Xi, Yi)i Itrain),

R(ℓ) := E[L(Y , Tℓtrain(X )) | (Xi, Yi)i Itrain].

Make the speciﬁc choice ℓtrain := inf{λ Λ : R(λ) α + εcal}, so that, by right-continuity of λ 7 E[L(Y , Tλ(X )) | (Xi, Yi)i Itrain],

R(ℓtrain) α + εcal (28)

while at the same time R(ℓtrain 1/k) > α + εcal for all k N. The deﬁnition of bλα,cal = inf{λ Λ : b Rcal(λ) < α} and the fact that our risk decreases with λ imply:

P[bλα,cal ℓtrain 1/k] P[ b Rcal(ℓtrain 1/k) > α] 1 δcal

where the last step uses assumption (26). Letting k + gives:

P[bλα,cal ℓtrain] 1 δcal (29)

Oliveira, Orenstein, Ramos and Romano

Now, bλα,cal ℓtrain and (28) together imply:

E[L(Y , Tbλα,cal(X ))] E[L(Y , Tℓtrain(X ))] = E[R(ℓtrain)] α + εcal, (30)

so the ﬁrst assertion in the Theorem follows from (29). The second assertion follows from:

E[L(Yi, Tbλα,cal(Xi))] E[L(Yi, Tℓtrain(Xi))] + L P[bλα,cal < ℓtrain]

combined with (27) and (30).

Thus the expected loss at any test point is controlled by α plus an error term that can be shown to be small, even for non-exchangeable data. Importantly, the result is achieved via assumptions that only bound the behavior of the loss over individual thresholds ℓtrain obtained from the training data. In particular, there is no need to require uniform control of the loss over a range of ℓ, which would require stronger (and looser) concentration bounds. The uniform bound on L can be replaced by a moment assumption, at the cost of a less clean bound.

D.3 Rank-One-Out Conformal Prediction

Rank-one-out (ROO) conformal prediction, introduced by Lei et al. (2018), is diﬀerent from split CP in that the method calibrates the quantile used for each test data point by looking at the remaining test points. This requires adapting the above setup as follows: partition the data indices as [n] = Itrain Itest, and for each i Itest the calibration set is I(i) cal = Itest \ {i}. Also, deﬁne the empirical quantiles as follows: given φ [0, 1) and i Itest, let bq(i) φ,cal denote the empirical φ-quantile

bq(i) φ,cal := inf

t R : 1 ntest 1

1{bstrain(Xj, Yj) t} φ

For x X, the rank-one-out prediction set for i Itest is then deﬁned via:

C(i) φ (x) := {y Y : bstrain(x, y) bq(i) φ,cal}.

We can then adapt the concentration and decoupling hypotheses. Indeed, we assume there exist εtest (0, 1), {εtest(i)}i Itest (0, 1) and δtest (0, 1) such that, for any i Itest,

|P[bstrain(Xi, Yi) qtrain] P[bstrain(X , Y ) qtrain]| εtest(i), (31)

and, moreover,

i Itest 1{bstrain(Xi, Yi) qtrain} Pq,train

1 δtest. (32)

Then, the analogue of Theorems 1 and 2 still hold for ROO.

Split CP and Non-Exchangeable Data

Theorem 27 (Marginal and empirical coverage over test data for ROO) Given a level α (0, 1), if (31) and (32) hold, then, for all i Itest:

P[bstrain(Xi, Yi) bq1 α,cal] 1 α εtest(i) εtest δcal 1 ntest .

Moreover, it holds that

i Itest 1 n bstrain(Xi, Yi) bq(i) 1 α,cal o 1 α εtest 1 ntest

Proof If we consider Ical := Itest in the proof of Theorem 1, the event F = {q1 α εcal,train bq1 α,cal} satisﬁes P[F] 1 δcal. But since bq(i) 1 α,cal bq1 α 1/ntest,cal, it is also true that the

event F = {q1 α εcal 1/ntest,train bq(i) 1 α,cal} also satisﬁes P[F ] 1 δcal. The rest of the

proof follows the same strategy as in Theorem 1 using bq(i) 1 α,cal instead of bq1 α,cal.

One can adapt the analysis in Section 3.3 to bound the parameters δtest, εtest and εtest(i) for β-mixing data. In particular, one may take εtest(i) = β(i ncal), and εtest, δtest equal to the respective parameters εcal, δcal in that section, but with ntest replacing ncal (since the calibration set for each point of rank-one-out is essentially equal to the test set). On the other hand, we note that marginal coverage might suﬀer somewhat over the ﬁrst few test data, since εtest(i) = β(i ncal) may be large for small values i ncal. In contrast to split CP, there is no gap in ROO between training and test data so the ﬁrst test points may be strongly correlated with the training data.

Appendix E. Details of Section 4

For the experiment comparing split CP and Enb PI (cf. Figure 3), split CP s underlying random forest model comprised 100 trees; mean squared error was used as the split criterion; no maximum tree depth was set, so nodes are expanded until all leaves contain less than 2 samples; all features were considered for splitting. Enb PI s random forest was exactly the same and the length of the blocks in the Enb PI s block bootstrap procedure was set to 8 with the number of resamplings set to 30. For the AR(1) experiment (cf. Figure 1) and Examples 3 and 4, gradient boosting was set to boost 100 trees with a learning rate of 0.1 and pinball loss; trees of any depth were allowed; the minimal number of data in one leaf was 20; the minimal sum hessian in one leaf was 0.001; no minimal gain to perform a split was required; no more than 31 leaves were allowed per tree; no regularization was set. The neural network consisted of three fully connected layers with Re LU activation; the number of output units were 128, 64 and 2, respectively, where the ﬁnal output of 2 units represents the low and high quantiles being estimated; Adam W with learning rate of 10 3 and weight decay of 10 6 was used; training was over 100 epochs with batches of size 64; pinball loss was used. The random forest model (quantile regression forest) comprised 10 trees; mean squared error was used as the split criterion; no maximum tree depth was set, so nodes are expanded until all leaves contain less than 2 samples; all features were considered for splitting. Quantile regressors making use of the pinball loss Lτ(y, ˆy) = τ(y ˆy) 1{y ˆy} + (1 τ)(ˆy y) 1{y < ˆy} had τ set to α/2 and 1 α/2, with α the acceptable miscoverage level.

Oliveira, Orenstein, Ramos and Romano

Anastasios N. Angelopoulos and Stephen Bates. Conformal prediction: A gentle introduction. Foundations and Trends in Machine Learning, 16(4):494 591, 2023. ISSN 1935-8237. doi: 10.1561/2200000101.

Anastasios N. Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. Uncertainty sets for image classiﬁers using conformal prediction. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=e Ndi U_Db M9.

Anthony Arguez, Imke Durre, Scott Applequist, Russell S Vose, Michael F Squires, Xungang Yin, Richard R Heim, and Timothy W Owen. NOAA s 1981 2010 US climate normals: an overview. Bulletin of the American Meteorological Society, 93(11):1687 1697, 2012.

Rina Foygel Barber. Hoeﬀding and bernstein inequalities for weighted sums of exchangeable random variables. ar Xiv preprint ar Xiv:2404.06457, 2024.

Rina Foygel Barber, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. The limits of distribution-free conditional predictive inference. Information and Inference: A Journal of the IMA, 10(2):455 482, 2020. ISSN 2049-8772. doi: 10.1093/imaiai/iaaa017.

Rina Foygel Barber, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. Conformal prediction beyond exchangeability. The Annals of Statistics, 51(2):816 845, 2023. doi: 10.1214/23-AOS2276.

Stephen Bates, Anastasios Angelopoulos, Lihua Lei, Jitendra Malik, and Michael Jordan. Distribution-free, risk-controlling prediction sets. Journal of the ACM (JACM), 68(6): 1 34, 2021.

Marine Carrasco and Xiaohong Chen. Mixing and moment properties of various GARCH and stochastic volatility models. Econometric Theory, 18(1):17 39, 2002.

Maxime Cauchois, Suyash Gupta, and John C. Duchi. Knowing what you know: valid and validated conﬁdence sets in multiclass and multilabel prediction. Journal of Machine Learning Research, 22(81):1 42, 2021. URL http://jmlr.org/papers/v22/20-753.html.

Victor Chernozhukov, Kaspar Wüthrich, and Yinchu Zhu. Exact and robust conformal inference methods for predictive machine learning with dependent data. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, volume 75 of Proceedings of Machine Learning Research, pages 732 749. PMLR, 2018. URL http: //proceedings.mlr.press/v75/chernozhukov18a.html.

Victor Chernozhukov, Kaspar Wüthrich, and Yinchu Zhu. Distributional conformal prediction. Proceedings of the National Academy of Sciences, 118(48):e2107794118, 2021. doi: 10.1073/ pnas.2107794118. URL https://www.pnas.org/doi/abs/10.1073/pnas.2107794118.

Paul Doukhan. Mixing: properties and examples, volume 85. Springer Science & Business Media, 2012.

Split CP and Non-Exchangeable Data

Shai Feldman, Liran Ringel, Stephen Bates, and Yaniv Romano. Risk control for online learning models. ar Xiv preprint ar Xiv:2205.09095, 2022.

Isaac Gibbs and Emmanuel J. Candès. Adaptive conformal inference under distribution shift. In Advances in Neural Information Processing Systems, 2021. URL https://openreview. net/forum?id=6va Actvpcp3.

Isaac Gibbs and Emmanuel J. Candès. Conformal inference for online prediction with arbitrary distribution shifts. Journal of Machine Learning Research, 25(162):1 36, 2024. URL http://jmlr.org/papers/v25/22-1218.html.

Yotam Hechtlinger, Barnabas Poczos, and Larry Wasserman. Cautious deep learning. ar Xiv preprint ar Xiv:1805.09460, 2019. URL https://openreview.net/forum?id=SJequs Cq KQ.

Hist Data, 2022. https://www.histdata.com/, Retrieved on 2022-01-27.

Jessica Hwang, Paulo Orenstein, Judah Cohen, Karl Pfeiﬀer, and Lester Mackey. Improving subseasonal forecasting in the western US with machine learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2325 2335, 2019.

Vilde Jensen, Filippo Maria Bianchi, and Stian Norman Anﬁnsen. Ensemble conformalized quantile regression for probabilistic time series forecasting. ar Xiv preprint ar Xiv:2202.08756, 2022.

Vitaly Kuznetsov and Mehryar Mohri. Generalization bounds for non-stationary mixing processes. Machine Learning, 106(1):93 117, 2017. ISSN 1573-0565. doi: 10.1007/ s10994-016-5588-2.

Jing Lei, Max G Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094 1111, 2018. doi: 10.1080/01621459.2017.1307116.

Daniel J. Mc Donald, Cosma Rohilla Shalizi, and Mark Schervish. Estimating beta-mixing coeﬃcients via histograms. Electronic Journal of Statistics, 9:2855 2883, 2015. doi: 10.1214/15-EJS1094.

Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-i.i.d. processes. In Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2009. URL https://proceedings.neurips.cc/paper/2008/file/ 7eacb532570ff6858afd2723755ff790-Paper.pdf.

Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary ϕ-mixing and β-mixing processes. Journal of Machine Learning Research, 11(2), 2010.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. Adaptive Computation and Machine Learning. MIT Press, 2nd edition, 2018. ISBN 978-0-262-03940-6.

Oliveira, Orenstein, Ramos and Romano

Abdelkader Mokkadem. Mixing properties of ARMA processes. Stochastic processes and their applications, 29(2):309 315, 1988.

NOAA, 2023. http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/.CPC/ .temperature/.daily/, Retrieved on 2023-03-08.

Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alexander Gammerman. Inductive conﬁdence machines for regression. In Proceedings of the 13th European Conference on Machine Learning, ECML 02, pages 345 356. Springer-Verlag, 2002.

William A. Gardner; Antonio Napolitano; Luigi Paura. Cyclostationarity: Half a century of research. Signal Processing, 86:639 697, 2006. ISSN 0165-1684. doi: 10.1016/j.sigpro.2005. 06.016.

Yaniv Romano, Evan Patterson, and Emmanuel Candès. Conformalized quantile regression. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ 5103c3584b063c431bd1268e9b5e76fb-Paper.pdf.

Philippe Rémy. Hist Data API. https://github.com/philipperemy/FX-1-Minute-Data, 2022.

Norbert Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1):145 147, 1972.

Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(12):371 421, 2008. URL http://jmlr.org/papers/v9/shafer08a. html.

Vianney Taquet, Vincent Blot, Thomas Morzadec, Louis Lacombe, and Nicolas Brunel. Mapie: an open-source library for distribution-free uncertainty quantiﬁcation, 2022.

Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer-Verlag, 2005. ISBN 0387001522.

Chen Xu and Yao Xie. Conformal prediction interval for dynamic time-series. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11559 11569. PMLR, 2021. URL https://proceedings. mlr.press/v139/xu21h.html.

Chen Xu and Yao Xie. Conformal prediction for time series. ar Xiv preprint ar Xiv:2010.09107, 2023.

Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94 116, 1994. doi: 10.1214/aop/1176988849.

Margaux Zaﬀran, Olivier Feron, Yannig Goude, Julie Josse, and Aymeric Dieuleveut. Adaptive conformal predictions for time series. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 25834 25866. PMLR, 2022. URL https://proceedings.mlr.press/v162/zaffran22a.html.