# adaptive_conformal_inference_under_distribution_shift__4b9b85d5.pdf

Adaptive Conformal Inference

Under Distribution Shift

Isaac Gibbs Department of Statistics

Stanford University igibbs@stanford.edu

Emmanuel J. Candès Department of Statistics Department of Mathematics

Stanford University candes@stanford.edu

We develop methods for forming prediction sets in an online setting where the data generating distribution is allowed to vary over time in an unknown fashion. Our framework builds on ideas from conformal inference to provide a general wrapper that can be combined with any black box method that produces point predictions of the unseen label or estimated quantiles of its distribution. While previous conformal inference methods rely on the assumption that the data points are exchangeable, our adaptive approach provably achieves the desired coverage frequency over long-time intervals irrespective of the true data generating process. We accomplish this by modelling the distribution shift as a learning problem in a single parameter whose optimal value is varying over time and must be continuously re-estimated. We test our method, adaptive conformal inference, on two real world datasets and ﬁnd that its predictions are robust to visible and signiﬁcant distribution shifts.

1 Introduction

Machine learning algorithms are increasingly being employed in high stakes decision making processes. For instance, deep neural networks are currently being used in self-driving cars to detect nearby objects [2] and parole decisions are being made with the assistance of complex models that combine over a hundred features [1]. As the popularity of black box methods and the cost of making wrong decisions grow it is crucial that we develop tools to quantify the uncertainty of their predictions.

In this paper we develop methods for constructing prediction sets that are guaranteed to contain the target label with high probability. We focus speciﬁcally on an online learning setting in which we observe covariate-response pairs {(Xt, Yt)}t2N Rd R in a sequential fashion. At each time step t 2 N we are tasked with using the previously observed data {(Xr, Yr)}1 r t 1 along with the new covariates, Xt, to form a prediction set ˆCt for Yt. Then, given a target coverage level 2 (0, 1) our generic goal is to guarantee that Yt belongs to ˆCt at least 100(1 )% of the time.

Perhaps the most powerful and ﬂexible tools for solving this problem come from conformal inference [see e.g. 34, 16, 32, 22, 31, 15, 3] . This framework provides a generic methodology for transforming the outputs of any black box prediction algorithm into a prediction set. The generality of this approach has facilitated the development of a large suite of conformal methods, each specialized to a speciﬁc prediction problem of interest [e.g. 30, 11, 23, 8, 24, 21]. With only minor exceptions all of these algorithms share the same common guarantee that if the training and test data are exchangeable, then the prediction set has valid marginal coverage P(Yt 2 ˆCt) = 1 .

While exchangeability is a common assumption, there are many real-world applications in which we do not expect the marginal distribution of (Xt, Yt) to be stationary. For example, in ﬁnance and economics market behaviour can shift drastically in response to new legislation or major world

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

events. Alternatively, the distribution of (Xt, Yt) may change as we deploy our prediction model in new environments. This paper develops adaptive conformal inference (ACI), a method for forming prediction sets that are robust to changes in the marginal distribution of the data. Our approach is both simple, in that it requires only the tracking of a single parameter that models the shift, and general as it can be combined with any modern machine learning algorithm that produces point predictions or estimated quantiles for the response. We show that over long time intervals ACI achieves the target coverage frequency without any assumptions on the data-generating distribution. Moreover, when the distribution shift is small and the prediction algorithm takes a certain simple form we show that ACI will additionally obtain approximate marginal coverage at most time steps.

1.1 Conformal inference

Suppose we are given a ﬁtted regression model for predicting the value of Y from X. Let y be a candidate value for Yt. To determine if y is a reasonable estimate of Yt, we deﬁne a conformity score S(X, Y ) that measures how well the value y conforms with the predictions of our ﬁtted model. For example, if our regression model produces point predictions ˆµ(X) then we could use a conformity score that measures the distance between ˆµ(Xt) and y. One such example is

S(Xt, y) = |ˆµ(Xt) y|.

Alternatively, suppose our regression model outputs estimates ˆq(X; p) of the pth quantile of the distribution of Y |X. Then, we could use the method of conformal quantile regression (CQR) [28], which examines the signed distance between y and ﬁtted upper and lower quantiles through the score

S(Xt, y) = max{ˆq(Xt; /2) y, y ˆq(Xt; 1 /2)}.

Regardless of what conformity score is chosen the key issue is to determine how small S(Xt, y) should be in order to accept y as a reasonable prediction for Yt. Assume we have a calibration set Dcal {(Xr, Yr)}1 r t 1 that is different from the data that was used to ﬁt the regression model. Using this calibration set we deﬁne the ﬁtted quantiles of the conformity scores to be

ˆQ(p) := inf

(Xr,Yr)2Dcal

{S(Xr,Yr) s}

and say that y is a reasonable prediction for Yt if S(Xt, y) ˆQ(1 ).

The crucial observation is that if the data Dcal [ {(Xt, Yt)} are exchangeable and we break ties uniformly at random then the rank of S(Xt, Yt) amongst the points {S(Xr, Yr)}(Xr,Yr)2Dcal [ {S(Xt, Yt)} will be uniform. Therefore,

P(S(Xt, Yt) ˆQ(1 )) = d|Dcal|(1 )e

|Dcal| + 1 .

Thus, deﬁning our prediction set to be ˆCt := {y : S(Xt, y) ˆQ(1 )} gives the marginal coverage guarantee

P(Yt 2 ˆCt) = P(S(Xt, Yt) ˆQ(1 )) = d|Dcal|(1 )e

|Dcal| + 1 .

By introducing additional randomization this generic procedure can be altered slightly to produce a set ˆCt that satisﬁes the exact marginal coverage guarantee P(Yt 2 ˆCt) = 1 [34]. For the purposes of this paper this adjustment is not critical and so we omit the details here. Additionally, we remark that the method outlined above is often referred to as split or inductive conformal inference [27, 34, 26]. This refers to the fact that we have split the observed data between a training set used to ﬁt the regression model and a withheld calibration set. The adaptive conformal inference method developed in this article can also be easily adjusted to work with full conformal inference in which data splitting is avoided at the cost of greater computational resources [34].

2 Adapting conformal inference to distribution shifts

Up until this point we have been working with a single score function S( ) and quantile function ˆQ( ). In the general case where the distribution of the data is shifting over time both these functions should

be regularly re-estimated to align with the most recent observations. Therefore, we assume that at each time t we are given a ﬁtted score function St( ) and corresponding quantile function ˆQt( ). We deﬁne the realized miscoverage rate of the prediction set ˆCt( ) := {y : St(Xt, y) ˆQt(1 )} as

Mt( ) := P(St(Xt, Yt) > ˆQt(1 )),

where the probability is over the test point (Xt, Yt) as well as the data used to ﬁt St( ) and ˆQt( ).

Now, since the distribution generating the data is non-stationary we do not expect Mt( ) to be equal, or even close to, . Even so, we can still postulate that if the conformity scores used to ﬁt ˆQt( ) cover the bulk of the distribution of St(Xt, Yt) then there may be an alternative value

t 2 [0, 1] such that Mt(

t ) = . More rigorously, assume that with probability one, ˆQt( ) is continuous, non-decreasing and such that ˆQt(0) = 1 and ˆQt(1) = 1. This does not hold for the split conformal quantile functions deﬁned in (1), but in the case where there are no ties amongst the conformity scores we can adjust our deﬁnition to guarantee this by smoothing over the jump discontinuities in ˆQ( ). Then, Mt( ) will be non-decreasing on [0, 1] with Mt(0) = 0 and Mt(1) = 1 and so we may deﬁne

t := sup{β 2 [0, 1] : Mt(β) }.

Moreover, if we additionally assume that

P(St(Xt, Yt) = ˆQt(1

then we will have that Mt(

t ) = . So, in particular we ﬁnd that by correctly calibrating the argument to ˆQt( ) we can achieve either approximate or exact marginal coverage.

To perform this calibration we will use a simple online update. This update proceeds by examining the empirical miscoverage frequency of the previous prediction sets and then decreasing (resp. increasing) our estimate of

t if the prediction sets were historically under-covering (resp. over-covering) Yt. In particular, let 1 denote our initial estimate (in our experiments we will choose 1 = ). Recursively deﬁne the sequence of miscoverage events

1, if Yt /2 ˆCt( t),

0, otherwise, where ˆCt( t) := {y : St(Xt, y) ˆQt(1 t)}.

Then, ﬁxing a step size parameter γ > 0 we consider the simple online update

t+1 := t + γ( errt). (2)

We refer to this algorithm as adaptive conformal inference. Here, errt plays the role of our estimate of the historical miscoverage frequency. A natural alternative to this is the update

t+1 = t + γ

where {ws}1 s t [0, 1] is a sequence of increasing weights with Pt

s=1 ws = 1. This update has the appeal of more directly evaluating the recent empirical miscoverage frequency when deciding whether or not to lower or raise t. In practice, we ﬁnd that (2) and (3) produce almost identical results. For example, in Section A.3 in the Appendix we show some sample trajectories for t obtained using the update (3) with

ws := 0.95t s Pt

s0=1 0.95t s0 .

We ﬁnd that these trajectories are very similar to those produced by (2). The main difference is that the trajectories obtained with (3) are smoother with less local variation in t. In the remainder of this article we will focus on (2) for simplicity.

2.1 Choosing the step size

The choice of γ gives a tradeoff between adaptability and stability. While raising the value of γ will make the method more adaptive to observed distribution shifts, it will also induce greater volatility in

the value of t. In practice, large ﬂuctuations in t may be undesirable as it allows the method to oscillate between outputting small conservative and large anti-conservative prediction sets.

In Theorem 4.2 we give an upper bound on (Mt( t) )2 that is optimized by choosing γ proportional to

t |. While not directly applicable in practice, this result supports the intuition that in environments with greater distributional shift the algorithm needs to be more adapatable and thus γ should be chosen to be larger. In our experiments we will take γ = 0.005. This value was chosen because it was found to give relatively stable trajectories for t while still being sufﬁciently large as to allow t to adapt to observed shifts. In agreement with the general principles outlined above we found that larger values of γ also successfully protect against distribution shifts, while taking γ to be too small causes adaptive conformal inference to perform similar to non-adaptive methods that hold t = constant across time.

2.2 Real data example: predicting market volatility

We apply ACI to the prediction of market volatility. Let {Pt}1 t T denote a sequence of daily open prices for a stock. For all t 2, deﬁne the return Rt := (Pt Pt 1)/Pt 1 and realized volatility Vt = R2

t . Our goal is to use the previously observed returns Xt := {Rs}1 s t 1 to form prediction sets for Yt := Vt. More sophisticated ﬁnancial models might augment Xt with additional market covariates (available to the analyst at time t 1). As the primary purpose of this section is to illustrate adaptive conformal inference we work with only a simple prediction method.

We start off by forming point predictions using a GARCH(1,1) model [4]. This method assumes that Rt = σt t with 2, . . . , T taken to be i.i.d. N(0, 1) and σt satisfying the recursive update

t = ! + Vt 1 + βσ2

This is a common approach used for forecasting volatility in economics. In practice, shifting market dynamics can cause the predictions of this model to become inaccurate over large time periods. Thus, when forming point predictions we ﬁt the model using only the last 1250 trading days (i.e. approximately 5 years) of market data. More precisely, for all times t > 1250 we ﬁt the coefﬁcients

ˆ!t, ˆ t, ˆβt as well as the sequence of variances {ˆσt

s}1 s t 1 using only the data {Rr}t 1250 r<t. Then, our point prediction for the realized volatility at time t is

t)2 := ˆ!t + ˆ t Vt 1 + ˆβt(ˆσt

To form prediction intervals we deﬁne the sequence of conformity scores

St := |Vt (ˆσt

and the corresponding quantile function

ˆQt(p) := inf

Then, our prediction set at time t is

v : |v (ˆσt

t)2 ˆQt(1 t)

where { t} is initialized with 1250 = = 0.1 and then updated recursively as in (2).

We compare this algorithm to a non-adaptive alternative that takes t = ﬁxed. To measure the performance of these methods across time we examine their local coverage frequencies deﬁned as the average coverage rate over the most recent two years, i.e.

local Covt := 1 1 500

If the methods perform well then we expect the local coverage frequency to stay near the target value 1 across all time points.

Adaptive Alpha Fixed Alpha Bernoulli Sequence

Local Coverage Level

Black Berry

2005 2010 2015 2020

Local Coverage Level

2000 2005 2010 2015 2020 Time

Figure 1: Local coverage frequencies for adaptive conformal (blue), a non-adaptive method that holds t = ﬁxed (red), and an i.i.d. Bernoulli(0.1) sequence (grey) for the prediction of stock market volatility. The coloured dotted lines mark the average coverage obtained across all time points, while the black line indicates the target level of 1 = 0.9.

Daily open prices were obtained from publicly available datasets published by The Wall Street Journal. The realized local coverage frequencies for the non-adaptive and adaptive conformal methods on four different stocks are shown in Figure 1. These stocks were selected out of a total of 12 stocks that we examined because they showed a clear failure of the non-adaptive method. Adaptive conformal inference was found to perform well in all cases (see Figure 9 in the appendix).

As a visual comparator, the grey curves show the moving average 1 1 500

r=t 250+1 Ir for sequences {It}1 t T that are i.i.d. Bernoulli(0.1). We see that the local coverage frequencies obtained by adaptive conformal inference (blue lines) always stay within the variation that would be expected from an i.i.d. Bernoulli sequence. On the other hand, the non-adaptive method undergoes large excursions away from the target level of 1 = 0.9 (red lines). For example, in the bottom right panel we can see that the non-adaptive method fails to cover the realized volatility of Fannie Mae during the 2008 ﬁnancial crisis, while the adaptive method is robust to this event (see Figure 4 in the Appendix for a plot of the price of Fannie Mae over this time period).

3 Related Work

Prior work on conformal inference has considered two different types of distribution shift [33, 10]. In both cases the focus was on environments in which the calibration data is drawn i.i.d. from a single distribution P0, while the test point comes from a second distribution P1. In this setting Tibshirani et al. [33] showed that valid prediction sets can be obtained by re-weighting the calibration data using the likelihood ratio between P1 and P0. However, this requires the conditional distribution of Y |X to be constant between training and testing and the likelihood ratio P1(X)/P0(X) to be either known or very accurately estimated. On the other hand, Cauchois et al. [10] develop methods for forming prediction sets that are valid whenever P1 and P0 are close in f-divergence. Similar to our work, they show that if Df(P1||P0) then there exists a conservative value 2 (0, 1) such that

M( ) := P(S(Xt, Yt) > ˆQ(1 )) .

The difference between our approach and theirs is twofold. First, while they ﬁx a single conservative value our methods aim to estimate the optimal choice satisfying M( ) = . This is not possible in the setting of [10] as they do not observe any data from which the size of the distribution shift can be estimated. Second, while they consider only one training and one testing distribution we work in a fully online setting in which the distribution is allowed to shift continuously over time.

4 Coverage guarantees

4.1 Distribution-free results

In this section we outline the theoretical coverage guarantees of adaptive conformal inference. We will assume throughout that with probability one 1 2 [0, 1] and ˆQt is non-decreasing with ˆQt(x) = 1 for all x < 0 and ˆQt(x) = 1 for all x > 1. Our ﬁrst result shows that over long time intervals adaptive conformal inference obtains the correct coverage frequency irrespective of any assumptions on the data-generating distribution.

Lemma 4.1 With probability one we have that 8t 2 N, t 2 [ γ, 1 + γ].

Proof: Assume by contradiction that with positive probability { t}t2N is such that inft t < γ (the case where supt t > 1 + γ is identical). Note that supt | t+1 t| = supt γ| errt| < γ. Thus, with positive probability we may ﬁnd t 2 N such that t < 0 and t+1 < t. However,

t < 0 =) ˆQt(1 t) = 1 =) errt = 0 =) t+1 = t + γ( errt) t and thus P(9t such that t+1 < t < 0) = 0. We have reached a contradiction.

Proposition 4.1 With probability one we have that for all T 2 N, 44444

44444 max{ 1, 1 1} + γ

In particular, lim T !1 1

Proof: By expanding the recursion deﬁned in (2) and applying Lemma 4.1 we ﬁnd that

[ γ, 1 + γ] 3 T +1 = 1 +

Rearranging this gives the result.

Proposition 4.1 puts no constraints on the data generating distribution. One may immediately ask whether these results can be improved by making mild assumptions on the distribution shifts. We argue that without assumptions on the quality of the initialization the answer to this question is negative. To understand this, consider a setting in which there is a single ﬁxed optimal target 2 [0, 1] and assume that

Mt(p) = M(p) =

1 (p ), if p > , +

(p ) if p . .

Suppose additionally that E[errt| t] = M( t).1 In order to simplify the calculations consider the noiseless update t+1 = t + γ( M( t)) = t + γ( E[errt| t]). Intuitively, the noiseless update can be viewed as the average case behaviour of (2). Now, for any initialization 1 and any γ min{ 1

} there exists a constant c 2 { 1

} such that for all t, M( t) = c( t ). So, we have that

E[errt] = c E[ t ] = c E[ t 1 + γ( Mt 1( t 1)) ] = c(1 cγ)E[ t 1 ]. Repeating this calculation recursively gives that

E[errt] = c(1 cγ)t 1E[ 1 ] = c(1 cγ)t 1( 1 ), and thus, 44444

44444 = 1 (1 cγ)T

The comparison of this bound to (5) is self-evident. The main difference is that we have replaced max{1 1, 1} with | 1 |. This arises from the fact that 2 (0, 1) is arbitrary and thus max{1 1, 1} is the best possible upper bound on | 1 |. So, we view Proposition 4.1 as both an agnostic guarantee that shows that our method gives the correct long-term empirical coverage frequency irrespective of the true data generating process, and as an approximately tight bound on the worst-case behaviour immediately after initialization.

1This last assumption is in general only true if ˆQt( ) and t are ﬁt independently of one another.

4.2 Performance in a hidden Markov model

Although we believe Proposition 4.1 is an approximately tight characterization of the behaviour after initialization, we can still ask whether better bounds can be obtained for large time steps. In this section we answer this question positively by showing that if 1 is initialized appropriately and the distribution shift is small, then tighter coverage guarantees can be given. In order to obtain useful results we will make some simplifying assumptions about the data generating process. While we do not expect these assumptions to hold exactly in any real-world setting, we do consider our results to be representative of the true behaviour of adaptive conformal inference and we expect similar results to hold under alternative models.

4.2.1 Setting

We model the data as coming from a hidden Markov model. In particular, we let {At}t2N A denote the underlying Markov chain for the environment and we assume that conditional on {At}t2N, {(Xt, Yt)}t2N is an independent sequence with (Xt, Yt) PAt for some collection of distributions {Pa : a 2 A}. In order to simplify our calculations, we assume additionally that the estimated quantile function ˆQt( ) and score function St( ) do not depend on t and we denote them by ˆQ( ) and S( ). This occurs for example in the split conformal setting with ﬁxed training and calibration sets.

In this setting, {( t, At)}t2N forms a Markov chain on [ γ, 1 + γ] A. We assume that this chain has a unique stationary distribution and that ( 1, A1) . This implies that ( t, At, errt) is a stationary process and thus will greatly simplify our characterization of the behaviour of errt. While there is little doubt that the theory can be extended, recall our that main goal is to get useful and simple results. That said, what we really have in mind here is that {At}t2N is sufﬁciently wellbehaved to guarantee that ( t, At) has a limiting stationary distribution. In Section A.5 we give an example where this is indeed provably the case. Lastly, the assumption that ( 1, A1) is essentially equivalent to assuming that we have been running the algorithm for long enough to exit the initialization phase described in Section 4.1.

4.2.2 Large deviation bound for the errors

Our ﬁrst observation is that errt has the correct average value. More precisely, by Proposition 4.1 we have that lim T !1 T 1 PT

a.s. = and since errt is stationary it follows that E[errt] = . Thus, to understand the deviation of T 1 PT

t=1 errt from we simply need to characterize the dependence structure of {errt}t2N.

We accomplish this in Theorem 4.1, which gives a large deviation bound on |T 1 PT

t=1 errt |. The idea behind this result is to decompose the dependence in {errt}t2N into two parts. First, there is dependence due to the fact that t is a function of {errr}1 r t 1. In Section A.7 in the Appendix we argue that this dependence induces a negative correlation and thus the errors concentrate around their expectation at a rate no slower than that of an i.i.d. Bernoulli sequence. This gives rise to the ﬁrst term in (6), which is what would be obtained by applying Hoeffding s inequality to an i.i.d. sequence. Second, there is dependence due to the fact that At depends on At 1. More speciﬁcally, consider a setting in which the distribution of Y |X has more variability in some states than others. The goal of adaptive conformal inference is to adapt to the level of variability and thus return larger prediction sets in states where the distribution of Y |X is more spread. However, this algorithm is not perfect and as a result there may be some states a 2 A in which E[errt|At = a] is biased away from . Furthermore, if the environment tends to spend long stretches of time in more variable (or less variable) states this will induce a positive dependence in the errors and cause T 1 PT

t=1 errt to deviate from . To control this dependence we use a Bernstein inequality for Markov chains to bound |T 1 PT

t=1 E[errt|At] |. This gives rise to the second term in (6).

Theorem 4.1 Assume that {At}t2N has non-zero absolute spectral gap 1 > 0. Let

|E[errt|At = a] | and σ2

B := E[(E[errt|At] )2].

A formal proof of this result can be found in Section A.7. The quality of this concentration inequality will depend critically on the size of the bias terms B and σ2

B. Before proceeding, it is important that we emphasize that the deﬁnitions of B and σ2

B are independent of the choice of t owing to the fact that ( t, At, errt) is assumed stationary. Now, to understand these quantities, let

M(p|a) := P(S(Xt, Yt) > ˆQ(1 p)|At = a)

denote the realized miscoverage level in state a 2 A obtained by the quantile ˆQ(1 p). Assume that M(p|a) is continuous. This will happen for example when ˆQ( ) is continuous and S(Xt, Yt)|At = a is continuously distributed. Then, there exists an optimal value

a such that M(

a|a) = . Lemma A.4 in the Appendix shows that if in addition M( |a) admits a second order Taylor expansion, then

γ + γ 1 sup

44At+k = a]

Here, the constant C will depend on how much M( |a) differs from the ideal case in which ˆQ( ) is the true quantile function for S(Xt, Yt)|At = a. In this case we would have that M( |a) is the linear function M(p|a) = p, 8p 2 [0, 1] and C 2.

We remark that the term E[|

44At+k = a] can be seen as a quantitative measurement of the size of the distribution shift in terms of the change in the critical value

a. Thus, we interpret these results as showing that if the distribution shift is small and 8a 2 A, ˆQ( ) gives reasonable coverage of the distribution of S(Xt, Yt)|At = a, then T 1 PT

t=1 errt will concentrate well around .

4.2.3 Achieving approximate marginal coverage

Theorem 4.1 bounds the distance between the average miscoverage rate and the target level over long stretches of time. On the other hand, it provides no information about the marginal coverage frequency at a single time step. The following result shows that if the distribution shift is small, the realized marginal coverage rate M( t|At) will be close to on average.

Theorem 4.2 Assume that there exists a constant L > 0 such that for all a 2 A and all 1, 2 2 R,

|M( 2|a) M( 1|a)| L| 2 1|.

Assume additionally that for all a 2 A there exists

a 2 (0, 1) such that M(

a|a) = . Then,

E[(M( t|At) )2] L(1 + γ)

Once again we emphasize that (7) holds for any choice of t owing to the fact that ( t, At, errt) is assumed stationary and thus the quantities appearing in the bound are invariant across t. Proof of this result can be found in Section A.8 of the Appendix. We remark that the right-hand side of (7) is minimized by choosing γ = (2E[|

At|])1/2, which gives the inequality

E[(M( t|At) )2] L(

As above we have that in the ideal case ˆQ( ) is a perfect estimate of the quantiles of S(Xt, Yt)|At = a and thus M(p|a) = p and L = 1. Moreover, we once again have the interpretation that E[|

At|] is a quantitative measurement of the distribution shift. Thus, this result can be interpreted as bounding the average difference between the realized and target marginal coverage in terms of the size of the underlying distribution shift. Finally, note that the choice γ = (2E[|

formalizes our intuition that γ should be chosen to be larger in domains with greater distribution shift, while not being so large as to cause t to be overly volatile.

5 Impact of St St St( ) on the performance

The performance of all conformal inference methods depends heavily on the design of the conformity score. Previous work has shown how carefully chosen scores or even explicit optimization of the

Adaptive Alpha Fixed Alpha Bernoulli Sequence

Local Coverage Level

Black Berry

2005 2010 2015 2020

Local Coverage Level

2000 2005 2010 2015 2020 Time

Figure 2: Local coverage frequencies for adaptive conformal (blue), a non-adaptive method that holds t = ﬁxed (red), and an i.i.d. Bernoulli(0.1) sequence (grey) for the prediction of stock market volatility with conformity score St. The coloured dotted lines mark the average coverage obtained across all time points, while the black line indicates the target level of 1 = 0.9.

interval width can be used to obtain smaller prediction sets [e.g. 28, 29, 20, 12]. Adaptive conformal inference can work with any conformity score St( ) and quantile function ˆQt( ) and thus can be directly combined with other improvements in conformal inference to obtain shorter intervals. One important caveat here is that the lengths of conformal prediction sets depend directly on the quality of the ﬁtted regression model. Thus, to obtain smaller intervals one should re-ﬁt the model at each time step using the most recent data to build the most accurate predictions. This is exactly what we have done in our experiments in Sections 2.2 and 6.

In addition to this, the choice of St( ) can also have a direct effect on the coverage properties of adaptive conformal inference. Theorems 4.1 and 4.2 show that the performance of adaptive conformal inference is controlled by the size of the shift in the optimal parameter

t across time. Moreover,

t itself is in one-to-one correspondence with the 1 quantile of St(Xt, Yt). Thus, the coverage properties of adaptive conformal inference depend on how close St(Xt, Yt) is to being stationary.

For a simple example illustrating the impact of this dependence, note that in Section 2.2 we formed prediction sets using the conformity score

St := |Vt ˆσ2

An a priori reasonable alternative to this is the unnormalized score

St := |Vt ˆσ2

t |. However, after a more careful examination it becomes unsurprising that normalization by ˆσ2

t is critical for obtaining an approximately stationary conformity score and thus St leads to much worse coverage properties. Figure 2 shows the local coverage frequency (see (4)) of adaptive conformal inference using St. In comparison to Figure 1 the coverage now undergoes much wider swings away from the target level of 0.9. This issue can be partially mitigated by choosing a larger value of γ that gives greater adaptivity to the algorithm.

6 Real data example: election night predictions

During the 2020 US presidential election The Washington Post used conformalized quantile regression (CQR) (see (1) and Section 1.1) to produce county level predictions of the vote total on election night [13]. Here we replicate the core elements of this method using both ﬁxed and adaptive quantiles.

Adaptive Alpha Fixed Alpha Bernoulli Sequence

500 1000 1500 2000 Time

Local Coverage Level

Figure 3: Local coverage frequencies of adaptive conformal (blue), a non-adaptive method that holds t = ﬁxed (red), and an i.i.d. Bernoulli(0.1) sequence (grey) for county-level election predictions. Coloured dotted lines show the average coverage across all time points, while the black line indicates the target coverage level of 1 = 0.9.

To make the setting precise, let {Yt}1 t T denote the number of votes cast for presidential candidate Joe Biden in the 2020 election in each of approximately T = 3000 counties in the United States. Let Xt denote a set of demographic covariates associated to the tth county. In our experiment Xt will include information on the make-up of the county population by ethnicity, age, sex, median income and education (see Section A.6.1 for details). On election night county vote totals were observed as soon as the vote count was completed. If the order in which vote counts completed was uniformly random {(Xt, Yt)}1 t T would be an exchangeable sequence on which we could run standard conformal inference methods. In reality, larger urban counties tend to report results later than smaller rural counties and counties on the east coast of the US report earlier than those on the west coast. Thus, the distribution of (Xt, Yt) can be viewed as drifting throughout election night.

We apply CQR to predict the county-level vote totals (see Section A.6.2 for details). To replicate the east to west coast bias observed on election night we order the counties by their time zone with eastern time counties appearing ﬁrst and Hawaiian counties appearing last. Within each time zone counties are ordered uniformly at random. Figure 3 shows the realized local coverage frequency over the most recent 300 counties (see (4)) for the non-adaptive and adaptive conformal methods. We ﬁnd that the non-adaptive method fails to maintain the desired 90% coverage level, incurring large troughs in its coverage frequency during time zone changes. On the other hand, the adaptive method maintains approximate 90% coverage across all time points with deviations in its local coverage level comparable to what is observed in Bernoulli sequences.

7 Discussion

There are still many open problems in this area. The methods we develop are speciﬁc to cases where Yt is revealed at each time point. However, there are many settings in which we receive the response in a delayed fashion or in large batches. In addition, our theoretical results in Section 4.2 are limited to a single model for the data generating distribution and the special case where the quantile function

ˆQt( ) is ﬁxed across time. It would be interesting to determine if similar results can be obtained in settings where ˆQt( ) is ﬁt in an online fashion on the most recent data. Another potential area for improvement is in the choice of the step size γ. In Section 2.1 we give some heuristic guidelines for choosing γ based on the size of the distribution shift in the environment. Ideally however we would like to be able to determine γ adaptively without prior knowledge. Finally, our experimental results are limited to just two domains. Additional work is needed to determine if our methods can successfully protect against a wider variety of real-world distribution shifts.

8 Acknowledgements

E.C. was supported by Ofﬁce of Naval Research grant N00014-20-12157, by the National Science Foundation grants OAC 1934578 and DMS 2032014, by the Army Research Ofﬁce (ARO) under grant W911NF-17-1-0304, and by the Simons Foundation under award 814641. We thank John Cherian for valuable discussions related to Presidential Election Night 2020 and Lihua Lei for helpful comments on the connection to online learning.

[1] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: There s software used across the country to predict future criminals. and it s biased against blacks, 2016. URL https://www.propublica.org/article/ machine-bias-risk-assessments-in-criminal-sentencing. [2] Claudine Badue, Rânik Guidolini, Raphael Vivacqua Carneiro, Pedro Azevedo, Vinicius B.

Cardoso, Avelino Forechi, Luan Jesus, Rodrigo Berriel, Thiago M. Paixão, Filipe Mutz, Lucas de Paula Veronese, Thiago Oliveira-Santos, and Alberto F. De Souza. Self-driving cars: A survey. Expert Systems with Applications, 165:113816, 2021. ISSN 0957-4174. doi: https: //doi.org/10.1016/j.eswa.2020.113816. URL https://www.sciencedirect.com/science/ article/pii/S095741742030628X. [3] Rina Foygel Barber, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. Predictive

inference with the jackknife+. The Annals of Statistics, 49(1):486 507, 2021. doi: 10.1214/ 20-AOS1965. URL https://doi.org/10.1214/20-AOS1965. [4] Tim Bollerslev. Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31(3):307 327, 1986. ISSN 0304-4076. doi: https://doi.org/10.1016/ 0304-4076(86)90063-1. URL https://www.sciencedirect.com/science/article/ pii/0304407686900631. [5] United States Census Bureau. County characteristics resident population estimates, 2019. data

retrieved from https://www.census.gov/data/tables/time-series/demo/popest/ 2010s-counties-detail.html. [6] United States Census Bureau. 2015-2019 american community survey 5-year average county-

level estimates, 2019. data retrieved from https://www.ers.usda.gov/data-products/ county-level-data-sets/download-data/. [7] United States Census Bureau. Small area income and poverty estimates: 2019, 2019. data retrieved from https://www.ers.usda.gov/data-products/ county-level-data-sets/download-data/. [8] Emmanuel J. Candès, Lihua Lei, and Zhimei Ren. Conformalized survival analysis. ar Xiv

preprint, 2021. ar Xiv:2103.09763. [9] X. Cao, J. Zhang, and H. V. Poor. On the time-varying distributions of online stochastic

optimization. In 2019 American Control Conference (ACC), pages 1494 1500, 2019. doi: 10.23919/ACC.2019.8814889. [10] Maxime Cauchois, Suyash Gupta, Alnur Ali, and John C. Duchi. Robust validation: Conﬁdent

predictions even when distributions shift. ar Xiv preprint, 2020. ar Xiv:2008.04267. [11] Maxime Cauchois, Suyash Gupta, and John C. Duchi. Knowing what you know: valid and

validated conﬁdence sets in multiclass and multilabel prediction. Journal of Machine Learning Research, 22(81):1 42, 2021. URL http://jmlr.org/papers/v22/20-753.html. [12] Haoxian Chen, Ziyi Huang, Henry Lam, Huajie Qian, and Haofeng Zhang. Learning prediction

intervals for regression: Generalization and calibration. In Proceedings of The 24th International Conference on Artiﬁcial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 820 828. PMLR, 13 15 Apr 2021. [13] John Cherian and Lenny Bronner. How the washington post estimates outstanding votes

for the 2020 presidential election. https://elex-models-prod.s3.amazonaws.com/ 2020-general/write-up/election_model_writeup.pdf, 2020. [14] MIT Election Data and Science Lab. County Presidential Election Returns 2000-2016, 2018.

URL https://doi.org/10.7910/DVN/VOQCHQ.

[15] Rina Foygel Barber, Emmanuel J Candès, Aaditya Ramdas, and Ryan J Tibshirani. The

limits of distribution-free conditional predictive inference. Information and Inference: A Journal of the IMA, 08 2020. ISSN 2049-8772. doi: 10.1093/imaiai/iaaa017. URL https: //doi.org/10.1093/imaiai/iaaa017. iaaa017. [16] Alexander Gammerman and Vladimir Vovk. Hedging predictions in machine learning. The

Computer Journal, 50(2):151 163, 2007. doi: 10.1093/comjnl/bxl065. [17] Elad Hazan. Introduction to online convex optimization. ar Xiv preprint, 2019. ar Xiv:1909.05207. [18] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal

of the American Statistical Association, 58(301):13 30, 1963. doi: 10.1080/01621459.1963. 10500830. [19] Bai Jiang, Qiang Sun, and Jianqing Fan. Bernstein s inequality for general markov chains.

ar Xiv preprint, 2020. ar Xiv:1805.10721. [20] Danijel Kivaranovic, Kory D. Johnson, and Hannes Leeb. Adaptive, distribution-free prediction

intervals for deep networks. In Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research,

pages 4346 4356. PMLR, 26 28 Aug 2020. [21] Danijel Kivaranovic, Robin Ristl, Martin Posch, and Hannes Leeb. Conformal prediction

intervals for the individual treatment effect. ar Xiv preprint, 2020. ar Xiv:2006.01474. [22] Jing Lei and Larry Wasserman. Distribution-free prediction bands for non-parametric regression.

Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76, 01 2014. doi: 10.1111/rssb.12021. [23] Jing Lei, Max G Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman.

Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094 1111, 2018. doi: 10.1080/01621459.2017.1307116. URL https://doi.org/10.1080/01621459.2017.1307116. [24] Lihua Lei and Emmanuel J. Candès. Conformal inference of counterfactuals and individual

treatment effects. ar Xiv preprint, 2021. ar Xiv:2006.06138. [25] David Leip. Dave leip s atlas of u.s. presidential elections., 2020. http://uselectionatlas.

org. [26] Harris Papadopoulos. Inductive conformal prediction: Theory and application to neural net-

works. In Tools in Artiﬁcial Intelligence,, pages 315 330, 2008. [27] Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman. Inductive

conﬁdence machines for regression. In Tapio Elomaa, Heikki Mannila, and Hannu Toivonen, editors, Machine Learning: ECML 2002, pages 345 356, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg. ISBN 978-3-540-36755-0. [28] Yaniv Romano, Evan Patterson, and Emmanuel Candes. Conformalized quantile regression. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ 5103c3584b063c431bd1268e9b5e76fb-Paper.pdf. [29] Yaniv Romano, Matteo Sesia, and Emmanuel Candes. Classiﬁcation with valid and adaptive coverage. In Advances in Neural Information Processing Systems, volume 33, pages 3581 3591, 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 244edd7e85dc81602b7615cd705545f5-Paper.pdf. [30] Yaniv Romano, Matteo Sesia, and Emmanuel Candes. Classiﬁcation with valid and adap-

tive coverage. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 3581 3591. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/ file/244edd7e85dc81602b7615cd705545f5-Paper.pdf. [31] Mauricio Sadinle, Jing Lei, and Larry Wasserman. Least ambiguous set-valued classiﬁers with

bounded error levels. Journal of the American Statistical Association, 114(525):223 234, 2019. doi: 10.1080/01621459.2017.1395341. URL https://doi.org/10.1080/01621459.2017. 1395341.

[32] Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine

Learning Research, 9(12):371 421, 2008. URL http://jmlr.org/papers/v9/shafer08a. html. [33] Ryan J Tibshirani, Rina Foygel Barber, Emmanuel Candes, and Aaditya Ramdas. Conformal

prediction under covariate shift. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/ file/8fb21ee7a2207526da55a679f0332de2-Paper.pdf. [34] Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random

World. Springer-Verlag, Berlin, Heidelberg, 2005. ISBN 0387001522. [35] Jingyi Zhu and James Spall. Tracking capability of stochastic gradient algorithm with constant

gain. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 4522 4527, 12 2016. doi: 10.1109/CDC.2016.7798957.