# adaptive_conformal_inference_under_distribution_shift__4b9b85d5.pdf Adaptive Conformal Inference Under Distribution Shift Isaac Gibbs Department of Statistics Stanford University igibbs@stanford.edu Emmanuel J. Candès Department of Statistics Department of Mathematics Stanford University candes@stanford.edu We develop methods for forming prediction sets in an online setting where the data generating distribution is allowed to vary over time in an unknown fashion. Our framework builds on ideas from conformal inference to provide a general wrapper that can be combined with any black box method that produces point predictions of the unseen label or estimated quantiles of its distribution. While previous conformal inference methods rely on the assumption that the data points are exchangeable, our adaptive approach provably achieves the desired coverage frequency over long-time intervals irrespective of the true data generating process. We accomplish this by modelling the distribution shift as a learning problem in a single parameter whose optimal value is varying over time and must be continuously re-estimated. We test our method, adaptive conformal inference, on two real world datasets and find that its predictions are robust to visible and significant distribution shifts. 1 Introduction Machine learning algorithms are increasingly being employed in high stakes decision making processes. For instance, deep neural networks are currently being used in self-driving cars to detect nearby objects [2] and parole decisions are being made with the assistance of complex models that combine over a hundred features [1]. As the popularity of black box methods and the cost of making wrong decisions grow it is crucial that we develop tools to quantify the uncertainty of their predictions. In this paper we develop methods for constructing prediction sets that are guaranteed to contain the target label with high probability. We focus specifically on an online learning setting in which we observe covariate-response pairs {(Xt, Yt)}t2N Rd R in a sequential fashion. At each time step t 2 N we are tasked with using the previously observed data {(Xr, Yr)}1 r t 1 along with the new covariates, Xt, to form a prediction set ˆCt for Yt. Then, given a target coverage level 2 (0, 1) our generic goal is to guarantee that Yt belongs to ˆCt at least 100(1 )% of the time. Perhaps the most powerful and flexible tools for solving this problem come from conformal inference [see e.g. 34, 16, 32, 22, 31, 15, 3] . This framework provides a generic methodology for transforming the outputs of any black box prediction algorithm into a prediction set. The generality of this approach has facilitated the development of a large suite of conformal methods, each specialized to a specific prediction problem of interest [e.g. 30, 11, 23, 8, 24, 21]. With only minor exceptions all of these algorithms share the same common guarantee that if the training and test data are exchangeable, then the prediction set has valid marginal coverage P(Yt 2 ˆCt) = 1 . While exchangeability is a common assumption, there are many real-world applications in which we do not expect the marginal distribution of (Xt, Yt) to be stationary. For example, in finance and economics market behaviour can shift drastically in response to new legislation or major world 35th Conference on Neural Information Processing Systems (Neur IPS 2021). events. Alternatively, the distribution of (Xt, Yt) may change as we deploy our prediction model in new environments. This paper develops adaptive conformal inference (ACI), a method for forming prediction sets that are robust to changes in the marginal distribution of the data. Our approach is both simple, in that it requires only the tracking of a single parameter that models the shift, and general as it can be combined with any modern machine learning algorithm that produces point predictions or estimated quantiles for the response. We show that over long time intervals ACI achieves the target coverage frequency without any assumptions on the data-generating distribution. Moreover, when the distribution shift is small and the prediction algorithm takes a certain simple form we show that ACI will additionally obtain approximate marginal coverage at most time steps. 1.1 Conformal inference Suppose we are given a fitted regression model for predicting the value of Y from X. Let y be a candidate value for Yt. To determine if y is a reasonable estimate of Yt, we define a conformity score S(X, Y ) that measures how well the value y conforms with the predictions of our fitted model. For example, if our regression model produces point predictions ˆµ(X) then we could use a conformity score that measures the distance between ˆµ(Xt) and y. One such example is S(Xt, y) = |ˆµ(Xt) y|. Alternatively, suppose our regression model outputs estimates ˆq(X; p) of the pth quantile of the distribution of Y |X. Then, we could use the method of conformal quantile regression (CQR) [28], which examines the signed distance between y and fitted upper and lower quantiles through the score S(Xt, y) = max{ˆq(Xt; /2) y, y ˆq(Xt; 1 /2)}. Regardless of what conformity score is chosen the key issue is to determine how small S(Xt, y) should be in order to accept y as a reasonable prediction for Yt. Assume we have a calibration set Dcal {(Xr, Yr)}1 r t 1 that is different from the data that was used to fit the regression model. Using this calibration set we define the fitted quantiles of the conformity scores to be ˆQ(p) := inf (Xr,Yr)2Dcal {S(Xr,Yr) s} and say that y is a reasonable prediction for Yt if S(Xt, y) ˆQ(1 ). The crucial observation is that if the data Dcal [ {(Xt, Yt)} are exchangeable and we break ties uniformly at random then the rank of S(Xt, Yt) amongst the points {S(Xr, Yr)}(Xr,Yr)2Dcal [ {S(Xt, Yt)} will be uniform. Therefore, P(S(Xt, Yt) ˆQ(1 )) = d|Dcal|(1 )e |Dcal| + 1 . Thus, defining our prediction set to be ˆCt := {y : S(Xt, y) ˆQ(1 )} gives the marginal coverage guarantee P(Yt 2 ˆCt) = P(S(Xt, Yt) ˆQ(1 )) = d|Dcal|(1 )e |Dcal| + 1 . By introducing additional randomization this generic procedure can be altered slightly to produce a set ˆCt that satisfies the exact marginal coverage guarantee P(Yt 2 ˆCt) = 1 [34]. For the purposes of this paper this adjustment is not critical and so we omit the details here. Additionally, we remark that the method outlined above is often referred to as split or inductive conformal inference [27, 34, 26]. This refers to the fact that we have split the observed data between a training set used to fit the regression model and a withheld calibration set. The adaptive conformal inference method developed in this article can also be easily adjusted to work with full conformal inference in which data splitting is avoided at the cost of greater computational resources [34]. 2 Adapting conformal inference to distribution shifts Up until this point we have been working with a single score function S( ) and quantile function ˆQ( ). In the general case where the distribution of the data is shifting over time both these functions should be regularly re-estimated to align with the most recent observations. Therefore, we assume that at each time t we are given a fitted score function St( ) and corresponding quantile function ˆQt( ). We define the realized miscoverage rate of the prediction set ˆCt( ) := {y : St(Xt, y) ˆQt(1 )} as Mt( ) := P(St(Xt, Yt) > ˆQt(1 )), where the probability is over the test point (Xt, Yt) as well as the data used to fit St( ) and ˆQt( ). Now, since the distribution generating the data is non-stationary we do not expect Mt( ) to be equal, or even close to, . Even so, we can still postulate that if the conformity scores used to fit ˆQt( ) cover the bulk of the distribution of St(Xt, Yt) then there may be an alternative value t 2 [0, 1] such that Mt( t ) = . More rigorously, assume that with probability one, ˆQt( ) is continuous, non-decreasing and such that ˆQt(0) = 1 and ˆQt(1) = 1. This does not hold for the split conformal quantile functions defined in (1), but in the case where there are no ties amongst the conformity scores we can adjust our definition to guarantee this by smoothing over the jump discontinuities in ˆQ( ). Then, Mt( ) will be non-decreasing on [0, 1] with Mt(0) = 0 and Mt(1) = 1 and so we may define t := sup{β 2 [0, 1] : Mt(β) }. Moreover, if we additionally assume that P(St(Xt, Yt) = ˆQt(1 then we will have that Mt( t ) = . So, in particular we find that by correctly calibrating the argument to ˆQt( ) we can achieve either approximate or exact marginal coverage. To perform this calibration we will use a simple online update. This update proceeds by examining the empirical miscoverage frequency of the previous prediction sets and then decreasing (resp. increasing) our estimate of t if the prediction sets were historically under-covering (resp. over-covering) Yt. In particular, let 1 denote our initial estimate (in our experiments we will choose 1 = ). Recursively define the sequence of miscoverage events 1, if Yt /2 ˆCt( t), 0, otherwise, where ˆCt( t) := {y : St(Xt, y) ˆQt(1 t)}. Then, fixing a step size parameter γ > 0 we consider the simple online update t+1 := t + γ( errt). (2) We refer to this algorithm as adaptive conformal inference. Here, errt plays the role of our estimate of the historical miscoverage frequency. A natural alternative to this is the update t+1 = t + γ where {ws}1 s t [0, 1] is a sequence of increasing weights with Pt s=1 ws = 1. This update has the appeal of more directly evaluating the recent empirical miscoverage frequency when deciding whether or not to lower or raise t. In practice, we find that (2) and (3) produce almost identical results. For example, in Section A.3 in the Appendix we show some sample trajectories for t obtained using the update (3) with ws := 0.95t s Pt s0=1 0.95t s0 . We find that these trajectories are very similar to those produced by (2). The main difference is that the trajectories obtained with (3) are smoother with less local variation in t. In the remainder of this article we will focus on (2) for simplicity. 2.1 Choosing the step size The choice of γ gives a tradeoff between adaptability and stability. While raising the value of γ will make the method more adaptive to observed distribution shifts, it will also induce greater volatility in the value of t. In practice, large fluctuations in t may be undesirable as it allows the method to oscillate between outputting small conservative and large anti-conservative prediction sets. In Theorem 4.2 we give an upper bound on (Mt( t) )2 that is optimized by choosing γ proportional to t |. While not directly applicable in practice, this result supports the intuition that in environments with greater distributional shift the algorithm needs to be more adapatable and thus γ should be chosen to be larger. In our experiments we will take γ = 0.005. This value was chosen because it was found to give relatively stable trajectories for t while still being sufficiently large as to allow t to adapt to observed shifts. In agreement with the general principles outlined above we found that larger values of γ also successfully protect against distribution shifts, while taking γ to be too small causes adaptive conformal inference to perform similar to non-adaptive methods that hold t = constant across time. 2.2 Real data example: predicting market volatility We apply ACI to the prediction of market volatility. Let {Pt}1 t T denote a sequence of daily open prices for a stock. For all t 2, define the return Rt := (Pt Pt 1)/Pt 1 and realized volatility Vt = R2 t . Our goal is to use the previously observed returns Xt := {Rs}1 s t 1 to form prediction sets for Yt := Vt. More sophisticated financial models might augment Xt with additional market covariates (available to the analyst at time t 1). As the primary purpose of this section is to illustrate adaptive conformal inference we work with only a simple prediction method. We start off by forming point predictions using a GARCH(1,1) model [4]. This method assumes that Rt = σt t with 2, . . . , T taken to be i.i.d. N(0, 1) and σt satisfying the recursive update t = ! + Vt 1 + βσ2 This is a common approach used for forecasting volatility in economics. In practice, shifting market dynamics can cause the predictions of this model to become inaccurate over large time periods. Thus, when forming point predictions we fit the model using only the last 1250 trading days (i.e. approximately 5 years) of market data. More precisely, for all times t > 1250 we fit the coefficients ˆ!t, ˆ t, ˆβt as well as the sequence of variances {ˆσt s}1 s t 1 using only the data {Rr}t 1250 r ˆQ(1 )) . The difference between our approach and theirs is twofold. First, while they fix a single conservative value our methods aim to estimate the optimal choice satisfying M( ) = . This is not possible in the setting of [10] as they do not observe any data from which the size of the distribution shift can be estimated. Second, while they consider only one training and one testing distribution we work in a fully online setting in which the distribution is allowed to shift continuously over time. 4 Coverage guarantees 4.1 Distribution-free results In this section we outline the theoretical coverage guarantees of adaptive conformal inference. We will assume throughout that with probability one 1 2 [0, 1] and ˆQt is non-decreasing with ˆQt(x) = 1 for all x < 0 and ˆQt(x) = 1 for all x > 1. Our first result shows that over long time intervals adaptive conformal inference obtains the correct coverage frequency irrespective of any assumptions on the data-generating distribution. Lemma 4.1 With probability one we have that 8t 2 N, t 2 [ γ, 1 + γ]. Proof: Assume by contradiction that with positive probability { t}t2N is such that inft t < γ (the case where supt t > 1 + γ is identical). Note that supt | t+1 t| = supt γ| errt| < γ. Thus, with positive probability we may find t 2 N such that t < 0 and t+1 < t. However, t < 0 =) ˆQt(1 t) = 1 =) errt = 0 =) t+1 = t + γ( errt) t and thus P(9t such that t+1 < t < 0) = 0. We have reached a contradiction. Proposition 4.1 With probability one we have that for all T 2 N, 44444 44444 max{ 1, 1 1} + γ In particular, lim T !1 1 Proof: By expanding the recursion defined in (2) and applying Lemma 4.1 we find that [ γ, 1 + γ] 3 T +1 = 1 + Rearranging this gives the result. Proposition 4.1 puts no constraints on the data generating distribution. One may immediately ask whether these results can be improved by making mild assumptions on the distribution shifts. We argue that without assumptions on the quality of the initialization the answer to this question is negative. To understand this, consider a setting in which there is a single fixed optimal target 2 [0, 1] and assume that Mt(p) = M(p) = 1 (p ), if p > , + (p ) if p . . Suppose additionally that E[errt| t] = M( t).1 In order to simplify the calculations consider the noiseless update t+1 = t + γ( M( t)) = t + γ( E[errt| t]). Intuitively, the noiseless update can be viewed as the average case behaviour of (2). Now, for any initialization 1 and any γ min{ 1 } there exists a constant c 2 { 1 } such that for all t, M( t) = c( t ). So, we have that E[errt] = c E[ t ] = c E[ t 1 + γ( Mt 1( t 1)) ] = c(1 cγ)E[ t 1 ]. Repeating this calculation recursively gives that E[errt] = c(1 cγ)t 1E[ 1 ] = c(1 cγ)t 1( 1 ), and thus, 44444 44444 = 1 (1 cγ)T The comparison of this bound to (5) is self-evident. The main difference is that we have replaced max{1 1, 1} with | 1 |. This arises from the fact that 2 (0, 1) is arbitrary and thus max{1 1, 1} is the best possible upper bound on | 1 |. So, we view Proposition 4.1 as both an agnostic guarantee that shows that our method gives the correct long-term empirical coverage frequency irrespective of the true data generating process, and as an approximately tight bound on the worst-case behaviour immediately after initialization. 1This last assumption is in general only true if ˆQt( ) and t are fit independently of one another. 4.2 Performance in a hidden Markov model Although we believe Proposition 4.1 is an approximately tight characterization of the behaviour after initialization, we can still ask whether better bounds can be obtained for large time steps. In this section we answer this question positively by showing that if 1 is initialized appropriately and the distribution shift is small, then tighter coverage guarantees can be given. In order to obtain useful results we will make some simplifying assumptions about the data generating process. While we do not expect these assumptions to hold exactly in any real-world setting, we do consider our results to be representative of the true behaviour of adaptive conformal inference and we expect similar results to hold under alternative models. 4.2.1 Setting We model the data as coming from a hidden Markov model. In particular, we let {At}t2N A denote the underlying Markov chain for the environment and we assume that conditional on {At}t2N, {(Xt, Yt)}t2N is an independent sequence with (Xt, Yt) PAt for some collection of distributions {Pa : a 2 A}. In order to simplify our calculations, we assume additionally that the estimated quantile function ˆQt( ) and score function St( ) do not depend on t and we denote them by ˆQ( ) and S( ). This occurs for example in the split conformal setting with fixed training and calibration sets. In this setting, {( t, At)}t2N forms a Markov chain on [ γ, 1 + γ] A. We assume that this chain has a unique stationary distribution and that ( 1, A1) . This implies that ( t, At, errt) is a stationary process and thus will greatly simplify our characterization of the behaviour of errt. While there is little doubt that the theory can be extended, recall our that main goal is to get useful and simple results. That said, what we really have in mind here is that {At}t2N is sufficiently wellbehaved to guarantee that ( t, At) has a limiting stationary distribution. In Section A.5 we give an example where this is indeed provably the case. Lastly, the assumption that ( 1, A1) is essentially equivalent to assuming that we have been running the algorithm for long enough to exit the initialization phase described in Section 4.1. 4.2.2 Large deviation bound for the errors Our first observation is that errt has the correct average value. More precisely, by Proposition 4.1 we have that lim T !1 T 1 PT a.s. = and since errt is stationary it follows that E[errt] = . Thus, to understand the deviation of T 1 PT t=1 errt from we simply need to characterize the dependence structure of {errt}t2N. We accomplish this in Theorem 4.1, which gives a large deviation bound on |T 1 PT t=1 errt |. The idea behind this result is to decompose the dependence in {errt}t2N into two parts. First, there is dependence due to the fact that t is a function of {errr}1 r t 1. In Section A.7 in the Appendix we argue that this dependence induces a negative correlation and thus the errors concentrate around their expectation at a rate no slower than that of an i.i.d. Bernoulli sequence. This gives rise to the first term in (6), which is what would be obtained by applying Hoeffding s inequality to an i.i.d. sequence. Second, there is dependence due to the fact that At depends on At 1. More specifically, consider a setting in which the distribution of Y |X has more variability in some states than others. The goal of adaptive conformal inference is to adapt to the level of variability and thus return larger prediction sets in states where the distribution of Y |X is more spread. However, this algorithm is not perfect and as a result there may be some states a 2 A in which E[errt|At = a] is biased away from . Furthermore, if the environment tends to spend long stretches of time in more variable (or less variable) states this will induce a positive dependence in the errors and cause T 1 PT t=1 errt to deviate from . To control this dependence we use a Bernstein inequality for Markov chains to bound |T 1 PT t=1 E[errt|At] |. This gives rise to the second term in (6). Theorem 4.1 Assume that {At}t2N has non-zero absolute spectral gap 1 > 0. Let |E[errt|At = a] | and σ2 B := E[(E[errt|At] )2]. A formal proof of this result can be found in Section A.7. The quality of this concentration inequality will depend critically on the size of the bias terms B and σ2 B. Before proceeding, it is important that we emphasize that the definitions of B and σ2 B are independent of the choice of t owing to the fact that ( t, At, errt) is assumed stationary. Now, to understand these quantities, let M(p|a) := P(S(Xt, Yt) > ˆQ(1 p)|At = a) denote the realized miscoverage level in state a 2 A obtained by the quantile ˆQ(1 p). Assume that M(p|a) is continuous. This will happen for example when ˆQ( ) is continuous and S(Xt, Yt)|At = a is continuously distributed. Then, there exists an optimal value a such that M( a|a) = . Lemma A.4 in the Appendix shows that if in addition M( |a) admits a second order Taylor expansion, then γ + γ 1 sup 44At+k = a] Here, the constant C will depend on how much M( |a) differs from the ideal case in which ˆQ( ) is the true quantile function for S(Xt, Yt)|At = a. In this case we would have that M( |a) is the linear function M(p|a) = p, 8p 2 [0, 1] and C 2. We remark that the term E[| 44At+k = a] can be seen as a quantitative measurement of the size of the distribution shift in terms of the change in the critical value a. Thus, we interpret these results as showing that if the distribution shift is small and 8a 2 A, ˆQ( ) gives reasonable coverage of the distribution of S(Xt, Yt)|At = a, then T 1 PT t=1 errt will concentrate well around . 4.2.3 Achieving approximate marginal coverage Theorem 4.1 bounds the distance between the average miscoverage rate and the target level over long stretches of time. On the other hand, it provides no information about the marginal coverage frequency at a single time step. The following result shows that if the distribution shift is small, the realized marginal coverage rate M( t|At) will be close to on average. Theorem 4.2 Assume that there exists a constant L > 0 such that for all a 2 A and all 1, 2 2 R, |M( 2|a) M( 1|a)| L| 2 1|. Assume additionally that for all a 2 A there exists a 2 (0, 1) such that M( a|a) = . Then, E[(M( t|At) )2] L(1 + γ) Once again we emphasize that (7) holds for any choice of t owing to the fact that ( t, At, errt) is assumed stationary and thus the quantities appearing in the bound are invariant across t. Proof of this result can be found in Section A.8 of the Appendix. We remark that the right-hand side of (7) is minimized by choosing γ = (2E[| At|])1/2, which gives the inequality E[(M( t|At) )2] L( As above we have that in the ideal case ˆQ( ) is a perfect estimate of the quantiles of S(Xt, Yt)|At = a and thus M(p|a) = p and L = 1. Moreover, we once again have the interpretation that E[| At|] is a quantitative measurement of the distribution shift. Thus, this result can be interpreted as bounding the average difference between the realized and target marginal coverage in terms of the size of the underlying distribution shift. Finally, note that the choice γ = (2E[| formalizes our intuition that γ should be chosen to be larger in domains with greater distribution shift, while not being so large as to cause t to be overly volatile. 5 Impact of St St St( ) on the performance The performance of all conformal inference methods depends heavily on the design of the conformity score. Previous work has shown how carefully chosen scores or even explicit optimization of the Adaptive Alpha Fixed Alpha Bernoulli Sequence Local Coverage Level Black Berry 2005 2010 2015 2020 Local Coverage Level 2000 2005 2010 2015 2020 Time Figure 2: Local coverage frequencies for adaptive conformal (blue), a non-adaptive method that holds t = fixed (red), and an i.i.d. Bernoulli(0.1) sequence (grey) for the prediction of stock market volatility with conformity score St. The coloured dotted lines mark the average coverage obtained across all time points, while the black line indicates the target level of 1 = 0.9. interval width can be used to obtain smaller prediction sets [e.g. 28, 29, 20, 12]. Adaptive conformal inference can work with any conformity score St( ) and quantile function ˆQt( ) and thus can be directly combined with other improvements in conformal inference to obtain shorter intervals. One important caveat here is that the lengths of conformal prediction sets depend directly on the quality of the fitted regression model. Thus, to obtain smaller intervals one should re-fit the model at each time step using the most recent data to build the most accurate predictions. This is exactly what we have done in our experiments in Sections 2.2 and 6. In addition to this, the choice of St( ) can also have a direct effect on the coverage properties of adaptive conformal inference. Theorems 4.1 and 4.2 show that the performance of adaptive conformal inference is controlled by the size of the shift in the optimal parameter t across time. Moreover, t itself is in one-to-one correspondence with the 1 quantile of St(Xt, Yt). Thus, the coverage properties of adaptive conformal inference depend on how close St(Xt, Yt) is to being stationary. For a simple example illustrating the impact of this dependence, note that in Section 2.2 we formed prediction sets using the conformity score St := |Vt ˆσ2 An a priori reasonable alternative to this is the unnormalized score St := |Vt ˆσ2 t |. However, after a more careful examination it becomes unsurprising that normalization by ˆσ2 t is critical for obtaining an approximately stationary conformity score and thus St leads to much worse coverage properties. Figure 2 shows the local coverage frequency (see (4)) of adaptive conformal inference using St. In comparison to Figure 1 the coverage now undergoes much wider swings away from the target level of 0.9. This issue can be partially mitigated by choosing a larger value of γ that gives greater adaptivity to the algorithm. 6 Real data example: election night predictions During the 2020 US presidential election The Washington Post used conformalized quantile regression (CQR) (see (1) and Section 1.1) to produce county level predictions of the vote total on election night [13]. Here we replicate the core elements of this method using both fixed and adaptive quantiles. Adaptive Alpha Fixed Alpha Bernoulli Sequence 500 1000 1500 2000 Time Local Coverage Level Figure 3: Local coverage frequencies of adaptive conformal (blue), a non-adaptive method that holds t = fixed (red), and an i.i.d. Bernoulli(0.1) sequence (grey) for county-level election predictions. Coloured dotted lines show the average coverage across all time points, while the black line indicates the target coverage level of 1 = 0.9. To make the setting precise, let {Yt}1 t T denote the number of votes cast for presidential candidate Joe Biden in the 2020 election in each of approximately T = 3000 counties in the United States. Let Xt denote a set of demographic covariates associated to the tth county. In our experiment Xt will include information on the make-up of the county population by ethnicity, age, sex, median income and education (see Section A.6.1 for details). On election night county vote totals were observed as soon as the vote count was completed. If the order in which vote counts completed was uniformly random {(Xt, Yt)}1 t T would be an exchangeable sequence on which we could run standard conformal inference methods. In reality, larger urban counties tend to report results later than smaller rural counties and counties on the east coast of the US report earlier than those on the west coast. Thus, the distribution of (Xt, Yt) can be viewed as drifting throughout election night. We apply CQR to predict the county-level vote totals (see Section A.6.2 for details). To replicate the east to west coast bias observed on election night we order the counties by their time zone with eastern time counties appearing first and Hawaiian counties appearing last. Within each time zone counties are ordered uniformly at random. Figure 3 shows the realized local coverage frequency over the most recent 300 counties (see (4)) for the non-adaptive and adaptive conformal methods. We find that the non-adaptive method fails to maintain the desired 90% coverage level, incurring large troughs in its coverage frequency during time zone changes. On the other hand, the adaptive method maintains approximate 90% coverage across all time points with deviations in its local coverage level comparable to what is observed in Bernoulli sequences. 7 Discussion There are still many open problems in this area. The methods we develop are specific to cases where Yt is revealed at each time point. However, there are many settings in which we receive the response in a delayed fashion or in large batches. In addition, our theoretical results in Section 4.2 are limited to a single model for the data generating distribution and the special case where the quantile function ˆQt( ) is fixed across time. It would be interesting to determine if similar results can be obtained in settings where ˆQt( ) is fit in an online fashion on the most recent data. Another potential area for improvement is in the choice of the step size γ. In Section 2.1 we give some heuristic guidelines for choosing γ based on the size of the distribution shift in the environment. Ideally however we would like to be able to determine γ adaptively without prior knowledge. Finally, our experimental results are limited to just two domains. Additional work is needed to determine if our methods can successfully protect against a wider variety of real-world distribution shifts. 8 Acknowledgements E.C. was supported by Office of Naval Research grant N00014-20-12157, by the National Science Foundation grants OAC 1934578 and DMS 2032014, by the Army Research Office (ARO) under grant W911NF-17-1-0304, and by the Simons Foundation under award 814641. We thank John Cherian for valuable discussions related to Presidential Election Night 2020 and Lihua Lei for helpful comments on the connection to online learning. [1] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: There s software used across the country to predict future criminals. and it s biased against blacks, 2016. URL https://www.propublica.org/article/ machine-bias-risk-assessments-in-criminal-sentencing. [2] Claudine Badue, Rânik Guidolini, Raphael Vivacqua Carneiro, Pedro Azevedo, Vinicius B. Cardoso, Avelino Forechi, Luan Jesus, Rodrigo Berriel, Thiago M. Paixão, Filipe Mutz, Lucas de Paula Veronese, Thiago Oliveira-Santos, and Alberto F. De Souza. Self-driving cars: A survey. Expert Systems with Applications, 165:113816, 2021. ISSN 0957-4174. doi: https: //doi.org/10.1016/j.eswa.2020.113816. URL https://www.sciencedirect.com/science/ article/pii/S095741742030628X. [3] Rina Foygel Barber, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. Predictive inference with the jackknife+. The Annals of Statistics, 49(1):486 507, 2021. doi: 10.1214/ 20-AOS1965. URL https://doi.org/10.1214/20-AOS1965. [4] Tim Bollerslev. Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31(3):307 327, 1986. ISSN 0304-4076. doi: https://doi.org/10.1016/ 0304-4076(86)90063-1. URL https://www.sciencedirect.com/science/article/ pii/0304407686900631. [5] United States Census Bureau. County characteristics resident population estimates, 2019. data retrieved from https://www.census.gov/data/tables/time-series/demo/popest/ 2010s-counties-detail.html. [6] United States Census Bureau. 2015-2019 american community survey 5-year average county- level estimates, 2019. data retrieved from https://www.ers.usda.gov/data-products/ county-level-data-sets/download-data/. [7] United States Census Bureau. Small area income and poverty estimates: 2019, 2019. data retrieved from https://www.ers.usda.gov/data-products/ county-level-data-sets/download-data/. [8] Emmanuel J. Candès, Lihua Lei, and Zhimei Ren. Conformalized survival analysis. ar Xiv preprint, 2021. ar Xiv:2103.09763. [9] X. Cao, J. Zhang, and H. V. Poor. On the time-varying distributions of online stochastic optimization. In 2019 American Control Conference (ACC), pages 1494 1500, 2019. doi: 10.23919/ACC.2019.8814889. [10] Maxime Cauchois, Suyash Gupta, Alnur Ali, and John C. Duchi. Robust validation: Confident predictions even when distributions shift. ar Xiv preprint, 2020. ar Xiv:2008.04267. [11] Maxime Cauchois, Suyash Gupta, and John C. Duchi. Knowing what you know: valid and validated confidence sets in multiclass and multilabel prediction. Journal of Machine Learning Research, 22(81):1 42, 2021. URL http://jmlr.org/papers/v22/20-753.html. [12] Haoxian Chen, Ziyi Huang, Henry Lam, Huajie Qian, and Haofeng Zhang. Learning prediction intervals for regression: Generalization and calibration. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 820 828. PMLR, 13 15 Apr 2021. [13] John Cherian and Lenny Bronner. How the washington post estimates outstanding votes for the 2020 presidential election. https://elex-models-prod.s3.amazonaws.com/ 2020-general/write-up/election_model_writeup.pdf, 2020. [14] MIT Election Data and Science Lab. County Presidential Election Returns 2000-2016, 2018. URL https://doi.org/10.7910/DVN/VOQCHQ. [15] Rina Foygel Barber, Emmanuel J Candès, Aaditya Ramdas, and Ryan J Tibshirani. The limits of distribution-free conditional predictive inference. Information and Inference: A Journal of the IMA, 08 2020. ISSN 2049-8772. doi: 10.1093/imaiai/iaaa017. URL https: //doi.org/10.1093/imaiai/iaaa017. iaaa017. [16] Alexander Gammerman and Vladimir Vovk. Hedging predictions in machine learning. The Computer Journal, 50(2):151 163, 2007. doi: 10.1093/comjnl/bxl065. [17] Elad Hazan. Introduction to online convex optimization. ar Xiv preprint, 2019. ar Xiv:1909.05207. [18] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963. doi: 10.1080/01621459.1963. 10500830. [19] Bai Jiang, Qiang Sun, and Jianqing Fan. Bernstein s inequality for general markov chains. ar Xiv preprint, 2020. ar Xiv:1805.10721. [20] Danijel Kivaranovic, Kory D. Johnson, and Hannes Leeb. Adaptive, distribution-free prediction intervals for deep networks. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 4346 4356. PMLR, 26 28 Aug 2020. [21] Danijel Kivaranovic, Robin Ristl, Martin Posch, and Hannes Leeb. Conformal prediction intervals for the individual treatment effect. ar Xiv preprint, 2020. ar Xiv:2006.01474. [22] Jing Lei and Larry Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76, 01 2014. doi: 10.1111/rssb.12021. [23] Jing Lei, Max G Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094 1111, 2018. doi: 10.1080/01621459.2017.1307116. URL https://doi.org/10.1080/01621459.2017.1307116. [24] Lihua Lei and Emmanuel J. Candès. Conformal inference of counterfactuals and individual treatment effects. ar Xiv preprint, 2021. ar Xiv:2006.06138. [25] David Leip. Dave leip s atlas of u.s. presidential elections., 2020. http://uselectionatlas. org. [26] Harris Papadopoulos. Inductive conformal prediction: Theory and application to neural net- works. In Tools in Artificial Intelligence,, pages 315 330, 2008. [27] Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman. Inductive confidence machines for regression. In Tapio Elomaa, Heikki Mannila, and Hannu Toivonen, editors, Machine Learning: ECML 2002, pages 345 356, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg. ISBN 978-3-540-36755-0. [28] Yaniv Romano, Evan Patterson, and Emmanuel Candes. Conformalized quantile regression. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ 5103c3584b063c431bd1268e9b5e76fb-Paper.pdf. [29] Yaniv Romano, Matteo Sesia, and Emmanuel Candes. Classification with valid and adaptive coverage. In Advances in Neural Information Processing Systems, volume 33, pages 3581 3591, 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 244edd7e85dc81602b7615cd705545f5-Paper.pdf. [30] Yaniv Romano, Matteo Sesia, and Emmanuel Candes. Classification with valid and adap- tive coverage. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 3581 3591. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/ file/244edd7e85dc81602b7615cd705545f5-Paper.pdf. [31] Mauricio Sadinle, Jing Lei, and Larry Wasserman. Least ambiguous set-valued classifiers with bounded error levels. Journal of the American Statistical Association, 114(525):223 234, 2019. doi: 10.1080/01621459.2017.1395341. URL https://doi.org/10.1080/01621459.2017. 1395341. [32] Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(12):371 421, 2008. URL http://jmlr.org/papers/v9/shafer08a. html. [33] Ryan J Tibshirani, Rina Foygel Barber, Emmanuel Candes, and Aaditya Ramdas. Conformal prediction under covariate shift. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/ file/8fb21ee7a2207526da55a679f0332de2-Paper.pdf. [34] Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer-Verlag, Berlin, Heidelberg, 2005. ISBN 0387001522. [35] Jingyi Zhu and James Spall. Tracking capability of stochastic gradient algorithm with constant gain. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 4522 4527, 12 2016. doi: 10.1109/CDC.2016.7798957.