# online_conformal_prediction_via_online_optimization__52560847.pdf Online Conformal Prediction via Online Optimization Felipe Areces * 1 Christopher Mohri * 2 Tatsunori Hashimoto 2 John Duchi 1 3 We introduce a family of algorithms for online conformal prediction with coverage guarantees for both adversarial and stochastic data. In the adversarial setting, we establish the standard guarantee: over time, a pre-specified target fraction of confidence sets cover the ground truth. For stochastic data, we provide a guarantee at every time instead of just on average over time: the probability that a confidence set covers the ground truth conditioned on past observations converges to a pre-specified target when the conditional quantiles of the errors are a linear function of past data. Complementary to our theory, our experiments spanning over 15 datasets suggest that the performance improvement of our methods over baselines grows with the magnitude of the data s dependence, even when baselines are tuned on the test set. We put these findings to the test by pre-registering an experiment for electricity demand forecasting in Texas, where our algorithms achieve over a 10% reduction in confidence set sizes, a more than a 30% improvement in quantile and absolute losses with respect to the observed errors, and significant outcomes on all 78 out of 78 pre-registered hypotheses. We provide documentation for the pypi package implementing our algorithms here: https: //conformalopt.readthedocs.io/. 1. Introduction It would be unreasonable to presume that a predictive model would maintain its accuracy eternally populations change, new scenarios arise, individuals modify behavior so seek- *Equal contribution 1Department of Electrical Engineering, Stanford University, Stanford, USA 2Department of Computer Science, Stanford University, Stanford, USA 3Department of Statistics, Stanford University, Stanford, USA. Correspondence to: Felipe Areces , Christopher Mohri . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). ing a predictor that can always guarantee its accuracy is no more than a snipe hunt. Instead, however, we might seek an appropriate (and dynamic) level of confidence in our predictions, even in the face of changing data processes. Even more, we ought to develop procedures that can provide such guarantees in arbitrary settings, so that we do not rely on (likely invalid) modeling assumptions, but which can in fact do the right thing, adapting to underlying structure when it exists. To address these desiderata, this work adopts and adapts online conformal prediction, providing predictive confidence sets for which, over time, a pre-specified fraction covers the ground truth, even if data is adversarial. Current algorithms for online conformal prediction resemble bang-bang controllers from control theory, which switch between two extreme states to achieve a desired behavior. The first algorithm to propose this approach, Gibbs & Cand es s Adaptive Conformal Inference (ACI) 2021, constructs confidence sets using a single parameter that governs their level of conservativeness. The update rule for this parameter is simple: on an error, increase the parameter by a constant, and otherwise, decrease it by (another) constant, inducing bang-bang behavior. Existing theoretical analyses for ACI suggest making these switches as aggressive as possible to obtain the tightest coverage. Several works have extended ACI by adaptively tuning the magnitude of these updates (Gibbs & Cand es, 2024; Zaffran et al., 2022; Bhatnagar et al., 2023) or with more sophisticated update rules that, that for example, incorporate other aspects from control (Angelopoulos et al., 2023; Yang et al., 2024), but all fundamentally share the same bang-bang backbone. This purely adversarial approach overlooks the possibility of achieving stronger guarantees under more predictable data. In this work, we show that directly modeling the distribution of errors can provide both new theoretical guarantees and empirical benefits. Our key observation is that we can develop algorithms that not only satisfy the standard adversarial guarantees, but also achieve time-conditional coverage guarantees for stochastic data by solving a stochastic optimization problem. These time-conditional coverage guarantees ensure that we eventually achieve the target coverage level at all times, instead of only when averaged across time. While we are not the first to depart from purely adversarial guarantees past work considers the i.i.d. setting (Angelopoulos et al., 2024) we provide guarantees for a Online Conformal Prediction via Online Optimization general class of standard stochastic processes that exhibit dependence across time, such as autoregressive processes. Our theoretical results show the benefits of this approach across three main settings. In the first, we show the standard adversarial guarantee that over time, a pre-specified fraction of our prediction sets contain the true label. That is, in the worst-case setting where an adversary hand-picks the data, we still satisfy the same general guarantees as in past work. In the other two, we consider stochastic data. We first prove that when our model for the underlying errors is well-specified, we get a time-conditional coverage guarantee. The byproducts of these guarantees are new theoretical results for strongly convex stochastic optimization, where we develop convergence guarantees for stochastic gradient methods that operate with dependent data. In the second setting, where our model is mis-specified but an ergodic stochastic process generates the data, we show the convergence of our algorithms to the minimizer of an expected loss with respect to the stationary data distribution. Here, we extend past results on convex optimization with ergodic data (Duchi et al., 2012; Agarwal & Duchi, 2013) to last-iterate strongly convex settings. In addition to our theory, we empirically evaluate our proposed algorithms alongside existing ones. We first use several standard datasets in the online conformal literature, where we control the marginal coverage of all algorithms to make fair comparisons. In the spirit of truly testing our proposed algorithms, we also conduct a pre-registered experiment on a period of electricity consumption data in Texas. There, we record our hyperparameter choices and experiment design before the data is available. Across all 18 datasets, our algorithms never perform discernibly worse than baselines (even after tuning baseline hyperparameters on the test set), and consistently outperform them whenever the data exhibits strong linear dependence (which we measure in terms of autocorrelation of the first lag). We summarize our contributions and the organization of the paper below. Section 3 presents a new algorithm for online conformal prediction, allowing for arbitrary parametric online conformal predictors beyond scalars. Section 4 develops adversarial guarantees, showing it always achieves long-run coverage (1). Sections 5 and 6 develop stochastic guarantees, showing it attains conditional coverage when well-specified and relaxed conditional coverage when mis-specified. Section 7 contains experiments on 17 datasets with a pre-registration to compare our method with existing alternatives and provide evidence of its empirical benefits. We defer all proofs to Appendix C. 2. Preliminaries The goal of online conformal prediction is to generate confidence sets in an online fashion. In this setting, we sequentially observe covariates Xt X and wish to construct confidence sets Ct : X Y for the labels Yt Y. We construct these confidence sets from a score function st : X Y R, which may be learned and updated over time. The score function defines Ct(Xt) via the level sets Ct(Xt) := {y Y : st(Xt, y) qt}, for some threshold qt R. The conformal confidence set makes an error at time t if the conformal score St := st(Xt, Yt), exceeds the threshold qt, which we formally define as errt := 1qt 0 almost surely, and η-bounded if for a learning rate schedule η = {ηt}t and any θ Θ , P t=1 ηt|q t qt(θ )| < almost surely. Many of our results apply to conformal predictors qt( ) with a bias term. In this case, for θ Θ Rd 1 and c R such that θ = [ θ , c] and all t N, there exists ht : X t 1 Yt 1 Θ R satisfying qt(θ) = ht( θ) + c. We often also require these predictors to be bounded so that Θ { θ Rd 1 : t N, |ht( θ)| Kq} for Kq (0, ). In our experiments, we define conformal predictors using feature maps Φt : X t 1 Yt 1 Rd producing conformal covariates Zt = Φt(Xt 1 1 , Y t 1 1 ), so that qt(θ) = Z t θ; these are bounded if Zt and θ have bounded Euclidean norm . Algorithm 1 Batched projected online gradient descent Input: θ1 Rd, batch size m, convex feasible set Θ, learning rate schedule {ηb} for b = 1 to B do for i = 1 to m do Predict threshold qb,i(θb) Observe score Sb,i Compute gradient ˆgb,i ℓb,i(θb) end for Update θb+1 ΠΘ θb ηb 1 m Pm i=1 ˆgb,i 3. Algorithms Our online conformal predictors serve as estimators for the conditional quantiles q t , so the core idea of our method is to optimize their parameters via online gradient descent on the quantile loss ℓt(θ) := ℓquantile(St qt(θ)) so as to approach the the minimizers of its conditional expectation Lt(θ) := E[ℓt(θ) | Ft 1]. This procedure follows similar algorithmic principles as ACI but allows for arbitrary parametric online conformal predictors beyond scalars. We present the framework of our proposed method in Algorithm 1, which performs a gradient descent step on the conformal predictors parameter θ Θ after observing each batch of data. To simplify notation regarding batches, we use t to represent time t, and the tuple (b, i) to denote the i th sample in the b th batch. Our batch size is m, so time (b, i) is equivalent to time (b 1)m + i. When subscript b appears on its own, it refers to a quantity that is uniform across batch b. In the following examples, we instantiate some special cases of conformal predictors that can be used with Algorithm 1. Example 3.1. Scalar quantile tracking (SQT) (Angelopoulos et al., 2023): Choose Θ = R and This model is well-specified when {St}t are i.i.d. or all conditional quantiles q t are equal. Example 3.2. Scalar quantile tracking (SQT) + scorecaster ˆqt: Choose Θ = R and qt(θ) = θ + ˆqt. This model is asymptotically well-specified when (ˆqt q t ) converges to a constant and fully well-specified when (ˆqt q t ) is constant across all time. Since the conditional quantiles q t are not necessarily the same for all t as in Example 3.1, the scorecaster aims to predict the time dependence in the conditional quantile so that the differences (ˆqt q t ) are constant. In our experiment section, the model SQT+AR is of this form and fits ˆqt as a quantile estimator for autoregressive (AR) models. Online Conformal Prediction via Online Optimization Example 3.3. Linear quantile tracking (LQT) of order p: Choose Θ = Rp+1, feature map Φt(Xt 1 1 , Y t 1 1 ) = (St 1, , St p, 1), and qt(θ) = θ Φt(Xt 1 1 , Y t 1 1 ). This model is well-specified for AR processes of order at most p, and can be extended to richer feature vectors. Our experiments focus on the models described above, but our framework is designed to be flexible and allow for even more general conformal predictors. For example, given access to p N scorecasters {ˆqt,i}p i=1, we could extend Example 3.2 by choosing Θ = [0, 1]p R and qt(θ) = θ [ˆqt,1, , ˆqt,p, 1]. Similarly to the SQT + scorecaster method, this variant would be well-specified if any of the scorecasters satisfy the conditions in Example 3.2 and enjoy regret bounds against the best predictor in hindsight with respect to the quantile loss immediately from classical online convex optimization results (Hazan, 2016). In particular, by choosing SQT predictors with learning rates in a fixed grid as scorecasters, we could obtain a conformal predictor that preserves long-run coverage and attains quantile loss at least as good as the best SQT predictor in that grid without the need for tuning. 4. Guarantees under adversarial data We first focus on the setting where covariate and label pairs (Xt, Yt) are adversarial. In this context, we show that by arguments similar to those in Gibbs & Cand es (2021); Angelopoulos et al. (2023; 2024), our algorithms satisfy the long-run coverage guarantee (1). We provide the proof in Section C.1 of Appendix C. Theorem 4.1. Let the scores St be almost surely bounded for any time t with supt |St| Ks < and let {ηb}b be a non-increasing sequence of learning rates. Then, for any bounded conformal predictor with a bias term initialized so that |c1| Ks + Kq, Algorithm 1 with batch size m and B = T m almost surely satisfies 1 T 2m(Ks + Kq + η1) This long-run coverage bound generalizes those in Angelopoulos et al. (2023) and Angelopoulos et al. (2024) for SQT, as it allows batch sizes larger than 1 and more complex underlying models. 5. Guarantees under well-specification We now show that in addition to always satisfying the longrun coverage guarantee (1) for adversarial data, our algorithms achieve the conditional coverage guarantee (2) when the parameter set Θ is well-specified. We begin by showing this property asymptotically, and then provide finite sample rates characterizing the speed of convergence when suitably large batch sizes are chosen. For the remaining sections with stochastic data we always assume that our conformal predictor qt is linear in its parameter θ Θ Rd, so that qt(θ) = θ Zt for covariates Zt = Φt(Xt 1 1 , Y t 1 1 ) Rd. This implies that ℓt is convex for all t N. We also assume that these covariates have almost surely bounded norm supt Zt G so that our conformal predictor is G-Lipschitz. For this section in particular, we make an assumption on the distribution of {St}t: the conditional score distributions St | Ft 1 have uniformly lower and upper bounded continuous densities ft in some ε-neighborhood around their unique quantiles q t . More concretely, for all t: p ft(s) u if |s q t | ε. While all of our stochastic guarantees apply to the existing SQT algorithm, they are stronger for the LQT algorithm; its larger parameter set Θ is well-specified more often and has minimizers θ Θ at least as good, according to any loss. 5.1. Asymptotic Consistency The following result shows the consistency of Algorithm 1 under the weakest assumptions, where we require only asymptotic well-specification and non-increasing learning rates ηt satisfying P t=1 ηt = and P t=1 η2 t < . We provide the proof in Section C.2 of Appendix C. Theorem 5.1. Let the assumptions in the previous paragraph hold, and assume that the conformal predictors have a bias term, are asymptotically well-specified, regular, and η-bounded. Then the iterates θb produced by Algorithm 1 satisfy θb a.s. θ Θ , and attain the conditional coverage guarantee P (Yt / Ct(Xt) | Ft 1) a.s. α. This consistency result also holds more broadly for conformal predictors that ensure that ℓt is convex for all t, with linear models being a special case. 5.2. Finite-sample Convergence We will now provide finite-sample convergence rates for Algorithm 1. The two main challenges are the changing expected conditional loss Lt (with expected subgradients that depends on the time t) and the lack of strong convexity or smoothness in the quantile loss. A key result for our convergence analysis is the following strong convexity lemma, which shows that the batched con- Online Conformal Prediction via Online Optimization ditional losses are large if our iterate θ is far from θ . This is non-trivial, as each conditional loss Lt can have an infinite number of minimizes, for example when Zt represents the last p 1 scores and has a bias term. We provide a proof of this Lemma in Section C.3 of Appendix C. Lemma 5.2. Let Θ be well-specified with finite diameter D. For batch b, let λb be the minimum eigenvalue of the sample covariance matrix 1 m Pm i=1 Zb,i Z b,i and let µb := pλb 2 min ε GD, 1 . Then, the conditional loss gap in batch b is lower bounded for any θ Θ: i=1 Lb,i(θ) Lb,i(θ ) µb θ θ 2. The bound above is data-dependent and for our main convergence result, we will need a positive lower bound on minb [B] E[λb | Fb 1,m], which ensures that the progress with each gradient step is lower-bounded. This requires an analysis of the minimum eigenvalues λb. We provide an example lower bound for the weakly-stationary AR(1) model St = ϕSt 1 + εt with |ϕ| < 1 and zero mean εt iid Pε, which we follow for most of our experiments, and prove it in Section C.4 of Appendix C. Lemma 5.3. Let {St}t follow a weakly-stationary AR(1) stochastic process where E ε4 t O(1)E ε2 t 2, and there is a constant σ > 0 such that P([ε]+ σ) 1 4 and P([ ε]+ σ) 1 4. Let Zb,i = (Sb,i, 1) and Σb = 1 m Pm i=1 Zb,i Z b,i. Then for m 1 1 ϕ there are numerical constants p, c > 0 such that P(λmin(Σb) c (σ 1) | Fb 1,m) p. In particular, E[λmin(Σb) | Fb 1,m] pc(σ 1). We can now state our finite-sample convergence result, which we prove in Section C.5 of Appendix C, provided that a result analogous to Lemma 5.3 holds almost surely. Theorem 5.4. Let the assumptions of Lemma 5.2 hold, and let µmin minb [B] E[µb | Fb 1,m], B 4, and c [0, 1]. Then, for any b B and a universal constant C, the iterates of Algorithm 1 with learning rates ηb = 2c 1 µminbc satisfy the following bound with probability at least 1 δ: θb θ 2 C (log(B log(B)/δ) + 1)G2 This implies that if u is a uniform upper bound for the conditional densities in a neighborhood of size GD around q t , then for some (other) constant C depending on m, u , G, µmin, and any t T, P(Yt / Ct(Xt) | Ft 1) α C r (log(T log(T)/δ)) Our result provides a data-dependent, last-iterate highprobability bound. To the best of our knowledge, it is the first finite-sample convergence analysis for SGD that does not rely on stationarity or mixing, but instead assumes only the existence of a unique minimizer consistent across time. This convergence resuls shows a trade-off in terms of learning rate when compared to the long-run coverage bound in Theorem 4.1. As the learning rates decay quickly and approach Θ(1/t), this bound tightens to a fast O(1/t), while the long-run coverage bound turns vacuous. On the other hand, as the learning rates approach Θ(1), this convergence rate becomes vacuous while the long-run coverage bound strengthens to O(1/t). This suggests learning rate schedules with faster decay to achieve better convergence rates when the well-specified assumption holds. We will observe the same learning rate tradeoff in the following section, where we trade well-specification for a mixing assumption on the data. 6. Guarantees under mis-specification Under mis-specification, the full conditional guarantee (2) is impossible. Thus, we weaken this goal to consider achieving a type of best-in-class approximation to conditional coverage, where for a sequence of coverage values αt optimal relative to the data and parameters Θ, we obtain P (Yt / Ct(Xt) | Ft 1) αt If αt = α for all t, which holds under well-specification, we recover conditional coverage (2). We thus consider a setting where there is some stationary distribution Π governing the long run behavior of {(St, Zt)}. For this long-run distribution Π, we define the αt implicitly via the best quantile predictor θ argmin θ Θ LΠ(θ) := EΠ ℓquantile(S Z θ) , (Typically, we consider an ergodic stochastic process converging to a stationary distribution Π.) When the minimizer θ is unique, it defines conformal sets C t (Xt) = y Y | s(Xt, y) Z t θ and relatively optimal confidence values αt = P (Yt / C t (Xt) | Ft 1) . (3) These αt are often close to α, as all other θ Θ attain worse expected coverage relative to our chosen features Zi. We formalize and prove this, as well as providing synthetic experiments showing the coverage properties of αt, in Section D.1 of Appendix D. Online Conformal Prediction via Online Optimization Aside from the good coverage provided by θ , it is not immediately clear that the quantile loss is the correct objective to minimize. The iterates of Algorithm 1 perform well with respect to the quantile loss since they achieve sublinear regret by the argument in Theorem 3.1 of Hazan (2016), but one may instead be concerned with alternative loss functions, such as conformal set sizes or the square loss. However, in Appendix D.2 we argue as in Mannor et al. (2009) that in the online adversarial setting, minimizing a loss function subject to long-run average constraints is impossible even in the case of simple linear functions, so it is unlikely that this is possible in the case of minimizing some arbitrary loss subject to the long-run coverage guarantee (1). To characterize the speed of convergence to the αt in (3), we will assume the stochastic process generating {(St, Zt)}t is β-mixing. If for 0 < r t, we let P t [r] be the distribution of (Zt, St) given Fr with density pt [r], we can define the β-mixing coefficient β(τ) := sup t N n 2E h P t+τ [t] Π TV with mixing time τβ(P, ϵ) = inf{τ N | β(τ) ϵ}. Under these assumptions, Algorithm 1 with m = 1 reduces to a special instance of ergodic mirror descent (Duchi et al., 2012; Agarwal & Duchi, 2013) applied to the convex function LΠ(θ). When the scores S have densities, the expected quantile loss behaves like a strongly convex function, allowing us to extend prior analyses to the last-iterate strongly convex setting. We provide a proof of this extension in Section C.6 of Appendix C. (Our proofs formally require that under Π, the conditional score distribution S | Z has continuous density πZ with the uniformly upper and lower-bounds pΠ πZ(s) uΠ in a ε-neighborhood of ZT θ .) Theorem 6.1. Let the assumptions stated above hold, and let Θ have finite diameter D. For some λ > 0, let EΠ ZZ λI, t [T + 1], c [0, 1], and µ = pΠλ min ε GD, 1 . Then the minimizer θ is unique and the iterates θt produced by Algorithm 1 with m = 1 and ηt = 2c µtc satisfy E h θt θ 2i C G2 for τ = τβ(P, t c) and a universal constant C. This implies that if u is a uniform upper bound for the conditional densities in a ε-neighborhood around Z t θ , for some other constant C now depending on µ, G, D, u, ε, E [|P (Yt / Ct(Xt) | Ft 1) αt|] C r τ In both cases, the expectation is taken over the samples {(Si, Zi)}t i=1. Here we also observe the same tradeoff as in Theorem 5.4 in terms of the learning rate when compared to the long-run coverage guarantee of Theorem 4.1. 7. Experiments In this section, we empirically evaluate Algorithm 1 alongside baselines. We focus on SQT+AR as in Example 3.2 and LQT of order p as in Example 3.3, and use batch size m = 1 since larger batch sizes provide comparable results. 7.1. Baseline algorithms We now describe the baselines we compare against. Adaptive conformal inference (ACI) (Gibbs & Cand es, 2021). In offline conformal prediction, the threshold q to construct confidence sets C( ) is chosen (up to a minor correction) as the (1 α) empirical quantile of a heldout set of validation scores. ACI instead chooses the (1 αt) empirical quantile of past scores for time-varying αt updated via gradient descent on the quantile loss. Our remaining baselines fall under the conformal PID control framework (Angelopoulos et al., 2023). Scalar quantile tracking (SQT). This method (also known as conformal P control) directly updates the threshold qt via gradient descent on the quantile loss. It predicts qt = Pt 1 i=1 ηi(erri α) when initialized at q1 = 0. Conformal PI control (PI). This method incorporates an error integrator rt : R R, adding rt(Pt 1 i=1 erri α) to the SQT update. As in Angelopoulos et al. (2023), we use the tangent integrator rt(x) := KI tan(x log(t)/t Csat), where KI, Csat > 0 are hyperparameters and tan(x) = sign(x) for x / [ π/2, π/2]. Conformal PID control (PID). This method is the same as PI, but also predicts future scores via a scorecaster ˆqt, whose predictions are added to the PI update. As in Angelopoulos et al. (2023), we use a Theta (Assimakopoulos & Nikolopoulos, 2000) scorecaster, which we call the PID(T) method, and an AR scorecaster, which we call the PID(A) method. We use the code in Angelopoulos et al. (2023) to implement the ACI, PI, and PID methods, where we replace predictions of with the maximum score in each test set. We test all methods with both fixed and decaying step sizes. The decaying step sizes are of the form c t 0.6 as in Angelopoulos et al. (2024), which satisfies the conditions of our stochastic consistency guarantees. For the PI and PID methods, we additionally adopt their learning rate trick of multiplying the learning rate by the highest score over a trailing window, even though this does not satisfy the conditions of Theorem 4.1. We describe the hyperparameter grids for each Online Conformal Prediction via Online Optimization method in Section B.1 of Appendix B. 7.2. Existing data In this section, we run our algorithms and baselines on datasets appearing in the online conformal literature. We describe our datasets below. 7.2.1. DATASETS Stock data (AMZN, GOOGL, MSFT). Using stock data is common in online conformal work. Here we consider the returns of Amazon, Google, and Microsoft stock, which are datasets used in Angelopoulos et al. (2023) and contain roughly 3,000 observations each. Daily climate. This dataset has 1,575 daily temperature measurements in Delhi, India from 2013 to 2017, and is also used in Angelopoulos et al. (2023). For the stock and daily climate datasets, we use the following four base forecasters: autoregressive (AR) model of order 3, Theta model, Prophet model, and Transformer model. These are retrained at every time-step and give predictions { ˆYt}t of the time series {Yt}t defining conformal scores St := | ˆYt Yt|. We use the code in Angelopoulos et al. (2023) to generate the predictions and subsequent scores. This gives a total of 16 (small) datasets. Elec2 (Harries, 1999). This dataset consists of 45,312 hourly measurements of electricity demand in New South Wales, Austrailia from May 7, 1996 to December 5, 1998. As in Angelopoulos et al. (2024), we use a one-day delayed moving average as base forecaster, that is ˆYt := 1 24 P24 i=1 Yt 24 i and conformal scores St := | ˆYt Yt|. 7.2.2. EXPERIMENTAL SET-UP In all experiments, we set the confidence level to 1 α = 0.9. We always reserve the first scores as a validation set, and set the rest as the test set. We tune the hyperparameters for our algorithms on the validation set, and for the baselines, we directly tune the hyperparameters on the test set. We justify this approach below. We choose our experimental design following the recommendation in Gulrajani & Lopez-Paz (2020) to specify strategies for model selection under distribution shifts. Since this is not explicitly done in all prior online conformal work, we tune baseline hyperparameters on the test set to identify a rough upper bound on their performance. We want to neither rely on heuristics (which can only be worse) nor subject baselines to our (potentially suboptimal) tuning strategies, so our setup ensures that the best possible choices, subject to our extensive grid, are always made. Several follow-up works on ACI have considered adaptively setting hyperparameters; these are orthogonal to this work, and can be adapted to our algorithms (Gibbs & Cand es, 2024; Zaffran et al., 2022; Bhatnagar et al., 2023). For our methods, we recommend tuning hyperparameters by performing a grid search on recent data. One almost always has access to a validation set of past conformal scores, which can give a sense of general properties of the data. We provide a starting grid in Section B.1 of Appendix B, which is implemented in our code and used in all our experiments. If the optimal values are found at the edge of the grid, the grid should be made larger. 7.2.3. EXPERIMENTAL RESULTS Our experimental results come in two parts, as we separate datasets into ones with stronger and weaker linear dependence. Since we fit linear autoregressive models to the scores, we expect our algorithms to do well when the data exhibits linear dependence, and otherwise be on par with existing algorithms. To measure this property in our setting, we consider the autocorrelation of the first lag, which quantifies the degree of linear dependence between a time series and its one-step lagged values. We foresee increasing linear dependence when the base forecaster gets worse. For example, in a regression setting, if the labels Yt R are decomposed into their conditional mean µt and uncorrelated zero-mean noise wt as Yt = µt + wt, then the scores St = |wt| have no linear dependence when the base forecaster predicts the conditional mean µt perfectly. Otherwise, the error in conditional mean (a function of past data) introduces correlations between the scores. This is supported by the stock data where we can directly compare base predictors: we show in Table 4 of Appendix B that larger loss (average score) corresponds to larger autocorrelation of the first lag. In the first four blocks of Table 1, we present results for the Microsoft stock dataset. Due to space constraints, we defer to Appendix B the remaining results for Amazon stock, Google stock, and the daily climate dataset, although they are similar. In the subsequent block, we report results for the larger Elec2 dataset. We only present separate results for both fixed and decaying step sizes for our main algorithm (LQT) for all other algorithms we report the best result. We focus on both the quantile loss to measure the closeness of the threshold to the true scores, as well as the set sizes to show how conservative the conformal sets are. In Appendix B, we also measure the average win rate against LQT(fixed), which we define as 1 T PT t=1 1{ℓt,baseline ℓt,LQT (fixed)} when ℓcorresponds to quantile loss or set size, to get a sense of instantaneous performance. The key takeaway is that when there is linear dependence in the conformal scores, our algorithms fit the data better Online Conformal Prediction via Online Optimization Dataset, metric LQT (fixed) LQT (decay) SQT+AR SQT ACI PI PID(T) PID(A) Smaller Autocorr. MSFT ar, q. loss (avg) 0.091 0.09 0.09 0.088 0.09 0.088 0.088 0.091 MSFT ar, set size (avg) 0.853 0.818 0.869 0.83 0.85 0.822 0.812 0.838 MSFT theta, q. loss (avg) 0.139 0.133 0.136 0.14 0.154 0.14 0.128 0.138 MSFT theta, set size (avg) 2.048 1.956 1.997 2.113 2.266 2.092 1.922 2.068 Larger Autocorrelation MSFT prophet, q. loss (avg) 0.104 0.1 0.101 0.151 0.213 0.15 0.134 0.134 MSFT prophet, set size (avg) 2.718 2.705 2.738 3.117 3.57 3.143 2.938 2.934 MSFT transformer, q. loss (avg) 0.115 0.118 0.107 0.289 0.352 0.196 0.149 0.148 MSFT transformer, set size (avg) 5.947 5.942 5.852 7.629 7.938 6.661 6.119 6.104 Elec2, q. loss (avg) 0.005 0.005 0.005 0.013 0.015 0.013 0.011 0.011 Elec2, set size (avg) 0.16 0.159 0.157 0.229 0.244 0.227 0.201 0.201 Elec2, runtime (sec) 0.18 - 129.82 0.05 17.57 0.58 234.76 3.94 ERCOT, q. loss (avg) 29.095 29.158 36.918 59.118 76.629 53.233 42.344 42.349 ERCOT, set size (avg) 691.496 691.414 746.546 977.629 1117.69 912.214 778.808 778.99 Table 1. Performance of conformal predictors across datasets, separated by datasets with smaller autocorrelation (top 2) and larger autocorrelation (bottom 4) of the first lag. Our algorithms are the three on the left. Numbers are in bold if they represent at least a 5% improvement over all methods in the other category. All algorithms achieved at least 0.88 long-run coverage. We do not report runtime for LQT(decay) because it is the same as LQT(fixed). 1700 1750 1800 1850 1900 1950 2000 Final 300 scores Predictions (LQT (fixed)) Scores 1700 1750 1800 1850 1900 1950 2000 Final 300 scores Predictions (LQT (fixed)) Scores 4400 4450 4500 4550 4600 4650 Final 300 scores Predictions (LQT (fixed)) Scores Figure 1. Linear quantile tracking (fixed) predictions versus scores for MSFT AR (left) and MSFT Prophet (middle) and ERCOT (right). than the baselines. Otherwise, all algorithms perform similarly. For stock data, this dependence typically exists for the Prophet and Transformer base predictors, but not as much with the Theta and AR base predictors. This can be seen in Figure 1, where we show an ending window of scores and predictions for the LQT algorithm on both the MSFT AR and Prophet datasets. In the former, the scores look more like pure noise with a first lag having autocorrelation 0.19, while there are clearer waves in the latter with the first lag having autocorrelation 0.95. LQT is roughly constant in the former and the scores appear unpredictable; in the latter, LQT tracks the scores. Even when results are similar, it is advantageous to use the LQT algorithm due to its lightweight nature and simple update. In Table 1, we also show the running times of each algorithm with a fixed step size for the largest Elec2 dataset. Here the p order for LQT is 1, which is a very common choice in our experiments. This is only slightly slower than SQT, and much faster than any other method like PID that attempts to model the score distribution. 7.3. Pre-registered experiment Our main experiment is pre-registered on Open Science Framework (OSF). The pre-registration document is included in the supplementary material and we provide the OSF link here: https://osf.io/64jbp/. We gather data from the Electric Reliability Council of Texas (ERCOT), an organization that operates Texas s electrical grid. This data is accessible through the Grid Status API, which provides the true electricity load and a forecast for the load every 5 minutes. Our conformal scores are their absolute difference, and our test data is from 12/18/2024 to 1/04/2025, which provides roughly 5,000 scores. We tuned the hyperparameters for our algorithms on the data from 2 weeks prior to the date of preregistration, and the hyperparameters for all baselines are tuned on the test data. Summarized results are at the bottom of Table 1, and we also show an ending window of scores in the rightmost plot of Figure 1, as well as long-run coverage rates in Figure 2. Our methods show significant improvements, and the Online Conformal Prediction via Online Optimization 0 1000 2000 3000 4000 Time step Long-run coverage rate (target = 0.9) Long-run coverage rates on ERCOT SQT LQT ACI PI PID(theta) SQT+AR 0 1000 2000 3000 4000 Time step Long-run coverage rate (target = 0.9) Long-run coverage tradeoff via LQT learning rate decay on ERCOT decay: 0 decay: 0.2 decay: 0.4 decay: 0.6 decay: 0.8 decay: 1 Figure 2. Long-run coverage of online conformal algorithms on pre-registered ERCOT data. The left plot shows the long-run coverage as a function of time for several algorithms; all converge to the target (1 α = 0.9) rather quickly. On the right, we vary the learning rate decay (the learning rate is of the form ct x, we vary x here) for LQT to exhibit the tradeoff from our theory. We observe a slower convergence for more quickly decaying learning rates. autocorrelation of the first lag was 0.94. In addition to reporting our standard metrics, we also committed to hypothesis tests in our preregistration. We performed 78 tests with several loss functions comparing each baseline to LQT(fixed), and we describe these in Section B.2 of Appendix B. We report here that all 78 hypothesis tests can be rejected at a 0.05 significance level, and that almost all p-values are significantly smaller than 0.05. 8. Conclusion In this work, we developed a new framework for online conformal prediction that not only satisfies standard marginal adversarial guarantees but also achieves time-conditional coverage at every time for more predictable stochastic data. In our experiments, we observed that the linear dependence of the data, which we measure in terms of autocorrelation, is a strong indicator of the performance improvements that our algorithms offer over the baselines. While our approach focused on simple autoregressive models, we expect that more complex conformal predictors that, for example, use richer feature vectors or incorporate transformations of the data, can offer further empirical improvements. We expect that techniques from offline conformal prediction that approach covariate-conditional coverage can be incorporated into our framework to provide even stronger conditional guarantees. We also believe that similar consistency guarantees can hold when ℓt is weakly convex, as stochastic subgradient iterates converge to stationary points in this case (Duchi & Ruan, 2017; Davis & Drusvyatskiy, 2019). Acknowledgements Partially supported by the Office of Naval Research Grant N00014-22-1-2669 and N00014-24-1-2609, as well as a gift from Schmidt Sciences. Felipe Areces acknowledges support from the Mr. and Mrs. Chun Chiu Stanford Graduate Fellowship. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Agarwal, A. and Duchi, J. C. The generalization ability of online algorithms for dependent data. IEEE Transactions on Information Theory, 59(1):573 587, 2013. Angelopoulos, A., Cand es, E., and Tibshirani, R. J. Conformal PID control for time series prediction. In Advances in Neural Information Processing Systems 36, 2023. Angelopoulos, A. N., Bates, S., Fisch, A., Lei, L., and Schuster, T. Conformal risk control. ar Xiv preprint ar Xiv:2208.02814, 2022. Angelopoulos, A. N., Barber, R. F., and Bates, S. Online conformal prediction with decaying step sizes. ar Xiv:2402.01139 [stat.ML], 2024. Areces, F., Cheng, C., Duchi, J., and Kuditipudi, R. Two fundamental limits for uncertainty quantification in predictive inference. In Proceedings of the Thirty Seventh Annual Conference on Computational Learning Theory, 2024. Online Conformal Prediction via Online Optimization Assimakopoulos, V. and Nikolopoulos, K. The theta model: a decomposition approach to forecasting. International Journal of Forecasting, 16(4):521 530, 2000. Awasthi, P., Cortes, C., and Mohri, C. Theory and algorithm for batch distribution drift problems. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pp. 9826 9851. PMLR, 25 27 Apr 2023. Barber, R. F., Cand es, E. J., Ramdas, A., and Tibshirani, R. J. Conformal prediction beyond exchangeability. ar Xiv:2202.13415 [stat.ME], 2022. Bartlett, P. L., Dani, V., Hayes, T. P., Kakade, S. M., Rakhlin, A., and Tewari, A. High-probability regret bounds for bandit online linear optimization. In Proceedings of the Twenty First Annual Conference on Computational Learning Theory, pp. 335 342, 2008. Bastani, O., Gupta, V., Jung, C., Noarov, G., Ramalingam, R., and Roth, A. Practical adversarial multivalid conformal prediction. ar Xiv:2206.01067 [cs.LG], 2022. Bhatnagar, A., Wang, H., Xiong, C., and Bai, Y. Improved online conformal prediction via strongly adaptive online learning. ar Xiv:2302.07869 [cs.LG], 2023. Davis, D. and Drusvyatskiy, D. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207 239, 2019. Duchi, J. C. and Ruan, F. Stochastic methods for composite and weakly convex optimization problems. ar Xiv:1703.08570 [math.OC], 2017. URL https: //arxiv.org/abs/1703.08570. Duchi, J. C., Agarwal, A., Johansson, M., and Jordan, M. I. Ergodic mirror descent. SIAM Journal on Optimization, 22(4):1549 1578, 2012. Gibbs, I. and Cand es, E. Adaptive conformal inference under distribution shift. In Advances in Neural Information Processing Systems 34, 2021. Gibbs, I. and Cand es, E. J. Conformal inference for online prediction with arbitrary distribution shifts. Journal of Machine Learning Research, 25(162):1 36, 2024. Gibbs, I., Cherian, J., and Cand es, E. Conformal prediction with conditional guarantees. ar Xiv:2305.12616 [stat.ME], 2023. Gulrajani, I. and Lopez-Paz, D. In search of lost domain generalization. ar Xiv:2007.01434 [cs.LG], 2020. Gupta, V., Jung, C., Noarov, G., Pai, M. M., and Roth, A. Online multivalid learning: Means, moments, and prediction intervals. In Proceedings of the Thirteenth Innovations in Theoretical Computer Science (ITCS), 2022. Harries, M. Splice-2 Comparative Evaluation: Electricity Pricing. University of New South Wales, School of Computer Science and Engineering, 1999. Harvey, N. J. A., Liaw, C., Plan, Y., and Randhawa, S. Tight analyses for non-smooth stochastic gradient descent. In Proceedings of the 36th International Conference on Machine Learning, 2019. Hazan, E. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3 4):157 325, 2016. Herrndorf, N. A functional central limit theorem for ρmixing sequences. Journal of Multivariate Analysis, 15 (1):141 146, 1984. Jung, C., Noarov, G., Ramalingam, R., and Roth, A. Batch multivalid conformal prediction. ar Xiv:2209.15145 [cs.LG], 2023. Kiyani, S., Pappas, G., and Hassani, H. Conformal prediction with learned features. ar Xiv:2404.17487 [cs.LG], 2024. Koenker, R. and Bassett Jr., G. Regression quantiles. Econometrica: Journal of the Econometric Society, 46(1):33 50, 1978. Kushner, H. J. and Yin, G. Stochastic Approximation and Recursive Algorithms and Applications. Springer, second edition, 2003. Lei, J., G Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094 1111, 2018. Lobato, I. N. Testing that a dependent process is uncorrelated. Journal of the American Statistical Association, 96 (455):1066 1076, 2001. Mannor, S., Tsitsiklis, J. N., and Yu, J. Y. Online learning with sample path constraints. Journal of Machine Learning Research, 10:569 590, 2009. Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation: Learning bounds and algorithms. In Proceedings of the Twenty Second Annual Conference on Computational Learning Theory, 2009. Mohri, C. and Hashimoto, T. Language models with conformal factuality guarantees. In Proceedings of the 41st International Conference on Machine Learning, volume Online Conformal Prediction via Online Optimization 235 of Proceedings of Machine Learning Research, pp. 36029 36047. PMLR, 21 27 Jul 2024. Nemirovski, A. and Yudin, D. Problem Complexity and Method Efficiency in Optimization. Wiley, 1983. Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009. Rakhlin, A., Shamir, O., and Sridharan, K. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning, 2012. Robbins, H. and Monro, S. A stochastic approximation method. Annals of Mathematical Statistics, 22:400 407, 1951. Robbins, H. and Siegmund, D. A convergence theorem for non-negative almost supermartingales and some applications. In Optimizing Methods in Statistics, pp. 233 257. Academic Press, New York, 1971. Romano, Y., Patterson, E., and Cand es, E. J. Conformalized quantile regression. In Advances in Neural Information Processing Systems 32, 2019. Ryu, E. K. and Yin, W. Large-Scale Convex Optimization: Algorithms & Analyses via Monotone Operators. Cambridge University Press, 2022. Shamir, O. and Zhang, T. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning, pp. 71 79, 2013. URL http://proceedings.mlr.press/ v28/shamir13.html. Vovk, V., Grammerman, A., and Shafer, G. Algorithmic Learning in a Random World. Springer, 2005. Yang, Z., Cand es, E., and Lei, L. Bellman conformal inference: Calibrating prediction intervals for time series. ar Xiv:2402.05203 [cs.LG], 2024. Zaffran, M., Feron, O., Goude, Y., Josse, J., and Dieuleveut, A. Adaptive conformal predictions for time series. In Proceedings of the 39th International Conference on Machine Learning, 2022. Online Conformal Prediction via Online Optimization A. Related work Online conformal prediction builds on standard offline conformal prediction for exchangeable data (Vovk et al., 2005), a framework that has recently grown in popularity, with its tools also finding applications in areas such as risk control (Angelopoulos et al., 2022) and language modeling (Mohri & Hashimoto, 2024). It guarantees that P(Yt+1 C(Xt+1)) 1 α, and P(Yt+1 C(Xt+1)) 1 α + 1 t+1, when the conformal scores {St}t are almost surely distinct (Lei et al., 2018). A key property of these results is that the probability statements are marginal over both the dataset used to fit the predictive sets C and the new covariate Xt+1. Real-world data can however fail to satisfy the exchangeability assumption. Similar to how more traditional machine learning theory has been adapted to develop guarantees under weaker assumptions (Mansour et al., 2009; Awasthi et al., 2023), adaptive conformal inference (ACI) (Gibbs & Cand es, 2021) introduced the first algorithm to present an adversarial approach to conformal prediction. Subsequent work follows two main directions. The first explores how to adaptively tune the magnitude of ACI updates using online learning techniques, resulting in algorithms such as Ag ACI (Zaffran et al., 2022), SAOCP (Bhatnagar et al., 2023), and Dt ACI (Gibbs & Cand es, 2024). These methods primarily address the adversarial setting, offering improved regret guarantees while preserving similar long-run coverage properties as ACI. Similarly to our work, Zaffran et al. (2022) discuss the idea of making ACI adaptive to time series with general dependency, but Ag ACI is not guaranteed to provide adversarial long-run coverage and is thus not comparable with our algorithms. The second line of work proposes modifications to ACI s update rule inspired by ideas from control theory, leading to new algorithms like PID-conformal (Angelopoulos et al., 2023) and BCI (Yang et al., 2024). Our work is more closely aligned with a recent line of research developing online conformal algorithms that achieve good performance in stochastic settings while preserving long-run coverage when data is adversarial, such as decaying step-size ACI (Angelopoulos et al., 2024), which focuses on the i.i.d. setting. Our work focuses on developing time-conditional conformal guarantees that average over the covariate distribution while conditioning on the set C, as in Angelopoulos et al. (2024). Such guarantees are fundamental for ensuring that our algorithms remain valid without the need for retraining, as is common in standard offline conformal algorithms such as CQR (Romano et al., 2019). Recent work also addresses the remaining challenge of providing covariate-conditional or simply conditional coverage guarantees, both in the offline setting (Barber et al., 2022; Gibbs et al., 2023; Jung et al., 2023; Areces et al., 2024) and in the online adversarial setting (Gupta et al., 2022; Bastani et al., 2022). In particular, Kiyani et al. (2024) introduce mean squared conditional error as a metric to evaluate algorithms seeking covariate-conditional guarantees in the stochastic i.i.d. setting, which is similar to the notion of time-conditional coverage we discuss in Section 6. Our results are also related to the convergence theory of Stochastic Gradient Descent (SGD) (Robbins & Monro, 1951), which provides rates of convergence for the gap (or error) between the loss attained by the algorithm s outputs and the true minimizer. In the case where the algorithm receives i.i.d. samples from the target distribution and the loss is convex and smooth, Nemirovski et al. (2009) show that after T steps of SGD, the expected error of the last iterate is O(T 1/2) and O(T 1) if the loss is additionally strongly convex. The quantile regression framework, as presented in Koenker & Bassett Jr. (1978), has a non-smooth objective, so our results build upon the convergence theory for non-smooth functions. In this setting, many guarantees focus on the loss attained by the average iterate rather than the final iterate, achieving optimal expected rates of O T 1/2 (Nemirovski & Yudin, 1983) and O T 1 (Rakhlin et al., 2012) when the loss is strongly convex. Shamir & Zhang (2013) show that these rates are indeed better than those for the final iterates, which only achieve expected errors of O log(T)T 1/2 and O log(T)T 1 in the convex and strongly convex settings, respectively. Harvey et al. (2019) extends both of these to the high probability regime. In our setting, we analyze the performance of SGD when the data is not sampled i.i.d. from the target distribution, so our results are also related to the convergence theory of SGD with dependent data. The class of problems we focus on is part of the family of stochastic problems with exogenous correlated noise (Kushner & Yin, 2003), which Duchi et al. (2012) analyzes in the context of β-mixing and ϕ-mixing processes. Duchi et al. (2012) show that for convex losses under such mixing assumptions, the error rate of the average SGD iterate is governed by the β and ϕ mixing times of our process, respectively. Agarwal & Duchi (2013) extend these results to the strongly convex setting but still focus on average iterate guarantees. Online Conformal Prediction via Online Optimization Dataset, metric LQT (fixed) LQT (decay) QT+AR SQT ACI PI PID(T) PID(A) Elec2, q. loss (avg) 0.005 0.005 0.005 0.013 0.015 0.013 0.011 0.011 Elec2, set size (avg) 0.16 0.159 0.157 0.229 0.244 0.227 0.201 0.201 Elec2, q. loss (win %) 1.0 0.563 0.603 0.158 0.183 0.189 0.241 0.243 Elec2, set size (win %) 1.0 0.567 0.612 0.183 0.206 0.209 0.255 0.258 Elec2, q. loss (avg) p-val 1.0 0.939 0.899 0.0 0.0 0.0 0.0 0.0 Elec2, set size (avg) p-val 1.0 0.948 0.914 0.0 0.0 0.0 0.0 0.0 Elec2, a. loss (avg) p-val 1.0 0.928 0.886 0.0 0.0 0.0 0.0 0.0 Elec2, q. loss (win %) p-val 1.0 0.881 0.977 0.0 0.0 0.0 0.0 0.0 Elec2, set size (win %) p-val 1.0 0.884 0.988 0.0 0.0 0.0 0.0 0.0 Elec2, a. loss (win %) p-val 1.0 0.885 0.973 0.0 0.0 0.0 0.0 0.0 Table 2. Elec2 dataset. Our algorithms are the three on the left. Numbers are bolded if they represent at least a 5% improvement over all methods in the other category. All algorithms achieved at least 0.89 coverage. Dataset, metric LQT (fixed) LQT (decay) QT+AR SQT ACI PI PID(T) PID(A) ERCOT, q. loss (avg) 29.095 29.158 36.918 59.118 76.629 53.233 42.344 42.349 ERCOT, set size (avg) 691.496 691.414 746.546 977.629 1117.69 912.214 778.808 778.99 ERCOT, q. loss (win %) 1.0 0.371 0.399 0.227 0.288 0.296 0.286 0.283 ERCOT, set size (win %) 1.0 0.34 0.407 0.243 0.31 0.317 0.301 0.296 ERCOT, q. loss (avg) p-val 1.0 0.5 0.0 0.003 0.016 0.002 0.0 0.0 ERCOT, set size (avg) p-val 1.0 0.507 0.0 0.002 0.016 0.001 0.0 0.001 ERCOT, a. loss (avg) p-val 1.0 0.494 0.0 0.004 0.015 0.003 0.0 0.0 ERCOT, q. loss (win %) p-val 1.0 0.183 0.001 0.0 0.0 0.0 0.0 0.0 ERCOT, set size (win %) p-val 1.0 0.175 0.003 0.0 0.0 0.0 0.0 0.0 ERCOT, a. loss (win %) p-val 1.0 0.18 0.0 0.0 0.0 0.0 0.0 0.0 Table 3. Pre-registered ERCOT dataset. Our algorithms are the three on the left. Numbers are bolded if they represent at least a 5% improvement over all methods in the other category. All algorithms achieved at least 0.89 coverage. B. Additional Experiments In this section, we report more detailed results for our experiments. In Table 2 and Table 3, we present full results for the Elec2 dataset and the pre-registered ERCOT dataset, respectively. The hyperparameter grids are in B.1, with ours tuned on validation data and the baseline hyperparameters tuned on the test sets. These additionally contain win rates versus LQT (fixed), defined as 1 T PT t=1 1{ℓt,baseline ℓt,LQT (fixed)} when ℓcorresponds to quantile loss or set size, as well as p-values to determine significance. For each baseline algorithm and loss function ℓcorresponding to quantile loss, absolute loss, and set size, we construct a time series of the form {ℓt,baseline ℓt,LQT (fixed)}t, and one with corresponding indicator form {1{ℓt,baseline ℓt,LQT (fixed)} 0.5}t. This gives 6 time series per method. With the assumption that the data is (weakly) stationary, our null hypothesis is that the baseline algorithm is at least as good as LQT(fixed). That is, the mean µ of each stochastic process is non-positive: µ 0, with alternative hypothesis µ > 0. Since the data is not independent, we must resort to a hypothesis test for dependent data. For that reason, we use the test in Lobato (2001, Section 2.1), which we describe in detail in Section B.2. Lastly, we show full results for the smaller datasets in Table 6. We reserve the first 1/3 of the datasets as validation data and tune our hyperparamters with the hyperparameter grid in Appendix B.1, while still tuning baseline hyperparameters on the test set. We also show the relationship between loss of base predictor and autocorrelation of the first lag for the stock data in Table 4. Online Conformal Prediction via Online Optimization Dataset Base predictor Loss (average score) Autocorrelation of first lag GOOGL AR 4.68 0.23 Prophet 24.47 0.95 Theta 11.38 0.76 Transformer 151.40 0.99 MSFT AR 0.38 0.19 Prophet 1.78 0.95 Theta 0.93 0.75 Transformer 5.03 0.99 AMZN AR 3.85 0.31 Prophet 25.28 0.98 Theta 3.86 0.30 Transformer 63.53 0.99 Daily-Climate AR 1.25 0.19 Prophet 2.16 0.67 Theta 1.48 0.15 Transformer 9.83 0.97 Table 4. Relationship between loss (average score) and dependence (measured via autocorrelation of first lag) for stock data. B.1. Hyperparameters Here we describe the hyperparameters used in our experiments. All hyperparameters are tuned via grid search, with grids presented in Table 5. Again, we emphasize that all baseline hyperparameters are always tuned on the test sets. When tuning hyperparameters, we use the ones that lead to the best quantile loss, since this is implicitly what all algorithms minimize and is also a measure of the closeness of the predictions to the true scores, provided that they achieved 1 α 0.01 = 0.89 coverage. If no choice achieves at least 0.89 coverage, we disregard that constraint. Hyperparameter Applicable Algorithms Grid Values Learning rate All 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3, 1e4, 1e5 P order LQT, SQT+AR 0, 1, 2 Bias covariate LQT 0.1, 1, 5, 10, 100, 200, 1000 KI PI, PID(A), PID(T) 10, 100, 200, 1000, Angelopoulos et al. (2023, Appendix B) heuristic Csat PI, PID(A), PID(T) 0.1, 1, 5, 20, Angelopoulos et al. (2023, Appendix B) heuristic Table 5. Hyperparameter grids. We provide intuition for the hyperparameters of LQT here. The p order corresponds to the AR(p) model that we use to predict the quantiles of our scores. Therefore, it should correspond to the number of lags with large partial autocorrelation, which can be examined by plotting the partial autocorrelation function (PACF) plot. As a heuristic, we observe that choosing p equal to the number of consecutive lags with a magnitude larger than 0.2 leads to good performance. In some cases, values larger than those in Table 5 can be helpful, as in the case of Elec2 dataset. The other main hyperparameter (besides the learning rate) is the feature corresponding to the bias in the conformal covariate. Specifically, if we choose a p order of 1, the conformal covariate will be of the form Zt = [St 1, w], where w is the bias covariate. While any non-zero bias term satisfies our guarantees, we generally observe that it should be on the same order of magnitude as the scores. For that reason, a heuristic is the average validation score (or a slightly smaller value). We can likely achieve a similar effect by normalizing the scores and setting w = 1. B.1.1. SCORECASTER DETAILS For the PID baseline methods, we implement both the Theta and AR scorecaster using code from Angelopoulos et al. (2023). Angelopoulos et al. (2023) train the scorescaster at using all past data at every time step. We follow this procedure for the smaller datasets in Table 6. This becomes prohibitely slow for larger datasets, so there we only use the most recent 200 Online Conformal Prediction via Online Optimization scores. In the ERCOT pre-registration, we had committed to only using the most recent 10 scores, but we show better results here with 200 scores since this is a baseline. The algorithm SQT+AR also uses an AR scorecaster to predict the next conditional quantile. In all datasets besides ERCOT, the AR scorecaster is trained using the most recent 200 scores via quantile regression and the cvxpy python library. The runtime can likely be made much faster by using other libraries. At the time of preregistration, we had only committed to using the most recent 10 scores, so for ERCOT data we only use the most recent 10 scores since this is one of our proposed methods. B.2. Dependent hypothesis testing To determine if our algorithm, represented by LQT (fixed), provides a significant improvement over a baseline for a specific loss function ℓ, we look at the following two sequences of random variables Xt = ℓt,baseline ℓt,LQT (fixed) Wt = 1{ℓt,baseline > ℓt,LQT (fixed)} 0.5. In this case, if we assume that E [Xt] = µx and E [Wt] = µw are constant over time, testing for a significant improvement would be equivalent to testing the null hypotheses (H0, H0) against the alternative hypotheses (H1, H1) in the following hypothesis tests H0 : µx 0 H0 : µw 0 H1 : µx > 0 H1 : µw > 0. However, because Xt and Wt come from a time series and are not i.i.d. samples from an underlying distribution, standard hypothesis tests are not applicable here. For this reason, we use the hypothesis test in Lobato (2001, Section 2.1) which applies to an arbitrary dependent sequence Yt on a probability space (Ω, A, P) satisfying the following properties: 1. Yt is wide-sense stationary so that E [Yt] = µy for all t N and E [Yt1Yt2] = E [Yt1+τYt2+τ] for all t1, t2, τ N. 2. For some δ > 0, E Y 2+δ t < . 3. For St = Pt i=1 Yi, E S2 t = σ2 t satisfies σ2 t . 4. Yt is ρ-mixing so that the maximum correlation coefficient ρ(k) = sup {corr(Z1, Z2) : Z1 L2(Ω, σ(Yi : i t)), Z2 L2(Ω, σ(Yi : i t + k)), t N} , satisfies ρ(k) 0. Under these assumptions, if µy = 0, then Herrndorf (1984, Corollary) shows that for all r [0, 1] i=1 Yi d ϕB(r), where ϕ = p 2πfy(0), fy(0) is the spectral density of Yt at zero frequency, and B(r) is a Brownian motion process. Lobato (2001) now uses this result to conclude that the test statistic Vt = t1/2 Yt s1/2 t with st = 1 Vt d U1/2 with U1/2 = B(1) q R 1 0 (B(r) r B(1))2 dr . Assuming that the conditions outlined above hold for Xt and Wt under the null hypothesis, we can now perform one-sided hypothesis tests on our random sequences by computing the value of Vt and comparing it to the critical values of the distribution of U1/2. We provide estimated p-values with the same simulation technique in Lobato (2001) but using 10,000,000 sequences of length 10,000. Online Conformal Prediction via Online Optimization 0.0 0.2 0.4 0.6 0.8 1.0 Fixed learning rate Decaying learning rate Figure 3. θt θ vs. time t for synthetic AR(2) data, averaged over 10 runs where the error bars represent standard deviation. B.3. Synthetic data Here we experimentally verify our convergence theory in Section 5 by generating synthetic AR(p) data where our wellspecified assumption holds. In this case, an optimal parameter θ Θ exists, so we can track the distance between the iterates of Algorithm 1 and this optimal parameter. Our synthetic data is sampled from the following AR(2) process: St = 0.3St 1 0.3St 2 + ϵt, where ϵt iid N(0, 1). If we now run Algorithm 1 with covariates Zt = [St 1, St 2, 1] , the optimal parameters are θ = [0.3, 0.3, ϕ 1(1 α)] , where ϕ 1 represents the inverse of the standard normal CDF. In Figure 3, we plot the distance to θ over time for fixed and decaying learning rates. The decaying learning rates are Θ(1/t), which Theorem 5.4 suggests gives the fastest convergence. The results are as expected: fixed learning rates appear to converge to a noise ball, and decaying learning rates appear to converge to θ , suggesting that the conditional coverage condition (2) holds. Online Conformal Prediction via Online Optimization Dataset, metric LQT (fixed) LQT (decay) SQT+AR SQT ACI PI PID(T) PID(A) MSFT ar, q. loss (avg) 0.091 0.09 0.09 0.088 0.09 0.088 0.088 0.091 MSFT ar, set size (avg) 0.853 0.818 0.869 0.83 0.85 0.822 0.812 0.838 MSFT ar, q. loss (win %) 1.0 0.554 0.486 0.538 0.538 0.515 0.526 0.517 MSFT ar, set size (win %) 1.0 0.566 0.474 0.542 0.538 0.526 0.528 0.517 MSFT theta, q. loss (avg) 0.139 0.133 0.136 0.14 0.154 0.14 0.128 0.138 MSFT theta, set size (avg) 2.048 1.956 1.997 2.113 2.266 2.092 1.922 2.068 MSFT theta, q. loss (win %) 1.0 0.561 0.485 0.428 0.447 0.451 0.547 0.505 MSFT theta, set size (win %) 1.0 0.563 0.49 0.423 0.446 0.457 0.543 0.504 MSFT prophet, q. loss (avg) 0.104 0.1 0.101 0.151 0.213 0.15 0.134 0.134 MSFT prophet, set size (avg) 2.718 2.705 2.738 3.117 3.57 3.143 2.938 2.934 MSFT prophet, q. loss (win %) 1.0 0.519 0.466 0.302 0.281 0.301 0.257 0.251 MSFT prophet, set size (win %) 1.0 0.508 0.449 0.31 0.3 0.309 0.268 0.269 MSFT transformer, q. loss (avg) 0.115 0.118 0.107 0.289 0.352 0.196 0.149 0.148 MSFT transformer, set size (avg) 5.947 5.942 5.852 7.629 7.938 6.661 6.119 6.104 MSFT transformer, q. loss (win %) 1.0 0.443 0.543 0.29 0.243 0.345 0.38 0.373 MSFT transformer, set size (win %) 1.0 0.441 0.536 0.294 0.257 0.356 0.379 0.373 AMZN ar, q. loss (avg) 1.198 1.215 1.205 1.198 1.29 1.195 1.197 1.228 AMZN ar, set size (avg) 10.112 9.711 10.398 10.112 11.592 10.009 10.15 10.63 AMZN ar, q. loss (win %) 1.0 0.538 0.446 1.0 0.508 0.537 0.528 0.469 AMZN ar, set size (win %) 1.0 0.556 0.446 1.0 0.513 0.554 0.554 0.481 AMZN theta, q. loss (avg) 1.218 1.203 1.207 1.188 1.283 1.192 1.195 1.23 AMZN theta, set size (avg) 10.252 10.364 10.586 9.927 11.364 9.91 9.976 10.426 AMZN theta, q. loss (win %) 1.0 0.405 0.397 0.452 0.42 0.458 0.453 0.406 AMZN theta, set size (win %) 1.0 0.379 0.371 0.434 0.394 0.446 0.438 0.391 AMZN prophet, q. loss (avg) 2.208 1.419 1.341 2.432 2.998 2.264 1.855 1.84 AMZN prophet, set size (avg) 49.995 42.825 43.105 52.808 57.934 51.747 46.209 46.231 AMZN prophet, q. loss (win %) 1.0 0.529 0.511 0.341 0.293 0.364 0.384 0.378 AMZN prophet, set size (win %) 1.0 0.514 0.488 0.352 0.302 0.369 0.386 0.379 AMZN transformer, q. loss (avg) 1.713 1.863 1.39 4.975 6.622 4.286 2.055 2.057 AMZN transformer, set size (avg) 93.963 97.903 92.614 127.9 143.724 120.728 97.04 97.04 AMZN transformer, q. loss (win %) 1.0 0.389 0.567 0.309 0.25 0.292 0.283 0.274 AMZN transformer, set size (win %) 1.0 0.367 0.545 0.314 0.261 0.307 0.277 0.269 GOOGL ar, q. loss (avg) 1.117 1.136 1.138 1.122 1.185 1.124 1.118 1.142 GOOGL ar, set size (avg) 10.436 9.828 10.681 10.335 11.167 10.334 9.932 10.827 GOOGL ar, q. loss (win %) 1.0 0.63 0.465 0.418 0.545 0.541 0.633 0.495 GOOGL ar, set size (win %) 1.0 0.675 0.477 0.413 0.557 0.567 0.666 0.494 GOOGL theta, q. loss (avg) 1.744 1.61 1.615 1.76 1.911 1.762 1.623 1.724 GOOGL theta, set size (avg) 24.999 23.433 23.027 25.503 27.248 24.773 23.235 24.545 GOOGL theta, q. loss (win %) 1.0 0.594 0.603 0.458 0.495 0.509 0.586 0.512 GOOGL theta, set size (win %) 1.0 0.598 0.616 0.455 0.5 0.523 0.598 0.519 GOOGL prophet, q. loss (avg) 1.335 1.33 1.317 2.086 2.459 2.072 1.812 1.803 GOOGL prophet, set size (avg) 34.916 35.136 35.297 43.597 47.222 43.553 39.424 39.154 GOOGL prophet, q. loss (win %) 1.0 0.38 0.406 0.27 0.258 0.272 0.245 0.251 GOOGL prophet, set size (win %) 1.0 0.333 0.371 0.279 0.27 0.284 0.248 0.25 GOOGL transformer, q. loss (avg) 1.711 1.585 1.6 5.729 12.424 4.045 2.396 2.407 GOOGL transformer, set size (avg) 208.989 206.891 207.588 245.381 291.003 228.834 212.054 211.965 GOOGL transformer, q. loss (win %) 1.0 0.566 0.601 0.158 0.212 0.277 0.318 0.318 GOOGL transformer, set size (win %) 1.0 0.557 0.605 0.174 0.227 0.292 0.322 0.322 daily-climate ar, q. loss (avg) 0.243 0.246 0.249 0.243 0.251 0.239 0.237 0.254 daily-climate ar, set size (avg) 2.708 2.514 2.609 2.632 2.689 2.705 2.583 2.708 daily-climate ar, q. loss (win %) 1.0 0.537 0.525 1.0 0.5 0.549 0.556 0.47 daily-climate ar, set size (win %) 1.0 0.539 0.532 1.0 0.504 0.558 0.564 0.461 daily-climate theta, q. loss (avg) 0.274 0.268 0.269 0.264 0.275 0.265 0.263 0.288 daily-climate theta, set size (avg) 3.325 3.351 3.291 3.295 3.192 3.245 3.174 3.247 daily-climate theta, q. loss (win %) 1.0 0.47 0.476 0.539 0.527 0.552 0.542 0.483 daily-climate theta, set size (win %) 1.0 0.454 0.473 0.541 0.533 0.57 0.539 0.498 daily-climate prophet, q. loss (avg) 0.269 0.275 0.281 0.317 0.3 0.309 0.32 0.317 daily-climate prophet, set size (avg) 3.828 3.788 3.918 4.219 4.22 4.168 4.2 4.162 daily-climate prophet, q. loss (win %) 1.0 0.506 0.483 0.347 0.383 0.394 0.356 0.374 daily-climate prophet, set size (win %) 1.0 0.515 0.482 0.351 0.39 0.404 0.371 0.399 daily-climate transformer, q. loss (avg) 0.313 0.317 0.315 0.443 0.359 0.392 0.42 0.421 daily-climate transformer, set size (avg) 9.1 9.148 8.955 10.126 9.923 9.808 9.831 9.899 daily-climate transformer, q. loss (win %) 1.0 0.409 0.549 0.431 0.376 0.431 0.345 0.311 daily-climate transformer, set size (win %) 1.0 0.406 0.548 0.437 0.384 0.436 0.359 0.317 Table 6. Numbers are bolded if they represent at least a 5% improvement over all methods in the other category. Asterisk numbers achieved less than 0.87 coverage. All baselines achieved at least 0.89 coverage because we tuned them on the test set. Online Conformal Prediction via Online Optimization C. Deferred proofs for Sections 4, 5, and 6 C.1. Proof of Theorem 4.1 This proof follows Theorems 1 and 2 of (Angelopoulos et al., 2024). Similarly, we argue that adversarial coverage follows from the boundedness of the bias terms in our iterates. More concretely, using the decomposition θb = ( θb, cb) Θ R we prove the following proposition. Proposition C.1. If supb |cb| C < almost surely, then with probability 1, Algorithm 1 satifies 1 T It now remains to show that the bias terms cb are indeed bounded, which we do via the following claim. Claim C.1. The iterates of Algorithm 1 satisfy supb |cb| Ks + Kq + η1 almost surely. We now provide proofs for both of these results. Demonstrating Proposition C.1. The final time T is not necessarily divisible by the batch size m, so by time T, Algorithm 1 will only perform updates based on the first Bm T scores. For this reason, we first point out that by the triangle inequality, 1 T t=1 (errt α) t=1 (errt α) i=1 |err B+1,i α| , and since |err B+1,i α| 1 for all i [T Bm] and T Bm m 1, 1 T t=1 (errt α) t=1 (errt α) The rightmost term m 1 T matches the result in Proposition C.1, so it only remains to show that t=1 (errt α) For this purpose, we define ( mη 1 1 if b = 1 m(η 1 b η 1 b 1) if b > 1, to argue that the left-hand side of inequality (4) can be written in terms of the bias terms {cb}B b=1 as 1 T t=1 (errt α) i=1 (errb,i α) ! ηb m(errb,i α) ηb m(errb,i α) r=1 r (c B+1 cr) Online Conformal Prediction via Online Optimization Since the bias terms are bounded |c B+1 cr| 2C, by H older s inequality we have that Since the learning rate schedule {ηb}B b=1 is non-increasing 1:B 1 is a telescoping sum so that 1:B 1 = mη 1 B , which shows inequality (4) and thus the desired result. Demonstrating Claim C.1. For C := (Ks + Kq + η1), we will only prove that cb C with probability 1, since the other direction cb C follows by an identical argument. We will proceed by contradiction, so we let τ := inf{b N | cb < C} < be the first index where cb falls below C, and note that τ > 1 since |c1| Ks + Kq. We will now show that if a finite τ existed, then cτ 1 < cτ < C, contradicting our definition that τ is the first index where cb falls below C. Recall that Θ = Θ R, so the projection step in Algorithm 1 leaves the last coordinate unchanged, and the update for the bias term has bounded magnitude sup b |cb+1 cb| = sup b i=1 |errb,i α| η1. This implies that the previous bias term cτ 1 < (Ks + Kq) and the predictions for batch (τ 1) with i [m] satisfy qτ 1,i(θτ 1) = hτ 1,i( θτ 1) + cτ 1 < Ks. But, by our assumption that all scores are at least Ks almost surely, these predictions all lead to errors as errτ 1,i = 1qτ 1,i cτ 1, which shows the desired contradiction and demonstrates Claim C.1. This argument only requires that the bias term can take any value in the set [ C, C] so it can hold if Θ has finite diameter, as is required by some of our other results. C.2. Proof of Theorem 5.1 In our search for consistent conformal predictors that satisfy the conditional coverage guarantee (2), we must first characterize what model parameters (if any) attain this property. Our first step is thus to establish the conditional coverage of any sequence which eventually represents all the conditional quantiles, matching our definition of asymptotic well-specification: Proposition C.2. For any sequence of iterates satisfying θb a.s. θ Θ , then P (Yt Ct(Xt) | Ft 1) a.s. 1 α, and our conformal predictors satisfy the consistency guarantee (2). In light of this proposition, proving Theorem 5.1 is equivalent to showing that Algorithm 1 eventually converges to some θ Θ : Proposition C.3. The sequence of iterates {θb}b produced by Algorithm 1 satisfies θb a.s. θ Θ . We now provide proofs for both of these statements. Online Conformal Prediction via Online Optimization Demonstrating Proposition C.2. Recall that for b = t P (Yt Ct(Xt) | Ft 1) = P (St qt(θb) | Ft 1) = P (St q t | Ft 1) + Z (1St [q t ,qt(θb)] 1St [qt(θb),q t ])d PSt|Ft 1. For any ε > 0, if we let ε be the neighborhood around the quantiles where the uniform density upper bound p1 holds, with probability 1 we can take t and b large enough that |qt(θ ) q t | min{ϵ/p1, ε} |qt(θb) qt(θ )| min{ϵ/p1, ε} simultaneously. For such t, it must also be the case that |P (Yt Ct(Xt) | Ft 1) (1 α)| ϵ, and since ϵ was arbitrary P (Yt Ct(Xt) | Ft 1) a.s. 1 α. Demonstrating Proposition C.3. We start by defining ˆgb,i lb,i(θb) as our empirical subgradient for batch b and index i satisfying gb,i = E [ˆgb,i | Fb,i 1] Lb,i(θb). For any θ Θ , we now use our assumption that the conformal predictor has a bias term to define 0 ... 0 q b,i qb,i(θ ) which can be nonzero as we only require asymptotic well-specification, so that qb,i(θ + b,i) = q b,i, 0 Lb,i(θ + b,i). Our proof is based on the proof of Theorem 4 in Chapter 7 of Ryu & Yin (2022), which hinges on the following three claims: (i) Almost surely, limb θb θ exists and is finite for all θ Θ . (ii) The iterates {θb}b are arbitrarily close to optimal infinitely often, since for any θ Θ lim inf b 1 m i=1 gb,i, θb (θ + b,i) = 0. (iii) Claims (i) and (ii) imply that θb a.s. θ Θ . Demonstrating Claims (i) and (ii). For this part, we rely on the Robbins and Siegmund quasimartingale convergence theorem (Robbins & Siegmund, 1971), which we state in Theorem E.1 in Appendix E without proof. To apply this result, we must first construct an appropriate quasimartingale. Online Conformal Prediction via Online Optimization Claim C.2. For any θ Θ , define the Fmb-measurable random variables Vb := θb+1 θ 2 Db := 2ηb+1 i=1 E [ ˆgb+1,i, θb+1 (θ + b+1,i) | Fmb] Ub := G2η2 b+1 + 2ηb+1 i=1 E |q b+1,i qb+1,i(θ )| | Fmb . Then Vb, Db, Ub are non-negative with P b=1 Ub < almost surely and E Vb | Fm(b 1) Vb 1 Db 1 + Ub 1, so we can apply Theorem E.1. Proof. Since θb+1 = ΠΘ θb ηb m Pm i=1 ˆgb,i θb+1 θ 2 θb θ 2 + η2 b i=1 ˆgb,i, θb θ θb θ 2 + G2η2 b 2ηb i=1 ˆgb,i, θb (θ + b,i) 2ηb i=1 ˆgb,i, b,i . If we now let θb = ( θb, cb) Θ R, we can use the fact that our model has a bias term to argue that the last coordinate of our empirical gradient [ˆgb,i]d satisfies [ˆgb,i]d cbℓ(Sb,i hb,i( θb) cb), so that |[ˆgb,i]d| max{α, 1 α} < 1 and i=1 ˆgb,i, b,i i=1 [ˆgb,i]d(q b,i qb,i(θ )) i=1 |q b,i qb,i(θ )|. We can now use this fact and our previous upper bound to conclude that θb+1 θ 2 θb θ 2 + G2η2 b + 2ηb i=1 |q b,i qb,i(θ )| 2ηb i=1 ˆgb,i, θb (θ + b,i) . This shows that the Fmb-measurable random variables i=1 E [ ˆgb+1,i, θb+1 (θ + b+1,i) | Fmb] Ub = G2η2 b+1 + 2ηb+1 i=1 E |q b+1,i qb+1,i(θ )| | Fmb . E Vb | Fm(b 1) Vb 1 Db 1 + Ub 1. The sequence {Db}b is also non-negative as i=1 E [E [ ˆgb+1,i, θb+1 (θ + b+1,i) | Fmb+i 1] | Fmb] i=1 E [ gb+1,i, θb+1 (θ + b+1,i) | Fmb] Online Conformal Prediction via Online Optimization since Lb,i(θ) is convex and (θ + b,i) minimizes Lb,i for any (b, i). Additionally, Ub is also non-negative and by η-summability and the fact that P b=1 η2 b < , with probability 1 We can now use Claim C.2 and Theorem E.1 to conclude that almost surely (1) limb Vb exists and is finite. (2) P b=2 Pm i=1 ηb m gb,i, θb (θ + b,i) < . Theorem E.1 states that P b=1 Db < almost surely instead of (2), but this is immediate consequence of the former since P b=1 Db being almost surely finite implies that E [P b=1 Db] < and ηb m gb,i, θb (θ + b,i) so the finiteness of this expectation also implies (2). This argument holds for any θ Θ so by (1) for all θ Θ [with probability 1 limb θb θ exists], and since Θ Rd by Proposition 1 in Chapter 5 of Ryu & Yin (2022) this implies (i) as with probability 1 [for all θ Θ , limb θb θ exists]. For (ii), since P b=1 ηb = , by our positivity constraint (2) implies that for any θ Θ , lim inf b 1 m i=1 gb,i, θb (θ + b,i) = 0. Demonstrating Claim (iii). The fact that θb θ is bounded by claim (i) implies that θb is almost surely bounded too, so by (ii) there exists a subsequence bj i=1 gbj,i, θbj (θ + bj,i) 0. We will now show that θ Θ by contradiction, so we assume that θ / Θ and argue that if this is the case i=1 gbj,i, θbj (θ + bj,i) > 0, which contradicts the condition above. The fact that our conformal predictors are regular implies that there exist infinitely many k 1 such that |qbk,1(θ ) q bk,1| ϵ for some ϵ > 0. We can now find N large enough so that any j N satisfies θbj θ ϵ 2G and by our Lipschitz assumption |qbj,1(θbj) qbj,1(θ )| ϵ Therefore, for any of the infinitely many k N where |qbk,1(θ ) q bk,1| ϵ we also have |qbk,i(θbk) q bk,i| ϵ Online Conformal Prediction via Online Optimization We now apply Lemma E.3 and our uniform local lower bound on the conditional densities for any such k N to conclude that there exist constants c1, c2 such that i=1 gbk,i, θbk (θ + bk,i) 1 m gbk,1, θbk (θ + bk,1) m (Lbk,1(θbk) Lbk,1(θ + bk,1)) m(c1ϵ c2ϵ2) > 0. This result contradicts our assumption that limj 1 m Pm i=1 gbj,i, θbj (θ + bj,i) = 0, so θ Θ . Finally, by (i) we know that almost surely θb θ exists for all θ Θ , which includes θ = θ so θb θ 0 and the entire sequence converges to θ Θ almost surely. C.3. Proof of Lemma 5.2 Since Θ is well-specified, there exists θ Θ such that θ Zb,i = q b,i is the conditional quantile of Sb,i | Fb,i 1. Noting that Zb,i is measurable with respect to Fb,i 1 and |θ Zb,i θ Zb,i| GD by our Lipschtiz and finite diameter assumptions, we can use Lemma E.3, the lower bounds in the standard scalar case, to write i=1 Lb,i(θ) Lb,i(θ ) = 1 i=1 E[ℓquantile(Sb,i θ Zb,i) ℓquantile(Sb,i θ Zb,i) | Fb,i 1] p 2 min n ε GD, 1 o |θ Zb,i θ Zb,i|2. This can further be lower bounded as: i=1 Lb,i(θ) Lb,i(θ ) p 2m min n ε GD, 1 o m X i=1 (θ θ ) Zb,i Z b,i(θ θ ) GD, 1 o (θ θ ) 1 m i=1 Zb,i Z b,i GD, 1 o θ θ 2 2 = µb θ θ 2 2. The second inequality follows from the fact that x Ax λmin(A) x 2 for any symmetric matrix A and with minimum eigenvalue λmin(A) R, and any x. Note that λb here is non-negative since its corresponding matrix is the average of positive semidefinite matrices. This completes the proof. C.4. Proof of Lemma 5.3 For simplicity, we consider only a single batch of m observed covariate vectors Zi = (Si, 1) for i = 1, . . . , m, as in the assumed AR(1) process we have Si = ϕSi 1 + εi, where εi are i.i.d. with E[εi] = 0. This is sufficient to show Lemma 5.3 since the proof for an arbitrary batch b is identical after appropriately offsetting the indices of the scores. Define the matrix i=1 Zi Z i = 1 S2 i Si Si 1 m Pm i=1 Si and S2m = 1 m Pm i=1 S2 i be, respectively, the sample mean and second moment of the scores. We can relate the minimal eigenvalues of Σ to first and second moments of Sm 1 via the following lemma: Lemma C.4. For the matrix Σ = 1 m Pm i=1 Zi Z i , we have S2m 1 + S2m . Online Conformal Prediction via Online Optimization Deferring the proof of Lemma C.4 temporarily, we see that to show Lemma 5.3, it suffices to show that both (i) S2m c0σ2 for a constant c0 > 0 (ii) Sm 2 c1 S2m for a constant c1 < 1 with constant probability, as when these both occur, we have λmin(Σ) (1 c1) c0σ2 with constant probability. For the remainder of the proof, we thus seek to demonstrate both of these, and in both cases, we assume S0 = s for a fixed scalar value s R, as the process is independent of (S i)i N given S0, and we condition implicitly on S0 throughout. Demonstrating claim (i). Writing Si = ϕSi 1 + εi, we have i=1 (ϕSi 1 + εi)2 i=1 ε2 i 1{sign(εi) = sign(Si 1)} . But of course, by assumption on the innovation process we have P([ε]+ σ) 1 4 and P([ ε]+ σ) 1 i=1 ε2 i 1{sign(εi) = sign(Si 1)} m 4 (1 o(1))σ2 with high probability.1 This demonstrates point (i). Demonstrating claim (ii). This requires a substantially more tedious argument. We first develop explicit formulae for Si and their sums. Define the sequence of vectors vi := ϕi 1 ϕi 2 1 0 0 Rm +. Then we have j=1 ϕi jεj + ϕi S0 = ϕi S0 + v T i ε, so that m X i=1 Si = ϕ1 ϕm+1 1 ϕ εi (5a) i=1 (ϕi S0 + v T i ε)2 = ϕ2 1 ϕ2(m+1) 1 ϕ2 S2 0 + εT m X i=1 viv T i i=1 ϕiv T i ε. (5b) We will control the deviations of both the expansions (5). For the former (5a), a quick calculation with Chebyshev s inequality shows that with probability at least 1 1/t2, For the second term (5b), we control both the quadratic Pm i=1(v T i ε)2 and the final linear term. Claim C.3. Let M = Pm i=1 viv T i . Then E[εT Mε] = σ2 ε tr(M), and there is a numerical constant p > 0 such that for all t [0, 1], P εT Mε tσ2 ε tr(M) (1 t)2 p. 1Making this rigorous would use Azuma s inequality. Online Conformal Prediction via Online Optimization Proof. The expectation calculation is trivial. To lower bound the probabilities, note that by the Paley-Zygmund inequality, for any t [0, 1], P(εT Mε tσ2 ε tr(M)) (1 t)2 σ4 ε tr(M)2 E[(εT Mε)2], so that the claimed constant p = σ4 ε tr(M)2 E[(εT Mε)2]. We now demonstrate that p is indeed a (numerical) constant under the assumptions of Lemma 5.3. To control the denominator, the expansion εT Mε = P i,j εiεj Mij and that the εi are are i.i.d. and mean zero together imply E[(εT Mε)2] = i=1 M 2 ii E[ε4 i ] + O(1) X i =j Mii Mjj E[ε2 i ε2 j] + O(1) X i =j M 2 ij E[ε2 i ε2 j] = O(1)σ4 ε tr(M)2 + M 2 Fr . Because vt has entries vt,i = ϕt i for i t and vt,i = 0 otherwise, t=1 vt,ivt,j = t=i j ϕ2t i j = ϕ i j m X t=i j ϕ2t = ϕ2(i j) i j 1 ϕ2(i j+1) So tr(M) = Pm i=1 1 ϕ2(i+1) 1 ϕ2 m 1 ϕ2 , while by considering the diagonal, one-off diagonal, two-off-diagonal, and so on, M 2 Fr Pm 1 i=0 (m i)ϕ2i 1 (1 ϕ2)2 . The following observation allows us to control such sums: Observation C.1. Let c = 1. Then for any m N, i=0 (m i)ci = m1 cm 1 c + (m + 1)cm 1 (1 c)2 + 1 cm Proof. We observe that Pm 1 i=0 ci = 1 cm 1 c whenever c = 1, and so i=0 (m i)ci = m1 cm i=0 (i + 1)ci + i=0 ci+1 + 1 cm Computing derivatives gives the result. Returning to our chain of inequalities and substituting c = ϕ2 in the observation, we thus obtain M 2 Fr m (1 ϕ2)3 tr(M)2, and so the constant p = tr(M)2 O(1)(tr(M)2+ M 2 Fr) 1 as desired. Finally, we control the final sum in the expansion (5b). Defining the scalars ui = 1 1 ϕϕi 1 ϕm i+1 , so for the vector u = (u1, . . . , um) we observe that i=1 ϕiv T i ε = i=1 uiεi = u T ε, Online Conformal Prediction via Online Optimization i=1 ϕiv T i ε 4# = E[(u T ε)4] = i=1 u4 i E[ε4 i ] + O(1) X i =j u2 i u2 j E[ε2 i ε2 j] = O(1)σ4 ε u 4 4 + u 4 2 . Because u 4 4 ϕ4 (1 ϕ)4(1 ϕ4) and u 2 2 ϕ2 (1 ϕ)2(1 ϕ2), we obtain that i=1 ϕiv T i ε tσε (1 ϕ)4(1 ϕ2)2 O(1) for all t 0. Summarizing, in pursuit of proving claim (ii), we have shown Claim C.4. There is a numerical constant p > 0 such that for any ta, tb 0, with probability at least p 1/t2 a 1/t4 b, i=1 Si ϕ1 ϕm+1 O(1)ta m σε i=1 S2 i ϕ2 1 ϕ2(m+1) 1 ϕ2 S2 0 + σ2 ε 2 tr O(1)S0tbσε ϕ By claim C.4, we may evidently choose ta, tb to be appropriate constants and see that with (numerical) constant probability, m2(1 ϕ)2 S2 0 + σ2 ε m(1 ϕ)2 and S2m ϕ2 m 1 1 ϕ2 S2 0 + σ2 ε 1 ϕ2 S0σεϕ We consider two cases: depending on whether S2 0 mσ2 ε/ϕ2. In the first case (that it is larger), we see that Sm 2 ϕ2 m2(1 ϕ)2 S2 0 while S2m ϕ2S2 0 m(1 ϕ2) S2 0ϕ2 m3/2(1 ϕ) p That 1 (1 ϕ)2 1 1 ϕ2 immediately shows S2m Sm 2. In the latter case that S2 0 < mσ2 ε/ϕ2, we have Sm 2 σ2 ε m(1 ϕ)2 while by the Fenchel-Young inequality ab a2 S2m ϕ2S2 0 m(1 ϕ2) + σ2 ε 1 ϕ S2 0ϕ2 2m(1 ϕ2) σ2 ε 2m(1 ϕ)2 ( ) ϕ2S2 0 2m(1 ϕ2) + σ2 ε 2m(1 ϕ)2 , where inequality ( ) used that m 1 1 ϕ. In particular, we have S2m Sm 2, as desired. Lastly, we return to prove Lemma C.4: Proof. We explicitly find the eigenvalues of a two-by-two matrix. For scalars a, b and matrix A = a b b 1 we have det(A λI) = (a λ)(1 λ) b2 = λ2 (a + 1)λ + a b2. Solving for det(A λI) = 0, we obtain eigenvalues λ = a + 1 p (a 1)2 + 4b2 Now, let us assume that b2 = (1 γ)a for some γ [0, 1]. Then using the first-order concavity of the square root (that is, that x + x + 2 x), we obtain 2λmin(A) = a + 1 p (a 1)2 + 4(1 γ)a = a + 1 p (a + 1)2 4γa 2γa Substituting a = S2m and b = Sm, which satisfy b2 a by Jensen s inequality, on the event that (1 γ)S2m Sm 2 we have λmin(Σ) γS2m 1+S2m . Noting that (1 γ)S2m Sm 2 if and only if γ 1 Sm 2 S2m gives the lemma. Online Conformal Prediction via Online Optimization C.5. Proof of Theorem 5.4 We first introduce some new notation regarding gradients for ease of presentation. We denote the conditional subgradient by gb,i(θ) θLb,i(θ) and the observed subgradient by ˆgb,i(θ) θℓquantile(Sb,i θ Zb,i), dropping the argument θ when it is clear from context. We denote the batch expected subgradient conditioned on previous batches, as gb = E[ 1 m Pm i=1 ˆgb,i | Fb 1,m], the batch empirical subgradient as ˆgb := 1 m Pm i=1 ˆgb,i, and define the difference zb = gb ˆgb. As in the proof of asymptotic consistency of Theorem 5.1, we start by establishing the connection between the conditional coverage guarantee (2) and the conformal predictor s parameter θ Θ. The following proposition states that the conditional coverage error can be bounded by the distance of θ to the optimal θ Θ that represents all conditional quantiles under well-specification. Proposition C.5. Let Sb,i | Fb,i 1 have density fb,i upper-bounded by u in a GD-neighborhood around its conditional quantile q b,i. Then, P(Yb,i Cb,i(Xb,i) | Fb,i 1) (1 α) u G θb θ . Deferring the proof of Proposition C.5 temporarily, we see that to prove Theorem 5.4 it only remains to show that Algorithm 1 s parameters θb Θ satisfy θb θ 2 O(log(B log(B)/δ)/bc) with high probability. We do this through the following two claims: (i) For any δ > 0 there exist recursive upper bounds θb+1 θ 2 q Pb i=1 Ab,i θi θ 2 + Rb that hold simultaneously for all b B with probability at least 1 δ. (ii) Under the event in (i), q Pb i=1 Ab,i θi θ 2 + Rb O(log(B log(B)/δ)/bc) by strong induction. Substituting our bound on θb θ 2, we get that for some constant Cb, which is a function of G, µmin, u , P(Yb,i Cb,i(Xb,i) | Fb,i 1) (1 α) Cb (log(B log(B)/δ)) Without batching, as in the statement of Theorem 5.4 for different constant C, a function of m, G, µmin, u , we can write P(Yt Ct(Xt) | Ft 1) (1 α) C (log(T log(T)/δ)) We now turn to the proof, which borrows several parts from Proposition 1 of Rakhlin et al. (2012). The novelty lies in the batch setting, the changing objective function, and the generalized learning rates. Demonstrating Claim (i). We start by unrolling one iteration of Algorithm 1, using the fact that θb+1 = ΠΘ(θb ηbˆgb): θb+1 θ 2 θb ηbˆgb θ 2 = θb θ 2 2ηb ˆgb, θb θ + η2 b ˆgb 2 θb θ 2 2ηb ˆgb, θb θ + 2ηb gb, θb θ 2ηb gb, θb θ + η2 b G2 = θb θ 2 2ηb 1 m i=1 E[ gb,i, θb θ | Fb 1,m] + 2ηb zb, θb θ + η2 b G2. This expression can now be upper bounded using the subgradient inequality gb,i, θb θ Lb,i(θb) Lb,i(θ ) and Lemma 5.2 due to well-specification: θb+1 θ 2 θb θ 2 2ηb 1 m i=1 E[(Lb,i(θb) Lb,i(θ )) | Fb 1,m] + 2ηb zb, θb θ + η2 b G2 θb θ 2 2ηb θb θ 2E[µb | Fb 1,m] + 2ηb zb, θb θ + η2 b G2. Online Conformal Prediction via Online Optimization We now use the assumption that E [µb | Fb 1,m] is uniformly lower bounded by µmin > 0 with probability 1 (as in Lemma 5.3), and our choice of step size ηb = 2c 1 bcµmin to argue that θb+1 θ 2 θb θ 2 2ηb θb θ 2µmin + 2ηb zb, θb θ + η2 b G2 = (1 2ηbµmin) θb θ 2 + 2ηb zb, θb θ + η2 b G2 c θb θ 2 + 2ηb zb, θb θ + η2 b G2. Unrolling this recursion until b = 2, we obtain: c 2ηi zi, θi θ + ic zi, θi θ + G2 where the first term vanishes since 2 i c = 1 at i = 2. Note that the sum in the second term can be bounded by 1 Lemma E.6 so that using the simplified notation fb(i) := Qb j=i+1 1 2 j c 1 ic , θb+1 θ 2 2c i=2 fb(i) zi, θi θ The upper bound in its current form does not match Claim (i) as it depends on zi, θi θ rather than the norms θi θ . We bridge this gap in the following proposition using the fact that ( ) is a martingale that we can bound with high probability. Proposition C.6. Let Mb(i) := fb(i) zi, θi θ and δ > 0. Then Mb(i) is a martingale difference sequence and with probability at least 1 δ i=2 Mb(i) 8G max i=2 fb(i)2 θi θ 2, Gfb(b) log B log(B) log B log(B) for all b B simultaneously. Proof. Our sequence Mb(i) forms a martingale difference sequence since E[ˆgi gi | Fi 1,m] = E j=1 ˆgi,j E j=1 ˆgi,j | Fi 1,m fb(i) is constant, and θi is measurable with respect to Fi 1,m. The conditional variance, using the fact that zi 2G, can be bounded by Var(Mb(i) | Fi 1,m) 4fb(i)2G2 θi θ 2 , so that the sum of the conditional variances from Mb(2) to Mb(b) satisfies i=2 Var(Mb(i) | Fi 1,m) 4G2 b X i=2 fb(i)2 θi θ 2 . We also have the uniform bound (since fb is increasing by arguments in the inductive step of Lemma E.4) |Mb(i)| fb(b)2G θi θ 2G2fb(b) Online Conformal Prediction via Online Optimization where the upper-bound θi θ G µmin follows from the fact that G θi θ gi θi θ gi, θi θ E [µi | Fi 1,m] θi θ 2 µmin θi θ 2 . (6) Now, we can apply Lemma E.2 from Bartlett et al. (2008) to bound the sum of martingale differences Pb i=2 Mb(i). As b increases from 2 to B, we have (B 1) martingales, which we will bound together with a union bound. As long as B 4 and δ (0, 1/e), with probability at least 1 Bδ , for all b B, i=2 Mb(i) 2 max v u u t4G2 b X i=2 fb(i)2 θi θ 2, 2G2fb(b) i=2 fb(i)2 θi θ 2, Gfb(b) Substituting δ = δ B yields the stated result. Summarizing, with the goal of establishing a O(log(B log(B)/δ)/bc) rate for θb θ 2, we have shown that for all b B, with probability at least 1 δ, µmin 8G max i=2 fb(i)2 θi θ 2, Gfb(b) log B log(B) log B log(B) log(B log(B)/δ) i=2 fb(i)2 θi θ 2 + 16G2 log(B log(B)/δ) µ2 minbc + G2 log(B log(B)/δ) i=2 fb(i)2 θi θ 2 + G2(16 log(B log(B)/δ) + 1) This is equivalent to Claim (i). Demonstrating Claim (ii). We now prove by strong induction that θb θ 2 a bc for suitably chosen a. This holds for b = 1, 2 if a 2G2 µ2 min , since that implies θb θ 2 G2 µ2 min , which holds by inequality (6). To show the inductive step, we write the above inequality as i=2 fb(i)2 θi θ 2 + y for x = 16G log(B log(B)/δ) µmin and y = G2(16 log(B log(B)/δ)+1) µ2 min . By the inductive hypothesis, θi θ 2 a ic for i = 1, . . . b. To show that θb+1 θ 2 a (b+1)c , it suffices to find a sufficiently large so that i=2 fb(i)2 a bc a (b + 1)c . By Lemma E.7, which bounds Pb i=2 fb(i)2 1 bc a (b + 1)c (x a + y) b + 1 Online Conformal Prediction via Online Optimization Using the fact that b+1 2 for b 2 and c [0, 1], then 2(x a + y) a 3x Solving the quadratic inequality, we get that 4 + 6y = 9x2 Substituting in the values of x and y, we get that a 576G2 log(B log(B)/δ) µ2 min + 3G2(16 log(B log(B)/δ) + 1) µ2 min = a G2(624 log(B log(B)/δ) + 3) Note that this satisfies the condition on a in the base case, that is a 2G2 µ2 min , since 3G2 µ2 min . This shows Claim (ii) and the first statement of the theorem. Lastly, we return to prove Proposition C.5: Proof. We use our assumption that St | Ft 1 has density ft upper-bounded by u in a GD neighborhood around the conditional quantile q t , so that P(Yb,i Cb,i(Xb,i) | Fb,i 1) (1 α) = P(qb,i(θb) Sb,i | Fb,i 1) P(qb,i(θ ) Sb,i | Fb,i 1) (Z qb,i(θ ) qb,i(θb) fb,i(x)dx, Z qb,i(θb) qb,i(θ ) fb,i(x)dx u |qb,i(θb) qb,i(θ )| C.6. Proof of Theorem 6.1 We first provide a roadmap for the proof, and then the details for each step. For our iterates θt to converge to θ which defines the best quantile predictor, we must first establish the existence and uniqueness of the minimizer θ . We will argue that under our assumptions, this is a consequence of the strong convexity of LΠ(θ) around its minimizer. Claim C.5. Let the assumptions of Theorem 6.1 hold. Then: (i) There exists θ Θ such that LΠ(θ ) LΠ(θ) for any other θ Θ. (ii) For any θ Θ and µ = pΠλ min 1, ε GD LΠ(θ) LΠ(θ ) µ so θ is unique. Claim C.5 not only states the existence and uniqueness of θ , but also the fact that LΠ behaves like a strongly convex function with respect to its minimizer. This allows us to bound the squared distance θt θ 2 using a recursion argument based on Proposition 1 of Rakhlin et al. (2012): Online Conformal Prediction via Online Optimization Claim C.6. The iterates θt of Algorithm 1 with m = 1 and ηt = 2c µtc satisfy c 2ηi(LΠ(θi) LΠ(θ ) ℓi(θi) + ℓi(θ )) c η2 i ˆgi 2 . To prove the first part of Theorem 6.1 it now suffices to control the expected value of the bound in Claim C.6. Deriving a bound for the rightmost term is straightforward from our Lipschitz assumption and the learning rate inequality in Lemma E.6. To control the remaining term we rely on the following lemma. Lemma C.7. Let {ν2, , νt} be any sequence of positive scalars and i=2 νi(LΠ(θi) LΠ(θ ) ℓi(θi) + ℓi(θ )), then for 0 < τ < t 2 + |νi νi+τ| + νiτηi G i=t τ+1 2νi The expectation is taken over the samples {(Si, Zi)}t i=1. Combining the bounds for both terms now yields the first part of Theorem 6.1, which we summarize below. Proposition C.8. Based on the bound in Claim C.6, the iterates θt of Algorithm 1 with m = 1 and ηt = 2c µtc satisfy E h θt+1 θ 2i C G2 τ (t + 1)c , for t [T + 1] and a universal constant C. The final step is to link this upper bound to the expected coverage gap via the following lemma. Lemma C.9. Let St | Ft 1 have density ft upper-bounded by u in a ε-neighborhood around Z t θ . Then, E [|P (Yt / Ct(Xt) | Ft 1) αt|] G E h θt+1 θ 2i . We now provide proofs for each of these results, with the proof of Lemma C.7 deferred to the end of the section. Demonstrating Claim C.5. Part (i) is a straightforward consequence of the extreme value theorem since Θ is closed and bounded and LΠ(θ) is continuous, so there exists θ Θ such that LΠ(θ ) LΠ(θ) for any other θ Θ. Part (ii) requires a more detailed analysis that we split into 3 sub-parts: (ii)(a) The Hessian 2LΠ(θ) exists for all θ B2(θ , ε (ii)(b) LΠ is pΠλ-strongly convex in B2(θ , ε G) since for any θ in this set the Hessian satisfies 2LΠ(θ) pΠλI. (ii)(c) Part (b) and the finite diameter of Θ imply part (ii) of Claim C.5. Online Conformal Prediction via Online Optimization Demonstrating (ii)(a). Recall that by the chain rule for subdifferentials LΠ(θ) = EΠ Z(P S Z θ | Z (1 α)) . For any z and i [d] we now define the auxiliary function gi(θ; z) = zi(P S z θ | Z = z (1 α)), so that for any j [d] if we now let πz be the continuous conditional density of S | Z = z, then the derivative = πz(z θ)zizj, exists by the chain rule and the fundamental theorem of calculus. We now note that for any θ B2(θ , ε 1. |zi| G < so gi(θ; z) is an integrable function of z. 2. |zizj| G2 < and πz is upper bounded by uΠ over {s = z θ : θ B2(θ , ε with probability 1 and G2uΠ < is an integrable function of z. These results allow us to swap expectation and differentiation to conclude that for any θ B2(θ , ε 2LΠ(θ) = EΠ πZ(Z θ)ZZT . Demonstrating (ii)(b). Armed with our proof of (ii)(a) and an explicit expression for 2LΠ(θ), the result (ii)(b) now follows from the fact that for any θ in this set πZ(Z θ) pΠ and 2LΠ(θ) pΠλI. We have thus shown that LΠ(θ) is µ-strongly convex for µ = pΠλ on B2(θ , ε Demonstrating (ii)(c). If ε GD then LΠ is µ-strongly convex for all Θ so the claim immediately holds. If ε < GD for any θ Θ and β [0, 1] we can define θ = βθ + (1 β)θ Θ so that θ θ = β θ θ βD, and choosing β = ε GD (0, 1) now ensures that θ B2(θ , ε G). We now know that for any other θ = θ the convex optimality conditions indicate that LΠ(θ ), θ θ 0 and by the convexity of LΠ(θ) LΠ(θ) LΠ(θ ) 1 LΠ( θ) LΠ(θ ) LΠ(θ ), θ θ + pΠλ 2 β θ θ 2 . This implies that θ is unique since LΠ(θ) > LΠ(θ ) for θ = θ . Online Conformal Prediction via Online Optimization Demonstrating Claim C.6. To obtain our upper bound we define ˆgt lt(θt) as our empirical subgradient for time t and turn to the standard recursive relationship θt+1 θ 2 θt θ 2 2ηt ˆgt, θt θ + η2 t ˆgt 2 2 . We can now use the convexity of ℓt to conclude that θt+1 θ 2 θt θ 2 2ηt(ℓt(θt) ℓt(θ )) + η2 t ˆgt 2 2 = θt θ 2 2ηt(LΠ(θt) LΠ(θ )) + 2ηt(LΠ(θt) LΠ(θ ) ℓt(θt) + ℓt(θ )) + η2 t ˆgt 2 2 . Recall that by Claim C.5 if µ = pΠλ min ε GD, 1 then LΠ(θ) LΠ(θ ) µ θt+1 θ 2 (1 µηt) θt θ 2 + 2ηt(LΠ(θt) LΠ(θ ) ℓt(θt) + ℓt(θ )) + η2 t ˆgt 2 2 . For arbitrary t 2 we now set ηt = 2c µtc and unroll the inequality to obtain c 2ηi(LΠ(θi) LΠ(θ ) ℓi(θi) + ℓi(θ )) c η2 i ˆgi 2 , where the first term vanishes since (1 2c/ic) = 0 at i = 2. Demonstrating Proposition C.8. We can obtain a bound for the leftmost term in Claim C.6 by applying Lemma C.7 with and using the learning rate inequalities in Lemmas E.4, E.5, E.6, and E.8 to bound each individual term. After some algebra, this yields the upper bound E [A] GDβ(τ) and choosing τ = τβ(P, (t + 1) c) E [A] GD µ(t + 1)c + 28τGD To handle the remaining term, we see that by the learning rate inequality in Lemma E.6 µ2i2c E h ˆgi 2i 4G2 and thus since t+1 E h θt+1 θ 2i GD µ(t + 1)c + 56τGD µ(t + 1)c + 18τG2 τ (t + 1)c . The value t 2 was chosen arbitrarily so the result holds for any 3 t + 1 T + 1. Moreover, the inequality (6) in the proof of Theorem 5.4 also holds in this case, with the corresponding µ, so θi θ 2 GD/µ and the bound is also valid for t + 1 = 1, 2. Online Conformal Prediction via Online Optimization Demonstrating Lemma C.9. We start by using our local upper bound on the conditional density to upper bound the coverage gap Rt = |P (Yt Ct(Xt) | Ft 1) (1 αt)| = |P (St qt(θt) | Ft 1) P (St qt(θ ) | Ft 1) | Z (1St [qt(θ ),qt(θt)] 1St [qt(θt),qt(θ )])d PSt|Ft 1 G (1St [qt(θ ),qt(θt)] 1St [qt(θt),qt(θ )])d PSt|Ft 1 G + u |qt(θt) qt(θ )| G + u G θt θ . We now take the expectation of Rt to conclude that E [Rt] P θt θ > ε + u GE [ θt θ ] , and by Markov s inequality, ε + u G E [ θt θ ] . Finally, applying Jensen s inequality E h θt θ 2i , which proves the desired result. Demonstrating Lemma C.7. The proof of this lemma follows from a similar decomposition as in the proof of Theorem 3.1 in Duchi et al. (2012), so we first introduce the following two lemmas adapted from their analysis. Lemma C.10. [Adapted from Lemma 6.3 in Duchi et al. (2012)] Let our ergodicity assumption hold, let our conformal models be G-Lipschitz, and let Θ have finite diameter D. Then for θ m(Fi), t > τ 0, i t τ, and any θ Θ E [LΠ(θ) LΠ(θ ) ℓi+τ(θ) + ℓi+τ(θ ) | Fi] GD P i+τ [i] Π TV Proof. Note that E [LΠ(θ) LΠ(θ ) ℓi+τ(θ) + ℓi+τ(θ ) | Fi] = Z (ℓ(s z θ) ℓ(s z θ ))dΠ(z, s) Z (ℓ(s z θ) ℓ(s z θ ))d P i+τ [i] (z, s), and since we assume that Π and P i+τ [i] have densities with respect to the Lebesgue measure λ on Rd+1 |E [LΠ(θ) LΠ(θ ) ℓt+τ(θ) + ℓt+τ(θ ) | Ft]| = Z (ℓ(s z θ) ℓ(s z θ ))(π(z, s) pi+τ [i] (z, s))dλ(z, s) GD Z |π(z, s) pi+τ [i] (z, s)|dλ(z, s) = GD P i+τ [i] Π TV . Online Conformal Prediction via Online Optimization Lemma C.11. [Adapted from Lemma 6.4 in Duchi et al. (2012)] Let our conformal models be G-Lipschitz and let {ηt}t be a non-increasing sequence of step sizes. For t > τ 0 and i t τ the iterates produced by Algorithm 1 with batch size 1 satisfy |ℓi+τ(θi) ℓi+τ(θi+τ)| τηi G2. Proof. Recall that r=i ηr ℓr(θr), and since the learning rate schedule {ηt} is non-increasing r=i ηr ℓr(θr) r=i ηr ℓr(θr) Finally, using our Lipschitz assumption once again |ℓi+τ(θi) ℓi+τ(θi+τ)| τηi G2. We now note that we can decompose A into the following 5 terms i=2 νi (LΠ(θi) LΠ(θ ) ℓi+τ(θi) + ℓi+τ(θ )) i=2 νi+τ(ℓi+τ(θi) ℓi+τ(θi+τ)) i=2 (νi νi+τ)(ℓi+τ(θi) ℓi+τ(θ )) i=2 νi (ℓi(θ ) ℓi(θi)) i=t τ+1 νi(LΠ(θi) LΠ(θ ) ℓi(θi) + ℓi(θ )), so that A = P5 k=1 Ak. We will now bound the expectation of each of these terms individually. Firstly, we observe that i=2 νi E [(LΠ(θi) LΠ(θ ) ℓi+τ(θi) + ℓi+τ(θ ))] i=2 νi E [E [(LΠ(θi) LΠ(θ ) ℓi+τ(θi) + ℓi+τ(θ )) | Fi]] , and apply Lemma C.10 to obtain i=2 νi GDβ(τ) Online Conformal Prediction via Online Optimization For A2 we note that by Lemma C.11 i=2 νi+ττηi G2, so E [A2] must also be upper bounded by this quantity. The upper bounds for the remaining 3 terms follow from the triangle inequality, our Lipschitz assumption, and our diameter bound i=2 GD|νi νi+τ| i=t τ+1 2νi GD. Summing up all the stated upper bounds yields the result. D. Justifying the quantile loss Here we provide some intuition to justify why the quantile loss is an effective metric for evaluating and comparing online conformal algorithms. We argue that the minimizers of the quantile loss exhibit desirable properties in both mis-specified ergodic and adversarial settings, as discussed in Sections D.1 and D.2, respectively. In addition to these results, we also point out that the quantile loss is often quite natural for the online conformal task. Specifically, it grows linearly as the predicted quantile deviates further from the observed score St, and its asymmetric nature penalizes mis-coverage, where an actual error occurs, more heavily than over-coverage, where the conformal set is simply less conservative than necessary, as long as α < 0.5. D.1. Properties of the quantile loss in the mis-specified ergodic setting In Section 6, we argued that if our quantile models are linear and the covariate and score pairs {(Zt, St)} are produced by an ergodic process P converging to a stationary distribution Π, then the iterates of Algorithm 1 with batch size m = 1 converge to θ minimizing LΠ, the expected quantile loss with respect to the stationary distribution. This implies that the coverage properties of our algorithm are determined by αt = P St > Z t θ | Ft 1 . To argue that these αt often provide good coverage in the mis-specified setting, we consider the case where P St > Z t θ | Ft 1 = P St > Z t θ | Zt . This holds, for example, when the distribution of St only depends on the previous q scores and we use an LQT model of order p q. In this case we can provide an alternative characterization of the stronger guarantee (2), since we assume that Π is the stationary distribution of our process αt(z) := P St > z θ | Zt = z = EΠ 1{S>z θ } | Z = z , so that αt = α σ(Z) if and only if sup V L2(Ω,σ(Z),ΠZ) EΠ V (α 1{S>Z θ }) = 0. In the mis-specified setting, our predictor will not be able to attain conditional coverage, but may be able to attain the weaker condition of coverage with respect to a family of witnesses V L2(Ω, σ(Z), ΠZ): sup V V EΠ V (α 1{S>Z θ }) = 0. We will now argue that these αt provide good coverage within our class of conformal predictors since for any other θ we can find a witness function V defined by our features Z such that the sets obtained from qt(θ) have strictly worse coverage than qt(θ ) relative to V . Online Conformal Prediction via Online Optimization 2 3 4 5 6 7 8 9 10 Maximum degree of mis-specified model (p) Average over random Figure 4. Comparison of attained Lcov for θcov, θ , and the average value of Lcov over 10,000 randomly chosen θ. Theorem D.1. Let the assumptions in the previous paragraph and those of Theorem 6.1 hold. Then, for any θ Θ \ {θ } there exists β(θ) Rd such that the witness V = β(θ) Z satisfies EΠ V (α 1{S>Z θ}) > EΠ V (α 1{S>Z θ }) 0. Proof. Recall from our proof of Theorem 6.1 that LΠ(θ) = EΠ ℓquantile(S Z θ) has gradient L(θ) = EΠ Z(α 1{S>Z θ}) , and that by the convex optimality conditions LΠ(θ), θ θ LΠ(θ) LΠ(θ ) LΠ(θ ), θ θ + µ 2 θ θ 2 2 > LΠ(θ ), θ θ 0. This immediately implies that for β(θ) = (θ θ ) and V = β(θ) Z EΠ V (α 1{S>Z θ}) > EΠ V (α 1{S>Z θ }) 0. Note that the proof of Theorem D.1 also shows that when θ Int(Θ), the model defined by θ is the only one to achieve coverage relative to our features Zi since L(θ ) = EΠ Z(α 1{S>Z θ }) = 0. This result illustrates why minimizing the quantile loss is desirable, especially when θ Int(Θ), as the minimizer obtains the best coverage relative to our features. We can also verify that αt provide good coverage in synthetic experiments where the stationary distribution Π is known. For our experiment we consider the standard AR(1) model with standard Gaussian i.i.d. innovations and parameter 0.5, and use a mis-specified linear model with feature vector Φt(St 1 1 ) = [1, S2 1, , Sp 1] for p 2. In this case, a natural way to measure coverage of our model is through Lcov(θ) = EΠ P S > Z θ | Z α the expected absolute difference between the conditional mis-coverage of our model and α. We now show the values of Lcov attained by the minimizer θ of LΠ, the minimizer θcov of Lcov, and the average value of Lcov for 10,000 randomly chosen θ, in Figure 4. Our results suggest that even under the criterion Lcov the θ minimizing the quantile loss provides good coverage comparable to θcov, which is significantly better than the coverage attained by most other parameter values. D.2. Properties of the quantile loss in the adversarial setting In many cases, we can interpret online conformal prediction algorithms from online convex optimization perspective, where we analyze the performance of an algorithm A producing iterates θt optimizing a sequence of convex functions ft : Θ R in terms of static regret with respect to some set Θ Θ RT (A, {ft}, Θ) = t=1 ft(θt) inf θ Θ In this context, by the argument in Theorem 3.1 of (Hazan, 2016) Algorithm 1 (A1) also satisfies the following regret bound. Online Conformal Prediction via Online Optimization Theorem D.2. Let ℓt be convex and G-Lipschtiz for every t N, and let Θ have finite diameter D. Then the iterates {θt}t produced by Algorithm 1 with non-increasing learning rates {ηt}t and m = 1 satisfy RT (A1, {ℓt}, Θ) D2 ηT + G2 T X We now note that by Theorem 4.1, when scores are bounded Algorithm 1 also achieves small constraint violation with respect to the constraint functions gt(θ) = 1qt(θ) 0 and m = 1 provides a solution to the constrained non-convex online optimization problem min θ1, ,θT Θ t=1 ℓt(θt) min θ ΘT t=1 gt(θt) 0, having sublinear regret RT (A1, {ℓt}, ΘT ) = O(T 1/2+ϵ) and constraint violations PT t=1 gt(θt) O(T 1/2+ϵ), simultaneously. Note that any algorithm with sublinear constraint violations will satisfy t=1 errt α o(1), which is equivalent to providing at least (1 α) adversarial marginal coverage in the long-run. This naturally prompts the question of whether we can design algorithms with at least (1 α) adversarial marginal coverage and similarly small regret RT (A1, {ft}, ΘT ) for an arbitrary set of convex functions {ft} chosen by the user. The following Lemma, which follows from a simple adaptation of the argument in Proposition 4 of Mannor et al. (2009), points out that this is impossible in general even when objective and constraint functions are linear. We provide a proof of this fact in Section D.3 for convenience of the reader. Theorem D.3. Let Θ have finite diameter D. Then there exist sequences of linear functions {ft}, {gt} such that any algorithm A attaining t=1 gt(θt) 0, also suffers RT (A, {ft}, ΘT ) and it is thus impossible to attain sublinear regret and constraint violations simultaneously. However, we believe the design of online conformal algorithms that allow the user to trade off between coverage and regret with respect to some arbitrary sequence of convex loss functions {ft} is a promising area of future research. D.3. Proof of Theorem D.3 This proof closely follows the proof of Proposition 4 of Mannor et al. (2009); we only adapt it to our online convex optimization setup for the convenience of the reader. To prove this result, we first define the behavior of our adversary and then show how this choice makes it impossible to attain sublinear regret and constraint violations simultaneously. Online Conformal Prediction via Online Optimization Defining the adversary. Let (v, w) argmaxv,w Θ v w be the two furthest points in our set so that v w = D < , and consider a setting where the behavior of our adversary is dictated by a sequence of indices jt so that ( 1 jt = 1 1 D2 (w v) (θ v) jt = 2, and ft(θ) = 1 1 D2 (w v) (θ v). For any θ Θ D2 w v θ v 0, and ft(w) = 0, so that w minimizes ft for any t. Let ˆq T = 1 T PT t=1 1jt=2 then if ˆq T 1/2 we could choose θ = w (in hindsight) to minimize all ft(θ) while satisfying the average constraint. We now let our adversary initialize a counter k = 0 and pick the indices jt using the following two steps: 1. While k = 0 or 1 D2(t 1) Pt 1 r=1(w v) (θr v) > 1 2 pick jt = 2 and increment k by 1. 2. For the next k time indices jt = 1. Then, reset counter k to 0 and go back to step 1. Proving the impossibility result. We will now show that any algorithm that has sublinear constraint violations against this adversary cannot attain sublinear regret. To do so, we show that: (i) The adversary executes both steps infinitely often. (ii) Any algorithm attains at least linear regret when the adversary finishes executing step 2. Demonstrating claim (i). We will prove this result by contradiction, so we first assume that step 2 occurs only a finite number of times. This implies that after a finite number of time steps, the adversary only chooses jt = 2 so that ˆq T 1. However, our sublinear constraint t=1 gt(θt) 0 implies that the condition 1 D2(t 1) Pt 1 r=1(w v) (θr v) > 1 2 must be violated at some point, triggering step 2. This result implies that both steps occur infinitely often, so we can find infinitely many ti < t i < ti+1 such that the adversary plays step 1 for t (ti, t i] and step 2 for t (t i, ti+1]. Demonstrating claim (ii). Since both steps are equally long we know that ˆqti = 1 r=1 fr(θ) = 0. Observe that by construction ti+1 t i t i so that t i ti+1 r=1 (w v) (θr v) 1 ti+1 t i 2 + (ti+1 t i) This implies that Rti+1(A, {ft}, Θti+1) ti+1 = 1 1 D2ti+1 r=1 (w v) (θr v) 1 Online Conformal Prediction via Online Optimization and also the desired result RT (A, {ft}, ΘT ) E. Auxiliary results E.1. Existing Theorems and Lemmas In this section we compile several known results that we use in our proofs for the convenience of the reader. The first result is a useful tool for asymptotic convergence proofs commonly known as the Robbins Siegmund quasimartingale convergence Theorem. We apply this result in our proof of Theorem 5.1. Theorem E.1. [Adapted from Robbins & Siegmund (1971)] Let (Ω, F, P) be a probability space and F1 F2 be a sequence of sub-σ-algebras of F. For each n = 1, 2, let Vn, Dn, Un, and βn be non-negative Fn-measurable random variables such that E [Vn+1 | Fn] Vn(1 + βn) Dn + Un, then limn Vn exists and is finite and P n=1 Dn < almost surely on The second result is a concentration inequality for martingale difference sequences from Bartlett et al. (2008) and is used to derive the high-probability bounds in Theorem 5.4. Lemma E.2. [Copied from Bartlett et al. (2008)] Let d1, . . . , d T be a martingale difference sequence with a uniform bound |di| b for all i. Let V = PT t=1 Vart 1(dt) be the sum of conditional variances of dt s. Further, let σ = V . Then we have, for any δ < 1/e and T 4, t=1 dt > 2 max n 2σ, b p log(1/δ) o p The final result provides some lower bounds on the expected quantile loss gap between some q R and the true quantile q , in terms of the distance between q and q . We use this lemma for our proofs of the results in Section 5 and 6. Lemma E.3. Let S have a positive continuous density f, lower bounded by p > 0 in an ε-neighborhood around its unique (1 α) quantile q . Then, (i) For q such that |q q | ε, E[ℓquantile(S q) ℓquantile(S q )] p (ii) For q such that |q q | > ε, E[ℓquantile(S q) ℓquantile(S q )] pε (iii) For q such that |q q | K with K (ε, ), E[ℓquantile(S q) ℓquantile(S q )] pε 2K |q q |2. Proof. We dedicate a section to the proof of each lower bound. Online Conformal Prediction via Online Optimization Demonstrating (i). We only prove (i) for q q as the other direction follows by an identical argument. We temporarily adopt the notation L(q) := E[ℓquantile(S q)] and recall that qℓquantile(S q) = ( α 1S q>0 S = q [α 1, α] S = q. The subgradient of the expected quantile loss L is thus q L(q) = q E[ℓquantile(S q)] = E[ qℓquantile(S q)] = E[α 1S q>0] = P(S q) (1 α), since we assume that S has a density. This is single valued so we can conclude that L(q) is differentiable and L (q) = P(S q) (1 α). Note that P(S q ) = (1 α) since f is continuous, and for q [q , q + ε], L (q) = L (q) L (q ) = P(S q) P(S q ) = Z q q f(s)ds p(q q ) . This now implies that L(q) L(q ) = Z q q L (z)dz p q2 2 + q 2 = p Demonstrating (ii). We once again only prove (ii) for q q as the other direction follows by an identical argument. In this case, q > q := q + ε so that L(q) L(q ) = L(q) L(q ) + L(q ) L(q ) L (q )(q q ) + p 2(q q )2 (convexity of L; lower bound for q [q , q + ε]) pε(q q ) + pε 2 (q q ) (q q = ε) as desired. Demonstrating (iii). This follows immediately from the previous bound by noting that if |q q | ε then L(q) L(q ) p 2|q q |2 pε 2K |q q |2, and if |q q | > ε L(q) L(q ) pε 2K (q q )2. E.2. Learning rate inequalities The proofs of Theorems 5.4 and 6.1 rely on multiple inequalities bounding functions of the products Qt j=i+1 1 2 j c for 2 i t, so we present and prove them in this section. In standard analyses of strongly convex stochastic gradient descent, the learning rates are of the form Θ(1/tc) for c = 1, which simplifies the analysis of these products. Lemma E.4. For any t 2, i [2, t] and c [0, 1], Proof. We prove the statement by induction on i = t, . . . , 2. Online Conformal Prediction via Online Optimization Setting up the induction. Let and note the recursive relationship ft(i 1) = 1 2 (i 1)c ft(i). (7) Our goal is now to show that ft(t) 1 tc for the base case, and given ft(i) 1 tc then ft(i 1) 1 tc for the inductive step. Base case. Observe that ft(t) = 1 tc , so the inequality immediately holds. Inductive step. We now move to the inductive step, where by our previous relationship (7) and the inductive hypothesis ft(i 1) = 1 2 (i 1)c ft(i) proving the desired result. Lemma E.5. For t 2 and c [0, 1], t X Proof. We prove the statement by induction on t 2. Setting up the induction. Let and note the recursive relationship We already know that X2 = 2 c for the base case, so it remains to show that if Xt 2 c, then Xt+1 2 c. Inductive step. We now move to the inductive step where, by our previous relationship (8) and the inductive hypothesis Xt+1 = 1 2 t + 1 c Xt + 1 (t + 1)c 2c + 1 (t + 1)c proving the desired result. Online Conformal Prediction via Online Optimization Lemma E.6. For t 1 and c [0, 1], t X Proof. We prove the statement by induction on t 1. Setting up the induction. Let and note the recursive relationship We already know that X1 = 0 1 for the base case, so our goal is to show that if Xt 1 (t 1) c then Xt t c. Inductive step. For t 2, using the recursive relationship (9) and the inductive hypothesis, c 1 (t 1)c + 1 t2c tc 2c + 1 We now define g(t) := tc (t 1)c and note that g (t) = c(tc 1 (t 1)c 1) 0 so g(t) is decreasing with g(t) g(2) = 2c 1, and Xt tc (t 1)c 2c + 1 (t 1)c + 1 1 tc = g(t) 2c + 1 (t 1)c + 1 1 proving the desired result. Lemma E.7. For t 1 and c [0, 1], Proof. We prove the statement by induction on t 1. Setting up the induction. Let and note the recursive relationship c 2 Xt 1 + 1 We already know that X1 = 0 1 for the base case so our goal is to show that if Xt (t 1) 2c then Xt t 2c. Inductive step. For t 2, using the recursive relationship (10) and the inductive hypothesis, c 2 1 (t 1)2c + 1 Online Conformal Prediction via Online Optimization We now recall from the last step of the proof of Lemma E.6 that (t 1)c 1 1 (t 1)c 1, so that tc 2c (t 1)c 2 tc 2c 1 1 (t 1)c + 1 proving the desired result. Lemma E.8. For any t 3, τ [t 2] and c [0, 1] c 1 (i + τ)c Proof. Note that so applying Lemma E.9, 3τ (i + τ)2c . We finally note that and applying the bound in Lemma E.6, Lemma E.9. For any t 2 and τ 1, 3τ (t + τ)c . Proof. For every t 2 we will use induction on τ 1 to show that the stated bound holds. Online Conformal Prediction via Online Optimization Setting up the induction. Note that we can express using the telescoping product We will now show that for any t 2, then |1 Rt,1| 3 (t+1)c for the base case, and if |1 Rt,τ| 3τ (t+τ)c then |1 Rt,τ+1| 3(τ+1) (t+τ+1)c for the inductive step. Base case. For the base case with τ = 1, observe that 1 Rt,1 = 1 1 + 1 so we can argue that tc = (t + 1)c (t + 1)c 3 (t + 1)c , Rt,1 1 1 + 1 tc 3 (t + 1)c . This implies that the condition holds for the base case since |1 Rt,1| 3 (t + 1)c . Inductive step. We now move to the inductive step and note that 1 Rt,τ+1 = 1 (t + τ + 1)c 2c + (t + τ + 1)c 2c (t + τ)c [1 Rt,τ] , and by the triangle inequality |1 Rt,τ+1| 1 (t + τ + 1)c 2c + (t + τ + 1)c 2c (t + τ)c |1 Rt,τ|. Focusing on the first term we can apply the same logic as before to conclude that 1 (t + τ + 1)c 2c (t + τ)c 2c (t + τ + 1)c 2c (t + τ)c 1 c t + τ 2c Online Conformal Prediction via Online Optimization or equivalently 1 (t + τ + 1)c 2c We now let gt+τ(c) be as defined in Lemma E.10 and use the inductive hypothesis on the second term to conclude that |1 Rt,τ+1| 2c (t + τ)c + (t + τ + 1)c 2c 3τ (t + τ)c = t + τ + 1 (t + τ + 1)c + gt+τ(c) 3τ (t + τ + 1)c . We now apply Lemma E.10 to conclude that |1 Rt,τ+1| 3 (t + τ + 1)c + 3τ (t + τ + 1)c 3(τ + 1) (t + τ + 1)c , proving the desired result. Lemma E.10. For u 3 and c [1/2, 1] the function gu(c) := (u + 1)c 2c satisfies gu(c) gu(1) 1. Proof. By computation we find that the derivative of our function at c [1/2, 1] is g u(c) = (u + 1)c 2(u + 1)c log u + 1 2c log 2(u + 1) Note that since u 3 we have u+1 u > 1 and 2(u+1) 9 < 1, so that for any u, g u(c) > 0 and the function gu(c) is increasing. Therefore, for any u, gu(c) gu(1) = (u + 1)2 2(u + 1)