# conformal_predictions_under_markovian_data__92cd4fa9.pdf Conformal Predictions under Markovian Data Fr ed eric Zheng 1 Alexandre Proutiere 1 We study the split Conformal Prediction method when applied to Markovian data. We quantify the gap in terms of coverage induced by the correlations in the data (compared to exchangeable data). This gap strongly depends on the mixing properties of the underlying Markov chain, and we prove that it typically scales as p tmix ln(n)/n (where tmix is the mixing time of the chain). We also derive upper bounds on the impact of the correlations on the size of the prediction set. Finally we present K-split CP, a method that consists in thinning the calibration dataset and that adapts to the mixing properties of the chain. Its coverage gap is reduced to tmix/(n ln(n)) without really affecting the size of the prediction set. We finally test our algorithms on synthetic and real-world datasets. 1. Introduction Machine Learning (ML) algorithms and in particular those based on deep neural networks are being increasingly used by practitioners in critical decision processes. If the resulting models often achieve unprecedented levels of prediction performance, they most often come without any guarantees, which can be too hazardous in application fields where safety is crucial (including health care, self driving vehicles, etc.). Introduced and developed by Vovk, see (Vovk et al., 2005; Gammerman & Vovk, 2007; Shafer & Vovk, 2008), Conformal Prediction (CP) is a robust and flexible tool to quantify and handle the inherent uncertainty of these models. The CP framework allows us to build, from any black-box prediction model, a new model whose output is a prediction set with guaranteed level of certainty. CP has become very popular recently, see (Lei & Wasserman, 2014; Lei et al., 2018; Romano et al., 2019; Angelopoulos & Bates, 2023) and references therein, and applied to various learning tasks 1Division of Decision and Control Systems, EECS KTH and Digital Futures, Stockholm, Sweden. Correspondence to: Fr ed eric Zheng . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). including regression and classification. CP typically works as follows. Consider a model ˆµ fitted to a training dataset and with input (or covariate) X and output ˆµ(X). Based on this model and on a calibration dataset of n (covariate, response) pairs {(X1, Yt)}n t=1, CP builds, for a given certainty level 1 α, Cn(Xn+1) a set of most likely responses. This set is chosen with size as small as possible while ensuring the following coverage guarantee P[Yn+1 Cn(Xn+1] 1 α. The coverage guarantees achieved using CP crucially require the calibration data to be sampled independently and identically (i.i.d.) or at least to be exchangeable. This assumption does not hold in many real-world datasets, where indeed samples may be highly correlated. Such correlations arise naturally in learning tasks pertaining to time series (e.g. predicting the evolution of market behavior) (Gibbs & Candes, 2021; Zaffran et al., 2022) or to more general dynamical systems (e.g. in Reinforcement Learning or adaptive control) (Foffano et al., 2023; Dixit et al., 2023). In this paper, we aim at studying CP methods when the correlations in the data are modelled as a Markov chain ({(X1, Yt)}t 1 is a Markov chain). This model is general and includes the classical scenario where the successive covariates form a Markov chain and where the response Yt remains independent of {(Xs, Ys)}s =t (Bresler et al., 2020; Oliveira et al., 2022). For this model, we address the following questions. (a) How are the coverage and the prediction set size affected by the correlations in the data? (b) How can we adapt the CP methods to minimize the impact of these correlations? Contributions. 1. We first provide a theoretical analysis of marginal coverage and of the prediction set size under the classical split CP method in the case of Markovian data. The additional coverage gap due to the correlations depends on the mixing properties of the underlying Markov chain, and we establish that under generic ergodicity assumptions, this gap scales at most as q n where n is the size of the calibration dataset and tmix denotes the mixing time of the Markov chain. We further show that typically, the increase in the size of the prediction interval due to the correlations does not exceed p 2. We then investigate the idea of thinning the calibration dataset so as to alleviate the impact of correlations. We Conformal Predictions under Markovian Data refer to as the K-split CP, the resulting method where one in K samples of the calibration dataset is kept. We optimize the value of K to achieve an efficient trade-off between coverage and size of the prediction set. The optimal value depends on the mixing time of the Markov chain, and can be estimated in an online manner without really impacting the coverage. K-split CP improves the coverage gap that now scales tmix n ln(n) and has very little impact on the size of the prediction interval. 3. Our theoretical results are confirmed using numerical experiments, both on synthetic and real-world data (e.g., for the prediction on the EUR/SEK exchange rate). 2. Related Work Extending the CP framework to non-exchangeable data has been considered in the literature, e.g. to model distribution shifts (Tibshirani et al., 2019; Barber et al., 2023). The work (Oliveira et al., 2022) is close to ours. There, the authors investigate the Split CP for dependent data verifying a set of concentration-type assumptions, such as stationary β-mixing chains. They apply concentration inequalities on an independent calibration dataset and come back to the original trajectory by adding an additional β coefficient. Our analysis of split CP is inspired by their work. However, we leverage the Markovian nature of the data and in particular concentration results available for Markov chains. The key differences between (Oliveira et al., 2022) and our work are: 1) our analysis can be conducted without the stationarity assumption (which we believe is very restrictive); 2) we get more explicit, more generic and simpler upper bounds on the coverage gap for Markovian data; 3) we present an analysis of the size of the prediction interval; 4) we present and investigate the idea of thinning the calibration dataset. A different line of research consists in adjusting the empirical quantile level to account for the possible undercoverage due to the correlations. For example the authors of (Gibbs & Candes, 2021; Zaffran et al., 2022) propose that for each new observed covariate Xn, the quantile level αn used to compute the next CP interval is modified (αn is updated following a classic stochastic approximation scheme). 3. Preliminaries In this section, we first introduce our models, and provide a few existing results on Markov chains and β-mixing stochastic processes that will be useful in our analysis1. We then recall the classical Conformal Prediction framework for i.i.d. or exchangeable data. 1Refer to (Meyn & Tweedie, 1993) for a more extensive exposition on Markov chains with general state space, and to (Levin & Peres, 2017) for mixing properties of Markov chains. 3.1. Markovian data We assume that {(Xt, Yt)}t 1 N 2 is an homogeneous Markov chain with kernel P. The covariates {Xt}t 1 N take values in X Rd, and the responses {Yt}t 1 N in Y Rr. We assume that the Markov chain is ϕ-irreducible, aperiodic, and admits a stationary distribution π. We define Zt = (Xt, Yt) and Z = X Y. Example 1. An important example where Z = {Zt}t 1 N is indeed a Markov chain is the case studied in (Bresler et al., 2020) where X = {Xt}t 1 N is an homogeneous Markov chain of kernel PX, and where given Xt = x, the response Yt is independent of {Zs}s =t and of distribution b(x, ). For instance, one may assume that Yt = µ(Xt) + εt where µ is an unknown function and the noise process {εt}t 1 N is i.i.d. and independent of {Xt}t 1 N. The kernel of Z is simply PX(x, dx )b(x , dy) and its stationary distribution πX(dx)b(x, dy) where πX is that of X. Our results apply to this example but to the more general case where Z is a Markov chain. Mixing time. The mixing time of the Markov chain quantifies the time it takes to approach its steady state. When its transition kernel is P, the mixing time is defined as tmix = τ( 1 4) where for any ε > 0, τ(ε) = min{t 1 : supz ||P t(z, .) π||T V ε}. Geometric ergodicity. Consider a ϕ-irreducible and aperiodic Markov chain with stationary distribution π. The chain is geometrically ergodic (Roberts & Rosenthal, 1997) if there exists a constant 0 ρ < 1, referred to as the rate of the chain, such that for π-a.e. z Z, there exists Q(z) < with for all n 1, P n(z, .) π T V Q(z)ρn. We know (Nummelin & Tuominen, 1982) that the function Q can be chosen so that it is π-integrable ( R Z Q(z)π(dz) < ). The chain is uniformly geometrically ergodic if the constant Q(z) is independent of z. Note that if the state space Z is finite, an aperiodic and irreducible Markov chain is uniformly geometrically ergodic (Levin & Peres, 2017). Also observe that in the case of Example 1, if X is geometrically ergodic with rate ρ, then so is Z with the same rate. Connection between ρ and tmix. These two quantities, involved in the statement of our results, are closely linked in the case of uniformly geometric Markov chains, as shown in (Paulin, 2015) (see Proposition 3.4). Indeed, the rate ρ of a uniformly geometric ergodic Markov chain can be chosen so that ρ p1 γps where γps denotes its pseudospectral gap. Combined with the fact that γps 1 2tmix , we 1 1 2tmix . When the chain is reversible, it 2Starting the process at time N + 1 will turn convenient notation-wise. The first samples up to time 0 will constitute the training data, and the remaining samples the calibration data. Conformal Predictions under Markovian Data admits an absolute spectral gap γ γps, leading to a tighter bound of ρ 1 γ. For a more exhaustive discussion on the connection between the mixing time, the rate and the spectral gap or its variants for non-reversible Markov chain, refer to (Chatterjee, 2023). 3.2. β-mixing processes As in (Oliveira et al., 2022), we are interested in β-mixing processes as they can be divided into approximately independent blocks. This property will be instrumental the analysis of split CP. Let {Zt}t 1 be a stochastic process. For any s > t 1, we denote by Zt s = (Zs, Zs+1, . . . , Zt), by Pt s its distribution, and by σ(Zt s) the σ-algebra generated by Zt s. The β-coefficients of the process are defined as β(a) = supt 1 E sup{|P(B|σ(Zt 1)) P(B)|, B σ(Z t+a)} (Davydov, 1973; Yu, 1994). The process is called β-mixing if β(a) 0 as a . When the process admits a stationary distribution π, we will also use a slightly different definition of the β-coefficients: β (a) = supt 1 E[ Pt+a[ |Zt 1] π T V ], where Pt+a[ |Zt 1] is the conditional distribution of Zt+a given Zt 1. The connection between the rate of convergence of Markov chains and their β-mixing coefficients have been extensively studied. We cite here results from (Davydov, 1973; Nummelin & Tuominen, 1982; Liebscher, 2005) that will turn instrumental in our analysis. Stationary Geometrically Ergodic Markov chains. Let {Zt}t 1 be a geometrically ergodic Markov chain with rate ρ < 1, and with stationary distribution π. Assume that it is in steady-state (Z1 π), then its β-mixing coefficients satisfy (Davydov, 1973): β(a) Cρa, with C = R Z Q(z)π(dz). Non-stationary Geometrically Ergodic Markov chains. Let {Zt}t 1 be a geometrically ergodic Markov chain with rate ρ, and with stationary distribution π. Let ν1 be its initial distribution and suppose that R Z Q(z)ν1(dz) < . Then we can show as in the proof of Proposition 3 in (Liebscher, 2005) that for all a 1, β (a) 3C ( ρ)a where C = max( R Z Q(z)π(dz), R Z Q(z)ν1(dz)). 3.3. Split Conformal Prediction We outline below the classical Split CP framework and its basic coverage guarantees in the case of i.i.d. or exchangeable data. Assume that we have access to a dataset {(Xt, Yt)}n t=1 N. Given a new data point Xn+1, we wish to create a conformal prediction set Cn(Xn+1) with minimal size such that Yn+1 lies in Cn(Xn+1) with probability at least 1 α. In the original split CP framework, the dataset is divided into two parts in a random split as the data is exchangeable. However to handle temporal dependencies in our analysis, we use a sequential split: the training dataset is Dtr = {(Xt, Yt)}0 t=1 N and the calibration dataset Dcal = {(Xt, Yt)}n t=1. The first dataset is used to fit model ˆf N : X Rr. Based on this model, a score function s : X Rr R is designed (e.g., this function could simply be s(x, y) = ˆf N(x) y ). Next, we compute the scores of the samples in the calibration dataset s1 = s(X1, Y1), . . . , sn = s(Xn, Yn), as well ˆqα,n as the (n+1)(1 α) n quantile of these scores. Given the new data point Xn+1, the conformal prediction set Cn(Xn+1) is finally defined as {y : s(Xn+1, y) ˆqα,n}. In the case of i.i.d. (or exchangeable) calibration data (i.e., {(Xi, Yi)}n+1 i=1 are i.i.d.), the above construction enjoys the following marginal coverage guarantee (Vovk et al., 2005): 1 α P[Yn+1 Cn(Xn+1)] 1 α + 1 1 + n. (1) We investigate whether this guarantee holds in the case of Markovian data and in the case it does not, how we can adapt the above framework. 3.4. Notation The notation v N,m = ON,n(w(N, n)) means that there exist M > 0 and two integers N0, n0 such that for all N N0 log2(n), n n0, |v N,m| Mw(N, n). The notation On, on are the usual big O and little-o notation as n . Z(N, n) = o P(1) means that limn lim N P[|Z(N, n)| ϵ] = lim N limn P[|Z(N, n)| ϵ] = 0 for all ϵ > 0. Let L(A) be the Lebesgue measure of the set A and A B be the symmetric difference between sets A, B. 4. Split CP for Markov Chains In this section, we apply the classical Split CP framework to Markovian data, and evaluate the coverage gap induced by the non-exchangeability of the data. We also study the size of the conformal prediction set. 4.1. Marginal coverage We formally define the notion of coverage gap as follows. Let Cn(Xn+1) be a prediction set based on the n last observed samples taken from the calibration dataset. We say that it has a coverage gap smaller than γ 0 if: P[Yn+1 Cn(Xn+1)] 1 α γ, 1 α + γ + 1 Consider the Markov chain {(Xt, Yt)}t N+1 with initial distribution ν0 (i.e., (X1 N, Y1 N) ν0) and stationary distribution π. In the following, to simplify the notation, Conformal Predictions under Markovian Data we define δ(a) = ν0P a π T V for any a N. When applying the standard Split CP to this Markov chain, the induced coverage gap will necessarily depends on its transient behavior, partly described by tmix and by the sequences {δ(a)}a 0 and {β(a)}a 0. To analyze the coverage gap, we consider two scenarios: 1. With restart. In this scenario, we assume that the calibration and the training datasets are independent, and that the Markov chain is restarted with distribution ν1 at time 1. Assuming such independence makes the conditioning on the training dataset easy and simplifies the analysis. When applying a restart, we will use the notation: for a 0, δ1(a) = ν1P a π T V . 2. Without restart. Here, the samples composing the training and calibration datasets come from a single trajectory of the Markov chain. The datasets are hence not independent and the distribution of (X1, Y1) is ν0P N. We tackle this case by creating a separation of r [n] time-steps between the training and calibration datasets so as to reduce their stochastic dependence. The following proposition provides upper bounds on the coverage gap in both scenarios. These upper bounds will be optimized in various cases depending on the nature of the Markov chain. Proposition 4.1. (1. With restart) Applying Split CP with restart yields a coverage gap γ satisfying, for any u > 1 n Pn a=1 δ1(a), γ γ(u) where γ(u) = u + e 2n 9tmix u 1 n Pn a=1 δ1(a) 2 + δ1(n + 1). (3) (2. Without restart) Applying Split CP yields a coverage gap γ satisfying, for any u > δ(N) and for any r [n], γ γ(u, r) where γ(u, r) = u+e 2(n r) 9tmix u δ(N) 2 + δ(n + N + 1) + 2β(r) + 1 + αr n + 1 . (4) We may optimize the above coverage gap upper bounds over u and r. The following theorem summarizes the outcomes of this optimization. Theorem 4.1. (1. With restart) Applying Split CP with restart yields a coverage gap γ satisfying n + δ1(n + 1) + 1 (5) We deduce that, if the Markov chain is geometrically ergodic, then γ = On( q n ). (2. Without restart) Applying Split CP to a geometrically ergodic Markov chain yields a coverage gap γ satisfying Proposition 4.1 and Theorem 4.1 quantify the coverage gap due to the non-exchangeability of the data. The gap depends (in both cases with or without restart) on the mixing properties of the underlying Markov chain. As expected, the gap becomes negligible only if the number of samples in the calibration dataset significantly exceeds the mixing time. 4.2. Size of the conformal prediction set To investigate the impact of the non-exchangeability of the data on the size of the conformal prediction set, we use the same setting as that used in (Lei et al., 2018). We consider real valued responses, and we make the following assumptions (similar to those made in Example 1). Assumption 1. For any t N + 1 and x, given Xt = x, the response Yt can be written as µ(x)+ε where the density of the noise ε is symmetric around 0 and non-increasing on R+. The training dataset is used to design an estimator ˆµN of the function µ, and the score function used in the CP method is defined through the residuals: st = |Yt ˆµN(Xt)|. The next assumption states that the estimator ˆµN becomes accurate as the size of the training dataset increases. This assumption is not critical, and results for the size of the prediction set can also be obtained by just assuming stability (Lei et al., 2018), i.e., convergence of ˆµN to a given µ possibly different than µ. Assumption 2. There exist two sequences {c N}N 1 and {d N}N 1 such that c N = o N(1), d N = o N(1) and P(||ˆµN µ|| c N) d N. Assumption 3. The noise |Y µ(X)| admits a density f that is lower bounded by κ > 0 on a neighborhood N of its (1 α) quantile qα. Under the above assumptions and should the data be i.i.d., we know (Lei & Wasserman, 2014) that the optimal conformal prediction set (that with conditional coverage 1 α and minimal size) is given Xn+1 = x, C (x) = [µ(x) qα, µ(x) + qα], where qα is the (1 α) quantile of the law of |ε|. We use the set C (x) as a benchmark to study the impact of (i) non-i.i.d. data and (ii) of the fact that µ and qα are unknown. We compare C (x) to the returned conformal prediction set Cn(x) = [ˆµN(x) ˆqα,n, ˆµN(x) ˆqα,n]. 4.2.1. CONCENTRATION OF EMPIRICAL QUANTILES We start quantifying the deviation between the empirical quantile ˆqα,n used to build the prediction set and its true Conformal Predictions under Markovian Data counterpart ˆqα, the (1 α) quantile of |Y ˆµN(X)| when (X, Y ) π. We introduce the set U = {t > 0 : [qα t, qα+t] N}. Proposition 4.2. Suppose that Assumptions 1, 2, 3 hold. (1. With restart) Let u 0 and let u = u (2κc N + d N). Assume that u lie in U, and that u > 1 n Pn a=1 δ1(a). Then we have: P(|ˆqα,n ˆqα| > u κ) 2e 2n 9tmix u 1 n Pn a=1 δ1(a) 2 + d N. Therefore, for any δ > 0 with probability at least 1 δ d N, we have |ˆqα,n ˆqα| u u = d N + 2κc N + 1 i=1 δ1(a) + 9tmix ln( 2 provided that u U. (2. Without restart) For any r [n] and u R, let u = u + 1+rα n+1 (2κc N + d N). Assume that both 1+rα lie in U and that u > δ(N). Then we have: P(|ˆqα,n ˆqα| > u κ) 2e 2(n r) 9tmix u δ(N) 2 +d N+β(r). Further assuming that the Markov chain is geometrically ergodic with rate ρ, we deduce that for any δ > 0 with probability at least 1 δ d N 2 n, we have |ˆqα,n ˆqα| u u = d N + 2κc N + δ(N) + α ln(n) 9tmix ln( 2 provided that u U. Note that when the Markov chain is geometrically ergodic, with or without restart, we indeed obtain that ˆqα,n concentrates around ˆqα as n and N grow large. Further observe that as a direct application of the above proposition, we recover the result from (Kolla et al., 2019) stating that in the case of i.i.d. data, with probability at least 1 δ, |ˆqα,n ˆqα| c p ln(2/δ)/2n where c is a constant that depends on κ. Handling Markovian data induces an additional multiplicative factor tmix in the difference between the true and the estimated quantile of the residuals. 4.2.2. ASYMPTOTIC OPTIMALITY To compare Cn(x) to C (x), we further need to make an assumption about the consistency of the estimated model ˆµN. We make the same assumption as in (Lei et al., 2018): Assumption 4. The estimated model ˆµN satisfies: N = EX π,tr[(ˆµN(X) µ(X))2] = o N(1). As shown in (Lei et al., 2018), this assuption allows us to control the gap between the true quantile qα and that of the estimated model ˆqα. Indeed, if we assume that the density function of |ε| has continuous derivate upper bounded by M > 0 and is lower bounded by κ > 0 on (qα η, qα + η) for some η > (M/2κ) N, then |ˆqα qα| (M/2κ) N. Combining this result to that of Proposition 4.2, we can finally quantify the difference between Cn(Xn+1) and C (Xn+1). Proposition 4.3. Assume that the Markov chain Z = (X, Y ) is β-mixing. We have: L(C (Xn+1) Cn(Xn+1)) = o P(1), if either (i) the Markov chain is in steady-state (i.e., ν0 = π without restart or ν1 = π with restart), and Assumptions 1, 3, 4 hold, or (ii) Assumptions 1, 2, 3, 4 hold. The proposition states that asymptotically, when the calibration dataset grows large, the prediction set becomes very close to the optimal conformal prediction set. It is worth noting that we do not assume that the Markov chain is geometrically ergodic. We do not quantify the speed at which Cn(Xn+1) converges to C (Xn+1), and hence assuming β-mixing is enough. 5. K-Split Conformal Prediction The coverage gap γ identified when applying the classical CP approach to Markovian data stems from the correlations in the data. A natural way to decrease the impact of these correlations consists in thinning the calibration data. Specifically, we may build the prediction set based on calibration samples taken every K steps. We refer to this approach as the K-split conformal prediction. It applies the classical CP method to the calibration data Dcal,K = {(Xi K+1, Yi K+1)} n K 1 i=0 . In the following, for simplicity, we assume that n = Km for some m N so that Dcal,K consists of m samples. We define ˆqα,K,n as the (m+1)(1 α) m quantile of the scores s1 = s(X1, Y1), . . . , sm = s(XK(m 1)+1, YK(m 1)+1). Given the new data point Xn+1, the conformal prediction set CK(Xn+1) is defined as {y : s(Xn+1, y) ˆqα,K,n}. The analysis made in the previous section extends here since the thinned process remains a Markov chain but with transition kernel P K. However, applying Theorem 4.1 does not provide interesting results. Indeed, observe that the mixing time of thinned process is tmix,K = tmix K . This implies that the coverage gap γ obtained in Theorem 4.1 would scale as r K K ln(n/K) tmix ln(n/K) Hence, from this analysis, we do not see any improvement in thinning the process. We propose below an alternative analysis based on the so-called blocking technique (that can be traced back to (Bernstein, 1927)). Conformal Predictions under Markovian Data 5.1. Marginal coverage guarantees via the blocking technique The blocking technique applies not only to Markov chains but also to β-mixing processes. Hence we start by stating general results for these processes, and then specialize the results in the case of Markov chains. 5.1.1. β-MIXING PROCESSES Let {Xt, Yt}n 1 N a stochastic process with initial distribution ν0 and stationary distribution π. For any r [n], denote by β r(a) the mixing coefficient calculated on the trajectory {(Xt, Yt)}n t=r, i.e., β r(a) = supt r E[ Pt+a[ |(Xi, Yi)t i=r] π T V ]. Proposition 5.1. (1. With restart) Under the K-Split CP with restart, P[Yn+1 CK(Xn+1)] belongs to the interval [1 α γ(K), 1 α + γ(K) + K n ] for any n 1 and any K [n] where K β(K) if ν1 = π, 2 n K β 1(K) otherwise. (7) (2. Without restart) Under the K-Split CP without restart, P[Yn+1 CK(Xn+1)] belongs to the interval [1 α γ(K, r), 1 α + γ(K, r) + K n r] for any n 1, r [n], K [n r] where n+1 + 2 n r K β(K) + β(r) if ν0 = π, 1+αr n+1 + 2 n r K β r(K) + β(r) otherwise. (8) The above bounds on the coverage gap are rather simpler than those obtained in the case without thinning (see Proposition 4.1) and only depend on the β coefficients of the process. Note that these bounds do make sense only when these coefficients decrease rapidly with K. They are also useless in the case without thinning (K = 1). The term (1 α) appearing in the lower bounds of the coverage stems from the analysis of split CP in the i.i.d setting. In this setting, due to the finiteness of the calibration dataset, the coverage is actually lower bounded by : coverage (ncal + 1)(1 α) ncal + 1 1 α, (9) where ncal is the size of the calibration dataset which is n K for K-split CP. When ncal (resp. n K ) is small, the difference between the two bounds can be non negligible and lead to an over-coverage of split CP (resp. K-split CP). We illustrate this phenomenon and propose a slight refinement of the analysis of K-split CP in Section 6. 5.1.2. GEOMETRICALLY ERGODIC MARKOV CHAINS Next we optimize the value of K to achieve a good trade-off between coverage and size of the prediction set. Observe that indeed such a trade-off exists. If we just want to maximize coverage, then we can select values of K maximizing the functions γ(K) or γ(K, r) defined in Proposition 5.1. This would lead to choosing K very large, but would be at the expense of enlarging the prediction set. This trade-off is confirmed by the terms K/n and K/(n r) in the upper limit of the intervals where P[Yn+1 CK(Xn+1)] lies. To address the coverage-size trade-off, we choose to minimize the size of these intervals. Specifically, for the case with restart, we wish to minimize γ(K) + K/n over K; whereas for the case without restart, the function to minimize is γ(K, r) + K/(n r) over K and r. To this aim, we assume that the process is a geometrically ergodic Markov chain with rate ρ. In the case without restart, to simplify the optimization in r, we observe that β r(a) β 1(a) C ρ a 2 for C = 3 max( R X Q(x)π(dx), R X Q(x)ν1(dx)) < (refer to subsection 3.2). Theorem 5.1. (1. With restart) Assume R X Q(x)ν1(dx) < . Under the K-Split CP with restart and with3 K = W0(n2(ln ρ)2) ρ ) , P(Yn+1 CK (Xn+1)) belongs to [1 α γlow, 1 α + γup] with γlow = On( 1 n ln(n) ln( 1 and γup = On( ln(n) (2. Without restart) The same result as in the case of restart holds if we assume R X Q(x)ν0P N(dx) < . This theorem states that for geometrically ergodic Markov chains, the coverage gap of the optimal K-split CP scales as 1 n ln(n) ln(1/ρ). For clarity, we restate the result with tmix only as in Section 4. As discussed in Section 3, we can choose ρ satisfying ρ (1 1 2tmix )1/2. Therefore, the coverage gap of optimal K-split CP scales as tmix n ln(n) Recall that the gap of the split CP without thinning was scaling as q n , therefore K-split CP divides the coverage gap by a factor n ln(n)3/2. 5.2. Size of the conformal prediction set As mentioned earlier, by thinning the initial Markov chain, we obtain another Markov chain with kernel P K. Hence our results pertaining to the size of the prediction set and derived in Subsection 4.2 remain valid. More precisely, all statements made in Proposition 4.2 hold provided that we replace n by n/K, tmix by tmix n Pn a=1 δ1(a) by n P n/K a=1 δ1(Ka) (note that all terms related to the training data, e.g. c N, d N, remain unchanged). When applying 3W0 is the Lambert function of order 0. Conformal Predictions under Markovian Data the results to geometrically ergodic chains, we also have to replace the rate ρ by ρK. The following proposition summarizes the above observations. Proposition 5.2. Suppose that Assumptions 1, 2, 3 hold. Assume that the Markov chain is geometrically ergodic with rate ρ. Applying the K-Split CP with or without restart yields that: for any δ > 0 with probability at least 1 δ d N 2 n, we have |ˆqα,n ˆqα| u u = d N + 2κc N + δ(N) + On (tmix K) ln( 2 provided that u U. We can plug K , the optimal value of K identified in Theorem 5.1, in the above result. Using the fact that W0(x) = O(ln x), we have K = On( ln(n) On(tmix ln(n)). Hence with this choice of K, the high probability upper bound of |ˆqα,n ˆqα| goes δ ) n ) for the classical Split CP tmix ln(n) ln( 2 δ ) n ) for the K -Split CP. This result suggests that the impact of thinning the calibration dataset using the optimal K does not impact much the size of conformal prediction set. Finally observe that with thinning, the results of Proposition 4.3 hold. 5.3. Adaptive K-Split CP As demonstrated in Theorem 5.1, the choice of the thinning parameter K in the K-split CP is critical to achieve an efficient trade-off between coverage and prediction set size. However, K is a function of the rate ρ of the Markov chain, and the latter is initially unknown. We assume below that the state space is finite so that the Markov chain is uniformly geometric ergodic. In this case, its rate can be selected as ρ = p1 γps or ρ = 1 γ if the chain is reversible. Both quantities γps, γ can be estimated (Hsu et al., 2015; 2019; Combes & Touati, 2019; Wolfer & Kontorovich, 2019). For simplicity, we state the results in the reversible case and assuming that we estimate γ using the training dataset. Our results can however be easily extended to the non-reversible case by estimating γps (Wolfer & Kontorovich, 2019). We use the estimator ˆγN proposed in (Hsu et al., 2015) to construct ˆρN = 1 ˆγN. This estimator enjoys the following guarantees: for any N 1, for any δ (0, 1), δ ) ln(N) π (1 λ)N where C is a universal constant and π := minz Z π(z) (here π is the stationary distribution of the Markov chain). Based on ˆρN, we build the following estimator for K : ˆKN = ln(n)/ ln(1/ˆρN). When plugging this value in our K-Split CP method, we get the following guarantees. Proposition 5.3. (1. With restart) Assume that R X Q(x)ν1(dx) < . Under the K-Split CP with restart and thinning parameter ˆKN, P(Yn+1 C ˆ KN (Xn+1)) belongs to the interval [1 α γlow l N,n, 1 α + γup + u N,n] with γlow, γup as in Theorem 5.1 and l N,n = ON,n(ln(N) N 1 ρ(ln ρ)2n ln(n)), u N,n = ON,n(ln(N) (2. Without restart) The same result as in the case of restart holds if we assume R X Q(x)ν0P N(dx) < . From the above results, we conclude that the estimation of K does not impact much the coverage gap, and one can easily enjoy the benefits of the optimal K-split CP without the apriori knowledge of K . 6. Experiments We illustrate experimentally the coverage gap and set length of our two described conformal prediction methods, namely original split conformal and K-split conformal. The score considered is the usual residual |Y ˆµ(X)| (all responses are real-valued) therefore the conformal set length is simply 2ˆqα,n (resp 2ˆqα,K,n). When the model is specified, the latter is compared with the optimal length 2qα (as defined in Section 4.2) through a relative difference. The objective of our experiments is to mainly assess the performance of the original split CP method in the case of Markovian data, and that of our proposed algorithms, namely K-split CP and corrected K-split CP (as defined below). Note that we do not compare our algorithms to those designed to cope with non-stationary dependent data such as ACI (Gibbs & Candes, 2021; Zaffran et al., 2022) (ACI extends the original CP algorithm by allowing the confidence level α to change in an active and online manner depending on the observed coverage). 6.1. K-split conformal prediction in practice As discussed in Section 5, K-split CP may exhibit an overcoverage when there are not enough values in the reduced calibration dataset. To circumvent this issue, we can adjust the quantile level used by K-split CP. This level is set to α by taking into account the difference between the two bounds presented (9). Specifically, we select α such that K + 1)(1 α ) n K + 1 = 1 α Conformal Predictions under Markovian Data 6.2. Synthetic data In this subsection, we apply all split CP methods to synthetic examples, one with a finite state space and one with a continuous state space. An experiment consists in generating one trajectory of length N +n+1 and in applying CP to the last point. We repeat the experiment Ntrials = 1000 times and report the average coverage rate. We fix N = 10000. 6.2.1. THE LAZY RANDOM WALK ON Z/w Z Consider the stationary lazy random walk. It is a simple example of a finite state space, irreducible, aperiodic and reversible Markov Chain, for which ρ = λ2 = 1+cos( 2π w ) 2 . For w = 20, this example already exhibits strong temporal correlations as ρ = 0.97. The discrete state space setting allows to use the estimator from (Hsu et al., 2015). For a given true model µ(x) and independent symmetric noise εt, we generate Yt = µ(Xt) + εt (refer to Appendix B.1 for details). Figure 1. Coverage for the lazy walk (w = 20) as a function of n. The second values on the x-axis represent the (optimal) values of K (n). In Figure 1, we observe that all three methods achieve almost 1 α coverage gap as n increases. Note that K-split CP always has stronger coverage than full split CP, and overcovering steadily diminishes when the number of samples in the reduced dataset increase. The correction proposed allows to remain closer to 1 α in the regime where n is small as expected. Figure 2 shows that all CP methods output an interval whose length approaches the optimal length as n increases. Again, we observe the overcoverage phenomenon when n is low for K-split CP, but asymptotically vanishes as analysed in Section 5. 6.2.2. THE GAUSSIAN AUTOREGRESSIVE (AR) MODEL OF ORDER 1 The classical AR(1) models are reversible Markov chains defined by the following recursive equation: n, Xn+1 = θXn + εn+1 with εn N(0, ω2) and for some θ [0, 1[ and ω > 0. εn is a Gaussian noise independent of X. Figure 2. |qn,α qα| qα for the lazy walk as a function of n. The second values on the x-axis represent the (optimal) values of K (n). For this stochastic process, we predict the AR model itself, i.e., Yt = Xt+1. (X, Y ) is a Markov chain with the same kernel as X and it is geometrically ergodic (Bhattacharya & Lee, 1995). Since the state space is continuous, we can not use the estimator defined in Section 5.3, however in this example, we can explicity compute ρ (refer to Appendix B.2 for details). In Figure 3, we compare our different CP methods. Again, we observe that K-split CP outperforms full split CP. But all methods achieve a very good coverage, close to 1 α as n grows large. Figure 3. Coverage for Gaussian AR (θ = 0.9, ω = 1) as a function of n. The second values on the x-axis represent the (optimal) values of K (n). 6.3. Real-world application In this subsection, we apply split CP to real-world datasets. Experiments to assess the performance of split CP on dependent data have already been conducted in (Wisniewski et al., 2020) (without guarantees) and in (Oliveira et al., 2022) (with guarantees for stationary β mixing data). We further consider non-stationary data and investigate the performance of K-split CP. Conformal Predictions under Markovian Data 6.3.1. EXCHANGE RATE EUR/SEK The objective here is to predict the exchange rate EUR/SEK4 in 2022. Let Xt be the exchange rate, reported every minute. As done in (Oliveira et al., 2022), we assume that the series of returns rt = Xt+1 Xt 1 is β-mixing with exponentially fast convergence to 0. We estimate the corresponding value ρ by computing the auto-correlations as they exhibit a similar decay as that of the β coefficients, and applying a simple linear regression to their logarithms. This gives a value of ρ = 0.57. At each timestep, we apply conformal prediction to the next return rt+1 with a rolling window of fixed size divided into training and calibration datasets (1 month = 30x24x60 data points for each in this example). We compute whether the given CP method covered rt+1 in 2022. Figure 4. Daily coverage for EUR/SEK exchange rate (n = 43200, N = 43200) during one month In Figure 4, we plotted the empirical averages of the coverage over a month. We observe that all three methods achieve 1 α coverage. 6.3.2. ELECTRICITY PRICE FORECASTING In this second example, we consider the same dataset as (Zaffran et al., 2022), which contains the French electricity price between 2016 to 2019, reported every hour. We consider again the prediction of the one-step return rt+1 with a rolling window of fixed size (18 months = 18x30x24 data points for both training/calibration). In this example, we obtained a value of ρ = 0.78. We calculate the empirical coverage for the year 2019. In Figure 7, we can observe that all methods are quite similar in performance, with a coverage close to 1 α. 7. Conclusion In this paper, we extended the analysis of the original split CP method to the case of Markovian data. We established upper bounds on the impact of the correlations in the data on the coverage guarantees and size of the prediction set. When 4The dataset can be found at https://www.histdata. com/ Figure 5. Monthly coverage for French electricity price forecast (n = 12960, N = 12960) the underlying Markov chain mixes rapidly, this impact is negligible. When this is not the case, handling correlations remains challenging and an interesting topic for future research. We could try for example to identify fundamental limits on the coverage vs. size of the prediction set tradeoff satisfied by any conformal method; such limits would indicate the incompressible price one has to pay when dealing with Markovian data. These limits would also provide insights into the design of CP algorithms. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Acknowledgements This research was supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation, the Swedish Research Council (VR), and Digital Futures. Conformal Predictions under Markovian Data Angelopoulos, A. N. and Bates, S. Conformal prediction: A gentle introduction. Found. Trends Mach. Learn., 16 (4):494 591, mar 2023. ISSN 1935-8237. doi: 10.1561/ 2200000101. URL https://doi.org/10.1561/ 2200000101. Barber, R. F., Candes, E. J., Ramdas, A., and Tibshirani, R. J. Conformal prediction beyond exchangeability. The Annals of Statistics, 51(2):816 845, 2023. Bernstein, S. Sur l extension du th eor eme limite du calcul des probabilit es aux sommes de quantit es d ependantes. Mathematische Annalen, 97:1 59, 1927. URL https://api.semanticscholar. org/Corpus ID:122172457. Bhattacharya, R. N. and Lee, C. Ergodicity of nonlinear first order autoregressive models. Journal of Theoretical Probability, 8:207 219, 1995. Bresler, G., Jain, P., Nagaraj, D., Netrapalli, P., and Wu, X. Least squares regression with markovian data: fundamental limits and algorithms. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546. Chatterjee, S. Spectral gap of nonreversible markov chains, 2023. Chatzigeorgiou, I. Bounds on the lambert function and their application to the outage analysis of user cooperation. IEEE Communications Letters, 17(8):1505 1508, 2013. Combes, R. and Touati, M. Computationally efficient estimation of the spectral gap of a markov chain. Proc. ACM Meas. Anal. Comput. Syst., 3(1), mar 2019. Davydov, Y. A. Mixing conditions for markov chains. Teoriya Veroyatnostei i ee Primeneniya, 18(2):321 338, 1973. Dixit, A., Lindemann, L., Wei, S. X., Cleaveland, M., Pappas, G. J., and Burdick, J. W. Adaptive conformal prediction for motion planning among dynamic agents. In Matni, N., Morari, M., and Pappas, G. J. (eds.), Proceedings of The 5th Annual Learning for Dynamics and Control Conference, volume 211 of Proceedings of Machine Learning Research, pp. 300 314. PMLR, 15 16 Jun 2023. URL https://proceedings.mlr.press/ v211/dixit23a.html. Doukhan, P. Mixing: Properties and Examples. Lecture Notes in Statistics. Springer New York, 2012. ISBN 9781461226420. Foffano, D., Russo, A., and Prouti ere, A. Conformal off-policy evaluation in markov decision processes. In 62nd IEEE Conference on Decision and Control, CDC 2023, Singapore, December 13-15, 2023, pp. 3087 3094. IEEE, 2023. doi: 10.1109/CDC49753.2023. 10383469. URL https://doi.org/10.1109/ CDC49753.2023.10383469. Foygel Barber, R., Candes, E. J., Ramdas, A., and Tibshirani, R. J. The limits of distribution-free conditional predictive inference. Information and Inference: A Journal of the IMA, 10(2):455 482, 2021. Gallegos-Herrada, M. A., Ledvinka, D., and Rosenthal, J. S. Equivalences of geometric ergodicity of markov chains. Journal of Theoretical Probability, pp. 1 27, 2023. Gammerman, A. and Vovk, V. Hedging predictions in machine learning. Comput. J., 50(2):151 163, mar 2007. ISSN 0010-4620. doi: 10.1093/comjnl/bxl065. URL https://doi.org/10.1093/comjnl/bxl065. Gibbs, I. and Candes, E. Adaptive conformal inference under distribution shift. Advances in Neural Information Processing Systems, 34:1660 1672, 2021. Hsu, D., Kontorovich, A., Levin, D. A., Peres, Y., Szepesv ari, C., and Wolfer, G. Mixing time estimation in reversible markov chains from a single sample path. The Annals of Applied Probability, 29(4):pp. 2439 2480, 2019. ISSN 10505164, 21688737. Hsu, D. J., Kontorovich, A., and Szepesv ari, C. Mixing time estimation in reversible markov chains from a single sample path. Advances in neural information processing systems, 28, 2015. Kolla, R. K., Prashanth, L., Bhat, S. P., and Jagannathan, K. Concentration bounds for empirical conditional value-atrisk: The unbounded case. Operations Research Letters, 47(1):16 20, 2019. Kuznetsov, V. and Mohri, M. Generalization bounds for non-stationary mixing processes. Machine Learning, 106 (1):93 117, 2017. Lei, J. and Wasserman, L. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B: Statistical Methodology, pp. 71 96, 2014. Lei, J., G Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094 1111, 2018. Levin, D. A. and Peres, Y. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017. Conformal Predictions under Markovian Data Liebscher, E. Towards a unified approach for proving geometric ergodicity and mixing properties of nonlinear autoregressive processes. Journal of Time Series Analysis, 26(5):669 689, 2005. Meyn, S. and Tweedie, R. Markov Chains and Stochastic Stability. Springer-Verlag, London, 1993. URL /brokenurl#probability.ca/MT. Nummelin, E. and Tuominen, P. Geometric ergodicity of harris recurrent marcov chains with applications to renewal theory. Stochastic Processes and Their Applications, 12(2):187 202, 1982. Oliveira, R. I., Orenstein, P., Ramos, T., and Romano, J. V. Split conformal prediction for dependent data. ar Xiv preprint ar Xiv:2203.15885, 2022. Paulin, D. Concentration inequalities for markov chains by marton couplings and spectral methods. 2015. Roberts, G. and Rosenthal, J. Geometric ergodicity and hybrid markov chains. 1997. Romano, Y., Patterson, E., and Candes, E. Conformalized quantile regression. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips. cc/paper_files/paper/2019/file/ 5103c3584b063c431bd1268e9b5e76fb-Paper. pdf. Shafer, G. and Vovk, V. A tutorial on conformal prediction. J. Mach. Learn. Res., 9:371 421, jun 2008. ISSN 15324435. Tibshirani, R. J., Foygel Barber, R., Candes, E., and Ramdas, A. Conformal prediction under covariate shift. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips. cc/paper_files/paper/2019/file/ 8fb21ee7a2207526da55a679f0332de2-Paper. pdf. Vovk, V., Gammerman, A., and Shafer, G. Algorithmic learning in a random world, volume 29. Springer, 2005. Wisniewski, W., Lindsay, D., and Lindsay, S. Application of conformal prediction interval estimations to market makers net positions. In Conformal and probabilistic prediction and applications, pp. 285 301. PMLR, 2020. Wolfer, G. and Kontorovich, A. Estimating the mixing time of ergodic markov chains. In Conference on Learning Theory, pp. 3120 3159. PMLR, 2019. Yu, B. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pp. 94 116, 1994. Zaffran, M., F eron, O., Goude, Y., Josse, J., and Dieuleveut, A. Adaptive conformal predictions for time series. In International Conference on Machine Learning, pp. 25834 25866. PMLR, 2022. Conformal Predictions under Markovian Data A.1. Notation We will often make use of the following definition of the total variation between two measures P, Q defined on a measurable space (X, F) ||P Q||T V = sup A F |P(A) Q(A)| (10) We will also make use of the equivalent formulation of β(a) = sup t 1 Pt 1 P t+a Pt,a T V for a 1, where Pt,a is the joint distribution of (Xt 1, X t+a) (Doukhan, 2012). It can be shown (Gallegos-Herrada et al., 2023) that if a Markov chain is geometrically ergodic, then for any measure ν Lp(π) = {µ, R | dµ dπ|pdπ < } for some p 1, there exists 0 ρ < 1 (that depends on ν through p only) and a constant Cν such that for all n 1, νP n π T V Cνρn. Without loss of generality, we will identify for conciseness the constants C, Cν (resp C ) and ρ appearing in the total variation term νP n π T V and in β(a) (resp β (a)) to be the same (consider the maximum of each value for example). We define the following probability notations: (i) the probability over the calibration dataset given the training dataset, Pcal = Pcal,(Xn+1,Yn+1)(.|Dtr), (ii) the probability Pπ when the test point (Xn+1, Yn+1) is taken independently of the chain and follows the stationary distribution π. A.2. Proof of Proposition 4.1 and Theorem 4.1 Proof. The proof is divided into 4 parts: Calculation of the marginal coverage when the training and calibration datasets are independent Calculation of the marginal coverage in the general case Optimisation of the coverage gap w.r.t u Optimisation w.r.t r Define the random variable Z = (X, Y ) and let Zn+1 1 N = ((Xi, Yi))n+1 i=1 N Calculation of the marginal coverage when the training and calibration datasets are independent Suppose in this section only that Z0 1 N is fixed therefore S is fixed and furthermore suppose that Z0 1 N is independent of Zn 1 Similar to (Foygel Barber et al., 2021; Oliveira et al., 2022), we show with high probability over Zn 1 that ˆqα,n is lower bounded by a true quantile ˆqα+ of the scores S(X, Y ) when (X, Y ) π, and where α+ = α + u > α for 1 α u > 0. Define the function f i=1 1{Si > ˆqα+} Conformal Predictions under Markovian Data Then by definition on the calibration dataset Ecalf(Zn 1 ) = i=1 Pcal(Si > ˆqα+) i=1 Pcal(Si > ˆqα+) Pπ(S > ˆqα+) + Pπ(S > ˆqα+) i=1 [Pπ(S > ˆqα+) ||ν0P N+i π||T V ] i=1 ||ν0P N+i π||T V By (Paulin, 2015), we have the following Mc Diarmid concentration t > 0, Pcal( 1 n(f(Zn 1 ) Ecalf(Zn 1 )) > t) 1 exp 2n2t2 τmin = inf 0 ε<1 τ(ε) 2 ε 1 ε 2 and τmin 9tmix and c Rn+1 = (1, ..., 1)T , ||c||2 = n is a vector chosen such that for any (Z, Z ) f(Zn 1 ) f(Z 1 n) i=1 ci1{Zi = Z i} From (11), we have 1 n(f(Zn 1 ) Ecalf(Zn 1 )) t = 1 i=1 1{Si > ˆqα+} α+ δ(N) t i=1 1{Si > ˆqα+} nα + n(u δ(N) t) i=1 1{Si > ˆqα+} nα for t = u δ(N) By definition of ˆqα,n, this inequality implies that ˆqα+ ˆqα,n. For any 1 r n and u > δ(N), let Er(u) = {f(Zn+1 r ) Ecal(f(Zn+1 r )) (u δ(N))}. Therefore, for any 1 r n, we proved that under the hypothesis that Z0 1 N is independent of Zn+1 r , then we have over both the training and (reduced) calibration datasets P(ˆqα+ ˆqα,(n r)) P(Er(u)) 1 exp 2(n r)(u δ(N))2 where ˆqα,(n r) is the empirical quantile over {S(Zn r )}. Finally, we relate ˆqα,(n r) to ˆqα ,n with α well chosen. Conformal Predictions under Markovian Data For any dataset D and point Z D, let rank(D, Z) be the rank of Z in D. Then by definition of the empirical quantile, rank(Dcal,r, ˆqα,(n r)) = (n + r 1)(1 α) . When we add the remaining S1, .., Sr then rank(Dcal, ˆqα,(n r)) [ (n + r 1)(1 α) r, (n + r 1)(1 α) + r]. The extremities of the interval correspond to the cases where either min(S1, ..., Sr) max(Sr+1, ..., Sn) or max(S1, ..., Sr) min(Sr+1, ..., Sn). Hence, we directly obtain that ˆqα+ 1+rα n+1 ,n ˆqα,(n r) ˆqα 1+rα n+1 ,n and equivalently ˆqα+ 1+rα n+1 ,(n r) ˆqα,n ˆqα 1+rα Calculation of the marginal coverage in the general case We now consider the probability over the entire chain, therefore the training and calibration datasets are not independent. Similar to (Oliveira et al., 2022), for r 1, we consider the reduced chain Zn r . Let P r be the joint distribution under which Z0 1 N and Zn r are supposed independent. Let P n+1 be the joint distribution under which Z0 1 N and Zn+1 are supposed independent. For any u > δ(N), for any r [1, n], we have P(Sn+1 ˆqα,n) P(Sn+1 ˆqα ,(n r)) where α = α + 1 + rα n + 1 P(Sn+1 ˆqα ,(n r), Zn r Er(u)) P(Sn+1 ˆqα +u, Zn r Er(u)) since on Er(u), ˆqα +u ˆqα ,n r and the c.d.f is non decreasing P(Sn+1 ˆqα +u) P(Zn r Er(u)) P n+1(Sn+1 ˆqα +u) β(n + 1) P r(Zn r Er(u)) β(r) Pπ(S ˆqα +u) + [P n+1(Sn+1 ˆqα +u) Pπ(S ˆqα +u)] β(n + 1) exp 2(n r)(u δ(N))2 1 α u 1 + rα n + 1 δ(n + N + 1) exp 2(n r)(u δ(N))2 β(n + 1) β(r) where the fifth inequality comes from P(Sn+1 ˆqα +u) = P(Sn+1 ˆqα +u) P n+1(Sn+1 ˆqα +u) + P n+1(Sn+1 ˆqα +u) P n+1(Sn+1 ˆqα +u) ||P0,n+1 P0 1 N P n+1||T V P n+1(Sn+1 ˆqα +u) sup t ||Pt,n+1 Pt 1 N P t+n+1||T V = P n+1(Sn+1 ˆqα +u) β(n + 1) and similarly for P(Zn r Er(u)). Notice that in the case of a restart of the chain at the calibration with initial distribution ν1, we can choose r = 0 and ignore the β terms therefore obtaining γ(u) = u + exp 2n(u 1 n Pn i=1 δ1(i))2 + δ1(n + 1) (13) If X is ϕ-irreductible aperiodic Markov chain with coefficients β(a), we have γ(u, r) = u + exp 2(n r)(u δ(N))2 + +δ(n + N + 1) + β(n + 1) + β(r) (14) Conformal Predictions under Markovian Data If X is a ϕ-irreductible aperiodic geometric ergodic Markov chain, γ(u, r) = u + 1 + rα n + 1 + exp 2(n r)(u CρN)2 + C(ρr + ρn+1 + ρn+N+1) (15) Upper bound We keep the same notations but replace α+ by α = α + α 1 n u < α where the additional α 1 n takes into account the correction in the empirical quantile calculation. Again, supposing that Z0 1 N is independent of Zn+1 1 then, we have t > 0, Pcal( 1 n(f(Z) Ecalf(Z)) t) 1 exp 2nt2 and we also have 1 n(f(Z) Ecalf(Z)) t = 1 i=1 1{Si > ˆqα } α + δ(N) + t i=1 1{Si > ˆqα } (n + 1)α 1 + n(t (u δ(N))) i=1 1{Si ˆqα } (n + 1)α 1 for t = u δ(N) i=1 1{Si ˆqα } > (n + 1)(1 α) (n + 1)(1 α) since the l.h.s is an integer By definition of ˆqα,n, this inequality implies that ˆqα ˆqα,n. For any 1 r n and u > δ(N), let Er(u) = {f(Zn+1 r ) Ecal(f(Zn+1 r )) + (u δ(N))}. P(Sn+1 ˆqα,n) P(Sn+1 ˆqα ,(n r)) where α = α 1 + rα n + 1 = P(Sn+1 ˆqα ,(n r), Zn r Er(u)) + P(Sn+1 ˆqα ,(n r), Zn r Er(u)c) P(Sn+1 ˆqα u, Zn r Er(u)) + P(Zn r Er(u)c) P(Sn+1 ˆqα u) + P(Zn r Er(u)c) P n+1(Sn+1 ˆqα u) + β(n + 1) + P r(Zn r Er(u)) + β(r) Hence, we obtain the same value of γ with an additional 1 α n r for all cases as in 13,14 and 15. Optimisation of γ(u, r) w.r.t to u We study the behaviour of γ as a function of u. For simplicity, let γ(u) = u + exp( A(u K)2) + B where 9tmix K = δ(N) B = 1 + rα n + 1 + δ(n + N + 1) + β(n + 1) + β(r) Conformal Predictions under Markovian Data We study the behaviour of the shifted function u γ(u + K). By writing the first derivative, we have for u δ(N) γ (u) = 1 2Au exp( Au2) Hence by equivalence (as u > 0) γ (u) = 0 1 = 2Au exp( Au2) 1 = 4A2u2 exp( 2u2) 2A = 2Au2 exp( 2Au2) 2A and z = 2Au2 then we need to solve the following Lambert equation zez = w with unknown z. e i.e n r < 9e 4 tmix en0 then this equation has no solution as u, ueu > 1 e . And by ascending the previous calculus by equivalence, this implies γ (u) > 0. If n r en0 then there exists two critical points z1 = W0(w) and z2 = W 1(w), which in turn gives To know the ordering between both, we need to study the second order derivative. γ (u) = 2C exp( Au2)(2Au2 1) Hence, u < q 1 2A = u0, γ (u) < 0 and u u0, γ (u) 0. This implies that γ is decreasing on [0, u0] and increasing on [u0, 1]. Since γ (0) = 1, lim γ = 1, and γ (u0) < 0 (which naturally comes from the condition n r n0), then γ is positive on [0, u1] [u2, ], and negative on [u1, u2]. This can be summed up in the following variation table. 0 u1 u0 u2 + γ(u1) γ(u1) γ(u2) γ(u2) Therefore, we proved that γ(u2) γ(u1). Finally, we add the constants to obtain the optimal expression of γ(u, r) which will again be denoted by γ(r). Conformal Predictions under Markovian Data n r) + 1 + rα n + 1 +δ(N)+δ(n +N +1)+β(n +1)+β(r) (16) Remark that in the restart case, the optimisation remains the same and we obtain i=1 δ1(i) + δ1(n + 1) (17) Let us consider a few important points in the optimisation of γ(r): From 14 or 15 it is clear that for any u, r γ(u, r) is non increasing. Therefore it must also hold for the optimal value of u = u2 n r + exp 1 2W 1( n0 n + exp 1 2W 1( n0 For our method to achieve optimal lower bound γas, we are interested in processes with β coefficient such that for a well chosen r(n) = on(n) we have β(r) = On(γas) and r n = On(γas) Finally, this will yield an optimal asymptotic value (in the geometric ergodic case) γ = γ(r(n)) = On,N(γas) We start by analysing the asymptotic order of γas. Using the inequality from (Chatzigeorgiou, 2013), we have 2z z W 1( e z 1) 1 Hence, with z = ln n n0 , we have γmin γas γmax n )2/3 exp( r It is clear that γmin = On( r n0 )) = On( q For γmax, we prove that 1 n1/3 exp( q 1 2 ln n) < q n0 ). This is equivalent to proving that for n large 2 ln n ln ln n < 0 We have for any z large enough Conformal Predictions under Markovian Data and the term inside the parenthesis can be seen as a polynomial of degree 2 with unknown 1 ln z. The discriminant is strictly positive hence h is the sign of the leading coefficient which is 1 here. Therefore h is decreasing and for h(e) = 1 2 < 0 hence h is negative. This proves that γmax = On( q n0 ) and therefore γas has also the same rate of convergence i.e γas = n0 ) = On( q Hence, ignoring the total variation terms, we are looking for stochastic processes such that there exists r = on(n) and β(r) = On( q Optimisation of γ(r) for geometrically ergodic chains We show that all conditions are fulfilled for geometric ergodic Markov chains and we can achieve asymptotically optimal rate of convergence γas Indeed, since β(r) Cρr, taking r = ln n ρ gives r = on(n), r n = On(γas), β(r) = On( 1 n) = On(γas) and the proof is A.3. Proof of Proposition 4.2 Proof. Comparison with µ Let qα,n be the empirical quantile of {|Yi µ(Xi)|}n i=1. Let E = {||ˆµN µ|| c N} such that P(E) 1 d N. |ˆqα,n ˆqα| |ˆqα,n qα,n| + |qα,n qα| + |qα ˆqα| Following the proof of (Lei et al., 2018), we have for any Dtr E, |qα ˆqα| d N and for any Dcal |ˆqα,n qα,n| c N Hence for any u > 2κc N + d N, we have {|qα,n qα| u κ } {|ˆqα,n ˆqα| u P(|ˆqα,n ˆqα| u κ) P(|ˆqα,n ˆqα| u P(|qα,n qα| u In the restart version, we can directly apply Equation 12 provided that u = u 2κc N d N N and obtain that Conformal Predictions under Markovian Data P(|ˆqα,n ˆqα| u κ) (1 2 exp 2n u δ1(N) 2 )Ptr(E) u δ1(N) 2 d N Application of the concentration inequality over Dcal as in Proposition 4.1 We now compute P(|qα,n qα| u Since the concentration inequality was applied on the last n r samples, we need to replace |qα,n qα| by |qα ,n r qα| for a certain α . For any r [1, n], we have qα+dα,(n r) qα,n qα dα,(n r) where dα = 1+rα n+1 . Hence, we have for v = u qα v qα,n qα + v = qα + v qα+dα,(n r) , qα v qα dα,(n r) Supposing dα N, we also have by the bounded hypothesis on the density f, (1 α + dα) (1 α) = Z qα dα qα f (qα dα qα)κ qα dα qα + dα κ and qα+dα qα dα qα+v qα+dα,(n r) , qα v qα dα,(n r) = qα+dα+dα κ +v qα+dα,(n r) , qα dα dα κ v qα dα,(n r) Finally, we know from Equation 12 that for any u > δ(N), we have provided that u α + dα, u N qα+dα,(n r) qα+dα u qα+dα + u with probability at least 1 exp 2(n r)(u δ(N))2 if the (reduced) calibration and training datasets are independent and similarly qα dα,(n r) qα dα+u qα dα u Therefore identifying u κ + v i.e u = dα + vκ we have Conformal Predictions under Markovian Data P(|qα,n qα| u P(qα dα qα dα,n r u 2κc N d N + dα κ , qα+dα,n r qα+dα u 2κc N d N + dα P r(qα dα qα dα,n r u 2κc N d N + dα κ , qα+dα,n r qα+dα u 2κc N d N + dα κ , E) β(r) (1 2 exp 2(n r) u δ(N) 2 )Ptr(E) β(r) 1 2 exp 2(n r) u δ(N) 2 d N β(r) A.4. Proof of Proposition 4.3 The proof in (Lei et al., 2018) consists in proving the two following points: ε > 0 lim N PXn+1,tr(|ˆµN(Xn+1) µ(Xn+1)| > ε) = 0 ε > 0 lim N limn P(|ˆqα,n qα| > ε) = limn lim N P(|ˆqα,n qα| > ε) = 0 We tackle 1) the additional difficulty that N and n are different, 2) the non iid settings. For the second point, we know by (Lei et al., 2018) that |ˆqα qα| (M/2κ) N hence a N,n = P(|ˆqα,n qα| > ε) P(|ˆqα,n ˆqα| + (M/2κ) N > ε) By the hypothesis that N = o N(1), c N = o N(1), d N = o N(1), Let N0 such that N N0, N ε Let N1 N0 such that N N1, f(N) = (M/2κ) N + 2κc N + d N κε/2. Let n0 such that n n0, κϵ 1+αr n+1 f(N1). This is justified as the l.h.s tends to κε when n therefore it must be larger than κε/2 after an integer n0. Then by applying Proposition 4.2 we have for any n n0, P(|ˆqα,n qα| > ε) 2e 2(n ln n) 9tmix u δ(N1) 2 + d N0 + β(ln n) where u = κϵ (M/2κ) N1 2κc N1 d N1 1+α ln n Let N2 N1 such that N N2, f(N) f(N1) Then n n0 and N N2 P(|ˆqα,n qα| > ε) 2e 2(n ln n) 9tmix u δ(N) 2 + d N + β(ln n) where u = κϵ (M/2κ) N 2κc N d N 1+α ln n Now that we decoupled n and N (since all integers defined beforehand only depend on ε), we can take the limit w.r.t n and N. Since it is a sum of terms that tend to 0 (supposing the chain is β-mixing) then the order does not count and we have Conformal Predictions under Markovian Data ε > 0 lim N lim n P(|ˆqα,n qα| > ε) = lim n lim N P(|ˆqα,n qα| > ε) = 0 To prove the first point, remark that the condition EX π,tr[(ˆµN(X) µ(X))2] = o N(1) implies that there exists sequences ηN = o(1) and ρN = o(1) such that P(EX π[(ˆµN(X) µ(X))2|ˆµN] ηN) ρN (18) Let I = { N ηN} be the event which has probability at least 1 ρN. We calculate P(|ˆµN(Xn+1) µ(Xn+1)| η1/3) P n+1(|ˆµN(Xn+1) µ(Xn+1)| η1/3) + β(n + 1) = E(P n+1(|ˆµN(Xn+1) µ(Xn+1)| η1/3|Dtr)) + β(n + 1) Therefore by Markov inequality conditioned on I P n+1(|ˆµN(Xn+1) µ(Xn+1)| η1/3 N |Dtr) EX n+1(|ˆµN(X n+1) µ(X n+1)|2) If Xn+1 π then the r.h.s becomes η1/3 N and under β mixing assumption, we prove the first point as we have P(|ˆµN(Xn+1) µ(Xn+1)| η1/3) η1/3 N + ρN + β(n + 1) and the r.h.s tends to 0 independently of the order for n and N. Otherwise under P , Xn+1 ν0P n+N+1 (or ν1P n in the restart version) Suppose that we have furthermore Assumption 2 and let I = { N ηN, ||ˆµN µ|| c N} be the event which has probability at least 1 ρN d N. From the same calculus as above, we bound the r.h.s EX n+1(|ˆµN(X n+1) µ(X n+1)|2) = Z x (ˆµN(x) µ(x))2ν0P n+N+1(dx) Hence, we obtain P(|ˆµN(Xn+1) µ(Xn+1)| η1/3 N ) c2 N η2/3 N + ρN + d N + β(n + 1) Furthermore, remark that Assumption 2 also implies Equation 18 as EX π[(ˆµN(X) µ(X))2|ˆµN] ||ˆµ µ|| c2 N where the last inequality holds with probability at least 1 d N therefore we can take ηN = c2 N and ρN = d N. Hence, we can conclude again by taking the limit, independently of the order, when n, N . Conformal Predictions under Markovian Data A.5. Proof of Proposition 5.1 Proof. For simplicity, we suppose that K divides n (otherwise the difference is asymptotically negligible). For any r, s, let Zs r = {(Xi, Yi)}s i=r. Stationary case: We describe here the original blocking technique for stationary β-mixing sequences with fixed block size. Divide a sample Z = (Z1, ..., Zn) into 2m blocks B1, ..., B2m. Even blocks are of size a and odd blocks are of size b i.e n = 2m(a + b) (suppose n is even as there are minor changes otherwise) Denote by Bodd = (B1, B3, ..., B2m 1) the list of odd blocks and define its independent version B odd = (B 1, B 3, ..., B 2m 1) where B 2k 1 d= B2k 1 B 1 B 3... Then we have the following theorem Theorem A.1 (Blocking technique (Yu, 1994)). For all bounded measurable functions |h| M |E(h(Bodd)) E(h(B odd))| 2Mmβ(a) 1) Suppose in the following only that Z0 1 N and Zn 1 are independent and let Z0 1 N fixed such that S is well defined and constant. Let Bodd = {Zi K+1} n K i=0 = {Zi K+1} n K 1 i=0 Zn+1 and h(Bodd) = 1{Sn+1 ˆqα,K,n}. Conditionally on Z0 1 N, Zn 1 still follows the stationary distribution as they are independent. Therefore applying Theorem A.1, we obtain |Pcal,Zn+1(Sn+1 ˆqα,K,n) Pcal ,Z n+1(Sn+1 ˆqα,K,n)| 2 n where under Pcal ,Z n+1, the samples {Zi K+1}i=0,..., n K are i.i.d. Since the calibration dataset is independent of the training dataset then βcal(K) can be seen simply as the original β(K) coefficient but on a smaller chain. Therefore, βcal(K) β(K). It is well known that 1 α Pcal ,Z n+1(Sn+1 ˆqα,K,n) 1 α + 1 1 + n From there, we can conclude for the restart part in the stationary case. 2) Suppose now that Z0 1 N and Zn 1 are no longer independent. Denote by P r,cal,Zn+1 the conditional probability under which Zn+1 r is independent of Z0 1 N. Similarly, denote by P r,cal ,Z n+1 the same probability and with {Zi}n+1 i=r i.i.d. P(Sn+1 ˆqα,K,n) P(Sn+1 ˆqα ,K,n r) where α = α + 1 + αr n + 1 P r(Sn+1 ˆqα ,K,n r) β(r) = E[P r,cal,Zn+1(Sn+1 ˆqα ,K,n r)] β(r) E[P r,cal ,Z n+1(Sn+1 ˆqα ,K,n r)] 2n r K β(K) β(r) = 1 α 1 + αr K β(K) β(r) Conformal Predictions under Markovian Data and similarly, P(Sn+1 ˆqα,K,n r) 1 α + K n r + 1 + αr n + 1 + 2n r K β(K) + β(r) Non stationary case 1) Suppose again in the following only that Z0 1 N and Zn 1 are independent and let Zn 1 fixed such that S is well defined and constant. However, Zn 1 is not necessarily stationary anymore as for all i [1, n], Pi = P(Zi) = ν0P N+i. We will use an extended result of the original Blocking Technique for non stationary β-mixing sequences which can be found in (Kuznetsov & Mohri, 2017). More precisely, suppose n = ma for a 1, m 1, and define for j = 0, ..., a 1, the list Bj odd = (Z1+j, Z1+a+j, ..., Z1+(m 1)a+j) where each sample are separated by a. Then we have the following proposition Proposition A.1 (Proposition 2 of (Kuznetsov & Mohri, 2017)). For all bounded measurable functions |h| M we have |E(h(Bj odd)) E(h(B π))| 2Mmβ (a) where β (a) = supt 1 E(||P t+a(.|(Z1, ..., Zt)) π||T V ) and B π is an i.i.d sample of size m which follows the stationary distribution π. Let Bodd = {Zi K+1} n K i=0 = {Zi K+1} n K 1 i=0 Zn+1 and h(Bodd) = 1{Sn+1 ˆqα,K,n}. Then applying Proposition A.1 to Bodd, we obtain |Pcal,Zn+1(Sn+1 ˆqα,K,n) Pcal ,Z n+1(Sn+1 ˆqα,K,n)| 2 n where i, 0 i n K , Z i K+1 are independent and follow the stationary distribution π. Therefore, again we fall back under the i.i.d settings and it is well-known that 1 α Pcal ,Z n+1(Sn+1 ˆqα,K,n) 1 α + 1 1 + n We can conclude for the restart version in the non stationary case. 2) Suppose now that Z0 1 N and Zn 1 are no longer independent then P(Sn+1 ˆqα,K,n) P(Sn+1 ˆqα ,K,n r) where α = α + 1 + αr n + 1 P r(Sn+1 ˆqα ,K,n r) β(r) = E[P r,cal,Zn+1(Sn+1 ˆqα ,K,n r)] β(r) E[P r,cal ,Z n+1(Sn+1 ˆqα,K,n r)] 2n r K β r(K) β(r) = 1 α 1 + αr K β r(K) β(r) and similarly, P(Sn+1 ˆqα,K,n) 1 α + K n r + 1 + αr n + 1 + 2n r K β r(K) + β(r) Conformal Predictions under Markovian Data A.6. Proof of Theorem 5.1 We focus our analysis on the without restart version as the restart version is simply an application of the former. We suppose K >> 1 (which will be verified further on) and we wish to minimize the total coverage gap as a function of K (i.e without the terms dependeing on r only) γK = K n r + an r (replace ρ by ρ in the non stationary case) For u > 0, let f(u) = u 1 n r + an r u eu ln ρ(ln ρ 1 n r + a(n r)eu ln ρ(u ln ρ 1) = 0 Suppose furthermore that the optimal K (and therefore u) verifies |K ln ρ| >> 1 then the last equation is asymptotically equal to n r + a(n r)u ln(ρ)eu ln ρ = 0 u ln(ρ)e u ln ρ = a(n r)2(ln ρ)2 u = W0(a(n r)2(ln ρ)2) ρ = On(ln(a(n r)2(ln ρ)2) ρ ) = On(ln(n r) Let us again consider a few important points when choosing the optimal value of r: The coverage gap is an increasing function of r When r = 0, K = W0(an2(ln ρ)2) ρ (as it indeed verifies K >> 1 and |K ln ρ| >> 1) therefore n = On( ln n K ρK = On(a n ln 1 ρ ln n exp[ln(an2(ln ρ)2) = On(a n ln 1 ρ ln n 1 an2(ln ρ)2 ) = On( 1 n ln(n) ln( 1 which finally gives γK = On( K n ) = On( ln n Conformal Predictions under Markovian Data From this calculus, we can already conclude in the restart case γlow = γ = On( 1 n ln(n) ln( 1 ρ)) γup = On( ln n Let γas = ln n n ln 1 ρ . Hence, we are looking for r [n] such that asymptotically r = o(n) which will ensure n r = On(n) and also K >> 1 and |K ln ρ| >> 1 the remaining terms function of r only are also negligible i.e 1+αr n+1 = On(γas) and β(r) = On(γas) Taking r = ln n ρ convenes. Finally, we also achieve an asymptotic optimal coverage gap in the non restart case γlow = γ = On( 1 n ln(n) ln( 1 ρ)) γup = On( ln n A.7. Proof of Proposition 5.3 Proof. For a given ˆKN, we wish to compute the asymptotic coverage gap γlow( ˆKN) and γup( ˆKN). We start by bounding the estimation error on ˆKN. For δ 0 and a universal constant C, let w = w(N, δ) = C δ ln N π (1 ρ)N We have with probability at least 1 δ ˆKN ln n ln(ρ + w) = ln n ln(ρ(1 + w ρ )) = ln n ln(ρ) + w = ln n ln(ρ)(1 + w ρ ln ρ + o N(w)) ln(ρ)(1 w ρ ln ρ + o N(w)) ρ) + w ln n ρ(ln ρ)2 + o N(w) = K + w ln n ρ(ln ρ)2 + o N(w) Similarly, ˆKN K w ln n ρ(ln ρ)2 + o N(w). Therefore there exists a universal constant which will again denoted by C > 0 such that | ˆKN K| C w(N, δ) ln n and for simplicity, we will ignore it. For the lower bound of the coverage gap, we have with probability at least 1 δ Conformal Predictions under Markovian Data γlow( ˆKN) n K + w ln n ρ(ln ρ)2 ρ ˆ KN = n K (1 w ln n ρ(ln ρ)2K + o N(w))ρK = γlow w ln n ρ(ln ρ)2K γlow + o N(w) = γlow w ρ(ln ρ)2n ln n + o N(w) And similarly ˆγl γlow + w ρ(ln ρ)2n ln n + o N(w). Doing the same calculation for the upper bound of the coverage gap γup( ˆKN) = ˆ KN n , we finally obtain for a universal constant C > 0 |γlow( ˆKN) γlow| C w(N, δ) ρ(ln ρ)2n ln n, |γup( ˆKN) γup| C w(N, δ) Let I = {|γlow( ˆKN) γlow| C w(N,δ) ρ(ln ρ)2n ln n, |γup( ˆKN) γup| C w(N,δ) n }, we have Ptr(I) 1 δ By combining both probabilities, P(Yn+1 C ˆ KN (Xn+1)) P(Yn+1 C ˆ KN (Xn+1), I) = Etr[P(Yn+1 C ˆ KN (Xn+1)|Dtr)] Etr[1 α γlow( ˆKN)] Etr[1 α γlow C w(N, δ) ρ(ln ρ)2n ln n] 1 α γlow C w(N, δ) ρ(ln ρ)2n ln n δ and similarly for the upper bound, we have P(Yn+1 C ˆ KN (Xn+1)) 1 α + γup + C w(N, δ) Taking δ = |X| ln N N , this gives w(N, δ) = q ln N ln N ln N π (1 ρ)N = ON( ln N B. Experiments B.1. Lazy random walk For a given w > 0, the associated transition matrix is P : [1, w]2 [0, 1] 1 4 if y = x + 1 or y = x 1 (mod w), 1 2 if y = x, 0 otherwise. We consider a simple linear regression for illustration purposes and take µ(x) = ax for a certain a [0, 1]. The independent noise follows εt N(0, 1). Conformal Predictions under Markovian Data B.2. Gaussian AR(1) It can be easily shown that the corresponding transition kernel P and all its iterates are also Gaussian P(a, dy) = 1 2πω exp ( (y θa)2 n P n(x, dy) N(θnx, ω2 1 θ2n In the stationary case, we know that β can be written as R π(dx)||P n(x, .) π||T V Furthermore, we also have the following Pinsker inequality on the total variation distance ||P n(x, .) π||T V 1 2KL(P n(x, .)||π) where KL is the usual KL divergence between two distributions. Or for Gaussian distributions N(m1, σ2 1), N(m2, σ2 2), its expression is well known 1 2 log σ2 2 σ2 1 + σ2 1 + (m1 m2)2 In our case, for a given starting state x R, KL(P n(x, .)||π) = 1 2 log(1 θ2n) + 1 θ2n 2 + x2θ2n(1 θ2) 2 log(1 θ2n) θ2n 2 + x2θ2n(1 θ2) x2θ2n(1 θ2) ω2 for n >> 1 R π(dx)||P n(x, .) π||T V x2θ2n(1 θ2) Note also that if the initial distribution ν0 is Gaussian then ν0P n will also be Gaussian. Therefore given the geometric ergodicity function Q(x) = |x|, the condition of Theorem 5.1, R R Q(x)ν0P N(x)dx < will be verified for any N. Conformal Predictions under Markovian Data B.3. Real-world dataset Figure 6. Exchange rate EUR/SEK Figure 7. French electricity price