# sparse_markov_models_for_highdimensional_inference__ecbc13e8.pdf Journal of Machine Learning Research 24 (2023) 1-54 Submitted 3/22; Published 8/23 Sparse Markov Models for High-dimensional Inference Guilherme Ost guilhermeost@im.ufrj.br Institute of Mathematics Federal University of Rio de Janeiro Rio de Janeiro, RJ, Brazil Daniel Y. Takahashi takahashiyd@gmail.com Brain Institute Federal University of Rio Grande do Norte Natal, RN, Brazil Editor: Aur elien Garivier Finite-order Markov models are well-studied models for dependent finite alphabet data. Despite their generality, application in empirical work is rare when the order d is large relative to the sample size n (e.g., d = O(n)). Practitioners rarely use higher-order Markov models because (1) the number of parameters grows exponentially with the order, (2) the sample size n required to estimate each parameter grows exponentially with the order, and (3) the interpretation is often difficult. Here, we consider a subclass of Markov models called Mixture of Transition Distribution (MTD) models, proving that when the set of relevant lags is sparse (i.e., O(log(n))), we can consistently and efficiently recover the lags and estimate the transition probabilities of high-dimensional (d = O(n)) MTD models. Moreover, the estimated model allows straightforward interpretation. The key innovation is a recursive procedure for a priori selection of the relevant lags of the model. We prove a new structural result for the MTD and an improved martingale concentration inequality to prove our results. Using simulations, we show that our method performs well compared to other relevant methods. We also illustrate the usefulness of our method on weather data where the proposed method correctly recovers the long-range dependence. Keywords: Markov Chains, High-dimensional inference, Mixture Transition Distribution 1. Introduction From the daily number of COVID-19 cases to the activity of neurons in the brain, discrete time series are ubiquitous in our life. A natural way to model these time series is by describing how the present events depend on the past events, i.e., characterizing the transition probabilities. Therefore, finite-order Markov chains - models specified by transition probabilities that depend only on a limited portion of the past - are an obvious choice to model time series with discrete values. The length of the portion of the relevant past defines the order of the Markov chain. At first glance, estimating the transition probabilities of a Markov chain from the data is straightforward. Given a sample X1:n := (X1, X2, . . . , Xn) of a stationary d-th order Markov chain on a discrete alphabet A, the empirical transition probabilities are computed, for all past x d: 1 := (x d, . . . , x 1) A{ d,..., 1} and symbol c 2023 Guilherme Ost and Daniel Y. Takahashi. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-0266.html. Ost and Takahashi ˆpn(a|x d: 1) := Nn(x d: 1, a) P b A Nn(x d: 1, b), where Nn(x d: 1, a) denotes the number of occurrences of the past x d: 1 followed by the symbol a in the sample X1:n. Nevertheless, some difficulties become apparent. First, for a Markov chain of order d, we have to estimate |A|d(|A| 1) transition probabilities (parameters), making the uniform control of estimation errors much harder when the order d increases. One solution to avoid the exponential increase in the number of parameters is to consider more parsimonious classes of models. One such popular class of models is the variable length Markov chains (VLMC), in which P(Xt = a|Xt d:t 1 = x d: 1) = P(Xt = a|Xt ℓ:t 1 = x ℓ: 1), where ℓis a function of the past x d: 1 (Rissanen, 1983; B uhlmann and Wyner, 1999; Galves et al., 2012). The relevant portion x ℓ: 1 of the past x d: 1 is called a context. The key feature of VLMC is that all transition probabilities with the same context have the same values. Therefore, denoting τ as the set of all contexts, the number of transition probabilities that need to be estimated reduces to |τ|(|A| 1). Another class of models that is even more parsimonious is the Minimal Markov Models - also known as Sparse Markov Chains (SMC) (Garcıa et al., 2011; J a askinen et al., 2014). In SMC, we say that the pasts x d: 1 and y d: 1 are related if for all symbols a A, P(Xt = a|Xt d:t 1 = x d: 1) = P(Xt = a|Xt d:t 1 = y d: 1). This relation generates the equivalent classes C1, . . . , CK that partition A{ d,..., 1}. Now, the number of transition probabilities that need to be estimated is K(|A| 1). Both VLMC and SMC have the advantage of better balancing the bias and variance tradeoff. Nevertheless, in both models we still need to estimate the transition probability using ˆpn(a|x d: 1), either because we need to estimate the largest context (for VLMC) or because we need first to calculate the transition probabilities to establish the partitions (for SMC). This creates a second difficulty. For the estimator ˆpn(a|x d: 1) to have any meaning, we have to observe the sequence x d: 1 in the sample X1:n at least once. By ergodicity, the number of times that we will observe the sequence x d: 1 is roughly n P(X1:d = x d: 1). It is straightforward to show that, if the transition probabilities are bounded below from zero, there exists a constant c > 0 such that P(X1:d = x d: 1) < e cd. Therefore, in general, it is hopeless to have a reasonable estimator ˆpn(a|x d: 1) if d > (1 + ε) log n/c, for some positive value ε. This imposes a fundamental limit to the size of the past that can be included in the description of the time series. Moreover, Markov chains with small orders are not consistent with the known workings of several natural phenomena where the transition probabilities might depend on remote pasts. For example, in predicting whether today will be a warm or cold day, we might need to know remote past events like the corresponding weather approximately a year ago (Kir aly et al., 2006; Yuan et al., 2013). Physiological phenomena in humans with cycles of different lengths might result from dependence on events at vastly different temporal scales (Gilden et al., 1995; Chen et al., 1997; Buzsaki and Draguhn, 2004). Importantly, Sparse Markov Models for High-dimensional Inference not all portions of the past are necessarily relevant. These observations motivate us to explore sparser representations of the dependence on past events. The mixture of transition distribution model (MTD) is a subclass of finite order Markov chains that can be used to obtain such sparse representation. Like VLMC and SMC, MTD was initially introduced to overcome the problem of exponential increase in the number of transition probabilities for Markov chains (Raftery, 1985; Berchtold and Raftery, 2002). MTD represents higher-order Markov models as a convex mixture of single-step Markov chains, where each single-step Markov chain depends on a single time point in the past. If an MTD model is a mixture of only a few single step Markov chains, we naturally obtain a class of sparse Markov chains that depends only on a small portion of the past events. Nevertheless, available methods to consistently estimate the transition probabilities of MTD still need to consider all the past events up to the MTD order (Berchtold and Raftery, 2002), which might include irrelevant portions of the past. This fact still restricts the MTD order to d = O(log n). In this work, we introduce a simple method that consistently recovers the relevant part of the past even when the order d of the MTD model is proportional to the sample size n (i.e., d = O(n)) if the size of the relevant past is O(log n). Consequently, we prove that we can consistently estimate the transition probabilities for high dimensional MTD under sparsity constraint. Our estimator is computationally efficient, consisting of a forward stepwise procedure that finds the candidates for relevant past portions and a cutting procedure to eliminate the irrelevant portions. The theoretical guarantees of our estimator are based on a novel structural result for MTD and an improved martingale concentration inequality. Both results might have an interest on their own. Moreover, we show that the estimator can be further improved when the alphabet is binary. We also prove that in several cases, our estimator is minimax rate optimal. Finally, using simulated data, we show that our method s performance is, in general superior to a best subset selection method, where the lags with k largest weights are selected after estimating the model with a classical MTD estimation method (Berchtold, 2001), and similar to the performance of Conditional Tensor Factorization (CTF) based on higher-order Markov chain estimation when the order is moderate (Sarkar and Dunson, 2016). We also applied our method to weather data to model a binary sequence indicating days with and without rain. Our method successfully captures long-range dependencies (e.g. annual cycle) that were not detected either by VLMC algorithm with BIC order selection (Csisz ar and Talata, 2006) or by the CTF based higher order Markov chain estimation. New Bayesian approaches for higher order VLMC and MTD selection were introduced in (Kontoyiannis et al., 2020; Heiner and Kottas, 2021), where a posteriori most likely model estimation is considered. These works provide interesting alternative approaches for modeling higherorder Markovian dependence in a Bayesian setting. We organized the paper as follows. In Section 2 we introduce the main notations, definitions, and assumptions that we will use throughout the paper. In Section 3 we introduce the algorithms to select the relevant part of the past. In Section 3.4 we provide an estimate of the estimator s convergence rate for the transition probabilities. In Section 3.5, we show that our estimator achieves the optimal minimax rate. In Section 4, we illustrate the performance of the proposed estimators through simulations and an application on weather data. Ost and Takahashi 2. Notation, Model Definition and Preliminary Remarks 2.1 General notation We denote Z = {. . . 1, 0, 1, . . .} and Z+ = {1, 2 . . .} the set of integers and positive integers respectively. For s, t Z with s t, we write Js, t K to denote the discrete interval Z [s, . . . , t]. Throughout the article A denotes a finite subset of R, called alphabet. The elements of A will be denoted by the first letters of the alphabet a, b and c. Hereafter, we denote A = maxa A |a| and Diam(A) = maxa,b A |a b|. For each S Z, the set AS denotes the set of all A-valued strings x S = (xj)j S indexed by the set S. To alleviate the notation, if S = Js, t K for s, t Z with s t, we write xs:t instead of x Js,t K. For any non-empty subsets U S Z and any string x S AS, we denote x(S\U) A(S\U) the string obtained from x S by removing the string x U AU. For all t Z and S Z, we will write in some cases t + S do denote the set {t + s : s S}. The set of all finite A-valued strings is denoted by S Z:S finite AS. For all x A, we denote Sx Z the set indexing the string x, i.e., such that x ASx. Given two probability measures µ and ν on A, we denote d TV (µ, ν) the total variation distance between µ and ν, defined as d TV (µ, ν) = 1 a A |µ(a) ν(a)|. For q Z+, the q-norm of vector v RL is defined as ℓ=1 |vℓ|q !1/q The dimension L Z+ will be implicit in most cases. For two probability distributions P and Q on AJ1,k K where P is absolutely continuous with respect to Q, we denote KL(P||Q) the Kullback-Leibler divergence between P and Q, given by KL(P||Q) = X x1:k AJ1,k K P(x1:k) log P(x1:k) 2.2 Markov models Let X = (Xt)t Z be a discrete time stochastic chain, defined in a suitable probability space (Ω, F, P), taking values in an alphabet A. For a d Z+, we say that X is a Markov chain of order d if for all k Z+ with k > d, t Z and xt k:t AJt k,t K with P(Xt k:t 1 = xt k:t 1) > 0, we have P (Xt = xt|Xt k:t 1 = xt k:t 1) = P (Xt = xt|Xt d:t 1 = xt d:t 1) . (1) Sparse Markov Models for High-dimensional Inference We say that a Markov chain is stationary if Xs:t and Xs+h:t+h have the same distribution for all t, s, h Z. Throughout the article, the distribution of a stationary Markov chain will be denoted by P. For a finite S Z and x S AS, we write P(x S) to denote P(XS = x S). The support of a stationary Markov model is the set supp(P) = {x A : P(x Sx) > 0}. For stationary Markov chains, the conditional probabilities in (1) do not depend on the time index t. Therefore, for a stationary Markov chain of order d, for any a A, x S supp(P) with S J d, 1K and t Z, we denote p(a|x S) = P (Xt = a|Xt+S = x S) . Notice that p( |x S) is a probability measure on A, for each fixed past x S supp(P). The set {p( |x d: 1) : x d: 1 supp(P)} is called the family of transition probabilities of the chain. In this article, we consider only stationary Markov chains. For a Markov chain of order d, the oscillation δj for j J d, 1K is defined as δj = max{d TV (p( |x d: 1), p( |y d: 1)) : (x d: 1, y d: 1) AJ d, 1K, x k = y k, k = j}. The oscillation is useful to measure the influence of a j-th past value in the values of the transition probabilities. 2.3 Mixture transition distribution (MTD) models A MTD model of order d Z+ is a Markov chain of order d for which the associated family of transition probabilities {p( |x d: 1) : x d: 1 supp(P)} admits the following representation: p(a|x d: 1) = λ0p0(a) + j= d λjpj(a|xj), a A, (2) with λ0, λ 1, . . . , λ d [0, 1] satisfying P0 j= d λj = 1 and p0( ) and pj( |b),j J d, 1K and b A, being probability measures on A. Following (Berchtold and Raftery, 2002), we call the index j J d, 0K of the weight λj in (2) the j-th lag of the model. The representation in (2) has the following probabilistic interpretation. To sample a symbol from p( |x d: 1), we first choose a lag in J d, 0K randomly, being λj the probability of choosing the lag j. Once the lag has been chosen, say lag j, we then sample a symbol from the probability measure pj( |xj) which depends on the past x d: 1 only through the symbol xj. Notice that a symbol is sampled independently from the past x d: 1, whenever the lag 0 is chosen. For later use, let us define the conditional average at lag j as a A apj(a|b), (3) for each j J d, 1K and b A. For a MTD model of order d, we have that the oscillation δj of the lag j J d, 1K can be written as, δj = λj max b,c A d TV (pj( |b), pj( |c)). (4) Ost and Takahashi Notice that in this case δj = 0 if and only if either λj = 0 or d TV (pj( |b), pj( |c)) = 0 for all b, c A. In the sequel, we say that the lag j is relevant if δj > 0, and irrelevant otherwise. We will denote Λ the set of all relevant lags, i.e., Λ = {j J d, 1K : δj > 0}. (5) The set Λ captures the dependence structure of the MTD model. The size |Λ| of the set Λ represents the degree of sparsity of the MTD model. The smaller the value of |Λ|, the sparser the MTD model. The following quantities will appear in many of our results: δmin = min j Λ δj and δmin = min j Λ λj mj Lip, (6) where mj Lip = max{|mj(b) mj(c)|/|b c| : b, c A, b = c} denotes the Lipschitz norm of the function mj defined in (3). One can check easily that these quantities coincide when the alphabet A is binary (i.e. A = {0, 1}). For general alphabets, the following inequality holds: δmin A 1 δmin min b,c A:b =c |b c|. 2.4 Statistical lag selection Suppose that we are given a sample X1:n of a MTD model of known order d < n and whose set of relevant lags Λ is unknown. The goal of statistical lag selection is to estimate the set Λ from the sample X1:n. Our particular interest is in the high-dimensional setting in which the parameters d = dn and |Λ| = |Λn| scale as a function of the sample size n. Let us write ˆΛn to indicate an estimator of the set of relevant lags Λ computed from the sample X1:n. We say that the estimator ˆΛn is consistent if P(ˆΛn = Λ) 0 as n . With respect to statistical lag selection, our goal is to exhibit sufficient conditions for each proposed estimator guaranteeing its consistency. 2.5 Empirical transition probabilities Let n, m and d be positive integers such that n m > d. We denote for each a A and x S A with S J d, 1K non-empty, Nm,n(x S, a) = t=m+d+1 1{Xt+j = xj, j S, Xt = a}. The random variable Nm,n(x S, a) indicates the number of occurrences of the string x S followed by the symbol a, in the last n m symbols Xm+1:n of the sample X1:n. We also define Nm,n(x S) = P a A Nm,n(x S, a). With this notation, the empirical transition Sparse Markov Models for High-dimensional Inference probabilities computed from the last n m symbols Xm+1:n of the sample X1:n are defined as, ˆpm,n(a|x S) = (Nm,n(x S,a) Nm,n(x S) , if Nm,n(x S) > 0 1 |A|, otherwise . (7) When the countings are made over the whole sample X1:n, we denote Nn(x S, a) and Nn(x S) the corresponding counting random variables, and ˆpn(a|x S) the corresponding empirical transition probabilities. In the next sections, the estimators for the set of relevant lags we propose in this paper rely on these empirical transition probabilities. If ˆΛm denotes an estimator for the set of relevant lags Λ computed from X1:m, we expect that under some assumptions (guaranteeing in particular the consistency of ˆΛm) the empirical transition probability ˆpm,n(a|xˆΛm) converges (in probability) to p(a|xΛ) as min{n, m} , for any x d: 1 supp(P). To understand the convergence for the transition probabilities of high order Markov chains is crucial in our analysis. 2.6 Assumptions We collect here the main assumptions used in the article. Assumption 1 The MTD model has full support, that is, supp(P) = A. In other words, Assumption 1 means that P(XS = x S) = P(x S) > 0 for any string x S AS with S Z finite. This means that the marginal distributions of the distribution generating the data are strictly positive. Such a condition is usually assumed in the problem of estimating the graph structure underlying graphical models (see for instance Chapter 11 of (Wainwright, 2019)). Notice that this assumption implies, in particular, that pmin = min{p(a|xΛ) : a A, xΛ AΛ} > 0, (8) where p( |xΛ) are the transition probabilities of MTD generating the data. Assumption 2 The quantity := 1 P j Λ δj > 0, where δj is given by (4). We have that λ0 > 0 is a sufficient condition to Assumption 2 to hold. To check this, notice that X j Λ λj max b,c A d TV (pj( |b), pj( |c)) X j Λ λj = 1 λ0, where we have used that d TV (pj( |b), pj( |c)) 1 for all b, c A and j Λ. Hence, it follows that > 0 whenever λ0 > 0. Assumptions 1 and 2 are used to obtain concentration inequalities for the counting random variables Nm,n(x S, a) and Nm,n(x S) appearing in the definition of the empirical transition probabilities ˆpm,n(a|x). The next assumption is as follows. Assumption 3 For each j Λ, there exists b , c A such that mj(b ) = mj(c ), where mj( ) is defined in (3). Ost and Takahashi Notice that if A = {0, 1}, then mj(1) mj(0) = pj(1|1) pj(1|0), so that Assumption 3 holds whenever d TV (pj( |1), pj( |0)) = |pj(1|1) pj(1|0)| > 0 for each j Λ. In this case this is always true by the definition of the set Λ. As we will see in Section 3, the condition is crucial to prove a structural result about MTD models, presented in Proposition 6. In what follows, Px S(Xj |Xk = b) denotes the conditional distribution of Xj given XS = x S and Xk = b. We use the convention that, for S = , these conditional probabilities correspond to the unconditional ones. Moreover, for any function f : A R, we write Ex S(f(Xj)|Xk = b) to denote the expectation of f(Xj) with respect to Px S(Xj |Xk = b). The next two assumptions are the following. Assumption 4 (Inward weak dependence condition) There exists Γ1 (0, 1] such that the following condition holds: for all S J d, 1K such that Λ S, k Λ \ S and b, c A with b = c satisfying |mk(b) mk(c)| > 0, max x S AS X λj |Ex S(mj(Xj)|Xk = b) Ex S(mj(Xj)|Xk = c)| λk|mk(b) mk(c)| (1 Γ1). (9) Assumption 5 (Outward weak dependence condition) The alphabet is binary, i.e. A = {0, 1}. Moreover, there exists Γ2 (0, 1] such that the following condition holds: for all S J d, 1K such that S Λ and k / Λ, j Λ\S max x S {0,1}S |Px S(Xk = 1|Xj = 1) Px S(Xk = 1|Xj = 0)| Γ2. (10) Both Assumptions 4 and 5 are conditions of weak dependence. In words, Assumption 4 says that no relevant lag j can be completely determined by any subset S containing only relevant lags or any other relevant lag k when combined with some irrelevant lags. Similarly, Assumption 5 says that irrelevant lags cannot be completely determined by some subset of relevant lags. These two assumptions will be only necessary to obtain a computationally very efficient algorithm. 3. Statistical Lag Selection In this section, we address the problem of statistical lag selection for the MTD models. We will first introduce a statistical procedure called PCP estimator that is general and works well if there is a known small set S such that Λ S. When such set S is not available, we will have to consider an alternative procedure called FSC estimator, which will be introduced later. 3.1 Estimator based on pairwise comparisons Throughout this section we suppose that there is a known set S J d, 1K such that Λ S. Note that this is always satisfied in the worse case scenario in which the set S is the whole set J d, 1K. In some cases, however, we may have a prior knowledge on the set Λ and we can use this information to restrict our analysis to the lags in a known set S of size (possibly much) smaller than d. Sparse Markov Models for High-dimensional Inference The estimator discussed in this section is based on pairwise comparisons of empirical transition probabilities corresponding to compatible pasts. For this reason, we call it PCP estimator. The estimator is based on the following observation. For any j S, we say that the pasts x S, y S AS are (S \ {j})-compatible, if y S\{j} = x S\{j}. We have that if j Λ, then there exist a pair of (S \ {j})-compatible pasts x S, y S AS such that total variation distance between p( |x S) and p( |y S) is strictly positive. On the other hand, if j S \ Λ, then the total variation distance between p( |x S) and p( |y S) is 0 for all (S \{j})-compatible pasts x S, y S AS. These remarks suggests to estimate Λ by the subset of all lags j S for which the total variation distance between ˆpn( |x S) and ˆpn( |y S) is larger than a suitable positive threshold, for some pair of (S\{j})-compatible pasts x S and y S. An uniform threshold over all possible realizations usually gives suboptimal results by either underestimating or overestimating for some configurations. The threshold we use here is adapted to each realization of the MTD, relying on improved martingale concentration inequalities that are of independent interest (see Appendix B). Fix ε > 0, α > 0 and µ (0, 3) such that µ > ψ(µ) := eµ µ 1. For each x S, y S AS, consider the random threshold tn(x S, y S) defined as, tn(x S, y S) = sn(x S) + sn(y S), (11) where sn(x S) is given by α(1 + ε) 2 Nn(x S) ˆpn(a|x S) + α Nn(x S) + α|A| 6 Nn(x S). (12) With this notation, the PCP estimator ˆΛ1,n is defined as follows. A lag j S belongs to ˆΛ1,n if and only if there exists a (S \ {j})-compatible pair of pasts x S, y S AS such that d TV (ˆpn( |x S), ˆpn( |y S)) tn(x S, y S). (13) In the sequel, the set S J d, 1K such that Λ S and the constants ε > 0, α > 0 and µ (0, 3) such that µ > ψ(µ) are called parameters of the PCP estimator ˆΛ1,n. Hereafter, for each j S and any b, c A, let Cj(b, c) = (x, y) AS AS : x S\{j} = y S\{j}, xj = b and yj = c , tn,j(b, c) = min (x S,y S) Cj(b,c) tn (x S, y S) , tn,j = max b,c A:b =c tn,j(b, c) and γn,j = 2tn,j. (14) Finally, consider the following quantity PS = min j Λ min b,c A:b =c max (x S,y S) Cj(b,c) (P(x S) P(y S)) . (15) With these definitions, we have the following result. Ost and Takahashi Theorem 1 Let X1:n be a sample of MTD model with set of relevant lags Λ, where n > d. If ˆΛ1,n is the PCP estimator defined in (13) with parameters µ (0, 3) such that µ > ψ(µ), α > 0, ε > 0 and Λ S J d, 1K, we have that 1. For each j S \ Λ, we have that P j ˆΛ1,n 8|A|(n d) log(µ(n d)/α + 2) 2. For each j Λ, we have that P j / ˆΛ1,n, γn,j δj 8|A| log(µ(n d)/α + 2) where γn,j and and δj are defined in (14) and (4) respectively. 3. Furthermore, if assumptions 1 and 2 hold, then there exits a constant C = C(ε, µ) > 0 such that for n satisfying n d + C|A|α δ2 min PS , (16) where δmin and PS are defined in respectively (6) and (15), we have that P ˆΛ1,n = Λ 8|A| ((|S| |Λ|)(n d) + |Λ|) log(µ(n d)/α + 2) + 6|A|(|A| 1)|Λ| exp 2(n d)2P2 S 8n(|S| + 1)2 The proof of Theorem 1 is given in Appendix A.1.1. Remark 2 (a) The sum over j S \Λ of the upper bound provided by Item 1 of Theorem 1 controls the probability that the PCP estimator ˆΛ1,n overestimates the set of relevant lags Λ. The sum over j Λ of the upper bound given in Item 2 of Theorem 1 is as an upper bound for the probability that the PCP estimator underestimates the subset of relevant lags j Λ whose oscillation δj is larger or equal than the noise level γn,j. Note that the sum of these upper bounds corresponds to the first term appearing on the right hand side of (17). (b) The second term on the right hand side of (17) is an upper bound for the probability that there exists some relevant lag j Λ whose oscillation δj is strictly smaller than the noise level γn,j. (c) (Computation of PCP estimator) As we show in Appendix (A.6), the PCP estimator can be implemented with at most O(|A|2|S|(n d)) computations. Remark 3 By Assumption 1, we have that PS pmin/|A||S| 1, where pmin is defined in (8). As a consequence, it follows from (16) that if the sample size n is such that n d + C|A||S|α pminδ2 min , (18) Sparse Markov Models for High-dimensional Inference then inequality (17) holds with the second exponential term replaced by exp 2p2 min(n d)2 8n(|S| + 1)2|A|2(|S| 1) Combining Theorem 1 and Remark 3, one can deduce the following result. Corollary 4 For each n, consider a MTD model with set of relevant lags Λn and transition probabilities pn(a|xΛn) such that pmin,n p min and n min for some positive constants p min and min. Let dn = βn for some β (0, 1) and suppose that Λn Sn J dn, 1K with |Sn| ((1 γ)/2) log|A|(n) for some γ (0, 1). Let X1:n be a sample from the MTD specified by Λn and pn(a|xΛn), and denote ˆΛ1,n the PCP estimator defined in (13) computed from this sample with parameters µn = µ (0, 3) such that µ > ψ(µ), εn = ε > 0, αn = (1 + η) log(n) with η > 0 and Sn. Under these assumptions there exists a constant C = C(µ, ε, β, p min, min, γ, η) > 0 such that if δ2 min,n C log(n) n(1+γ)/2 , (20) then P(ˆΛ1,n = Λn) 0 as n . The proof of Corollary is given in Appendix A.1.2. Remark 5 (a) Under the assumptions of Corollary 4, if additionally we have |Λn| L for all values of n for some positive integer L, then one can choose a suitable sequence γn 1 as n to obtain that P(ˆΛ1,n = Λn) vanishes as n , as long as δ2 min,n C log(n) where the constant C here is larger than the one given in (20). (b) Observe that in Corollary 4, the set of relevant lags can be either finite or grow very slowly with respect to the sample size n. On the other hand, no assumption on the orders dn of the underlying sequence of MTD models is made. In particular, we could consider MTD models with very large orders, for example dn = βn with β (0, 1). As Corollary 4 indicates, in the setting dn = βn, the major drawback of the PCP estimator ˆΛ1,n is that it requires a prior knowledge of Λn in the form of a set Sn growing slowly enough and such that Λn Sn. The main goal of the next two sections is to propose alternative estimators of Λn to deal with this issue. 3.2 Forward Stepwise and Cut estimator In this section we introduce a second estimator of the set of relevant lags Λ, called Forward Stepwise and Cut (FSC) estimator. This estimator is based on a structural result about MTD models presented in Proposition 6 below. Before presenting this structural result, we need to introduce some notation. Ost and Takahashi In what follows, for each lag k J d, 1K, subset S J d, 1K \ {k}, configuration x S AS and symbols b, c A, let us denote dk,S(b, c, x S) = d TV (Px S(X0 |Xk = b), Px S(X0 |Xk = c)) , (22) and wk,S(b, c, x S) = Px S(Xk = b)Px S(Xk = c). (23) Recall that Px S(X0 |Xk = b) and Px S(Xk = b) denote, respectively, the conditional distribution of X0 given XS = x S and Xk = b and the conditional probability of Xk = b given XS = x S, with the convention that these conditional probabilities for S = correspond to the unconditional ones. Let us also denote for each lag k J d, 1K and subset S J d, 1K \ {k}, νk,S(x S) = X c A wk,S(b, c, x S)dk,S(b, c, x S), (24) and νk,S = E (νk,S(XS)) . (25) The quantity νk,S(x S) measures the influence of Xk on X0, conditionally on the variables XS = x S. The average conditional influence of Xk on X0 is measured through the quantity νk,S. In the sequel, we write Covx S(X0, Xk) to denote the conditional covariance between the random variables X0 and Xk given that XS = x S. Here, we also use the convention that the conditional covariance for S = corresponds to the unconditional one. With this notation, we can prove the following structural result about MTD models. Proposition 6 For any lag k J d, 1K and subset S J d, 1K \ {k}, Diam(A) A νk,S E (|Cov XS(X0, Xk)|) . (26) Moreover, if Assumptions 1 and 3 hold, then there exists a constant κ > 0 such that the following property holds: for any S J d, 1K such that Λ S there exists k Λ \ S satisfying E (|Cov XS(X0, Xk)|) κ. (27) Furthermore, if Assumption 3 is replaced by Assumption 4, then the constant κ satisfies κ p2 minΓ1 min{|b c|2 : b = c} δmin where δmin, pmin and Γ1 are defined respectively in (6), (8) and (9). The proof of Proposition 6 is given in A.2. Remark 7 Denote f(S) = maxk Sc νk,S, for each S J d, 1K. On one hand, we have that f(S) = 0 for any S J d, 1K such that Λ S. This follows immediately from the definition of Λ. On the other hand, Proposition 6 assures that f(S) κ/Diam(A) A > 0 for any S J d, 1K such that Λ S. Putting together these facts, we deduce that the set of relevant lags can be written as Λ = arg min{|S| : f(S) = 0}. This observation motivates the FSC estimator defined below. Sparse Markov Models for High-dimensional Inference In what follows, we split the data X1:n into two pieces. The first part is composed of the first m symbols X1:m where 1 m < n, whereas the second part is composed of the n m last symbols Xm+1:n. In the sequel, we write ˆνm,k,S to denote the empirical estimate of νk,S computed from X1:m. The formal definition of ˆνm,k,S involves extra notation and is postponed to Appendix A. The FSC estimator is built in two steps. The first step is called Forward Stepwise (FS) and the second one is called CUT. In the FS step, we start with S = and add iteratively to the set S a lag j arg maxk Sc ˆνm,k,S, as long as |S| < ℓ, where 0 ℓ d is a parameter of the estimator. We denote ˆSm the set obtained at the end of FS step, with the convention that ˆSm = if the parameter ℓ= 0. As we will see, if ℓis properly chosen the candidate set ˆSm will contain the set of relevant lags Λ with high probability. It may, of course, include irrelevant lags j (those with δj = 0). In the CUT step, for each j ˆSm, we remove j from ˆSm unless d TV (ˆpm,n( |x ˆSm), ˆpm,n( |y ˆSm)) tm,n(x ˆSm, y ˆSm) := sm,n(x ˆSm) + sm,n(y ˆSm) for some ( ˆSm \ {j})-compatible pasts x ˆSm, y ˆSm A ˆSm, where sm,n(x ˆSm) is given by (12) replacing Nn( ) and ˆpn( | ) by Nm,n( ) and ˆpm,n( | ) respectively. The FSC estimator is defined as the set ˆΛ2,n of all lags not removed in the CUT step. The pseudo-code of the algorithm to compute the FSC estimator is given in Algorithm 1. Algorithm 1: FSC(X1, . . . , Xn) FS Step; 1. ˆSm = ; 2. While | ˆSm| < ℓ; 3. Compute j = arg maxj ˆScm ˆνm,j, ˆSm and include j in ˆSm; CUT step; 6. For each j ˆSm, remove j from ˆSm unless d TV (ˆpm,n( |x ˆSm), ˆpm,n( |y ˆSm)) tm,n(x ˆSm, y ˆSm), for some ( ˆSm \ {j})-compatible pasts x ˆSm, y ˆSm A ˆSm ; 7. Output ˆSm; Remark 8 (a) It is worth mentioning the following alternative algorithm (henceforth called Algorithm 2) to estimate the set of relevant lags Λ. As Algorithm 1, Algorithm 2 has two steps as well. In the first step, we start with S = and add iteratively a lag j arg maxk Sc ˆνn,k,S as long as ˆνn,j,S > τ, where τ is a parameter of the algorithm and ˆνn,j,S is the empirical estimate of νj,S computed from the entire data X1:n. Let ˆSn denote the set obtained at the end of this step. Next, in the second step, for each j ˆSn, we remove j from ˆSn unless ˆνn,j, ˆSn\{j} τ. The output of Algorithm 2 is the set of all lags in ˆSn which were not removed in the second step. Algorithm 2 can be seen as a version adapted for our setting of the Learn Nbhd algorithm, proposed in Bresler (2015), to estimate the interaction graph underlying an Ising model from i.i.d samples of the model. Ost and Takahashi (b) As opposed to Algorithm 2, notice that the data X1:n is split into two parts in Algorithm 1. The first m symbols X1:m of the sample are used in the FS step, whereas the last n m symbols Xm+1:n are only used in the CUT step. Despite requiring to split the data into two parts, one nice feature of Algorithm 1 is that even if a large ℓis chosen the CUT step would remove the non-relevant lags, whereas in Algorithm 2, we have to calibrate τ carefully to recover the relevant lags. (c) (Computation of FSC estimator) As we show in the Appendix A.6, we need to perform at most O(|A|3ℓ(m d)(d (ℓ 1)/2) + |A|2(n m d)ℓ) computations to determine the FSC estimator. The first summand in the sum corresponds to the algorithmic complexity of the FS step, whereas that the second summand can be interpreted as the algorithmic complexity of the PCP estimator computed from a sample of size n m and whose set S has ℓelements (recall item (c) of Remark 2). In what follows, for any ξ > 0 and 0 ℓ d, let us define the following event, Gm(ℓ, ξ) = \ max j Sc | νj,S ˆνm,j,S| ξ , (29) where Sd,ℓ= {S J d, 1K : |S| ℓ}. In the next result we show that whenever the event Gm(ℓ, ξ) holds with properly chosen parameters ξ and ℓ, the candidate set ˆSm constructed in the FS step with parameter ℓcontains Λ. Theorem 9 Suppose Assumptions 1 and 3 hold and let κ be the lower bound provided by Proposition 6. Let ξ = κ 4 A Diam(A) and ℓ = log2(|A|) = 2(Diam(A) A )2 log2(|A|) Let ˆSm denote the candidate set constructed in the FS step of Algorithm 1 with parameter ℓ . On the event Gm(ℓ , ξ ), we have that Λ ˆSm. The proof of Theorem 9 is given in Appendix A.2.2. Theorem 9 ensures that the candidate set ˆSm contains the set of relevant lags Λ whenever the event Gm(ℓ , ξ ) holds. In this case, we can think of the CUT step as the PCP estimator discussed in the previous section applied to the n m last observations Xm+1:n of the data, where S = ˆSm. The main difference is that ˆSm is a random set, depending on the first m observations X1:m of the data. In the sequel, let us denote PSd,ℓ= min S Sd,ℓPS, (31) where PS is defined in (15). In the next result we estimate the error probability of the FSC estimator. Theorem 10 Suppose Assumptions 1, 2, and 3 hold. Let > 0 be the quantity defined in Assumption 2. Denote ˆΛ2,n the FSC estimator constructed by Algorithm 1 with parameter Sparse Markov Models for High-dimensional Inference ℓ , as defined in (30). Suppose also that m > d 2ℓ . Then there exits a constant C = C(ε, µ) > 0 such that if n m + d + C|A|α δ2 min PSd,ℓ , (32) where δmin and PSd,ℓ are defined in (6) and (31), then we have that, P(ˆΛ2,n = Λ) 2|A|(ℓ + 1) d ℓ d|A|ℓ +1 exp (ξ )2(m d)2 18m|A|2(ℓ +2)(ℓ + 2)2 +3(|A| 1)|Λ| exp 2(n m d)2P2 Sd,ℓ 8(n m)(ℓ + 1)2 + 8|A| ((ℓ |Λ|)(n m d) + |Λ|) log(µ(n m d)/α + 2) where ξ is defined in (30). The proof of Theorem 10 is given in Appendix A.2.3. Remark 11 (a) Let us give some intuition about the three terms appearing on the righthand side of (33). The first one is an upper bound for P(Gc m(ℓ , ξ )). The other two are related to the terms appearing in (17). Indeed, by recalling that | ˆSm| = ℓ , one immediately sees that the third terms of (33) corresponds to the first term of (17) with ˆSm and n m in the place of S and n respectively. Besides, the second term of (33) is similar (modulo a factor which depends on d and ℓ ) to the second term of (17). This extra factor reflects the fact that we do not know a priori a set S containing the set of relevant lags Λ. (b) Similar to Remark 3, one can also show that PSd,ℓ pmin/|A|ℓ 1, where pmin is defined in (8). Hence, we can deduce from (32) that when the sample size n satisfies n d + C|A|ℓ α pminδ2 min , (34) then inequality (33) holds with the second exponential term replaced by exp ( pmin)2(n d)2 8n(ℓ + 1)2|A|2(ℓ 1) The next result is a corollary of Theorem 10. Corollary 12 For each n, consider a MTD model with set of relevant lags Λn and transition probabilities pn(a|xΛn) satisfying pmin,n p min n min for some positive constants p min and min, and such that Assumption 3 holds. Let mn = n/2 and dn = mnβ with β (0, 1). Let X1:n be a sample from the MTD specified by Λn and pn(a|xΛn), and denote ˆΛ2,n the FSC estimator constructed by Algorithm 1 with parameters mn, µn = µ (0, 3) such that µ > ψ(µ), εn = ε > 0, αn = (1 + η) log(n) with η > 0 and ℓ ,n as defined in (30). Ost and Takahashi Assume that ℓ ,n ((1 γ)/2) log|A|(n) for some γ (0, 1). Then there exists a constant C = C(β, γ, η, p min, min, ε, µ) > 0 such that P(ˆΛ2,n = Λ ) 0 as n , whenever δ2 min,n C log(n) n(1+γ)/2 . (36) The proof of Corollary 12 is given in Appendix A.2.4. Remark 13 (a) Under Assumption 4, one can check that ℓ ,n ((1 γ)/2) log|A|(n) whenever, δ2 min |Λ| 16(1 γ) Γ2 1(p min min{|b c| : b = c})4 log|A|(n). (b) Comparing Corollaries 12 and 4, we observe that the consistency of both FSC and PCP estimators require the same lower bound on the decay of minimal oscillation δmin,n. Despite requiring additional assumptions (Assumption 3 and a condition on the growth of ℓ ), FSC estimator do not need prior knowledge of a small subset S containing the set of relevant lags Λ as opposed to the PCP estimator, which is a significant advantage in practice. (c) Let us mention that under the assumptions of Corollary 12, we have that the algorithmic complexity of the FSC is O(|A|3n2 log|A|(n)). This follows immediately from item (c) of Remark 8. 3.3 Improving the efficiency for the binary case In this section, we show that when the alphabet is binary, i.e., A = {0, 1}, we can further improve the FSC algorithm if we consider Assumptions 4 and 5. Observe that when the alphabet is binary, Assumption 3 holds automatically (see Section 2.6). Moreover, we have that νk,S = 2E (PXS(Xk = 1)PXS(Xk = 1) |PXS(X0 = 1|Xk = 1) PXS(X0 = 1|Xk = 0)|) = 2E (|Cov XS(X0, Xk)|) , for any lag k J d, 1K, subset S J d, 1K \ {k} and configuration x S {0, 1}S. For a binary MTD, we have the following result. Theorem 14 Under Assumptions 4 and 5, it holds that max j Λ\S νj,S max j (Λ)c νj,S 2(Γ1 Γ2)p2 minδmin, where δmin and pmin are defined in (6) and (8) respectively. In particular, if Γ1 > Γ2 and ˆSm denotes the candidate set constructed at the end of the FS step of Algorithm 1 with parameter ℓ |Λ|, then Λ ˆSm whenever the event Gm(ℓ, ξ) holds where 0 < ξ < (Γ1 Γ2)p2 minδmin. (37) The proof of Theorem 14 is given in Appendix A.3.1. Sparse Markov Models for High-dimensional Inference Remark 15 Notice that if the size of Λ is known, then Theorem 14 implies ˆSm = Λ on the event Gm(|Λ|, ξ) with ξ satisfying (37). In particular, in this case, we neither need to execute the CUT step nor to split the data into two pieces. In the same spirit of the previous corollaries, we can show the following result. Corollary 16 For each n, consider a MTD model with set of relevant lags Λn and transition probabilities pn(a|xΛn) satisfying Assumptions 4 and 5 with Γ1,n = Γ1 > Γ2 = Γ2,n and such that |Λn| L for some integer L, pmin,n p min and n min for some positive constants p min and min. Let X1:n be a sample from the MTD specified by Λn and pn(a|xΛn), and denote ˆΛ2,n the FSC estimator with parameters with parameters mn = n/2, µn = µ (0, 3) such that µ > ψ(µ), εn = ε > 0, αn = (1+η) log(n) with η > 0 and ℓ= L. Suppose that dn = βn with β (0, 1). Then there exists a constant C = C(β, L, min, p min, Γ1, Γ2, η, µ, ε) > 0 such that P(ˆΛ2,n = Λ ) 0 as n , as long as The proof of Corollary 16 is given in Appendix A.3.2. 3.4 Post-selection transition probabilities estimation Once the set of relevant lags have been estimated by applying the FSC estimator to the sample X1:n, we reuse the entire sample to compute the estimator ˆpn(a|xˆΛ2,n) of the transition probability p(a|xΛ). In the next result, we provide an estimate for rate of convergence of ˆpn(a|xˆΛ2,n) towards p(a|xΛ), simultaneously for all pasts x d: 1 AJ d, 1K. Theorem 17 Under assumptions and notation of Theorem 10, x d: 1 AJ d, 1K |ˆpn(a|xˆΛ2,n) p(a|xΛ)| v u u t2α(1 + ϵ) ˆVn(a, xˆΛ2,n) + α 3 Nn(xˆΛ2,n) 4|A|(n d) log(µ(n m d)/α + 2) e α + P(ˆΛ2,n = Λ), (39) where ˆVn(a, xˆΛ2,n) is given by ˆVn(a, xˆΛ2,n) = µ µ ψ(µ) ˆpn(a|xˆΛ2,n) + α µ ψ(µ) 1 Nn(xˆΛ2,n). The proof of Theorem 17 is given in Appendix A.4. 3.5 A remark on the minimax rate for the lag selection We take A = {0, 1} and consider the set of {p(j)( | ), j J d, 1K} of transition probabilities of the following form: p(j)(1|x d: 1) = (1 λ) 2 + λp(1|xj), j J d, 1K, λ (0, 1), (40) Ost and Takahashi where λ|p(1|1) p(1|0)| := δ > 0. For each j J d, 1K, we denote P(j) the probability measure under which (Xt)t Z is a stationary MTD model having transition probability p(j)( | ). For each t 1, we denote P (j) t the marginal distribution with respect to the variables X1:t: P (j) t (x1:t) = P(j)(X1:t = x1:t) In what follows, KL(P (j) t ||P (k) t ) denotes the Kullback-Leibler divergence between the distributions P (j) t and P (k) t . We denote MTDd,δ the set all transition probabilities p = {p(a|xΛ) : a A, xΛ AΛ} of a MTD model of order d whose corresponding δmin δ. For a given p MTDd,δ, we denote Pp the probability distribution under which (Xt)t Z is a stationary MTD model of order d with transition probabilities given by p. With this notation, we have the following result. Proposition 18 Let n > d. Then the following inequality holds: for j, k J d, 1K, KL(P (j) n ||P (k) n ) 2nδ2 In particular, if β (0, 1), d = nβ, and 2 log(2) , (42) then inf ˆΛn sup p MTDd,δ Pp(ˆΛn = Λ) 1/4, (43) where the infimum is over all lag estimators ˆΛn based on a sample of size n. The proof of (43) follows immediately from Fano s inequality and the upper bound (41). Combining (38) and (42), we deduce that the condition on the minimal oscillation required for the consistency of the FSC estimator in Corollary 16 is sharp. The proof of Proposition 18 is given in Appendix A.5 4. Simulations Here, we investigated the performance of the proposed methods using simulations. 4.1 Experiment 1 We first used a MTD model on alphabet A = {0, 1} with two relevant lags, denoted here as i and j for notational convenience. The choices for the order d and for the values of i and j are shown in the first three columns of Table 1. Let p0(1) = p0(0) = 0.5 and λ0 = 0.4. Also, let λ i = 0.2, λ j = 0.4, p i(0|0) = 0.3, p i(0|1) = 0.6, p j(0|0) = 0.5, and p j(0|1) = 0.9. For all x d: 1 {0, 1}J d, 1K and a A, the transition probability of the model was given by p(a|x d: 1) = λ0p0(a) + λ ip i(a|x i) + λ jp j(a|x j). Sparse Markov Models for High-dimensional Inference We simulated the above model using sample sizes n {102, 5.102, 103, 5.103, 104, 5.104, 105, 5.105}. For each choice of i, j, d, and n we simulated 100 realizations. We compared four different methods to select the relevant lags. FSC(ℓ) stands for the Forward Stepwise and Cut algorithm described in Algorithm 1 with parameter ℓ, ϵ = 0.1, µ = 0.5, and α = C log(n), where the values of the constant C was chosen by optimizing the probability to select the relevant lags correctly only for sample size n = 100, for the given choice of d, i and j. We used the first n/2 samples for the Forward Stepwise and the last n/2 for Cut. Remember that ϵ, µ, α are used to define the random threshold for the Cut step. BSS(2) stands for the best subset selection algorithm, where we first estimated the parameters of the MTD model using n samples and the algorithm described in Berchtold (2001) with python implementation mtd-learn. This algorithm estimated for k {1, 2, . . . , d} the weight parameters λ k. We then choose the lags of the two largest λ k as the lags selected by BSS(2). We were not able to run the mtd-learn on models with order d larger than 15 in our computers because that algorithm did not converge. Finally, CTF(ℓ) stands for Conditional Tensor Factorization based Higher Order Markov Chain estimation together with the test for relevance of lags described in Sarkar and Dunson (2016), the parameter ℓbeing the maximal number of relevant lags. We used the code available at https://github.com/david-dunson/bnphomc. The maximal possible order of the Markov chain was set to d and the number of simulation for the Gibbs sampler was set to 1000. The set of relevant lags chosen by CTF was given by the lags with non-null inclusion probability estimated using the Gibbs sampler. We were not able to run CTF(ℓ) when j = n/5 and d = n/4 because the algorithm did not converge when n > 103. We note that FS and BSS assume prior knowledge of the number of relevant sites, giving advantage over FSC and CTF. The results are indicated in Table 1. Table 1: Estimated probability of correctly selecting only the relevant lags. Model parameter Method Sample size (n) i j d 100 500 103 5.103 104 5.104 105 5.105 1 8 8 FSC(3) 0.05 0.08 0.13 0.53 0.81 0.86 0.93 1 1 8 8 CTF(3) 0 0 0.04 0.67 0.99 1 1 1 1 8 8 FS(2) 0.07 0.3 0.47 0.98 1 1 1 1 1 8 8 BSS(2) 0.05 0.14 0.23 0.41 0.79 0.78 0.84 0.87 1 15 15 FSC(5) 0.03 0.36 0.51 0.82 0.97 1 1 1 1 15 15 CTF(5) 0 0 0.01 0.62 0.99 1 1 1 1 15 15 FS(2) 0.02 0.2 0.66 0.92 1 1 1 1 1 15 15 BSS(2) 0.04 0.18 0.17 0.28 0.31 0.8 0.8 0.93 1 n/5 n/4 FSC(5) 0 0 0.04 0.19 0.46 1 1 1 1 n/5 n/4 CTF(5) 0 0 0 - - - - - 1 n/5 n/4 FS(2) 0.01 0.11 0.27 0.89 0.96 1 1 1 1 n/5 n/4 BSS(2) - - - - - - - - Ost and Takahashi 4.2 Experiment 2 Here we used the following MTD model on alphabet A = {0, 1}. We considered different choices of order d and relevant lags i, j J1, d K (see Table 2). Let p0(1) = p0(0) = 0.5 and λ0 = 0.2. Also, let p i(0|0) = 1 p i(0|1) = p j(0|1) = 1 p j(0|1) = 0.7 and λ i = λ j = 0.4. For all x d: 1 {0, 1}J d, 1K and a A, the transition probability of the model was given by p(a|x d: 1) = λ0p0(a) + λ ip i(a|x i) + λ jp j(a|x j). We simulated the above model using sample sizes n {28, 29, 210, 211, 212, 213}. For each choice of i, j, d, and n we simulated 100 realizations. For each realization, we estimated the transition probability p(0|0d). We used different estimators for the comparisons. FSC(ℓ) and FS(ℓ) are the same as described in Experiment 1. For transition probability estimation with FSC, we used X1:n/2 for Forward Stepwise and Xn/2+1:n for Cut step, obtaining the estimated relevant lag set ˆΛn. Then we used X1:n to calculate ˆpn(0|0ˆΛn). For transition probability estimation after PCP, we used X1:n to calculate ˆΛn for the PCP relevant lag estimator with initial set S = J d, 1K. The parameters for the threshold were chosen as follows: ϵ = 0.1, µ = 0.5, and α = C log(n), where we choose the values of the constant C by optimizing the probability to select the relevant lags correctly only for sample size n = 100, for the given choice of d, i and j. Then we used X1:n to calculate ˆpn(0|0ˆΛn). We also compared the performance of transition probability estimator ˆpn(0|0 d: 1), where we did not select the relevant lags (Naive estimator). In our simulations, when d was larger than 5, for both PCP and Naive estimators we did not obtain meaningful results because Nn(0d) = 0 with high probability. Therefore, we compared PCP and Naive estimators only for d = 5. In this case, FSC showed similar performance to PCP estimator and was in general better than Naive estimator. When d > 5, e.g. d = n/8, FSC still exhibited good performance. The results are indicated in Table 2. Table 2: Empirical standard deviation of the estimator of p(0|0d). FSC, FS, PCP, and Naive are described in the main text. Model parameter Method Sample size (n) i j d 256 512 1024 2048 4096 8192 1 5 5 FS(2) 0.0774 0.0682 0.0506 0.0286 0.0174 0.0133 1 5 5 FSC(5) 0.0745 0.0835 0.0602 0.0426 0.0222 0.0129 1 5 5 PCP 0.0965 0.0786 0.0577 0.0432 0.0242 0.0131 1 5 5 Naive 0.1518 0.0933 0.0624 0.0455 0.0340 0.0252 1 5 10 FSC(5) 0.0836 0.0842 0.0659 0.0425 0.0228 0.0141 1 10 15 FSC(5) 0.0864 0.0781 0.0641 0.0438 0.0249 0.0151 1 15 15 FSC(5) 0.0833 0.0834 0.0747 0.0488 0.0222 0.0167 11 100 120 FSC(5) - - 0.0838 0.0647 0.0312 0.0169 1 10 n/8 FSC(5) 0.0563 0.0543 0.0780 0.0698 0.0504 0.0105 Sparse Markov Models for High-dimensional Inference 4.3 Application We applied the proposed method to study the relevant lags on a daily weather data registering the rainy and non-rainy days in Canberra Australia for a n = 1000 days. We obtained the data from kaggle (https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package). We used the first 1000 time points of the data. We used Forward Stepwise algorithm with ℓ= 3 (FS(3)) and maximal order d = 400 to include the possibility of the annual cycle. We obtained as the three relevant lags {1, 62, 330}. The selected relevant lags were the same for d = 365 and d = 450, showing teh robustness of the result. The day before (lag 1) is clearly relevant and is often included in weather prediction models. Annual cycles ( 12 months) are also predictor of the weather, matching the 330 days lag that we found. Finally, the 62 days lag is consistent with the cycle of Madden-Julian oscillator ( 60 days), which is the largest inter-seasonal source of precipitation events in Australia (Wheeler et al. (2009)). We note that the estimated Markov chain is of order 330, which is around one-third of the sample size n = 1000, whereas using VLMC we do not expect to typically estimate Markov chains of order larger than log(10) 7. Indeed, using VLMC with BIC model selection criterion we selected a model with order 1. We set the upper limit of the model size as 400 for the VLMC order selection. As a further comparison, we applied the Conditional Tensor Factorization based Higher Order Markov Chain estimation together with the test for relevance of lags described in Sarkar and Dunson (2016). We again used the code available at https://github.com/david-dunson/bnphomc. The maximal possible order of the Markov chain was set to 400, the maximal number of relevant lags was 3, and the number of simulation for the Gibbs sampler was set to 1000. The inclusion probability calculated using Gibbs sampler for lags (1, 2, 3, 4, 5, 6, 7) were (100, 1.2, 0.2, 0.6, 0.2, 0.4, 0.2) percent, respectively. For all other orders the inclusion probability was zero. Therefore, no larger lags were detected by this method. Acknowledgments This research has been conducted as part of FAPESP project Research, Innovation and Dissemination Center for Neuromathematics (grant 2013/07699-0). GO thanks FAPERJ (grants E-26/201.397/2021 and E-26/211.343/2019) and CNPq (grant 303166/2022-3). Appendix A. Proofs of Section 3 A.1 Proofs of Section 3.1 A.1.1 Proof of Theorem 1 Proof [Proof of Theorem 1] Since the set S J d, 1K containing the set Λ is fixed, we will write x instead of x S to alleviate the notation. We start proving Item 1. Proof of Item 1. For each x AS, let us define the event |ˆpn(a|x) p(a|x)| < 2α(1 + ε) ˆVn(a, x) Nn(x) + α 3 Nn(x) Ost and Takahashi where ˆVn(a, x) is given by ˆVn(a, x) = µ µ ψ(µ) ˆpn(a|x) + α (µ ψ(µ)) Nn(x). By using first the union bound and then by applying Proposition 34, we deduce that for each x AS, P(Gc x) 4|A| log(µ(n d)/α + 2) e αP Nn(x) > 0 . (44) Now, observe that on Gx, we have that d TV (ˆpn( |x), p( |x)) < X α(1 + ε) ˆVn(a, x) 2 Nn(x) + α|A| 6 Nn(x) = sn(x), which, together with (44), implies that P (d TV (ˆpn( |x), p( |x)) sn(x)) 4|A| log(µ(n d)/α + 2) e αP Nn(x) > 0 . (45) Note that if j / Λ, then by the definition of the set Λ it follows that p(a|x) = p(a|y) for all x, y AS which are (S\{j})-compatible. Hence, by applying first the triangle inequality and then using that tn(x, y) = sn(x) + sn(y), we deduce that the event {d TV (ˆpn( |x), ˆpn( |y)) tn(x, y)} is contained in the event {d TV (ˆpn( |x), p( |x)) sn(x)} {d TV (ˆpn( |y), p( |y)) sn(y)}, P j ˆΛ1,n 2 X x AS P(d TV (ˆpn( |x), pn( |x)) sn(x)) 8|A| log(µ(n d)/α + 2) x AS P( Nn(x) > 0), where in the second inequality we have used (45). Since n d = P x AS Nn(x) P x AS 1{ Nn(x) > 0} which implies that n d E P x AS 1{ Nn(x) > 0} = P x AS P( Nn(x) > 0), we obtain from the above inequality that, P j ˆΛ1,n 8|A| log(µ(n d)/α + 2) concluding the the proof of Item 1. Proof of Item 2. Let j Λ, recall that δj = λj maxb,c A d TV (pj( |b), pj( |c)) and consider the event E = {δj γn,j}. Take b , c A such that d TV (pj( |b ), pj( |c )) = maxb,c A d TV (pj( |b), pj( |c)), and observe that with this choice, δj = λjd TV (pj( |b ), pj( |c )). Sparse Markov Models for High-dimensional Inference From this equality it follows that for any pair (x, y) Cj(b , c ), we have E = {d TV (p( |x), p( |y)) 2tn,j}, where we have used also that γn,j = 2tn,j. Now, take a pair (x , y ) Cj(a , b ) attaining the minimum in (14): tn,j(b , c ) = tn(x , y ). From the definition of tn,j, it follows then that tn,j tn,j(b , c ) = tn(x , y ). Therefore, we conclude that E {d TV (p( |x ), p( |y )) 2tn(x , y )}, so that by the triangle inequality, we obtain that on E, 2tn(x , y ) d TV (ˆpn( |x ), p( |x )) + d TV (ˆpn( |y ), p( |y )) + d TV (ˆpn( |x ), ˆpn( |y )). Hence, on {j / ˆΛ1,n} E, we have tn(x , y ) d TV (ˆpn( |x ), p( |x )) + d TV (ˆpn( |y ), p( |y )), implying that P {j / ˆΛ1,n} E P(d TV (ˆpn( |x ), p( |x )) sn(x )) + P(d TV (ˆpn( |y ), p( |y )) sn(y )). From (45), it follows then that P j / ˆΛ1,n, γn,j δj = P {j / ˆΛ1,n} E 8|A| log(µ(n d)/α + 2) concluding the proof of Item 2. Proof of Item 3. Observe that by combining Items 1 and 2 together with the union bound, we deduce that P ˆΛ1,n = Λ 8|A| ((|S| |Λ|)(n d) + |Λ|) log(µ(n d)/α + 2) j Λ P (γn,j > δj) . Hence, to conclude the proof of Item 3, it suffices to show that P (γn,j > δj) 6|A|(|A| 1) exp 2(n d)2P2 S 8n(|S| + 1)2 Ost and Takahashi for all j Λ, whenever the sample size n satisfies (16). By the union bound, we have that P (γn,j > δj) X c A:c =b P (tn,j(b, c) > δj/2) , (47) and for each b, c A with b = c, P (tn,j(b, c) > δj/2) P (tn(x, y) > δj/2) P (sn(x) > δj/4) + P (sn(y) > δj/4) , (48) for any (x, y) Cj(b, c). By using again the union bound, we can deduce that for each in x AS, P (sn(x) > δj/4) P α(1 + ε) ˆVn(a, x) 2 Nn(x) > δj/8 + P α 6 Nn(x) > δj 8|A| and also that α(1 + ε) ˆVn(a, x) 2 Nn(x) > δj/8 α(1 + ε)ˆpn(a|x)µ 2(µ ψ(µ)) Nn(x) > δj/16 (1 + ε) 2(µ ψ(µ)) > δj/16 By applying Proposition 26 with u1 = P(x) (4|A|α)/(3δj(n d)) and u2 = P(x) (16|A|α p (1 + ε)/2(µ ψ(µ)))/(δj(n d)), one can show that 6 Nn(x) > δj/8 + P (1 + ε) 2(µ ψ(µ)) > δj/16 2n(|S| + 1)2 P(x) 16|A|α δj(n d) (1 + ε) 2(µ ψ(µ)) as long as (n d) > 16|A|α (1+ε) 2(µ ψ(µ)). By using Jensen inequality, one can verify that α(1 + ε)ˆpn(a|x)µ 2(µ ψ(µ)) Nn(x) > δj/16 Nn(x) < 128α(1 + ε)µ|A| δ2 j (µ ψ(µ)) so that by Proposition 26 with u3 = P(x) (128α(1 + ε)µ|A|)/(δ2 j (µ ψ(µ))), it follows that 2α(1 + ε)ˆpn(a|x)µ (µ ψ(µ)) Nn(x) > δj/16 2n(|S| + 1)2 P(x) 128|A|αµ(1 + ε) δ2 j (µ ψ(µ))(n d) Sparse Markov Models for High-dimensional Inference whenever (n d) > (128|A|αµ(1 + ε))/(δ2 j P(x)(µ ψ(µ))). Therefore, we have shown that for any x AS, P (sn(x) > δj/4) 3 exp 2(n d)2P2(x) 8n(|S| + 1)2 8µ(1 + ε) (µ ψ(µ)) + (1 + ε) 2(µ ψ(µ)) Now, considering b ,j, c ,j A with b = c and (x ,j, y ,j) Cj(a ,j, b ,j) such that min b,c A:b =c max (x,y) Cj(a,b)(P(x) P(y)) = (P(x ,j) P(y ,j)), we can deduce from (48) and (49) that P (tn,j(b, c) > δj/2) 6 exp 2(n d)2(P(x ,j) P(y ,j))2 8n(|S| + 1)2 16 |A|α δ2 j P(x ,j) P(y ,j) 8µ(1 + ε) (µ ψ(µ)) + (1 + ε) 2(µ ψ(µ)) Since P(x ,j) P(y ,j) PS for all j Λ , we can take C = C(µ, ε) = 32 8µ(1 + ε) (µ ψ(µ)) + (1 + ε) 2(µ ψ(µ)) to deduce that (46) is indeed satisfied whenever (n d) C|A|α δ2 min PS , concluding the proof. A.1.2 Proof of Corollary 4 Proof [Proof of Corollary 4] Notice that Assumptions 1 and 2 are satisfied for all values of n, since pmin,n p min and n min for positive constants p min and min. Hence, the result follows immediately from Theorem 1-Item 3 and Remark 2-Item (c). Ost and Takahashi A.2 Proofs of Section 3.2 A.2.1 Proof of Proposition 6 In this section we prove Proposition 6. To that end, we need some auxiliary results. The first auxiliary result is the following. Recall that we write Covx S(X0, mk(Xk)) to denote the conditional covariance between the random variables X0 and mk(Xk) given that XS = x S, where mk is defined (3). Lemma 19 For each S J d, 1K, k Sc and x S AS, the following identity holds: Covx S(X0, mk(Xk)) = X j Λ\S λj Covx S(mj(Xj), mk(Xk)). (50) Remark 20 In (50), we use the convention that P j λj Covx S(mj(Xj), mk(Xk)) = 0. Proof [Proof of Lemma 19] Observe that if Λ S, then the both sides of (50) are 0, so that the result holds immediately in this case. Now suppose that Λ S. In this case, to shorten the notation, let us write Px S(xΛ\S) = Px S(XΛ\S = xΛ\S), for xΛ\S AΛ\S. We want to compute Covx S(X0, mk(Xk)) = Ex S(X0mk(Xk)) Ex S(X0)Ex S(mk(Xk)). We first compute Ex S(X0). To that end, write Ex S(X0) = X a A a Px S(X0 = a), and observe that for each a A, Px S(X0 = a) = X xΛ\S AΛ\S Px S(xΛ\S)p(a|x SxΛ\S) = λ0p0(a) + X j Λ S λjpj(a|xj) + X xΛ\S AΛ\S Px S(xΛ\S)pj(a|xj) = λ0p0(a) + X j Λ S λjpj(a|xj) + X j Λ\S λj Ex S(pj(a|Xj)), where in the second equality we have used the definition of the transition probabilities (2). Hence, we have that Ex S(X0) = λ0m0 + X j Λ S λjmj(xj) + X j Λ\S λj Ex S(mj(Xj)), where m0 = P a A ap0(a). Sparse Markov Models for High-dimensional Inference As a consequence of the above equality, it follows that Ex S(X0)Ex S(mk(Xk)) = j Λ S λjmj(xj) Ex S(mk(Xk)) j Λ\S λj Ex S(mj(Xj))Ex S(mk(Xk)). (51) We now compute Ex S(X0mk(Xk)). We consider only the case k Λ, the other is treated similarly. In this case, we first write Ex S(X0mk(Xk)) = X b A amk(b)Px S(X0 = a, Xk = b), (52) and then we proceed similar as above to deduce that for each a, b A, Px S(X0 = a, Xk = b) = X xΛ\S AΛ\S Px S(xΛ\S)p(a|x SxΛ\S)1{xk = b} λ0p0(a) + X j Λ S λjpj(a|xj) + λkpk(a|b) Px S(Xk = b) j Λ\(S {k}) λj X c A pj(a|c)Px S(Xj = c, Xk = b). (53) where in the second equality we have used the definition of the transition probabilities (2). Combining (52) and (53), we deduce that Ex S(X0mk(Xk)) = j Λ S λjmj(xj) Ex S(mk(Xk)) + λk Ex S(m2 k(Xk)) j Λ\(S {k}) λj Ex S(mk(Xk)mj(Xj)) j Λ S λjmj(xj) Ex S(mk(Xk)) j Λ\S λj Ex S(mk(Xk)mj(Xj)). (54) Putting together the identities (51) and (54), we then conclude that Covx S(X0mk(Xk)) = X j Λ\S λj (Ex S(mj(Xj)mk(Xk)) Ex S(mj(Xj))Ex S(mk(Xk)) , (55) and the result follows. The next auxiliary result is the following. Ost and Takahashi Lemma 21 Suppose Assumptions 1 and 3 hold. Then there exists a constant κ > 0 such that the following property holds: for any S J d, 1K such that Λ S, we have k Λ\S λjλk E(Cov XS(mj(Xj), mk(Xk))) κ . Proof [Proof of Lemma 21] It suffices to show that for S J d, 1K such that Λ S, we have X k Λ\S λjλk E(Covx S(mj(Xj), mk(Xk))) > 0. Suppose that this is not the case. Then, k Λ\S λjλk E(Covx S(mj(Xj), mk(Xk))) k Λ\S λjλk Cov (mj(Xj) EXS(mj(Xj)), mk(Xk) EXS(mk(Xk))) j Λ\S λj(mj(Xj) EXS(mj(Xj))) so that P-almost surely, X j Λ\S λj(mj(Xj) EXS(mj(Xj))) = 0. This implies that P-almost surely, j Λ\S λjmj(Xj) = EXS j Λ\S λjmj(Xj) or equivalently, X j Λ\S λjmj(Xj) = f(XS), P-a.s., for some function f : AS R. Now take any configuration x S AS and consider the event A = {XS = x S}. From the above identity, it follows that P-a.s., j Λ\S λjmj(Xj) = 1Af(x S). Finally, take any configuration xΛ\S AΛ\S such that j Λ\S λjmj(xj) = f(x S), Sparse Markov Models for High-dimensional Inference and let B = {XΛ\S = xΛ\S}. Such a configuration must exist by Assumption 3. As a consequence, we have that P-a.s., j Λ\S λjmj(xj) = 1A Bf(x S), implying that P(A B) X j Λ\S λjmj(xj) = P(A B)f(x S). By Assumption 1, we have P(A B) = P(xΛ) > 0 so that the identify above would imply that X j Λ\S λjmj(xj) = f(x S), which is a contradiction. Therefore, we must have j Λ\S λj(mj(Xj) EXS(mj(Xj))) and the result follows. We also need the following result. Lemma 22 For each S J d, 1K, k Sc and x S AS, the following identity holds: |Covx S(X0, mk(Xk))| mk Lip|Covx S(X0, Xk)|, (56) where mk and mk Lip are defined (3) and (6) respectively. Proof [Proof of Lemma 22] First observe that Covx S(X0, mk(Xk)) = Covx S(X0, mk(Xk) mk(c)), for any c A. Since, Covx S(X0, mk(Xk) mk(c)) = X b A (mk(b) mk(c))Covx S(X0, 1{Xk = b}) and |mk(b) mk(c)| mk Lip|b c|, it follows then that Covx S(X0, mk(Xk)) mk Lip X b A |b c|Covx S(X0, 1{Xk = b}). By taking c = min(A), we have that |b c| = (b c) for any b A and we deduce from the above inequality that Covx S(X0, mk(Xk)) mk Lip Covx S(X0, Xk) mk Lip|Covx S(X0, Xk)|. A similar argument shows that Covx S(X0, mk(Xk)) mk Lip|Covx S(X0, Xk)|, concluding the proof. Our last auxiliary result is the following. Ost and Takahashi Lemma 23 For each S J d, 1K such Λ S, x S AS and j, k Λ \ S, the following identity holds: Covx S(mj(Xj), mk(Xk)) = 1 c A Px S(Xk = b)Px S(Xk = c)(mk(b) mk(c)) (Ex S(mj(Xj)|Xk = b) Ex S(mj(Xj)|Xk = c)) . (57) where mj is defined (3). Proof [Proof of Lemma 23] First notice that Covx S(mj(Xj), mk(Xk)) = X a,b A mj(a)mk(b)Covx S(1{Xj = a}, 1{Xk = b}). Now, for any a, b A, one can check that Covx S(1{Xj = a}, 1{Xk = b}) = X c A Px S(Xk = b)Px S(Xk = c) (Px S(Xj = a|Xk = b) Px S(Xj = a|Xk = c)) Hence, we deduce from the above equalities that Covx S(mj(Xj), mk(Xk)) = X c A mk(b)Px S(Xk = b)Px S(Xk = c)) (Ex S(mj(Xj)|Xk = b) Ex S(mj(Xj)|Xk = c)). Interchanging the role of the symbols b and c in the equality above, we obtain that Covx S(mj(Xj), mk(Xk)) = X b A mk(c)Px S(Xk = b)Px S(Xk = c)) (Ex S(mj(Xj)|Xk = b) Ex S(mj(Xj)|Xk = c)). The result follows by combing the last two equalities above. We now prove Proposition 6. Proof [Proof of Proposition 6] We first prove inequality (26). Let us denote Dk,S(a, b, c, x S) = Px S(X0 = a|Xk = b) Px S(X0 = a|Xk = c), for each a, b, c A, x S AS and k / S. With this notation, one can check that for any x S AS and k / S, we have that Covx S(X0, Xk) = 1 c A (b c)wk,S(b, c, x S) X a A a Dk,S(a, b, c, x S). (58) Now, observe that the triangle inequality and the equality a A |Dk,S(a, b, c, x S)| = dk,S(b, c, x S), Sparse Markov Models for High-dimensional Inference |Covx S(X0, Xk)| Diam(A) A X c A wk,S(b, c, x S)dk,S(b, c, x S), so that E (|Cov XS(X0, Xk)|) Diam(A) A νk,S, proving inequality (26). We now prove (27). This is done as follows. In the sequel, we shall write λ1Λ\S to denote the vector λ = (λj)j Λ restricted to the coordinates in Λ \ S: λ1Λ\S = (λj)j Λ\S. With this notation, it follows from Lemma 19 and Lemma 21 that for any S J d, 1K such that Λ S, X k Λ\S λk E (Cov XS(X0, mk(Xk))) κ . By the triangle inequality, it then follows that X k Λ\S λk |E (Cov XS(X0, mk(Xk))|) κ . Now using that max k Λ\S |E (Cov XS(X0, mk(Xk)))| λ1Λ\S 1 X k Λ\S λk |E (Cov XS(X0, mk(Xk)))| , we conclude that max k Λ\S |E (Cov XS(X0, mk(Xk)))| λ1Λ\S 1 κ . By observing that 1 λ0 = P k Λ λk λ1Λ\S 1, we conclude from the above inequality that max k Λ\S |E (Cov XS(X0, mk(Xk)))| κ /(1 λ0) > 0, and the result follows from Lemma 22. Therefore, it remains to prove (28). To that end, we first use Lemma 19, Lemma (22) and Lemma 23 to obtain that X k Λ\S λk mk Lip|Covx S(X0, Xk)| 1 c A λkλj Px S(Xk = b)Px S(Xk = c) (mk(b) mk(c)) (Ex S(mj(Xj)|Xk = b) Ex S(mj(Xj)|Xk = c)) . Next, we observe that Assumption 4 implies that λj (|Ex S(mj(Xj)|Xk = b) Ex S(mj(Xj)|Xk = c)|) λk|mk(b) mk(c)| k Λ\S λk mk Lip E (|Covx S(X0, Xk)|) p2 minΓ1 k Λ\S λ2 k X c A (mk(b) mk(c))2, Ost and Takahashi where we have also used that Px S(Xk = b) pmin. Finally, note that |mk(b) mk(c)| min{|b c|2 : b = c} mk 2 Lip to obtain that max k Λ\S E (|Covx S(X0, Xk)|) X k Λ\S λk mk Lip p2 minΓ1 2 min{|b c|2 : b = c} k Λ\S λ2 k mk 2 Lip. Then, by using Cauchy-Schwartz inequality, we deduce that X k Λ\S λk mk Lip s X k Λ\S λ2 k mk 2 Lip p |Λ \ S| s X k Λ\S λ2 k mk 2 Lip p The result follows by combining the last two inequalities. A.2.2 Proof of Theorem 9 Before starting the proof of Theorem 9, we recall some definitions from Information Theory. In what follows, for S J d, 1K and j J d, 1K, we write I(X0; Xj|XS) to denote the conditional mutual information between X0 and Xj given XS, defined as I(X0; Xj|XS) = X x S AS P(x S)I(X0; Xj|XS = x S), (59) where I(X0; Xj|XS = x S) := Ij(x S) denotes the conditional mutual information between X0 and Xj given XS = x S, defined as Ij(x S) = X a,b A Px S(X0 = a, Xj = b) log Px S(X0 = a, Xj = b) Px S(X0 = a)Px S(Xj = b) We use the convention that when S = , the conditional probability Pxs is the unconditional probability P. Hence, in this case, the conditional mutual information between X0 and Xj is the mutual information between these random variables, denoted I(X0; Xj) := Ij. The entropy H(X0) of X0 is defined as a A P(X0 = a) log(P(X0 = a)). (61) To prove Theorem 9 we proceed similarly to Bresler (2015). During the proof we will need the the following lemma. Lemma 24 Suppose that the event Gm(ξ, ℓ) defined in (29) holds and let S J d, 1K with |S| ℓ. If ˆνm,k,S τ with k Sc, then I(X0; Xk|XS) 2(τ ξ)2. Proof Definition (59) together with Jensen inequality implies that for any j Sc, r 1 2I(X0; Xj|XS) = x S AS P(x S)Ij(x S) x S AS P(x S) 1 2Ij(x S). Sparse Markov Models for High-dimensional Inference By Pinsker inequality, it then follows that r 1 2Ij(x S) 1 a,b A |Px S(X0 = a, Xj = b) Px S(X0 = a)Px S(Xj = b)| b A Px S(Xj = b)1 a A |Px S(X0 = a|Xj = b) Px S(X0 = a)| c A wj,S(b, c, x S)d TV (Px S(X0 |Xj = b), Px S(X0 |Xj = c)) = νj,S(x S), where in the second equality we have used that for any a, b A, Px S(X0 = a|Xj = b) = X c A Px S(Xj = c)Px S(X0 = a|Xj = b), and also that Px S(X0 = a) = X c A Px S(Xj = c)Px S(X0 = a|Xj = c)). As a consequence, we deduce that r 1 2I(X0; Xj|XS) X x S AS P(x S)νj,S(x S) = νj,S. Now, on the event Gm(ξ, ℓ), we have that νk,S ˆνm,k,S ξ so that 1 2I(X0; Xk|XS) ˆνm,k,S ξ τ ξ, where in the rightmost inequality we have used that ˆνm,k,S τ. Hence, I(X0; Xj|XS) 2(τ ξ)2, and the result follows. We now prove Theorem 9. Proof Suppose the event Gm(ξ , ℓ ) holds and let ˆSm be the set obtained at the end of FS step of Algorithm 1 with parameter ℓ , where the parameters ξ and ℓ are defined as in (30). In the sequel, let S0 = and Sk = Sk 1 {jk}, where jk arg maxj Sc k 1 ˆνm,j,Sk 1 for 1 k d, and observe that by construction ˆSm = Sℓ . We want to show that Λ ˆSm. We argue by contraction. Suppose that Λ is not contained in ˆSm. In this case, it follows that Λ Sk for all 1 k ℓ , and Proposition 6 implies that for all 1 k ℓ , max j Sc k νj,Sk κ A Diam(A) = 4ξ , Ost and Takahashi where the equality holds by the choice of ξ . Since the event Gm(ξ , ℓ ) holds and |Sk| |Sℓ | = ℓ , it follows from the above inequality that ˆνm,jk,Sk 1 = max j Sc k 1 ˆνm,j,Sk 1 3ξ , for all 1 k ℓ + 1. By Lemma 24, we then deduce that I(X0, Xjk|XSk 1) 8ξ2 for all 1 k ℓ + 1. Now, notice that log2(|A|) H(X0) I(X0; X ˆSm {jℓ +1}) = k=1 I(X0, Xjk|XSk 1), (62) where we have used Gibbs inequality in the first passage, the fact that the entropy is always larger than the mutual information in the second passage and the Chain Rule in the last passage. The proof of these facts can be found, for instance, in (Cover and Thomas, 2006). By the choice of ℓ = log2(|A|)/8ξ2 , we have that ℓ + 1 > log2(|A|)/8ξ2 so that it follows from (62) that log2(|A|) (ℓ + 1)8ξ2 > log2(|A|), a contradiction. Thus, we must have Λ ˆSm and the result follows. A.2.3 Proof of Theorem 10 To prove Theorem 10 we shall need the following result. Proposition 25 Suppose Assumptions 1 and 2 hold, and let > 0 the quantity defined in Assumption 2. Then, for any ξ > 0 and m > d 2ℓ 0, P(Gc m(ℓ, ξ)) 2d(ℓ+ 1) d ℓ |A|ℓ+2 exp ξ2(m d)2( )2 18|A|2(ℓ+2)m(ℓ+ 2)2 During the proof of Proposition 25 we will make use of the following proposition. For any function f : AJ1,m K R, define for each 1 j m, δj(f) = sup n |f(x1:j 1axj+1:m) f(x1:j 1bxj+1:m)| : a, b A, x1:m AJ1,m Ko , (64) with the convention that x1:0 = xm+1:m = , ax2:m = ax2:m and x1:m 1a = x1:m 1a . Let δ(f) = (δ1(f), . . . , δm(f)) and denote δ(f) 2 2 = Pm j=1 δ2 j (f). In what follows, we write E1:m[f] = P x1:m AJ1,m K P(X1:m = x1:m)f(x1:m). Proposition 26 (Theorem 3.4. of (Chazottes et al., 2020)) Suppose Assumption 2 holds, that is, > 0. 1. For any u > 0 and f : AJ1,m K R, P (|f(X1:m) E1:m[f]| > u) 2 exp 2u2 2 Sparse Markov Models for High-dimensional Inference 2. For m > d, any g : AS A R with S J d, 1K and u > 0, t=d+1 g(Xt+S, Xt) ES[g] 2 exp u2(m d)2 2 2m(|S| + 1)2 g 2 where ES[g] = P x S AS P a A P(XS = x S, X0 = a)g(x S, a) and Xt+S = (Xt+j)j S. Before starting the proof Proposition 25, we need to introduce some additional notation. For each x AS with S J d, 1K, we write ˆPm(x) = Nm(x) m d . (65) In what follows, we write xa V {0}, with a A and V J d, 1K, to denote the configuration ((xa)j)j V {0}, defined as ( xj, for j V a, for j = 0 . When V = S {k} and xk = b A, we shall write xba S {k,0} instead of xa V {0}. With this notation, the empirical version of νk,S is defined as follows: ˆνm,k,S = X x S AS ˆPm(x S)ˆνm,k,S(x S) (66) where for x S AS, we define ˆνm,k,S(x S) = X c A ˆwm,k,S(b, c, x S) ˆdm,k,S(b, c, x S), (67) and for b, c A, ˆwm,k,S(b, c, x S) = ˆPm(xb S {k}) ˆPm(xc S {k}) ˆPm(x S) , (68) and ˆdm,k,S(b, c, x S) = 1 ˆPx S(X0 = a|Xk = b) ˆPx S(X0 = a|Xk = c) , where for each b A, ˆPx S(X0 = a|Xk = b) = ˆPm(xba S {k,0}) ˆPm(xb S {k}) . Hereafter, we omit the dependence on S and on m, whenever there is no risk of confusion. We now prove Proposition 25. Proof [Proof of Proposition 25] Claim 1. Let S J d, 1K with |S| ℓ< d/2 and take j Sc. Then, | νj,S ˆνj,S| 3 X ˆP(XS = x, Xj = b, X0 = a) P(XS = x, Xj = b, X0 = a) . Ost and Takahashi Proof of the Claim 1. By applying the triangle inequality twice, one can check that | νj,S ˆνj,S| 1 a,b,c A |P(x)wj,S(b, c, x) (Px(X0 = a|Xj = b) Px(X0 = a|Xj = c)) ˆP(x) ˆwj,S(b, c, x) ˆPx(X0 = a|Xj = b) ˆPx(X0 = a|Xj = c) . (69) Now observe that for fixed x AS and a, b, c A, P(x)wj,S(b, c, x)Px(X0 = a|Xj = b) = Px(Xj = c)P(XS = x, Xj = b, X0 = a) and similarly, ˆP(x) ˆwj,S(b, c, x)ˆPx(X0 = a|Xj = b) = ˆPx(Xj = c)ˆP(XS = x, Xj = b, X0 = a). By using these identities in (69) and then by applying the triangle inequality, one can deduce that | νj,S ˆνj,S| X a,b,c A |Px(Xj = c)P(XS = x, Xj = b, X0 = a) ˆPx(Xj = c)ˆP(XS = x, Xj = b, X0 = a) . (70) By adding and subtracting the term Px(Xj = c)ˆP(XS = x, Xj = b, X0 = a) in the right-hand side of the above inequality and using again the triangle inequality, it follows that a,b,c A |Px(Xj = c)P(XS = x, Xj = b, X0 = a) ˆPx(Xj = c)ˆP(XS = x, Xj = b, X0 = a) a,b A |P(XS = x, Xj = b, X0 = a) ˆP(XS = x, Xj = b, X0 = a)| a,c A ˆP(XS = x, X0 = a)|Px(Xj = c) ˆPx(Xj = c)|. (71) By adding and subtracting the term P(XS = x)P(XS = x, Xj = c), we can then check that |Px(Xj = c) ˆPx(Xj = c)| P(Xj = c) ˆP(XS = x) |ˆP(XS = x) P(XS = x)| + 1 ˆP(XS = x) |ˆP(XS = x, Xj = c) P(XS = x, Xj = c)|. (72) Sparse Markov Models for High-dimensional Inference From (71) and (72), one deduces that | νj,S ˆνj,S| X a,b A |P(XS = x, Xj = b, X0 = a) ˆP(XS = x, Xj = b, X0 = a)| c A |P(XS = x, Xj = c) ˆP(XS = x, Xj = c)| x AS |P(XS = x) ˆP(XS = x)|. (73) |P(XS = x, Xj = c) ˆP(XS = x, Xj = c)| a A |P(XS = x, Xj = c, X0 = a) ˆP(XS = x, Xj = c, X0 = a)| |P(XS = x) ˆP(XS = x)| a,c A |P(XS = x, Xj = c, X0 = a) ˆP(XS = x, Xj = c, X0 = a)|, the proof of Claim 1 follows from (73). Claim 2. For any u > 0, ˆP(XS = x, Xj = b, X0 = a) P(XS = x, Xj = b, X0 = a) > u 2|A||S|+2 exp u2(m d)2 2 18|A|2(|S|+2)m(|S| + 2)2 Proof of Claim 2. It follows from the union bound and Proposition 26. We now will conclude the proof. Let Sk = {S J d, 1K : |S| = k} and observe that by the union bound P(Gc m(ξ, ℓ)) j Sc P (| νj,S ˆνj,S| > ξ) . Combining Claims 1 and 2, it follows that P (| νj,S ˆνj,S| > ξ) 2|A||S|+2 exp ξ2(m d)( )2 18|A|2(|S|+2)m(|S| + 2)2 which implies that P(Gc m(ξ, ℓ)) 2 (d k)|A|k+2 exp ξ2(m d)2 2 18|A|2(k+2)m(k + 2)2 Ost and Takahashi Since ℓ d/2, we can use that d k d ℓ for all 0 k ℓto obtain that P(Gc m(ξ, ℓ)) 2d(ℓ+ 1) d ℓ |A|ℓ+2 exp ξ2(m d)( )2 18|A|2(ℓ+2)m(ℓ+ 2)2 and the result follows. We now prove Theorem 10. Proof [Proof of Theorem 10] First, observe that by Theorem 9, P(ˆΛ2,n = Λ) P(Gc m(ξ , ℓ )) + P(Λ ˆSm, ˆΛ2,n = Λ) (74) Next, notice that the second term on the right hand side of (74) can be written as P(Λ ˆSm, ˆΛ2,n = Λ) = X S J d, 1K:Λ S,|S| ℓ P( ˆSm = S, ˆΛ2,n = Λ). Now for any S J d, 1K such that Λ S, |S| ℓ , it follows from the union bound that P( ˆSm = S, ˆΛ2,n = Λ) X j Λ P( ˆSm = S, j / ˆΛ2,n) + X j S\Λ P( ˆSm = S, j ˆΛ2,n). By proceeding similarly as in the proof of Item 1 of Theorem 1, one can deduce that for any j S \ Λ, P( ˆSm = S, j ˆΛ2,n) 8|A| log(µ(n m d)/α + 2) x AS P( ˆSm = S, Nm,n(x) > 0), j S\Λ P( ˆSm = S, j ˆΛ2,n) 8(ℓ |Λ|)|A| log(µ(n m d)/α + 2) x AS P( ˆSm = S, Nm,n(x) > 0). S J d, 1K:Λ S,|S| ℓ x AS P( ˆSm = S, Nm,n(x) > 0) (n m d)P(Λ ˆSm) we then deduce that X S J d, 1K:Λ S,|S| ℓ j S\Λ P( ˆSm = S, j ˆΛ2,n) 8(ℓ |Λ|)|A| log(µ(n m d)/α + 2) e α(n m d). Sparse Markov Models for High-dimensional Inference Following the steps of the proof of Item 2 of of Theorem 1, we can also show that X S J d, 1K:Λ S,|S| ℓ j Λ P( ˆSm = S, j / ˆΛ2,n, δj γS m,n,j) 8|Λ||A| log(µ(n m d)/α + 2) S J d, 1K:Λ S,|S| ℓ P( ˆSm = S) 8|Λ||A| log(µ(n m d)/α + 2) where γS m,n,j is defined as in (14) with tm,n,j = maxb,c A:b =c min(x S,y S) Cj(b,c) tm,n(x S, y S) in the place of tn,j. Hence, it remains to estimate X S J d, 1K:Λ S,|S| ℓ j Λ P( ˆSm = S, j / ˆΛ2,n, δj < γS m,n,j). By proceeding similarly to the proof of Item 3 of Theorem 1, one can show that for each S J d, 1K such that |S| ℓ , P γS m,n,j > δj 6|A|(|A| 1) exp ( ( pmin)2(n m d)2 2(n m)|A|2(ℓ 1)(ℓ + 1)2 1 nmin n m d for all j Λ as long as n satisfies n > m + d + nmin. By using this upper bound and by recalling that P S J d, 1K:Λ S,|S| ℓ Pℓ k=0 d k (ℓ + 1) d ℓ (since 2ℓ d), we deduce that X S J d, 1K:Λ S,|S| ℓ j Λ P( ˆSm = S, j / ˆΛ2,n, δj < γS m,n,j) 6|A|(|A| 1)(ℓ + 1) d ℓ ( ( pmin)2(n m d)2 2(n m)|A|2(ℓ 1)(ℓ + 1)2 1 nmin n m d for all j Λ whenever n > m + d + nmin, implying the result. A.2.4 Proof of Corollary 12 Proof [Proof of Corollary 12] Assumptions 1 and 2 are satisfied for all values of n, since p n p min and n min for positive constants p min and min for all n. Since the sequence of MTD models also satisfy Assumption 3, the result follows from Theorem 10 and Remark 11-Item (b). A.3 Proofs of Section 3.3 A.3.1 Proof of Theorem 14 Proof [Proof of Theorem 14] Ost and Takahashi Notice that we can write for each j J d, 1K, mj(Xj) = (pj(1|1) pj(1|0)X j + pj(1|0), so that equality (50) can be rewritten for any j J d, 1K satisfying (pj(1|1) pj(1|0)) = 0, S J d, 1K \ {j} and x S {0, 1}S, as Covx S(X0, Xj) = X ℓ Λ\S ℓCovx S(Xℓ, Xj), where ℓ= λℓ(pℓ(1|1) pℓ(1|0)) for ℓ Λ. Recalling that νj,S = 2E (|Cov XS(X0, Xj)|) in the binary case, we can deduce that for any S J d, 1K, x S {0, 1}S and j J d, 1K\S, ℓ Λ\S ℓCov XS(Xℓ, Xj) As a consequence, it follows that for S Λ and j Λ \ S, Var XS(Xj)| j| X ℓ Λ\S:ℓ =j | ℓ||Cov XS(Xℓ, Xj)| = 2E (| j|Var XS(Xj) | ℓ| | j||Px S(Xℓ= 1|Xj = 1) Px S(Xℓ= 1|Xj = 0)| where in the second inequality we have used that |Covx S(Xℓ, Xj)| = Varx S(Xj)|Px S(Xℓ= 1|Xj = 1) Px S(Xℓ= 1|Xj = 0)|. By Assumption 4, we then deduce that νj,S 2Γ1| j|E(Var XS(Xj)). (75) Now, take S Λ and let j S arg minℓ Λ\S | j|E(Var XS(Xℓ)). For any j (Λ)c, use the triangle inequality, equality (75) and Assumption 5 to deduce that ℓ Λ\S | ℓ||Cov XS(Xℓ, Xj)| 2| j S|E (Var XS(Xj S)) Γ2 (76) Using that δj = | j| and combing inequalities (75) and (76), it follows then that max j Λ\S νj,S max j (Λ)c νj,S 2(Γ1 Γ2)| j S|E (Var XS(Xj S)) , (77) Sparse Markov Models for High-dimensional Inference where we have used also that maxj Λ\S νj,S νj S,S. Using that E (Var XS(Xj S)) (p )2 and | j S| minj Λ | j| we obtain that max j Λ\S νj,S max j (Λ)c νj,S 2(Γ1 Γ2)p2 min min j Λ | j|. concluding the first half of the proof. To show the second assertion of the theorem, take S Λ, let j S arg maxj Λ\S νj,S and note that on Gn(ℓ, ξ), max j Λ\S ˆνn,j,S ˆνn,j S,S νj S,S ξ = max j Λ\S νj,S ξ. Similarly, one can show that on Gn(ℓ, ξ), max j (Λ)c ˆνn,j,S max j (Λ)c νj,S + ξ. As a consequence, it follows that max j Λ\S ˆνn,j,S max j (Λ)c ˆνn,j,S max j Λ\S νj,S max j (Λ)c νj,S whenever Gn(ℓ, ξ). By taking ξ as in (37), we have that max j Λ\S ˆνn,j,S max j (Λ)c ˆνn,j,S > 0, implying that arg maxj Sc ˆνn,j,S Λ for all S Λ, and the result follows. A.3.2 Proof of Corollary 16 Proof [Proof of Corollary 16] By Theorem 14, we have that P(ˆΛ2,n = Λ) P(Gc m(ξ, L)) + P(Λ ˆSm, ˆΛ2,n = Λ), for any ξ < (Γ1 Γ2)p min minj Λ δj. By Proposition 25, we have that P(Gc m(ξ, L)) 2d(L + 1) d L |A|L+2 exp ξ2(m d)2 2 18|A|2(L+2)m(L + 2)2 By taking ξ = (Γ1 Γ2)p min minj Λ δj, one can check that if min j Λ δj C1 log(n) for some constant C1 = C1(β, L, min, p min, Γ1, Γ2), then P(Gc m(ξ, L)) 0 as n . Ost and Takahashi By proceeding exactly as in the proof of Theorem 10 and using 11-item (b), we can show that P(Λ ˆSm, ˆΛ2,n = Λ) 6|A|(L + 1) d L d|A|L+1(|A| 1)|Λ| exp ( pmin)2(n d)2 8n(L + 1)2|A|2(L 1) + 8|A| ((L |Λ|)(n m d) + |Λ|) log(µ(n m d)/α + 2) n d + C|A|Lα pminδ2 min , where C = C(µ, ε). Therefore, using that min, pmin p min, d = nβ, α = (1 + η) log(n), we also see that if δ2 min C2 log(n) n for some C2 = C2(µ, ε, η, min, p min, β, L) then P(Λ ˆSm, ˆΛ2,n = Λ) 0 as n . By taking C = C1 C2, we deduce that P(ˆΛ2,n = Λ) 0 as n as long as δ2 min C log(n) n , and the result follows. A.4 Proofs of Section 3.4 Proof [Proof of Theorem 17] By the union bound, we have that x d: 1 AJ d, 1K |ˆpn(a|xˆΛ2,n) p(a|xΛ)| v u u t2α(1 + ϵ) ˆVn(a, xˆΛ2,n) + α 3 Nn(xˆΛ2,n) P(Λ = ˆΛ2,n) |ˆpn(a|xΛ) p(a|xΛ)| 2α(1 + ϵ) ˆVn(a, xΛ) Nn(xΛ) + α 3 Nn(xΛ) Now, Proposition 34 implies that for any a A and xΛ AΛ, |ˆpn(a|xΛ) p(a|xΛ)| 2α(1 + ϵ) ˆVn(a, xΛ) Nn(xΛ) + α 3 Nn(xΛ) 4 log(µ(n d)/α + 2) e αP Nn(xΛ) > 0 , Sparse Markov Models for High-dimensional Inference |ˆpn(a|xΛ) p(a|xΛ)| 2α(1 + ϵ) ˆVn(a, xΛ) Nn(xΛ) + α 3 Nn(xΛ) 4|A| log(µ(n d)/α + 2) xΛ AΛ E 1{ Nn(xΛ) > 0} . xΛ AΛ E[1{ Nn(xΛ) > 0}] (n d), the result follows. A.5 Proof of Section 3.5 A.5.1 Proof of Proposition 18 Proof [Proof of Proposition 18] First observe that since all MTDs are stationary Markov chains of order at most d, we can use the Markov property to show that KL(P (j) n ||P (k) n ) = KL(P (j) d ||P (k) d ) + (n d)E(j)(KL(p(j)( |X d: 1)||p(k)( |X d: 1))). where KL(p(j)( |x d: 1)||p(k)( |x d: 1)) denotes the Kullback-Leibler divergence between p(j)( |x d: 1) and p(k)( |x d: 1). Now note that for each fixed x :d1 AJ d, 1K, we can use the definition of the transition probabilities p(j)( | ) together with Lemma 6 of Csizar and Talata (2005) to deduce that KL(p(j)( |x d: 1)||p(k)( |x d: 1)) λ2|p(1|1) p(1|0)|2(pmin) 11{xj =xk}. Since pmin (1 λ)/2 and δ = λ|p(1|1) p(1|0)|, it follows from the above inequality that E(j)(KL(p(j)( |X d: 1)||p(k)( |X d: 1))) 2δ2 By using similar arguments, one can also show that KL(P (j) d ||P (k) d ) 1 pmin max x d: 1,y d: 1 |p(j)(1|x d: 1) p(k)(1|y d: 1)|2 i=1 max x d: 1,y d: 1:x i: 1=y i: 1 |p(j)(1|x d: 1) p(k)(1|y d: 1)|2 ! Therefore, it follows that KL(P (j) n ||P (k) n ) 2dδ2 1 λ + (n d) 2δ2 and the result follows. Ost and Takahashi A.6 Computation of PCP and FSC estimators We will first show that one can compute the PCP estimator with at most O(|A|2|S|(n d)) computations, as claimed in item (c) of Remark 2. Proof of item (c) of Remark 2. One way to compute the PCP estimator is the following. First, we compute Nn(x S, a) simultaneously for all pasts x S and symbols a A, and build the set ES = {x S : Nn(x S) > 0}. This can be done with O(n d) computations. Indeed, we set initially Nn(x S, a) = 0 for all past x S and symbol a A. Then at each time d + 1 t n, we increment by 1 the count of Nn(x S, a) for which Xt+S = x S and Xt = a, leaving all the other counts unchanged. Moreover, at the first time that Nn(x S, a) > 0, we include x S in the set E. Note that the cardinality of the set ES is at most (n d). Next, we need to compute sn(x S) and ˆpn( |x S) for each x S E, which can be done with at most O(|A|) additional computations. Once all these quantities are determined, we then need to test whether a given lag j S has to be removed or not, by evaluating inequality (13) for all pairs of (S \ {j})-compatible pasts in ES. This can be done with at most O(|A|2(n d)) more computations because 1) the number of different pasts in ES is at most (n d); 2) there are at most |A| pasts in E which are compatible with a fixed past x S in ES; and 3) one can evaluate whether inequality (13) holds or not to a given pair of compatible past with O(|A|) additional computations. Finally, since the number of lags to be tested is |S|, it follows that we can implement the PCP estimator with at most O(|A|2|S|(n d)) computations, concluding the proof. We now show that we can compute the FSC estimator by using at most O(|A|3ℓ(m d)(d (ℓ 1)/2) + |A|2(n m d)ℓ) computations, as stated in item (c) of Remark 8. Proof of item (c) of Remark 8. By the item (c) of Remark 2, the CUT step can be computed with at most O(ℓ|A|2(n m d)) computations since the FS step outputs a subset of size ℓand the size of the second half of the sample is n m. Hence, the proof will be concluded if we show that the FS step can be computed with at most O(|A|3(m d)(ℓd (ℓ 1)ℓ/2) computations. To see that, let us fix S J d, 1K and j / S. Proceeding as in the proof item (c) of Remark 2, one can check that we compute Nm(xba S {0,j}) simultaneously for all configurations xba S {j,0} and build the set ES = {x S : Nm(x S) > 0} with O(m d) computations. Notice that the size of the set ES is most (m d). Since for each x S ES, we need to perform at most O(|A|3) additional operations to compute ˆPm(x S) and ˆνj,m(x S), it follows that with at most O(|A|3(m d)) computations we can determine ˆνm,j,S. Therefore, the step 3 of the FS step (where we need to compute ˆνm,j,S for j Sc) can be implemented with O(|A|3(m d)(d |S|)) calculations. Since we need to repeat step 3 of the FS step for ℓ different sets, we conclude that with at most O(|A|3(m d) |S|=0 (d |S|)) = O(|A|3(m d)(ℓd (ℓ 1)ℓ/2) computations, we can implement the FS step. This concludes the proof. Sparse Markov Models for High-dimensional Inference Appendix B. Martingale concentration inequalities In the sequel, N denotes the set of non-negative integers {0, 1, . . .} Let (Ω, F, P) be a probability space. We assume that this probability space is rich enough so that the following stochastic processes may be defined on it. In what follows, let (Xt)t Z be a Markov chain of order d Z+, taking values on a finite alphabet A, with family of transition probabilities {p( |x d: 1) : x d: 1 supp(P)}. Denote Ft = σ(X d:t) for t N. For each a A, consider the stochastic process Ma = ((Ma t ))t N defined as, Ma t = 1{Xt = a} p(a|X(t d):(t 1)), t N. Let H = (Ht)t N be a stochastic process taking values on a finite alphabet B R, satisfying H0 = 0 and Ht Ft 1 for all t Z+, and consider H Ma = (H Ma t )t N defined as, s=0 Hs Ma s , t N. (78) Notice that H Ma is adapted to the fitration F := (Ft)t N, that is H Ma t Ft for all t N. Also H Ma 0 = 0. Recall the notation B = maxb B |b|. Lemma 27 Let H Ma = (H Ma t )t N be the stochastic process defined in (78). Then H Ma is a square integrable Martingale w.r.t. F starting from H Ma 0 = 0. Moreover, the predictable quadratic variation of H Ma, denoted by H Ma = ( H Ma t)t N, is given by s=0 H2 s p a|X(s d):(s 1) 1 p a|X(s d):(s 1) , t N. (79) Furthermore, for any λ > 0 and b > 0 such that B b, the stochastic process exp λH Ma eλb λb 1 b2 H Ma = exp λH Ma t eλb λb 1 is a supermartingale w.r.t. F starting from 1. Proof For each t Z+, we have that Ht Ft 1 and also that E [1{Xt = a}|Ft 1] = p(a|X(t d):(t 1)). These two facts imply that for any t Z+, E [Ht Ma t |Ft 1] = Ht E [Ma t |Ft 1] = 0, which, in turn, implies that E[H Ma t |Ft 1] = H Ma t 1. Hence, H Ma is a martingale w.r.t. to F. Since |H Ma t | B t for t 1, it follows that H Ma is also square integrable. The predictable quadratic variation of H Ma is defined as s=1 E (Ms Ms 1)2 |Fs 1 , for t Z+ with H Ma 0 = 0. For any t Z+, one can check that H Ma t H Ma t 1 2 = H2 t (1{Xt = a} 2p(a|X(t d):(t 1))1{Xt = a} p2(a|X(t d):(t 1))). Ost and Takahashi Using again that Ht Ft 1 and also that E [1{Xt = a}|Ft 1] = p(a|X(t d):(t 1)), one then deduces that for any t Z+, E H Ma t H Ma t 1 2 |Ft 1 = H2 t p(a|X(t d):(t 1))(1 p(a|X(t d):(t 1))), which establishes (79). The proof that exp λH Ma eλb λb 1 b2 H Ma is a supermartingale w.r.t F can be found in (Raginsky and Sason, 2014). We will use Lemma 27 to prove the following concentration inequality. Proposition 28 Let H Ma = (H Ma t )t N be the stochastic process defined in (78). Suppose that B b for some b > 0. For any fixed α > 0 and v > 0, we have for t N, 3 , H Ma t v exp ( α) P( H Ma t > 0). Remark 29 This is basically Lemma 5 of (Oliveira, 2015) (see the Economical Freedman s inequality provided in Inequality (41)) applied to the square integrable martingale H Ma. The only difference is the factor 2 in front of the linear term αb 3 which is not present here. Notice that for t Z+, the concentration inequality above can be rewritten in the following form: 3 , H Ma t v| H Ma t > 0 exp ( α) . The conditioning on event { H Ma t > 0} reflects the fact that if H Ma t = 0 almost surely, then H Ma t = H Ma 0 = 0 almost surely as well. Proof For t = 0 the result holds trivially. Now, suppose t Z+. By considering the set B/b = {c/b : c B} instead of B, it suffices to prove the case b = 1. To shorten the notation, we denote Mt = H Ma t in the sequel. By the Markov property, we have that for any λ > 0, P(λMt ψ(λ) M t α) exp( α)E [exp (λMt ψ(λ) M t) 1{ M t > 0}] , where ψ(λ) = eλ λ 1 and we have used that if M t = 0 almost surely, then Mt = M0 = 0 almost surely. By using the fact that (exp (λMt ψ(λ) M t))t N is a supermartingale (Lemma 27 with b = 1) together with the decomposition { M t > 0} = k=1 { M k > 0 and M j = 0 for all j < k}, as in (Oliveira, 2015), we can deduce that E [exp (λMt ψ(λ) M t) 1{ M t > 0}] P( M t > 0), Sparse Markov Models for High-dimensional Inference which implies not only that for any λ > 0, exp( α)P( M t > 0), (80) but also that λ, M t v exp( α)P( M t > 0). Now, we use that for λ (0, 3) it holds that ψ(λ) λ2(1 λ/3) 1/2. Hence, from the above inequalities we deduce that for any λ (0, 3), P Mt λ 2(1 λ/3) M t + α exp( α)P( M t > 0), (81) and also that P Mt λ 2(1 λ/3)v + α λ, M t v exp( α)P( M t > 0). Minimizing λ (0, 3) 7 λ (1 λ/3)v + α λ, the result follows. By using a peeling argument as in (Hansen et al., 2015), we deduce from the above result the following. Proposition 30 Let H Ma = (H Ma t )t N be the stochastic process defined in (78). Suppose that B b for some b > 0. For ϵ > 0, v > w > 0 and α > 0, we have for t N, 2α(1 + ϵ) H Ma t + αb 3 , w H Ma t v log(v/w + 1) exp ( α) P( H Ma t > 0). Proof It suffices to prove the case b = 1. The general case follows from this one by first replacing B by B/b = {c/b : c B} and then rearranging the terms properly. Let us denote v0 = w and vk = (1 + ε)vk 1 for 1 k K := l log(v/w+1) log(1+ε) m . Notice that v K v, by the definition of K. To shorten the notation, we denote H Ma t = Mt in what follows. Starting from (81), one can deduce that for any 0 k < K and λ (0, 3), we have P Mt λ 2(1 λ/3) M t + α λ, vk M t vk+1 exp( α)P( M t > 0), which, in turn, implies that P Mt λ 2(1 λ/3)vk+1 + α λ, vk M t vk+1 exp( α)P( M t > 0). Ost and Takahashi Minimizing w.r.t. to λ (0, 3) as in Proposition 28, it then follows that 3 , vk M t vk+1 exp ( α) P( M t > 0). Now, on the event {vk M t}, we have that M t(1 + ε) (1 + ε)vk = vk+1, so that the inequality above implies 2(1 + ε) M tα + α 3 , vk M t vk+1 exp ( α) P( M t > 0). Summing over k the result follows (recall that v v K by the choice of K). Hereafter, let m N and consider a function ϕ : AJ1,m K AJ d, 1K such that its supremum norm ϕ = max(x1:m,x d: 1) AJ1,m K AJ d, 1K |ϕ(x1:m, x d: 1)| b. Here we use the convention that ϕ is a function defined only on AJ d, 1K when m = 0. Given such a function ϕ, let us denote Hϕ = (Hϕ t )t 0 the stochastic process defined as Hϕ 0 = . . . = Hϕ m+d = 0 and Hϕ t = ϕ(X1:m, X(t d):(t 1)) for t d + m + 1. Clearly, Hϕ t Ft 1 for all t Z+. From (79), one can check that the predictable quadratic variation Hϕ Ma of the martingale Hϕ Ma is given by Hϕ Ma 0 = . . . = Hϕ Ma m+d = 0 and for t m + d + 1, s=m+d+1 ϕ2(X1:m, X(s d):(s 1))p a|X(s d):(s 1) 1 p a|X(s d):(s 1) . (82) As a direct consequence of Proposition 30, we derive the following result. Corollary 31 Let X1:n be a sample from a MTD model of order d with set of relevant lags Λ. Let ˆΛm be an estimator of Λ computed from X1:m, where n > m. For each x AJ d, 1K, a A and S J d, 1K, let ˆpm,n(a|x S) be the empirical transition probability defined in (7) computed from Xm+1:n. Then for any S J d, 1K such that Λ S, ε > 0, α > 0 and n m + d + 1, we have ˆΛm = S, |ˆpm,n(a|x S) p(a|xΛ)| 2α(1 + ε)p(a|xΛ)(1 p(a|xΛ)) + α 3 Nm,n(x S) 2 log(n m d + 1) e αP Nm,n(x S) > 0, ˆΛm = S . (83) In particular, Λ ˆΛm, |ˆpm,n(a|xˆΛm) p(a|xΛ)| 2α(1 + ε)p(a|xΛ)(1 p(a|xΛ)) + α 3 Nm,n(xˆΛm) 2 log(n m d + 1) e αP Nm,n(xˆΛm) > 0, Λ ˆΛm . (84) Sparse Markov Models for High-dimensional Inference Proof Summing in both sides of (83) over S J d, 1K such that Λ S, we obtain inequality (84). Thus, it remains to prove (83). To that end, take ϕ(X1:m, X(t d):(t 1)) = 1{ˆΛm = S}1{Xt+j = xj, j S} and notice that in this case Hϕ Ma n = 1{ˆΛm = S} Nm,n(x S)p(a|xΛ)(1 p(a|xΛ)). So, if either p(a|xΛ) = 0 or p(a|xΛ) = 1, then we have necessarily Hϕ Ma n = 0 for all n m + d + 1, which implies that almost surely for all n m + d + 1, Hϕ Ma n = 1{ˆΛm = S}( Nm,n(x S, a) Nm,n(x S)p(a|xΛ)) = 0. By noticing that Hϕ Ma n = 1{ˆΛm = S} Nm,n(x S)(ˆpm,n(a|x S) p(a|xΛ)), it follows that, on the event {ˆΛm = S, Nm,n(x S) > 0}, we must have ˆpm,n(a|x S) = p(a|xΛ) almost surely and so the left-hand side of (83) is 0 and the result holds trivially. Let us now suppose 0 < p(a|xΛ) < 1. In this case, we apply Proposition 30 with w = p(a|xΛ)(1 p(a|xΛ)), v = (n m d)w and b = 1 to deduce that, P ˆΛm = S, N n,m(x S)(ˆpm,n(a|x S) p(a|xΛ)) q 2α(1 + ε)p(a|xΛ)(1 p(a|xΛ)) Nm,n(x S) + α 3 , Nm,n(x S) > 0 log(n m d + 1) e αP(1{ˆΛm = S} Nm,n(x S)p(a|xΛ)(1 p(a|xΛ)) > 0). (85) To conclude the proof, observe that in this case {1{ˆΛm = S} Nm,n(x S)p(a|xΛ)(1 p(a|xΛ)) > 0} = {ˆΛm = S, Nm,n(x S) > 0}, and use again Proposition 30 with H ϕ Ma in the place of Hϕ Ma (by noting also that Hϕ Ma = H ϕ Ma ). Remark 32 Let us briefly comment on the results of Corollary 31. Suppose x AJ d, 1K and a A are such that 0 < p(a|xΛ) < 1 and also that ˆΛm is a consistent estimator of Λ. By the CLT for aperiodic and irreducible Markov Chains it follows that q Nm,n(xΛ)(ˆpm,n(a|xΛ) p(a|xΛ))1{ˆΛm = Λ, Nm,n(xΛ) > 0} converges in distribution (as min{m, n} ) to a centered Gaussian random variable with variance p(a|xΛ)(1 p(a|xΛ)). This implies that for sufficiently large n, P (ˆpm,n(a|xΛ) p(a|xΛ) 2αp(a|xΛ)(1 p(a|xΛ)) Nm,n(xΛ) |ˆΛm = Λ, Nm,n(xΛ) > 0 Let us compare this heuristic argument with Corollary 31 applied to S = Λ. In this case, in Inequality (83), the variance term q 2α(1+ε)p(a|xΛ)(1 p(a|xΛ)) Nm,n(xΛ) can be made arbitrarily close to Ost and Takahashi optimal value q 2αp(a|xΛ)(1 p(a|xΛ)) Nm,n(xΛ) , at the cost 1 log(1+ε). Both the linear term α Nm,n(xΛ) and the log(n m d + 1) factor are the price to pay to achieve the result which holds every n m+d+1 and reflect the fact that ˆpm,n(a|xΛ) p(a|xΛ) is Gaussian only asymptotically. In particular, Corollary 31 improves the Economical Freedman s Inequality (as stated in (Oliveira, 2015) - Lemma 5, Inequality (42)) when restricted to the martingale Hϕ Ma. In the sequel, let us denote P a = (P a t )t N, for each a A, the stochastic process defined as P a t = p(a|X(t d:t 1)) for each t N. With this notation, notice that s=0 H2 s p(a|X(s 1):(s d)), t N, (86) is such that H Ma t H2 P a t for all t N. In particular, Proposition 28 holds if we replace H Ma t by H2 P a t . A closer inspection of the proof of Proposition 30 reveals that this proposition also holds with H2 P a t in the place of H Ma t. In the next theorem, we show that we can replace H2 P a t by a linear transformation of its empirical version which is crucial for our analysis. Theorem 33 Let H Ma = (H Ma t )t N be the stochastic process defined in (78). Suppose that B b for some b > 0. For any fixed µ (0, 3) satisfying µ > ψ(µ) = exp(µ) µ 1 and α > 0, define for t N, H2 ˆP a t = µ µ ψ(µ) s=0 H2 s ˆP a s + b2α µ ψ(µ), where ˆP a s = 1{Xs = a} for all s N. Then, for any fixed ϵ > 0 and v > w > 0, we have for any t N, 2(1 + ϵ)αH2 ˆP a t + αb 3 , w H2 ˆP a t v 2 log(v/w + 1) exp ( α) P( H Ma t > 0). Proof We prove only the case b = 1. Note that H2 Ma t = H2 P a t H2 ˆP a t . Also recall that H Ma t H2 P a t . We now proceed to the proof. We first use Inequality (80) for the martingale H2 Ma together with that fact that H2 Ma t H4 P a t H2 P a t (the last inequality holds because b = 1) to deduce that for any µ > 0, P H2 P a t H2 ˆP a t + ψ(µ) µ H2 P a t + α exp( α)P( H Ma t > 0), which implies that for any µ (0, 3) satisfying µ > ψ(µ), it holds P H2 P a t H2 ˆP a t exp( α)P( H Ma t > 0). Sparse Markov Models for High-dimensional Inference Hence, combining this inequality with (81), we conclude that for any λ (0, 3) and any µ (0, 3) satisfying µ > ψ(µ), P H Ma t λ 2(1 λ/3)H2 ˆP a t + α P H Ma t λ 2(1 λ/3)H2 ˆP a t + α µ, H2 P a t H2 ˆP a t + P H2 P a t H2 ˆP a t 2 exp( α)P( H Ma t > 0), where we also used in the last inequality the fact that H Ma t H2 P a t . To conclude the proof, we need to follow the same steps as Proposition 28 and then use the peeling argument as in the proof of Proposition 30 with H2 ˆP a t in the place of H Ma t. As consequence of Theorem 33, we obtain the following result. Proposition 34 Let X1:n be a sample from a MTD model of order d with set of relevant lags Λ. Let ˆΛm be an estimator of Λ computed from X1:m where n > m. For any x AJ d, 1K, a A and S J d, 1K, let ˆpm,n(a|x S) be the empirical transition probability defined in (7) computed from Xm+1:n, and consider for α > 0 and µ (0, 3) satisfying µ > ψ(µ) = exp(µ) µ 1, ˆVm,n(a, x, S) = µ µ ψ(µ) ˆpm,n(a|x S) + α µ ψ(µ) 1 Nm,n(x S). Then for any S J d, 1K such that Λ S and n m + d + 1, we have ˆΛm = S, |ˆpm,n(a|x S) p(a|xΛ)| 2α(1 + ϵ) ˆVm,n(a, x, S) Nm,n(x S) + α 3 Nm,n(x S) 4 log(µ(n m d)/α + 2) e αP ˆΛm = S, Nm,n(x S) > 0 . (87) In particular, Λ ˆΛm, |ˆpm,n(a|xˆΛm) p(a|xΛ)| 2α(1 + ϵ) ˆVm,n(a, x, ˆΛm) Nm,n(xˆΛm) + α 3 Nm,n(xˆΛm) 4 log(µ(n m d)/α + 2) e αP Λ ˆΛn, Nm,n(xˆΛm) > 0 . (88) Proof Summing in both sides of (87) over S J d, 1K such that Λ S, we obtain inequality (88). Hence, it remains to show (87). Arguing as in Corollary 31 we need to consider only the case 0 < p(a|x) < 1. Ost and Takahashi By applying Theorem 33 with H = H ϕ where ϕ is as in the proof of Corollary 31, v = µ µ ψ(µ)(n m d) + α µ ψ(µ) and w = α µ ψ(µ), we obtain that P ˆΛm = S, Nm,n(x S)|ˆpm,n(a|x S) p(a|xΛ)| q 2(1 + ϵ)α Vm,n(a, x, S) + α 4 log(µ(n m d)/α + 2) e αP(1{ˆΛm = S} Nm,n(x S)p(a|xΛ)(1 p(a|xΛ)) > 0), where Vm,n(a, x, S) = Nm,n(x S) ˆVm,n(a, x, S). By using that when 0 < p(a|x) < 1, {1{ˆΛm = S} Nm,n(x S)p(a|xΛ)(1 p(a|xΛ)) > 0} = {ˆΛm = S, Nm,n(x S) > 0}, and the fact that Vm,n(a, x, S) = Nm,n(x S) ˆVm,n(a, x, S), we deduce (87) from the above inequality. Andre Berchtold. Estimation in the mixture transition distribution model. Journal of Time Series Analysis, 22(4):379 397, 2001. Andr e Berchtold and Adrian Raftery. The Mixture Transition Distribution Model for High Order Markov Chains and Non-Gaussian Time Series. Statistical Science, 17(3):328 356, 2002. Guy Bresler. Efficiently learning ising models on arbitrary graphs. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, STOC 15, page 771 782. Association for Computing Machinery, 2015. Peter B uhlmann and Abraham J Wyner. Variable length markov chains. The Annals of Statistics, 27(2):480 513, 1999. Gyorgy Buzsaki and Andreas Draguhn. Neuronal oscillations in cortical networks. science, 304(5679):1926 1929, 2004. J.R. Chazottes, S. Gallo, and D.Y. Takahashi. Optimal gaussian concentration bounds for stochastic chains of unbounded memory. Ar Xiv, 2020. Yanqing Chen, Mingzhou Ding, and JA Scott Kelso. Long memory processes (1/f α type) in human coordination. Physical Review Letters, 79(22):4501, 1997. Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006. ISBN 0471241954. Imre Csisz ar and Zsolt Talata. Context tree estimation for not necessarily finite memory processes, via bic and mdl. IEEE Transactions on Information theory, 52(3):1007 1016, 2006. Sparse Markov Models for High-dimensional Inference Antonio Galves, Charlotte Galves, Jesus E Garcia, Nancy L Garcia, and Florencia Leonardi. Context tree selection and linguistic rhythm retrieval from written texts. The Annals of Applied Statistics, 6(1):186 209, 2012. Jes us E Garcıa, Ver onica A Gonz alez-L opez, Rua Sergio Buarque de Holanda, and Cidade Universit aria-Barao Geraldo. Minimal markov models. In Fourth Workshop on Information Theoretic Methods in Science and Engineering, page 25, 2011. David L Gilden, Thomas Thornton, and Mark W Mallon. 1/f noise in human cognition. Science, 267(5205):1837 1839, 1995. Niels Richard Hansen, Patricia Reynaud-Bouret, and Vincent Rivoirard. Lasso and probabilistic inequalities for multivariate point processes. Bernoulli, 21(1):83 143, 2015. Matthew Heiner and Athanasios Kottas. Estimation and selection for high-order markov chains with bayesian mixture transition distribution models. Journal of Computational and Graphical Statistics, pages 1 13, 2021. V ain o J a askinen, Jie Xiong, Jukka Corander, and Timo Koski. Sparse markov chains for sequence data. Scandinavian Journal of Statistics, 41(3):639 655, 2014. Andrea Kir aly, Imre Bartos, and Imre M J anosi. Correlation properties of daily temperature anomalies over land. Tellus A: Dynamic Meteorology and Oceanography, 58(5):593 600, 2006. Ioannis Kontoyiannis, Lambros Mertzanis, Athina Panotopoulou, Ioannis Papageorgiou, and Maria Skoularidou. Bayesian context trees: Modelling and exact inference for discrete time series. ar Xiv preprint ar Xiv:2007.14900, 2020. Roberto Imbuzeiro Oliveira. Stochastic processes with random contexts: A characterization and adaptive estimators for the transition probabilities. IEEE Transactions on Information Theory, 61(12):6910 6925, 2015. Adrian E. Raftery. A model for high-order markov chains. Journal of the Royal Statistical Society. Series B (Methodological), 47(3):528 539, 1985. Maxim Raginsky and Igal Sason. Concentration of Measure Inequalities in Information Theory, Communications, and Coding: Second Edition. 2014. Jorma Rissanen. A universal data compression system. IEEE Transactions on information theory, 29(5):656 664, 1983. Abhra Sarkar and David B Dunson. Bayesian nonparametric modeling of higher order markov chains. Journal of the American Statistical Association, 111(516):1791 1803, 2016. M.J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019. Ost and Takahashi Matthew C Wheeler, Harry H Hendon, Sam Cleland, Holger Meinke, and Alexis Donald. Impacts of the madden julian oscillation on australian rainfall and circulation. Journal of Climate, 22(6):1482 1498, 2009. Naiming Yuan, Zuntao Fu, and Shida Liu. Long-term memory in climate variability: A new look based on fractional integral techniques. Journal of Geophysical Research: Atmospheres, 118(23):12 962, 2013.