# sparse_markov_models_for_highdimensional_inference__ecbc13e8.pdf

Journal of Machine Learning Research 24 (2023) 1-54 Submitted 3/22; Published 8/23

Sparse Markov Models for High-dimensional Inference

Guilherme Ost guilhermeost@im.ufrj.br Institute of Mathematics Federal University of Rio de Janeiro Rio de Janeiro, RJ, Brazil

Daniel Y. Takahashi takahashiyd@gmail.com Brain Institute Federal University of Rio Grande do Norte Natal, RN, Brazil

Editor: Aur elien Garivier

Finite-order Markov models are well-studied models for dependent ﬁnite alphabet data. Despite their generality, application in empirical work is rare when the order d is large relative to the sample size n (e.g., d = O(n)). Practitioners rarely use higher-order Markov models because (1) the number of parameters grows exponentially with the order, (2) the sample size n required to estimate each parameter grows exponentially with the order, and (3) the interpretation is often diﬃcult. Here, we consider a subclass of Markov models called Mixture of Transition Distribution (MTD) models, proving that when the set of relevant lags is sparse (i.e., O(log(n))), we can consistently and eﬃciently recover the lags and estimate the transition probabilities of high-dimensional (d = O(n)) MTD models. Moreover, the estimated model allows straightforward interpretation. The key innovation is a recursive procedure for a priori selection of the relevant lags of the model. We prove a new structural result for the MTD and an improved martingale concentration inequality to prove our results. Using simulations, we show that our method performs well compared to other relevant methods. We also illustrate the usefulness of our method on weather data where the proposed method correctly recovers the long-range dependence.

Keywords: Markov Chains, High-dimensional inference, Mixture Transition Distribution

1. Introduction

From the daily number of COVID-19 cases to the activity of neurons in the brain, discrete time series are ubiquitous in our life. A natural way to model these time series is by describing how the present events depend on the past events, i.e., characterizing the transition probabilities. Therefore, ﬁnite-order Markov chains - models speciﬁed by transition probabilities that depend only on a limited portion of the past - are an obvious choice to model time series with discrete values. The length of the portion of the relevant past deﬁnes the order of the Markov chain. At ﬁrst glance, estimating the transition probabilities of a Markov chain from the data is straightforward. Given a sample X1:n := (X1, X2, . . . , Xn) of a stationary d-th order Markov chain on a discrete alphabet A, the empirical transition probabilities are computed, for all past x d: 1 := (x d, . . . , x 1) A{ d,..., 1} and symbol

c 2023 Guilherme Ost and Daniel Y. Takahashi.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-0266.html.

Ost and Takahashi

ˆpn(a|x d: 1) := Nn(x d: 1, a) P b A Nn(x d: 1, b),

where Nn(x d: 1, a) denotes the number of occurrences of the past x d: 1 followed by the symbol a in the sample X1:n. Nevertheless, some diﬃculties become apparent. First, for a Markov chain of order d, we have to estimate |A|d(|A| 1) transition probabilities (parameters), making the uniform control of estimation errors much harder when the order d increases. One solution to avoid the exponential increase in the number of parameters is to consider more parsimonious classes of models. One such popular class of models is the variable length Markov chains (VLMC), in which

P(Xt = a|Xt d:t 1 = x d: 1) = P(Xt = a|Xt ℓ:t 1 = x ℓ: 1),

where ℓis a function of the past x d: 1 (Rissanen, 1983; B uhlmann and Wyner, 1999; Galves et al., 2012). The relevant portion x ℓ: 1 of the past x d: 1 is called a context. The key feature of VLMC is that all transition probabilities with the same context have the same values. Therefore, denoting τ as the set of all contexts, the number of transition probabilities that need to be estimated reduces to |τ|(|A| 1). Another class of models that is even more parsimonious is the Minimal Markov Models - also known as Sparse Markov Chains (SMC) (Garcıa et al., 2011; J a askinen et al., 2014). In SMC, we say that the pasts x d: 1 and y d: 1 are related if for all symbols a A,

P(Xt = a|Xt d:t 1 = x d: 1) = P(Xt = a|Xt d:t 1 = y d: 1).

This relation generates the equivalent classes C1, . . . , CK that partition A{ d,..., 1}. Now, the number of transition probabilities that need to be estimated is K(|A| 1). Both VLMC and SMC have the advantage of better balancing the bias and variance tradeoﬀ. Nevertheless, in both models we still need to estimate the transition probability using ˆpn(a|x d: 1), either because we need to estimate the largest context (for VLMC) or because we need ﬁrst to calculate the transition probabilities to establish the partitions (for SMC). This creates a second diﬃculty. For the estimator ˆpn(a|x d: 1) to have any meaning, we have to observe the sequence x d: 1 in the sample X1:n at least once. By ergodicity, the number of times that we will observe the sequence x d: 1 is roughly n P(X1:d = x d: 1). It is straightforward to show that, if the transition probabilities are bounded below from zero, there exists a constant c > 0 such that P(X1:d = x d: 1) < e cd. Therefore, in general, it is hopeless to have a reasonable estimator ˆpn(a|x d: 1) if d > (1 + ε) log n/c, for some positive value ε. This imposes a fundamental limit to the size of the past that can be included in the description of the time series. Moreover, Markov chains with small orders are not consistent with the known workings of several natural phenomena where the transition probabilities might depend on remote pasts. For example, in predicting whether today will be a warm or cold day, we might need to know remote past events like the corresponding weather approximately a year ago (Kir aly et al., 2006; Yuan et al., 2013). Physiological phenomena in humans with cycles of diﬀerent lengths might result from dependence on events at vastly diﬀerent temporal scales (Gilden et al., 1995; Chen et al., 1997; Buzsaki and Draguhn, 2004). Importantly,

Sparse Markov Models for High-dimensional Inference

not all portions of the past are necessarily relevant. These observations motivate us to explore sparser representations of the dependence on past events. The mixture of transition distribution model (MTD) is a subclass of ﬁnite order Markov chains that can be used to obtain such sparse representation. Like VLMC and SMC, MTD was initially introduced to overcome the problem of exponential increase in the number of transition probabilities for Markov chains (Raftery, 1985; Berchtold and Raftery, 2002). MTD represents higher-order Markov models as a convex mixture of single-step Markov chains, where each single-step Markov chain depends on a single time point in the past. If an MTD model is a mixture of only a few single step Markov chains, we naturally obtain a class of sparse Markov chains that depends only on a small portion of the past events. Nevertheless, available methods to consistently estimate the transition probabilities of MTD still need to consider all the past events up to the MTD order (Berchtold and Raftery, 2002), which might include irrelevant portions of the past. This fact still restricts the MTD order to d = O(log n).

In this work, we introduce a simple method that consistently recovers the relevant part of the past even when the order d of the MTD model is proportional to the sample size n (i.e., d = O(n)) if the size of the relevant past is O(log n). Consequently, we prove that we can consistently estimate the transition probabilities for high dimensional MTD under sparsity constraint. Our estimator is computationally eﬃcient, consisting of a forward stepwise procedure that ﬁnds the candidates for relevant past portions and a cutting procedure to eliminate the irrelevant portions. The theoretical guarantees of our estimator are based on a novel structural result for MTD and an improved martingale concentration inequality. Both results might have an interest on their own. Moreover, we show that the estimator can be further improved when the alphabet is binary. We also prove that in several cases, our estimator is minimax rate optimal.

Finally, using simulated data, we show that our method s performance is, in general superior to a best subset selection method, where the lags with k largest weights are selected after estimating the model with a classical MTD estimation method (Berchtold, 2001), and similar to the performance of Conditional Tensor Factorization (CTF) based on higher-order Markov chain estimation when the order is moderate (Sarkar and Dunson, 2016). We also applied our method to weather data to model a binary sequence indicating days with and without rain. Our method successfully captures long-range dependencies (e.g. annual cycle) that were not detected either by VLMC algorithm with BIC order selection (Csisz ar and Talata, 2006) or by the CTF based higher order Markov chain estimation. New Bayesian approaches for higher order VLMC and MTD selection were introduced in (Kontoyiannis et al., 2020; Heiner and Kottas, 2021), where a posteriori most likely model estimation is considered. These works provide interesting alternative approaches for modeling higherorder Markovian dependence in a Bayesian setting.

We organized the paper as follows. In Section 2 we introduce the main notations, deﬁnitions, and assumptions that we will use throughout the paper. In Section 3 we introduce the algorithms to select the relevant part of the past. In Section 3.4 we provide an estimate of the estimator s convergence rate for the transition probabilities. In Section 3.5, we show that our estimator achieves the optimal minimax rate. In Section 4, we illustrate the performance of the proposed estimators through simulations and an application on weather data.

Ost and Takahashi

2. Notation, Model Deﬁnition and Preliminary Remarks

2.1 General notation

We denote Z = {. . . 1, 0, 1, . . .} and Z+ = {1, 2 . . .} the set of integers and positive integers respectively. For s, t Z with s t, we write Js, t K to denote the discrete interval Z [s, . . . , t]. Throughout the article A denotes a ﬁnite subset of R, called alphabet. The elements of A will be denoted by the ﬁrst letters of the alphabet a, b and c. Hereafter, we denote A = maxa A |a| and Diam(A) = maxa,b A |a b|. For each S Z, the set AS denotes the set of all A-valued strings x S = (xj)j S indexed by the set S. To alleviate the notation, if S = Js, t K for s, t Z with s t, we write xs:t instead of x Js,t K. For any non-empty subsets U S Z and any string x S AS, we denote x(S\U) A(S\U) the string obtained from x S by removing the string x U AU. For all t Z and S Z, we will write in some cases t + S do denote the set {t + s : s S}. The set of all ﬁnite A-valued strings is denoted by

S Z:S ﬁnite AS.

For all x A, we denote Sx Z the set indexing the string x, i.e., such that x ASx. Given two probability measures µ and ν on A, we denote d TV (µ, ν) the total variation distance between µ and ν, deﬁned as

d TV (µ, ν) = 1

a A |µ(a) ν(a)|.

For q Z+, the q-norm of vector v RL is deﬁned as

ℓ=1 |vℓ|q !1/q

The dimension L Z+ will be implicit in most cases. For two probability distributions P and Q on AJ1,k K where P is absolutely continuous with respect to Q, we denote KL(P||Q) the Kullback-Leibler divergence between P and Q, given by

KL(P||Q) = X

x1:k AJ1,k K P(x1:k) log P(x1:k)

2.2 Markov models

Let X = (Xt)t Z be a discrete time stochastic chain, deﬁned in a suitable probability space (Ω, F, P), taking values in an alphabet A. For a d Z+, we say that X is a Markov chain of order d if for all k Z+ with k > d, t Z and xt k:t AJt k,t K with P(Xt k:t 1 = xt k:t 1) > 0, we have

P (Xt = xt|Xt k:t 1 = xt k:t 1) = P (Xt = xt|Xt d:t 1 = xt d:t 1) . (1)

Sparse Markov Models for High-dimensional Inference

We say that a Markov chain is stationary if Xs:t and Xs+h:t+h have the same distribution for all t, s, h Z. Throughout the article, the distribution of a stationary Markov chain will be denoted by P. For a ﬁnite S Z and x S AS, we write P(x S) to denote P(XS = x S). The support of a stationary Markov model is the set supp(P) = {x A : P(x Sx) > 0}. For stationary Markov chains, the conditional probabilities in (1) do not depend on the time index t. Therefore, for a stationary Markov chain of order d, for any a A, x S supp(P) with S J d, 1K and t Z, we denote

p(a|x S) = P (Xt = a|Xt+S = x S) .

Notice that p( |x S) is a probability measure on A, for each ﬁxed past x S supp(P). The set {p( |x d: 1) : x d: 1 supp(P)} is called the family of transition probabilities of the chain. In this article, we consider only stationary Markov chains. For a Markov chain of order d, the oscillation δj for j J d, 1K is deﬁned as

δj = max{d TV (p( |x d: 1), p( |y d: 1)) : (x d: 1, y d: 1) AJ d, 1K, x k = y k, k = j}.

The oscillation is useful to measure the inﬂuence of a j-th past value in the values of the transition probabilities.

2.3 Mixture transition distribution (MTD) models

A MTD model of order d Z+ is a Markov chain of order d for which the associated family of transition probabilities {p( |x d: 1) : x d: 1 supp(P)} admits the following representation:

p(a|x d: 1) = λ0p0(a) +

j= d λjpj(a|xj), a A, (2)

with λ0, λ 1, . . . , λ d [0, 1] satisfying P0 j= d λj = 1 and p0( ) and pj( |b),j J d, 1K and b A, being probability measures on A. Following (Berchtold and Raftery, 2002), we call the index j J d, 0K of the weight λj in (2) the j-th lag of the model. The representation in (2) has the following probabilistic interpretation. To sample a symbol from p( |x d: 1), we ﬁrst choose a lag in J d, 0K randomly, being λj the probability of choosing the lag j. Once the lag has been chosen, say lag j, we then sample a symbol from the probability measure pj( |xj) which depends on the past x d: 1 only through the symbol xj. Notice that a symbol is sampled independently from the past x d: 1, whenever the lag 0 is chosen. For later use, let us deﬁne the conditional average at lag j as

a A apj(a|b), (3)

for each j J d, 1K and b A. For a MTD model of order d, we have that the oscillation δj of the lag j J d, 1K can be written as, δj = λj max b,c A d TV (pj( |b), pj( |c)). (4)

Ost and Takahashi

Notice that in this case δj = 0 if and only if either λj = 0 or d TV (pj( |b), pj( |c)) = 0 for all b, c A. In the sequel, we say that the lag j is relevant if δj > 0, and irrelevant otherwise. We will denote Λ the set of all relevant lags, i.e.,

Λ = {j J d, 1K : δj > 0}. (5)

The set Λ captures the dependence structure of the MTD model. The size |Λ| of the set Λ represents the degree of sparsity of the MTD model. The smaller the value of |Λ|, the sparser the MTD model. The following quantities will appear in many of our results:

δmin = min j Λ δj and δmin = min j Λ λj mj Lip, (6)

where mj Lip = max{|mj(b) mj(c)|/|b c| : b, c A, b = c} denotes the Lipschitz norm of the function mj deﬁned in (3). One can check easily that these quantities coincide when the alphabet A is binary (i.e. A = {0, 1}). For general alphabets, the following inequality holds: δmin A 1 δmin min b,c A:b =c |b c|.

2.4 Statistical lag selection

Suppose that we are given a sample X1:n of a MTD model of known order d < n and whose set of relevant lags Λ is unknown. The goal of statistical lag selection is to estimate the set Λ from the sample X1:n. Our particular interest is in the high-dimensional setting in which the parameters d = dn and |Λ| = |Λn| scale as a function of the sample size n. Let us write ˆΛn to indicate an estimator of the set of relevant lags Λ computed from the sample X1:n. We say that the estimator ˆΛn is consistent if

P(ˆΛn = Λ) 0 as n .

With respect to statistical lag selection, our goal is to exhibit suﬃcient conditions for each proposed estimator guaranteeing its consistency.

2.5 Empirical transition probabilities

Let n, m and d be positive integers such that n m > d. We denote for each a A and x S A with S J d, 1K non-empty,

Nm,n(x S, a) =

t=m+d+1 1{Xt+j = xj, j S, Xt = a}.

The random variable Nm,n(x S, a) indicates the number of occurrences of the string x S followed by the symbol a, in the last n m symbols Xm+1:n of the sample X1:n. We also deﬁne Nm,n(x S) = P a A Nm,n(x S, a). With this notation, the empirical transition

Sparse Markov Models for High-dimensional Inference

probabilities computed from the last n m symbols Xm+1:n of the sample X1:n are deﬁned as,

ˆpm,n(a|x S) =

(Nm,n(x S,a)

Nm,n(x S) , if Nm,n(x S) > 0

1 |A|, otherwise . (7)

When the countings are made over the whole sample X1:n, we denote Nn(x S, a) and Nn(x S) the corresponding counting random variables, and ˆpn(a|x S) the corresponding empirical transition probabilities. In the next sections, the estimators for the set of relevant lags we propose in this paper rely on these empirical transition probabilities. If ˆΛm denotes an estimator for the set of relevant lags Λ computed from X1:m, we expect that under some assumptions (guaranteeing in particular the consistency of ˆΛm) the empirical transition probability ˆpm,n(a|xˆΛm) converges (in probability) to p(a|xΛ) as min{n, m} , for any x d: 1 supp(P). To understand the convergence for the transition probabilities of high order Markov chains is crucial in our analysis.

2.6 Assumptions

We collect here the main assumptions used in the article.

Assumption 1 The MTD model has full support, that is, supp(P) = A.

In other words, Assumption 1 means that P(XS = x S) = P(x S) > 0 for any string x S AS with S Z ﬁnite. This means that the marginal distributions of the distribution generating the data are strictly positive. Such a condition is usually assumed in the problem of estimating the graph structure underlying graphical models (see for instance Chapter 11 of (Wainwright, 2019)). Notice that this assumption implies, in particular, that

pmin = min{p(a|xΛ) : a A, xΛ AΛ} > 0, (8)

where p( |xΛ) are the transition probabilities of MTD generating the data.

Assumption 2 The quantity := 1 P j Λ δj > 0, where δj is given by (4).

We have that λ0 > 0 is a suﬃcient condition to Assumption 2 to hold. To check this, notice that X

j Λ λj max b,c A d TV (pj( |b), pj( |c)) X

j Λ λj = 1 λ0,

where we have used that d TV (pj( |b), pj( |c)) 1 for all b, c A and j Λ. Hence, it follows that > 0 whenever λ0 > 0. Assumptions 1 and 2 are used to obtain concentration inequalities for the counting random variables Nm,n(x S, a) and Nm,n(x S) appearing in the deﬁnition of the empirical transition probabilities ˆpm,n(a|x). The next assumption is as follows.

Assumption 3 For each j Λ, there exists b , c A such that mj(b ) = mj(c ), where mj( ) is deﬁned in (3).

Ost and Takahashi

Notice that if A = {0, 1}, then mj(1) mj(0) = pj(1|1) pj(1|0), so that Assumption 3 holds whenever d TV (pj( |1), pj( |0)) = |pj(1|1) pj(1|0)| > 0 for each j Λ. In this case this is always true by the deﬁnition of the set Λ. As we will see in Section 3, the condition is crucial to prove a structural result about MTD models, presented in Proposition 6. In what follows, Px S(Xj |Xk = b) denotes the conditional distribution of Xj given XS = x S and Xk = b. We use the convention that, for S = , these conditional probabilities correspond to the unconditional ones. Moreover, for any function f : A R, we write Ex S(f(Xj)|Xk = b) to denote the expectation of f(Xj) with respect to Px S(Xj |Xk = b). The next two assumptions are the following.

Assumption 4 (Inward weak dependence condition) There exists Γ1 (0, 1] such that the following condition holds: for all S J d, 1K such that Λ S, k Λ \ S and b, c A with b = c satisfying |mk(b) mk(c)| > 0,

max x S AS X

λj |Ex S(mj(Xj)|Xk = b) Ex S(mj(Xj)|Xk = c)|

λk|mk(b) mk(c)| (1 Γ1). (9)

Assumption 5 (Outward weak dependence condition) The alphabet is binary, i.e. A = {0, 1}. Moreover, there exists Γ2 (0, 1] such that the following condition holds: for all S J d, 1K such that S Λ and k / Λ,

j Λ\S max x S {0,1}S |Px S(Xk = 1|Xj = 1) Px S(Xk = 1|Xj = 0)| Γ2. (10)

Both Assumptions 4 and 5 are conditions of weak dependence. In words, Assumption 4 says that no relevant lag j can be completely determined by any subset S containing only relevant lags or any other relevant lag k when combined with some irrelevant lags. Similarly, Assumption 5 says that irrelevant lags cannot be completely determined by some subset of relevant lags. These two assumptions will be only necessary to obtain a computationally very eﬃcient algorithm.

3. Statistical Lag Selection

In this section, we address the problem of statistical lag selection for the MTD models. We will ﬁrst introduce a statistical procedure called PCP estimator that is general and works well if there is a known small set S such that Λ S. When such set S is not available, we will have to consider an alternative procedure called FSC estimator, which will be introduced later.

3.1 Estimator based on pairwise comparisons

Throughout this section we suppose that there is a known set S J d, 1K such that Λ S. Note that this is always satisﬁed in the worse case scenario in which the set S is the whole set J d, 1K. In some cases, however, we may have a prior knowledge on the set Λ and we can use this information to restrict our analysis to the lags in a known set S of size (possibly much) smaller than d.

Sparse Markov Models for High-dimensional Inference

The estimator discussed in this section is based on pairwise comparisons of empirical transition probabilities corresponding to compatible pasts. For this reason, we call it PCP estimator. The estimator is based on the following observation. For any j S, we say that the pasts x S, y S AS are (S \ {j})-compatible, if y S\{j} = x S\{j}. We have that if j Λ, then there exist a pair of (S \ {j})-compatible pasts x S, y S AS such that total variation distance between p( |x S) and p( |y S) is strictly positive. On the other hand, if j S \ Λ, then the total variation distance between p( |x S) and p( |y S) is 0 for all (S \{j})-compatible pasts x S, y S AS. These remarks suggests to estimate Λ by the subset of all lags j S for which the total variation distance between ˆpn( |x S) and ˆpn( |y S) is larger than a suitable positive threshold, for some pair of (S\{j})-compatible pasts x S and y S. An uniform threshold over all possible realizations usually gives suboptimal results by either underestimating or overestimating for some conﬁgurations. The threshold we use here is adapted to each realization of the MTD, relying on improved martingale concentration inequalities that are of independent interest (see Appendix B). Fix ε > 0, α > 0 and µ (0, 3) such that µ > ψ(µ) := eµ µ 1. For each x S, y S AS, consider the random threshold tn(x S, y S) deﬁned as,

tn(x S, y S) = sn(x S) + sn(y S), (11)

where sn(x S) is given by

α(1 + ε) 2 Nn(x S)

ˆpn(a|x S) + α Nn(x S)

+ α|A| 6 Nn(x S). (12)

With this notation, the PCP estimator ˆΛ1,n is deﬁned as follows. A lag j S belongs to ˆΛ1,n if and only if there exists a (S \ {j})-compatible pair of pasts x S, y S AS such that

d TV (ˆpn( |x S), ˆpn( |y S)) tn(x S, y S). (13)

In the sequel, the set S J d, 1K such that Λ S and the constants ε > 0, α > 0 and µ (0, 3) such that µ > ψ(µ) are called parameters of the PCP estimator ˆΛ1,n. Hereafter, for each j S and any b, c A, let

Cj(b, c) = (x, y) AS AS : x S\{j} = y S\{j}, xj = b and yj = c ,

tn,j(b, c) = min (x S,y S) Cj(b,c) tn (x S, y S) , tn,j = max b,c A:b =c tn,j(b, c) and γn,j = 2tn,j. (14)

Finally, consider the following quantity

PS = min j Λ min b,c A:b =c max (x S,y S) Cj(b,c) (P(x S) P(y S)) . (15)

With these deﬁnitions, we have the following result.

Ost and Takahashi

Theorem 1 Let X1:n be a sample of MTD model with set of relevant lags Λ, where n > d. If ˆΛ1,n is the PCP estimator deﬁned in (13) with parameters µ (0, 3) such that µ > ψ(µ), α > 0, ε > 0 and Λ S J d, 1K, we have that

1. For each j S \ Λ, we have that

P j ˆΛ1,n 8|A|(n d) log(µ(n d)/α + 2)

2. For each j Λ, we have that

P j / ˆΛ1,n, γn,j δj 8|A| log(µ(n d)/α + 2)

where γn,j and and δj are deﬁned in (14) and (4) respectively.

3. Furthermore, if assumptions 1 and 2 hold, then there exits a constant C = C(ε, µ) > 0 such that for n satisfying

n d + C|A|α

δ2 min PS , (16)

where δmin and PS are deﬁned in respectively (6) and (15), we have that

P ˆΛ1,n = Λ 8|A| ((|S| |Λ|)(n d) + |Λ|) log(µ(n d)/α + 2)

+ 6|A|(|A| 1)|Λ| exp 2(n d)2P2 S 8n(|S| + 1)2

The proof of Theorem 1 is given in Appendix A.1.1.

Remark 2 (a) The sum over j S \Λ of the upper bound provided by Item 1 of Theorem 1 controls the probability that the PCP estimator ˆΛ1,n overestimates the set of relevant lags Λ. The sum over j Λ of the upper bound given in Item 2 of Theorem 1 is as an upper bound for the probability that the PCP estimator underestimates the subset of relevant lags j Λ whose oscillation δj is larger or equal than the noise level γn,j. Note that the sum of these upper bounds corresponds to the ﬁrst term appearing on the right hand side of (17).

(b) The second term on the right hand side of (17) is an upper bound for the probability that there exists some relevant lag j Λ whose oscillation δj is strictly smaller than the noise level γn,j.

(c) (Computation of PCP estimator) As we show in Appendix (A.6), the PCP estimator can be implemented with at most O(|A|2|S|(n d)) computations.

Remark 3 By Assumption 1, we have that PS pmin/|A||S| 1, where pmin is deﬁned in (8). As a consequence, it follows from (16) that if the sample size n is such that

n d + C|A||S|α

pminδ2 min , (18)

Sparse Markov Models for High-dimensional Inference

then inequality (17) holds with the second exponential term replaced by

exp 2p2 min(n d)2

8n(|S| + 1)2|A|2(|S| 1)

Combining Theorem 1 and Remark 3, one can deduce the following result.

Corollary 4 For each n, consider a MTD model with set of relevant lags Λn and transition probabilities pn(a|xΛn) such that pmin,n p min and n min for some positive constants p min and min. Let dn = βn for some β (0, 1) and suppose that Λn Sn J dn, 1K with |Sn| ((1 γ)/2) log|A|(n) for some γ (0, 1). Let X1:n be a sample from the MTD speciﬁed by Λn and pn(a|xΛn), and denote ˆΛ1,n the PCP estimator deﬁned in (13) computed from this sample with parameters µn = µ (0, 3) such that µ > ψ(µ), εn = ε > 0, αn = (1 + η) log(n) with η > 0 and Sn. Under these assumptions there exists a constant C = C(µ, ε, β, p min, min, γ, η) > 0 such that if

δ2 min,n C log(n)

n(1+γ)/2 , (20)

then P(ˆΛ1,n = Λn) 0 as n .

The proof of Corollary is given in Appendix A.1.2.

Remark 5 (a) Under the assumptions of Corollary 4, if additionally we have |Λn| L for all values of n for some positive integer L, then one can choose a suitable sequence γn 1 as n to obtain that P(ˆΛ1,n = Λn) vanishes as n , as long as

δ2 min,n C log(n)

where the constant C here is larger than the one given in (20).

(b) Observe that in Corollary 4, the set of relevant lags can be either ﬁnite or grow very slowly with respect to the sample size n. On the other hand, no assumption on the orders dn of the underlying sequence of MTD models is made. In particular, we could consider MTD models with very large orders, for example dn = βn with β (0, 1).

As Corollary 4 indicates, in the setting dn = βn, the major drawback of the PCP estimator ˆΛ1,n is that it requires a prior knowledge of Λn in the form of a set Sn growing slowly enough and such that Λn Sn. The main goal of the next two sections is to propose alternative estimators of Λn to deal with this issue.

3.2 Forward Stepwise and Cut estimator

In this section we introduce a second estimator of the set of relevant lags Λ, called Forward Stepwise and Cut (FSC) estimator. This estimator is based on a structural result about MTD models presented in Proposition 6 below. Before presenting this structural result, we need to introduce some notation.

Ost and Takahashi

In what follows, for each lag k J d, 1K, subset S J d, 1K \ {k}, conﬁguration x S AS and symbols b, c A, let us denote

dk,S(b, c, x S) = d TV (Px S(X0 |Xk = b), Px S(X0 |Xk = c)) , (22)

and wk,S(b, c, x S) = Px S(Xk = b)Px S(Xk = c). (23)

Recall that Px S(X0 |Xk = b) and Px S(Xk = b) denote, respectively, the conditional distribution of X0 given XS = x S and Xk = b and the conditional probability of Xk = b given XS = x S, with the convention that these conditional probabilities for S = correspond to the unconditional ones. Let us also denote for each lag k J d, 1K and subset S J d, 1K \ {k},

νk,S(x S) = X

c A wk,S(b, c, x S)dk,S(b, c, x S), (24)

and νk,S = E (νk,S(XS)) . (25)

The quantity νk,S(x S) measures the inﬂuence of Xk on X0, conditionally on the variables XS = x S. The average conditional inﬂuence of Xk on X0 is measured through the quantity νk,S. In the sequel, we write Covx S(X0, Xk) to denote the conditional covariance between the random variables X0 and Xk given that XS = x S. Here, we also use the convention that the conditional covariance for S = corresponds to the unconditional one. With this notation, we can prove the following structural result about MTD models.

Proposition 6 For any lag k J d, 1K and subset S J d, 1K \ {k},

Diam(A) A νk,S E (|Cov XS(X0, Xk)|) . (26)

Moreover, if Assumptions 1 and 3 hold, then there exists a constant κ > 0 such that the following property holds: for any S J d, 1K such that Λ S there exists k Λ \ S satisfying E (|Cov XS(X0, Xk)|) κ. (27)

Furthermore, if Assumption 3 is replaced by Assumption 4, then the constant κ satisﬁes

κ p2 minΓ1 min{|b c|2 : b = c} δmin

where δmin, pmin and Γ1 are deﬁned respectively in (6), (8) and (9).

The proof of Proposition 6 is given in A.2.

Remark 7 Denote f(S) = maxk Sc νk,S, for each S J d, 1K. On one hand, we have that f(S) = 0 for any S J d, 1K such that Λ S. This follows immediately from the deﬁnition of Λ. On the other hand, Proposition 6 assures that f(S) κ/Diam(A) A > 0 for any S J d, 1K such that Λ S. Putting together these facts, we deduce that the set of relevant lags can be written as Λ = arg min{|S| : f(S) = 0}. This observation motivates the FSC estimator deﬁned below.

Sparse Markov Models for High-dimensional Inference

In what follows, we split the data X1:n into two pieces. The ﬁrst part is composed of the ﬁrst m symbols X1:m where 1 m < n, whereas the second part is composed of the n m last symbols Xm+1:n. In the sequel, we write ˆνm,k,S to denote the empirical estimate of νk,S computed from X1:m. The formal deﬁnition of ˆνm,k,S involves extra notation and is postponed to Appendix A. The FSC estimator is built in two steps. The ﬁrst step is called Forward Stepwise (FS) and the second one is called CUT. In the FS step, we start with S = and add iteratively to the set S a lag j arg maxk Sc ˆνm,k,S, as long as |S| < ℓ, where 0 ℓ d is a parameter of the estimator. We denote ˆSm the set obtained at the end of FS step, with the convention that ˆSm = if the parameter ℓ= 0. As we will see, if ℓis properly chosen the candidate set ˆSm will contain the set of relevant lags Λ with high probability. It may, of course, include irrelevant lags j (those with δj = 0). In the CUT step, for each j ˆSm, we remove j from ˆSm unless d TV (ˆpm,n( |x ˆSm), ˆpm,n( |y ˆSm)) tm,n(x ˆSm, y ˆSm) := sm,n(x ˆSm) + sm,n(y ˆSm) for some

( ˆSm \ {j})-compatible pasts x ˆSm, y ˆSm A ˆSm, where sm,n(x ˆSm) is given by (12) replacing Nn( ) and ˆpn( | ) by Nm,n( ) and ˆpm,n( | ) respectively. The FSC estimator is deﬁned as the set ˆΛ2,n of all lags not removed in the CUT step. The pseudo-code of the algorithm to compute the FSC estimator is given in Algorithm 1.

Algorithm 1: FSC(X1, . . . , Xn)

FS Step; 1. ˆSm = ; 2. While | ˆSm| < ℓ; 3. Compute j = arg maxj ˆScm ˆνm,j, ˆSm and include j in ˆSm; CUT step; 6. For each j ˆSm, remove j from ˆSm unless

d TV (ˆpm,n( |x ˆSm), ˆpm,n( |y ˆSm)) tm,n(x ˆSm, y ˆSm),

for some ( ˆSm \ {j})-compatible pasts x ˆSm, y ˆSm A ˆSm ; 7. Output ˆSm;

Remark 8 (a) It is worth mentioning the following alternative algorithm (henceforth called Algorithm 2) to estimate the set of relevant lags Λ. As Algorithm 1, Algorithm 2 has two steps as well. In the ﬁrst step, we start with S = and add iteratively a lag j arg maxk Sc ˆνn,k,S as long as ˆνn,j,S > τ, where τ is a parameter of the algorithm and ˆνn,j,S is the empirical estimate of νj,S computed from the entire data X1:n. Let ˆSn denote the set obtained at the end of this step. Next, in the second step, for each j ˆSn, we remove j from ˆSn unless ˆνn,j, ˆSn\{j} τ. The output of Algorithm 2 is

the set of all lags in ˆSn which were not removed in the second step. Algorithm 2 can be seen as a version adapted for our setting of the Learn Nbhd algorithm, proposed in Bresler (2015), to estimate the interaction graph underlying an Ising model from i.i.d samples of the model.

Ost and Takahashi

(b) As opposed to Algorithm 2, notice that the data X1:n is split into two parts in Algorithm 1. The ﬁrst m symbols X1:m of the sample are used in the FS step, whereas the last n m symbols Xm+1:n are only used in the CUT step. Despite requiring to split the data into two parts, one nice feature of Algorithm 1 is that even if a large ℓis chosen the CUT step would remove the non-relevant lags, whereas in Algorithm 2, we have to calibrate τ carefully to recover the relevant lags.

(c) (Computation of FSC estimator) As we show in the Appendix A.6, we need to perform at most O(|A|3ℓ(m d)(d (ℓ 1)/2) + |A|2(n m d)ℓ) computations to determine the FSC estimator. The ﬁrst summand in the sum corresponds to the algorithmic complexity of the FS step, whereas that the second summand can be interpreted as the algorithmic complexity of the PCP estimator computed from a sample of size n m and whose set S has ℓelements (recall item (c) of Remark 2).

In what follows, for any ξ > 0 and 0 ℓ d, let us deﬁne the following event,

Gm(ℓ, ξ) = \

max j Sc | νj,S ˆνm,j,S| ξ , (29)

where Sd,ℓ= {S J d, 1K : |S| ℓ}. In the next result we show that whenever the event Gm(ℓ, ξ) holds with properly chosen parameters ξ and ℓ, the candidate set ˆSm constructed in the FS step with parameter ℓcontains Λ.

Theorem 9 Suppose Assumptions 1 and 3 hold and let κ be the lower bound provided by Proposition 6. Let

ξ = κ 4 A Diam(A) and ℓ = log2(|A|)

= 2(Diam(A) A )2 log2(|A|)

Let ˆSm denote the candidate set constructed in the FS step of Algorithm 1 with parameter ℓ . On the event Gm(ℓ , ξ ), we have that Λ ˆSm.

The proof of Theorem 9 is given in Appendix A.2.2. Theorem 9 ensures that the candidate set ˆSm contains the set of relevant lags Λ whenever the event Gm(ℓ , ξ ) holds. In this case, we can think of the CUT step as the PCP estimator discussed in the previous section applied to the n m last observations Xm+1:n of the data, where S = ˆSm. The main diﬀerence is that ˆSm is a random set, depending on the ﬁrst m observations X1:m of the data. In the sequel, let us denote PSd,ℓ= min S Sd,ℓPS, (31)

where PS is deﬁned in (15). In the next result we estimate the error probability of the FSC estimator.

Theorem 10 Suppose Assumptions 1, 2, and 3 hold. Let > 0 be the quantity deﬁned in Assumption 2. Denote ˆΛ2,n the FSC estimator constructed by Algorithm 1 with parameter

Sparse Markov Models for High-dimensional Inference

ℓ , as deﬁned in (30). Suppose also that m > d 2ℓ . Then there exits a constant C = C(ε, µ) > 0 such that if

n m + d + C|A|α δ2 min PSd,ℓ , (32)

where δmin and PSd,ℓ are deﬁned in (6) and (31), then we have that,

P(ˆΛ2,n = Λ) 2|A|(ℓ + 1) d ℓ

d|A|ℓ +1 exp (ξ )2(m d)2

18m|A|2(ℓ +2)(ℓ + 2)2

+3(|A| 1)|Λ| exp

2(n m d)2P2 Sd,ℓ 8(n m)(ℓ + 1)2

+ 8|A| ((ℓ |Λ|)(n m d) + |Λ|) log(µ(n m d)/α + 2)

where ξ is deﬁned in (30).

The proof of Theorem 10 is given in Appendix A.2.3.

Remark 11 (a) Let us give some intuition about the three terms appearing on the righthand side of (33). The ﬁrst one is an upper bound for P(Gc m(ℓ , ξ )). The other two are related to the terms appearing in (17). Indeed, by recalling that | ˆSm| = ℓ , one immediately sees that the third terms of (33) corresponds to the ﬁrst term of (17) with ˆSm and n m in the place of S and n respectively. Besides, the second term of (33) is similar (modulo a factor which depends on d and ℓ ) to the second term of (17). This extra factor reﬂects the fact that we do not know a priori a set S containing the set of relevant lags Λ.

(b) Similar to Remark 3, one can also show that PSd,ℓ pmin/|A|ℓ 1, where pmin is deﬁned in (8). Hence, we can deduce from (32) that when the sample size n satisﬁes

n d + C|A|ℓ α

pminδ2 min , (34)

then inequality (33) holds with the second exponential term replaced by

exp ( pmin)2(n d)2

8n(ℓ + 1)2|A|2(ℓ 1)

The next result is a corollary of Theorem 10.

Corollary 12 For each n, consider a MTD model with set of relevant lags Λn and transition probabilities pn(a|xΛn) satisfying pmin,n p min n min for some positive constants p min and min, and such that Assumption 3 holds. Let mn = n/2 and dn = mnβ with β (0, 1). Let X1:n be a sample from the MTD speciﬁed by Λn and pn(a|xΛn), and denote ˆΛ2,n the FSC estimator constructed by Algorithm 1 with parameters mn, µn = µ (0, 3) such that µ > ψ(µ), εn = ε > 0, αn = (1 + η) log(n) with η > 0 and ℓ ,n as deﬁned in (30).

Ost and Takahashi

Assume that ℓ ,n ((1 γ)/2) log|A|(n) for some γ (0, 1). Then there exists a constant C = C(β, γ, η, p min, min, ε, µ) > 0 such that P(ˆΛ2,n = Λ ) 0 as n , whenever

δ2 min,n C log(n)

n(1+γ)/2 . (36)

The proof of Corollary 12 is given in Appendix A.2.4.

Remark 13 (a) Under Assumption 4, one can check that ℓ ,n ((1 γ)/2) log|A|(n) whenever, δ2 min |Λ| 16(1 γ) Γ2 1(p min min{|b c| : b = c})4 log|A|(n).

(b) Comparing Corollaries 12 and 4, we observe that the consistency of both FSC and PCP estimators require the same lower bound on the decay of minimal oscillation δmin,n. Despite requiring additional assumptions (Assumption 3 and a condition on the growth of ℓ ), FSC estimator do not need prior knowledge of a small subset S containing the set of relevant lags Λ as opposed to the PCP estimator, which is a signiﬁcant advantage in practice.

(c) Let us mention that under the assumptions of Corollary 12, we have that the algorithmic complexity of the FSC is O(|A|3n2 log|A|(n)). This follows immediately from item (c) of Remark 8.

3.3 Improving the eﬃciency for the binary case

In this section, we show that when the alphabet is binary, i.e., A = {0, 1}, we can further improve the FSC algorithm if we consider Assumptions 4 and 5. Observe that when the alphabet is binary, Assumption 3 holds automatically (see Section 2.6). Moreover, we have that

νk,S = 2E (PXS(Xk = 1)PXS(Xk = 1) |PXS(X0 = 1|Xk = 1) PXS(X0 = 1|Xk = 0)|)

= 2E (|Cov XS(X0, Xk)|) ,

for any lag k J d, 1K, subset S J d, 1K \ {k} and conﬁguration x S {0, 1}S. For a binary MTD, we have the following result.

Theorem 14 Under Assumptions 4 and 5, it holds that

max j Λ\S νj,S max j (Λ)c νj,S

2(Γ1 Γ2)p2 minδmin,

where δmin and pmin are deﬁned in (6) and (8) respectively. In particular, if Γ1 > Γ2 and ˆSm denotes the candidate set constructed at the end of the FS step of Algorithm 1 with parameter ℓ |Λ|, then Λ ˆSm whenever the event Gm(ℓ, ξ) holds where

0 < ξ < (Γ1 Γ2)p2 minδmin. (37)

The proof of Theorem 14 is given in Appendix A.3.1.

Sparse Markov Models for High-dimensional Inference

Remark 15 Notice that if the size of Λ is known, then Theorem 14 implies ˆSm = Λ on the event Gm(|Λ|, ξ) with ξ satisfying (37). In particular, in this case, we neither need to execute the CUT step nor to split the data into two pieces.

In the same spirit of the previous corollaries, we can show the following result.

Corollary 16 For each n, consider a MTD model with set of relevant lags Λn and transition probabilities pn(a|xΛn) satisfying Assumptions 4 and 5 with Γ1,n = Γ1 > Γ2 = Γ2,n and such that |Λn| L for some integer L, pmin,n p min and n min for some positive constants p min and min. Let X1:n be a sample from the MTD speciﬁed by Λn and pn(a|xΛn), and denote ˆΛ2,n the FSC estimator with parameters with parameters mn = n/2, µn = µ (0, 3) such that µ > ψ(µ), εn = ε > 0, αn = (1+η) log(n) with η > 0 and ℓ= L. Suppose that dn = βn with β (0, 1). Then there exists a constant C = C(β, L, min, p min, Γ1, Γ2, η, µ, ε) > 0 such that P(ˆΛ2,n = Λ ) 0 as n , as long as

The proof of Corollary 16 is given in Appendix A.3.2.

3.4 Post-selection transition probabilities estimation

Once the set of relevant lags have been estimated by applying the FSC estimator to the sample X1:n, we reuse the entire sample to compute the estimator ˆpn(a|xˆΛ2,n) of the transition probability p(a|xΛ). In the next result, we provide an estimate for rate of convergence of ˆpn(a|xˆΛ2,n) towards p(a|xΛ), simultaneously for all pasts x d: 1 AJ d, 1K.

Theorem 17 Under assumptions and notation of Theorem 10,

x d: 1 AJ d, 1K

|ˆpn(a|xˆΛ2,n) p(a|xΛ)|

v u u t2α(1 + ϵ) ˆVn(a, xˆΛ2,n)

+ α 3 Nn(xˆΛ2,n)

4|A|(n d) log(µ(n m d)/α + 2)

e α + P(ˆΛ2,n = Λ), (39)

where ˆVn(a, xˆΛ2,n) is given by

ˆVn(a, xˆΛ2,n) = µ µ ψ(µ) ˆpn(a|xˆΛ2,n) + α µ ψ(µ) 1 Nn(xˆΛ2,n).

The proof of Theorem 17 is given in Appendix A.4.

3.5 A remark on the minimax rate for the lag selection

We take A = {0, 1} and consider the set of {p(j)( | ), j J d, 1K} of transition probabilities of the following form:

p(j)(1|x d: 1) = (1 λ)

2 + λp(1|xj), j J d, 1K, λ (0, 1), (40)

Ost and Takahashi

where λ|p(1|1) p(1|0)| := δ > 0. For each j J d, 1K, we denote P(j) the probability measure under which (Xt)t Z is a stationary MTD model having transition probability p(j)( | ). For each t 1, we denote P (j) t the marginal distribution with respect to the variables X1:t: P (j) t (x1:t) = P(j)(X1:t = x1:t)

In what follows, KL(P (j) t ||P (k) t ) denotes the Kullback-Leibler divergence between the distributions P (j) t and P (k) t . We denote MTDd,δ the set all transition probabilities p = {p(a|xΛ) : a A, xΛ AΛ} of a MTD model of order d whose corresponding δmin δ. For a given p MTDd,δ, we denote Pp the probability distribution under which (Xt)t Z is a stationary MTD model of order d with transition probabilities given by p. With this notation, we have the following result.

Proposition 18 Let n > d. Then the following inequality holds: for j, k J d, 1K,

KL(P (j) n ||P (k) n ) 2nδ2

In particular, if β (0, 1), d = nβ, and

2 log(2) , (42)

then inf ˆΛn sup p MTDd,δ Pp(ˆΛn = Λ) 1/4, (43)

where the inﬁmum is over all lag estimators ˆΛn based on a sample of size n.

The proof of (43) follows immediately from Fano s inequality and the upper bound (41). Combining (38) and (42), we deduce that the condition on the minimal oscillation required for the consistency of the FSC estimator in Corollary 16 is sharp. The proof of Proposition 18 is given in Appendix A.5

4. Simulations

Here, we investigated the performance of the proposed methods using simulations.

4.1 Experiment 1

We ﬁrst used a MTD model on alphabet A = {0, 1} with two relevant lags, denoted here as i and j for notational convenience. The choices for the order d and for the values of i and j are shown in the ﬁrst three columns of Table 1. Let p0(1) = p0(0) = 0.5 and λ0 = 0.4. Also, let λ i = 0.2, λ j = 0.4, p i(0|0) = 0.3, p i(0|1) = 0.6, p j(0|0) = 0.5, and p j(0|1) = 0.9. For all x d: 1 {0, 1}J d, 1K and a A, the transition probability of the model was given by

p(a|x d: 1) = λ0p0(a) + λ ip i(a|x i) + λ jp j(a|x j).

Sparse Markov Models for High-dimensional Inference

We simulated the above model using sample sizes

n {102, 5.102, 103, 5.103, 104, 5.104, 105, 5.105}.

For each choice of i, j, d, and n we simulated 100 realizations. We compared four diﬀerent methods to select the relevant lags. FSC(ℓ) stands for the Forward Stepwise and Cut algorithm described in Algorithm 1 with parameter ℓ, ϵ = 0.1, µ = 0.5, and α = C log(n), where the values of the constant C was chosen by optimizing the probability to select the relevant lags correctly only for sample size n = 100, for the given choice of d, i and j. We used the ﬁrst n/2 samples for the Forward Stepwise and the last n/2 for Cut. Remember that ϵ, µ, α are used to deﬁne the random threshold for the Cut step. BSS(2) stands for the best subset selection algorithm, where we ﬁrst estimated the parameters of the MTD model using n samples and the algorithm described in Berchtold (2001) with python implementation mtd-learn. This algorithm estimated for k {1, 2, . . . , d} the weight parameters λ k. We then choose the lags of the two largest λ k as the lags selected by BSS(2). We were not able to run the mtd-learn on models with order d larger than 15 in our computers because that algorithm did not converge. Finally, CTF(ℓ) stands for Conditional Tensor Factorization based Higher Order Markov Chain estimation together with the test for relevance of lags described in Sarkar and Dunson (2016), the parameter ℓbeing the maximal number of relevant lags. We used the code available at https://github.com/david-dunson/bnphomc. The maximal possible order of the Markov chain was set to d and the number of simulation for the Gibbs sampler was set to 1000. The set of relevant lags chosen by CTF was given by the lags with non-null inclusion probability estimated using the Gibbs sampler. We were not able to run CTF(ℓ) when j = n/5 and d = n/4 because the algorithm did not converge when n > 103. We note that FS and BSS assume prior knowledge of the number of relevant sites, giving advantage over FSC and CTF. The results are indicated in Table 1.

Table 1: Estimated probability of correctly selecting only the relevant lags.

Model parameter Method Sample size (n) i j d 100 500 103 5.103 104 5.104 105 5.105

1 8 8 FSC(3) 0.05 0.08 0.13 0.53 0.81 0.86 0.93 1 1 8 8 CTF(3) 0 0 0.04 0.67 0.99 1 1 1 1 8 8 FS(2) 0.07 0.3 0.47 0.98 1 1 1 1 1 8 8 BSS(2) 0.05 0.14 0.23 0.41 0.79 0.78 0.84 0.87 1 15 15 FSC(5) 0.03 0.36 0.51 0.82 0.97 1 1 1 1 15 15 CTF(5) 0 0 0.01 0.62 0.99 1 1 1 1 15 15 FS(2) 0.02 0.2 0.66 0.92 1 1 1 1 1 15 15 BSS(2) 0.04 0.18 0.17 0.28 0.31 0.8 0.8 0.93 1 n/5 n/4 FSC(5) 0 0 0.04 0.19 0.46 1 1 1 1 n/5 n/4 CTF(5) 0 0 0 - - - - - 1 n/5 n/4 FS(2) 0.01 0.11 0.27 0.89 0.96 1 1 1 1 n/5 n/4 BSS(2) - - - - - - - -

Ost and Takahashi

4.2 Experiment 2

Here we used the following MTD model on alphabet A = {0, 1}. We considered diﬀerent choices of order d and relevant lags i, j J1, d K (see Table 2). Let p0(1) = p0(0) = 0.5 and λ0 = 0.2. Also, let p i(0|0) = 1 p i(0|1) = p j(0|1) = 1 p j(0|1) = 0.7 and λ i = λ j = 0.4. For all x d: 1 {0, 1}J d, 1K and a A, the transition probability of the model was given by

p(a|x d: 1) = λ0p0(a) + λ ip i(a|x i) + λ jp j(a|x j).

We simulated the above model using sample sizes n {28, 29, 210, 211, 212, 213}. For each choice of i, j, d, and n we simulated 100 realizations. For each realization, we estimated the transition probability p(0|0d). We used diﬀerent estimators for the comparisons. FSC(ℓ) and FS(ℓ) are the same as described in Experiment 1. For transition probability estimation with FSC, we used X1:n/2 for Forward Stepwise and Xn/2+1:n for Cut step, obtaining the estimated relevant lag set ˆΛn. Then we used X1:n to calculate ˆpn(0|0ˆΛn). For transition probability estimation after PCP, we used X1:n to calculate ˆΛn for the PCP relevant lag estimator with initial set S = J d, 1K. The parameters for the threshold were chosen as follows: ϵ = 0.1, µ = 0.5, and α = C log(n), where we choose the values of the constant C by optimizing the probability to select the relevant lags correctly only for sample size n = 100, for the given choice of d, i and j. Then we used X1:n to calculate ˆpn(0|0ˆΛn). We also compared the performance of transition probability estimator ˆpn(0|0 d: 1), where we did not select the relevant lags (Naive estimator). In our simulations, when d was larger than 5, for both PCP and Naive estimators we did not obtain meaningful results because Nn(0d) = 0 with high probability. Therefore, we compared PCP and Naive estimators only for d = 5. In this case, FSC showed similar performance to PCP estimator and was in general better than Naive estimator. When d > 5, e.g. d = n/8, FSC still exhibited good performance. The results are indicated in Table 2.

Table 2: Empirical standard deviation of the estimator of p(0|0d). FSC, FS, PCP, and Naive are described in the main text.

Model parameter Method Sample size (n) i j d 256 512 1024 2048 4096 8192

1 5 5 FS(2) 0.0774 0.0682 0.0506 0.0286 0.0174 0.0133 1 5 5 FSC(5) 0.0745 0.0835 0.0602 0.0426 0.0222 0.0129 1 5 5 PCP 0.0965 0.0786 0.0577 0.0432 0.0242 0.0131 1 5 5 Naive 0.1518 0.0933 0.0624 0.0455 0.0340 0.0252 1 5 10 FSC(5) 0.0836 0.0842 0.0659 0.0425 0.0228 0.0141 1 10 15 FSC(5) 0.0864 0.0781 0.0641 0.0438 0.0249 0.0151 1 15 15 FSC(5) 0.0833 0.0834 0.0747 0.0488 0.0222 0.0167 11 100 120 FSC(5) - - 0.0838 0.0647 0.0312 0.0169 1 10 n/8 FSC(5) 0.0563 0.0543 0.0780 0.0698 0.0504 0.0105

Sparse Markov Models for High-dimensional Inference

4.3 Application

We applied the proposed method to study the relevant lags on a daily weather data registering the rainy and non-rainy days in Canberra Australia for a n = 1000 days. We obtained the data from kaggle (https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package). We used the ﬁrst 1000 time points of the data. We used Forward Stepwise algorithm with ℓ= 3 (FS(3)) and maximal order d = 400 to include the possibility of the annual cycle. We obtained as the three relevant lags {1, 62, 330}. The selected relevant lags were the same for d = 365 and d = 450, showing teh robustness of the result. The day before (lag 1) is clearly relevant and is often included in weather prediction models. Annual cycles ( 12 months) are also predictor of the weather, matching the 330 days lag that we found. Finally, the 62 days lag is consistent with the cycle of Madden-Julian oscillator ( 60 days), which is the largest inter-seasonal source of precipitation events in Australia (Wheeler et al. (2009)). We note that the estimated Markov chain is of order 330, which is around one-third of the sample size n = 1000, whereas using VLMC we do not expect to typically estimate Markov chains of order larger than log(10) 7. Indeed, using VLMC with BIC model selection criterion we selected a model with order 1. We set the upper limit of the model size as 400 for the VLMC order selection. As a further comparison, we applied the Conditional Tensor Factorization based Higher Order Markov Chain estimation together with the test for relevance of lags described in Sarkar and Dunson (2016). We again used the code available at https://github.com/david-dunson/bnphomc. The maximal possible order of the Markov chain was set to 400, the maximal number of relevant lags was 3, and the number of simulation for the Gibbs sampler was set to 1000. The inclusion probability calculated using Gibbs sampler for lags (1, 2, 3, 4, 5, 6, 7) were (100, 1.2, 0.2, 0.6, 0.2, 0.4, 0.2) percent, respectively. For all other orders the inclusion probability was zero. Therefore, no larger lags were detected by this method.

Acknowledgments

This research has been conducted as part of FAPESP project Research, Innovation and Dissemination Center for Neuromathematics (grant 2013/07699-0). GO thanks FAPERJ (grants E-26/201.397/2021 and E-26/211.343/2019) and CNPq (grant 303166/2022-3).

Appendix A. Proofs of Section 3

A.1 Proofs of Section 3.1

A.1.1 Proof of Theorem 1

Proof [Proof of Theorem 1] Since the set S J d, 1K containing the set Λ is ﬁxed, we will write x instead of x S to alleviate the notation. We start proving Item 1. Proof of Item 1. For each x AS, let us deﬁne the event

|ˆpn(a|x) p(a|x)| <

2α(1 + ε) ˆVn(a, x)

Nn(x) + α 3 Nn(x)

Ost and Takahashi

where ˆVn(a, x) is given by

ˆVn(a, x) = µ µ ψ(µ) ˆpn(a|x) + α (µ ψ(µ)) Nn(x).

By using ﬁrst the union bound and then by applying Proposition 34, we deduce that for each x AS,

P(Gc x) 4|A| log(µ(n d)/α + 2)

e αP Nn(x) > 0 . (44)

Now, observe that on Gx, we have that

d TV (ˆpn( |x), p( |x)) < X

α(1 + ε) ˆVn(a, x)

2 Nn(x) + α|A| 6 Nn(x) = sn(x),

which, together with (44), implies that

P (d TV (ˆpn( |x), p( |x)) sn(x)) 4|A| log(µ(n d)/α + 2)

e αP Nn(x) > 0 . (45)

Note that if j / Λ, then by the deﬁnition of the set Λ it follows that p(a|x) = p(a|y) for all x, y AS which are (S\{j})-compatible. Hence, by applying ﬁrst the triangle inequality and then using that tn(x, y) = sn(x) + sn(y), we deduce that the event {d TV (ˆpn( |x), ˆpn( |y)) tn(x, y)} is contained in the event

{d TV (ˆpn( |x), p( |x)) sn(x)} {d TV (ˆpn( |y), p( |y)) sn(y)},

P j ˆΛ1,n 2 X

x AS P(d TV (ˆpn( |x), pn( |x)) sn(x))

8|A| log(µ(n d)/α + 2)

x AS P( Nn(x) > 0),

where in the second inequality we have used (45). Since n d = P x AS Nn(x) P x AS 1{ Nn(x) > 0} which implies that n d E P x AS 1{ Nn(x) > 0} = P x AS P( Nn(x) > 0), we obtain from the above inequality that,

P j ˆΛ1,n 8|A| log(µ(n d)/α + 2)

concluding the the proof of Item 1. Proof of Item 2. Let j Λ, recall that δj = λj maxb,c A d TV (pj( |b), pj( |c)) and consider the event E = {δj γn,j}. Take b , c A such that d TV (pj( |b ), pj( |c )) = maxb,c A d TV (pj( |b), pj( |c)), and observe that with this choice,

δj = λjd TV (pj( |b ), pj( |c )).

Sparse Markov Models for High-dimensional Inference

From this equality it follows that for any pair (x, y) Cj(b , c ), we have

E = {d TV (p( |x), p( |y)) 2tn,j},

where we have used also that γn,j = 2tn,j. Now, take a pair (x , y ) Cj(a , b ) attaining the minimum in (14): tn,j(b , c ) = tn(x , y ).

From the deﬁnition of tn,j, it follows then that

tn,j tn,j(b , c ) = tn(x , y ).

Therefore, we conclude that

E {d TV (p( |x ), p( |y )) 2tn(x , y )},

so that by the triangle inequality, we obtain that on E,

2tn(x , y ) d TV (ˆpn( |x ), p( |x ))

+ d TV (ˆpn( |y ), p( |y )) + d TV (ˆpn( |x ), ˆpn( |y )).

Hence, on {j / ˆΛ1,n} E, we have

tn(x , y ) d TV (ˆpn( |x ), p( |x )) + d TV (ˆpn( |y ), p( |y )),

implying that

P {j / ˆΛ1,n} E P(d TV (ˆpn( |x ), p( |x )) sn(x ))

+ P(d TV (ˆpn( |y ), p( |y )) sn(y )).

From (45), it follows then that

P j / ˆΛ1,n, γn,j δj = P {j / ˆΛ1,n} E 8|A| log(µ(n d)/α + 2)

concluding the proof of Item 2. Proof of Item 3. Observe that by combining Items 1 and 2 together with the union bound, we deduce that

P ˆΛ1,n = Λ 8|A| ((|S| |Λ|)(n d) + |Λ|) log(µ(n d)/α + 2)

j Λ P (γn,j > δj) .

Hence, to conclude the proof of Item 3, it suﬃces to show that

P (γn,j > δj) 6|A|(|A| 1) exp 2(n d)2P2 S 8n(|S| + 1)2

Ost and Takahashi

for all j Λ, whenever the sample size n satisﬁes (16). By the union bound, we have that

P (γn,j > δj) X

c A:c =b P (tn,j(b, c) > δj/2) , (47)

and for each b, c A with b = c,

P (tn,j(b, c) > δj/2) P (tn(x, y) > δj/2) P (sn(x) > δj/4) + P (sn(y) > δj/4) , (48)

for any (x, y) Cj(b, c). By using again the union bound, we can deduce that for each in x AS,

P (sn(x) > δj/4) P

α(1 + ε) ˆVn(a, x)

2 Nn(x) > δj/8

+ P α 6 Nn(x) > δj 8|A|

and also that

α(1 + ε) ˆVn(a, x)

2 Nn(x) > δj/8

α(1 + ε)ˆpn(a|x)µ 2(µ ψ(µ)) Nn(x) > δj/16

(1 + ε) 2(µ ψ(µ)) > δj/16

By applying Proposition 26 with u1 = P(x) (4|A|α)/(3δj(n d)) and u2 = P(x) (16|A|α p

(1 + ε)/2(µ ψ(µ)))/(δj(n d)), one can show that

6 Nn(x) > δj/8 + P

(1 + ε) 2(µ ψ(µ)) > δj/16

2n(|S| + 1)2

P(x) 16|A|α δj(n d)

(1 + ε) 2(µ ψ(µ))

as long as (n d) > 16|A|α

(1+ε) 2(µ ψ(µ)). By using Jensen inequality, one can verify that

α(1 + ε)ˆpn(a|x)µ 2(µ ψ(µ)) Nn(x) > δj/16

Nn(x) < 128α(1 + ε)µ|A|

δ2 j (µ ψ(µ))

so that by Proposition 26 with u3 = P(x) (128α(1 + ε)µ|A|)/(δ2 j (µ ψ(µ))), it follows that

2α(1 + ε)ˆpn(a|x)µ

(µ ψ(µ)) Nn(x) > δj/16

2n(|S| + 1)2

P(x) 128|A|αµ(1 + ε) δ2 j (µ ψ(µ))(n d)

Sparse Markov Models for High-dimensional Inference

whenever (n d) > (128|A|αµ(1 + ε))/(δ2 j P(x)(µ ψ(µ))). Therefore, we have shown that for any x AS,

P (sn(x) > δj/4) 3 exp 2(n d)2P2(x)

8n(|S| + 1)2

8µ(1 + ε) (µ ψ(µ)) +

(1 + ε) 2(µ ψ(µ))

Now, considering b ,j, c ,j A with b = c and (x ,j, y ,j) Cj(a ,j, b ,j) such that

min b,c A:b =c max (x,y) Cj(a,b)(P(x) P(y)) = (P(x ,j) P(y ,j)),

we can deduce from (48) and (49) that

P (tn,j(b, c) > δj/2) 6 exp 2(n d)2(P(x ,j) P(y ,j))2

8n(|S| + 1)2

16 |A|α δ2 j P(x ,j) P(y ,j)

8µ(1 + ε) (µ ψ(µ)) +

(1 + ε) 2(µ ψ(µ))

Since P(x ,j) P(y ,j) PS for all j Λ , we can take

C = C(µ, ε) = 32

8µ(1 + ε) (µ ψ(µ)) +

(1 + ε) 2(µ ψ(µ))

to deduce that (46) is indeed satisﬁed whenever

(n d) C|A|α

δ2 min PS ,

concluding the proof.

A.1.2 Proof of Corollary 4

Proof [Proof of Corollary 4] Notice that Assumptions 1 and 2 are satisﬁed for all values of n, since pmin,n p min and n min for positive constants p min and min. Hence, the result follows immediately from Theorem 1-Item 3 and Remark 2-Item (c).

Ost and Takahashi

A.2 Proofs of Section 3.2

A.2.1 Proof of Proposition 6

In this section we prove Proposition 6. To that end, we need some auxiliary results. The ﬁrst auxiliary result is the following. Recall that we write Covx S(X0, mk(Xk)) to denote the conditional covariance between the random variables X0 and mk(Xk) given that XS = x S, where mk is deﬁned (3).

Lemma 19 For each S J d, 1K, k Sc and x S AS, the following identity holds:

Covx S(X0, mk(Xk)) = X

j Λ\S λj Covx S(mj(Xj), mk(Xk)). (50)

Remark 20 In (50), we use the convention that P j λj Covx S(mj(Xj), mk(Xk)) = 0.

Proof [Proof of Lemma 19] Observe that if Λ S, then the both sides of (50) are 0, so that the result holds immediately in this case. Now suppose that Λ S. In this case, to shorten the notation, let us write

Px S(xΛ\S) = Px S(XΛ\S = xΛ\S), for xΛ\S AΛ\S.

We want to compute

Covx S(X0, mk(Xk)) = Ex S(X0mk(Xk)) Ex S(X0)Ex S(mk(Xk)).

We ﬁrst compute Ex S(X0). To that end, write

Ex S(X0) = X

a A a Px S(X0 = a),

and observe that for each a A,

Px S(X0 = a) = X

xΛ\S AΛ\S Px S(xΛ\S)p(a|x SxΛ\S)

= λ0p0(a) + X

j Λ S λjpj(a|xj) + X

xΛ\S AΛ\S Px S(xΛ\S)pj(a|xj)

= λ0p0(a) + X

j Λ S λjpj(a|xj) + X

j Λ\S λj Ex S(pj(a|Xj)),

where in the second equality we have used the deﬁnition of the transition probabilities (2). Hence, we have that

Ex S(X0) = λ0m0 + X

j Λ S λjmj(xj) + X

j Λ\S λj Ex S(mj(Xj)),

where m0 = P a A ap0(a).

Sparse Markov Models for High-dimensional Inference

As a consequence of the above equality, it follows that

Ex S(X0)Ex S(mk(Xk)) =

j Λ S λjmj(xj)

Ex S(mk(Xk))

j Λ\S λj Ex S(mj(Xj))Ex S(mk(Xk)). (51)

We now compute Ex S(X0mk(Xk)). We consider only the case k Λ, the other is treated similarly. In this case, we ﬁrst write

Ex S(X0mk(Xk)) = X

b A amk(b)Px S(X0 = a, Xk = b), (52)

and then we proceed similar as above to deduce that for each a, b A,

Px S(X0 = a, Xk = b) = X

xΛ\S AΛ\S Px S(xΛ\S)p(a|x SxΛ\S)1{xk = b}

λ0p0(a) + X

j Λ S λjpj(a|xj) + λkpk(a|b)

Px S(Xk = b)

j Λ\(S {k}) λj X

c A pj(a|c)Px S(Xj = c, Xk = b). (53)

where in the second equality we have used the deﬁnition of the transition probabilities (2). Combining (52) and (53), we deduce that

Ex S(X0mk(Xk)) =

j Λ S λjmj(xj)

Ex S(mk(Xk)) + λk Ex S(m2 k(Xk))

j Λ\(S {k}) λj Ex S(mk(Xk)mj(Xj))

j Λ S λjmj(xj)

Ex S(mk(Xk))

j Λ\S λj Ex S(mk(Xk)mj(Xj)). (54)

Putting together the identities (51) and (54), we then conclude that

Covx S(X0mk(Xk)) = X

j Λ\S λj (Ex S(mj(Xj)mk(Xk))

Ex S(mj(Xj))Ex S(mk(Xk)) , (55)

and the result follows.

The next auxiliary result is the following.

Ost and Takahashi

Lemma 21 Suppose Assumptions 1 and 3 hold. Then there exists a constant κ > 0 such that the following property holds: for any S J d, 1K such that Λ S, we have

k Λ\S λjλk E(Cov XS(mj(Xj), mk(Xk))) κ .

Proof [Proof of Lemma 21] It suﬃces to show that for S J d, 1K such that Λ S, we have X

k Λ\S λjλk E(Covx S(mj(Xj), mk(Xk))) > 0.

Suppose that this is not the case. Then,

k Λ\S λjλk E(Covx S(mj(Xj), mk(Xk)))

k Λ\S λjλk Cov (mj(Xj) EXS(mj(Xj)), mk(Xk) EXS(mk(Xk)))

j Λ\S λj(mj(Xj) EXS(mj(Xj)))

so that P-almost surely, X

j Λ\S λj(mj(Xj) EXS(mj(Xj))) = 0.

This implies that P-almost surely,

j Λ\S λjmj(Xj) = EXS

j Λ\S λjmj(Xj)

or equivalently, X

j Λ\S λjmj(Xj) = f(XS), P-a.s.,

for some function f : AS R. Now take any conﬁguration x S AS and consider the event A = {XS = x S}. From the above identity, it follows that P-a.s.,

j Λ\S λjmj(Xj) = 1Af(x S).

Finally, take any conﬁguration xΛ\S AΛ\S such that

j Λ\S λjmj(xj) = f(x S),

Sparse Markov Models for High-dimensional Inference

and let B = {XΛ\S = xΛ\S}. Such a conﬁguration must exist by Assumption 3. As a consequence, we have that P-a.s.,

j Λ\S λjmj(xj) = 1A Bf(x S),

implying that P(A B) X

j Λ\S λjmj(xj) = P(A B)f(x S).

By Assumption 1, we have P(A B) = P(xΛ) > 0 so that the identify above would imply that X

j Λ\S λjmj(xj) = f(x S),

which is a contradiction. Therefore, we must have

j Λ\S λj(mj(Xj) EXS(mj(Xj)))

and the result follows.

We also need the following result.

Lemma 22 For each S J d, 1K, k Sc and x S AS, the following identity holds:

|Covx S(X0, mk(Xk))| mk Lip|Covx S(X0, Xk)|, (56)

where mk and mk Lip are deﬁned (3) and (6) respectively.

Proof [Proof of Lemma 22] First observe that Covx S(X0, mk(Xk)) = Covx S(X0, mk(Xk) mk(c)), for any c A. Since,

Covx S(X0, mk(Xk) mk(c)) = X

b A (mk(b) mk(c))Covx S(X0, 1{Xk = b})

and |mk(b) mk(c)| mk Lip|b c|, it follows then that

Covx S(X0, mk(Xk)) mk Lip X

b A |b c|Covx S(X0, 1{Xk = b}).

By taking c = min(A), we have that |b c| = (b c) for any b A and we deduce from the above inequality that

Covx S(X0, mk(Xk)) mk Lip Covx S(X0, Xk) mk Lip|Covx S(X0, Xk)|.

A similar argument shows that Covx S(X0, mk(Xk)) mk Lip|Covx S(X0, Xk)|, concluding the proof.

Our last auxiliary result is the following.

Ost and Takahashi

Lemma 23 For each S J d, 1K such Λ S, x S AS and j, k Λ \ S, the following identity holds:

Covx S(mj(Xj), mk(Xk)) = 1

c A Px S(Xk = b)Px S(Xk = c)(mk(b) mk(c))

(Ex S(mj(Xj)|Xk = b) Ex S(mj(Xj)|Xk = c)) . (57)

where mj is deﬁned (3).

Proof [Proof of Lemma 23] First notice that

Covx S(mj(Xj), mk(Xk)) = X

a,b A mj(a)mk(b)Covx S(1{Xj = a}, 1{Xk = b}).

Now, for any a, b A, one can check that

Covx S(1{Xj = a}, 1{Xk = b}) = X

c A Px S(Xk = b)Px S(Xk = c)

(Px S(Xj = a|Xk = b) Px S(Xj = a|Xk = c))

Hence, we deduce from the above equalities that

Covx S(mj(Xj), mk(Xk)) = X

c A mk(b)Px S(Xk = b)Px S(Xk = c))

(Ex S(mj(Xj)|Xk = b) Ex S(mj(Xj)|Xk = c)).

Interchanging the role of the symbols b and c in the equality above, we obtain that

Covx S(mj(Xj), mk(Xk)) = X

b A mk(c)Px S(Xk = b)Px S(Xk = c))

(Ex S(mj(Xj)|Xk = b) Ex S(mj(Xj)|Xk = c)).

The result follows by combing the last two equalities above.

We now prove Proposition 6. Proof [Proof of Proposition 6] We ﬁrst prove inequality (26). Let us denote Dk,S(a, b, c, x S) = Px S(X0 = a|Xk = b) Px S(X0 = a|Xk = c), for each a, b, c A, x S AS and k / S. With this notation, one can check that for any x S AS and k / S, we have that

Covx S(X0, Xk) = 1

c A (b c)wk,S(b, c, x S) X

a A a Dk,S(a, b, c, x S). (58)

Now, observe that the triangle inequality and the equality

a A |Dk,S(a, b, c, x S)| = dk,S(b, c, x S),

Sparse Markov Models for High-dimensional Inference

|Covx S(X0, Xk)| Diam(A) A X

c A wk,S(b, c, x S)dk,S(b, c, x S),

so that E (|Cov XS(X0, Xk)|) Diam(A) A νk,S,

proving inequality (26). We now prove (27). This is done as follows. In the sequel, we shall write λ1Λ\S to denote the vector λ = (λj)j Λ restricted to the coordinates in Λ \ S: λ1Λ\S = (λj)j Λ\S. With this notation, it follows from Lemma 19 and Lemma 21 that for any S J d, 1K such that Λ S, X

k Λ\S λk E (Cov XS(X0, mk(Xk))) κ .

By the triangle inequality, it then follows that X

k Λ\S λk |E (Cov XS(X0, mk(Xk))|) κ .

Now using that

max k Λ\S |E (Cov XS(X0, mk(Xk)))| λ1Λ\S 1 X

k Λ\S λk |E (Cov XS(X0, mk(Xk)))| ,

we conclude that max k Λ\S |E (Cov XS(X0, mk(Xk)))| λ1Λ\S 1 κ .

By observing that 1 λ0 = P k Λ λk λ1Λ\S 1, we conclude from the above inequality that max k Λ\S |E (Cov XS(X0, mk(Xk)))| κ /(1 λ0) > 0,

and the result follows from Lemma 22. Therefore, it remains to prove (28). To that end, we ﬁrst use Lemma 19, Lemma (22) and Lemma 23 to obtain that X

k Λ\S λk mk Lip|Covx S(X0, Xk)| 1

c A λkλj Px S(Xk = b)Px S(Xk = c)

(mk(b) mk(c)) (Ex S(mj(Xj)|Xk = b) Ex S(mj(Xj)|Xk = c)) .

Next, we observe that Assumption 4 implies that

λj (|Ex S(mj(Xj)|Xk = b) Ex S(mj(Xj)|Xk = c)|)

λk|mk(b) mk(c)|

k Λ\S λk mk Lip E (|Covx S(X0, Xk)|) p2 minΓ1

k Λ\S λ2 k X

c A (mk(b) mk(c))2,

Ost and Takahashi

where we have also used that Px S(Xk = b) pmin. Finally, note that |mk(b) mk(c)| min{|b c|2 : b = c} mk 2 Lip to obtain that

max k Λ\S E (|Covx S(X0, Xk)|) X

k Λ\S λk mk Lip p2 minΓ1 2 min{|b c|2 : b = c}

k Λ\S λ2 k mk 2 Lip.

Then, by using Cauchy-Schwartz inequality, we deduce that X

k Λ\S λk mk Lip s X

k Λ\S λ2 k mk 2 Lip p

|Λ \ S| s X

k Λ\S λ2 k mk 2 Lip p

The result follows by combining the last two inequalities.

A.2.2 Proof of Theorem 9

Before starting the proof of Theorem 9, we recall some deﬁnitions from Information Theory. In what follows, for S J d, 1K and j J d, 1K, we write I(X0; Xj|XS) to denote the conditional mutual information between X0 and Xj given XS, deﬁned as

I(X0; Xj|XS) = X

x S AS P(x S)I(X0; Xj|XS = x S), (59)

where I(X0; Xj|XS = x S) := Ij(x S) denotes the conditional mutual information between X0 and Xj given XS = x S, deﬁned as

Ij(x S) = X

a,b A Px S(X0 = a, Xj = b) log Px S(X0 = a, Xj = b) Px S(X0 = a)Px S(Xj = b)

We use the convention that when S = , the conditional probability Pxs is the unconditional probability P. Hence, in this case, the conditional mutual information between X0 and Xj is the mutual information between these random variables, denoted I(X0; Xj) := Ij. The entropy H(X0) of X0 is deﬁned as

a A P(X0 = a) log(P(X0 = a)). (61)

To prove Theorem 9 we proceed similarly to Bresler (2015). During the proof we will need the the following lemma.

Lemma 24 Suppose that the event Gm(ξ, ℓ) deﬁned in (29) holds and let S J d, 1K with |S| ℓ. If ˆνm,k,S τ with k Sc, then I(X0; Xk|XS) 2(τ ξ)2.

Proof Deﬁnition (59) together with Jensen inequality implies that for any j Sc, r

1 2I(X0; Xj|XS) =

x S AS P(x S)Ij(x S)

x S AS P(x S)

1 2Ij(x S).

Sparse Markov Models for High-dimensional Inference

By Pinsker inequality, it then follows that r

1 2Ij(x S) 1

a,b A |Px S(X0 = a, Xj = b) Px S(X0 = a)Px S(Xj = b)|

b A Px S(Xj = b)1

a A |Px S(X0 = a|Xj = b) Px S(X0 = a)|

c A wj,S(b, c, x S)d TV (Px S(X0 |Xj = b), Px S(X0 |Xj = c))

= νj,S(x S),

where in the second equality we have used that for any a, b A,

Px S(X0 = a|Xj = b) = X

c A Px S(Xj = c)Px S(X0 = a|Xj = b),

and also that Px S(X0 = a) = X

c A Px S(Xj = c)Px S(X0 = a|Xj = c)).

As a consequence, we deduce that r

1 2I(X0; Xj|XS) X

x S AS P(x S)νj,S(x S) = νj,S.

Now, on the event Gm(ξ, ℓ), we have that νk,S ˆνm,k,S ξ so that

1 2I(X0; Xk|XS) ˆνm,k,S ξ τ ξ,

where in the rightmost inequality we have used that ˆνm,k,S τ. Hence,

I(X0; Xj|XS) 2(τ ξ)2,

and the result follows.

We now prove Theorem 9. Proof Suppose the event Gm(ξ , ℓ ) holds and let ˆSm be the set obtained at the end of FS step of Algorithm 1 with parameter ℓ , where the parameters ξ and ℓ are deﬁned as in (30). In the sequel, let S0 = and Sk = Sk 1 {jk}, where jk arg maxj Sc k 1 ˆνm,j,Sk 1 for 1 k d, and observe that by construction ˆSm = Sℓ . We want to show that Λ ˆSm. We argue by contraction. Suppose that Λ is not contained in ˆSm. In this case, it follows that Λ Sk for all 1 k ℓ , and Proposition 6 implies that for all 1 k ℓ ,

max j Sc k νj,Sk κ A Diam(A) = 4ξ ,

Ost and Takahashi

where the equality holds by the choice of ξ . Since the event Gm(ξ , ℓ ) holds and |Sk| |Sℓ | = ℓ , it follows from the above inequality that

ˆνm,jk,Sk 1 = max j Sc k 1 ˆνm,j,Sk 1 3ξ ,

for all 1 k ℓ + 1. By Lemma 24, we then deduce that I(X0, Xjk|XSk 1) 8ξ2 for all 1 k ℓ + 1. Now, notice that

log2(|A|) H(X0) I(X0; X ˆSm {jℓ +1}) =

k=1 I(X0, Xjk|XSk 1), (62)

where we have used Gibbs inequality in the ﬁrst passage, the fact that the entropy is always larger than the mutual information in the second passage and the Chain Rule in the last passage. The proof of these facts can be found, for instance, in (Cover and Thomas, 2006). By the choice of ℓ = log2(|A|)/8ξ2 , we have that ℓ + 1 > log2(|A|)/8ξ2 so that it follows from (62) that log2(|A|) (ℓ + 1)8ξ2 > log2(|A|),

a contradiction. Thus, we must have Λ ˆSm and the result follows.

A.2.3 Proof of Theorem 10

To prove Theorem 10 we shall need the following result.

Proposition 25 Suppose Assumptions 1 and 2 hold, and let > 0 the quantity deﬁned in Assumption 2. Then, for any ξ > 0 and m > d 2ℓ 0,

P(Gc m(ℓ, ξ)) 2d(ℓ+ 1) d ℓ

|A|ℓ+2 exp ξ2(m d)2( )2

18|A|2(ℓ+2)m(ℓ+ 2)2

During the proof of Proposition 25 we will make use of the following proposition. For any function f : AJ1,m K R, deﬁne for each 1 j m,

δj(f) = sup n |f(x1:j 1axj+1:m) f(x1:j 1bxj+1:m)| : a, b A, x1:m AJ1,m Ko , (64)

with the convention that x1:0 = xm+1:m = , ax2:m = ax2:m and x1:m 1a = x1:m 1a . Let δ(f) = (δ1(f), . . . , δm(f)) and denote δ(f) 2 2 = Pm j=1 δ2 j (f). In what follows, we write E1:m[f] = P x1:m AJ1,m K P(X1:m = x1:m)f(x1:m).

Proposition 26 (Theorem 3.4. of (Chazottes et al., 2020)) Suppose Assumption 2 holds, that is, > 0.

1. For any u > 0 and f : AJ1,m K R,

P (|f(X1:m) E1:m[f]| > u) 2 exp 2u2 2

Sparse Markov Models for High-dimensional Inference

2. For m > d, any g : AS A R with S J d, 1K and u > 0,

t=d+1 g(Xt+S, Xt) ES[g]

2 exp u2(m d)2 2

2m(|S| + 1)2 g 2

where ES[g] = P x S AS P a A P(XS = x S, X0 = a)g(x S, a) and Xt+S = (Xt+j)j S.

Before starting the proof Proposition 25, we need to introduce some additional notation. For each x AS with S J d, 1K, we write

ˆPm(x) = Nm(x) m d . (65)

In what follows, we write xa V {0}, with a A and V J d, 1K, to denote the conﬁguration ((xa)j)j V {0}, deﬁned as

( xj, for j V a, for j = 0 .

When V = S {k} and xk = b A, we shall write xba S {k,0} instead of xa V {0}. With this notation, the empirical version of νk,S is deﬁned as follows:

ˆνm,k,S = X

x S AS ˆPm(x S)ˆνm,k,S(x S) (66)

where for x S AS, we deﬁne

ˆνm,k,S(x S) = X

c A ˆwm,k,S(b, c, x S) ˆdm,k,S(b, c, x S), (67)

and for b, c A,

ˆwm,k,S(b, c, x S) = ˆPm(xb S {k})

ˆPm(xc S {k})

ˆPm(x S) , (68)

and ˆdm,k,S(b, c, x S) = 1

ˆPx S(X0 = a|Xk = b) ˆPx S(X0 = a|Xk = c) ,

where for each b A,

ˆPx S(X0 = a|Xk = b) = ˆPm(xba S {k,0})

ˆPm(xb S {k}) .

Hereafter, we omit the dependence on S and on m, whenever there is no risk of confusion. We now prove Proposition 25. Proof [Proof of Proposition 25] Claim 1. Let S J d, 1K with |S| ℓ< d/2 and take j Sc. Then,

| νj,S ˆνj,S| 3 X

ˆP(XS = x, Xj = b, X0 = a) P(XS = x, Xj = b, X0 = a) .

Ost and Takahashi

Proof of the Claim 1. By applying the triangle inequality twice, one can check that

| νj,S ˆνj,S| 1

a,b,c A |P(x)wj,S(b, c, x) (Px(X0 = a|Xj = b) Px(X0 = a|Xj = c))

ˆP(x) ˆwj,S(b, c, x) ˆPx(X0 = a|Xj = b) ˆPx(X0 = a|Xj = c) . (69)

Now observe that for ﬁxed x AS and a, b, c A,

P(x)wj,S(b, c, x)Px(X0 = a|Xj = b) = Px(Xj = c)P(XS = x, Xj = b, X0 = a)

and similarly,

ˆP(x) ˆwj,S(b, c, x)ˆPx(X0 = a|Xj = b) = ˆPx(Xj = c)ˆP(XS = x, Xj = b, X0 = a).

By using these identities in (69) and then by applying the triangle inequality, one can deduce that

| νj,S ˆνj,S| X

a,b,c A |Px(Xj = c)P(XS = x, Xj = b, X0 = a)

ˆPx(Xj = c)ˆP(XS = x, Xj = b, X0 = a) . (70)

By adding and subtracting the term Px(Xj = c)ˆP(XS = x, Xj = b, X0 = a) in the right-hand side of the above inequality and using again the triangle inequality, it follows that

a,b,c A |Px(Xj = c)P(XS = x, Xj = b, X0 = a)

ˆPx(Xj = c)ˆP(XS = x, Xj = b, X0 = a)

a,b A |P(XS = x, Xj = b, X0 = a) ˆP(XS = x, Xj = b, X0 = a)|

a,c A ˆP(XS = x, X0 = a)|Px(Xj = c) ˆPx(Xj = c)|. (71)

By adding and subtracting the term P(XS = x)P(XS = x, Xj = c), we can then check that

|Px(Xj = c) ˆPx(Xj = c)| P(Xj = c)

ˆP(XS = x) |ˆP(XS = x) P(XS = x)|

+ 1 ˆP(XS = x) |ˆP(XS = x, Xj = c) P(XS = x, Xj = c)|. (72)

Sparse Markov Models for High-dimensional Inference

From (71) and (72), one deduces that

| νj,S ˆνj,S| X

a,b A |P(XS = x, Xj = b, X0 = a) ˆP(XS = x, Xj = b, X0 = a)|

c A |P(XS = x, Xj = c) ˆP(XS = x, Xj = c)|

x AS |P(XS = x) ˆP(XS = x)|. (73)

|P(XS = x, Xj = c) ˆP(XS = x, Xj = c)|

a A |P(XS = x, Xj = c, X0 = a) ˆP(XS = x, Xj = c, X0 = a)|

|P(XS = x) ˆP(XS = x)|

a,c A |P(XS = x, Xj = c, X0 = a) ˆP(XS = x, Xj = c, X0 = a)|,

the proof of Claim 1 follows from (73). Claim 2. For any u > 0,

ˆP(XS = x, Xj = b, X0 = a) P(XS = x, Xj = b, X0 = a) > u

2|A||S|+2 exp u2(m d)2 2

18|A|2(|S|+2)m(|S| + 2)2

Proof of Claim 2. It follows from the union bound and Proposition 26.

We now will conclude the proof. Let Sk = {S J d, 1K : |S| = k} and observe that by the union bound

P(Gc m(ξ, ℓ))

j Sc P (| νj,S ˆνj,S| > ξ) .

Combining Claims 1 and 2, it follows that

P (| νj,S ˆνj,S| > ξ) 2|A||S|+2 exp ξ2(m d)( )2

18|A|2(|S|+2)m(|S| + 2)2

which implies that

P(Gc m(ξ, ℓ)) 2

(d k)|A|k+2 exp ξ2(m d)2 2

18|A|2(k+2)m(k + 2)2

Ost and Takahashi

Since ℓ d/2, we can use that d k d ℓ for all 0 k ℓto obtain that

P(Gc m(ξ, ℓ)) 2d(ℓ+ 1) d ℓ

|A|ℓ+2 exp ξ2(m d)( )2

18|A|2(ℓ+2)m(ℓ+ 2)2

and the result follows.

We now prove Theorem 10. Proof [Proof of Theorem 10] First, observe that by Theorem 9,

P(ˆΛ2,n = Λ) P(Gc m(ξ , ℓ )) + P(Λ ˆSm, ˆΛ2,n = Λ) (74)

Next, notice that the second term on the right hand side of (74) can be written as

P(Λ ˆSm, ˆΛ2,n = Λ) = X

S J d, 1K:Λ S,|S| ℓ P( ˆSm = S, ˆΛ2,n = Λ).

Now for any S J d, 1K such that Λ S, |S| ℓ , it follows from the union bound that

P( ˆSm = S, ˆΛ2,n = Λ) X

j Λ P( ˆSm = S, j / ˆΛ2,n) + X

j S\Λ P( ˆSm = S, j ˆΛ2,n).

By proceeding similarly as in the proof of Item 1 of Theorem 1, one can deduce that for any j S \ Λ,

P( ˆSm = S, j ˆΛ2,n) 8|A| log(µ(n m d)/α + 2)

x AS P( ˆSm = S, Nm,n(x) > 0),

j S\Λ P( ˆSm = S, j ˆΛ2,n) 8(ℓ |Λ|)|A| log(µ(n m d)/α + 2)

x AS P( ˆSm = S, Nm,n(x) > 0).

S J d, 1K:Λ S,|S| ℓ

x AS P( ˆSm = S, Nm,n(x) > 0) (n m d)P(Λ ˆSm)

we then deduce that X

S J d, 1K:Λ S,|S| ℓ

j S\Λ P( ˆSm = S, j ˆΛ2,n) 8(ℓ |Λ|)|A|

log(µ(n m d)/α + 2)

e α(n m d).

Sparse Markov Models for High-dimensional Inference

Following the steps of the proof of Item 2 of of Theorem 1, we can also show that X

S J d, 1K:Λ S,|S| ℓ

j Λ P( ˆSm = S, j / ˆΛ2,n, δj γS m,n,j) 8|Λ||A|

log(µ(n m d)/α + 2)

S J d, 1K:Λ S,|S| ℓ P( ˆSm = S)

8|Λ||A| log(µ(n m d)/α + 2)

where γS m,n,j is deﬁned as in (14) with tm,n,j = maxb,c A:b =c min(x S,y S) Cj(b,c) tm,n(x S, y S) in the place of tn,j. Hence, it remains to estimate X

S J d, 1K:Λ S,|S| ℓ

j Λ P( ˆSm = S, j / ˆΛ2,n, δj < γS m,n,j).

By proceeding similarly to the proof of Item 3 of Theorem 1, one can show that for each S J d, 1K such that |S| ℓ ,

P γS m,n,j > δj 6|A|(|A| 1) exp

( ( pmin)2(n m d)2

2(n m)|A|2(ℓ 1)(ℓ + 1)2

1 nmin n m d

for all j Λ as long as n satisﬁes n > m + d + nmin. By using this upper bound and by recalling that P S J d, 1K:Λ S,|S| ℓ Pℓ k=0 d k (ℓ + 1) d ℓ (since 2ℓ d), we deduce that X

S J d, 1K:Λ S,|S| ℓ

j Λ P( ˆSm = S, j / ˆΛ2,n, δj < γS m,n,j)

6|A|(|A| 1)(ℓ + 1) d ℓ

( ( pmin)2(n m d)2

2(n m)|A|2(ℓ 1)(ℓ + 1)2

1 nmin n m d

for all j Λ whenever n > m + d + nmin, implying the result.

A.2.4 Proof of Corollary 12

Proof [Proof of Corollary 12] Assumptions 1 and 2 are satisﬁed for all values of n, since p n p min and n min for positive constants p min and min for all n. Since the sequence of MTD models also satisfy Assumption 3, the result follows from Theorem 10 and Remark 11-Item (b).

A.3 Proofs of Section 3.3

A.3.1 Proof of Theorem 14

Proof [Proof of Theorem 14]

Ost and Takahashi

Notice that we can write for each j J d, 1K,

mj(Xj) = (pj(1|1) pj(1|0)X j + pj(1|0),

so that equality (50) can be rewritten for any j J d, 1K satisfying (pj(1|1) pj(1|0)) = 0, S J d, 1K \ {j} and x S {0, 1}S, as

Covx S(X0, Xj) = X

ℓ Λ\S ℓCovx S(Xℓ, Xj),

where ℓ= λℓ(pℓ(1|1) pℓ(1|0)) for ℓ Λ. Recalling that νj,S = 2E (|Cov XS(X0, Xj)|) in the binary case, we can deduce that for any S J d, 1K, x S {0, 1}S and j J d, 1K\S,

ℓ Λ\S ℓCov XS(Xℓ, Xj)

As a consequence, it follows that for S Λ and j Λ \ S,

Var XS(Xj)| j| X

ℓ Λ\S:ℓ =j | ℓ||Cov XS(Xℓ, Xj)|

= 2E (| j|Var XS(Xj)

| ℓ| | j||Px S(Xℓ= 1|Xj = 1) Px S(Xℓ= 1|Xj = 0)|

where in the second inequality we have used that

|Covx S(Xℓ, Xj)| = Varx S(Xj)|Px S(Xℓ= 1|Xj = 1) Px S(Xℓ= 1|Xj = 0)|.

By Assumption 4, we then deduce that

νj,S 2Γ1| j|E(Var XS(Xj)). (75)

Now, take S Λ and let j S arg minℓ Λ\S | j|E(Var XS(Xℓ)). For any j (Λ)c, use the triangle inequality, equality (75) and Assumption 5 to deduce that

ℓ Λ\S | ℓ||Cov XS(Xℓ, Xj)|

2| j S|E (Var XS(Xj S)) Γ2 (76)

Using that δj = | j| and combing inequalities (75) and (76), it follows then that

max j Λ\S νj,S max j (Λ)c νj,S 2(Γ1 Γ2)| j S|E (Var XS(Xj S)) , (77)

Sparse Markov Models for High-dimensional Inference

where we have used also that maxj Λ\S νj,S νj S,S. Using that E (Var XS(Xj S)) (p )2

and | j S| minj Λ | j| we obtain that

max j Λ\S νj,S max j (Λ)c νj,S

2(Γ1 Γ2)p2 min min j Λ | j|.

concluding the ﬁrst half of the proof. To show the second assertion of the theorem, take S Λ, let j S arg maxj Λ\S νj,S and note that on Gn(ℓ, ξ),

max j Λ\S ˆνn,j,S ˆνn,j S,S νj S,S ξ = max j Λ\S νj,S ξ.

Similarly, one can show that on Gn(ℓ, ξ),

max j (Λ)c ˆνn,j,S max j (Λ)c νj,S + ξ.

As a consequence, it follows that

max j Λ\S ˆνn,j,S max j (Λ)c ˆνn,j,S max j Λ\S νj,S max j (Λ)c νj,S

whenever Gn(ℓ, ξ). By taking ξ as in (37), we have that

max j Λ\S ˆνn,j,S max j (Λ)c ˆνn,j,S > 0,

implying that arg maxj Sc ˆνn,j,S Λ for all S Λ, and the result follows.

A.3.2 Proof of Corollary 16

Proof [Proof of Corollary 16] By Theorem 14, we have that

P(ˆΛ2,n = Λ) P(Gc m(ξ, L)) + P(Λ ˆSm, ˆΛ2,n = Λ),

for any ξ < (Γ1 Γ2)p min minj Λ δj. By Proposition 25, we have that

P(Gc m(ξ, L)) 2d(L + 1) d L

|A|L+2 exp ξ2(m d)2 2

18|A|2(L+2)m(L + 2)2

By taking ξ = (Γ1 Γ2)p min minj Λ δj, one can check that if

min j Λ δj C1 log(n)

for some constant C1 = C1(β, L, min, p min, Γ1, Γ2), then P(Gc m(ξ, L)) 0 as n .

Ost and Takahashi

By proceeding exactly as in the proof of Theorem 10 and using 11-item (b), we can show that

P(Λ ˆSm, ˆΛ2,n = Λ)

6|A|(L + 1) d L

d|A|L+1(|A| 1)|Λ| exp ( pmin)2(n d)2

8n(L + 1)2|A|2(L 1)

+ 8|A| ((L |Λ|)(n m d) + |Λ|) log(µ(n m d)/α + 2)

n d + C|A|Lα

pminδ2 min ,

where C = C(µ, ε). Therefore, using that min, pmin p min, d = nβ, α = (1 + η) log(n), we also see that if δ2 min C2 log(n)

n for some C2 = C2(µ, ε, η, min, p min, β, L) then P(Λ ˆSm, ˆΛ2,n = Λ) 0 as n . By taking C = C1 C2, we deduce that P(ˆΛ2,n = Λ) 0 as n as long as δ2 min C log(n)

n , and the result follows.

A.4 Proofs of Section 3.4

Proof [Proof of Theorem 17] By the union bound, we have that

x d: 1 AJ d, 1K

|ˆpn(a|xˆΛ2,n) p(a|xΛ)|

v u u t2α(1 + ϵ) ˆVn(a, xˆΛ2,n)

+ α 3 Nn(xˆΛ2,n)

P(Λ = ˆΛ2,n)

|ˆpn(a|xΛ) p(a|xΛ)|

2α(1 + ϵ) ˆVn(a, xΛ)

Nn(xΛ) + α 3 Nn(xΛ)

Now, Proposition 34 implies that for any a A and xΛ AΛ,

|ˆpn(a|xΛ) p(a|xΛ)|

2α(1 + ϵ) ˆVn(a, xΛ)

Nn(xΛ) + α 3 Nn(xΛ)

4 log(µ(n d)/α + 2)

e αP Nn(xΛ) > 0 ,

Sparse Markov Models for High-dimensional Inference

|ˆpn(a|xΛ) p(a|xΛ)|

2α(1 + ϵ) ˆVn(a, xΛ)

Nn(xΛ) + α 3 Nn(xΛ)

4|A| log(µ(n d)/α + 2)

xΛ AΛ E 1{ Nn(xΛ) > 0} .

xΛ AΛ E[1{ Nn(xΛ) > 0}] (n d),

the result follows.

A.5 Proof of Section 3.5

A.5.1 Proof of Proposition 18

Proof [Proof of Proposition 18] First observe that since all MTDs are stationary Markov chains of order at most d, we can use the Markov property to show that

KL(P (j) n ||P (k) n ) = KL(P (j) d ||P (k) d ) + (n d)E(j)(KL(p(j)( |X d: 1)||p(k)( |X d: 1))).

where KL(p(j)( |x d: 1)||p(k)( |x d: 1)) denotes the Kullback-Leibler divergence between p(j)( |x d: 1) and p(k)( |x d: 1). Now note that for each ﬁxed x :d1 AJ d, 1K, we can use the deﬁnition of the transition probabilities p(j)( | ) together with Lemma 6 of Csizar and Talata (2005) to deduce that

KL(p(j)( |x d: 1)||p(k)( |x d: 1)) λ2|p(1|1) p(1|0)|2(pmin) 11{xj =xk}.

Since pmin (1 λ)/2 and δ = λ|p(1|1) p(1|0)|, it follows from the above inequality that

E(j)(KL(p(j)( |X d: 1)||p(k)( |X d: 1))) 2δ2

By using similar arguments, one can also show that

KL(P (j) d ||P (k) d ) 1 pmin

max x d: 1,y d: 1 |p(j)(1|x d: 1) p(k)(1|y d: 1)|2

i=1 max x d: 1,y d: 1:x i: 1=y i: 1 |p(j)(1|x d: 1) p(k)(1|y d: 1)|2 !

Therefore, it follows that

KL(P (j) n ||P (k) n ) 2dδ2

1 λ + (n d) 2δ2

and the result follows.

Ost and Takahashi

A.6 Computation of PCP and FSC estimators

We will ﬁrst show that one can compute the PCP estimator with at most O(|A|2|S|(n d)) computations, as claimed in item (c) of Remark 2.

Proof of item (c) of Remark 2. One way to compute the PCP estimator is the following. First, we compute Nn(x S, a) simultaneously for all pasts x S and symbols a A, and build the set ES = {x S : Nn(x S) > 0}. This can be done with O(n d) computations. Indeed, we set initially Nn(x S, a) = 0 for all past x S and symbol a A. Then at each time d + 1 t n, we increment by 1 the count of Nn(x S, a) for which Xt+S = x S and Xt = a, leaving all the other counts unchanged. Moreover, at the ﬁrst time that Nn(x S, a) > 0, we include x S in the set E. Note that the cardinality of the set ES is at most (n d). Next, we need to compute sn(x S) and ˆpn( |x S) for each x S E, which can be done with at most O(|A|) additional computations. Once all these quantities are determined, we then need to test whether a given lag j S has to be removed or not, by evaluating inequality (13) for all pairs of (S \ {j})-compatible pasts in ES. This can be done with at most O(|A|2(n d)) more computations because 1) the number of diﬀerent pasts in ES is at most (n d); 2) there are at most |A| pasts in E which are compatible with a ﬁxed past x S in ES; and 3) one can evaluate whether inequality (13) holds or not to a given pair of compatible past with O(|A|) additional computations. Finally, since the number of lags to be tested is |S|, it follows that we can implement the PCP estimator with at most O(|A|2|S|(n d)) computations, concluding the proof.

We now show that we can compute the FSC estimator by using at most O(|A|3ℓ(m d)(d (ℓ 1)/2) + |A|2(n m d)ℓ) computations, as stated in item (c) of Remark 8.

Proof of item (c) of Remark 8.

By the item (c) of Remark 2, the CUT step can be computed with at most O(ℓ|A|2(n m d)) computations since the FS step outputs a subset of size ℓand the size of the second half of the sample is n m. Hence, the proof will be concluded if we show that the FS step can be computed with at most O(|A|3(m d)(ℓd (ℓ 1)ℓ/2) computations. To see that, let us ﬁx S J d, 1K and j / S. Proceeding as in the proof item (c) of Remark 2, one can check that we compute Nm(xba S {0,j}) simultaneously for all conﬁgurations xba S {j,0} and build the set ES = {x S : Nm(x S) > 0} with O(m d) computations. Notice that the size of the set ES is most (m d). Since for each x S ES, we need to perform at most O(|A|3) additional operations to compute ˆPm(x S) and ˆνj,m(x S), it follows that with at most O(|A|3(m d)) computations we can determine ˆνm,j,S. Therefore, the step 3 of the FS step (where we need to compute ˆνm,j,S for j Sc) can be implemented with O(|A|3(m d)(d |S|)) calculations. Since we need to repeat step 3 of the FS step for ℓ diﬀerent sets, we conclude that with at most

O(|A|3(m d)

|S|=0 (d |S|)) = O(|A|3(m d)(ℓd (ℓ 1)ℓ/2)

computations, we can implement the FS step. This concludes the proof.

Sparse Markov Models for High-dimensional Inference

Appendix B. Martingale concentration inequalities

In the sequel, N denotes the set of non-negative integers {0, 1, . . .} Let (Ω, F, P) be a probability space. We assume that this probability space is rich enough so that the following stochastic processes may be deﬁned on it. In what follows, let (Xt)t Z be a Markov chain of order d Z+, taking values on a ﬁnite alphabet A, with family of transition probabilities {p( |x d: 1) : x d: 1 supp(P)}. Denote Ft = σ(X d:t) for t N. For each a A, consider the stochastic process Ma = ((Ma t ))t N deﬁned as,

Ma t = 1{Xt = a} p(a|X(t d):(t 1)), t N.

Let H = (Ht)t N be a stochastic process taking values on a ﬁnite alphabet B R, satisfying H0 = 0 and Ht Ft 1 for all t Z+, and consider H Ma = (H Ma t )t N deﬁned as,

s=0 Hs Ma s , t N. (78)

Notice that H Ma is adapted to the ﬁtration F := (Ft)t N, that is H Ma t Ft for all t N. Also H Ma 0 = 0. Recall the notation B = maxb B |b|.

Lemma 27 Let H Ma = (H Ma t )t N be the stochastic process deﬁned in (78). Then H Ma is a square integrable Martingale w.r.t. F starting from H Ma 0 = 0. Moreover, the predictable quadratic variation of H Ma, denoted by H Ma = ( H Ma t)t N, is given by

s=0 H2 s p a|X(s d):(s 1) 1 p a|X(s d):(s 1) , t N. (79)

Furthermore, for any λ > 0 and b > 0 such that B b, the stochastic process

exp λH Ma eλb λb 1

b2 H Ma = exp λH Ma t eλb λb 1

is a supermartingale w.r.t. F starting from 1.

Proof For each t Z+, we have that Ht Ft 1 and also that E [1{Xt = a}|Ft 1] = p(a|X(t d):(t 1)). These two facts imply that for any t Z+,

E [Ht Ma t |Ft 1] = Ht E [Ma t |Ft 1] = 0,

which, in turn, implies that E[H Ma t |Ft 1] = H Ma t 1. Hence, H Ma is a martingale w.r.t. to F. Since |H Ma t | B t for t 1, it follows that H Ma is also square integrable. The predictable quadratic variation of H Ma is deﬁned as

s=1 E (Ms Ms 1)2 |Fs 1 ,

for t Z+ with H Ma 0 = 0. For any t Z+, one can check that H Ma t H Ma t 1 2 = H2 t (1{Xt = a} 2p(a|X(t d):(t 1))1{Xt = a} p2(a|X(t d):(t 1))).

Ost and Takahashi

Using again that Ht Ft 1 and also that E [1{Xt = a}|Ft 1] = p(a|X(t d):(t 1)), one then deduces that for any t Z+,

E H Ma t H Ma t 1 2 |Ft 1 = H2 t p(a|X(t d):(t 1))(1 p(a|X(t d):(t 1))),

which establishes (79). The proof that exp λH Ma eλb λb 1

b2 H Ma is a supermartingale w.r.t F can be found in (Raginsky and Sason, 2014).

We will use Lemma 27 to prove the following concentration inequality.

Proposition 28 Let H Ma = (H Ma t )t N be the stochastic process deﬁned in (78). Suppose that B b for some b > 0. For any ﬁxed α > 0 and v > 0, we have for t N,

3 , H Ma t v exp ( α) P( H Ma t > 0).

Remark 29 This is basically Lemma 5 of (Oliveira, 2015) (see the Economical Freedman s inequality provided in Inequality (41)) applied to the square integrable martingale H Ma. The only diﬀerence is the factor 2 in front of the linear term αb

3 which is not present here. Notice that for t Z+, the concentration inequality above can be rewritten in the following form:

3 , H Ma t v| H Ma t > 0 exp ( α) .

The conditioning on event { H Ma t > 0} reﬂects the fact that if H Ma t = 0 almost surely, then H Ma t = H Ma 0 = 0 almost surely as well.

Proof For t = 0 the result holds trivially. Now, suppose t Z+. By considering the set B/b = {c/b : c B} instead of B, it suﬃces to prove the case b = 1. To shorten the notation, we denote Mt = H Ma t in the sequel. By the Markov property, we have that for any λ > 0,

P(λMt ψ(λ) M t α) exp( α)E [exp (λMt ψ(λ) M t) 1{ M t > 0}] ,

where ψ(λ) = eλ λ 1 and we have used that if M t = 0 almost surely, then Mt = M0 = 0 almost surely. By using the fact that (exp (λMt ψ(λ) M t))t N is a supermartingale (Lemma 27 with b = 1) together with the decomposition

{ M t > 0} =

k=1 { M k > 0 and M j = 0 for all j < k},

as in (Oliveira, 2015), we can deduce that

E [exp (λMt ψ(λ) M t) 1{ M t > 0}] P( M t > 0),

Sparse Markov Models for High-dimensional Inference

which implies not only that for any λ > 0,

exp( α)P( M t > 0), (80)

but also that

λ, M t v exp( α)P( M t > 0).

Now, we use that for λ (0, 3) it holds that ψ(λ) λ2(1 λ/3) 1/2. Hence, from the above inequalities we deduce that for any λ (0, 3),

P Mt λ 2(1 λ/3) M t + α

exp( α)P( M t > 0), (81)

and also that

P Mt λ 2(1 λ/3)v + α

λ, M t v exp( α)P( M t > 0).

Minimizing λ (0, 3) 7 λ (1 λ/3)v + α

λ, the result follows.

By using a peeling argument as in (Hansen et al., 2015), we deduce from the above result the following.

Proposition 30 Let H Ma = (H Ma t )t N be the stochastic process deﬁned in (78). Suppose that B b for some b > 0. For ϵ > 0, v > w > 0 and α > 0, we have for t N,

2α(1 + ϵ) H Ma t + αb

3 , w H Ma t v log(v/w + 1)

exp ( α) P( H Ma t > 0).

Proof It suﬃces to prove the case b = 1. The general case follows from this one by ﬁrst replacing B by B/b = {c/b : c B} and then rearranging the terms properly. Let us denote v0 = w and vk = (1 + ε)vk 1 for 1 k K := l log(v/w+1)

log(1+ε) m . Notice that v K v, by the deﬁnition of K. To shorten the notation, we denote H Ma t = Mt in what follows. Starting from (81), one can deduce that for any 0 k < K and λ (0, 3), we have

P Mt λ 2(1 λ/3) M t + α

λ, vk M t vk+1

exp( α)P( M t > 0),

which, in turn, implies that

P Mt λ 2(1 λ/3)vk+1 + α

λ, vk M t vk+1

exp( α)P( M t > 0).

Ost and Takahashi

Minimizing w.r.t. to λ (0, 3) as in Proposition 28, it then follows that

3 , vk M t vk+1 exp ( α) P( M t > 0).

Now, on the event {vk M t}, we have that M t(1 + ε) (1 + ε)vk = vk+1, so that the inequality above implies

2(1 + ε) M tα + α

3 , vk M t vk+1 exp ( α) P( M t > 0).

Summing over k the result follows (recall that v v K by the choice of K).

Hereafter, let m N and consider a function ϕ : AJ1,m K AJ d, 1K such that its supremum norm ϕ = max(x1:m,x d: 1) AJ1,m K AJ d, 1K |ϕ(x1:m, x d: 1)| b. Here we use the convention that ϕ is a function deﬁned only on AJ d, 1K when m = 0. Given such a function ϕ, let us denote Hϕ = (Hϕ t )t 0 the stochastic process deﬁned as Hϕ 0 = . . . = Hϕ m+d = 0 and Hϕ t = ϕ(X1:m, X(t d):(t 1)) for t d + m + 1. Clearly, Hϕ t Ft 1 for all t Z+. From (79), one can check that the predictable quadratic variation Hϕ Ma of the martingale Hϕ Ma is given by Hϕ Ma 0 = . . . = Hϕ Ma m+d = 0 and for t m + d + 1,

s=m+d+1 ϕ2(X1:m, X(s d):(s 1))p a|X(s d):(s 1) 1 p a|X(s d):(s 1) .

(82) As a direct consequence of Proposition 30, we derive the following result.

Corollary 31 Let X1:n be a sample from a MTD model of order d with set of relevant lags Λ. Let ˆΛm be an estimator of Λ computed from X1:m, where n > m. For each x AJ d, 1K, a A and S J d, 1K, let ˆpm,n(a|x S) be the empirical transition probability deﬁned in (7) computed from Xm+1:n. Then for any S J d, 1K such that Λ S, ε > 0, α > 0 and n m + d + 1, we have

ˆΛm = S, |ˆpm,n(a|x S) p(a|xΛ)|

2α(1 + ε)p(a|xΛ)(1 p(a|xΛ))

+ α 3 Nm,n(x S)

2 log(n m d + 1)

e αP Nm,n(x S) > 0, ˆΛm = S . (83)

In particular,

Λ ˆΛm, |ˆpm,n(a|xˆΛm) p(a|xΛ)|

2α(1 + ε)p(a|xΛ)(1 p(a|xΛ))

+ α 3 Nm,n(xˆΛm)

2 log(n m d + 1)

e αP Nm,n(xˆΛm) > 0, Λ ˆΛm . (84)

Sparse Markov Models for High-dimensional Inference

Proof Summing in both sides of (83) over S J d, 1K such that Λ S, we obtain inequality (84). Thus, it remains to prove (83). To that end, take ϕ(X1:m, X(t d):(t 1)) = 1{ˆΛm = S}1{Xt+j = xj, j S} and notice that in this case

Hϕ Ma n = 1{ˆΛm = S} Nm,n(x S)p(a|xΛ)(1 p(a|xΛ)).

So, if either p(a|xΛ) = 0 or p(a|xΛ) = 1, then we have necessarily Hϕ Ma n = 0 for all n m + d + 1, which implies that almost surely for all n m + d + 1,

Hϕ Ma n = 1{ˆΛm = S}( Nm,n(x S, a) Nm,n(x S)p(a|xΛ)) = 0.

By noticing that Hϕ Ma n = 1{ˆΛm = S} Nm,n(x S)(ˆpm,n(a|x S) p(a|xΛ)), it follows that, on the event {ˆΛm = S, Nm,n(x S) > 0}, we must have ˆpm,n(a|x S) = p(a|xΛ) almost surely and so the left-hand side of (83) is 0 and the result holds trivially. Let us now suppose 0 < p(a|xΛ) < 1. In this case, we apply Proposition 30 with w = p(a|xΛ)(1 p(a|xΛ)), v = (n m d)w and b = 1 to deduce that,

P ˆΛm = S, N n,m(x S)(ˆpm,n(a|x S) p(a|xΛ)) q

2α(1 + ε)p(a|xΛ)(1 p(a|xΛ)) Nm,n(x S) + α

3 , Nm,n(x S) > 0

log(n m d + 1)

e αP(1{ˆΛm = S} Nm,n(x S)p(a|xΛ)(1 p(a|xΛ)) > 0). (85)

To conclude the proof, observe that in this case

{1{ˆΛm = S} Nm,n(x S)p(a|xΛ)(1 p(a|xΛ)) > 0} = {ˆΛm = S, Nm,n(x S) > 0},

and use again Proposition 30 with H ϕ Ma in the place of Hϕ Ma (by noting also that Hϕ Ma = H ϕ Ma ).

Remark 32 Let us brieﬂy comment on the results of Corollary 31. Suppose x AJ d, 1K

and a A are such that 0 < p(a|xΛ) < 1 and also that ˆΛm is a consistent estimator of Λ. By

the CLT for aperiodic and irreducible Markov Chains it follows that q

Nm,n(xΛ)(ˆpm,n(a|xΛ)

p(a|xΛ))1{ˆΛm = Λ, Nm,n(xΛ) > 0} converges in distribution (as min{m, n} ) to a centered Gaussian random variable with variance p(a|xΛ)(1 p(a|xΛ)). This implies that for suﬃciently large n,

P (ˆpm,n(a|xΛ) p(a|xΛ)

2αp(a|xΛ)(1 p(a|xΛ))

Nm,n(xΛ) |ˆΛm = Λ, Nm,n(xΛ) > 0

Let us compare this heuristic argument with Corollary 31 applied to S = Λ. In this case, in Inequality (83), the variance term q

2α(1+ε)p(a|xΛ)(1 p(a|xΛ))

Nm,n(xΛ) can be made arbitrarily close to

Ost and Takahashi

optimal value q

2αp(a|xΛ)(1 p(a|xΛ))

Nm,n(xΛ) , at the cost 1 log(1+ε). Both the linear term α Nm,n(xΛ) and the log(n m d + 1) factor are the price to pay to achieve the result which holds every n m+d+1 and reﬂect the fact that ˆpm,n(a|xΛ) p(a|xΛ) is Gaussian only asymptotically. In particular, Corollary 31 improves the Economical Freedman s Inequality (as stated in (Oliveira, 2015) - Lemma 5, Inequality (42)) when restricted to the martingale Hϕ Ma.

In the sequel, let us denote P a = (P a t )t N, for each a A, the stochastic process deﬁned as P a t = p(a|X(t d:t 1)) for each t N. With this notation, notice that

s=0 H2 s p(a|X(s 1):(s d)), t N, (86)

is such that H Ma t H2 P a t for all t N. In particular, Proposition 28 holds if we replace H Ma t by H2 P a t . A closer inspection of the proof of Proposition 30 reveals that this proposition also holds with H2 P a t in the place of H Ma t. In the next theorem, we show that we can replace H2 P a t by a linear transformation of its empirical version which is crucial for our analysis.

Theorem 33 Let H Ma = (H Ma t )t N be the stochastic process deﬁned in (78). Suppose that B b for some b > 0. For any ﬁxed µ (0, 3) satisfying µ > ψ(µ) = exp(µ) µ 1 and α > 0, deﬁne for t N,

H2 ˆP a t = µ µ ψ(µ)

s=0 H2 s ˆP a s + b2α µ ψ(µ),

where ˆP a s = 1{Xs = a} for all s N. Then, for any ﬁxed ϵ > 0 and v > w > 0, we have for any t N,

2(1 + ϵ)αH2 ˆP a t + αb

3 , w H2 ˆP a t v

2 log(v/w + 1)

exp ( α) P( H Ma t > 0).

Proof We prove only the case b = 1. Note that H2 Ma t = H2 P a t H2 ˆP a t . Also recall that H Ma t H2 P a t . We now proceed to the proof. We ﬁrst use Inequality (80) for the martingale H2 Ma

together with that fact that H2 Ma t H4 P a t H2 P a t (the last inequality holds because b = 1) to deduce that for any µ > 0,

P H2 P a t H2 ˆP a t + ψ(µ)

µ H2 P a t + α

exp( α)P( H Ma t > 0),

which implies that for any µ (0, 3) satisfying µ > ψ(µ), it holds

P H2 P a t H2 ˆP a t exp( α)P( H Ma t > 0).

Sparse Markov Models for High-dimensional Inference

Hence, combining this inequality with (81), we conclude that for any λ (0, 3) and any µ (0, 3) satisfying µ > ψ(µ),

P H Ma t λ 2(1 λ/3)H2 ˆP a t + α

P H Ma t λ 2(1 λ/3)H2 ˆP a t + α

µ, H2 P a t H2 ˆP a t

+ P H2 P a t H2 ˆP a t 2 exp( α)P( H Ma t > 0),

where we also used in the last inequality the fact that H Ma t H2 P a t . To conclude the proof, we need to follow the same steps as Proposition 28 and then use the peeling argument as in the proof of Proposition 30 with H2 ˆP a t in the place of H Ma t.

As consequence of Theorem 33, we obtain the following result.

Proposition 34 Let X1:n be a sample from a MTD model of order d with set of relevant lags Λ. Let ˆΛm be an estimator of Λ computed from X1:m where n > m. For any x AJ d, 1K, a A and S J d, 1K, let ˆpm,n(a|x S) be the empirical transition probability deﬁned in (7) computed from Xm+1:n, and consider for α > 0 and µ (0, 3) satisfying µ > ψ(µ) = exp(µ) µ 1,

ˆVm,n(a, x, S) = µ µ ψ(µ) ˆpm,n(a|x S) + α µ ψ(µ) 1 Nm,n(x S).

Then for any S J d, 1K such that Λ S and n m + d + 1, we have

ˆΛm = S, |ˆpm,n(a|x S) p(a|xΛ)|

2α(1 + ϵ) ˆVm,n(a, x, S)

Nm,n(x S) + α 3 Nm,n(x S)

4 log(µ(n m d)/α + 2)

e αP ˆΛm = S, Nm,n(x S) > 0 . (87)

In particular,

Λ ˆΛm, |ˆpm,n(a|xˆΛm) p(a|xΛ)|

2α(1 + ϵ) ˆVm,n(a, x, ˆΛm)

Nm,n(xˆΛm) + α 3 Nm,n(xˆΛm)

4 log(µ(n m d)/α + 2)

e αP Λ ˆΛn, Nm,n(xˆΛm) > 0 . (88)

Proof Summing in both sides of (87) over S J d, 1K such that Λ S, we obtain inequality (88). Hence, it remains to show (87). Arguing as in Corollary 31 we need to consider only the case 0 < p(a|x) < 1.

Ost and Takahashi

By applying Theorem 33 with H = H ϕ where ϕ is as in the proof of Corollary 31, v = µ µ ψ(µ)(n m d) + α µ ψ(µ) and w = α µ ψ(µ), we obtain that

P ˆΛm = S, Nm,n(x S)|ˆpm,n(a|x S) p(a|xΛ)| q

2(1 + ϵ)α Vm,n(a, x, S) + α

4 log(µ(n m d)/α + 2)

e αP(1{ˆΛm = S} Nm,n(x S)p(a|xΛ)(1 p(a|xΛ)) > 0),

where Vm,n(a, x, S) = Nm,n(x S) ˆVm,n(a, x, S). By using that when 0 < p(a|x) < 1,

{1{ˆΛm = S} Nm,n(x S)p(a|xΛ)(1 p(a|xΛ)) > 0} = {ˆΛm = S, Nm,n(x S) > 0},

and the fact that Vm,n(a, x, S) = Nm,n(x S) ˆVm,n(a, x, S), we deduce (87) from the above inequality.

Andre Berchtold. Estimation in the mixture transition distribution model. Journal of Time Series Analysis, 22(4):379 397, 2001.

Andr e Berchtold and Adrian Raftery. The Mixture Transition Distribution Model for High Order Markov Chains and Non-Gaussian Time Series. Statistical Science, 17(3):328 356, 2002.

Guy Bresler. Eﬃciently learning ising models on arbitrary graphs. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, STOC 15, page 771 782. Association for Computing Machinery, 2015.

Peter B uhlmann and Abraham J Wyner. Variable length markov chains. The Annals of Statistics, 27(2):480 513, 1999.

Gyorgy Buzsaki and Andreas Draguhn. Neuronal oscillations in cortical networks. science, 304(5679):1926 1929, 2004.

J.R. Chazottes, S. Gallo, and D.Y. Takahashi. Optimal gaussian concentration bounds for stochastic chains of unbounded memory. Ar Xiv, 2020.

Yanqing Chen, Mingzhou Ding, and JA Scott Kelso. Long memory processes (1/f α type) in human coordination. Physical Review Letters, 79(22):4501, 1997.

Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006. ISBN 0471241954.

Imre Csisz ar and Zsolt Talata. Context tree estimation for not necessarily ﬁnite memory processes, via bic and mdl. IEEE Transactions on Information theory, 52(3):1007 1016, 2006.

Sparse Markov Models for High-dimensional Inference

Antonio Galves, Charlotte Galves, Jesus E Garcia, Nancy L Garcia, and Florencia Leonardi. Context tree selection and linguistic rhythm retrieval from written texts. The Annals of Applied Statistics, 6(1):186 209, 2012.

Jes us E Garcıa, Ver onica A Gonz alez-L opez, Rua Sergio Buarque de Holanda, and Cidade Universit aria-Barao Geraldo. Minimal markov models. In Fourth Workshop on Information Theoretic Methods in Science and Engineering, page 25, 2011.

David L Gilden, Thomas Thornton, and Mark W Mallon. 1/f noise in human cognition. Science, 267(5205):1837 1839, 1995.

Niels Richard Hansen, Patricia Reynaud-Bouret, and Vincent Rivoirard. Lasso and probabilistic inequalities for multivariate point processes. Bernoulli, 21(1):83 143, 2015.

Matthew Heiner and Athanasios Kottas. Estimation and selection for high-order markov chains with bayesian mixture transition distribution models. Journal of Computational and Graphical Statistics, pages 1 13, 2021.

V ain o J a askinen, Jie Xiong, Jukka Corander, and Timo Koski. Sparse markov chains for sequence data. Scandinavian Journal of Statistics, 41(3):639 655, 2014.

Andrea Kir aly, Imre Bartos, and Imre M J anosi. Correlation properties of daily temperature anomalies over land. Tellus A: Dynamic Meteorology and Oceanography, 58(5):593 600, 2006.

Ioannis Kontoyiannis, Lambros Mertzanis, Athina Panotopoulou, Ioannis Papageorgiou, and Maria Skoularidou. Bayesian context trees: Modelling and exact inference for discrete time series. ar Xiv preprint ar Xiv:2007.14900, 2020.

Roberto Imbuzeiro Oliveira. Stochastic processes with random contexts: A characterization and adaptive estimators for the transition probabilities. IEEE Transactions on Information Theory, 61(12):6910 6925, 2015.

Adrian E. Raftery. A model for high-order markov chains. Journal of the Royal Statistical Society. Series B (Methodological), 47(3):528 539, 1985.

Maxim Raginsky and Igal Sason. Concentration of Measure Inequalities in Information Theory, Communications, and Coding: Second Edition. 2014.

Jorma Rissanen. A universal data compression system. IEEE Transactions on information theory, 29(5):656 664, 1983.

Abhra Sarkar and David B Dunson. Bayesian nonparametric modeling of higher order markov chains. Journal of the American Statistical Association, 111(516):1791 1803, 2016.

M.J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.

Ost and Takahashi

Matthew C Wheeler, Harry H Hendon, Sam Cleland, Holger Meinke, and Alexis Donald. Impacts of the madden julian oscillation on australian rainfall and circulation. Journal of Climate, 22(6):1482 1498, 2009.

Naiming Yuan, Zuntao Fu, and Shida Liu. Long-term memory in climate variability: A new look based on fractional integral techniques. Journal of Geophysical Research: Atmospheres, 118(23):12 962, 2013.