# on_learning_markov_chains__98d8c781.pdf

On Learning Markov Chains

Yi HAO Dept. of Electrical and Computer Engineering University of California, San Diego La Jolla, CA 92093 yih179@ucsd.edu

Alon Orlitsky Dept. of Electrical and Computer Engineering University of California, San Diego La Jolla, CA 92093 alon@ucsd.edu

Venkatadheeraj Pichapati Dept. of Electrical and Computer Engineering University of California, San Diego La Jolla, CA 92093 dheerajpv7@ucsd.edu

The problem of estimating an unknown discrete distribution from its samples is a fundamental tenet of statistical learning. Over the past decade, it attracted signiﬁcant research effort and has been solved for a variety of divergence measures. Surprisingly, an equally important problem, estimating an unknown Markov chain from its samples, is still far from understood. We consider two problems related to the min-max risk (expected loss) of estimating an unknown k-state Markov chain from its n sequential samples: predicting the conditional distribution of the next sample with respect to the KL-divergence, and estimating the transition matrix with respect to a natural loss induced by KL or a more general f-divergence measure. For the ﬁrst measure, we determine the min-max prediction risk to within a linear factor in the alphabet size, showing it is Ω(k log log n/n) and O(k2 log log n/n). For the second, if the transition probabilities can be arbitrarily small, then only trivial uniform risk upper bounds can be derived. We therefore consider transition probabilities that are bounded away from zero, and resolve the problem for essentially all sufﬁciently smooth f-divergences, including KL-, L2-, Chi-squared, Hellinger, and Alpha-divergences.

1 Introduction

Many natural phenomena are inherently probabilistic. With past observations at hand, probabilistic models can therefore help us predict, estimate, and understand, future outcomes and trends. The two most fundamental probabilistic models for sequential data are i.i.d. processes and Markov chains. In an i.i.d. process, for each i 1, a sample Xi is generated independently according to the same underlying distribution. In Markov chains, for each i 2, the distribution of sample Xi is determined by just the value of Xi 1.

Let us conﬁne our discussion to random processes over ﬁnite alphabets, without loss of generality, assumed to be [k] := {1, 2, . . . , k}. An i.i.d. process is deﬁned by a single distribution p over [k], while a Markov chain is characterized by a transition probability matrix M over [k] [k]. We denote the initial and stationary distributions of a Markov model by µ and π, respectively. For notational consistency let P = (p) denote an i.i.d. model and P = (M) denote a Markov model.

Having observed a sample sequence Xn := X1, . . . , Xn from an unknown i.i.d. process or Markov chain, a natural problem is to predict the next sample point Xn+1. Since Xn+1 is a random

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

variable, this task is typically interpreted as estimating the conditional probability distribution Pxn := Pr(Xn+1 = |Xn = xn) of the next sample point Xn+1.

Let [k] denote the collection of all ﬁnite-length sequences over [k].

Therefore, conditioning on Xn = xn, our ﬁrst objective is to estimate the conditional distribution To be more precise, we would like to ﬁnd an estimator ˆP, that associates with every sequence xn [k]

a distribution ˆPxn over [k] that approximates Pxn in a suitable sense.

Perhaps a more classical problem is parameter estimation, which describes the underlying process. An i.i.d. process is completely characterized by Pxn = p, hence this problem coincides with the previous one. For Markov chains, we seek to estimate the transition matrix M. Therefore, instead of producing a probability distribution ˆPxn, the estimator ˆ M maps every sequence xn [k] to a transition matrix ˆ Mxn over [k] [k].

For two distributions p and q over [k], let L(p, q) be the loss when p is approximated by q. For the prediction problem, we measure the performance of an estimator ˆP in terms of its prediction risk,

ρL n(P, ˆP) := E Xn P[L(PXn, ˆPXn)] = X

xn [k]n P(xn)L(Pxn, ˆPxn),

the expected loss with respect to the sample sequence Xn, where P(xn) := Pr(Xn = xn).

For the estimation problem, we quantify the performance of the estimator by estimation risk. We ﬁrst consider the expected loss of ˆ M with respect to a single state i [k]:

E Xn (M)[L(M(i, ), ˆ MXn(i, ))].

We then deﬁne the estimation risk of ˆ M given sample sequence Xn as the maximum expected loss over all states, εL n(M, ˆ M) := max i [k] E Xn (M)[L(M(i, ), ˆ MXn(i, ))].

While the process P we are trying to learn is unknown, it often belongs to a known collection P. The worst prediction risk of an estimator ˆP over all distributions in P is

ρL n(P, ˆP) := max P P ρL n(P, ˆP).

The minimal possible worst-case prediction risk, or simply the minimax prediction risk, incurred by any estimator is ρL n(P) := min ˆ P ρL n(P, ˆP) = min ˆ P max P P ρL n(P, ˆP).

The worst-case estimation risk εL n(P, ˆ M) and the minimax estimation risk εL n(P) are deﬁned similarly. Given P, our goals are to approximate the minimax prediction/estimation risk to a universal constant-factor, and to devise estimators that achieve this performance.

An alternative deﬁnition of the estimation risk, considered in [1] and mentioned by a reviewer, is

εL n(M, ˆ M) := X

i [k] πi E Xn (M)[L(M(i, ), ˆ MXn(i, ))].

We denote the corresponding minimax estimation risk by εL n(P).

Let o(1) represent a quantity that vanishes as n . In the following, we use a b to denote a b(1 + o(1)), and a b to denote a b(1 + o(1)) and b a(1 + o(1)).

For the collection IIDk of all the i.i.d. processes over [k], the above two formulations coincide and the problem is essentially the classical discrete distribution estimation problem. The problem of determining ρL n(IIDk) was introduced by [2] and studied in a sequence of papers [3, 4, 5, 6, 7]. For ﬁxed k and KL-divergence loss, as n goes to inﬁnity, [7] showed that

ρKL n (IIDk) k 1

KL-divergence and many other important similarity measures between two distributions can be expressed as f-divergences [8]. Let f be a convex function with f(1) = 0, the f-divergence between two distributions p and q over [k], whenever well-deﬁned, is Df(p, q) := P

i [k] q(i)f(p(i)/q(i)). Call an f-divergence ordinary if f is thrice continuously differentiable over (0, ), sub-exponential, namely, limx |f(x)|/ecx = 0 for all c > 0, and satisﬁes f (1) = 0.

Observe that all the following notable measures are ordinary f-divergences: Chi-squared divergence [9] from f(x) = (x 1)2, KL-divergence [10] from f(x) = x log x, Hellinger divergence [11] from f(x) = ( x 1)2, and Alpha-divergence [12] from fα(x) := 4(1 x(1+α)/2)/(1 α2), where α = 1.

Related Work For any f-divergence, we denote the corresponding minimax prediction risk for an n-element sample over set P by ρf n(P). Researchers in [13] considered the problem of determining ρf n(IIDk) for the ordinary f-divergences. Except the above minimax formulation, recently, researchers also considered formulations that are more adaptive to the underlying i.i.d. processes [14] [15]. Surprisingly, while the min-max risk of i.i.d. processes was addressed in a large body of work, the risk of Markov chains, which frequently arise in practice, was not studied until very recently.

Let Mk denote the collection of all the Markov chains over [k]. For prediction with KLdivergence, [16] showed that ρKL n (Mk) = Θk (log log n/n), but did not specify the dependence on k. For estimation, [17] considered the class of Markov Chains whose pseudo-spectral gap is bounded away from 0 and approximated the L1 estimation risk to within a log n factor. Some of their techniques, in particular the lower-bound construction in their displayed equation (4.3), are of similar nature and were derived independently of results in Section 5 in our paper.

Our ﬁrst main result determines the dependence of ρKL n (Mk) on both k and n, to within a factor of roughly k:

Theorem 1 The minimax KL-prediction risk of Markov chains satisﬁes

(k 1) log log n

4en ρKL n (Mk) 2k2 log log n

Depending on M, some states may be observed very infrequently, or not at all. This does not drastically affect the prediction problem as these states will be also have small impact on ρKL n (Mk) in the prediction risk ρL n(P, ˆP). For estimation, however, rare and unobserved states still need to be well approximated, hence εL n(Mk) does not decrease with n, and for example εKL n (Mk) = log k for all n.

We therefore parametrize the risk by the lowest probability in the transition matrix. For δ > 0 let

Mk δ := {(M) : Mi,j δ, i, j},

be the collection of Markov chains whose lowest transition probability exceeds δ. Note that Mk δ is trivial if δ 1/k, we only consider δ (0, 1/k). We characterize the minimax estimation risk of Mk δ almost precisely.

Theorem 2 For all ordinary f-divergences and all δ (0, 1/k),

εf n(Mk δ) (k 1)kf (1)

(1 δ)(k 2)f (1)

2nδ εf n(Mk δ) (k 1)f (1)

We can further reﬁne the estimation-risk bounds by partitioning Mk δ based on the smallest probability in the chain s stationary distribution π. Clearly, mini [k] πi 1/k. For 0 < π 1/k and 0 < δ < 1/k, let Mk δ,π := {(M) : (M) Mk δ and min i [k] πi = π }

be the collection of all Markov chains in Mk δ whose lowest stationary probability is π . We determine the minimax estimation risk over Mk δ,π nearly precisely.

Theorem 3 For all ordinary f-divergences,

(1 π )(k 2)kf (1)

2n εf n(Mk δ,π ) (k 1)kf (1)

(1 π )(k 2)f (1)

2nπ εf n(Mk δ,π ) (k 1)f (1)

For L2-distance corresponding to the squared Euclidean norm, we prove the following risk bounds.

Theorem 4 For all δ (0, 1/k),

εL2 n (Mk δ) k 1

(1 δ)2 1 1 k 1 nδ εL2 n (Mk δ) 1 1

Theorem 5 For all δ (0, 1/k) and π (0, 1/k],

(1 π )2 k k k 1 n εL2 n (Mk δ,π ) k 1

(1 π )2 1 1 k 1 nπ εL2 n (Mk δ,π ) 1 1

The rest of the paper is organized as follows. Section 2 introduces add-constant estimators and additional deﬁnitions and notation for Markov chains. Note that each of the above results consists of a lower bound and an upper bound. We prove the lower bound by constructing a suitable prior distribution over the relevant collection of processes. Section 3 and 5 describe these prior distributions for the prediction and estimation problems, respectively. The upper bounds are derived via simple variants of the standard add-constant estimators. Section 4 and 6 describe the estimators for the prediction and estimation bounds, respectively. For space considerations, we relegate all the proofs to the supplemental material.

2 Deﬁnitions and Notation

2.1 Add-constant estimators

Given a sample sequence Xn from an i.i.d. process (p), let N i denote the number of times symbol i appears in Xn. The classical empirical estimator estimates p by

ˆp Xn(i) := N i n , i [k].

The empirical estimator performs poorly for loss measures such as KL-divergence. For example, if p assigns a tiny probability to a symbol so that it is unlikely to appear in Xn, then with high probability the KL-divergence between p and ˆp Xn will be inﬁnity.

A common solution applies the Laplace smoothing technique [18] that assigns to each symbol i a probability proportional to N i + β, where β > 0 is a ﬁxed constant. The resulting add-β estimator, is denoted by ˆp+β. Due to their simplicity and effectiveness, add-β estimators are widely used in various machine learning algorithms such as naive Bayes classiﬁers [19]. As shown in [7], for the i.i.d. processes, a variant of the add-3/4 estimator achieves the minimax estimation risk ρKL n (IIDk).

Analogously, given a sample sequence Xn generated by a Markov chain, let Nij denote the number of times symbol j appears right after symbol i in Xn, and let Ni denote the number of times that symbol i appears in Xn 1. We deﬁne the add-β estimator ˆ M +β as

ˆ M +β Xn(i, j) := Nij + β

Ni + kβ , i, j [k].

2.2 More on Markov chains

Adopting notation in [20], let k denote the collection of discrete distributions over [k]. Let [k]e and [k]o be the collection of even and odd integers in [k], respectively. By convention, for a Markov chain over [k], we call each symbol i [k] a state. Given a Markov chain, the hitting time τ(j) is the ﬁrst time the chain reaches state j. We denote by Pri(τ(j) = t) the probability that starting from i, the hitting time of j is exactly t. For a Markov chain (M), we denote by P t the distribution of Xt if we draw Xt (M). Additionally, for a ﬁxed Markov chain (M), the mixing time tmix denotes the smallest index t such that L1(P t, π) < 1/2. Finally, for notational convenience, we write Mij instead of M(i, j) whenever appropriate.

3 Minimax prediction: lower bound

A standard lower-bound argument for minimax prediction risk uses the fact that

ρKL n (P) = min ˆ P max P P ρKL n (P, ˆP) min ˆ P E P Π[ρKL n (P, ˆP)]

for any prior distribution Π over P. One advantage of this approach is that the optimal estimator that minimizes EP Π[ρKL n (P, ˆP)] can often be computed explicitly.

Perhaps the simplest prior is the uniform distribution U(PS) over a subset PS P. Let ˆP be the optimal estimator minimizing EP U(PS)[ρKL n (P, ˆP)]. Computing ˆP for all the possible sample sequences xn may be unrealistic. Instead, let Kn be an arbitrary subset of [k]n, we can lower bound

ρKL n (P, ˆP) = E Xn P[DKL(PXn, ˆPXn)]

by ρKL n (P, ˆP; Kn) := E Xn P[DKL(PXn, ˆPXn)1Xn Kn].

Hence, ρKL n (P) min ˆ P E P U(PS)[ρKL n (P, ˆP; Kn)].

The key to applying the above arguments is to ﬁnd a proper pair (PS, Kn).

Without loss of generality, assume that k is even. Let a := 1

n and b := 1 k 2

n , and deﬁne

Mn(p2, p4, . . . , pk) :=

b a a a a . . . a a p2 b p2 a a . . . a a a a b a a . . . a a a a p4 b p4 . . . a a ... ... ... ... ... ... ... a a a a . . . b a a a a a a . . . pk b pk

In addition, let

Vn := 1 logt n : t N and 1 t log n 2 log log n

and let uk denote the uniform distribution over [k]. Finally, given n, deﬁne

PS = {(M) Mk : µ = uk and M = Mn(p2, p4, . . . , pk), where pi Vn, i [k]e}.

Next, let Kn be the collection of sequences xn [k]n whose last appearing state didn t transition to any other state. For example, 3132, or 31322, but not 21323. In other words, for any state i [k], let i represent an arbitrary state in [k] \ {i}, then

Kn = {xn [k]n : xn = in ℓiℓ: i [k], n 1 ℓ 1}.

4 Minimax prediction: upper bound

For the Kn deﬁned in the last section,

ρKL n (P, ˆP; Kn) = X

xn Kn P(xn)DKL(Pxn, ˆPxn).

We denote the partial minimax prediction risk over Kn by

ρKL n (P; Kn) := min ˆ P max P P ρKL n (P, ˆP; Kn).

Let Kn := [k]n \ Kn. Deﬁne ρKL n (P, ˆP; Kn) and ρKL n (P; Kn) in the same manner. As the consequence of ˆP being a function from [k]n to k, we have the following triangle inequality,

ρKL n (P) ρKL n (P; Kn) + ρKL n (P; Kn).

Turning back to Markov chains, let ˆP + 1

2 denote the estimator that maps Xn (M) to ˆ M + 1

2 (Xn, ), one can show that

ρKL n (Mk; Kn) max P Mk ρKL n (P, ˆP + 1

2 ; Kn) Ok 1

Recall the following lower bound

ρKL n (Mk) = Ωk

This together with the above upper bound on ρKL n (Mk; Kn) and the triangle inequality shows that an upper bound on ρKL n (Mk; Kn) also sufﬁces to bound the leading term of ρKL n (Mk). The following construction yields such an upper bound.

We partition Kn according to the last appearing state and the number of times it transitions to itself,

Kn = n 1 ℓ=1 Kℓ(i), where Kℓ(i) := {xn [k]n : xn = in ℓiℓ}.

For any xn Kn, there is a unique Kℓ(i) such that xn Kℓ(i). Consider the following estimator

( 1 1 ℓlog n ℓ n

and ˆPxn(j) := 1 ˆPxn(i)

k 1 , j [k] \ {i},

we can show that

ρKL n (Mk; Kn) max P Mk ρKL n (P, ˆP; Kn) 2k2 log log n

The upper-bound proof applies the following lemma that uniformly bounds the hitting probability of any k-state Markov chain.

Lemma 1 [21] For any Markov chain over [k] and any two states i, j [k], if n > k, then

Pri(τ(j) = n) k

5 Minimax estimation: lower bound

Analogous to Section 3, we use the following standard argument to lower bound the minimax risk

εL n(M ) = min ˆ M max (M) M εL n(M, ˆ M) min ˆ M E (M) U(MS)[εL n(M, ˆ M)],

where MS M and U(MS) is the uniform distribution over MS. Setting M = Mk(δ, π ), we outline the construction of MS as follows.

Let uk 1 be the uniform distribution over [k 1]. As in [13], denote the L ball of radius r around uk 1 by Bk 1(r) := {p k 1 : L (p, uk 1) < r},

where L ( , ) is the L distance between two distributions. Deﬁne

p := (p1, p2, . . . , pk 1),

k 1, . . . π

π k 1 π k 1 . . . π k 1 π

π k 1 π k 1 . . . π k 1 π ... ... ... ... ... π k 1 π k 1 . . . π k 1 π

π p1 π p2 . . . π pk 1 π

where π = 1 π and Pk 1 i=1 pi = 1.

Given n and ϵ (0, 1), let n := (n(1 + ϵ)π )1/5. We set

MS = {(M) Mk(δ, π ) : µ = p and M = Mn(p ), where p Bk 1(1/n )}.

Noting that the uniform distribution over MS, U(MS), is induced by U(Bk 1(1/n )), the uniform distribution over Bk 1(1/n ) and thus is well-deﬁned.

An important property of the above construction is that for a sample sequence Xn (M) MS, Nk, the number of times that state k appears in Xn, is a binomial random variable with parameters n and π . Therefore, by the following lemma, Nk is highly concentrated around its mean nπ .

Lemma 2 [22] Let Y be a binomial random variable with parameters m N and p [0, 1], then for any ϵ (0, 1), Pr(Y (1 + ϵ)mp) exp ϵ2mp/3 .

In order to prove the lower bound on εf n(Mk δ,π ), we only need to modify the above construction as follows. Instead of drawing the last row of the transition matrix Mn(p ) uniformly from the distribution induced by U(Bk 1(1/n )), we draw all rows independently in the same fashion. The proof is omitted due to similarity.

6 Minimax estimation: upper bound

The proof of the upper bound relies on a concentration inequality for Markov chains in Mk δ, which can be informally expressed as

Pr(|Ni (n 1)π(i)| > t) Θδ(exp(Θδ( t2/n))).

Note that this inequality is very similar to the Hoeffding s inequality for i.i.d. processes.

The difﬁculty in analyzing the performance of the original add-β estimator is that the chain s initial distribution could be far away from its stationary distribution and ﬁnding a simple expression for E[Ni] and E[Nij] could be hard. To overcome this difﬁculty, we ignore the ﬁrst few sample points and construct a new add-β estimator based on the remaining sample points. Speciﬁcally, let Xn be a length-n sample sequence drawn from the Markov chain (M). Removing the ﬁrst m sample points, Xn m+1 := Xm+1, . . . , Xn can be viewed as a length-(n m) sample sequence drawn from (M) whose initial distribution µ satisﬁes

L1(µ , π) < 2(1 δ)m 1.

Let m = n. For sufﬁciently large n, L1(µ , π) 1/n2 and n n. Hence without loss of generality, we assume that the original initial distribution µ already satisﬁes L1(µ, π) < 1/n2. If not, we can simply replace Xn by Xn n+1.

To prove the desired upper bound for ordinary f-divergences, it sufﬁces to use the add-β estimator

ˆ M +β Xn(i, j) := Nij + β

Ni + kβ , i, j [k].

For the L2-distance, instead of an add-constant estimator, we apply an add Ni/k estimator

ˆ M + Ni/k Xn (i, j) := Nij + Ni/k

Ni + Ni , i, j [k].

7 Experiments

We augment the theory with experiments that demonstrate the efﬁcacy of our proposed estimators and validate the functional form of the derived bounds.

We brieﬂy describe the experimental setup. For the ﬁrst three ﬁgures, k = 6, δ = 0.05, and 10, 000 n 100, 000. For the last ﬁgure, δ = 0.01, n = 100, 000, and 4 k 36. In all the experiments, the initial distribution µ of the Markov chain is drawn from the k-Dirichlet(1) distribution. For the transition matrix M, we ﬁrst construct a transition matrix M where each row is drawn independently from the k-Dirichlet(1) distribution. To ensure that each element of M is at least δ, let Jk represent the k k all-ones matrix, and set M = M (1 kδ) + δJk. We generate a new Markov chain for each curve in the plots. And each data point on the curve shows the average loss of 100 independent restarts of the same Markov chain.

The plots use the following abbreviations: Theo for theoretical minimax-risk values; Real for real experimental results: using the estimators described in Sections 4 and 6; Pre for average prediction loss and Est for average estimation loss; Const for add-constant estimator; Prop for proposed add Ni/k estimator described in Section 6; Hell, Chi, and Alpha(c) for Hellinger divergence, Chi-squared divergence, and Alpha-divergence with parameter c. In all three graphs, the theoretical min-max curves are precisely the upper bounds in the corresponding theorems, except that in the prediction curve in Figure 1a the constant factor 2 in the upper bound is adjusted to 1/2 to better ﬁt the experiments. Note the excellent ﬁt between the theoretical bounds and experimental results.

Figure 1a shows the decay of the experimental and theoretical KL-prediction and KL-estimation losses with the sample size n. Figure 1b compares the L2-estimation losses of our proposed estimator and the add-one estimator, and the theoretical minimax values. Figure 1c compares the experimental estimation losses and the theoretical minimax-risk values for different loss measures. Finally, ﬁgure 1d presents an experiment on KL-learning losses that scales k up while n is ﬁxed. All the four plots demonstrate that our theoretical results are accurate and can be used to estimate the loss incurred in learning Markov chains. Additionally, Figure 1b shows that our proposed add Ni/k estimator is uniformly better than the traditional add-one estimator for different values of sample size n. We have also considered add-constant estimators with different constants varying from 2 to 10 and our proposed estimator outperformed all of them.

8 Conclusions

We studied the problem of learning an unknown k-state Markov chain from its n sequential sample points. We considered two formulations: prediction and estimation. For prediction, we determined the minimax risk up to a multiplicative factor of k. For estimation, when the transition probabilities are bounded away from zero, we obtained nearly matching lower and upper bounds on the minimax risk for L2 and ordinary f-divergences. The effectiveness of our proposed estimators was veriﬁed through experimental simulations. Future directions include closing the gap in the prediction problem in Section 1, extending the results on the min-max estimation problem to other classes of Markov chains, and extending the work from the classical setting k n, to general k and n.

(a) KL-prediction and estimation losses

(b) L2-estimation losses for different estimators

(c) Hellinger, Chi-squared, and Alphaestimation losses

(d) Fixed n and varying k

Figure 1: Experiments

[1] Moein Falahatgar, Mesrob I. Ohannessian, and Alon Orlitsky. Near-optimal smoothing of structured conditional probability matrices? In In Advances in Neural Information Processing Systems (NIPS), pages 4860 4868, 2016.

[2] Edgar Gilbert. Codes based on inaccurate source probabilities. IEEE Transactions on Information Theory, 17, 3:304 314, 1971.

[3] Thomas Cover. Admissibility properties or gilbert s encoding for unknown source probabilities (corresp.). IEEE Transactions on Information Theory, 18.1:216 217, 1972.

[4] Raphail Krichevsky and Victor Troﬁmov. The performance of universal encoding. IEEE Transactions on Information Theory, 27.2:199 207, 1981.

[5] Dietrich Braess, Jürgen Forster, Tomas Sauer, and Hans U. Simon. How to achieve minimax expected kullback-leibler distance from an unknown ﬁnite distribution. In International Conference on Algorithmic Learning Theory, pages 380 394. Springer, 2002.

[6] Liam Paninski. Variational minimax estimation of discrete distributions under kl loss. In Advances in Neural Information Processing Systems, pages 1033 1040, 2004.

[7] Dietrich Braess and Thomas Sauer. Bernstein polynomials and learning theory. Journal of Approximation Theory, 128(2):187 206, 2004.

[8] I. Csiszár. Information type measures of differences of probability distribution and indirect observations. Studia Math. Hungarica, 2:299 318, 1967.

[9] Frank Nielsen and Richard Nock. On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Processing Letters, 21, no. 1:10 13, 2014.

[10] Solomon Kullback and Richard A. Leibler. On information and sufﬁciency. The Annals of Mathematical Statistics, 22, no. 1:79 86, 1951.

[11] Mikhail S Nikulin. Hellinger distance. Encyclopedia of mathematics, 151, 2001.

[12] Gavin E. Crooks. On measures of entropy and information. Tech. Note 9 (2017): v4, 2017.

[13] Sudeep Kamath, Alon Orlitsky, Dheeraj Pichapati, and Ananda Theertha Suresh. On learning distributions from their samples. In Annual Conference on Learning Theory (COLT), pages 1066 1100, 2015.

[14] Alon Orlitsky and Ananda Theertha Suresh. Competitive distribution estimation: Why is good-turing good. In Advances in Neural Information Processing Systems, pages 2143 2151, 2015.

[15] Gregory Valiant and Paul Valiant. Instance optimal learning of discrete distributions. In 48th annual ACM symposium on Theory of Computing, pages 142 155, 2016.

[16] Moein Falahatgar, Alon Orlitsky, Venkatadheeraj Pichapati, and Ananda Theertha Suresh. Learning markov distributions: Does estimation trump compression? In IEEE International Symposium on Information Theory (ISIT), pages 2689 2693, 2016.

[17] Geoffrey Wolfer and Aryeh Kontorovich. Minimax learning of ergodic markov chains. ar Xiv:1809.05014, 2018.

[18] Kai Lai Chung and Farid Ait Sahlia. Elementary probability theory: With stochastic processes and an introduction to mathematical ﬁnance. Springer Science & Business Media, 2012.

[19] Christopher M. Bishop and Tom M. Mitchell. Pattern recognition and machine learning. Springer, 2006.

[20] David A.Levin and Yuval Peres. Markov chains and mixing Times. American Mathematical Soc., 2017.

[21] James Norris, Yuval Peres, and Alex Zhai. Surprise probabilities in markov chains. Combinatorics, Probability and Computing, 26.4:603 627, 2017.

[22] Fan RK Chung and Linyuan Lu. Complex graphs and networks. American Mathematical Soc., 2006.