# predictive_learning_on_hidden_treestructured_ising_models__17602149.pdf Journal of Machine Learning Research 22 (2021) 1-82 Submitted 2/19; Published 2/21 Predictive Learning on Hidden Tree-Structured Ising Models Konstantinos E. Nikolakakis k.nikolakakis@rutgers.edu Department of Electrical & Computer Engineering Rutgers, The State University of New Jersey 94 Brett Road, Piscataway, NJ 08854, USA Dionysios S. Kalogerias dionysis@msu.edu Department of Electrical & Computer Engineering Michigan State University 428 S. Shaw Lane, MI 48824, USA Anand D. Sarwate anand.sarwate@rutgers.edu Department of Electrical & Computer Engineering Rutgers, The State University of New Jersey 94 Brett Road, Piscataway, NJ 08854, USA Editor: Bert Huang We provide high-probability sample complexity guarantees for exact structure recovery and accurate predictive learning using noise-corrupted samples from an acyclic (tree-shaped) graphical model. The hidden variables follow a tree-structured Ising model distribution, whereas the observable variables are generated by a binary symmetric channel taking the hidden variables as its input (flipping each bit independently with some constant probability q [0, 1/2)). In the absence of noise, predictive learning on Ising models was recently studied by Bresler and Karzand (2020); this paper quantifies how noise in the hidden model impacts the tasks of structure recovery and marginal distribution estimation by proving upper and lower bounds on the sample complexity. Our results generalize state-of-the-art bounds reported in prior work, and they exactly recover the noiseless case (q = 0). In fact, for any tree with p vertices and probability of incorrect recovery δ > 0, the sufficient number of samples remains logarithmic as in the noiseless case, i.e., O(log(p/δ)), while the dependence on q is O 1/(1 2q)4 , for both aforementioned tasks. We also present a new equivalent of Isserlis Theorem for sign-valued tree-structured distributions, yielding a new low-complexity algorithm for higher-order moment estimation. Keywords: Ising Model, Chow-Liu Algorithm, Structure Learning, Predictive Learning, Distribution Estimation, Noisy Data, Hidden Markov Random Fields 1. Introduction Graphical models are a useful tool for modeling high-dimensional structured data. The graph captures structural dependencies: its edge set corresponds to (often physical) interactions between variables. There is a long and deep literature on graphical models (see Koller and Friedman (2009) for a comprehensive introduction), and they have found wide applications in areas such as image processing and vision (Schwing and Urtasun, 2015; Li and Wand, 2016; Lin et al., 2016a; Liu et al., 2017; Morningstar and Melko, 2018; Wu et al., 2017), artificial c 2021 Konstantinos E. Nikolakakis, Dionysios S. Kalogerias and Anand D. Sarwate. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/19-149.html. Nikolakakis, Kalogerias and Sarwate intelligence more broadly (Wainwright et al., 2003; Wang et al., 2017), signal processing (Kim and Smaragdis, 2013; Wisdom et al., 2016), and gene regulatory networks (Zuo et al., 2017; Banf and Rhee, 2017), to name a few. An undirected graphical model, or Markov random field (MRF) in particular, is defined in terms of a hypergraph G = (V, E), that models the Markov properties of a joint distribution on p |V| node variables (X1, X2, . . . , Xp) X. A tree-structured graphical model is one in which G is a tree. We denote the tree-structured model as T = (V, E). In this paper, we consider binary models on 2p variables (X, Y), where the joint distribution p( ) of X is a tree-structured Ising model distribution on { 1, +1}p and Y = (Y1, Y2, . . . , Yp) is a noisy version of X, such that Yi = Ni Xi and {Ni} are independent and identically distributed (i.i.d.) Rademacher noise with P(Ni = 1) = 1 P(Ni = +1) = q, for all i V. We refer to X as the hidden layer and Y as the observed layer. Under this setting, our objective is to recover the underlying tree structure and accurately estimate the distribution of the hidden layer X (with high probability) using only the noisy observations Y. This is nontrivial because Y does not itself follow any tree structure; this is similar to more traditional problems in nonlinear filtering, where a Markov process of known distribution (and thus, of known structure) is observed through noisy measurements (Arulampalam et al., 2002; Jazwinski, 2007; Van Handel, 2009; Douc et al., 2011; Kalogerias and Petropulu, 2016). The sample complexity of the noiseless version of our model was recently studied by Bresler and Karzand (2020), where the well-known Chow-Liu algorithm (Chow and Liu, 1968) is employed for tree reconstruction. Like them, we also analyze the Chow-Liu algorithm. 1.1 Applications and Motivating Examples Models for joint distributions characterized by pairwise variable interactions have found many applications, with the Ising model being a popular model for binary variables. Our work is primarily motivated by examples of Ising models corrupted by noise. In many cases, the underlying graph-structured process cannot be observed directly; instead, only a noisy version of the process is available. Examples abound in physics, computer science, biology, medicine, psychology, social sciences, and finance. Some applications motivating this work include the following: 1) Statistical mechanics of population, social and pedestrian dynamics (see related work by Matsuda et al. (1992); Castellano et al. (2009)): The Ising model can be used to represent the statistical properties of the spreading of a feeling, behavior or the change of an emotional state among individuals in a crowd, where each individual interacts with his neighbors. 2) Epidemic dynamics and epidemiological models by Barnett et al. (2013); Erten et al. (2017): Disease spread can be modeled through the Ising model, where each individual is susceptible (spin down) or ineffective (spin up). 3) Neoplastic transitions and related applications in biology (Torquato (2011)): Each cell interacts with neighboring cells. Different cases are studied in the literature, for instance, healthy versus cancerous cells, malignant versus benign cells, where both can be modeled as spin up and spin down observations. The probability of diagnostic error is not zero which gives rise to the hidden model that we consider. 4) Differential Privacy, originally proposed by Dwork et al. (2006a,b): In computer science, differential privacy is used to guarantee privacy for individuals. A hidden model Predictive Learning on Hidden Tree-Structured Ising Models describes data gathered using a locally differentially private mechanism (Warner, 1965; Kasiviswanathan et al., 2008) such as randomized response. 5) Trading and related applications in economics (see related work by Zhou and Sornette (2007); Takaishi (2015)): The Ising model has been considered in the literature to model increasing (spin up) or decreasing (spin down) price trends in a market. 1.2 Structure Learning for Undirected Graphical Models and Related Work For a detailed review of methods for structure learning involving undirected and directed graphical models, see the relevant article by Drton and Maathuis (2017). In general, learning the structure of a graphical model from samples can be intractable (Karger and Srebro, 2001; Højsgaard et al., 2012). For general graphs, neighborhood selection methods (Jalali et al., 2011; Bresler, 2015; Ray et al., 2015) estimate the conditional distribution for each vertex in order to learn the neighborhood of each node and therefore the full structure. These approaches may use greedy search or ℓ1 regularization. For Gaussian or Ising models, ℓ1regularization (Ravikumar et al., 2010), the GLasso (Yuan and Lin, 2007; Banerjee et al., 2008), or coordinate descent approaches (Friedman et al., 2008) have been proposed, focusing on estimating the non-zero entries of the precision (or interaction) matrix. Model selection can also be performed using score matching methods (Hyvärinen, 2005, 2007; Nandy et al., 2018; Lin et al., 2016b), or Bayesian information criterion methods (Foygel and Drton, 2010; Gao et al., 2012; Barber et al., 2015). Other works address non-Gaussian models such as elliptical distributions, t-distribution models or latent Gaussian data (Finegold and Drton, 2011; Vogel and Fried, 2011; Vogel and Tyler, 2014; Bilodeau, 2014), or even mixed data (Fan et al., 2017). For treeor forest-structured models, exact inference and the structure learning problem are significantly simpler: the Chow-Liu algorithm provides an estimate of the tree or forest structure of the underlying graph (Chow and Liu, 1968; Wainwright et al., 2008; Edwards et al., 2010; Tan et al., 2011; Liu et al., 2011; Daskalakis et al., 2018; Bresler and Karzand, 2020). Furthermore, marginal distributions and maximum values are simpler to compute using a variety of algorithms (sum-product, max-product, message passing, variational inference) (Pearl, 1988; Lauritzen, 1996; Wainwright et al., 2003, 2008)). The noiseless counterpart of the model considered in this paper was studied recently by Bresler and Karzand (2020); in this paper, we extend their results to the hidden case, where samples from a tree-structured Ising model are passed through a binary symmetric channel with crossover probability q [0, 1/2). Of course, in the special case of a linear graph, our model reduces to a hidden Markov model. Latent variable models are often considered in the literature when some variables of the graph are deterministically unobserved (Chandrasekaran et al., 2010; Anandkumar and Valluvan, 2013; Ma et al., 2013; Anandkumar et al., 2014). Our model is most similar to that studied by Chaganty et al. (Chaganty and Liang, 2014), in which a hidden model is considered with a discrete exponential distribution and Gaussian noise. They solve the parameter estimation problem by using moment matching and pseudo-likelihood methods; the structure can be recovered indirectly using the estimated parameters. Connection with Phylogenetic Estimation. In phylogenetic estimation problems the goal is to learn the structure of tree given only observations form the leaves (Erdős et al., Nikolakakis, Kalogerias and Sarwate 1999). The sample complexity of phylogenetic reconstruction algorithms grows exponentially with respect to the depth of the tree (Erdős et al., 1999), however if we are interested in reconstructing only parts of the tree which are close to the leaves then the depth of tree does not affect the sample complexity (Daskalakis et al., 2009). The hidden structure learning problem that we consider in this paper is a special case of phylogeny estimation problem with constant depth; there is exactly one noisy observable for each hidden node of the tree. In contrast with phylogenetic estimation approaches, Chow-Liu algorithm is simple and computationally more efficient, while the sample complexity is of the same order1 with the well-known phylogenetic reconstruction methods, to name a few Dyadic Closure method by Erdős et al. (1999), the Contractor-Extender and Cherry-picking algorithms by Daskalakis et al. (2006, 2009, 2013). On the other hand, the approach of distribution estimation by matching the structure and the correlations (Bresler and Karzand, 2020) has not been considered in the phylogenetic estimation literature. Based on the above discussion, the following interesting question naturally rises: How well can we estimate the distribution of a hidden tree structured model while having access only to the leaves of the tree? The latter remains open problem for future work. 1.3 Statement of Contributions We are interested in answering the following general question: How does noise affect the sample complexity of the structure and predictive learning procedure? That is, given only noisy observations, our goal is to learn the tree structure of the hidden layer in a well-defined and meaningful sense. The MLE-structure from tree-structured (noiseless) data is the output of the Chow-Liu algorithm (Chow and Liu, 1968). However, the MLE-structure from noisy data is not consistent with the hidden structure in general because the graphical model of the observables is a complete graph. Further, the (latent) MLE of the actual interaction parameters θ of the hidden layer is intractable. In Sections 2.4 and 2.6 we explain the importance of Chow-Liu algorithm in our setting, we show why the classical MLE approach fails, and we discuss the connection between the output of the Chow-Liu algorithm and an alternative, projection-based MLE approach. The estimated structure is an essential statistic for estimating the underlying distribution of the hidden layer, allowing for predictive learning. Specifically, based on the structure estimate, we are also interested in appropriately approximating the tree-structured distribution under study, which can then be used for accurate predictions. We also consider the problem of hidden layer higher-order moment estimation of tree-structured Ising models and, in particular, how such estimation can be efficiently performed, on the basis of noisy observations. A summary of the main contributions of this paper is as follows: A lower bound on the sufficient number of samples needed to recover the exact hidden structure with high probability, by using the Chow-Liu algorithm. We also show an upper bound on the necessary number of samples for any algorithm to estimate the hidden structure. The proof of the lower bound follows the general structure of Lemmata 8.1-8.4 by Bresler and Karzand (2020), however we need to extend the necessary events and prove new concentration bounds for the noisy setting. Although 1. while considering the depth fixed Predictive Learning on Hidden Tree-Structured Ising Models the graphical model of the observables is a complete graph we show that the Chow-Liu algorithm (with input a finite number of noisy samples) returns the exact tree of the hidden layer with high probability and we characterize its sample complexity. The proof of the upper bound uses the same construction of the approach in Section 7.1 by Bresler and Karzand (2020) but requires the combination of Fano s inequality and a strong data processing inequality (SDPI) by Polyanskiy and Wu (2017). Specifically, we show that SDPI s can be a useful tool to derive minimax bounds when closed form expressions or upper bounds of the KL-divergence are hard to be found. The later is of independent interest and it can be applied to other machine learning problems that involve noisy observations. Determination of the sufficient and necessary number of samples for accurate predictive learning. We analyze the sample complexity of learning distribution estimates, which can accurately provide predictions on the hidden tree. The estimates are computed using the noisy data. Predictive learning under noisy samples is challenging because structural properties such as the independence of random variables Xi Xj and correlation estimates ˆE[Xi Xj] for (i, j) E do not hold for the noisy observable Y. To overcome this we evaluate the required conditional distributions of the dependent variables, construct a martingale difference sequence, and prove a high probability bound of the event that involves these variables by applying a concentration bound for supermartingales (generalized Bennet s inequality (Fan et al., 2012)). We refer the reader to Section 4.3 for a detailed discussion about the technical contributions and a sketch of proof of the main result. A closed-form expression and a computationally efficient estimator for higher-order moment estimation in tree-structured Ising models. This result corresponds to an equivalent statement of Isserlis theorem for sign-valued tree models. Given pair-wise correlations and the tree (or estimates of both, from noisy or noiseless data) we provide an algorithm that runs on the tree and returns the expression of high-order moments. The proof involves the existence and identification of (minimum length) disjoint paths among any set of pairs of nodes. The proposed algorithm (Algorithm 2) identifies these paths that yield the expression of the moments. The results may be of independent interest for a computational efficient exact or approximated higher-moment evaluation. Our main results Theorem 5 and Theorem 7 provide the amount of finite samples needed for exact structure recovery and accurate predictive learning with high probability. Although we are interested in the finite sample complexity bounds, our results are also asymptotically optimal. That is, for any fixed (constant) q [0, 1/2) the order of the upper bound (necessary number of samples) matches the corresponding (lower) minimax bound. The sample complexity bounds that we provide are the extended form of state of the art (noiseless setting) bounds by Bresler and Karzand (2020). By setting q = 0, our bounds reduce to the noiseless setting bounds. Further, the explicit version of our results (see Section 3) are continuous functions of the cross-over probability q. 1.4 Notation Boldface indicates a vector or tuple and calligraphic face for sets and trees. The sets of even and odd natural numbers are 2N and 2N + 1 respectively. For an integer n, define Nikolakakis, Kalogerias and Sarwate Symbol Meaning p number of variables nodes in the tree p(x) exp P (i,j) E θi,jxixj /Z(θ), x { 1, +1}p, Z(θ): partition function α minimum |θij| in the Ising model, mini,j V |θij| β maximum |θij| in the Ising model, maxi,j V |θij| T Original tree of the model PT(α, β) set of tree-structured Ising models with α |θij| β n number of samples q crossover probability of the BSC, q [0, 1/2) cq 1 2q p( ) distribution of the hidden node variables, X p( ) PT(α, β) p ( ) distribution of the observable node variables Y p ( ) 1A indicator function of the set A DKL KL divergence SKL symmetric KL divergence I(X, Y ) mutual information of X,Y d TV total variation distance L(2)(P, Q) supi,j V d TV (Pij, Qij), and Pij, Qij the pairwise marginals of P, Q X1:n n independent observations of X Y1:n n independent observations of Y TCL Chow-Liu-estimated structure from noiseless data X1:n TCL Chow-Liu-estimate of the hidden tree structure T from noisy data Y1:n path T(w, w) the set of edges which connects the nodes w, w VT ˆµi,j 1 n Pn k=1 X(k) i X(k) j ˆµ i,j 1 n Pn k=1 Y (k) i Y (k) j ΠTCL (ˆp ) estimator of the distribution p( ) from noisy data Y1:n η maximum error on the distribution estimation: L(2)(P, ˆP) η δ maximum probability of error, the notation depends on the task in structure estimation: P(TCL = T) δ in predictive learning: P L(2) p( ), ΠTCL (ˆp ) η 1 δ. Table 1: Notation/Definitions. [n] {1, 2, . . . n}. The indicator function of a set A is 1A. For a graph G = (V, E), V = [p] indexes the set of variables {X1, X2, . . . , Xp}, for any pair of vertices i, j V the correlation µij = E [Xi Xj] and for any edge e = (i, j) E it is µe E[Xi Xj]. For two nodes w, w of a tree, the term path(w, w) denotes the set of edges in the unique path with endpoints w and w. Further, BSC(q)p denotes a binary symmetric channel with crossover probability q and block-length p. The BSC(q)p is a conditional distribution from { 1, 1}p { 1, 1}p that acts componentwise independently on X to generate Y, such that Xi = Ni Yi and N is a vector of i.i.d. Rademacher variables equal to +1 with probability 1 q. We use the symbol to indicate the corresponding quantity for the observable (noisy) layer. For instance, p ( ) is the probability mass function of Y and µ i,j E[Yi Yj] corresponds to the correlation of Predictive Learning on Hidden Tree-Structured Ising Models variables Yi, Yj, where Yi generates noisy observations of Xi, for any i V. For our readers convenience, we summarize the notation in Table 1. Figure 1: The simulation corresponds to structure learning. Comparison of the experimental results (heat-map) and the theoretical bound of Theorem 5, the bound that yields (1). The colored regions denote different values of the estimated probability of error δ (at least one edge has been missed). The value of δ varies between 0 and 1 while the parameters α = 0.2, β = 1.1, p = 100 are fixed. The red line shows the bound from Theorem 5 (the explicit form of Theorem 1). The code of the experiment is available at https://github.com/Konstantinos Nikolakakis/ Structure-Learning. 1.5 Summary of the Results In this section, we present a summary of the main results of our work up to constant factors C, C > 0. We refer the reader to Table 1 for the definition of the model parameters. We provide the explicit statements of the results, and we specify the constants in Section 3. Recall that, the random vector Y { 1, +1}p is the output of the binary symmetric channel BSC(q)p with input the random vector X p( ) PT(α, β). 1.5.1 Structure Learning The first results provides the sufficient number of samples for exact structure recovery. Theorem 1 (Sample Complexity for Structure Learning.) The Chow-Liu algorithm with input n noisy samples Y1:n exactly estimates the hidden tree structure TCL T with probability at least 1 δ (0, 1), as long as n > C e2β(1+1q =0) (1 2q)4 tanh2(α) log(p/δ). (1) Nikolakakis, Kalogerias and Sarwate The order with respect to β is O(e4β) for all q > 0. The bound in (1) exactly reduces to the noiseless case (Bresler and Karzand, 2020, Theorem 3.2). Additionally, the explicit form of the result, Theorem 5, shows that the bound is also a continuous function of q [0, 1/2). The next proposition gives the necessary number of samples for exact structure recovery. Proposition 2 No algorithm can recover the structure with probability great than 1/2 if n < C e2β[1 (4q(1 q))p] 1 α tanh(α) log (p) . (2) Note that the terms (1 2q) 4 and [1 (4q(1 q))p] 1 introduce a gap between the sample complexity of (1) and (2). However, the sample complexity of Theorem 1 is indeed accurate. To illustrate this experimentally, we show that the theoretical and experimental bounds exactly match, see Figure 1. The latter indicates that the Chow-Liu algorithm requires exactly the number of samples that our theoretical result suggest (see Figure 1). On the other hand, Proposition 2 provides the necessary number of samples, for any algorithm. Finally, we conjecture that the bound of Proposition 1.2 is tight only under the low temperature regime |θi,j| for all i, j E. The derivation of generalized tighter forms of the bound in (2) is challenging and left for future work. 1.5.2 Predictive Learning To learn the tree-shaped distribution p( ) PT(α, β) of X from n noisy samples Y1:n, we first estimate the correlations ˆµ i,j for all i, j V. We then estimate the tree structure TCL by running the Chow-Liu algorithm with input the candidate edge weights ˆµ i,j and finally evaluate the estimator of p( ) (by matching correlations) as follows2 ΠTCL (ˆp ) 1 1 + xixj ˆµ i,j (1 2q)2 2 , x { 1, +1}p. (3) Note that one restriction of our approach is that the distribution estimator requires the value q to be known. The same restriction appears in other structure learning from noisy data approaches (Goel et al., 2019). However, in our setting q is required only for the predictive learning, while the Chow-Liu algorithm and the structure estimation does not require q to be known. Under the assumption that q is unknown, one can first learn its value through an independent procedure (Goel et al., 2019, Section 5). The accuracy of the estimated distribution in (3) is measured by the small-set Total Variation (ss TV), that captures the estimation error on the kth-order marginals (Georgii, 2011; Rebeschini et al., 2015; Bresler and Karzand, 2020). Let PS, QS denote the marginals of P, Q on a set S V, and |S| = k. Then the kth order ss TV of P and Q is defined as L(k) (P, Q) sup S:|S|=k d TV (PS, QS) . (4) 2. The distribution in (3) is a function of x, however we suppress the notation for consistency with prior work and for sake of space. Predictive Learning on Hidden Tree-Structured Ising Models Figure 2: The simulation corresponds to predictive learning. Comparison of the experimental results (heat-map) and the theoretical bound of Theorem 3. The colored regions denote different values of the estimated probability of error δ (ss TV to be greater than a fixed number η). The value of δ varies between 0 and 1 while the parameters η = 0.03, β = 1.1, p = 31 are fixed. The code of the experiment is available at https://github.com/Konstantinos Nikolakakis/ Predictive-Learning. The next results provides the necessary number of samples for accurate distribution estimation by guaranteeing that the L(2) is less than a small positive number η with high probability. We provide guarantees on higher-order marginals (k > 2) in Section 3.3. Theorem 3 (Sample Complexity for Predictive Learning) Fix δ (0, 1). Choose η > 0 (independent of δ). If ( 1 η2(1 2q)4 , e2β(1+1q =0) (1 2q)4 , e4β P L(2) p( ), ΠTCL (ˆp ) η 1 δ. (6) Note that the dependence on β is O(e4β) for accurate distribution learning from noisy data (similarly to the structure learning task, Theorem 1). The bound in (5) exactly reduces to the noiseless setting bound by (Bresler and Karzand, 2020, Theorem 3.3). Theorem 3 is a short version of the main result of the paper. The explicit statement, Theorem 7, shows that the bound is also continuous at q 0. Conversely, the following proposition gives an upper bound on the necessary number of samples for accurate marginal distributions estimation under the assumption β > α. Nikolakakis, Kalogerias and Sarwate Sufficient Number of Samples Task/Setting Noiseless (prior work) Noisy Structure Learning C e2β tanh2(α) log(p/δ) C e2β(1+1q =0) (1 2q)4 tanh2(α) log(p/δ) Predictive Learning C max{η 2, e2β} log(p/δ) C max n η 2 (1 2q)4 , e2β(1+1q =0) (1 2q)4 , e4β1q =0 η2 o log (p/δ) Table 2: Sufficient number of samples for accurate structure and predictive learning. Necessary Number of Samples Task/Setting Noiseless (prior work) Noisy Structure learning C e2β α tanh(α) log (p) C e2β[1 (4q(1 q))p] 1 α tanh(α) log (p) Predictive learning C η 2 log(p) C η 2[1 (4q(1 q))p] 1 log(p) Table 3: Necessary number of samples for structure and predictive learning. Proposition 4 Fix η > 0 such that η (tanh(β) tanh(α))/2. Then no algorithm can accurately estimate the distribution of the hidden variables (ss TV less than η > 0) with probability greater than 1/2 if n < C η 2[1 (4q(1 q))p] 1 log(p). (7) A quick comparison of (5) and (7) shows that there is a gap between the sufficient and necessary number of samples. Our experiments (Figure 2) confirm the accuracy of our theoretical results. For instance the bound of Theorem 3 exactly matches the experimental curve. For further discussion related to the gap between the upper and lower bounds see Section 2.6. Further, we conjecture that bound in (7) is tight only under the low temperature regime, similarly to the Proposition 2. The derivation of tighter characterization of the necessary number of samples Propositions 2 and 4 remains an problem for future work. Additional plots of the experiments are provided in Section 3.4. Finally, Table 2 and 3 summarize the state-of-the-art bounds of the noiseless setting by Bresler and Karzand (2020) and the extended version under the noisy setting that we study in this paper. To summarize, the following holds for both structure and predictive learning: the dependence on the parameter β is of the order O(e2β) for q = 0 and becomes O(e4β) for positive values of q. Further, the bounds are continuous functions of q, as our results suggest (for the continuity see the explicit form of the results Theorem 5 and Theorem 7.) Similarly to the noiseless case, the following statement holds when noise exists as well: Under the high temperature regime (α close to zero), structure learning requires much more data than the predictive learning task, because of the tanh2(α) in the denominator of the bound in (1). On the contrary, the required number of samples for predictive learning (5) does not depend on α. Specifically, exact structure recovery is not necessary for learning the distribution Predictive Learning on Hidden Tree-Structured Ising Models efficiently, that is, weak edges identification failure does not affect the predictive learning task. We refer the reader to Section 4.3 for the definition of weak/strong edges and additional explanation. Finally, for q > 0 an extra term that involves both β and η appears in the bound of Theorem 3, while for values of q close to zero and q = 0 vanishes. The pairwise correlations of end-point vertices (E[Xi Xj]: (i, j) E) are sufficient statistics, and as expected, the accuracy of pairwise marginals corresponds to accuracy of higher order marginals and accurate estimation of higher order moments. In Sections 3.3 and 4.4 we provide a method for evaluating higher order moments (and marginals) from noisy observations. Our approach is based on an equivalent of Isserlis theorem for tree-structured Ising models that is also of independent interest. 2. Preliminaries and Problem Statement In this section, we introduce our model of hidden sign-valued Markov random fields on trees. 2.1 Undirected Graphical Models We consider sign-valued graphical models where the joint distribution p( ) has support { 1, +1}p. Let X = (X1, X2, . . . , Xp) { 1, +1}p be a collection of sign-valued (binary) random variables. Then, 1Xi=xi (1 + xi Xi)/2, and the distribution of X is S V:|S|=k E , x { 1, +1}p. (8) In this paper we assume that the marginal distributions of the Xi are uniform, that is, P (Xi = 1) = 1 2, i V. (9) Thus, E[Xi] = 0, for all i V. A distribution is Markov with respect to a hypergraph G = (V, E) if for every node i in the set V it is true that P Xi| x V\{i} = P Xi| x N(i) , where N(i) is the set of neighbors of i in G. One subclass of distributions for which the Markov property holds is the Ising model, in which the random variables Xi are sign-valued and the hypergraph is a simple undirected graph, indicating that variables have only pairwise and unary interactions. The joint distribution for the Ising model with zero external field is given by p(x) = 1 Z(θ) exp (s,t) E θstxsxt , x { 1, 1}p. (10) {θst : (s, t) E} are parameters of the model representing the interaction strength of the variables and Z( ) (0, ) is the partition function. These interactions are expressed through potential functions exp(θstxsxt) that ensure that the Markov property holds with respect to the graph G = (V, E). Next, we discuss the properties of distributions of the form of (8), which are Markov with respect to a tree. Nikolakakis, Kalogerias and Sarwate 2.2 Sign-Valued Markov Fields on Trees From prior work by Lauritzen (1996), it is known that any distribution p( ) that is Markov with respect to a tree (or forest) T = (V, E) factorizes as i V p (xi) Y p(xi, xj) p(xi)p(xj), x { 1, +1}p, (11) and we call p( ) as tree (forest) structured distribution, to indicate the factorization property. If the distribution p( ) has the form of (8) with P(Xi = 1) = 1/2, for all i V, and is Markov with respect to a tree T, then 1 + xixj E [Xi Xj] E [Xi Xj] = Y e path(i,j) µe, for all i, j V. (13) (see Appendix A, Lemma 12). Additionally, let us state the definition of the so-called Correlation (coefficient) Decay Property (CDP), that will be of central importance in our analysis. Definition 1 The CDP holds if and only if |E[Xi Xk]| |E[XℓXm]| for all tuples {i, k, ℓ, m} V such that path(i, k) path(ℓ, m). The CDP is a well known attribute of acyclic Markov fields (see, e.g., Tan et al. (2010), Bresler and Karzand (2020)). Further, it is true that the products Xi Xj for all (i, j) E are independent and the CDP holds for every p( ) of the form of (8), that factorizes with respect to a tree (see Lemma 13, Appendix A). This is a consequence of property (13) and the inequality |µe| 1, for all e E. We can interpret the CDP as a type of data processing inequality (see Cover and Thomas (2012)). The connection is clear through the relationship between the mutual information I(Xi, Xj) and the correlations E[Xi Xj], namely, I (Xi, Xj) = 1 2 log2 (1 E [Xi Xj])1 E[Xi Xj] (1 + E [Xi Xj])1+E[Xi Xj] , (14) for any pair of nodes i, j V. This expression shows that the mutual information is a symmetric function of E [Xi Xj] and increasing with respect to |E [Xi Xj]| (see also Lemma 17, Appendix A). Tree-structured Ising models: Despite its simple form, the Ising model has numerous useful properties. In particular, (12), (13) hold for any tree-structured Ising model with uniform marginal distributions and θr = 0 for all r V. Furthermore, E[Xi Xj] = tanh θij, (i, j) ET, (15) the latter implies that 1 + xixj tanh θij 2 , x { 1, 1}p, α |θij| β, (16) Predictive Learning on Hidden Tree-Structured Ising Models E[Xi Xj] = Y e path(i,j) µe = Y e path(i,j) tanh (θe) , i, j V. (17) A short argument showing (15) and (16) is included in Appendix A, Lemma 14. For the rest of the paper, we assume a tree-structured Ising model for the hidden variable X, that is, the distribution of X has the form of (12). We also impose a reasonable compactness assumption on the respective interaction parameters, as follows. Assumption 1 There exist α and β such that for the distribution p( ), 0 < α |θst| β < for all (s, t) E. For a fixed tree structure T, and for future reference, we hereafter let PT(α, β) be the class of Ising models satisfying Assumption 1. 2.3 Hidden Sign-Valued Tree-Structured Models The problem considered in this paper is that of learning a tree-structured model from corrupted observations. Because we have no access to the original samples X1:n, we obtain the noisy observations Y1:n. To formalize this, consider a hidden Markov random field whose hidden layer X is an Ising model with respect to a tree, i.e., X p( ) PT(α, β), as defined in (16). The observed variables Y are formed by setting Yr = Nr Xr for all r V, where {Nr} are i.i.d. Rademacher(q) random variables. Let p ( ) be the distribution of the observed variables Y. We can think of Y as the result of passing X through a binary symmetric channel BSC(q)p. We have the following expressions E[Nr] = 1 2q cq, r V, and q [0, 1/2), (18) µ r,s = E [Yr Ys] = E [Nr Xr Ns Xs] = (1 2q)2 E [Xr Xs] , r, s V. (19) The distribution p ( ) of Y also has support { 1, +1}p, and so the joint distribution satisfies the general form (8). Since the marginal distribution of each Yr is also uniform, E[Yr] = 0 for all r V, (8) and (18) yield k [p] 2N ck q X S V:|S|=k E , y { 1, 1}p. (20) The moments of the hidden variables E Q s S Xs in (20) can be expressed as products of the pairwise correlations E[Xs Xt], for any (s, t) ET (Section 3.3, Theorem 10). From (20) it is clear that the distribution p ( ) of Y does not factorize with respect to any tree, that is, p ( ) / PT(α, β) in general.3 2.4 Hidden Structure Estimation We are interested in characterizing the sample complexity of structure recovery: given data generated from p( ) PT(α, β) for an unknown tree T, what is the minimum number n 3. Lemma 30 shows the structure preserving property for the observable layer holds for the special case of single-edge forests. Nikolakakis, Kalogerias and Sarwate Algorithm 1 Chow Liu Require: D = y(1), y(2), . . . , y(n) { 1, 1}p n, where y(k) is the kth observation of Y 1: Compute ˆµ i,j 1 n Pn k=1 y(k) i y(k) j , for all i, j V 2: return TCL Maximum Spanning Tree i =j n ˆµ i,j o of samples {y(i), i [n ]} from p ( ) needed to recover the (unweighted) edge set of T with high probability? In particular, we would like to quantify how n depends on the crossover probability q. Intuitively, noise makes weak edges to appear weaker , and the sample complexity is expected to be an increasing function of q. Because the distribution p ( ) of the observable variables does not factorize according to any tree, this problem does not follow directly from the noiseless case. Although the classical MLE is the standard approach for the noiseless case, for the noisy setting the MLE estimation of parameters θ of the hidden model is intractable, due to the summation over the support of X. Additionally, the MLE structure estimate from noisy data is not in general consistent with the hidden structure as we explain in Section 2.6. However, for the model that we consider in this paper, the projected-MLE estimate of the observables onto the space of tree-structured models gives a consistent structure estimate. Additionally, that structure estimate is identical to the output of Chow-Liu algorithm (Algorithm 1) from noisy data. We refer the reader to Section 2.6 for the discussion about the MLE and the connection with the noisy Chow-Liu algorithm. In this work, we use and analyze the sample complexity of the classical Chow-Liu algorithm (Algorithm 1) for the following reasons: We show that given finite number of noisy data as input, the Chow-Liu algorithm recovers the original tree T with high probability. Further the sample complexity is asymptotically optimal for fixed q < 1/2 (see Tables 2 and 3), and its order remains O(log p) in the high dimensional regime. The algorithm is computationally efficient in comparison to other optimization techniques and it does not require the value q to be known. Additionally, Algorithm 1 solves the projected-MLE problem that we discuss in Section 2.6. The above reasons and our finite sample complexity bound Theorems 1 and 5 suggest that Algorithm 1 is an excellent approach for tree-structure learning from noisy data. 2.5 Evaluating the Accuracy of the Estimated Distribution In addition to recovering the graph structure, we are interested in the goodness of fit of the estimated distribution. Let PS, QS be the marginal distributions of P, Q on the set S V, let d TV denote the total variation distance, and fix k = 2. We measure the error of distribution estimator through the small set Total Variation (or ss TV) distance as defined by Bresler and Karzand (2020) L(k) (P, Q) sup S:|S|=k d TV (PS, QS) . (21) If Q is an estimate of P, the norm L(k) guarantees predictive accuracy because (Bresler and Karzand, 2020, Section 3, page 720) EXS h |P (Xi = +1|XS) Q (Xi = +1|XS)| i 2L(|S|+1) (P, Q) . (22) Predictive Learning on Hidden Tree-Structured Ising Models The estimated (from noisy data) distribution of the hidden variables in (3) is a simple extension of the noiseless estimate. In fact the estimated distribution factorizes according to the estimated from noisy data tree structure, that is the output of Algorithm 1. Further, the pairwise correlations are normalized by the constant (1 2q). As a result, the estimator is consistent because if n then TCL T, ˆµ i,j/(1 2q) µi,j, and as a consequence the estimate ΠTCL (ˆp ) converges to the original distribution p( ) of X. Our main result gives a lower bound on the number of samples needed to guarantee accurate estimation (in the sense of small ss TV), with high probability. 2.6 Maximum Likelihood Estimate A natural first place to start in estimation is the maximum-likelihood estimate (MLE). We explain why this is problematic and show a method (the projected-MLE) which turns out to be equivalent to the Chow-Liu algorithm. This motivates why we study the Chow-Liu algorithm in the first place. To begin, the distribution of the observables parametrized over the interaction parameters θ of the hidden layer is (s,t) EG θstxsxt p(y|x), y { 1, 1}p. (23) It is known that above expression is intractable in closed form and it can be evaluated only through approximations. Secondly, the log-likelihood of Y can be written as log p (y) = log X x { 1,+1}p p(y|x) Y i V p (xi) Y p(xi, xj) p(xi)p(xj), y { 1, 1}p, (24) and the logarithm of the summation cannot be expressed as summation of logarithms. Therefore we see the classical MLE structure estimation approach is not applicable for hidden models. Specifically, the structure of the observable layer is a complete graph and not a tree (there is no conditional independence between Y s). The maximum likelihood structure estimate with respect to the parameters θ of the observables in general will return a complete graph. Specifically, let G = (V, EG) be the graph (which is complete) of the observable layer, then the distribution p ( ) is an Ising-Model distribution and it can be written as p (y) = 1 Z (θ ) exp (s,t) EG θ stysyt , y { 1, 1}p. (25) Since all the edges exist in the edge set, none of the values θ st is zero. As a consequence, even asymptotically (n ) the maximum likelihood that estimates the parameters θ st gives a complete graph. Recall that we want to recover the structure of the hidden layer which is a tree. Thus, the maximum likelihood structure estimate directly applied on (25) is not consistent, because of the different hidden and observables structure. To overcome the inconsistency that is introduced by the noise, we can project the distribution p (y) to a set of tree-structured distributions and then find the maximum likelihood structure estimate. We denote the projection of p (y) onto the space of trees as p T (y) Nikolakakis, Kalogerias and Sarwate and we call the MLE with respect to p T (y) as projected-MLE (PMLE). Then the following questions are natural: Is the PMLE always consistent with respect to structure of the hidden layer? Is the PMLE asymptotically optimal (n )? Is the PMLE optimal for finite values of n? (by optimal we mean that the sample complexity bound matches the minimax bound). We continue by answering the questions above. First we present the structural consistency and then we continue by discussing the asymptotic optimality and optimality for finite n. Although, the PMLE is not in general consistent with structure of the hidden layer (see also related work by Nikolakakis et al. (2020)), for the setting of the BSC channel with i.i.d noise we do have ˆTPMLE T when n . In fact, the projected distribution as p T (y) is given by p T (y) argmin Q( ) PT(α,β) DKL(p (y)||Q(y)). (26) The proof of the claim follows by a standard argument (see also Lemma 1 and Lemma 2 by (Bresler and Karzand, 2020, Supplemetary material, Appendix A)) and it gives DKL(p (y)||p T (y)) = 1 H(p (y)) + X 1 + (1 2q)2µi,j As a consequence the projected-MLE ˆTPMLE is ˆTPMLE = argmin T T (i,j) ET HB 1 + ˆµ i,j 2 and the following (i,j) ET HB 1 + (1 2q)2µi,j (i,j) ET HB gives that ˆTPMLE TCL T (almost surely) when n . Although, the above discussion of the consistency for n shows the connection with MLE, our results for instance Theorem 3.1 shows that the Chow-Liu algorithm returns the original tree for finite n with probability 1 δ. Additionally, the PMLE is asymptotically optimal, however for finite n it may be not optimal. For our structure/predictive learning problem our bounds are asymptotically optimal (up to constants). That is, for fixed q the upper and lower bounds match as n . Nevertheless for finite n the PMLE is not optimal in general. It is known that under the presence of noise the MLE approach may be non-robust and sub-optimal and extra steps should be considered including pre-processing, statistical learning of the noise by using pilot samples, and detecting and rejecting bad samples (for further information see also Zoubir et al. (2012, page 62) and Nikolakakis et al. (2020)). The reason that we consider Chow-Liu algorithm in our work is that it is computationally efficient, while its sample complexity remains logarithmic with respect to p even when noise exists. The latter makes the Chow Liu algorithm useful in practice when only noisy observations are available. Finally, to give Predictive Learning on Hidden Tree-Structured Ising Models further insight about the gap between the upper and lower bounds we present an example in Section H.1 (Appendix), for which perfect denoising is possible for p before running the Chow-Liu algorithm. As consequence, for p the bounds in Propositions 2 and 4 reduce to the noiseless case as they should. This example is a marginal case (since perfect denoising is not possible in general) and it affects our converse results which are universal and owe to include corner cases. 3. Main Results The main question asked by this paper is as follows: what is the impact of noise on the sample complexity of learning a tree-structured graphical model in order to make predictions? This corresponds to sampling variables Y generated by sampling X from the model (10) and randomly flipping each sign independently with probability q. We use the Chow-Liu algorithm to estimate the hidden structure using the noise-corrupted samples. We first find upper (Theorem 5) and lower bounds (Theorem 6) on the sample complexity for exact hidden structure recovery using the Chow-Liu algorithm on noisy observations. Secondly, we use the structure statistic to derive an accurate estimate of the hidden layer s probability distribution. The distribution estimate is computed to be accurate under the ss TV utility measure, that was introduced by Bresler and Karzand (2020). Furthermore, the estimator of the distribution factorizes with respect to the structure estimate, while the ss TV metric ensures that the estimated distribution is a trustworthy predictor. Theorem 7 and Theorem 8 give the sufficient and necessary sample complexity for accurate distribution estimation from noisy samples. These theorems generalize the results for the noiseless case (q = 0) by Bresler and Karzand (2020) and lead to interesting connections between structure learning on hidden models and data processing inequalities (Raginsky, 2016; Polyanskiy and Wu, 2017). The third part of the results includes Theorem 10, which gives an equivalent of Isserlis theorem by providing closed form expressions for higher order moments of sign-valued Markov fields on trees. Based on Theorem 10 we propose a low complexity algorithm to estimate any higher order moment of the hidden variables given the estimated tree structure and estimates of the pairwise correlations (both evaluated from observations corrupted by noise). Finally, Theorem 11 gives the sufficient number of samples for distribution estimation, when the symmetric KL divergence is considered as utility measure. These give rise to extensions of testing algorithms Daskalakis et al. (2018) under a hidden model setting. 3.1 Tree Structure Learning from Noisy Observations Our goal is to learn the tree structure T of an Ising model with parameters |θst| [α, β], when the nodes Xi are hidden variables and we observe Yi Ni Xi, i V, where Ni Rademacher(q) are i.i.d, for all i V and for all q [0, 1/2). We derive the estimated structure TCL by applying the Chow-Liu algorithm (Algorithm 1) (Chow and Liu, 1968). Instead of mutual information estimates, our Chow-Liu algorithm (Algorithm 1) requires correlation estimates; these are sufficient statistics because of (14). Further, it can consistently recover the hidden structure through noisy observations. The latter is true because of the order preserving property of the mutual information. That is, the stochastic map- Nikolakakis, Kalogerias and Sarwate ping X BSC(q)p Y allows structure recovery of X by observing Y, because for any tuple Xi, Xj, Xi , Xj such that I (Xi; Xj) I Xi , Xj , it is true that I (Yi; Yj) I Yi , Yj . The proof directly comes from (14) and (19). In addition, the monotonicity of mutual information with respect to the absolute values of correlations allows us to apply the Chow-Liu algorithm directly on the estimated correlations ˆµ i,j 1/n Pn k=1 (Yi)(k) (Yj)(k). Notice that because of (19), ˆµ i,j can be used as an alternative of ˆµi,j. The algorithm returns the maximum spanning tree TCL . Further discussion about the Chow-Liu algorithm is given in Section 4.1. The following theorem provides the sufficient number of samples for exact structure recovery through noisy observations. Theorem 5 (Sufficient number of samples for structure learning) Let Y be the output of a BSC(q)p, with input variable X p( ) PT(α, β). Fix a number δ (0, 1). If the number of samples n of Y satisfies the inequality n 32 h 1 (1 2q)4 tanh β i (1 2q)4 (1 tanh β)2 tanh2 α log 2p2 then Algorithm 1 returns TCL = T with probability at least 1 δ. Theorem 5 characterizes the finite-sample performance of the Chow-Liu estimator and by taking n we can see that Algorithm 1 is consistent in the noisy setting. As a consequence of (30) and the inequality 1 tanh(β) e 2β, if the number of samples satisfies the following bound h 1q=0 + e2β (1 2q) 4 tanh(β) 1q =0 i log(p/δ), (31) then the structure is exactly recovered with probability at least 1 δ. The latter gives the statement of Theorem 1. Complementary to Theorem 5, our next result characterizes the necessary number of samples required for exact structure recovery. Specifically, we prove a lower bound on the sample complexity that characterizes the necessary number of samples for any estimator ψ. Theorem 6 (Necessary number of samples for structure learning) Let Y be the output of a BSC(q)p, with input variable X p( ) PT(α, β). If the given number of samples of Y satisfies the inequality n < [1 (4q(1 q))p] 1 16α tanh(α) e2β log (p) , (32) then for any estimator ψ, it is true that inf ψ sup T T p( ) PT(α,β) P ψ Y1:n = T > 1 Predictive Learning on Hidden Tree-Structured Ising Models It can be shown that the right hand-side of (30) is greater than the right-hand side of (32) for any q in [0, 1/2) (and for all possible values of p, β, α), by simply comparing the two terms. Theorems 5 and 6 reduce to the noiseless setting by setting q = 0 (Bresler and Karzand (2020)). The sample complexity is increasing with respect to q, and structure learning is always feasible as long as q = 1/2. Let n denote the required samples under a noiseless setting assumption, then for a fixed probability of exact recovery, we always need n n because 1 (1 2q)4 tanh(β) [(1 2q)4(1 tanh(β))] 1, q h 0, 1 and β R. (34) Furthermore, 1 1 (4q(1 q))p 1, q [0, 1/2) and p N, (35) the latter shows that the sample complexity in a hidden model is greater than the noiseless case (q = 0), for any measurable estimator (Theorem 6). When q approaches 1/2, the sample complexity approaches infinity, n , and the structure learning is impossible. Theorem 6 extends Theorem 3.1 by Bresler and Karzand (2020) to our hidden model. Our results combines Bresler s and Karzand s method and a strong data processing inequality (SDPI) by Polyanskiy and Wu (2017, Evaluation of the BSC). Upper bounds on the symmetric KL divergence for the output distribution p ( ) can not be found in a closed form. However, by using the SDPI, we manage to capture the dependence of the bound on the parameters α, β, q and derive a non-trivial result. When p , the bound becomes trivial since limp 1/ [1 (4q(1 q))p] 1, giving the classical data processing inequality (contraction of KL divergence for finite alphabets, (Raginsky, 2016; Polyanskiy and Wu, 2017)). While direct application of the SDPI is simple and provides an upper bound which is almost insensitive to q (for sufficiently large p), it introduces a gap between the lower and upper bounds. Nevertheless, it is important because it indicates a possible non-optimal performance of the classical Chow-Liu algorithm (under a hidden model). We conjecture that the sample complexity bounds in (Theorem 6 and Theorem 8) are tight only under the low temperature regime |θi,j| = β for all i, j E, while in general (θi,j [α, β]) the inequalities hold but they are not tight. For further explanation related to the gap between the upper and lower bounds see Section 2.6. The latter is a consequence of the SDPI, which is tight for the repetition code (Polyanskiy and Wu, 2017, Evaluation for the BSC, page 12). We performed extensive simulations (c.f. Figures 1, 2) that suggests that our bound does indeed accurately characterize the performance of Chow-Liu. These simulations choose p = 100, but our evidence shows that the dependence on q is not affected for larger (p = 200) or smaller (p = 50) values of q. We believe that the term 1/[(1 (4q(1 q))p] does not characterize the Chow-Liu algorithm, but possibly a more complicated algorithm. 3.2 Predictive Learning from Noisy Observations In addition to recovering the structure of the hidden Ising model, we are interested in estimating the distribution p( ) PT(α, β) itself. If the L(2) distance between the estimator and the true distribution is sufficiently small, then the estimated distribution is appropriate Nikolakakis, Kalogerias and Sarwate for predictive learning because of (22). For consistency, this distribution should factorize according to the structure estimate TCL and for the predictive learning part, the estimate TCL is considered the output of the Chow-Liu algorithm (see Algorithm 1). We continue by defining the distribution estimator of p( ) as ΠTCL (ˆp ) 1 1 + xixj ˆµ i,j (1 2q)2 2 . (36) The estimator (36) can be defined for any q [0, 1/2). For q = 0 it reduces to that in the noiseless case, since TCL TCL, ˆµ i,j ˆµi,j, and thus ΠTCL ˆp ΠTCL ˆp . It is also closely related to the reverse information projection onto the tree-structured Ising models (Bresler and Karzand, 2020, supplementary material, Appendix A), in the sense that ΠT(P) = argmin Q PT(α,β) DKL (P||Q) , P PT(α, β). (37) To compute ΠTCL (ˆp ), two sufficient statistics are required: the structure TCL and the set of second order moments (Chow and Liu, 1968; Bresler and Karzand, 2020), under the assumption that q is known. The next result provides a sufficient condition on the number of samples to guarantee that the L(2) distance between the true distribution and the estimated distribution is small with probability at least 1 δ. Note that the dependence on β changes from e2β to e4β when the data are noisy q > 0, while for q = 0 our bound exactly recovers the noiseless case (Bresler and Karzand, 2020). A key component of the bound is the following function Γ(β, q) 1 (1 2q)2 1 (1 2q)4 tanh2(β) 2 , β > 0 and q [0, 1/2). (38) Further, notice that Γ(β, q) [0, 1] for all β > 0 and q [0, 1/2), and Γ(β, 0) = 0 for all β > 0. Additionally, we define the functions K(β, q) 10(1 tanh2(β)) 9 + (1 2q)2 tanh2(β)(1 2q)2(9(1 2q)2 + 1), (39) B(β, q) max 1 K(β, q), 1 + 2eβp 2 (1 q) q tanh β 2 . (40) The latter constitute additional components of the main rusult that follows. Theorem 7 Fix δ (0, 1) and choose η > 0. If n max 512 η2(1 2q)4 , 1152e2βB(β, q) (1 2q)4 , 48e4β η2 Γ(β, q) log 6p3 P L(2) p( ), ΠTCL (ˆp ) η 1 δ. (42) Predictive Learning on Hidden Tree-Structured Ising Models (41) and the inequalities Γ(β, q) 1q =0, B(β, q) 1 + 3 q 2 e2β1q =0 give Theorem 3. We provide the proof of Theorem 3 and Theorem 7 in Section E (Appendix). As we mentioned in Section 1.5.2, the sample complexity for accurate predictive learning does not depend on α, that is, even in the high temperature regime α 0 (and in contrast with the structure learning), the number of required samples does not increase. Conversely, the following result provides the necessary number of samples for small L(2) distance by a minimax bound, that characterizes any possible estimator ψ. In other words, it provides the necessary number of samples required for accurate distribution estimation, appropriate for predictive learning (small L(2)( )). Theorem 8 (Necessary number of samples for inference) Fix a number δ (0, 1). Choose η > 0 such that tanh(α) + 2η < tanh(β). If the given number of samples satisfies the inequality n < 1 [tanh(α) + 2η]2 16η2[1 (4q(1 q))p] log p, (43) then for any algorithm ψ, it is true that inf ψ sup T T p( ) PT(α,β) P L2 (p( ), ψ (Y1:n)) > η > 1 Theorems 7 and 8 reduce to the noiseless setting for q = 0, that has been studied earlier by Bresler and Karzand (2020). Similarly to our structure learning results, presented previously (Theorem 5, Theorem 6), when q 1/2 we have n , the latter indicates that the learning task becomes impossible for q = 1/2. Remark 9 Theorem 8 requires the assumption α < β. The special case α = β can be derived by applying the same proof technique of Theorem 8 combined with Theorem 3 by (Bresler and Karzand, 2020, supplementary material) and the SDPI by Polyanskiy and Wu (2017). Further details and proof sketches of Theorems 7 and 8 are provided in Section 4.3. 3.3 Estimating Higher Order Moments of Signed-Valued Trees A collection of moments is sufficient to represent completely any probability mass function. For many distributions, the first and second order moments are sufficient statistics; this is true, for instance, for the Gaussian distribution or the Ising model with unitary and pairwise interactions. Even further, in the Gaussian case, the well-known Isserlis Theorem (Isserlis (1918)) gives a closed form expression for all moments of every order. As part of this work, we derive the corresponding moment expressions, for any tree-structured Ising model. To derive the expression of higher order moments, we first prove a key property of tree structures: for any tree structure T = (V, E) and a even-sized set of nodes V V, we can partition V into |V |/2 pairs of nodes, such that the path along any pair is disjoint with the path of any other pair (see Appendix A, Lemma 15). We denote as CT(V ) the set of distinct |V |/2 pairs of nodes in V , such that path(u, u ) path(w, w ) = , for all Nikolakakis, Kalogerias and Sarwate Algorithm 2 Matching Pairs Require: Tree structure T = (V, E), any set V V : |V | 2N 2: for i V do 3: if i V then 7: for k [d] do d is the depth of the tree 8: Store all nodes at level k to L(k) 9: for k [d] do 10: for i L(d + 1 k) do Visit each of the nodes at level d + 1 k 11: if p(i) = 1 then 12: V V \ {i} 13: CPT CPT (i, ancestor(i)) 14: if p(ancestor(i)) = 1 then 15: V V \ {ancestor(i)} 16: p(ancestor(i)) 0 18: p(ancestor(i)) 1 19: if V then 20: return CPT {u, u }, {w, w } CT(V }. Let CPT(V ) be the set of all edges in all mutually edge-disjoint paths with endpoints the pairs of nodes in V , that is, {w,w } CT(V ) path T(w, w ). (44) For any tree T, the set CPT(V ) can be computed via the Matching Pairs algorithm, Algorithm 2. By using the notation above, we can now present the equivalent of Isserlis Theorem. The closed form expression of moments is given by the next theorem. Theorem 10 For any distribution of the form of (11), which factorizes according to a tree T and has support { 1, +1}p, it is true that E [Xi1Xi2 . . . Xik] = ( 0 k odd Q e CPT(i1,i2,...,ik) µe k even. (45) Theorem 10 is an equivalent of Isserlis theorem for tree-structured sign-valued distributions. Equation (45) is used later to define an estimator of higher order moments that requires two sufficient statistics: the estimated structure TCL and the correlation estimates ˆµ e, for any e TCL . Together with the parameter q, the higher order moments completely characterize the distribution of the noisy variables of the hidden model (20). We provide the proof of Theorem 10 in Appendix A. Predictive Learning on Hidden Tree-Structured Ising Models A similar expression to (45) has been introduced in prior work. Specifically, Algorithm 2 solves the problem of finding the optimal matching, see Definition 1 by (Bresler and Karzand, 2020, supplementary material). The evaluation of higher order moments requires an explicit expression or a way to compute the set CPT. For a given tree T = (E, V) and a set ((i1, i2, . . . , ik) V, there is a unique set CPT(i1, i2, . . . , ik) (see Appendix A, proof of Theorem 10). Given a set of edges E, we show that the set CPT can be evaluated by running a matching pair algorithm. For that purpose, we provide Algorithm 2 (with complexity (O(E))) and we prove its consistency (See Appendix, Lemma 15). The latter yields to an explicit expression of higher order moments; the Theorem 10. Furthermore, it provides a concrete higher order moments estimator, that is based on the estimated structure TCL (or TCL ) and the set of estimated correlations {ˆµe : e CPTCL}. High Order Moments Estimator: A higher order moment is the expected value of the product of the hidden tree-structured Ising model variables {Xi : i V } where V V. Theorem 10 gives the closed form solution for such moments. We have the following estimator for higher order moments using only noisy observations and known q. In particular, we have ˆE [Xi1Xi2 . . . Xik] 0, k 2N + 1, (46) ˆE [Xi1Xi2 . . . Xik] Y e CPTCL (i1,i2,...,ik) ˆµ e (1 2q)2 , k 2N. (47) The estimated structure and pairwise correlations are sufficient statistics: given those, (47) suggests a computationally efficient estimator for higher order moments. First we run the classical Chow-Liu algorithm to estimate the tree structure TCL , and then we run Algorithm 2 with input the estimate TCL to evaluate the set CPTCL . Thus, by estimating TCL , CPTCL and ˆµ e for any e TCL , we can in turn estimate any higher order moment through (47). Considering the absolute estimation error, we have ˆE # 2|V |L(2) p( ), ΠTCL ˆp . (48) Theorem 7 guarantees small ss TV and in combination with (48) gives an upper bound on the higher order moment estimate (47). In Section 4.4, we provide further details and discussion about Theorem 10, Algorithm 2, that computes the sets CPT(V ), CPTCL (V ), and the bound on the error of estimation (48). So far we have studied the consistency of the estimator with respect to the L(2) metric. We are also interested in sample complexity bounds for φ-divergences. While general divergences may be challenging, the most widely-used is the KL-divergence, particularly in testing Ising models (Daskalakis et al., 2018). The next result gives a bound for the sufficient number of samples to guarantee a small symmetric KL divergence SKL(P||Q) DKL(P||Q) + DKL(Q||P) with high probability. For any Ising model distributions P, Q of the form (10) with respective interaction parameters θ, θ , we have SKL θ||θ SKL(P||Q) = X θst θ st µst µ st . (49) Nikolakakis, Kalogerias and Sarwate Figure 3: Probability of incorrect structure recovery, The theoretical bound is given by Theorem 5. The top view of the figure is Figure 1 and provides a clear comparison between the experimental and theoretical results. Theorem 11 (Upper Bounds for the Symmetric KL Divergence) If the number of samples n of Y satisfies n 4 β2(p 1)2 (1 2q)4η2s log p2 then for p( ) PT(α, β) we have P SKL p( )||ΠTCL ˆp ηs 1 δ, (51) where TCL is the Chow-Liu tree defined in (52) and the estimate ΠTCL ˆp is given by (36). The asymptotic behavior of the bound in (50) was recently studied by Daskalakis et al. (2018). In that work, a set of testing algorithms are proposed and analyzed under the assumption of an Ising model with respect to trees and arbitrary graphs. Theorem 11 gives rise to possible extensions of testing algorithms to the hidden model setting. We consider the latter as an interesting subject for future work. 3.4 Simulations We provide empirical results based on synthetic data to illustrate the probability of error δ as function of the cross-over probability q and the number of samples n. For the simulations Predictive Learning on Hidden Tree-Structured Ising Models Figure 4: Estimate of the probability of the ss TV to be greater than η = 0.03. The theoretical bound is given by Theorem 7. The top view of the figure is Figure 2 and provides a clear comparison between the experimental and theoretical results. Figure 5: Estimate of the distribution error metric ss TV as a function of q and n. Nikolakakis, Kalogerias and Sarwate Figure 6: The probability of error in the predictive learning task for different values of n and estimation error of the parameter q. For both figures we consider p = 31, α = 0.2, β = 1, q = 0.1 and averaging over 500 independent runs. Left: ˆq [0.05, 0.15], η = 0.1, Right: ˆq [0, 0.2], η = 0.12. of this paper the original tree structure T is generated randomly where, starting from the root, we choose the parent of each new node uniformly at random among the nodes that are currently in the tree, in a sequential fashion. First, we estimate the probability of error P TCL = T (named as δ) of the structure learning problem, Figure 3. For the structure learning experiments, the number of nodes is 100, β = arctanh(0.8), and α = arctanh(0.2). Further, we considering 100 Monte Carlo runs for averaging, and we plot the estimated probability of incorrect structure recovery while q and n vary. As a next step, we would like to see how well the theoretical bound of Theorem 5 matches with the experimental results. To do this we plot the top view of Figure 3 to get Figure 1. Quite remarkably, the theoretical and experimental bounds exactly match. The latter suggests that our theoretical bound that we derive, sample complexity of the Chow-Liu algorithm (Theorem 5), is indeed accurate. Second, we plot the probability of error for the predictive learning task, that is the probability of the ss TV to be greater than a positive number η (Figure 4). For the simulation part, we restrict our attention to the case that the first of three terms in the maximization of (41) is the dominant. In fact, η = 0.03, p = 31, while α and β are the same as the structure learning. Finally, Figure 5 presents the ss TV itself for different values of q and n. Finally, the top view of Figure 4 is Figure 2, the latter suggest that the bound of our main result, Theorem 7 is accurate. Finally, we provide experimental results for the case of unknown q. Specifically, Figure 6 illustrates the relationship between the average probability of error and the relative error |ˆq q|/q for the predictive learning task. We notice that the distribution can be approximated by using an estimate ˆq of q even for relative error 30% or 60%. Predictive Learning on Hidden Tree-Structured Ising Models 4. Discussion In this section, we present sketches of proofs, we compare our results with prior work, we further elaborate on Algorithm 1, Algorithm 2 and the error of higher order moment estimates. First, we discuss the convergence of the estimate TCL (Section 4.1). In section 4.2, we explain the connection between the hidden and noiseless settings on the tree structure learning problem. Later, in Section 4.3, we present the analysis and a sketch of proof for Theorem 7. Finally, in Section 4.4, we provide further details about Theorem 10, discussion about the Matching Pairs algorithm (Algorithm 2) and the accuracy of the proposed higher order moments estimator (47). 4.1 Estimating the Tree Structure T In this work, the structure learning algorithm is based on the classical Chow-Liu algorithm, and is summarized in Algorithm 1. We can express its output as TCL = argmax T T ˆµ i,j , (52) (see also Section 2.6.). The difference between Algorithm 1 and the Chow-Liu algorithm of the noiseless scheme is the use of noisy observations as input, since we consider a hidden model, whereas Bresler and Karzand (2020) assume that observations directly from the treestructured model are available. Further, (52) shows the consistency of the estimate TCL for sufficiently large n. The tree structure estimator TCL converges to T when n , since lim n ˆµ i,j a.s. = c2µi,j. (53) From (52) and (53) we have (under an appropriate metric) lim n TCL a.s. = T. (54) Asymptotically, both TCL and TCL converge to T, where TCL denotes the structure estimate from noiseless data (q = 0). Most importantly, the Chow-Liu algorithm also returns the exact hidden structure with high probability given finite number of noisy samples. We provide the finite sample complexity bound in Theorem 5. For a fixed probability of exact structure recovery 1 δ, more samples are required in the hidden model setting, compared to the noiseless one. Additionally, the difference of the sample complexity between the noisy and noiseless setting comes from Theorem 5 by comparing the bound for the values q = 0 and q = 0. 4.2 Hidden Structure Recovery and Comparison with Prior Results Theorem 5 and Theorem 6 extend the noiseless setting (Bresler and Karzand, 2020, Theorem 3.2, Theorem 3.1) to our hidden model; the noiseless results correspond to q = 0. In particular, in the presence of noise, the dependence on p remains strictly logarithmic, that is, O(log(p/δ)). To make the connection between sufficient conditions more explicit, by setting q = 0 in (30) of Theorem 5, we retrieve the corresponding structure learning result Nikolakakis, Kalogerias and Sarwate by Bresler and Karzand (2020, Theorem 3.2) exactly: Fix a number δ (0, 1). If the number of samples of X satisfy the inequality n 32 tanh2 α (1 tanh β) log 2p2 then the Chow-Liu algorithm returns TCL = T with probability at least 1 δ. An equivalent condition of (55) is tanh α 4ϵ 1 tanh β τ(ϵ), and ϵ p 2/n log (2p2/δ), (56) the latter shows that the weight of weakest edge should satisfy the following inequality α > arctanh 4ϵ/ 1 tanh β (Bresler and Karzand (2020)). For the hidden model, the equivalent extended condition for the weakest edge is tanh α 4ϵ q 1 (1 2q)4 tanh β (1 2q)2 (1 tanh β) τ (ϵ ), and ϵ = 2 log (2p2/δ) (see Appendix C, Lemma 22) Condition (56) is retrieved through (57) for q = 0. Note that, for q = 1/2, the mutual information of the hidden and observable variables is zero, thus structure recovery is impossible. Theorem 6 provides the necessary number of samples bound for exact structure recovery given noisy observations. In fact, it generalizes Theorem 3.1 by Bresler and Karzand (2020) to the hidden setting. By fixing q = 0, Theorem 6 recovers the noiseless case. Fix δ (0, 1). If the number of samples of X satisfies the inequality 16e2β [α tanh(α)] 1 log (p) , (58) then for any algorithmic mapping (estimator) ψ, it is true that inf ψ sup T T P PT(α,β) P (ψ (X1:n) = T) > 1 When there is no noise, q = 0, we retrieve the noiseless result, while for any q (0, 1/2) the sample complexity increases since [1 (4q(1 q))p] 1 > 1 in (32) and for q 1/2 the required number of samples n , which makes structure learning impossible. The ratio between the noiseless and noisy necessary conditions indicates the gap between the hidden model and the original (noiseless) setting, which reads n [1 (4q(1 q))p] 1 1 ηKL , (60) (see Appendix E.). The right hand-side of (60) is the strong data processing inequality for the binary symmetric channel, which was recently developed by Polyanskiy and Wu (2017, Equation (39)). We continue by providing the main idea and the important steps of the proof of Theorem 7. Predictive Learning on Hidden Tree-Structured Ising Models 4.3 Theorem 7: A Sketch of the Proof Recall that the indices i, j V of the quantities µ i,j and ˆµ i,j are pair of nodes, and in fact they can be considered as one (pair) index. For sake of space we introduce the notation µ e and ˆµ e for some e ET, that is consistent with our previous definition and e represents a pair of nodes. Theorem 7 guarantees that the estimated pairwise marginal distributions are close to the the original distributions by ensuring that the L(2) is small. In this section we provide a sketch of the proof of the Theorem and we mention the main differences between the hidden model and the noiseless case (Bresler and Karzand, 2020). The intersection of three events is sufficient to guarantee that L(2) is upper bound by η > 0: µ i,j ˆµ i,j ϵ Estrong (ϵ ) e ET : |tanh θe| τ (ϵ ) Ecascade (γ ) e path T(i,j) ˆµ e (1 2q)2 Y e path T(i,j) µ e (1 2q)2 where (57) gives the definition of τ (ϵ ). The three events are equivalent events of the noiseless case, but they are modified accordingly to guarantee accurate estimation based on noisy data. The event Ecorr (ϵ ) guarantees that the error of the correlation estimates is not greater than ϵ . Under the event Estrong (ϵ ) all the strong edges are recovered by the Chow-Liu algorithm. Similarly to the noiseless setting, the event Estrong (ϵ ) requires the Chow-Liu algorithm to recover all the strong edges, while the weak edges (those that do not satisfy the inequality in (62)) do not affect the accuracy of the predictive learning, even if the Chow-Liu algorithm fails to recover them. In contrast with structure learning, exact structure recovery is not necessary for the predicative learning task. In other words, even if α is extremely small, assume tanh(α) τ (ϵ )/(1 2q)2 the required number of samples for accurate predictive learning will remain unaffected. Under the event Ecascade (γ ) the end-to-end error along paths is no greater than γ . In fact, each path between two nodes of the tree can be considered a sequence of segments with strong and weak edges. The end-to-end path error is determined by the strong edge segments of the path through the parameter γ for the Ecascade (γ ) event, while the effect of weak edges parameters is controlled by the quantity τ (ϵ ) (for the segmentation of the tree and the detailed proof see 26). Our goal is to find sufficient conditions on the parameters ϵ and γ that guarantee that the events Ecorr (ϵ ) , Ecascade (γ ) and Estrong (ϵ) occur with high probability. Recall that our goal is to guarantee that the quantity L(2)(p( ), ΠTCL (ˆp )) is smaller than a fixed number η > 0 with probability at least 1 δ. To do this, we follow the technique of prior work by Bresler and Karzand (2020), the triangle inequality gives L(2) p( ), ΠTCL (ˆp ) L(2) p( ), ΠTCL (p( )) + L(2) ΠTCL (p( )) , ΠTCL (ˆp ) , (64) Nikolakakis, Kalogerias and Sarwate and we find the required number of samples such that each of the terms L(2)(p( ), ΠTCL (p( ))) and L(2)(ΠTCL (p( )) , ΠTCL (ˆp )) is no greater than η/2 with probability at least 1 δ. As we show the probability of the event Ecascade (γ ) (Lemma 25, Appendix) and the L(2) (between the true and estimated distribution) can be bounded by a constant uniformly over the set of all trees and is not affected by long paths. To prove these properties of the hidden model is non-trivial and ensures that the estimation error from noisy observations does not increase exponentially along paths as someone might expect. Specifically, the first quantity at the right hand-side of inequality (64) represents the loss due to graph estimation error, while the second term represents the loss due to parameter estimation error. Lemma 26 (Appendix) shows that under the event E (ϵ , γ ) Ecorr (ϵ ) Ecascade (γ ) Estrong (ϵ ) , (65) 3 and ϵ (1 2q)2e β h 20 1 + 2eβp 2 (1 q) q tanh β i 1 (66) then L(2)(ΠTCL (p( )) , ΠTCL (ˆp )) η/2. Further Lemma 27 (Appendix) shows that if η 16(1 2q)2, (1 2q)2e β 24 1 + 2eβp 2 (1 q) q tanh β then L(2) p( ), ΠTCL (p( )) η 2 under the event Ecorr (ϵ ) Estrong (ϵ ). Both conditions (66) and (67) should be satisfied, so it is necessary to have 3 and ϵ min η 16(1 2q)2, (1 2q)2e β 24 1 + 2eβp 2 (1 q) q tanh β To guarantee that the errors γ and ϵ are sufficient small such that (68) is satisfied, we need to make sure that the number of samples n is sufficiently large. In fact, the upper bounds on the errors translate into lower bounds on the number of samples through the concentration bounds for the events. Specifically, Lemma 18 gives a sufficient sample size to ensure that the event Ecorr (ϵ ) occurs with probability at least 1 δ, Lemma 22 gives the concentration bound for the event Estrong (ϵ ) and Lemma 25 gives the concentration bound of the event Ecascade (γ ). Lemma 18, Lemma 22 and Lemma 25 together with (68) give the final bound of the sample complexity (see the proof 28) ( 512 η2(1 2q)4 , 1152e2βB(β, q) (1 2q)4 , 48e4β and its simplified but looser bound ( 512 η2(1 2q)4 , 1152 1 + 3 q 2 e2β(1+1q =0) (1 2q)4 , 48e4β Predictive Learning on Hidden Tree-Structured Ising Models that provides the condition of Theorem 3. Although the general structure of our argument follows that of the noiseless case, the presence of noise introduces several technical challenges whose solution may be of independent interest. In the sequel, we highlight the most important aspects of our approach that do not appear in the noiseless case. The proof of Theorem 3.3 is significantly different and includes additional steps and techniques compared with the approach by Bresler and Karzand (2020). Specifically, Lemma 23 is new and it is necessary for the hidden model and we use it later to prove (Lemma 25, Appendix). Lemma 24 is an non-trivial extension of the accurate estimation of edges correlation. Although the resulting expression seems complicated is important for the proof of Lemma 25. In fact Lemma 25, the proof of the concentration bound for the event Ecascade (γ ), is significantly more complicated and longer than the noiseless model (see Appendix E by Bresler and Karzand (2020) for comparison). To show this result we have to consider a martingale difference sequence and evaluate upper bounds for the conditional variance and bias of that sequence. The bias is crucial for the final result because it introduces an extra term in the final bound that does not exist in the noiseless case. It is interesting that this term does not involve any parameter related to the noise and shows how the result is affected by the structural inconsistency between the hidden and the observable layer. As a consequence, the expression of the bound (170) in Lemma 25 involves two inequalities to guarantee the high-probability bound. The first inequality which introduces restrictions on the parameter (see inequality 170) is an attribute of the noisy case. We continue by briefly explaining one of the main technical aspects of the proof. To begin with, consider a path of length d 2 in the original tree T, X1 X2 Xd+1 and we denote the edge (k, k + 1) as ek, for some k [d]. Recall that Y (i) k denotes the ith sample of Yk and k [d + 1]. We would like derive a concentration bound of the probability of the event Ecascade (γ ) (Lemma 25, Appendix). To do this, first we have to consider for all ℓ [n] and k {2, . . . , d} the random variables (Xk Nk Xk+1Nk+1)(ℓ) (1 2q)2 µ ek (1 2q)2 ˆµ ej (1 2q)2 µ ej (1 2q)2 . (71) Define the martingale difference sequence (MDS) {ξ(i) k } by setting ξ(0) k 0, ξ(1) k Z(1) k E h Z(1) k |ˆµ ek 1, . . . , ˆµ e1 i , ξ(i) k Z(i) k E h Z(i) k |Z(i 1) k , . . . , Z(1) k , ˆµ ek 1, . . . , ˆµ e1 i . Let Fk i 1 be the σ-algebra generated by Z(i 1) k , . . . , Z(1) k , ˆµ ek 1, . . . , ˆµ e1. Then the pair (ξ(i) k , Fk i )i=1,...,n is an MDS. In contrast with the noiseless case, the conditional means are not zero, which makes the problem significantly harder. To proceed, we apply a concentration bound for supermartingales (generalized Bennett s inequality) by Fan et al. (2012). Secondly we have to evaluate the following expression P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ ek 1, . . . , ˆµ e1 = 1 µ ek 2 1 µ ek 1 ˆµ ek 1 1 (µ ek 1)2 + µ ek 1 1 µek 2 ˆµ ek 1 µ ek 1 1 (µ ek 1)2 . (72) In the noiseless case, the product variables X(ℓ) k X(ℓ) k+1 are independent, leading to a simple expression for this probability (see Lemma 13, Appendix). The closed form expression of Nikolakakis, Kalogerias and Sarwate (72) is given by Lemma 23. Finally, the expectations E h Z(i) k |Z(i 1) k , . . . , Z(1) k , ˆµ ek 1, . . . , ˆµ e1 i are not zero, however when n , they approach zero. As a consequence, a bias exists that affects the sample complexity by introducing an additional term in the bound that that does not appear in the noiseless case, the quantity e4β/η2 (see Equation 69 and Equation 70). Finally, we continue by bounding the norm L(2) between the true and estimated distribution in Appendix E. The proof of Lemma 26 shows that in the noisy setting as well, the L(2) can be bounded by a constant uniformly over the set of all trees and it is not affected by long paths. This property of the hidden model is highly non-trivial and ensures that the estimation error from noisy observations does not increases along paths as someone might expect. Lemma 27 follows the corresponding approach of Lemma 6.1 by Bresler and Karzand (2020) and we provide only the required for the noisy setting differences. In Theorem 28, we combine the Lemmata of Appendices D and E, we find the appropriate choice of the parameter that satisfies the necessary conditions of Lemma 25 and we derive the final sample complexity bound. For further details about the proof of the main result see Appendix, Section D and Section E. 4.4 Estimating Higher Order Moments Our results also provide an analogue of Isserlis Theorem (Theorem 10) and the Matching Pairs algorithm, which returns the set CPT(V ) in (45). We provide a short proof sketch for the bound on the error of estimation (48). Proof sketch of Theorem 10: We prove that CT(V ) always exists (when k is even) by induction (see Appendix A, Lemma 15). We define the set of edges CPT(V ) as the union of the edge-disjoint paths4 CPT(V ) = w,w CT(V )path(w, w ). Combining the set CPT(V ) together with the independent products property (see Lemma 13), we derive the final expression (see Appendix A, proof of Theorem 10). Given the tree structure T and the correlations µe for all e E, we can calculate the higher order expectations. Notice that the collection of edge-disjoint paths CPT depends on the tree structure and as a consequence an algorithm is required to discover those paths. Different matching algorithms can be considered to find the set CPT. We propose Algorithm 2 which is simple and has low complexity of O(|E|). Matching Pairs Algorithm: Algorithm 2 requires as input the tree and the set of nodes V {i1, . . . , ik} V, and returns the set of edges CPT(V ). For each node in the tree, a flag variable is assigned to each node and indicates if the corresponding node is a candidate for the final set CT(V ) at the current step of the algorithm. The candidate nodes have to be matched with other nodes of the tree, such that the pairs generate edge-disjoint paths. Initially, the candidate nodes are the nodes of the set V . Starting from the nodes which appear in the deepest level of the tree, we move them to their ancestor. At each step, if two candidate nodes appear at the same point, we match them as pair, we store the pair in the set CPT(V ) and we remove both of them from the set V . We continue until V . The complexity of Algorithm 2 is O(|E|). Finally, Theorem 10 can be extended to any forest F structure by considering the set CPF(V ) instead of CPT(V ), where CPF(V ) i CPTi(V ) and Ti is the ith connected tree of the forest. 4. By edge-disjoint paths we refer to paths with no common edges. Predictive Learning on Hidden Tree-Structured Ising Models Estimation error of higher order moments: Inequality (48) bounds the error of estimation by the small set Total Variation (ss TV), that is guaranteed to be less than η > 0 by Theorem 7. Additionally, the bound on the error of the estimation in (48) can be found as follows ˆE e CPTCL (i1,i2,...,ik) ˆµ e (1 2q)2 Y e CPT(i1,i2,...,ik) µe e CPTCL (i1,i2,...,ik) ˆµ e (1 2q)2 Y e CPT(i1,i2,...,ik) µ e (1 2q)2 {w,w } CTCL (V ) path TCL (w,w ) ˆµ e (1 2q)2 Y {w,w } CT(V ) path T(w,w ) µ e (1 2q)2 {w,w } CTCL (V ) e path TCL (w,w ) ˆµ e (1 2q)2 Y {w,w } CT(V ) e path T(w,w ) µ e (1 2q)2 2|V |L(2) p( ), ΠTCL (ˆp ) , (75) where (73) holds due to (45) and (47), (74) comes from (44) and the last inequality (75) is being proved by Bresler and Karzand (2020, Lemma 1, supplementary material). Thus, if we can accurately estimate the distribution under the sense L(2)(p( ), ΠTCL (ˆp )) η , for a sufficiently small positive number η , then by using (47) and choosing η η/(2|V |), Theorem 7 guarantees accurate estimates for higher order moments with probability at least 1 δ. 5. Conclusion We have considered and analyzed the problem of predictive learning on hidden tree-structures from noisy observations, using the well-known Chow-Liu algorithm. In particular, we derived sample complexity guarantees for exact structure learning and marginal distributions estimation. Our bounds extend prior work (see Bresler and Karzand (2020)) to the hidden model, by introducing the cross-over probability q of the BSC(q)p. Our results exactly reduce to the noiseless setting when q = 0, and the explicit expressions of the bounds are also continuous functions of q. Additionally, by applying a graph property for tree structures and a probabilistic property for Ising models, we derived an equivalent of the well-known Isserlis theorem for Gaussian distributions, which yields to a consistent high-order moments Nikolakakis, Kalogerias and Sarwate estimator for Ising models. Further, we considered simulations based on synthetic data to validate our theoretical results. Our theoretical bounds exactly match with the experiment. indicating that our results correctly characterize the dependence on the model parameters. Our results show that the estimated structure statistic TCL is essential for successful statistical inference on the hidden (or observable) layer, while the sample complexity with respect to number of nodes and probability of error remains strictly logarithmic, as in the noiseless case. Our hidden setting constitutes a first step towards more technically challenging and potentially more realistic statistical models, such as, for instance, structure and distribution learning when the noise is generated by an erasure channel, or when the underlying hidden tree structured distribution has a larger, or even uncountable, support. Acknowledgments We would like to thank the action editor, Bert Huang, as well as the anonymous reviewers for providing valuable comments and suggestions, which have substantially improved the presentation of the results and the overall quality of our paper. This work was supported in part by DARPA and SSC Pacific under contract N6600115-C-4070 and the United States National Science Foundation under award CCF-1453432, and the United States National Institutes of Health under award 1R01DA040487. Appendix A. Preliminaries and Outline of Proof The chart in Figure 7 shows the various dependencies of the Lemmata and intermediate results either considered or developed in this paper, and the resulting Theorems. The proofs can be found in the corresponding section of the Appendix. For completeness, we start with some properties that hold for any distribution with support { 1, +1}p and tree-structured graphical model (Lauritzen, 1996). Later we derive explicit formulas for the Ising model (10). Lemma 12 Any distribution p(x) with respect to a forest F = (V, E), where x { 1, 1}p and uniform marginals P (Xi = 1) = 1/2, for all i V can be expressed as 1 + xixj E [Xi Xj] Proof We prove the result for an arbitrary tree T = (V, E) and then we extend it to any forest structure by applying cuts to E. The distribution factorizes according to the tree structure T and under the assumption of no external field (uniform marginal distributions), we have P(X = x) = Y i V p (xi) Y p(xi, xj) p(xi)p(xj) = 2p 2 Y 1 + xixj E [Xi Xj] 1 + xixj E [Xi Xj] Predictive Learning on Hidden Tree-Structured Ising Models Lemma B.1 Lemma B.4 Theorem 3.1 Lemma D.2 Lemma D.1 Theorem 3.3 Corollary G.1 SDPI of BSC by Polyanskiy Theorem 3.2 Lemmas A.1, A.2, Theorem 3.6 Theorem 3.5 Properties of the Ising Model Pairs Algorithm Order Moments Estimates 6.3 by Bresler and SDPI of BSC by Polyanskiy Theorem 3.4 Lemma B.2 Lemma B.3 Lemmas E.1 ,E.2 Corollary G.1 Figure 7: Stream mapping of the results (77) holds since the joint distribution of any pair (Xi, Xj) of distinct nodes i, j V is p(xi, xj) = E 1Xi=xi1Xj=xj = 1 + xixj E[Xi Xj] By setting E[Xi Xj] = 0 for some (i, j) E we derive the distribution with respect to a forest generated by cutting the edge (i, j) of T. In Lemma 13 we prove two fundamental properties of the model, the independence of the random variables {Xi Xj : (i, j) E} and the correlation decay property (CDP). To the best of our knowledge, these properties are known but there is no reference for the corresponding proofs in the literature. Lemma 13 Let X be a random binary vector in { 1, +1}p drawn according to a foreststructured distribution p( ) with uniform marginal distributions on each entry Xi for i [p]. Then the elements of the collection of |E| random variables {Xi Xj : (i, j) E}, are independent. Furthermore, we have E [Xi Xj] = Y e path(i,j) µe, (80) so the Correlation Decay Property (CDP) holds since |µe| 1 for all e E. Nikolakakis, Kalogerias and Sarwate Proof Let (ir)p r=1 be an arbitrary permutation of ℓ= {1, 2, . . . , p}. Notice that the singletons {ir}, r = 1, . . . , p form a partition of ℓ. Then, the set of edges E is defined as E = (ir, jr)p r=2 , and j1 = , jr {i1, . . . , ir 1} ℓ. (81) (81) defines a tree T = (V, E) with root the node i1 (since j1 = ). For the first part, it is sufficient to show that for any {cr : r = 2, 3, . . . , p} { 1, +1}p 1, the following holds r=2 {xirxjr = cr} r=2 P (xirxjr = cr) . (82) r=2 {Xir Xjr = cr} x:xirxjr=cr|p r=2 x:xirxjr=cr|p r=2 1 + xirxjr E [Xir Xjr] x:xir=crxjr|p r=2 1 + xirxjr E [Xir Xjr] xi1 { 1,+1} 1 + cr E [Xir Xjr] r=2 P (Xir Xjr = cr) , (84) (83) comes from (81) and Lemma 12 and the last from (79). For the second part of the statement note that for all i, j V there exists a unique path {i, k1, k2, . . . , kℓ, j} from i to j. Define the variable 1(i,j) (Xk1Xk1)(Xk2Xk2) . . . (XkℓXkℓ), which is equal to 1 almost surely, since X { 1, +1}p.5 Then, we have E[Xi Xj] = E[Xi1(i,j)Xj] = E[Xi(Xk1Xk1)(Xk2Xk2) . . . (XkℓXkℓ)Xj] = E[Xi Xk1] m=1 E[Xkm Xkm+1] E[XℓXj] = Y e path(i,j) µe, (85) and (85) comes from (82) and completes the proof. The next lemma relates the pairwise correlations to the parameters of the Ising model. Lemma 14 An equivalent expression of (10) is the following Q (i,j) E [1 + xixj tanh (θij)] P x Q (i,j) E [1 + xixj tanh (θij)] x { 1, 1}p. (86) Further, for a tree-structure Ising model E [Xi, Xj] = tanh (θij) , for all (i, j) E. 5. 1( ) should not be confused with 1A, where the last denotes the indicator function of a set A. Predictive Learning on Hidden Tree-Structured Ising Models Proof We can write exp (θijxixj) as exp (θijxixj) =exp (θijxixj) + exp ( θijxixj) 2 + exp (θijxixj) exp ( θijxixj) =exp (θij) + exp ( θij) 2 + xixj exp (θij) exp ( θij) = cosh (θij) [1 + xixj tanh (θij)] , (88) (87) holds because xixj { 1, +1}. The partition function can be written as (i,j) E exp (θijxixj) (i,j) E cosh (θij) [1 + xixj tanh (θij)] (i,j) E cosh (θij) X (i,j) E [1 + xixj tanh (θij)] = 2p Y (i,j) E cosh (θij) . (89) Notice that P x Q (i,j) E [1 + xixj tanh (θij)] = 2p under the tree-structure assumption. Then Q (i,j) E exp (θijxixj) Q (i,j) E cosh (θij) [1 + xixj tanh (θij)] 2p Q (i,j) E cosh (θij) (90) 1 + xixj tanh (θij) (88) and (89) give (90) and |E| = p 1 gives (91). Finally E [Xi Xj] (10) = ln Z (θ) (89) = ln h 2p Q (i,j) E cosh (θij) i θij = tanh(θij), (i, j) E, (92) and the latter gives the second part of the Lemma. Lemma 15 Let V be a set of nodes such that V V and |V | 2N. Then it exists a set CT(V ) of |V |/2 pairs of nodes of V , such that any two distinct pairs (w, w ), (v, v ) in CT(V ) are pairwise disjoint (their paths have no commons edge), that is, path T(w, w ) path T(v, v ) = , (w, w ), (v, v ) CT(V ) : (w, w ) (v, v ). (93) Proof We prove the existence of CT(V ) by contradiction. Assume that the two distinct paths path T(w, u ), path T(u, w ) share at least one edge. Let their common sub-path be path T(z, z ), Figure 8 and note that z and z do not necessarily differ from w, w , u, u . Notice that the common sub-path is unique (acyclic graph). Then we can always consider the permutation of the endpoints which gives the edge-disjoint paths path T(w, u) and path T(w , u ). Now the paths path T(w, u) and path T(w , u ) are disjoint, however it is possible that one of them or both, contain sub-paths with common edges. Then, we similarly Nikolakakis, Kalogerias and Sarwate proceed by removing the common sub-paths as previously. The set of common edges strictly decreases through the process, which terminates when there are only paths with no common edge. Figure 8: Proof of the existence of CT(V ), Lemma 15 Theorem 16 (Theorem 10) Assume X p(x) PT(α, β), {i1, i2, . . . , ik} V, then E [Xi1Xi2 . . . Xik] = (Q e CPT(i1,i2,...,ik) µe, k 2N 0, k 2N + 1. (94) Recall that the set of edges CPT(i1, . . . , ik) is a collection of k/2 edge-disjoint paths with endpoints pairs of the nodes i1, . . . , ik for each path. Given a tree structure T, CPT(i1, . . . , ik) is found by running Algorithm 2 on T. Proof Even k. We proceed by showing that the Algorithm 2 returns the unique set CPT. When k = 2 the expression is proved in Lemma 13. For k > 2 we proceed by using Lemmas 13 and 15. For all i, j V there exists a unique path {i, k1, k2, . . . , kℓ, j} from i to j. Define as previously the variable 1(i,j) (Xk1Xk1)(Xk2Xk2) . . . (XkℓXkℓ), which is equal to 1 almost surely, and define the set of nodes V {i1, i2, . . . , ik}. Without loss of generality we assume that the variables in the product Xi1Xi2 . . . Xik are ordered such such that the pairs Xij, Xij+1 for all j {1, 3, 5, . . . , k 1} [k 1]odd form edge-disjoint paths (Lemma 15), in other words path(ij, ij+1) path(ij , ij +1) = , j = j [k 1]odd. (95) Then, we have E [Xi1Xi2 . . . Xik] = E Xi11(i1,i2)Xi2Xi31(i3,i4)Xi4 Xik 11(ik 1,ik)Xik (96) j [k 1]odd E[Xij1(ij,ij+1)Xij+1] (97) e path(ij,ij+1) µe (98) e CPT(i1,i2,...,ik) µe, (99) Predictive Learning on Hidden Tree-Structured Ising Models where (96) and (97) come from (80), and (98) holds because of (95). Odd k. Lemma 12 gives p(x) = 2 p Q (i,j) E(1 + xixj E [Xi Xj]). Then E [Xi1Xi2 . . . Xik] = 1 x { 1,+1}ik xi1xi2 . . . xik Y 1 + xixj E [Xi Xj] 2 = 0, (100) gives the second part of (94). Lemma 17 The mutual information of Xi, Xj { 1, +1} is symmetric function of the correlation E [Xi Xj] and increasing with respect to |E [Xi Xj]|, I (Xi, Xj) = 1 2 log2 (1 E [Xi Xj])1 E[Xi Xj] (1 + E [Xi Xj])1+E[Xi Xj] . (101) The proof can be derived through the definition of I (Xi, Xj) and the expression (79), under the assumption of uniform marginal distributions. Appendix B. Bounding the Probability of Mis-Estimating Correlations The following lemma bounds the probability that the estimated pairwise correlations in the graph deviate from their true values. This follows from standard concentration of measure arguments. Lemma 18 Fix δ > 0. Then for any ϵ > 0, if n 2 log p2/δ /ϵ2 , (102) then the event Ecorr (ϵ ) defined in (61) holds with high probability: P Ecorr (ϵ ) 1 δ = 1 p2 exp Proof Let Z(i) be the ith sample of Z = Yw Y w = Nw Xw N w X w. Then ˆµ i,j = 1 n Pn i=1 Z(i) = 1 n Pn i=1 N(i) w X(i) w N(i) w X(i) w for all i = j V. Then Hoeffding s inequality and union bound over all pairs of nodes p 2 < p2/2 give (103). For the rest of the paper we consider ϵ = p 2 log (2p2/δ) /n , which satisfies Lemma 18. We apply Lemmata 19, 20, 21 to Lemma 22 to bound the required number of samples for exact structure recovery using noisy observations of the hidden model. To analyze the error event we use the Two trees lemma of Bresler and Karzand (2020, Appendix F, supplementary material). Informally, if two maximum spanning trees T, T differ in how a pair of nodes are connected then there exists at least one edge in ET which does not exist in ET and vice versa. Lemma 19 characterizes errors in the Chow-Liu in terms of correlations. Nikolakakis, Kalogerias and Sarwate Lemma 19 Suppose the error event {T = TCL } holds and let f (w, w) be an edge such that f T and f / TCL . Then there exists an edge g (u, u) TCL and g / T such that f path T (u, u) and g path TCL (w, w) and n X i=1 Z(i) f,u, u i=1 M(i) f,u, u and Zf,u, u Yw Y w Yu Y u and Mf,u, u Yw Y w + Yu Y u. Proof Using similar approaches to the procedures as in (Bresler and Karzand, 2020, Lemmata 8.2, 8.3) we have that the condition ˆµ f ˆµ g implies 0 ˆµ f 2 ˆµ g 2 = ˆµ f ˆµ g ˆµ f + ˆµ g i=1 N(i) w X(i) w N(i) w X(i) w N(i) u X(i) u N(i) u X(i) u i=1 N(i) w X(i) w N(i) w X(i) w + N(i) w X(i) u N(i) u X(i) u i=1 Z(i) f,u, u i=1 M(i) f,u, u that gives (104). 2 log (2p2/δ) 1 (1 2q)4 tanh β 1 tanh β (107) µe E [Xw X w] , (108) then µA is given through the following relationship E [Xw X w Xu X u] = µe(1 µA), (109) µ A (1 2q)4µA. (110) In Lemmata 20, 21 we derive two concentration of measure inequalities for the variables Z(i) f,u, u, M(i) f,u, u. In fact, we have that the event i=1 Z(i) e,u, u n E [Ze,u, u] n max 8ϵ2 , 4ϵ q : e E and u, u V Predictive Learning on Hidden Tree-Structured Ising Models happens with probability at least 1 δ 2 and the event i=1 M(i) e,u, u n E [Me,u, u] n max 8ϵ2 , 4ϵ q : e E and u, u V happens with probability at least 1 δ 2 . The parameters ϵ and µA, defined below, are decreasing functions of n . Finally, we apply the union bound to guarantee that the event EZ EM happens with probability at least 1 δ, where δ 2 } δ. The union bound is first applied over all tuples (w, w, u, u) in Lemmata 20 and 21 and then for the events EZ and EM. Lemma 20 Fix δ > 0 and let ϵ be given by (106). For all pairs of vertices u, u V and edges e = (w, w) in the path path T (u, u) from u to u, given n samples Z(1) e,u, u, Z(2) e,u, u, ..., Z(n) e,u, u of Ze,u, u = Yw Y w Yu Y u, it is true that i=1 Z(i) e,u, u n E [Ze,u, u] n max 8ϵ2 , 4ϵ q where A = path T (u, u) \ {e}. Proof The proof is an application of Bernstein s inequality. First, it is true that Ze,u, u = Xw Nw X w N w Nu Xu N u X u = Nw Xw N w X w (1 Nw Xw N w X w Nu Xu N u X u) . (114) E [Ze,u, u] = (1 2q)2 E [Xw X w Xu X u] = (1 2q)2 µe (1 µA) (115) Var (Ze,u, u) = E h (Ze,u, u)2i E [(Ze,u, u)]2 = E h (Xw Nw X w N w N u Xu N u X u)2i h (1 2q)2 E [Xw X w Xu X u] i2 = E [1 + 1 2Xw Nw X w N w Nu Xu N u X u] (1 2q)4 E [Xw X w Xu X u]2 = 2 2E [Xw Nw X w N w Nu Xu N u X u] (1 2q)4 E [Xw X w Xu X u]2 = 2 2 (1 2q)4 E [Xw X w Xu X u] (1 2q)4 E [Xw X w Xu X u]2 = 2 2 (1 2q)4 µA (1 2q)4 (µe (1 µA))2 = 2 (1 2q)4 h 2µA + µ2 e (1 µA)2i . (116) Using the expressions for the mean and the variance, we apply Bernstein s inequality (Bennett, 1962) for the noisy setting: for all i [n ] we have Z(i) e,u, u E [Ze,u, u] M almost surely. Then, Bernstein s inequality gives, for all t > 0 i=1 Z(i) e,u, u n E [Ze,u, u] 2n Var (Ze,u, u) + 2 Nikolakakis, Kalogerias and Sarwate Choose a δ > 0 and find t such that δ/2 = 2 exp 2n Var (Ze,u, u) + 2 After some algebra, we have 2n Var (Ze,u, u) + 2 From this we can solve for t: δ 2n Var (Ze,u, u) log 4 δ 2 + 8n Var (Ze,u, u) log 4 2 + 2n Var (Ze,u, u) log 4 Since t > 0, we have, setting M = 4: 2 + 2n Var (Ze,u, u) log 4 If the probability of the union u, u,w, w:(w, w) path T(u, u) i=1 Z(i) e,u, u n E [Ze,u, u] is at most δ 2p3 , then the union bound gives probability at most δ Var (Ze,u, u) = 2 (1 2q)4 h 2µA + µ2 e (1 µA)2i = 2 (1 2q)4 2µA (1 2q)4 µ2 e (1 µA)2 2 (1 2q)4 2µA + 0 = 2 1 (1 2q)4 µA = 2 1 µ A . (120) From (119) and (120), we have 2 + 4n 1 µ A log 4p3 4n 1 µ A log 4p3 Predictive Learning on Hidden Tree-Structured Ising Models the latter implies that 1 µ A log 4p3 8 3n log 4p3 1 µ A log 4p3 Define ϵ = p log (2p2/δ) 2/n (as it is defined in Bresler and Karzand (2020)), then we get n max 8ϵ2 , 4ϵ q By combining the above we derive the concentration bound for the event EZ in (111). The next Lemma gives the concentration of measure bound for the event EM in (112). Lemma 21 Fix δ > 0 and let ϵ be given by (106). For all pairs of vertices u, u V and edges e = (w, w) in the path path T (u, u) from u to u, given n samples M(1) e,u, u, M(2) e,u, u, ..., M(n) e,u, u of Me,u, u = Yw Y w + Yu Y u, it is true that i=1 M(i) e,u, u n E [Me,u, u] n max 8ϵ2 , 4ϵ q A = path T (u, u) \ {e}. Proof Similarly to the prior Lemma, we calculate the mean and the variance as E [Me,u, u] = (1 2q)2 E [Xw X w + Xu X u] = (1 2q)2 µe (1 + µA) (125) Var (Me,u, u) = E h (Me,u, u)2i E [(Me,u, u)]2 = E h (Xw Nw X w N w + Nu Xu N u X u)2i h (1 2q)2 E [Xw X w + Xu X u] i2 = E [1 + 1 + 2Xw Nw X w N w Nu Xu N u X u] (1 2q)4 E [Xw X w + Xu X u]2 = 2 + 2E [Xw Nw X w N w Nu Xu N u X u] (1 2q)4 E [Xw X w + Xu X u]2 = 2 + 2 (1 2q)4 E [Xw X w Xu X u] (1 2q)4 E [Xw X w + Xu X u]2 = 2 + 2 (1 2q)4 µA (1 2q)4 (µe (1 + µA))2 = 2 + (1 2q)4 h 2µA µ2 e (1 + µA)2i . (126) By applying Bernstein s inequality and we get that for any t > 0 i=1 M(i) e,u, u n E [Me,u, u] 2n Var (Me,u, u) + 2 Nikolakakis, Kalogerias and Sarwate Similarly, we find 8 3n log 4p3 2 n Var (Me,u, u) log 4p3 Var (Me,u, u) =2 + (1 2q)4 h 2µA µ2 e (1 + µA)2i 2 + (1 2q)4 2µA =2 1 + µ A . (128) We define ϵ p log (2p2/δ) 2/n , then n max 8ϵ2 , 4ϵ q which completes the proof. Appendix C. Recovering Strong Edges In Lemma 22, we define the set of strong edges for the hidden model and show that the event Estrong (ϵ ) defined in (62) occurs with high probability. That is, only the strong edges are guaranteed to exist in the estimated structure TCL We also find a lower bound for the necessary number of samples for exact structure recovery. In fact we have n n, as expected. Our bounds coincide with the noiseless case (Bresler and Karzand, 2020) by setting the noise level q = 0. Lemma 22 Fix δ (0, 1), and let ϵ = p 2 log (2p2/δ) /n , for any n > 0. Consider the set of strong edges 1 (1 2q)4 tanh β (1 tanh β) and (i, j) ET : |tanh θij| τ Then, the Chow-Liu algorithm recovers the strong edges with probability at least 1 δ. In other words, it is true that P h Estrong (ϵ ) i 1 2p2 exp Proof From Lemma 19, if there is an error then for an edge f not recovered in the tree TCL , we have i=1 Z(i) f,u, u i=1 M(i) f,u, u Predictive Learning on Hidden Tree-Structured Ising Models Therefore one of the sums must be negative. Expanding, one of the two following inequalities must hold: i=1 Z(i) f,u, u n E h Z(i) f,u, u i n E h Z(i) f,u, u i i=1 Y (i) f,u, u n E h Y (i) f,u, u i n E h M(i) f,u, u i . In addition, (115), (125), Lemma 20 and Lemma 21 give the following pairs of inequalities: (1 2q)2 µf (1 µA) max 8ϵ2 , 4ϵ q (1 2q)2 µf (1 + µA) max 8ϵ2 , 4ϵ q µ f (1 µA) 1 max 8ϵ2 , 4ϵ q µ f (1 + µA) 1 max 8ϵ2 , 4ϵ q Putting these together: 8ϵ2 (1 µA), 8ϵ2 (1 + µA), 4ϵ q 1 µ A (1 µA) , 4ϵ q 1 + µ A (1 + µA) 8ϵ2 (1 µA), 4ϵ q 1 µ A (1 µA) 1 µ A (1 µA) . (132) We get the last inequality for non trivial values of the bound 8ϵ2 1 µ A 1 and by using the following bound 8ϵ2 (1 µA) 16ϵ2 (1 µA) 4ϵ 1 µA = 4ϵ 1 µA (1 µA) 4ϵ q 1 µ A (1 µA) . (133) Finally the function f(µA) = 4ϵ q 1 µ A (1 µA) = 4ϵ q 1 (1 2q)4µA (1 µA) is increasing with respect to µA (for all µA 1) and µA tanh β < 1, so we have 1 µ A (1 µA) 4ϵ q 1 (1 2q)4 tanh β (1 tanh β) τ . (134) Nikolakakis, Kalogerias and Sarwate The weakest edge should satisfy µ f τ to guarantee the correct recovery of the tree under the event Estrong (ϵ ). This yields a condition on the edge strengths: µ f τ = (1 2q)2 tanh α 4ϵ q 1 (1 2q)4 tanh β tanh α 4ϵ q 1 (1 2q)4 tanh β (1 2q)2 (1 tanh β) , q [0, 1 The last inequality gives the definition of the strong edges in the noisy scheme. Based on the definition (134) we derive the following bound on τ 1 (1 2q)4 tanh β 1 (1 8q + 24q2 32q3 + 16q4) tanh β 1 tanh β + p (1 3q + 4q2 2q3) 8q tanh β (1 tanh β) 4ϵ eβ 1 + eβp (1 q) (2q2 2q + 1) 8q tanh β (136) < 4ϵ eβ 1 + 2eβp 2 (1 q) q tanh β . (137) (136) holds because 1 tanh(β) e 2β. We will later use (137) in Lemma 25. In comparison to the noiseless setting (see Bresler and Karzand (2020)), we can guarantee exact recover with high probability under the event Estrong (ϵ) when the weakest edge satisfies the inequality tanh α 4ϵ 1 tanh β . (138) Notice that (138) can be obtained by (135) when q = 0 and n = n . When q > 0 and n = n it is clear that the set of trees that can be recovered from noisy observations is a subset of the set of trees that can be recovered from the original observations. Also, we have 2 log (2p2/δ) /n = n = 2 ϵ2 log 2p2/δ and 2 log (2p2/δ) /n = n = 2 ϵ2 log 2p2/δ . (139) By combining (135) with (139) we found the number of samples that we need to recover the tree with probality at 1 δ (Theorem 5): n > 32 h 1 (1 2q)4 tanh β i (1 tanh β)2 (1 2q)4 tanh2 α log 2p2 Predictive Learning on Hidden Tree-Structured Ising Models On the other hand when there is no noise (Bresler and Karzand, 2020) we need n > 32 tanh2 α (1 tanh β) log 2p2 Appendix D. Analysis of the Event Ecascade (γ ) Lemma 23 Consider a path of length d 2 in the original tree T, and without loss of generality assume that path is X1 X2 Xd+1. Recall that Y (i) m is the ith sample of Ym and m [d + 1] and ˆµ k 1 n Pn i=1 Y (i) k Y (i) k+1, k [d]. Then P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ k 1, . . . , ˆµ 1 = 1 (1 2q)2µk 2 1 µ k 1ˆµ k 1 1 (µ k 1)2 + µ k 1 1 µk 2 ˆµ k 1 µ k 1 1 (µ k 1)2 . (142) Proof Note that i=1 Y (i) k Y (i) k+1 = 1 i=1 (Xk Nk Xk+1Nk+1)(i) , (143) where eacch term (Xk Nk Xk+1Nk+1)(ℓ) ˆµ r r [1, 2, . . . k 2], ℓ [1, 2, . . . , n] (144) P (Xk Nk Xk+1Nk+1)(ℓ) = 1 ˆµ k 1, . . . , ˆµ 1 = P (Xk Nk Xk+1Nk+1)(ℓ) = 1 ˆµ k 1 (Xk Nk Xk+1Nk+1)(ℓ) = 1 ˆµ k 1 = 1 i=1 (Xk 1Nk 1Xk Nk)(i) ! = P ˆµ k 1 = 1 n Pn i=1 (Xk 1Nk 1Xk Nk)(i) (Xk Nk Xk+1Nk+1)(ℓ) = 1 P ˆµ k 1 = 1 n Pn i=1 (Xk 1Nk 1Xk Nk)(i) P (Xk Nk Xk+1Nk+1)(ℓ) = 1 . (145) First we compute the probability P ˆµ k 1 = 1 n Pn i=1 (Xk 1Nk 1Xk Nk)(i) . Define the Bernoulli random variable Zk 1 as Zk 1 Xk 1Nk 1Xk Nk + 1 ( 0, w.p. 1 (1 2q)2µk 1 2 1, w.p. 1+(1 2q)2µk 1 i=1 (Xk 1Nk 1Xk Nk)(i) ! Nikolakakis, Kalogerias and Sarwate i=1 (2Zk 1 1)(i) ! i=1 Z(i) k 1 = n ˆµ k 1 + 1 1 (1 2q)2µk 1 n n ˆµ k 1+1 2 1 + (1 2q)2µk 1 As a second step we compute the probability i=1 (Xk 1Nk 1Xk Nk)(i) (Xk Nk Xk+1Nk+1)(ℓ) = 1 i=1,i =ℓ (Xk 1Nk 1Xk Nk)(i) + (Xk 1Nk 1Xk Nk)(ℓ) (Xk Nk Xk+1Nk+1)(ℓ) = 1 (Xk 1Nk 1Xk Nk)(i) (Xk Nk Xk+1Nk+1)(ℓ) , i = ℓ, (149) and we would like to find the conditional distribution of (Xk 1Nk 1Xk Nk)(ℓ) under the event {(Xk Nk Xk+1Nk+1)(ℓ) = 1}. We have P (Xk 1Nk 1Xk Nk)(ℓ) = c (Xk Nk Xk+1Nk+1)(ℓ) = 1 = P (Xk 1Nk 1Xk Nk)(ℓ) = c, (Xk Nk Xk+1Nk+1)(ℓ) = 1 P (Xk Nk Xk+1Nk+1)(ℓ) = 1 1 c E[Xk 1Nk 1Xk+1Nk+1]+c(1 2q)2µk 1 (1 2q)2µk 4 1 (1 2q)2µk = 1 c(1 2q)2µk 1µk + c(1 2q)2µk 1 (1 2q)2µk 2 (1 (1 2q)2µk) , c { 1, +1}. (150) P1 P (Xk 1Nk 1Xk Nk)(ℓ) = +1 (Xk Nk Xk+1Nk+1)(ℓ) = 1 (151) P2 P (Xk 1Nk 1Xk Nk)(ℓ) = 1 (Xk Nk Xk+1Nk+1)(ℓ) = 1 , (152) i=1,i =ℓ (Xk 1Nk 1Xk Nk)(i) Predictive Learning on Hidden Tree-Structured Ising Models + (Xk 1Nk 1Xk Nk)(ℓ) (Xk Nk Xk+1Nk+1)(ℓ) = 1 i=1,i =ℓ (2Zk 1 1)(i) + (2Zk 1 1)(ℓ) (Xk Nk Xk+1Nk+1)(ℓ) = 1 i=1,i =ℓ (Zk 1)(i) + (Zk 1)(ℓ) = n ˆµ k 1 + 1 (Xk Nk Xk+1Nk+1)(ℓ) = 1 = P (Zk 1)(ℓ) = 0 (Xk Nk Xk+1Nk+1)(ℓ) = 1 P i=1,i =ℓ (Zk 1)(i) = n ˆµ k 1 + 1 + P (Zk 1)(ℓ) = 1 (Xk Nk Xk+1Nk+1)(ℓ) = 1 P i=1,i =ℓ (Zk 1)(i) = n ˆµ k 1 + 1 1 (1 2q)2µk 1 n 1 n ˆµ k 1+1 2 1 + (1 2q)2µk 1 1 (1 2q)2µk 1 n 1 n ˆµ k 1+1 1 + (1 2q)2µk 1 2 1 . (153) P (Xk Nk Xk+1Nk+1)(ℓ) = 1 = 1 (1 2q)2µk and (145), (147), (148), (153) give P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ k 1, . . . , ˆµ 1 P (Xk Nk Xk+1Nk+1)(ℓ) = 1 n 1 n ˆµ k 1+1 2 1+µ k 1 2 n n ˆµ k 1+1 2 1+µ k 1 2 n 1 n ˆµ k 1+1 2 +1 1+µ k 1 2 n n ˆµ k 1+1 2 1+µ k 1 2 Nikolakakis, Kalogerias and Sarwate 1 1 µ k 1 2 1 1 + µ k 1 2 = P2 n n ˆµ k 1+1 ! 1 + P1 n ˆµ k 1+1 1 + µ k 1 2 = P2 1 ˆµ k 1 1 µ k 1 + P1 1 + ˆµ k 1 1 + µ k 1 . (155) The latter and the definition of P1, P2 (see Equations 151 and 152) give P Y (ℓ) k Y (ℓ) k+1 1 ˆµ k 1, . . . , ˆµ 1 P2 1 ˆµ k 1 1 µ k 1 + P1 1 + ˆµ k 1 1 + µ k 1 P (Xk Nk Xk+1Nk+1)(ℓ) = 1 P2 1 ˆµ k 1 1 µ k 1 + P1 1 + ˆµ k 1 1 + µ k 1 # 1 (1 2q)2µk " 1 (1 2q)2µk 1µk (1 2q)2µk 1 (1 2q)2µk 2 (1 (1 2q)2µk) 1 ˆµ k 1 1 µ k 1 + 1 (1 2q)2µk 1µk + (1 2q)2µk 1 (1 2q)2µk 2 (1 (1 2q)2µk) 1 + ˆµ k 1 1 + µ k 1 # 1 (1 2q)2µk = 1 (1 2q)2µk 1µk (1 2q)2µk 1 (1 2q)2µk 4 1 ˆµ k 1 1 µ k 1 + 1 (1 2q)2µk 1µk + (1 2q)2µk 1 (1 2q)2µk 4 1 + ˆµ k 1 1 + µ k 1 = 1 (1 2q)2µk 1 ˆµ k 1 1 µ k 1 + 1 + ˆµ k 1 1 + µ k 1 + (1 2q)2µk 1µk + (1 2q)2µk 1 1 + ˆµ k 1 1 + µ k 1 1 ˆµ k 1 1 µ k 1 = 1 (1 2q)2µk 2 1 µ k 1ˆµ k 1 1 (µ k 1)2 + (1 2q)2µk 1 1 µk 2 ˆµ k 1 µ k 1 1 (µ k 1)2 . (156) Note that (1 2q)2µk 1 = µ k 1, and the proof is completed. Predictive Learning on Hidden Tree-Structured Ising Models Lemma 24 Define the function K(β, q) as K(β, q) 10(1 tanh2(β)) 9 + (1 2q)2 tanh2(β)(1 2q)2(9(1 2q)2 + 1) (157) and the event Eedge e, as Eedge e, n ˆµ e µ e γe o , e ET, γe > 0, (158) and Eedge (ET) e ET Eedge e, . If n 108e2β log(2p/δ) (1 2q)4K(β, q) and γe = n K(β, q) log(2p/δ) (159) then P h Eedge (ET) ci δ. Proof The variance of ˆµ e is (1 (µ e)2)/n and by applying Bernstein s inequality P h Eedge e, ci 2 exp nγ2 e 2 1 (µ e)2 + 4 , γe > 0. (160) We choose γe = q 3 1 µ2e n K(β,q) log(2p/δ) (because the parameter γe is free, that is, Bernstein s inequality holds for all γe > 0). If n satisfies (159) then 108e2β (1 2q)4 (1 2q)2 6 (1 µ2 e), (161) and the last is true because e 2β 1 tanh(β) 1 |µe| 1 µ2 e. By applying (159) and (161) on (160) we get P h Eedge e, ci nγ2 e 2 1 (µ e)2 + 4 3 1 µ2 e K(β,q) log(2p/δ) 2 (1 (1 2q)4µ2e) + 4 3 K(β, q) 1 µ2 e 2 + 2 9(1 2q)2 µ2e(1 2q)2(2(1 2q)2 + 2 9) log(2p/δ) 3 2K(β, q) 1 µ2 e 1 + 1 9(1 2q)2 µ2e(1 2q)2((1 2q)2 + 1 9) log(2p/δ) Nikolakakis, Kalogerias and Sarwate 10 9K(β, q) 1 µ2 e 1 + 1 9(1 2q)2 µ2e(1 2q)2((1 2q)2 + 1 9) log(2p/δ) The following function 9 1 x 1 + 1 9(1 2q)2 x(1 2q)2((1 2q)2 + 1 9), x [tanh2(α), tanh2(β)] (163) is strictly decreasing, thus we have f(tanh2(β)) f(x) for all x [tanh2(α), tanh2(β)]. Also K(β, q) f(tanh2(β)), the latter together with (162) give P h Eedge e, ci 2 exp 1 K(β, q)f(µ2 e) log(2p/δ) 2 exp 1 K(β, q)f(tanh2(β)) log(2p/δ) = 2 exp 1 K(β, q)K(β, q) log(2p/δ) Finally, by applying union over the p 1 edges of the tree we get P h Eedge (ET) ci δ. The next Lemma is the extension of Lemma 8.7 by Bresler and Karzand (2020). The sample complexity bound exactly recovers the noiseless case and its expression is continuous at q = 0. Further, the bound is independent of the length of the longest path d, similarly to the noiseless setting. Finally, we provide upper bounds on the functions that appear in the bound. The latter give a more tractable version of the result and a clear representation of the required number of samples as a function of the parameters. Lemma 25 (Concentration bound for the event Ecascade (γ )) For β > 0 and q [0, 1/2) we define the functions S( ), G( ), K( ), A( ), ( ) S(β, q) 2 + (1 2q)2 6 (1 (1 2q)2) tanh2(β) 3 (1 2q)2 S (165) A(β, q) A (1 2q)2[1 tanh(β)(1 (1 2q)2)] (166) G(β, q) 3 4(1 2q)2 d(1 A) A + 2 3 3e 11q =0 + 1 4(1 2q)2 G, (167) and the inequality in (167) holds because the function G(β, q) is bounded for all d N \{1}. K(β, q) 10(1 tanh2(β)) 9 + (1 2q)2 tanh2(β)(1 2q)2(9(1 2q)2 + 1) e 2β1q=0 K (168) 1 (1 2q)4 tanh2(β) 3 log(2p3/δ) n tanh2(β)e2β. (169) Predictive Learning on Hidden Tree-Structured Ising Models If < γ S(β, q)G(β, q)/3 + and ( S2(β, q)G2(β, q) 0.32 (γ )2 log(4p2/δ), 108e2β (1 2q)4K(β, q) log(2p3/δ) then for any path Ad = {e1, e2, . . . , ed} of T with d edges, it is true that ˆµ e (1 2q)2 Y µ e (1 2q)2 p2 , d > 2. (171) Proof For sake of space we proceed by using the notation µ k and ˆµ k instead of µ ek and ˆµ ek for k [d]. Define the random variable ˆµ i (1 2q)2 µ i (1 2q)2 ˆµ j (1 2q)2 µ j (1 2q)2 . (172) Then Pd i=1 M i = Qd i=1 ˆµ i (1 2q)2 Qd i=1 µ i (1 2q)2 , and define the sequence of paths with length k as Ak {e1, e2, . . . , ek} Ad, for 2 k d. Although we provided the definition of the event Eedge ( ) in Lemma 24, we restate it below for completeness. For some γe > 0 the definition follows Eedge (Ak) \ n ˆµ e µ e γe o . (173) The law of total probability gives Eedge (Ad 1) + P h Eedge (Ad 1) ci . (174) For second term, Lemma 24 gives that P h Eedge (Ad 1) ci δ/p2 if n 108e2β log(2p3/δ) (1 2q)4K(β, q) , (175) we define the function K(β, q) in Lemma 24 (157). Here we will find an upper bound for the first term of the right hand-side of (174). Note that M k is written as n Pn ℓ=1 (Yk Yk+1)(ℓ) (1 2q)2 µ k (1 2q)2 ˆµ j (1 2q)2 µ j (1 2q)2 (Xk Nk Xk+1Nk+1)(ℓ) (1 2q)2 µ k (1 2q)2 ˆµ j (1 2q)2 µ j (1 2q)2 . (176) Nikolakakis, Kalogerias and Sarwate and we define (Xk Nk Xk+1Nk+1)(ℓ) (1 2q)2 µ k (1 2q)2 ˆµ j (1 2q)2 µ j (1 2q)2 . (177) The random variables Z(ℓ) k for ℓ [n] and fixed k [d] are independent conditioned on the event Eedge (Ak 1). However the conditional expectation E[Z(i) k |Z(i 1) k , . . . , Z(1) k , ˆµ k 1, . . . , ˆµ 1] is not zero. To apply a concentration of measure result on Z(ℓ) k we use the extended Bennet s inequality for supermartingales (Fan et al., 2012). Martingale Differences: Define ξ(0) k 0, ξ(1) k Z(1) k E h Z(1) k |ˆµ k 1, . . . , ˆµ 1 i , ξ(i) k Z(i) k E h Z(i) k |Z(i 1) k , . . . , Z(1) k , ˆµ k 1, . . . , ˆµ 1 i . Also, define as Fk i 1 the σ-algebra generated by Z(i 1) k , . . . , Z(1) k , ˆµ k 1, . . . , ˆµ 1, then (ξ(i) k , Fk i )i=1....,n is a Martingale Difference Sequence (MDS). Additionally, conditioned on Z(i 1) k , . . . , Z(1) k , ˆµ k 1, . . . , ˆµ 1 we have 1 (1 2q)2d 1 µ k Qk 1 j=1 ˆµ j Qd j=k+1 µ j, w.p. P Y (ℓ) k Y (ℓ) k+1 = +1 ˆµ k 1 1 (1 2q)2d 1 + µ k Qk 1 j=1 ˆµ j Qd j=k+1 µ j, w.p. P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ k 1 , (178) and we have proved (Lemma 23) that P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ k 1, . . . , ˆµ 1 = P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ k 1 = 1 µ k 2 1 µ k 1ˆµ k 1 1 (µ k 1)2 + µ k 1 1 µk 2 ˆµ k 1 µ k 1 1 (µ k 1)2 . (179) Thus we have E h Z(i) k |Fk i 1 i = E h Z(i) k |Z(i 1) k , . . . , Z(1) k , ˆµ k 1, . . . , ˆµ 1 i = h 1 µ k P Y (ℓ) k Y (ℓ) k+1 = +1 ˆµ k 1 (1 + µ k)P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ k 1 i Qk 1 j=1 ˆµ j Qd j=k+1 µ j (1 2q)2d . (180) Note that (179) gives h 1 µ k P Y (ℓ) k Y (ℓ) k+1 = +1 ˆµ k 1 (1 + µ k)P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ k 1 i (1 µ k)1 + µ k 2 (1 + µ k)1 µ k 2 µ k 1 1 µ k 1ˆµ k 1 1 (µ k 1)2 + (1 µ k)1 + µk 2 (1 + µ k)1 µk µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2 Predictive Learning on Hidden Tree-Structured Ising Models = (1 µ k)1 + µk 2 (1 + µ k)1 µk µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2 1 µ k + µk µ kµk 1 + µk µ k + µ kµk µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2 = (µk µ k)µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2 . (181) Combine the latter with (180) to get E h Z(i) k |Fk i 1 i = µ k 1(µk µ k) ˆµ k 1 µ k 1 1 (µ k 1)2 Qk 1 j=1 ˆµ j Qd j=k+1 µ j (1 2q)2d , i [n]. (182) If q = 0 then E h Z(i) k |Fk i 1 i = 0. Also limn E h Z(i) k |Fk i 1 i 0 for all q [0, 1/2) because limn ˆµ k 1 µ k 1. Note that " ξ(i) k 2 Fk i 1 " Z(i) k E h Z(i) k |Fk i 1 i 2 Fk i 1 " Z(i) k 2 Fk i 1 E2 h Z(i) k |Fk i 1 i . (183) We compute E h Z(i) k 2 Fk i 1 i : E Z(i) k 2 |Fk i 1 = E Z(i) k 2 |Z(i 1) k , . . . , Z(1) k , ˆµ k 1, . . . , ˆµ 1 = 1 µ k 2 P Y (ℓ) k Y (ℓ) k+1 = +1 ˆµ k 1 + (1 + µ k)2P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ k 1 Qk 1 j=1 ˆµ j Qd j=k+1 µ j (1 2q)2d We use (179) to find 1 µ k 2 P Y (ℓ) k Y (ℓ) k+1 = +1 ˆµ k 1 + (1 + µ k)2P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ k 1 1 µ k 2 1 + µ k 2 + 1 + µ k 2 1 µ k 2 ! 1 µ k 1ˆµ k 1 1 (µ k 1)2 + 1 µ k 2 1 + µk 2 + 1 + µ k 2 1 µk µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2 = (1 (µ k)2)1 µ k 1ˆµ k 1 1 (µ k 1)2 + 1 + (µ k)2 2µkµ k µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2 Nikolakakis, Kalogerias and Sarwate = (1 (µ k)2)1 µ k 1ˆµ k 1 1 (µ k 1)2 + 1 + (µ k)2 2(µ k)2 + 2(µ k)2 2µkµ k µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2 = (1 (µ k)2)1 µ k 1ˆµ k 1 1 (µ k 1)2 + 1 (µ k)2 µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2 + 2µ k µ k µk µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2 = 1 (µ k)2 " 1 µ k 1ˆµ k 1 1 (µ k 1)2 + µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2 + 2µ k µ k µk µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2 = 1 (µ k)2 + 2µ k µ k µk µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2 . (185) Now we combine (182), (183), (184) and (185) to get " ξ(i) k 2 Fk i 1 " Z(i) k 2 Fk i 1 E2 h Z(i) k |Fk i 1 i " 1 (µ k)2 + 2µ k µ k µk µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2 µ k 1(µk µ k) ˆµ k 1 µ k 1 1 (µ k 1)2 Qk 1 j=1 ˆµ j Qd j=k+1 µ j (1 2q)2d µ k + µ k 1(µk µ k) ˆµ k 1 µ k 1 1 (µ k 1)2 Qk 1 j=1 ˆµ j Qd j=k+1 µ j (1 2q)2d For sake of space we define the function ˆfµk µk 1(q) f(µ k, µ k 1, ˆµ k 1, q) µ k 1(µk µ k) ˆµ k 1 µ k 1 1 (µ k 1)2 , (187) and then (182) and (186) can be written as E h Z(i) k |Fk i 1 i = ˆfµk µk 1(q) Qk 1 j=1 ˆµ j Qd j=k+1 µ j (1 2q)2d , i [n], (188) " ξ(i) k 2 Fk i 1 = 1 µ k + ˆfµk µk 1(q) 2 Qk 1 j=1 ˆµ j Qd j=k+1 µ j (1 2q)2d , i [n]. (189) Predictive Learning on Hidden Tree-Structured Ising Models We would like to find an upper bound on the summation Pd k=1 E h ξ(i) k 2 Fk i 1 i . Define A A(β, q) (1 2q)2[1 tanh(β)(1 (1 2q)2)], for all β > 0 and q [0, 1). Then 1 µ k + ˆfµk µk 1(q) 2 µ k + µ k 1(µk µ k) ˆµ k 1 µ k 1 1 (µ k 1)2 (1 2q)2 + µ k 1(1 (1 2q)2) ˆµ k 1 µ k 1 1 (µ k 1)2 (1 2q)2 |µ k 1|(1 (1 2q)2)|ˆµ k 1 µ k 1| h 1 µ2 k (1 2q)2 (1 2q)2 tanh(β)(1 (1 2q)2) 2i (190) [1 µ2 k A(β, q)], (191) and (190) holds because |ˆµ k 1 µ k 1| γ j (1 2q)2(1 µ2 j)/6 (1 2q)2(1 (µ j)2)/6. Then (189) and (191) give " ξ(i) k 2 Fk i 1 k=1 [1 µ2 k A(β, q)] Qk 1 j=1 ˆµ j Qd j=k+1 µ j (1 2q)2d k=1 [1 µ2 k A(β, q)] µ2 j + 2 γ j (1 2q)2 The inequality (192) holds under the event Eedge (ET) defined in (158) (see Lemma 24) because ˆµ j (1 2q)2 µ j (1 2q)2 + 2 γ j (1 2q)2 , (193) since |µ j| (1 2q)2, |ˆµ j| (1 2q)2 under the assumption of known q. Next we define xj µ2 j + 2 γ j (1 2q)2 , then 3(1 A(β, q)xj)/2 + (A(β, q) 1)/2) 1 µ2 k A(β, q) and (192) gives " ξ(i) k 2 Fk i 1 2(1 Axj) + A 1 j=1,j =k xj Nikolakakis, Kalogerias and Sarwate 2(1 Ax) + A 1 xd 1. (194) The latter is maximized at x = (A + 2) 1 1 d /3 and A (0, 1], thus we have 2(1 Ax) + A 1 = d (1 2q)2 (x )d 1 + A(A + 2) 2(1 2q)2 (x )d 1 d 1 + A(A + 2) = d(A + 2)(1 A) d 1 + A(A + 2)2 d + 3 4(1 2q)2 G(β, q) (195) (1 A) 4(1 2q)2 3 e log 3 A+2 + 3 4(1 2q)2 = 3(1 A) e4(1 2q)2 log 3 A+2 + 3 4(1 2q)2 In (195) we define the function G(β, q) and we proved that it has an upper bound independent of d [p 1], G(β, q) d3(1 A(β, q)) A(β, q) + 2 d + 3 4(1 2q)2 3 3e 11q =0 + 1 4(1 2q)2 G. (196) For the rest of the proof and the final result G(β, q) can be replaced by its upper bound in (196), however the definition of G(β, q) shows the continuity of the result for q 0. The following inequality holds with probability 1 for all i [n] and k [d] under the event Eedge (ET) (Lemma 24), |ξ(i) k | 2 + E h Z(i) k |Fk i 1 i = 2 + | ˆfµk µk 1(q)| (1 2q)2 Qk 1 j=1 |ˆµ j| Qd j=k+1 |µ j| 2 + | ˆfµk µk 1(q)| (1 2q)2 Predictive Learning on Hidden Tree-Structured Ising Models = 2 + 1 (1 2q)2 |µ k 1||(µk µ k)||ˆµ k 1 µ k 1| = 2 + tanh2(β)(1 (1 2q)2) γ k 1 1 (µ k 1)2 2 + (1 2q)2 6 (1 (1 2q)2) tanh2(β) S(β, q). (197) the last step comes form the inequality γ e (1 2q)2 6 (1 (µe)2) (161), which holds if the inequality n > 108e2β K(β,q)(1 2q)4 log(4p) holds (see 159). Recall that Lemma 24 gives n K(β, q) log(2p3/δ) 31 tanh2(β) n K(β, q) log(2p3/δ) 3 log(2p3/δ) Also, for all i [n] we have k=1 E h Z(i) k |Fk i 1 i k=2 E h Z(i) k |Fk i 1 i E h Z(i) k |Fk i 1 i | ˆfµk µk 1(q)| (1 2q)2 Qk 1 j=1 |ˆµ j| Qd j=k+1 |µ j| k=2 |µ k 1|(µk µ k)|ˆµ k 1 µ k 1| tanh2(β)(1 (1 2q)2) γe 1 (1 2q)4 tanh2(β) (1 2q)2 (199) tanh2(β) (1 (1 2q)2) γ 1 (1 2q)4 tanh2(β) (1 2q)2 (200) tanh2(β) (1 (1 2q)2) γ 1 (1 2q)4 tanh2(β) |µ j| + γj (1 2q)2 (1 2q)2 (201) = tanh2(β) (1 (1 2q)2) γ 1 (1 2q)4 tanh2(β) j=1 tanh(β) + γj (1 2q)2 j=k+1 tanh(β) tanh2(β) (1 (1 2q)2) γ 1 (1 2q)4 tanh2(β) j=1,j =k tanh(β) + γj (1 2q)2 Nikolakakis, Kalogerias and Sarwate tanh2(β) (1 (1 2q)2) γ 1 (1 2q)4 tanh2(β) tanh(β) + 1 6(1 tanh2(β)) = tanh2(β) (1 (1 2q)2) γ 1 (1 2q)4 tanh2(β)(d 1) 5 6(tanh(β) 3)2 d 1 tanh2(β) (1 (1 2q)2) γ 1 (1 2q)4 tanh2(β) 1 e log 5 6(tanh(β) 3)2 (1 (1 2q)2) tanh2(β) γ 1 (1 2q)4 tanh2(β) e2β 1. (1 (1 2q)2) tanh2(β)e2β 1 (1 2q)4 tanh2(β) 3 log(2p3/δ) where (199), (200) , (201) come from Lemma 24 and (198). Finally, 0 < 5 6(tanh(β) 3)2 < 1 for all β > 0 and 1/ log 5 6(tanh(β) 3)2 e2β. We use the symbol EAk 1[ ] to denote the conditional expectation given the the event Eedge (Ak 1), for instance ! Eedge (Ak 1) Further we define the function F( , ) as F(t, λ) = log 1 1 + te λt + t 1 + teλ . (204) For any k d we have ! Eedge (Ak 1) exp λM k ˆµ 1, . . . , ˆµ k 1 ! ˆµ 1, . . . , ˆµ k 1 i=1 E h Z(i) k |Fk i 1 i! ˆµ 1, . . . , ˆµ k 1 exp λE h Z(i) k |Fk i 1 i ! ˆµ 1, . . . , ˆµ k 1 Predictive Learning on Hidden Tree-Structured Ising Models exp λE h Z(i) k |Fk i 1 i " ξ(i) k 2 Fk i 1 S(β, q)2 , |λ|S(β, q) exp λE h Z(i) k |Fk i 1 i " ξ(i) k 2 Fk i 1 S(β, q)2 , |λ|S(β, q) The equation (205) comes from change of measure and tower property, the definitions (176) and (177) of M k and ξ(i) k respectively give (206) and (207). The (208) is derived by upper bounding the quantity E h Z(i) k |Fk i 1 i similarly to (197), (209) is the upper bound on the moment generating function of the supermartingale Fan et al. (2012). To get a recurrence we proceed as follows: ! Eedge (Ak 1) P Eedge (Ak 1) ! Eedge (Ak 2) Eedge ek 1, P Eedge (Ak 1) ! Eedge (Ak 2) Eedge ek 1, Eedge (Ak 1) Eedge (Ak 2) P Eedge (Ak 2) ! Eedge (Ak 2) Eedge ek 1, Eedge ek 1, Eedge (Ak 2) P Eedge (Ak 2) 1Eedge ek 1, Eedge (Ak 2) P Eedge (Ak 2) ! Eedge (Ak 2) P Eedge (Ak 2) . (211) Nikolakakis, Kalogerias and Sarwate By applying the recurrence d times, we derive the following bound ! Eedge (Ak 1) k=1 E h Z(i) k |Fk i 1 i! " ξ(i) k 2 Fk i 1 S(β, q)2 , |λ|S(β, q) k=1 E h Z(i) k |Fk i 1 i! " ξ(i) k 2 Fk i 1 S(β, q)2 , |λ|S(β, q) Further (196), (197), (202) and (212) give ! Eedge (Ak 1) exp (λ (β, q)) exp 1 d G(β, q) S2 ξ (β, q), |λ|S(β, q) exp nd F G(β, q) d , |λ|S(β, q) + λ (β, q) . (213) For sake of space, we denote the functions G(β, q), S(β, q), and (β, q) as G, S, and respectively. It is true that ! Eedge (Ad 1) P h Eedge (Ad 1) Eq i + λ . (214) Under the assumption n > 108e2β (1 2q)4K(β,q) log(4p), we have P h Eedge (Ad 1) ci 1 The latter gives ! Eedge (Ad 1) 2 exp nd F G + λ , (216) Predictive Learning on Hidden Tree-Structured Ising Models which implies that Eedge (Ad 1) 2 min λ>0 exp nd F G = 2 min λ>0 exp nd F G + λ ( γ) , (217) and we define γ γ . The minimum value is attained at λ = n/S 1 + G/d log 1 + γ and by substituting the optimal value we get ! G/d 1+G/d γ /(d S) ! 1 1+G/d 1 γ G/d 1+G/d γ /(d S) (1+G/d) 1 γ G/d 1+G/d+ γ /(d S) 1 1+G/d 1 γ 1 1+G/d 1 γ G/d 1+G/d γ /(d S) (1+G/d) 1 γ G/d 1+G/d+ γ /(d S) (1+G/d) 1 1 γ 1 1+G/d 1 γ 1 1+G/d 1 γ G/d 1+G/d γ /(d S) (1+G/d) 1 γ G/d 1+G/d+ γ /(d S) G/d 1+G/d γ /(d S) (1+G/d) 1 γ G/d 1+G/d+ γ /(d S) G/d 1+G/d γ /(d S) (1+G/d) 1 γ 1 1+G/d+ γ /(d S) Nikolakakis, Kalogerias and Sarwate Then (217) and (219) give Eedge (Ad 1) G/d 1+G/d γ /(d S) (1+G/d) 1 γ 1 1+G/d+ γ /(d S) As a final step we want to express the upper bound as an exponential function of γ, we define ζ γ /(SG) and we proceed as follows: d G/d 1 + G/d + γ /(d S) (1 + G/d) + 1 1 + G/4d γ /(dy) (1 + G/4d) " G/d 1 + G/d + γ /(d S) (1 + G/d) 1 1 + G/d γ /(d S) (1 + G/d) log 1 + γ /d S 1 γ /d S d + G + dγ /S d + G dγ /S γ /d S 1 γ /d S 1 γ /d S 1 γ /d S " G + γ /S γ d γ /S γ /d S 1 γ /d S 1 γ /d S 1 γ /d S " G + γ /S γ = d G d + G " 1 + γ /(SG) γ = d G d + G (1 + ζ) ζ 1 (1 + ζ) ζ 1 2 , γ (0, SG Predictive Learning on Hidden Tree-Structured Ising Models Recall that ζ γ /(SG), γ = γ . If < γ S(β, q)G(β, q)/3+ then (220) and (221) give Eedge (Ad 1) 0.32n (γ )2 S2(β, q)G2(β, q) In a similar way we derive the bound k=1 M k γ Eedge (Ad 1) 0.32n (γ )2 S2(β, q)G2(β, q) Finally, we combine (174), Lemma 24, (222) and (223) to derive the bound (170) which guarantees that ˆµ i (1 2q)2 µ i (1 2q)2 2δ/p2, d 2. (224) To summarize we proved that the event Ecascade (γ ) happens with probability at least 1 2δ/p2 by combining Bresler s and Karzand s technique, the Corollary 2.3 by Fan et al. (2012) and Lemma 23. Appendix E. Predictive Learning, Proof of Theorem 3 and Theorem 7 Recall that our goal is to guarantee that the quantity L(2)(p( ), ΠTCL (ˆp )) is smaller than a number η > 0 with probability at least 1 δ. To do this, we use the triangle inequality as L(2) p( ), ΠTCL (ˆp ) L(2) p( ), ΠTCL (p( )) + L(2) ΠTCL (p( )) , ΠTCL (ˆp ) (225) and we find the required number of samples such that each of the terms L(2)(p( ), ΠTCL (p( ))) and L(2)(ΠTCL (p( )) , ΠTCL (ˆp )) in (225) is less than η/2 with probability at least 1 δ. The next Lemma provides the necessary bounds on γ and ϵ that guarantee L(2)(ΠTCL (p( )) , ΠTCL (ˆp )) η/2. Lemma 26 If γ η ϵ (1 2q)2e β h 20 1 + 2eβp 2 (1 q) q tanh β i 1 , then L(2)(ΠTCL (p( )) , ΠTCL (ˆp )) η/2 under the event Ecorr (ϵ ) Ecascade (γ ) Estrong (ϵ ). Proof The derivation of the bound is similar to the approach by Bresler and Karzand (2020, Section 6.1) but with different calculations. In the hidden model, we consider the path between two nodes i, j in the estimated structure TCL , namely path TCL (i, j), to be Nikolakakis, Kalogerias and Sarwate (F0, e1, F1, e1..., Ft 1, et, Ft), and Fi are segments with all strong edges and ei are all weak edges. We consider the case of at least one weak edge to exist in the path. If there is no weak edge the bound reduces to the case of Lemma 25. The length of each sub-path Fi is denoted as di, for all i {0, 1, . . . , t}. Each segment (sub-path) Fi has exactly di edges, and the total number of edges in the path are d; thus d = Pt i=0 di + t. Note that t 1 and di 0 for all i {0, 1, . . . , t}. Recall that ΠTCL (p( )) = 1 1 + xixj E [Xi Xj] 1 + xixj E[Yi Yj] (1 2q)2 (the latter comes from (19)), and ΠTCL (ˆp ) 1 1 + xixj ˆE[Yi Yj] (1 2q)2 Further, for any tree-structured Ising model distributions P, P with structures T = (V, E) and T = (V, E) respectively, we have L(2) P, P sup i,j V xi,xj { 1,+1}2 P(xi, xj) P(xi, xj) (228) = sup i,j V e path T(i,j) µe Y e path T(i,j) µe To upper bound the quantity L(2)(ΠTCL (p( )) , ΠTCL (ˆp )) we have 2L(2) ΠTCL (p( )) , ΠTCL (ˆp ) e path TCL (i,j) µ e (1 2q)2 Y e path TCL (i,j) ˆµ e (1 2q)2 = 1 (1 2q)2d i=1 ˆµ Fi ˆµ ei µ F0 i=1 µ Fiµ ei " ˆµ F0 µ F0 ˆµ Fi ˆµ ei µ Fiµ ei ˆµ Fj ˆµ ej µ Fi (1 2q)2di µ ei (1 2q)2t Predictive Learning on Hidden Tree-Structured Ising Models ˆµ Fi ˆµ ei µ Fiµ ei (1 2q)2(di+1) |ˆµ F0| (1 2q)2d0 Qi 1 j=1 ˆµ Fj ˆµ ej Qt k=i+1 µ Fkµ ek (1 2q)2(d di d0 1) (232) t 1 + τ + ϵ ˆµ Fi ˆµ ei µ Fiµ ei (1 2q)2(di+1) (233) t 1 + τ + ϵ (ˆµ Fi µ Fi)ˆµ ei (1 2q)2(di+1) + µ Fi(ˆµ ei µ ei) (1 2q)2(di+1) t 1 + τ + ϵ γ + ϵ (1 2q)2 t 1 (2t + 1) max γ , ϵ (1 2q)2 4ϵ eβ 1 + 2eβp 2 (1 q) q tanh β + ϵ (2t + 1) max γ , ϵ (1 2q)2 t 1 1 + 2eβp 2 (1 q) q tanh β t 1 (2t + 1) max γ , ϵ (1 2q)2 4t 1 η 3 (236) Telescoping summation and triangle inequality give (230) and (231). We use the definition of d, d = Pt i=0 di + t to get (232). The inequalities µ Fi (1 2q)2di, ˆµ Fi (1 2q)2di, µ ei τ , ˆµ ei τ +ϵ hold under Ecorr (ϵ ), Estrong (ϵ ). Further, under event Ecascade (γ ) (Lemma 25) it is true that ˆµ Fi µ Fi γ , the latter give (233) and (234). The bound τ 4ϵ eβ(1 + 2eβp 2 (1 q) q tanh β) gives (235) (see inequality 137). Inequality (236) requires max ϵ (1 2q)2 , γ ϵ (1 2q)2e β h 20 1 + 2eβp 2 (1 q) q tanh β i 1 . (239) Finally (237) holds for all t N. The latter completes the proof. The next Lemma provides the set of values of ϵ that guarantee L(2)(p( ), ΠTCL (p( ))) η 2 with high probability. Nikolakakis, Kalogerias and Sarwate Lemma 27 If η 16(1 2q)2, (1 2q)2e β 24 1 + 2eβp 2 (1 q) q tanh β then L(2) p( ), ΠTCL (p( )) η 2 under the event Ecorr (ϵ ) Estrong (ϵ ). Proof Recall that L(2) p( ), ΠTCL (p( )) = 1 e path T(w, w) µ e (1 2q)2 Y e path TCL (w, w) µ e (1 2q)2 e path T(w, w) µe Y e path TCL (w, w) µe We follow Bresler s and Karzand s technique Loss due to graph estimation (Bresler and Karzand, 2020, Section 6.2) and highlight the difference that appears in our setting. For the noisy case/hidden model, the argument changes slightly in the following manner: 2L(2) p( ), ΠTCL (p( )) |µfµAµ AµBµ B||µ2 Cµ2 C 1| + |µf| (k) + ( k) + ( k) (k) µ f (1 2q)2 µAµ AµBµ B µ f (1 2q)2 (k) + ( k) + ( k) (k) 8 ϵ (1 2q)2 + τ (1 2q)2 (2η + η2) (242) (242) holds since |µ f| |µ g| 4ϵ , |µf| 1 µ2 Cµ2 C 2|µf| 2|µg|, |µ f| τ , and (243) holds for all the values of ϵ that satisfy η 16(1 2q)2, (1 2q)2e β 24 1 + 2eβp 2 (1 q) q tanh β The latter provides the statement of the Lemma. The next Theorem provides the sufficient number of samples for predictive learning that recovers exactly the noiseless setting for q = 0. Note that the dependence on β changes from e2β to e4β when the data are noisy. A key component of the bound is the following function Γ(β, q) 1 (1 2q)2 1 (1 2q)4 tanh2(β) 2 , β > 0 and q [0, 1/2). (245) Predictive Learning on Hidden Tree-Structured Ising Models Note that Γ(β, q) [0, 1] for all β > 0 and q [0, 1/2), and Γ(β, 0) = 0 for all β > 0. Further we define B(β, q) max 1 K(β, q), 1 + 2eβp 2 (1 q) q tanh β 2 , (246) and the expression of K(β, q) is given by (168). Theorem 28 Fix δ (0, 1). Choose η > 0 (independent of δ). If ( 512 η2(1 2q)4 , 1152e2βB(β, q) (1 2q)4 , 48e4β P L(2) p( ), ΠTCL (ˆp ) η 1 δ. (248) Additionally, as a consequence of (247) , if ( 512 η2(1 2q)4 , 1152 1 + 3 q 2 e2β(1+1q =0) (1 2q)4 , 48e4β P L(2) p( ), ΠTCL (ˆp ) η 1 δ. (250) Proof Recall that L(2) p( ), ΠTCL (p( )) = 1 e path T(w, w) µe Y e path TCL (w, w) µe We combine the triangle inequality L(2) p( ), ΠTCL (ˆp ) L(2) p( ), ΠTCL (p( )) + L(2) ΠTCL (p( )) , ΠTCL (ˆp ) , (252) Lemma 26, and Lemma 27 to get that L(2)(p( ), ΠTCL (ˆp )) η with probability at least 1 δ if 3 and ϵ min η 16(1 2q)2, (1 2q)2e β 24 1 + 2eβp 2 (1 q) q tanh β First, we find the necessary number of samples such that for γ η/3 the probability of the complement of Ecascade (γ ) is not greater than δ/3. Recall that G 3 3e 11q =0 + 1 4(1 2q)2 , (254) Nikolakakis, Kalogerias and Sarwate S 3 (1 2q)2. (255) Recall that Γ(β, q) 1 (1 2q)2 1 (1 2q)4 tanh2(β) 2 , β > 0 and q [0, 1/2). (256) Lemma 25 gives that for any > 0 and η > if n max 0.3 2S2G2 (η )2 , 108e2β (1 2q)4K(β, q), 3e4β 2 Γ(β, q) log 6p3 then the probability of the complement of Ecascade (γ ) is upper bounded by δ/3 and we write P Ecascade (γ ) c δ Second, we find the necessary number of samples such that the complements of the events Estrong (ϵ ) and Ecorr (ϵ ) occur with probability not greater than δ/3 each. In fact the upper bound on ϵ (253) and Lemma 18 gives that if n max 512 η2(1 2q)4 , 1152e2β 2 (1 q) q tanh β 2 log 6p3 then ϵ satisfies the inequality in (253) with probability at least 1 δ/3. Note that (257) holds for any (0, η) and we will choose = η/4. Under the choice = η/4 (η )2 = 0.3 2S2G2 (η η/4)2 < 512 η2 , η > 0, q [0, 1/2). (260) Recall that B(β, q) max 1 K(β, q), 1 + 2eβp 2 (1 q) q tanh β 2 . (261) Combining (257), (259), (260) and (261) yields ( 512 η2(1 2q)4 , 1152e2βB(β, q) (1 2q)4 , 48e4β The latter gives the sample complexity for accurate predictive learning, it reduces exactly to the noiseless setting of prior work by Bresler and Karzand (2020) and it is continuous because lim q 0+ Γ(β, q) = Γ(β, 0) = 0, β > 0 (263) lim q 0+ K(β, q) = K(β, 0) = 1, β > 0 (264) Predictive Learning on Hidden Tree-Structured Ising Models 2 (1 q) q tanh β 2 = 1 (265) lim q 0+ B(β, q) = Γ(β, 0) = 1, β > 0. (266) To derive a simplified version of (262) note that 1 K(β, q) e2β1q =0 (267) by the definition (168) of K(β, q) and 1 + 2eβp 2 (1 q) q tanh β 2 eβ1q =0 + 2eβ1q =0p 2 (1 q) q tanh β 2 (1 + 3 q)2 e2β1q =0. (268) Then (261), (267) and (268) give B(β, q) (1 + 3 q)2 e2β1q =0 (269) and by the definition (256) Γ(β, q) [0, 1) and Γ(β, 0) = 0 for all β > 0, thus Γ(β, q) 1q =0. (270) Finally, we combine (262), (269), (270) to get ( 512 η2(1 2q)4 , 1152 1 + 3 q 2 e2β(1+1q =0) (1 2q)4 , 48e4β This completes the proof. Appendix F. Theorem 11: KL-Divergence Loss Assume the Ising model tree distributions Pθ according to a tree Tθ = (V, Eθ) and the estimate Pθ according a tree Tθ = (V, Eθ ) The goal is to upper bound the symmetric KL divergence SKL θ||θ = X θst θ st µst µ st with high probability. Under the event Ecorr (ϵ) we can upper bound the quantity |µst µ st| for all (s, t) E with high probability. By using bounds |θst θ st| 2β and |µst µ st| ϵ for all (s, t) E under the event Ecorr (ϵ), we have SKL θ||θ = SKL θ||θ Nikolakakis, Kalogerias and Sarwate θst θ st µst µ st θst θ st µst µ st (p 1) |β ( β)| ϵ by assuming ϵ ηs 2β(p 1). The sufficient number of samples satisfies the inequality n 4 log p2/δ β2(p 1)2 η2s . (273) Now assume that n samples of Y are given, by using the estimate Pθ = ΠTCL (ˆp ) defined in (227) under the event Ecorr (ϵ ) we have µst ˆµ st (1 2q)2 ϵ (1 2q)2 from Lemma 18. In the same way by assuming ϵ ηs(1 2q)2 2β(p 1) , we get n 4 log p2/δ β2(p 1)2 (1 2q)4η2s . (274) Appendix G. Theorem 6 and Theorem 8: Proofs We combine Fano s inequality and a Strong Data Processing Inequality to prove the necessary number of samples in the hidden model setting, first for structure learning (Theorem 6) and then for inference (Theorem 8). We use the following variation of Fano s inequality. Corollary 29 (Tsybakov, 2009, Corollary 2.6): Assume that Θ is a family of M + 1 distributions θ0, θ1, . . . , θM such that M 2. Let Pθi be the distribution of the variable X under the model θi, if i=1 DKL (Pθi||Pθ0) γ log M, for γ (0, 1) (275) then for the probability of error pe the following inequality holds: pe log(M+1) log(2) The construction from the noiseless case, with Corollary 29 and the Strong Data Processing Inequality for the BSC yield the bound of Theorem 6. We start by presenting Bresler s and Karzand s construction, which gives a sufficiently tight upper bound on symmetric KL divergence. Proof of Theorem 6: Consider a family of M + 1 different Ising model distributions {Pθi : i {0, . . . , M}}. This family of the structured distributions is chosen such that the structure recovery task (through The Chow-Liu algorithm) is sufficiently hard. First, we define Pθ0 to be an Ising model distribution with underlying structure a chain with p nodes and parameters θ0 j,j+1 = α, when j is odd and θ0 j,j+1 = β when j is even. The rest of family is Predictive Learning on Hidden Tree-Structured Ising Models constructed as follows: the elements of each θi, i [M] are equal to the elements of θ0 apart from two elements θi i,i+1 = 0 and θi i,i+2 = arctanh(tanh(α) tanh(β)), for each odd value of j. There are (p + 1)/2 distinct distributions in the constructed family. Through the expression (49), we derive the following upper bound on the SKL(Pθ0||Pθi), for all i [M], (Bresler and Karzand, 2020, Section 7.1), SKL(Pθ0||Pθi) = α tanh(α) tanh(α) tanh2(β) 4α tanh(α)e 2β. (276) Strong Data Processing Inequality: For each distribution Pθi and i {0, . . . , M} we consider the distribution of the noisy variable in the hidden model P θi PY|X Pθi. We would like to find an upper bound for the quantities SKL(P θ0||P θi), i {0, . . . , M}. For that purpose, we a use a strong data processing inequality result for the BSC by Polyanskiy and Wu (2017). The input random variable X is considered to have correlated binary elements, while the noise variables Ni are i.i.d Rademacher(q). This scheme is equivalent to the hidden model that we consider in this paper. In fact we have the following bound ηKL 1 (4q(1 q))p, (277) that is proved by Polyanski (Polyanskiy and Wu, 2017, Evaluation for the BSC , equation (39)), where the quantity ηKL is defined as ηKL sup Q sup P:0