# predictive_learning_on_hidden_treestructured_ising_models__17602149.pdf

Journal of Machine Learning Research 22 (2021) 1-82 Submitted 2/19; Published 2/21

Predictive Learning on Hidden Tree-Structured Ising Models

Konstantinos E. Nikolakakis k.nikolakakis@rutgers.edu Department of Electrical & Computer Engineering Rutgers, The State University of New Jersey 94 Brett Road, Piscataway, NJ 08854, USA

Dionysios S. Kalogerias dionysis@msu.edu Department of Electrical & Computer Engineering Michigan State University 428 S. Shaw Lane, MI 48824, USA

Anand D. Sarwate anand.sarwate@rutgers.edu Department of Electrical & Computer Engineering Rutgers, The State University of New Jersey 94 Brett Road, Piscataway, NJ 08854, USA

Editor: Bert Huang

We provide high-probability sample complexity guarantees for exact structure recovery and accurate predictive learning using noise-corrupted samples from an acyclic (tree-shaped) graphical model. The hidden variables follow a tree-structured Ising model distribution, whereas the observable variables are generated by a binary symmetric channel taking the hidden variables as its input (ﬂipping each bit independently with some constant probability q [0, 1/2)). In the absence of noise, predictive learning on Ising models was recently studied by Bresler and Karzand (2020); this paper quantiﬁes how noise in the hidden model impacts the tasks of structure recovery and marginal distribution estimation by proving upper and lower bounds on the sample complexity. Our results generalize state-of-the-art bounds reported in prior work, and they exactly recover the noiseless case (q = 0). In fact, for any tree with p vertices and probability of incorrect recovery δ > 0, the suﬃcient number of samples remains logarithmic as in the noiseless case, i.e., O(log(p/δ)), while the dependence on q is O 1/(1 2q)4 , for both aforementioned tasks. We also present a new equivalent of Isserlis Theorem for sign-valued tree-structured distributions, yielding a new low-complexity algorithm for higher-order moment estimation.

Keywords: Ising Model, Chow-Liu Algorithm, Structure Learning, Predictive Learning, Distribution Estimation, Noisy Data, Hidden Markov Random Fields

1. Introduction

Graphical models are a useful tool for modeling high-dimensional structured data. The graph captures structural dependencies: its edge set corresponds to (often physical) interactions between variables. There is a long and deep literature on graphical models (see Koller and Friedman (2009) for a comprehensive introduction), and they have found wide applications in areas such as image processing and vision (Schwing and Urtasun, 2015; Li and Wand, 2016; Lin et al., 2016a; Liu et al., 2017; Morningstar and Melko, 2018; Wu et al., 2017), artiﬁcial

c 2021 Konstantinos E. Nikolakakis, Dionysios S. Kalogerias and Anand D. Sarwate.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/19-149.html.

Nikolakakis, Kalogerias and Sarwate

intelligence more broadly (Wainwright et al., 2003; Wang et al., 2017), signal processing (Kim and Smaragdis, 2013; Wisdom et al., 2016), and gene regulatory networks (Zuo et al., 2017; Banf and Rhee, 2017), to name a few. An undirected graphical model, or Markov random ﬁeld (MRF) in particular, is deﬁned in terms of a hypergraph G = (V, E), that models the Markov properties of a joint distribution on p |V| node variables (X1, X2, . . . , Xp) X. A tree-structured graphical model is one in which G is a tree. We denote the tree-structured model as T = (V, E). In this paper, we consider binary models on 2p variables (X, Y), where the joint distribution p( ) of X is a tree-structured Ising model distribution on { 1, +1}p and Y = (Y1, Y2, . . . , Yp) is a noisy version of X, such that Yi = Ni Xi and {Ni} are independent and identically distributed (i.i.d.) Rademacher noise with P(Ni = 1) = 1 P(Ni = +1) = q, for all i V. We refer to X as the hidden layer and Y as the observed layer. Under this setting, our objective is to recover the underlying tree structure and accurately estimate the distribution of the hidden layer X (with high probability) using only the noisy observations Y. This is nontrivial because Y does not itself follow any tree structure; this is similar to more traditional problems in nonlinear ﬁltering, where a Markov process of known distribution (and thus, of known structure) is observed through noisy measurements (Arulampalam et al., 2002; Jazwinski, 2007; Van Handel, 2009; Douc et al., 2011; Kalogerias and Petropulu, 2016). The sample complexity of the noiseless version of our model was recently studied by Bresler and Karzand (2020), where the well-known Chow-Liu algorithm (Chow and Liu, 1968) is employed for tree reconstruction. Like them, we also analyze the Chow-Liu algorithm.

1.1 Applications and Motivating Examples

Models for joint distributions characterized by pairwise variable interactions have found many applications, with the Ising model being a popular model for binary variables. Our work is primarily motivated by examples of Ising models corrupted by noise. In many cases, the underlying graph-structured process cannot be observed directly; instead, only a noisy version of the process is available. Examples abound in physics, computer science, biology, medicine, psychology, social sciences, and ﬁnance. Some applications motivating this work include the following: 1) Statistical mechanics of population, social and pedestrian dynamics (see related work by Matsuda et al. (1992); Castellano et al. (2009)): The Ising model can be used to represent the statistical properties of the spreading of a feeling, behavior or the change of an emotional state among individuals in a crowd, where each individual interacts with his neighbors. 2) Epidemic dynamics and epidemiological models by Barnett et al. (2013); Erten et al. (2017): Disease spread can be modeled through the Ising model, where each individual is susceptible (spin down) or ineﬀective (spin up). 3) Neoplastic transitions and related applications in biology (Torquato (2011)): Each cell interacts with neighboring cells. Diﬀerent cases are studied in the literature, for instance, healthy versus cancerous cells, malignant versus benign cells, where both can be modeled as spin up and spin down observations. The probability of diagnostic error is not zero which gives rise to the hidden model that we consider. 4) Diﬀerential Privacy, originally proposed by Dwork et al. (2006a,b): In computer science, diﬀerential privacy is used to guarantee privacy for individuals. A hidden model

Predictive Learning on Hidden Tree-Structured Ising Models

describes data gathered using a locally diﬀerentially private mechanism (Warner, 1965; Kasiviswanathan et al., 2008) such as randomized response. 5) Trading and related applications in economics (see related work by Zhou and Sornette (2007); Takaishi (2015)): The Ising model has been considered in the literature to model increasing (spin up) or decreasing (spin down) price trends in a market.

1.2 Structure Learning for Undirected Graphical Models and Related Work

For a detailed review of methods for structure learning involving undirected and directed graphical models, see the relevant article by Drton and Maathuis (2017). In general, learning the structure of a graphical model from samples can be intractable (Karger and Srebro, 2001; Højsgaard et al., 2012). For general graphs, neighborhood selection methods (Jalali et al., 2011; Bresler, 2015; Ray et al., 2015) estimate the conditional distribution for each vertex in order to learn the neighborhood of each node and therefore the full structure. These approaches may use greedy search or ℓ1 regularization. For Gaussian or Ising models, ℓ1regularization (Ravikumar et al., 2010), the GLasso (Yuan and Lin, 2007; Banerjee et al., 2008), or coordinate descent approaches (Friedman et al., 2008) have been proposed, focusing on estimating the non-zero entries of the precision (or interaction) matrix. Model selection can also be performed using score matching methods (Hyvärinen, 2005, 2007; Nandy et al., 2018; Lin et al., 2016b), or Bayesian information criterion methods (Foygel and Drton, 2010; Gao et al., 2012; Barber et al., 2015). Other works address non-Gaussian models such as elliptical distributions, t-distribution models or latent Gaussian data (Finegold and Drton, 2011; Vogel and Fried, 2011; Vogel and Tyler, 2014; Bilodeau, 2014), or even mixed data (Fan et al., 2017). For treeor forest-structured models, exact inference and the structure learning problem are signiﬁcantly simpler: the Chow-Liu algorithm provides an estimate of the tree or forest structure of the underlying graph (Chow and Liu, 1968; Wainwright et al., 2008; Edwards et al., 2010; Tan et al., 2011; Liu et al., 2011; Daskalakis et al., 2018; Bresler and Karzand, 2020). Furthermore, marginal distributions and maximum values are simpler to compute using a variety of algorithms (sum-product, max-product, message passing, variational inference) (Pearl, 1988; Lauritzen, 1996; Wainwright et al., 2003, 2008)). The noiseless counterpart of the model considered in this paper was studied recently by Bresler and Karzand (2020); in this paper, we extend their results to the hidden case, where samples from a tree-structured Ising model are passed through a binary symmetric channel with crossover probability q [0, 1/2). Of course, in the special case of a linear graph, our model reduces to a hidden Markov model. Latent variable models are often considered in the literature when some variables of the graph are deterministically unobserved (Chandrasekaran et al., 2010; Anandkumar and Valluvan, 2013; Ma et al., 2013; Anandkumar et al., 2014). Our model is most similar to that studied by Chaganty et al. (Chaganty and Liang, 2014), in which a hidden model is considered with a discrete exponential distribution and Gaussian noise. They solve the parameter estimation problem by using moment matching and pseudo-likelihood methods; the structure can be recovered indirectly using the estimated parameters. Connection with Phylogenetic Estimation. In phylogenetic estimation problems the goal is to learn the structure of tree given only observations form the leaves (Erdős et al.,

Nikolakakis, Kalogerias and Sarwate

1999). The sample complexity of phylogenetic reconstruction algorithms grows exponentially with respect to the depth of the tree (Erdős et al., 1999), however if we are interested in reconstructing only parts of the tree which are close to the leaves then the depth of tree does not aﬀect the sample complexity (Daskalakis et al., 2009). The hidden structure learning problem that we consider in this paper is a special case of phylogeny estimation problem with constant depth; there is exactly one noisy observable for each hidden node of the tree. In contrast with phylogenetic estimation approaches, Chow-Liu algorithm is simple and computationally more eﬃcient, while the sample complexity is of the same order1

with the well-known phylogenetic reconstruction methods, to name a few Dyadic Closure method by Erdős et al. (1999), the Contractor-Extender and Cherry-picking algorithms by Daskalakis et al. (2006, 2009, 2013). On the other hand, the approach of distribution estimation by matching the structure and the correlations (Bresler and Karzand, 2020) has not been considered in the phylogenetic estimation literature. Based on the above discussion, the following interesting question naturally rises: How well can we estimate the distribution of a hidden tree structured model while having access only to the leaves of the tree? The latter remains open problem for future work.

1.3 Statement of Contributions

We are interested in answering the following general question: How does noise aﬀect the sample complexity of the structure and predictive learning procedure? That is, given only noisy observations, our goal is to learn the tree structure of the hidden layer in a well-deﬁned and meaningful sense. The MLE-structure from tree-structured (noiseless) data is the output of the Chow-Liu algorithm (Chow and Liu, 1968). However, the MLE-structure from noisy data is not consistent with the hidden structure in general because the graphical model of the observables is a complete graph. Further, the (latent) MLE of the actual interaction parameters θ of the hidden layer is intractable. In Sections 2.4 and 2.6 we explain the importance of Chow-Liu algorithm in our setting, we show why the classical MLE approach fails, and we discuss the connection between the output of the Chow-Liu algorithm and an alternative, projection-based MLE approach. The estimated structure is an essential statistic for estimating the underlying distribution of the hidden layer, allowing for predictive learning. Speciﬁcally, based on the structure estimate, we are also interested in appropriately approximating the tree-structured distribution under study, which can then be used for accurate predictions. We also consider the problem of hidden layer higher-order moment estimation of tree-structured Ising models and, in particular, how such estimation can be eﬃciently performed, on the basis of noisy observations. A summary of the main contributions of this paper is as follows:

A lower bound on the suﬃcient number of samples needed to recover the exact hidden structure with high probability, by using the Chow-Liu algorithm. We also show an upper bound on the necessary number of samples for any algorithm to estimate the hidden structure. The proof of the lower bound follows the general structure of Lemmata 8.1-8.4 by Bresler and Karzand (2020), however we need to extend the necessary events and prove new concentration bounds for the noisy setting. Although

1. while considering the depth ﬁxed

Predictive Learning on Hidden Tree-Structured Ising Models

the graphical model of the observables is a complete graph we show that the Chow-Liu algorithm (with input a ﬁnite number of noisy samples) returns the exact tree of the hidden layer with high probability and we characterize its sample complexity. The proof of the upper bound uses the same construction of the approach in Section 7.1 by Bresler and Karzand (2020) but requires the combination of Fano s inequality and a strong data processing inequality (SDPI) by Polyanskiy and Wu (2017). Speciﬁcally, we show that SDPI s can be a useful tool to derive minimax bounds when closed form expressions or upper bounds of the KL-divergence are hard to be found. The later is of independent interest and it can be applied to other machine learning problems that involve noisy observations. Determination of the suﬃcient and necessary number of samples for accurate predictive learning. We analyze the sample complexity of learning distribution estimates, which can accurately provide predictions on the hidden tree. The estimates are computed using the noisy data. Predictive learning under noisy samples is challenging because structural properties such as the independence of random variables Xi Xj and correlation estimates ˆE[Xi Xj] for (i, j) E do not hold for the noisy observable Y. To overcome this we evaluate the required conditional distributions of the dependent variables, construct a martingale diﬀerence sequence, and prove a high probability bound of the event that involves these variables by applying a concentration bound for supermartingales (generalized Bennet s inequality (Fan et al., 2012)). We refer the reader to Section 4.3 for a detailed discussion about the technical contributions and a sketch of proof of the main result. A closed-form expression and a computationally eﬃcient estimator for higher-order moment estimation in tree-structured Ising models. This result corresponds to an equivalent statement of Isserlis theorem for sign-valued tree models. Given pair-wise correlations and the tree (or estimates of both, from noisy or noiseless data) we provide an algorithm that runs on the tree and returns the expression of high-order moments. The proof involves the existence and identiﬁcation of (minimum length) disjoint paths among any set of pairs of nodes. The proposed algorithm (Algorithm 2) identiﬁes these paths that yield the expression of the moments. The results may be of independent interest for a computational eﬃcient exact or approximated higher-moment evaluation.

Our main results Theorem 5 and Theorem 7 provide the amount of ﬁnite samples needed for exact structure recovery and accurate predictive learning with high probability. Although we are interested in the ﬁnite sample complexity bounds, our results are also asymptotically optimal. That is, for any ﬁxed (constant) q [0, 1/2) the order of the upper bound (necessary number of samples) matches the corresponding (lower) minimax bound. The sample complexity bounds that we provide are the extended form of state of the art (noiseless setting) bounds by Bresler and Karzand (2020). By setting q = 0, our bounds reduce to the noiseless setting bounds. Further, the explicit version of our results (see Section 3) are continuous functions of the cross-over probability q.

1.4 Notation

Boldface indicates a vector or tuple and calligraphic face for sets and trees. The sets of even and odd natural numbers are 2N and 2N + 1 respectively. For an integer n, deﬁne

Nikolakakis, Kalogerias and Sarwate

Symbol Meaning p number of variables nodes in the tree p(x) exp P (i,j) E θi,jxixj /Z(θ), x { 1, +1}p, Z(θ): partition function α minimum |θij| in the Ising model, mini,j V |θij| β maximum |θij| in the Ising model, maxi,j V |θij| T Original tree of the model PT(α, β) set of tree-structured Ising models with α |θij| β n number of samples q crossover probability of the BSC, q [0, 1/2) cq 1 2q p( ) distribution of the hidden node variables, X p( ) PT(α, β) p ( ) distribution of the observable node variables Y p ( ) 1A indicator function of the set A DKL KL divergence SKL symmetric KL divergence I(X, Y ) mutual information of X,Y d TV total variation distance L(2)(P, Q) supi,j V d TV (Pij, Qij), and Pij, Qij the pairwise marginals of P, Q X1:n n independent observations of X Y1:n n independent observations of Y TCL Chow-Liu-estimated structure from noiseless data X1:n

TCL Chow-Liu-estimate of the hidden tree structure T from noisy data Y1:n

path T(w, w) the set of edges which connects the nodes w, w VT ˆµi,j 1 n Pn k=1 X(k) i X(k) j ˆµ i,j

1 n Pn k=1 Y (k) i Y (k) j ΠTCL (ˆp ) estimator of the distribution p( ) from noisy data Y1:n

η maximum error on the distribution estimation: L(2)(P, ˆP) η δ maximum probability of error, the notation depends on the task

in structure estimation: P(TCL = T) δ

in predictive learning: P L(2) p( ), ΠTCL (ˆp ) η 1 δ.

Table 1: Notation/Deﬁnitions.

[n] {1, 2, . . . n}. The indicator function of a set A is 1A. For a graph G = (V, E), V = [p] indexes the set of variables {X1, X2, . . . , Xp}, for any pair of vertices i, j V the correlation

µij = E [Xi Xj] and for any edge e = (i, j) E it is µe E[Xi Xj]. For two nodes w, w of a tree, the term path(w, w) denotes the set of edges in the unique path with endpoints w and w. Further, BSC(q)p denotes a binary symmetric channel with crossover probability q and block-length p. The BSC(q)p is a conditional distribution from { 1, 1}p { 1, 1}p that acts componentwise independently on X to generate Y, such that Xi = Ni Yi and N is a vector of i.i.d. Rademacher variables equal to +1 with probability 1 q. We use the symbol to indicate the corresponding quantity for the observable (noisy) layer. For instance, p ( ) is the probability mass function of Y and µ i,j E[Yi Yj] corresponds to the correlation of

Predictive Learning on Hidden Tree-Structured Ising Models

variables Yi, Yj, where Yi generates noisy observations of Xi, for any i V. For our readers convenience, we summarize the notation in Table 1.

Figure 1: The simulation corresponds to structure learning. Comparison of the experimental results (heat-map) and the theoretical bound of Theorem 5, the bound that yields (1). The colored regions denote diﬀerent values of the estimated probability of error δ (at least one edge has been missed). The value of δ varies between 0 and 1 while the parameters α = 0.2, β = 1.1, p = 100 are ﬁxed. The red line shows the bound from Theorem 5 (the explicit form of Theorem 1). The code of the experiment is available at https://github.com/Konstantinos Nikolakakis/ Structure-Learning.

1.5 Summary of the Results

In this section, we present a summary of the main results of our work up to constant factors C, C > 0. We refer the reader to Table 1 for the deﬁnition of the model parameters. We provide the explicit statements of the results, and we specify the constants in Section 3. Recall that, the random vector Y { 1, +1}p is the output of the binary symmetric channel BSC(q)p with input the random vector X p( ) PT(α, β).

1.5.1 Structure Learning

The ﬁrst results provides the suﬃcient number of samples for exact structure recovery.

Theorem 1 (Sample Complexity for Structure Learning.) The Chow-Liu algorithm with input n noisy samples Y1:n exactly estimates the hidden tree structure TCL T with probability at least 1 δ (0, 1), as long as

n > C e2β(1+1q =0)

(1 2q)4 tanh2(α) log(p/δ). (1)

Nikolakakis, Kalogerias and Sarwate

The order with respect to β is O(e4β) for all q > 0. The bound in (1) exactly reduces to the noiseless case (Bresler and Karzand, 2020, Theorem 3.2). Additionally, the explicit form of the result, Theorem 5, shows that the bound is also a continuous function of q [0, 1/2). The next proposition gives the necessary number of samples for exact structure recovery.

Proposition 2 No algorithm can recover the structure with probability great than 1/2 if

n < C e2β[1 (4q(1 q))p] 1

α tanh(α) log (p) . (2)

Note that the terms (1 2q) 4 and [1 (4q(1 q))p] 1 introduce a gap between the sample complexity of (1) and (2). However, the sample complexity of Theorem 1 is indeed accurate. To illustrate this experimentally, we show that the theoretical and experimental bounds exactly match, see Figure 1. The latter indicates that the Chow-Liu algorithm requires exactly the number of samples that our theoretical result suggest (see Figure 1). On the other hand, Proposition 2 provides the necessary number of samples, for any algorithm. Finally, we conjecture that the bound of Proposition 1.2 is tight only under the low temperature regime |θi,j| for all i, j E. The derivation of generalized tighter forms of the bound in (2) is challenging and left for future work.

1.5.2 Predictive Learning

To learn the tree-shaped distribution p( ) PT(α, β) of X from n noisy samples Y1:n, we ﬁrst estimate the correlations ˆµ i,j for all i, j V. We then estimate the tree structure TCL by running the Chow-Liu algorithm with input the candidate edge weights ˆµ i,j and ﬁnally evaluate the estimator of p( ) (by matching correlations) as follows2

ΠTCL (ˆp ) 1

1 + xixj ˆµ i,j (1 2q)2 2 , x { 1, +1}p. (3)

Note that one restriction of our approach is that the distribution estimator requires the value q to be known. The same restriction appears in other structure learning from noisy data approaches (Goel et al., 2019). However, in our setting q is required only for the predictive learning, while the Chow-Liu algorithm and the structure estimation does not require q to be known. Under the assumption that q is unknown, one can ﬁrst learn its value through an independent procedure (Goel et al., 2019, Section 5). The accuracy of the estimated distribution in (3) is measured by the small-set Total Variation (ss TV), that captures the estimation error on the kth-order marginals (Georgii, 2011; Rebeschini et al., 2015; Bresler and Karzand, 2020). Let PS, QS denote the marginals of P, Q on a set S V, and |S| = k. Then the kth order ss TV of P and Q is deﬁned as

L(k) (P, Q) sup S:|S|=k d TV (PS, QS) . (4)

2. The distribution in (3) is a function of x, however we suppress the notation for consistency with prior work and for sake of space.

Predictive Learning on Hidden Tree-Structured Ising Models

Figure 2: The simulation corresponds to predictive learning. Comparison of the experimental results (heat-map) and the theoretical bound of Theorem 3. The colored regions denote diﬀerent values of the estimated probability of error δ (ss TV to be greater than a ﬁxed number η). The value of δ varies between 0 and 1 while the parameters η = 0.03, β = 1.1, p = 31 are ﬁxed. The code of the experiment is available at https://github.com/Konstantinos Nikolakakis/ Predictive-Learning.

The next results provides the necessary number of samples for accurate distribution estimation by guaranteeing that the L(2) is less than a small positive number η with high probability. We provide guarantees on higher-order marginals (k > 2) in Section 3.3.

Theorem 3 (Sample Complexity for Predictive Learning) Fix δ (0, 1). Choose η > 0 (independent of δ). If

( 1 η2(1 2q)4 , e2β(1+1q =0)

(1 2q)4 , e4β

P L(2) p( ), ΠTCL (ˆp ) η 1 δ. (6)

Note that the dependence on β is O(e4β) for accurate distribution learning from noisy data (similarly to the structure learning task, Theorem 1). The bound in (5) exactly reduces to the noiseless setting bound by (Bresler and Karzand, 2020, Theorem 3.3). Theorem 3 is a short version of the main result of the paper. The explicit statement, Theorem 7, shows that the bound is also continuous at q 0. Conversely, the following proposition gives an upper bound on the necessary number of samples for accurate marginal distributions estimation under the assumption β > α.

Nikolakakis, Kalogerias and Sarwate

Suﬃcient Number of Samples Task/Setting Noiseless (prior work) Noisy

Structure Learning C e2β

tanh2(α) log(p/δ) C e2β(1+1q =0)

(1 2q)4 tanh2(α) log(p/δ)

Predictive Learning C max{η 2, e2β} log(p/δ) C max n η 2 (1 2q)4 , e2β(1+1q =0)

(1 2q)4 , e4β1q =0

η2 o log (p/δ)

Table 2: Suﬃcient number of samples for accurate structure and predictive learning.

Necessary Number of Samples Task/Setting Noiseless (prior work) Noisy

Structure learning C e2β α tanh(α) log (p) C e2β[1 (4q(1 q))p] 1

α tanh(α) log (p)

Predictive learning

C η 2 log(p) C η 2[1 (4q(1 q))p] 1 log(p)

Table 3: Necessary number of samples for structure and predictive learning.

Proposition 4 Fix η > 0 such that η (tanh(β) tanh(α))/2. Then no algorithm can accurately estimate the distribution of the hidden variables (ss TV less than η > 0) with probability greater than 1/2 if

n < C η 2[1 (4q(1 q))p] 1 log(p). (7)

A quick comparison of (5) and (7) shows that there is a gap between the suﬃcient and necessary number of samples. Our experiments (Figure 2) conﬁrm the accuracy of our theoretical results. For instance the bound of Theorem 3 exactly matches the experimental curve. For further discussion related to the gap between the upper and lower bounds see Section 2.6. Further, we conjecture that bound in (7) is tight only under the low temperature regime, similarly to the Proposition 2. The derivation of tighter characterization of the necessary number of samples Propositions 2 and 4 remains an problem for future work. Additional plots of the experiments are provided in Section 3.4. Finally, Table 2 and 3 summarize the state-of-the-art bounds of the noiseless setting by Bresler and Karzand (2020) and the extended version under the noisy setting that we study in this paper. To summarize, the following holds for both structure and predictive learning: the dependence on the parameter β is of the order O(e2β) for q = 0 and becomes O(e4β) for positive values of q. Further, the bounds are continuous functions of q, as our results suggest (for the continuity see the explicit form of the results Theorem 5 and Theorem 7.) Similarly to the noiseless case, the following statement holds when noise exists as well: Under the high temperature regime (α close to zero), structure learning requires much more data than the predictive learning task, because of the tanh2(α) in the denominator of the bound in (1). On the contrary, the required number of samples for predictive learning (5) does not depend on α. Speciﬁcally, exact structure recovery is not necessary for learning the distribution

Predictive Learning on Hidden Tree-Structured Ising Models

eﬃciently, that is, weak edges identiﬁcation failure does not aﬀect the predictive learning task. We refer the reader to Section 4.3 for the deﬁnition of weak/strong edges and additional explanation. Finally, for q > 0 an extra term that involves both β and η appears in the bound of Theorem 3, while for values of q close to zero and q = 0 vanishes. The pairwise correlations of end-point vertices (E[Xi Xj]: (i, j) E) are suﬃcient statistics, and as expected, the accuracy of pairwise marginals corresponds to accuracy of higher order marginals and accurate estimation of higher order moments. In Sections 3.3 and 4.4 we provide a method for evaluating higher order moments (and marginals) from noisy observations. Our approach is based on an equivalent of Isserlis theorem for tree-structured Ising models that is also of independent interest.

2. Preliminaries and Problem Statement

In this section, we introduce our model of hidden sign-valued Markov random ﬁelds on trees.

2.1 Undirected Graphical Models

We consider sign-valued graphical models where the joint distribution p( ) has support { 1, +1}p. Let X = (X1, X2, . . . , Xp) { 1, +1}p be a collection of sign-valued (binary) random variables. Then, 1Xi=xi (1 + xi Xi)/2, and the distribution of X is

S V:|S|=k E

, x { 1, +1}p. (8)

In this paper we assume that the marginal distributions of the Xi are uniform, that is,

P (Xi = 1) = 1

2, i V. (9)

Thus, E[Xi] = 0, for all i V. A distribution is Markov with respect to a hypergraph G = (V, E) if for every node i in the set V it is true that P Xi| x V\{i} = P Xi| x N(i) , where N(i) is the set of neighbors of i in G. One subclass of distributions for which the Markov property holds is the Ising model, in which the random variables Xi are sign-valued and the hypergraph is a simple undirected graph, indicating that variables have only pairwise and unary interactions. The joint distribution for the Ising model with zero external ﬁeld is given by

p(x) = 1 Z(θ) exp

(s,t) E θstxsxt

, x { 1, 1}p. (10)

{θst : (s, t) E} are parameters of the model representing the interaction strength of the variables and Z( ) (0, ) is the partition function. These interactions are expressed through potential functions exp(θstxsxt) that ensure that the Markov property holds with respect to the graph G = (V, E). Next, we discuss the properties of distributions of the form of (8), which are Markov with respect to a tree.

Nikolakakis, Kalogerias and Sarwate

2.2 Sign-Valued Markov Fields on Trees

From prior work by Lauritzen (1996), it is known that any distribution p( ) that is Markov with respect to a tree (or forest) T = (V, E) factorizes as

i V p (xi) Y

p(xi, xj) p(xi)p(xj), x { 1, +1}p, (11)

and we call p( ) as tree (forest) structured distribution, to indicate the factorization property. If the distribution p( ) has the form of (8) with P(Xi = 1) = 1/2, for all i V, and is Markov with respect to a tree T, then

1 + xixj E [Xi Xj]

E [Xi Xj] = Y

e path(i,j) µe, for all i, j V. (13)

(see Appendix A, Lemma 12). Additionally, let us state the deﬁnition of the so-called Correlation (coeﬃcient) Decay Property (CDP), that will be of central importance in our analysis.

Deﬁnition 1 The CDP holds if and only if |E[Xi Xk]| |E[XℓXm]| for all tuples {i, k, ℓ, m} V such that path(i, k) path(ℓ, m).

The CDP is a well known attribute of acyclic Markov ﬁelds (see, e.g., Tan et al. (2010), Bresler and Karzand (2020)). Further, it is true that the products Xi Xj for all (i, j) E are independent and the CDP holds for every p( ) of the form of (8), that factorizes with respect to a tree (see Lemma 13, Appendix A). This is a consequence of property (13) and the inequality |µe| 1, for all e E. We can interpret the CDP as a type of data processing inequality (see Cover and Thomas (2012)). The connection is clear through the relationship between the mutual information I(Xi, Xj) and the correlations E[Xi Xj], namely,

I (Xi, Xj) = 1

2 log2 (1 E [Xi Xj])1 E[Xi Xj] (1 + E [Xi Xj])1+E[Xi Xj] , (14)

for any pair of nodes i, j V. This expression shows that the mutual information is a symmetric function of E [Xi Xj] and increasing with respect to |E [Xi Xj]| (see also Lemma 17, Appendix A). Tree-structured Ising models: Despite its simple form, the Ising model has numerous useful properties. In particular, (12), (13) hold for any tree-structured Ising model with uniform marginal distributions and θr = 0 for all r V. Furthermore,

E[Xi Xj] = tanh θij, (i, j) ET, (15)

the latter implies that

1 + xixj tanh θij

2 , x { 1, 1}p, α |θij| β, (16)

Predictive Learning on Hidden Tree-Structured Ising Models

E[Xi Xj] = Y

e path(i,j) µe = Y

e path(i,j) tanh (θe) , i, j V. (17)

A short argument showing (15) and (16) is included in Appendix A, Lemma 14. For the rest of the paper, we assume a tree-structured Ising model for the hidden variable X, that is, the distribution of X has the form of (12). We also impose a reasonable compactness assumption on the respective interaction parameters, as follows.

Assumption 1 There exist α and β such that for the distribution p( ), 0 < α |θst| β < for all (s, t) E.

For a ﬁxed tree structure T, and for future reference, we hereafter let PT(α, β) be the class of Ising models satisfying Assumption 1.

2.3 Hidden Sign-Valued Tree-Structured Models

The problem considered in this paper is that of learning a tree-structured model from corrupted observations. Because we have no access to the original samples X1:n, we obtain the noisy observations Y1:n. To formalize this, consider a hidden Markov random ﬁeld whose hidden layer X is an Ising model with respect to a tree, i.e., X p( ) PT(α, β), as deﬁned in (16). The observed variables Y are formed by setting Yr = Nr Xr for all r V, where {Nr} are i.i.d. Rademacher(q) random variables. Let p ( ) be the distribution of the observed variables Y. We can think of Y as the result of passing X through a binary symmetric channel BSC(q)p. We have the following expressions

E[Nr] = 1 2q cq, r V, and q [0, 1/2), (18)

µ r,s = E [Yr Ys] = E [Nr Xr Ns Xs] = (1 2q)2 E [Xr Xs] , r, s V. (19)

The distribution p ( ) of Y also has support { 1, +1}p, and so the joint distribution satisﬁes the general form (8). Since the marginal distribution of each Yr is also uniform, E[Yr] = 0 for all r V, (8) and (18) yield

k [p] 2N ck q X

S V:|S|=k E

, y { 1, 1}p. (20)

The moments of the hidden variables E Q s S Xs in (20) can be expressed as products of the pairwise correlations E[Xs Xt], for any (s, t) ET (Section 3.3, Theorem 10). From (20) it is clear that the distribution p ( ) of Y does not factorize with respect to any tree, that is, p ( ) / PT(α, β) in general.3

2.4 Hidden Structure Estimation

We are interested in characterizing the sample complexity of structure recovery: given data generated from p( ) PT(α, β) for an unknown tree T, what is the minimum number n

3. Lemma 30 shows the structure preserving property for the observable layer holds for the special case of single-edge forests.

Nikolakakis, Kalogerias and Sarwate

Algorithm 1 Chow Liu

Require: D = y(1), y(2), . . . , y(n) { 1, 1}p n, where y(k) is the kth observation of Y

1: Compute ˆµ i,j 1

n Pn k=1 y(k) i y(k) j , for all i, j V

2: return TCL Maximum Spanning Tree i =j n ˆµ i,j o

of samples {y(i), i [n ]} from p ( ) needed to recover the (unweighted) edge set of T with high probability? In particular, we would like to quantify how n depends on the crossover probability q. Intuitively, noise makes weak edges to appear weaker , and the sample complexity is expected to be an increasing function of q. Because the distribution p ( ) of the observable variables does not factorize according to any tree, this problem does not follow directly from the noiseless case. Although the classical MLE is the standard approach for the noiseless case, for the noisy setting the MLE estimation of parameters θ of the hidden model is intractable, due to the summation over the support of X. Additionally, the MLE structure estimate from noisy data is not in general consistent with the hidden structure as we explain in Section 2.6. However, for the model that we consider in this paper, the projected-MLE estimate of the observables onto the space of tree-structured models gives a consistent structure estimate. Additionally, that structure estimate is identical to the output of Chow-Liu algorithm (Algorithm 1) from noisy data. We refer the reader to Section 2.6 for the discussion about the MLE and the connection with the noisy Chow-Liu algorithm. In this work, we use and analyze the sample complexity of the classical Chow-Liu algorithm (Algorithm 1) for the following reasons: We show that given ﬁnite number of noisy data as input, the Chow-Liu algorithm recovers the original tree T with high probability. Further the sample complexity is asymptotically optimal for ﬁxed q < 1/2 (see Tables 2 and 3), and its order remains O(log p) in the high dimensional regime. The algorithm is computationally eﬃcient in comparison to other optimization techniques and it does not require the value q to be known. Additionally, Algorithm 1 solves the projected-MLE problem that we discuss in Section 2.6. The above reasons and our ﬁnite sample complexity bound Theorems 1 and 5 suggest that Algorithm 1 is an excellent approach for tree-structure learning from noisy data.

2.5 Evaluating the Accuracy of the Estimated Distribution

In addition to recovering the graph structure, we are interested in the goodness of ﬁt of the estimated distribution. Let PS, QS be the marginal distributions of P, Q on the set S V, let d TV denote the total variation distance, and ﬁx k = 2. We measure the error of distribution estimator through the small set Total Variation (or ss TV) distance as deﬁned by Bresler and Karzand (2020)

L(k) (P, Q) sup S:|S|=k d TV (PS, QS) . (21)

If Q is an estimate of P, the norm L(k) guarantees predictive accuracy because (Bresler and Karzand, 2020, Section 3, page 720)

EXS h |P (Xi = +1|XS) Q (Xi = +1|XS)| i 2L(|S|+1) (P, Q) . (22)

Predictive Learning on Hidden Tree-Structured Ising Models

The estimated (from noisy data) distribution of the hidden variables in (3) is a simple extension of the noiseless estimate. In fact the estimated distribution factorizes according to the estimated from noisy data tree structure, that is the output of Algorithm 1. Further, the pairwise correlations are normalized by the constant (1 2q). As a result, the estimator is consistent because if n then TCL T, ˆµ i,j/(1 2q) µi,j, and as a consequence the estimate ΠTCL (ˆp ) converges to the original distribution p( ) of X. Our main result gives a lower bound on the number of samples needed to guarantee accurate estimation (in the sense of small ss TV), with high probability.

2.6 Maximum Likelihood Estimate

A natural ﬁrst place to start in estimation is the maximum-likelihood estimate (MLE). We explain why this is problematic and show a method (the projected-MLE) which turns out to be equivalent to the Chow-Liu algorithm. This motivates why we study the Chow-Liu algorithm in the ﬁrst place. To begin, the distribution of the observables parametrized over the interaction parameters θ of the hidden layer is

(s,t) EG θstxsxt

p(y|x), y { 1, 1}p. (23)

It is known that above expression is intractable in closed form and it can be evaluated only through approximations. Secondly, the log-likelihood of Y can be written as

log p (y) = log X

x { 1,+1}p p(y|x) Y

i V p (xi) Y

p(xi, xj) p(xi)p(xj), y { 1, 1}p, (24)

and the logarithm of the summation cannot be expressed as summation of logarithms. Therefore we see the classical MLE structure estimation approach is not applicable for hidden models. Speciﬁcally, the structure of the observable layer is a complete graph and not a tree (there is no conditional independence between Y s). The maximum likelihood structure estimate with respect to the parameters θ of the observables in general will return a complete graph. Speciﬁcally, let G = (V, EG) be the graph (which is complete) of the observable layer, then the distribution p ( ) is an Ising-Model distribution and it can be written as

p (y) = 1 Z (θ ) exp

(s,t) EG θ stysyt

, y { 1, 1}p. (25)

Since all the edges exist in the edge set, none of the values θ st is zero. As a consequence, even asymptotically (n ) the maximum likelihood that estimates the parameters θ st gives a complete graph. Recall that we want to recover the structure of the hidden layer which is a tree. Thus, the maximum likelihood structure estimate directly applied on (25) is not consistent, because of the diﬀerent hidden and observables structure. To overcome the inconsistency that is introduced by the noise, we can project the distribution p (y) to a set of tree-structured distributions and then ﬁnd the maximum likelihood structure estimate. We denote the projection of p (y) onto the space of trees as p T (y)

Nikolakakis, Kalogerias and Sarwate

and we call the MLE with respect to p T (y) as projected-MLE (PMLE). Then the following questions are natural: Is the PMLE always consistent with respect to structure of the hidden layer? Is the PMLE asymptotically optimal (n )? Is the PMLE optimal for ﬁnite values of n? (by optimal we mean that the sample complexity bound matches the minimax bound). We continue by answering the questions above. First we present the structural consistency and then we continue by discussing the asymptotic optimality and optimality for ﬁnite n. Although, the PMLE is not in general consistent with structure of the hidden layer (see also related work by Nikolakakis et al. (2020)), for the setting of the BSC channel with i.i.d noise we do have ˆTPMLE T when n . In fact, the projected distribution as p T (y) is given by

p T (y) argmin Q( ) PT(α,β) DKL(p (y)||Q(y)). (26)

The proof of the claim follows by a standard argument (see also Lemma 1 and Lemma 2 by (Bresler and Karzand, 2020, Supplemetary material, Appendix A)) and it gives

DKL(p (y)||p T (y)) = 1 H(p (y)) + X

1 + (1 2q)2µi,j

As a consequence the projected-MLE ˆTPMLE is

ˆTPMLE = argmin T T

(i,j) ET HB

1 + ˆµ i,j 2

and the following

(i,j) ET HB

1 + (1 2q)2µi,j

(i,j) ET HB

gives that ˆTPMLE TCL T (almost surely) when n . Although, the above discussion of the consistency for n shows the connection with MLE, our results for instance Theorem 3.1 shows that the Chow-Liu algorithm returns the original tree for ﬁnite n with probability 1 δ. Additionally, the PMLE is asymptotically optimal, however for ﬁnite n it may be not optimal. For our structure/predictive learning problem our bounds are asymptotically optimal (up to constants). That is, for ﬁxed q the upper and lower bounds match as n . Nevertheless for ﬁnite n the PMLE is not optimal in general. It is known that under the presence of noise the MLE approach may be non-robust and sub-optimal and extra steps should be considered including pre-processing, statistical learning of the noise by using pilot samples, and detecting and rejecting bad samples (for further information see also Zoubir et al. (2012, page 62) and Nikolakakis et al. (2020)). The reason that we consider Chow-Liu algorithm in our work is that it is computationally eﬃcient, while its sample complexity remains logarithmic with respect to p even when noise exists. The latter makes the Chow Liu algorithm useful in practice when only noisy observations are available. Finally, to give

Predictive Learning on Hidden Tree-Structured Ising Models

further insight about the gap between the upper and lower bounds we present an example in Section H.1 (Appendix), for which perfect denoising is possible for p before running the Chow-Liu algorithm. As consequence, for p the bounds in Propositions 2 and 4 reduce to the noiseless case as they should. This example is a marginal case (since perfect denoising is not possible in general) and it aﬀects our converse results which are universal and owe to include corner cases.

3. Main Results

The main question asked by this paper is as follows: what is the impact of noise on the sample complexity of learning a tree-structured graphical model in order to make predictions? This corresponds to sampling variables Y generated by sampling X from the model (10) and randomly ﬂipping each sign independently with probability q. We use the Chow-Liu algorithm to estimate the hidden structure using the noise-corrupted samples. We ﬁrst ﬁnd upper (Theorem 5) and lower bounds (Theorem 6) on the sample complexity for exact hidden structure recovery using the Chow-Liu algorithm on noisy observations. Secondly, we use the structure statistic to derive an accurate estimate of the hidden layer s probability distribution. The distribution estimate is computed to be accurate under the ss TV utility measure, that was introduced by Bresler and Karzand (2020). Furthermore, the estimator of the distribution factorizes with respect to the structure estimate, while the ss TV metric ensures that the estimated distribution is a trustworthy predictor. Theorem 7 and Theorem 8 give the suﬃcient and necessary sample complexity for accurate distribution estimation from noisy samples. These theorems generalize the results for the noiseless case (q = 0) by Bresler and Karzand (2020) and lead to interesting connections between structure learning on hidden models and data processing inequalities (Raginsky, 2016; Polyanskiy and Wu, 2017). The third part of the results includes Theorem 10, which gives an equivalent of Isserlis theorem by providing closed form expressions for higher order moments of sign-valued Markov ﬁelds on trees. Based on Theorem 10 we propose a low complexity algorithm to estimate any higher order moment of the hidden variables given the estimated tree structure and estimates of the pairwise correlations (both evaluated from observations corrupted by noise). Finally, Theorem 11 gives the suﬃcient number of samples for distribution estimation, when the symmetric KL divergence is considered as utility measure. These give rise to extensions of testing algorithms Daskalakis et al. (2018) under a hidden model setting.

3.1 Tree Structure Learning from Noisy Observations

Our goal is to learn the tree structure T of an Ising model with parameters |θst| [α, β], when the nodes Xi are hidden variables and we observe Yi Ni Xi, i V, where Ni Rademacher(q) are i.i.d, for all i V and for all q [0, 1/2). We derive the estimated structure TCL by applying the Chow-Liu algorithm (Algorithm 1) (Chow and Liu, 1968). Instead of mutual information estimates, our Chow-Liu algorithm (Algorithm 1) requires correlation estimates; these are suﬃcient statistics because of (14). Further, it can consistently recover the hidden structure through noisy observations. The latter is true because of the order preserving property of the mutual information. That is, the stochastic map-

Nikolakakis, Kalogerias and Sarwate

ping X BSC(q)p Y allows structure recovery of X by observing Y, because for any tuple Xi, Xj, Xi , Xj such that I (Xi; Xj) I Xi , Xj , it is true that I (Yi; Yj) I Yi , Yj . The proof directly comes from (14) and (19). In addition, the monotonicity of mutual information with respect to the absolute values of correlations allows us to apply the Chow-Liu algorithm directly on the estimated correlations ˆµ i,j 1/n Pn k=1 (Yi)(k) (Yj)(k). Notice

that because of (19), ˆµ i,j can be used as an alternative of ˆµi,j. The algorithm returns the maximum spanning tree TCL . Further discussion about the Chow-Liu algorithm is given in Section 4.1. The following theorem provides the suﬃcient number of samples for exact structure recovery through noisy observations.

Theorem 5 (Suﬃcient number of samples for structure learning) Let Y be the output of a BSC(q)p, with input variable X p( ) PT(α, β). Fix a number δ (0, 1). If the number of samples n of Y satisﬁes the inequality

n 32 h 1 (1 2q)4 tanh β i

(1 2q)4 (1 tanh β)2 tanh2 α log 2p2

then Algorithm 1 returns TCL = T with probability at least 1 δ.

Theorem 5 characterizes the ﬁnite-sample performance of the Chow-Liu estimator and by taking n we can see that Algorithm 1 is consistent in the noisy setting. As a consequence of (30) and the inequality 1 tanh(β) e 2β, if the number of samples satisﬁes the following bound

h 1q=0 + e2β (1 2q) 4 tanh(β) 1q =0 i log(p/δ), (31)

then the structure is exactly recovered with probability at least 1 δ. The latter gives the statement of Theorem 1. Complementary to Theorem 5, our next result characterizes the necessary number of samples required for exact structure recovery. Speciﬁcally, we prove a lower bound on the sample complexity that characterizes the necessary number of samples for any estimator ψ.

Theorem 6 (Necessary number of samples for structure learning) Let Y be the output of a BSC(q)p, with input variable X p( ) PT(α, β). If the given number of samples of Y satisﬁes the inequality

n < [1 (4q(1 q))p] 1

16α tanh(α) e2β log (p) , (32)

then for any estimator ψ, it is true that

inf ψ sup T T p( ) PT(α,β)

P ψ Y1:n = T > 1

Predictive Learning on Hidden Tree-Structured Ising Models

It can be shown that the right hand-side of (30) is greater than the right-hand side of (32) for any q in [0, 1/2) (and for all possible values of p, β, α), by simply comparing the two terms. Theorems 5 and 6 reduce to the noiseless setting by setting q = 0 (Bresler and Karzand (2020)). The sample complexity is increasing with respect to q, and structure learning is always feasible as long as q = 1/2. Let n denote the required samples under a noiseless setting assumption, then for a ﬁxed probability of exact recovery, we always need n n because 1 (1 2q)4 tanh(β)

[(1 2q)4(1 tanh(β))] 1, q h 0, 1

and β R. (34)

Furthermore,

1 1 (4q(1 q))p 1, q [0, 1/2) and p N, (35)

the latter shows that the sample complexity in a hidden model is greater than the noiseless case (q = 0), for any measurable estimator (Theorem 6). When q approaches 1/2, the sample complexity approaches inﬁnity, n , and the structure learning is impossible. Theorem 6 extends Theorem 3.1 by Bresler and Karzand (2020) to our hidden model. Our results combines Bresler s and Karzand s method and a strong data processing inequality (SDPI) by Polyanskiy and Wu (2017, Evaluation of the BSC). Upper bounds on the symmetric KL divergence for the output distribution p ( ) can not be found in a closed form. However, by using the SDPI, we manage to capture the dependence of the bound on the parameters α, β, q and derive a non-trivial result. When p , the bound becomes trivial since limp 1/ [1 (4q(1 q))p] 1, giving the classical data processing inequality (contraction of KL divergence for ﬁnite alphabets, (Raginsky, 2016; Polyanskiy and Wu, 2017)). While direct application of the SDPI is simple and provides an upper bound which is almost insensitive to q (for suﬃciently large p), it introduces a gap between the lower and upper bounds. Nevertheless, it is important because it indicates a possible non-optimal performance of the classical Chow-Liu algorithm (under a hidden model). We conjecture that the sample complexity bounds in (Theorem 6 and Theorem 8) are tight only under the low temperature regime |θi,j| = β for all i, j E, while in general (θi,j [α, β]) the inequalities hold but they are not tight. For further explanation related to the gap between the upper and lower bounds see Section 2.6. The latter is a consequence of the SDPI, which is tight for the repetition code (Polyanskiy and Wu, 2017, Evaluation for the BSC, page 12). We performed extensive simulations (c.f. Figures 1, 2) that suggests that our bound does indeed accurately characterize the performance of Chow-Liu. These simulations choose p = 100, but our evidence shows that the dependence on q is not aﬀected for larger (p = 200) or smaller (p = 50) values of q. We believe that the term 1/[(1 (4q(1 q))p] does not characterize the Chow-Liu algorithm, but possibly a more complicated algorithm.

3.2 Predictive Learning from Noisy Observations

In addition to recovering the structure of the hidden Ising model, we are interested in estimating the distribution p( ) PT(α, β) itself. If the L(2) distance between the estimator and the true distribution is suﬃciently small, then the estimated distribution is appropriate

Nikolakakis, Kalogerias and Sarwate

for predictive learning because of (22). For consistency, this distribution should factorize according to the structure estimate TCL and for the predictive learning part, the estimate TCL is considered the output of the Chow-Liu algorithm (see Algorithm 1). We continue by deﬁning the distribution estimator of p( ) as

ΠTCL (ˆp ) 1

1 + xixj ˆµ i,j (1 2q)2 2 . (36)

The estimator (36) can be deﬁned for any q [0, 1/2). For q = 0 it reduces to that in the noiseless case, since TCL TCL, ˆµ i,j ˆµi,j, and thus ΠTCL ˆp ΠTCL ˆp . It is also closely related to the reverse information projection onto the tree-structured Ising models (Bresler and Karzand, 2020, supplementary material, Appendix A), in the sense that

ΠT(P) = argmin Q PT(α,β) DKL (P||Q) , P PT(α, β). (37)

To compute ΠTCL (ˆp ), two suﬃcient statistics are required: the structure TCL and the set of second order moments (Chow and Liu, 1968; Bresler and Karzand, 2020), under the assumption that q is known. The next result provides a suﬃcient condition on the number of samples to guarantee that the L(2) distance between the true distribution and the estimated distribution is small with probability at least 1 δ. Note that the dependence on β changes from e2β to e4β when the data are noisy q > 0, while for q = 0 our bound exactly recovers the noiseless case (Bresler and Karzand, 2020). A key component of the bound is the following function

Γ(β, q) 1 (1 2q)2

1 (1 2q)4 tanh2(β)

2 , β > 0 and q [0, 1/2). (38)

Further, notice that Γ(β, q) [0, 1] for all β > 0 and q [0, 1/2), and Γ(β, 0) = 0 for all β > 0. Additionally, we deﬁne the functions

K(β, q) 10(1 tanh2(β)) 9 + (1 2q)2 tanh2(β)(1 2q)2(9(1 2q)2 + 1), (39)

B(β, q) max 1 K(β, q), 1 + 2eβp

2 (1 q) q tanh β 2 . (40)

The latter constitute additional components of the main rusult that follows.

Theorem 7 Fix δ (0, 1) and choose η > 0. If

n max 512 η2(1 2q)4 , 1152e2βB(β, q)

(1 2q)4 , 48e4β

η2 Γ(β, q) log 6p3

P L(2) p( ), ΠTCL (ˆp ) η 1 δ. (42)

Predictive Learning on Hidden Tree-Structured Ising Models

(41) and the inequalities Γ(β, q) 1q =0, B(β, q) 1 + 3 q 2 e2β1q =0 give Theorem 3. We provide the proof of Theorem 3 and Theorem 7 in Section E (Appendix). As we mentioned in Section 1.5.2, the sample complexity for accurate predictive learning does not depend on α, that is, even in the high temperature regime α 0 (and in contrast with the structure learning), the number of required samples does not increase. Conversely, the following result provides the necessary number of samples for small L(2)

distance by a minimax bound, that characterizes any possible estimator ψ. In other words, it provides the necessary number of samples required for accurate distribution estimation, appropriate for predictive learning (small L(2)( )).

Theorem 8 (Necessary number of samples for inference) Fix a number δ (0, 1). Choose η > 0 such that tanh(α) + 2η < tanh(β). If the given number of samples satisﬁes the inequality

n < 1 [tanh(α) + 2η]2

16η2[1 (4q(1 q))p] log p, (43)

then for any algorithm ψ, it is true that

inf ψ sup T T p( ) PT(α,β)

P L2 (p( ), ψ (Y1:n)) > η > 1

Theorems 7 and 8 reduce to the noiseless setting for q = 0, that has been studied earlier by Bresler and Karzand (2020). Similarly to our structure learning results, presented previously (Theorem 5, Theorem 6), when q 1/2 we have n , the latter indicates that the learning task becomes impossible for q = 1/2.

Remark 9 Theorem 8 requires the assumption α < β. The special case α = β can be derived by applying the same proof technique of Theorem 8 combined with Theorem 3 by (Bresler and Karzand, 2020, supplementary material) and the SDPI by Polyanskiy and Wu (2017).

Further details and proof sketches of Theorems 7 and 8 are provided in Section 4.3.

3.3 Estimating Higher Order Moments of Signed-Valued Trees

A collection of moments is suﬃcient to represent completely any probability mass function. For many distributions, the ﬁrst and second order moments are suﬃcient statistics; this is true, for instance, for the Gaussian distribution or the Ising model with unitary and pairwise interactions. Even further, in the Gaussian case, the well-known Isserlis Theorem (Isserlis (1918)) gives a closed form expression for all moments of every order. As part of this work, we derive the corresponding moment expressions, for any tree-structured Ising model. To derive the expression of higher order moments, we ﬁrst prove a key property of tree structures: for any tree structure T = (V, E) and a even-sized set of nodes V V, we can partition V into |V |/2 pairs of nodes, such that the path along any pair is disjoint with the path of any other pair (see Appendix A, Lemma 15). We denote as CT(V ) the set of distinct |V |/2 pairs of nodes in V , such that path(u, u ) path(w, w ) = , for all

Nikolakakis, Kalogerias and Sarwate

Algorithm 2 Matching Pairs Require: Tree structure T = (V, E), any set V V : |V | 2N

2: for i V do

3: if i V then

7: for k [d] do d is the depth of the tree

8: Store all nodes at level k to L(k)

9: for k [d] do

10: for i L(d + 1 k) do Visit each of the nodes at level d + 1 k

11: if p(i) = 1 then

12: V V \ {i}

13: CPT CPT (i, ancestor(i))

14: if p(ancestor(i)) = 1 then

15: V V \ {ancestor(i)}

16: p(ancestor(i)) 0

18: p(ancestor(i)) 1

19: if V then

20: return CPT

{u, u }, {w, w } CT(V }. Let CPT(V ) be the set of all edges in all mutually edge-disjoint paths with endpoints the pairs of nodes in V , that is,

{w,w } CT(V ) path T(w, w ). (44)

For any tree T, the set CPT(V ) can be computed via the Matching Pairs algorithm, Algorithm 2. By using the notation above, we can now present the equivalent of Isserlis Theorem. The closed form expression of moments is given by the next theorem.

Theorem 10 For any distribution of the form of (11), which factorizes according to a tree T and has support { 1, +1}p, it is true that

E [Xi1Xi2 . . . Xik] =

( 0 k odd Q e CPT(i1,i2,...,ik) µe k even. (45)

Theorem 10 is an equivalent of Isserlis theorem for tree-structured sign-valued distributions. Equation (45) is used later to deﬁne an estimator of higher order moments that requires two suﬃcient statistics: the estimated structure TCL and the correlation estimates

ˆµ e, for any e TCL . Together with the parameter q, the higher order moments completely characterize the distribution of the noisy variables of the hidden model (20). We provide the proof of Theorem 10 in Appendix A.

Predictive Learning on Hidden Tree-Structured Ising Models

A similar expression to (45) has been introduced in prior work. Speciﬁcally, Algorithm 2 solves the problem of ﬁnding the optimal matching, see Deﬁnition 1 by (Bresler and Karzand, 2020, supplementary material). The evaluation of higher order moments requires an explicit expression or a way to compute the set CPT. For a given tree T = (E, V) and a set ((i1, i2, . . . , ik) V, there is a unique set CPT(i1, i2, . . . , ik) (see Appendix A, proof of Theorem 10). Given a set of edges E, we show that the set CPT can be evaluated by running a matching pair algorithm. For that purpose, we provide Algorithm 2 (with complexity (O(E))) and we prove its consistency (See Appendix, Lemma 15). The latter yields to an explicit expression of higher order moments; the Theorem 10. Furthermore, it provides a concrete higher order moments estimator, that is based on the estimated structure TCL (or TCL ) and the set of estimated correlations {ˆµe : e CPTCL}. High Order Moments Estimator: A higher order moment is the expected value of the product of the hidden tree-structured Ising model variables {Xi : i V } where V V. Theorem 10 gives the closed form solution for such moments. We have the following estimator for higher order moments using only noisy observations and known q. In particular, we have

ˆE [Xi1Xi2 . . . Xik] 0, k 2N + 1, (46)

ˆE [Xi1Xi2 . . . Xik] Y

e CPTCL (i1,i2,...,ik)

ˆµ e (1 2q)2 , k 2N. (47)

The estimated structure and pairwise correlations are suﬃcient statistics: given those, (47) suggests a computationally eﬃcient estimator for higher order moments. First we run the classical Chow-Liu algorithm to estimate the tree structure TCL , and then we run Algorithm 2 with input the estimate TCL to evaluate the set CPTCL . Thus, by estimating TCL , CPTCL and ˆµ e for any e TCL , we can in turn estimate any higher order moment through (47). Considering the absolute estimation error, we have ˆE

# 2|V |L(2) p( ), ΠTCL ˆp . (48)

Theorem 7 guarantees small ss TV and in combination with (48) gives an upper bound on the higher order moment estimate (47). In Section 4.4, we provide further details and discussion about Theorem 10, Algorithm 2, that computes the sets CPT(V ), CPTCL (V ), and the bound on the error of estimation (48). So far we have studied the consistency of the estimator with respect to the L(2) metric. We are also interested in sample complexity bounds for φ-divergences. While general divergences may be challenging, the most widely-used is the KL-divergence, particularly in testing Ising models (Daskalakis et al., 2018). The next result gives a bound for the suﬃcient number of samples to guarantee a small symmetric KL divergence SKL(P||Q) DKL(P||Q) + DKL(Q||P) with high probability. For any Ising model distributions P, Q of the form (10) with respective interaction parameters θ, θ , we have

SKL θ||θ SKL(P||Q) = X

θst θ st µst µ st . (49)

Nikolakakis, Kalogerias and Sarwate

Figure 3: Probability of incorrect structure recovery, The theoretical bound is given by Theorem 5. The top view of the ﬁgure is Figure 1 and provides a clear comparison between the experimental and theoretical results.

Theorem 11 (Upper Bounds for the Symmetric KL Divergence) If the number of samples n of Y satisﬁes

n 4 β2(p 1)2

(1 2q)4η2s log p2

then for p( ) PT(α, β) we have

P SKL p( )||ΠTCL ˆp ηs 1 δ, (51)

where TCL is the Chow-Liu tree deﬁned in (52) and the estimate ΠTCL ˆp is given by (36).

The asymptotic behavior of the bound in (50) was recently studied by Daskalakis et al. (2018). In that work, a set of testing algorithms are proposed and analyzed under the assumption of an Ising model with respect to trees and arbitrary graphs. Theorem 11 gives rise to possible extensions of testing algorithms to the hidden model setting. We consider the latter as an interesting subject for future work.

3.4 Simulations

We provide empirical results based on synthetic data to illustrate the probability of error δ as function of the cross-over probability q and the number of samples n. For the simulations

Predictive Learning on Hidden Tree-Structured Ising Models

Figure 4: Estimate of the probability of the ss TV to be greater than η = 0.03. The theoretical bound is given by Theorem 7. The top view of the ﬁgure is Figure 2 and provides a clear comparison between the experimental and theoretical results.

Figure 5: Estimate of the distribution error metric ss TV as a function of q and n.

Nikolakakis, Kalogerias and Sarwate

Figure 6: The probability of error in the predictive learning task for diﬀerent values of n and estimation error of the parameter q. For both ﬁgures we consider p = 31, α = 0.2, β = 1, q = 0.1 and averaging over 500 independent runs. Left: ˆq [0.05, 0.15], η = 0.1, Right: ˆq [0, 0.2], η = 0.12.

of this paper the original tree structure T is generated randomly where, starting from the root, we choose the parent of each new node uniformly at random among the nodes that are currently in the tree, in a sequential fashion. First, we estimate the probability of error P TCL = T (named as δ) of the structure learning problem, Figure 3. For the structure learning experiments, the number of nodes is 100, β = arctanh(0.8), and α = arctanh(0.2). Further, we considering 100 Monte Carlo runs for averaging, and we plot the estimated probability of incorrect structure recovery while q and n vary. As a next step, we would like to see how well the theoretical bound of Theorem 5 matches with the experimental results. To do this we plot the top view of Figure 3 to get Figure 1. Quite remarkably, the theoretical and experimental bounds exactly match. The latter suggests that our theoretical bound that we derive, sample complexity of the Chow-Liu algorithm (Theorem 5), is indeed accurate. Second, we plot the probability of error for the predictive learning task, that is the probability of the ss TV to be greater than a positive number η (Figure 4). For the simulation part, we restrict our attention to the case that the ﬁrst of three terms in the maximization of (41) is the dominant. In fact, η = 0.03, p = 31, while α and β are the same as the structure learning. Finally, Figure 5 presents the ss TV itself for diﬀerent values of q and n. Finally, the top view of Figure 4 is Figure 2, the latter suggest that the bound of our main result, Theorem 7 is accurate.

Finally, we provide experimental results for the case of unknown q. Speciﬁcally, Figure 6 illustrates the relationship between the average probability of error and the relative error |ˆq q|/q for the predictive learning task. We notice that the distribution can be approximated by using an estimate ˆq of q even for relative error 30% or 60%.

Predictive Learning on Hidden Tree-Structured Ising Models

4. Discussion

In this section, we present sketches of proofs, we compare our results with prior work, we further elaborate on Algorithm 1, Algorithm 2 and the error of higher order moment estimates. First, we discuss the convergence of the estimate TCL (Section 4.1). In section 4.2, we explain the connection between the hidden and noiseless settings on the tree structure learning problem. Later, in Section 4.3, we present the analysis and a sketch of proof for Theorem 7. Finally, in Section 4.4, we provide further details about Theorem 10, discussion about the Matching Pairs algorithm (Algorithm 2) and the accuracy of the proposed higher order moments estimator (47).

4.1 Estimating the Tree Structure T

In this work, the structure learning algorithm is based on the classical Chow-Liu algorithm, and is summarized in Algorithm 1. We can express its output as

TCL = argmax T T

ˆµ i,j , (52)

(see also Section 2.6.). The diﬀerence between Algorithm 1 and the Chow-Liu algorithm of the noiseless scheme is the use of noisy observations as input, since we consider a hidden model, whereas Bresler and Karzand (2020) assume that observations directly from the treestructured model are available. Further, (52) shows the consistency of the estimate TCL for suﬃciently large n. The tree structure estimator TCL converges to T when n , since

lim n ˆµ i,j a.s. = c2µi,j. (53)

From (52) and (53) we have (under an appropriate metric)

lim n TCL a.s. = T. (54)

Asymptotically, both TCL and TCL converge to T, where TCL denotes the structure estimate from noiseless data (q = 0). Most importantly, the Chow-Liu algorithm also returns the exact hidden structure with high probability given ﬁnite number of noisy samples. We provide the ﬁnite sample complexity bound in Theorem 5. For a ﬁxed probability of exact structure recovery 1 δ, more samples are required in the hidden model setting, compared to the noiseless one. Additionally, the diﬀerence of the sample complexity between the noisy and noiseless setting comes from Theorem 5 by comparing the bound for the values q = 0 and q = 0.

4.2 Hidden Structure Recovery and Comparison with Prior Results

Theorem 5 and Theorem 6 extend the noiseless setting (Bresler and Karzand, 2020, Theorem 3.2, Theorem 3.1) to our hidden model; the noiseless results correspond to q = 0. In particular, in the presence of noise, the dependence on p remains strictly logarithmic, that is, O(log(p/δ)). To make the connection between suﬃcient conditions more explicit, by setting q = 0 in (30) of Theorem 5, we retrieve the corresponding structure learning result

Nikolakakis, Kalogerias and Sarwate

by Bresler and Karzand (2020, Theorem 3.2) exactly: Fix a number δ (0, 1). If the number of samples of X satisfy the inequality

n 32 tanh2 α (1 tanh β) log 2p2

then the Chow-Liu algorithm returns TCL = T with probability at least 1 δ. An equivalent condition of (55) is

tanh α 4ϵ 1 tanh β τ(ϵ), and ϵ p

2/n log (2p2/δ), (56)

the latter shows that the weight of weakest edge should satisfy the following inequality α > arctanh 4ϵ/ 1 tanh β (Bresler and Karzand (2020)). For the hidden model, the equivalent extended condition for the weakest edge is

tanh α 4ϵ q

1 (1 2q)4 tanh β

(1 2q)2 (1 tanh β) τ (ϵ ), and ϵ =

2 log (2p2/δ)

(see Appendix C, Lemma 22) Condition (56) is retrieved through (57) for q = 0. Note that, for q = 1/2, the mutual information of the hidden and observable variables is zero, thus structure recovery is impossible. Theorem 6 provides the necessary number of samples bound for exact structure recovery given noisy observations. In fact, it generalizes Theorem 3.1 by Bresler and Karzand (2020) to the hidden setting. By ﬁxing q = 0, Theorem 6 recovers the noiseless case. Fix δ (0, 1). If the number of samples of X satisﬁes the inequality

16e2β [α tanh(α)] 1 log (p) , (58)

then for any algorithmic mapping (estimator) ψ, it is true that

inf ψ sup T T P PT(α,β)

P (ψ (X1:n) = T) > 1

When there is no noise, q = 0, we retrieve the noiseless result, while for any q (0, 1/2) the sample complexity increases since [1 (4q(1 q))p] 1 > 1 in (32) and for q 1/2 the required number of samples n , which makes structure learning impossible. The ratio between the noiseless and noisy necessary conditions indicates the gap between the hidden model and the original (noiseless) setting, which reads

n [1 (4q(1 q))p] 1 1 ηKL , (60)

(see Appendix E.). The right hand-side of (60) is the strong data processing inequality for the binary symmetric channel, which was recently developed by Polyanskiy and Wu (2017, Equation (39)). We continue by providing the main idea and the important steps of the proof of Theorem 7.

Predictive Learning on Hidden Tree-Structured Ising Models

4.3 Theorem 7: A Sketch of the Proof

Recall that the indices i, j V of the quantities µ i,j and ˆµ i,j are pair of nodes, and in fact

they can be considered as one (pair) index. For sake of space we introduce the notation µ e and ˆµ e for some e ET, that is consistent with our previous deﬁnition and e represents a pair of nodes. Theorem 7 guarantees that the estimated pairwise marginal distributions are close to the the original distributions by ensuring that the L(2) is small. In this section we provide a sketch of the proof of the Theorem and we mention the main diﬀerences between the hidden model and the noiseless case (Bresler and Karzand, 2020). The intersection of three events is suﬃcient to guarantee that L(2) is upper bound by η > 0:

µ i,j ˆµ i,j ϵ

Estrong (ϵ ) e ET : |tanh θe| τ (ϵ )

Ecascade (γ )

e path T(i,j)

ˆµ e (1 2q)2 Y

e path T(i,j)

µ e (1 2q)2

where (57) gives the deﬁnition of τ (ϵ ). The three events are equivalent events of the noiseless case, but they are modiﬁed accordingly to guarantee accurate estimation based on noisy data. The event Ecorr (ϵ ) guarantees that the error of the correlation estimates is not greater than ϵ . Under the event Estrong (ϵ ) all the strong edges are recovered by the Chow-Liu algorithm. Similarly to the noiseless setting, the event Estrong (ϵ ) requires the Chow-Liu algorithm to recover all the strong edges, while the weak edges (those that do not satisfy the inequality in (62)) do not aﬀect the accuracy of the predictive learning, even if the Chow-Liu algorithm fails to recover them. In contrast with structure learning, exact structure recovery is not necessary for the predicative learning task. In other words, even if α is extremely small, assume tanh(α) τ (ϵ )/(1 2q)2 the required number of samples for accurate predictive learning will remain unaﬀected. Under the event Ecascade (γ ) the end-to-end error along paths is no greater than γ . In fact, each path between two nodes of the tree can be considered a sequence of segments with strong and weak edges. The end-to-end path error is determined by the strong edge segments of the path through the parameter γ for the Ecascade (γ ) event, while the eﬀect of weak edges parameters is controlled by the quantity τ (ϵ ) (for the segmentation of the tree and the detailed proof see 26). Our goal is to ﬁnd suﬃcient conditions on the parameters ϵ and γ that guarantee that the events Ecorr (ϵ ) , Ecascade (γ ) and Estrong (ϵ) occur with high probability. Recall that our goal is to guarantee that the quantity L(2)(p( ), ΠTCL (ˆp )) is smaller than a ﬁxed number η > 0 with probability at least 1 δ. To do this, we follow the technique of prior work by Bresler and Karzand (2020), the triangle inequality gives

L(2) p( ), ΠTCL (ˆp ) L(2) p( ), ΠTCL (p( )) + L(2) ΠTCL (p( )) , ΠTCL (ˆp ) , (64)

Nikolakakis, Kalogerias and Sarwate

and we ﬁnd the required number of samples such that each of the terms L(2)(p( ), ΠTCL (p( )))

and L(2)(ΠTCL (p( )) , ΠTCL (ˆp )) is no greater than η/2 with probability at least 1 δ. As we

show the probability of the event Ecascade (γ ) (Lemma 25, Appendix) and the L(2) (between the true and estimated distribution) can be bounded by a constant uniformly over the set of all trees and is not aﬀected by long paths. To prove these properties of the hidden model is non-trivial and ensures that the estimation error from noisy observations does not increase exponentially along paths as someone might expect. Speciﬁcally, the ﬁrst quantity at the right hand-side of inequality (64) represents the loss due to graph estimation error, while the second term represents the loss due to parameter estimation error. Lemma 26 (Appendix) shows that under the event

E (ϵ , γ ) Ecorr (ϵ ) Ecascade (γ ) Estrong (ϵ ) , (65)

3 and ϵ (1 2q)2e β h 20 1 + 2eβp

2 (1 q) q tanh β i 1 (66)

then L(2)(ΠTCL (p( )) , ΠTCL (ˆp )) η/2. Further Lemma 27 (Appendix) shows that if

η 16(1 2q)2, (1 2q)2e β

24 1 + 2eβp

2 (1 q) q tanh β

then L(2) p( ), ΠTCL (p( )) η

2 under the event Ecorr (ϵ ) Estrong (ϵ ). Both conditions (66) and (67) should be satisﬁed, so it is necessary to have

3 and ϵ min

η 16(1 2q)2, (1 2q)2e β

24 1 + 2eβp

2 (1 q) q tanh β

To guarantee that the errors γ and ϵ are suﬃcient small such that (68) is satisﬁed, we need to make sure that the number of samples n is suﬃciently large. In fact, the upper bounds on the errors translate into lower bounds on the number of samples through the concentration bounds for the events. Speciﬁcally, Lemma 18 gives a suﬃcient sample size to ensure that the event Ecorr (ϵ ) occurs with probability at least 1 δ, Lemma 22 gives the concentration bound for the event Estrong (ϵ ) and Lemma 25 gives the concentration bound of the event Ecascade (γ ). Lemma 18, Lemma 22 and Lemma 25 together with (68) give the ﬁnal bound of the sample complexity (see the proof 28)

( 512 η2(1 2q)4 , 1152e2βB(β, q)

(1 2q)4 , 48e4β

and its simpliﬁed but looser bound

( 512 η2(1 2q)4 , 1152 1 + 3 q 2 e2β(1+1q =0)

(1 2q)4 , 48e4β

Predictive Learning on Hidden Tree-Structured Ising Models

that provides the condition of Theorem 3. Although the general structure of our argument follows that of the noiseless case, the presence of noise introduces several technical challenges whose solution may be of independent interest. In the sequel, we highlight the most important aspects of our approach that do not appear in the noiseless case. The proof of Theorem 3.3 is signiﬁcantly diﬀerent and includes additional steps and techniques compared with the approach by Bresler and Karzand (2020). Speciﬁcally, Lemma 23 is new and it is necessary for the hidden model and we use it later to prove (Lemma 25, Appendix). Lemma 24 is an non-trivial extension of the accurate estimation of edges correlation. Although the resulting expression seems complicated is important for the proof of Lemma 25. In fact Lemma 25, the proof of the concentration bound for the event Ecascade (γ ), is signiﬁcantly more complicated and longer than the noiseless model (see Appendix E by Bresler and Karzand (2020) for comparison). To show this result we have to consider a martingale diﬀerence sequence and evaluate upper bounds for the conditional variance and bias of that sequence. The bias is crucial for the ﬁnal result because it introduces an extra term in the ﬁnal bound that does not exist in the noiseless case. It is interesting that this term does not involve any parameter related to the noise and shows how the result is aﬀected by the structural inconsistency between the hidden and the observable layer. As a consequence, the expression of the bound (170) in Lemma 25 involves two inequalities to guarantee the high-probability bound. The ﬁrst inequality which introduces restrictions on the parameter (see inequality 170) is an attribute of the noisy case. We continue by brieﬂy explaining one of the main technical aspects of the proof. To begin with, consider a path of length d 2 in the original tree T, X1 X2 Xd+1 and we denote the edge (k, k + 1) as ek, for some k [d]. Recall that Y (i) k denotes the ith

sample of Yk and k [d + 1]. We would like derive a concentration bound of the probability of the event Ecascade (γ ) (Lemma 25, Appendix). To do this, ﬁrst we have to consider for all ℓ [n] and k {2, . . . , d} the random variables

(Xk Nk Xk+1Nk+1)(ℓ)

(1 2q)2 µ ek (1 2q)2

ˆµ ej (1 2q)2

µ ej (1 2q)2 . (71)

Deﬁne the martingale diﬀerence sequence (MDS) {ξ(i) k } by setting ξ(0) k 0, ξ(1) k Z(1) k

E h Z(1) k |ˆµ ek 1, . . . , ˆµ e1 i , ξ(i) k Z(i) k E h Z(i) k |Z(i 1) k , . . . , Z(1) k , ˆµ ek 1, . . . , ˆµ e1 i . Let Fk i 1 be

the σ-algebra generated by Z(i 1) k , . . . , Z(1) k , ˆµ ek 1, . . . , ˆµ e1. Then the pair (ξ(i) k , Fk i )i=1,...,n is an MDS. In contrast with the noiseless case, the conditional means are not zero, which makes the problem signiﬁcantly harder. To proceed, we apply a concentration bound for supermartingales (generalized Bennett s inequality) by Fan et al. (2012). Secondly we have to evaluate the following expression

P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ ek 1, . . . , ˆµ e1

= 1 µ ek 2 1 µ ek 1 ˆµ ek 1 1 (µ ek 1)2 + µ ek 1 1 µek

2 ˆµ ek 1 µ ek 1 1 (µ ek 1)2 . (72)

In the noiseless case, the product variables X(ℓ) k X(ℓ) k+1 are independent, leading to a simple expression for this probability (see Lemma 13, Appendix). The closed form expression of

Nikolakakis, Kalogerias and Sarwate

(72) is given by Lemma 23. Finally, the expectations E h Z(i) k |Z(i 1) k , . . . , Z(1) k , ˆµ ek 1, . . . , ˆµ e1 i

are not zero, however when n , they approach zero. As a consequence, a bias exists that aﬀects the sample complexity by introducing an additional term in the bound that that does not appear in the noiseless case, the quantity e4β/η2 (see Equation 69 and Equation 70). Finally, we continue by bounding the norm L(2) between the true and estimated distribution in Appendix E. The proof of Lemma 26 shows that in the noisy setting as well, the L(2) can be bounded by a constant uniformly over the set of all trees and it is not aﬀected by long paths. This property of the hidden model is highly non-trivial and ensures that the estimation error from noisy observations does not increases along paths as someone might expect. Lemma 27 follows the corresponding approach of Lemma 6.1 by Bresler and Karzand (2020) and we provide only the required for the noisy setting diﬀerences. In Theorem 28, we combine the Lemmata of Appendices D and E, we ﬁnd the appropriate choice of the parameter that satisﬁes the necessary conditions of Lemma 25 and we derive the ﬁnal sample complexity bound. For further details about the proof of the main result see Appendix, Section D and Section E.

4.4 Estimating Higher Order Moments

Our results also provide an analogue of Isserlis Theorem (Theorem 10) and the Matching Pairs algorithm, which returns the set CPT(V ) in (45). We provide a short proof sketch for the bound on the error of estimation (48). Proof sketch of Theorem 10: We prove that CT(V ) always exists (when k is even) by induction (see Appendix A, Lemma 15). We deﬁne the set of edges CPT(V ) as the union of the edge-disjoint paths4 CPT(V ) = w,w CT(V )path(w, w ). Combining the set CPT(V ) together with the independent products property (see Lemma 13), we derive the ﬁnal expression (see Appendix A, proof of Theorem 10). Given the tree structure T and the correlations µe for all e E, we can calculate the higher order expectations. Notice that the collection of edge-disjoint paths CPT depends on the tree structure and as a consequence an algorithm is required to discover those paths. Diﬀerent matching algorithms can be considered to ﬁnd the set CPT. We propose Algorithm 2 which is simple and has low complexity of O(|E|). Matching Pairs Algorithm: Algorithm 2 requires as input the tree and the set of nodes V {i1, . . . , ik} V, and returns the set of edges CPT(V ). For each node in the tree, a ﬂag variable is assigned to each node and indicates if the corresponding node is a candidate for the ﬁnal set CT(V ) at the current step of the algorithm. The candidate nodes have to be matched with other nodes of the tree, such that the pairs generate edge-disjoint paths. Initially, the candidate nodes are the nodes of the set V . Starting from the nodes which appear in the deepest level of the tree, we move them to their ancestor. At each step, if two candidate nodes appear at the same point, we match them as pair, we store the pair in the set CPT(V ) and we remove both of them from the set V . We continue until V . The complexity of Algorithm 2 is O(|E|). Finally, Theorem 10 can be extended to any forest F structure by considering the set CPF(V ) instead of CPT(V ), where CPF(V ) i CPTi(V ) and Ti is the ith connected tree of the forest.

4. By edge-disjoint paths we refer to paths with no common edges.

Predictive Learning on Hidden Tree-Structured Ising Models

Estimation error of higher order moments: Inequality (48) bounds the error of estimation by the small set Total Variation (ss TV), that is guaranteed to be less than η > 0 by Theorem 7. Additionally, the bound on the error of the estimation in (48) can be found as follows ˆE

e CPTCL (i1,i2,...,ik)

ˆµ e (1 2q)2 Y

e CPT(i1,i2,...,ik) µe

e CPTCL (i1,i2,...,ik)

ˆµ e (1 2q)2 Y

e CPT(i1,i2,...,ik)

µ e (1 2q)2

{w,w } CTCL (V ) path TCL (w,w )

ˆµ e (1 2q)2 Y

{w,w } CT(V ) path T(w,w )

µ e (1 2q)2

{w,w } CTCL (V )

e path TCL (w,w )

ˆµ e (1 2q)2 Y

{w,w } CT(V )

e path T(w,w )

µ e (1 2q)2

2|V |L(2) p( ), ΠTCL (ˆp ) , (75)

where (73) holds due to (45) and (47), (74) comes from (44) and the last inequality (75) is being proved by Bresler and Karzand (2020, Lemma 1, supplementary material). Thus, if we can accurately estimate the distribution under the sense L(2)(p( ), ΠTCL (ˆp )) η ,

for a suﬃciently small positive number η , then by using (47) and choosing η η/(2|V |), Theorem 7 guarantees accurate estimates for higher order moments with probability at least 1 δ.

5. Conclusion

We have considered and analyzed the problem of predictive learning on hidden tree-structures from noisy observations, using the well-known Chow-Liu algorithm. In particular, we derived sample complexity guarantees for exact structure learning and marginal distributions estimation. Our bounds extend prior work (see Bresler and Karzand (2020)) to the hidden model, by introducing the cross-over probability q of the BSC(q)p. Our results exactly reduce to the noiseless setting when q = 0, and the explicit expressions of the bounds are also continuous functions of q. Additionally, by applying a graph property for tree structures and a probabilistic property for Ising models, we derived an equivalent of the well-known Isserlis theorem for Gaussian distributions, which yields to a consistent high-order moments

Nikolakakis, Kalogerias and Sarwate

estimator for Ising models. Further, we considered simulations based on synthetic data to validate our theoretical results. Our theoretical bounds exactly match with the experiment. indicating that our results correctly characterize the dependence on the model parameters. Our results show that the estimated structure statistic TCL is essential for successful statistical inference on the hidden (or observable) layer, while the sample complexity with respect to number of nodes and probability of error remains strictly logarithmic, as in the noiseless case. Our hidden setting constitutes a ﬁrst step towards more technically challenging and potentially more realistic statistical models, such as, for instance, structure and distribution learning when the noise is generated by an erasure channel, or when the underlying hidden tree structured distribution has a larger, or even uncountable, support.

Acknowledgments

We would like to thank the action editor, Bert Huang, as well as the anonymous reviewers for providing valuable comments and suggestions, which have substantially improved the presentation of the results and the overall quality of our paper. This work was supported in part by DARPA and SSC Paciﬁc under contract N6600115-C-4070 and the United States National Science Foundation under award CCF-1453432, and the United States National Institutes of Health under award 1R01DA040487.

Appendix A. Preliminaries and Outline of Proof

The chart in Figure 7 shows the various dependencies of the Lemmata and intermediate results either considered or developed in this paper, and the resulting Theorems. The proofs can be found in the corresponding section of the Appendix. For completeness, we start with some properties that hold for any distribution with support { 1, +1}p and tree-structured graphical model (Lauritzen, 1996). Later we derive explicit formulas for the Ising model (10).

Lemma 12 Any distribution p(x) with respect to a forest F = (V, E), where x { 1, 1}p

and uniform marginals P (Xi = 1) = 1/2, for all i V can be expressed as

1 + xixj E [Xi Xj]

Proof We prove the result for an arbitrary tree T = (V, E) and then we extend it to any forest structure by applying cuts to E. The distribution factorizes according to the tree structure T and under the assumption of no external ﬁeld (uniform marginal distributions), we have

P(X = x) = Y

i V p (xi) Y

p(xi, xj) p(xi)p(xj) = 2p 2 Y

1 + xixj E [Xi Xj]

1 + xixj E [Xi Xj]

Predictive Learning on Hidden Tree-Structured Ising Models

Lemma B.1 Lemma B.4

Theorem 3.1

Lemma D.2 Lemma D.1

Theorem 3.3

Corollary G.1

SDPI of BSC by Polyanskiy

Theorem 3.2

Lemmas A.1, A.2,

Theorem 3.6

Theorem 3.5

Properties of the Ising Model

Pairs Algorithm

Order Moments Estimates

6.3 by Bresler and

SDPI of BSC by Polyanskiy

Theorem 3.4

Lemma B.2 Lemma B.3

Lemmas E.1 ,E.2

Corollary G.1

Figure 7: Stream mapping of the results

(77) holds since the joint distribution of any pair (Xi, Xj) of distinct nodes i, j V is

p(xi, xj) = E 1Xi=xi1Xj=xj = 1 + xixj E[Xi Xj]

By setting E[Xi Xj] = 0 for some (i, j) E we derive the distribution with respect to a forest generated by cutting the edge (i, j) of T.

In Lemma 13 we prove two fundamental properties of the model, the independence of the random variables {Xi Xj : (i, j) E} and the correlation decay property (CDP). To the best of our knowledge, these properties are known but there is no reference for the corresponding proofs in the literature.

Lemma 13 Let X be a random binary vector in { 1, +1}p drawn according to a foreststructured distribution p( ) with uniform marginal distributions on each entry Xi for i [p]. Then the elements of the collection of |E| random variables {Xi Xj : (i, j) E}, are independent. Furthermore, we have

E [Xi Xj] = Y

e path(i,j) µe, (80)

so the Correlation Decay Property (CDP) holds since |µe| 1 for all e E.

Nikolakakis, Kalogerias and Sarwate

Proof Let (ir)p r=1 be an arbitrary permutation of ℓ= {1, 2, . . . , p}. Notice that the singletons {ir}, r = 1, . . . , p form a partition of ℓ. Then, the set of edges E is deﬁned as

E = (ir, jr)p r=2 , and j1 = , jr {i1, . . . , ir 1} ℓ. (81)

(81) deﬁnes a tree T = (V, E) with root the node i1 (since j1 = ). For the ﬁrst part, it is suﬃcient to show that for any {cr : r = 2, 3, . . . , p} { 1, +1}p 1, the following holds

r=2 {xirxjr = cr}

r=2 P (xirxjr = cr) . (82)

r=2 {Xir Xjr = cr}

x:xirxjr=cr|p r=2

x:xirxjr=cr|p r=2

1 + xirxjr E [Xir Xjr]

x:xir=crxjr|p r=2

1 + xirxjr E [Xir Xjr]

xi1 { 1,+1}

1 + cr E [Xir Xjr]

r=2 P (Xir Xjr = cr) , (84)

(83) comes from (81) and Lemma 12 and the last from (79). For the second part of the statement note that for all i, j V there exists a unique path {i, k1, k2, . . . , kℓ, j} from i to j. Deﬁne the variable 1(i,j) (Xk1Xk1)(Xk2Xk2) . . . (XkℓXkℓ), which is equal to 1 almost surely, since X { 1, +1}p.5 Then, we have

E[Xi Xj] = E[Xi1(i,j)Xj]

= E[Xi(Xk1Xk1)(Xk2Xk2) . . . (XkℓXkℓ)Xj]

= E[Xi Xk1]

m=1 E[Xkm Xkm+1]

E[XℓXj] = Y

e path(i,j) µe, (85)

and (85) comes from (82) and completes the proof.

The next lemma relates the pairwise correlations to the parameters of the Ising model.

Lemma 14 An equivalent expression of (10) is the following

Q (i,j) E [1 + xixj tanh (θij)] P x Q (i,j) E [1 + xixj tanh (θij)] x { 1, 1}p. (86)

Further, for a tree-structure Ising model E [Xi, Xj] = tanh (θij) , for all (i, j) E.

5. 1( ) should not be confused with 1A, where the last denotes the indicator function of a set A.

Predictive Learning on Hidden Tree-Structured Ising Models

Proof We can write exp (θijxixj) as

exp (θijxixj) =exp (θijxixj) + exp ( θijxixj)

2 + exp (θijxixj) exp ( θijxixj)

=exp (θij) + exp ( θij)

2 + xixj exp (θij) exp ( θij)

= cosh (θij) [1 + xixj tanh (θij)] , (88)

(87) holds because xixj { 1, +1}. The partition function can be written as

(i,j) E exp (θijxixj)

(i,j) E cosh (θij) [1 + xixj tanh (θij)]

(i,j) E cosh (θij) X

(i,j) E [1 + xixj tanh (θij)] = 2p Y

(i,j) E cosh (θij) . (89)

Notice that P x Q (i,j) E [1 + xixj tanh (θij)] = 2p under the tree-structure assumption. Then

Q (i,j) E exp (θijxixj)

Q (i,j) E cosh (θij) [1 + xixj tanh (θij)]

2p Q (i,j) E cosh (θij) (90)

1 + xixj tanh (θij)

(88) and (89) give (90) and |E| = p 1 gives (91). Finally

E [Xi Xj] (10) = ln Z (θ)

(89) = ln h 2p Q (i,j) E cosh (θij) i

θij = tanh(θij), (i, j) E, (92)

and the latter gives the second part of the Lemma.

Lemma 15 Let V be a set of nodes such that V V and |V | 2N. Then it exists a set CT(V ) of |V |/2 pairs of nodes of V , such that any two distinct pairs (w, w ), (v, v ) in CT(V ) are pairwise disjoint (their paths have no commons edge), that is,

path T(w, w ) path T(v, v ) = , (w, w ), (v, v ) CT(V ) : (w, w ) (v, v ). (93)

Proof We prove the existence of CT(V ) by contradiction. Assume that the two distinct paths path T(w, u ), path T(u, w ) share at least one edge. Let their common sub-path be path T(z, z ), Figure 8 and note that z and z do not necessarily diﬀer from w, w , u, u . Notice that the common sub-path is unique (acyclic graph). Then we can always consider the permutation of the endpoints which gives the edge-disjoint paths path T(w, u) and path T(w , u ). Now the paths path T(w, u) and path T(w , u ) are disjoint, however it is possible that one of them or both, contain sub-paths with common edges. Then, we similarly

Nikolakakis, Kalogerias and Sarwate

proceed by removing the common sub-paths as previously. The set of common edges strictly decreases through the process, which terminates when there are only paths with no common edge.

Figure 8: Proof of the existence of CT(V ), Lemma 15

Theorem 16 (Theorem 10) Assume X p(x) PT(α, β), {i1, i2, . . . , ik} V, then

E [Xi1Xi2 . . . Xik] =

(Q e CPT(i1,i2,...,ik) µe, k 2N

0, k 2N + 1. (94)

Recall that the set of edges CPT(i1, . . . , ik) is a collection of k/2 edge-disjoint paths with endpoints pairs of the nodes i1, . . . , ik for each path. Given a tree structure T, CPT(i1, . . . , ik) is found by running Algorithm 2 on T. Proof Even k. We proceed by showing that the Algorithm 2 returns the unique set CPT. When k = 2 the expression is proved in Lemma 13. For k > 2 we proceed by using Lemmas 13 and 15. For all i, j V there exists a unique path {i, k1, k2, . . . , kℓ, j} from i to j. Deﬁne as previously the variable 1(i,j) (Xk1Xk1)(Xk2Xk2) . . . (XkℓXkℓ), which is equal to 1 almost surely, and deﬁne the set of nodes V {i1, i2, . . . , ik}. Without loss of generality we assume that the variables in the product Xi1Xi2 . . . Xik are ordered such such that the pairs Xij, Xij+1 for all j {1, 3, 5, . . . , k 1} [k 1]odd form edge-disjoint paths (Lemma 15), in other words

path(ij, ij+1) path(ij , ij +1) = , j = j [k 1]odd. (95)

Then, we have

E [Xi1Xi2 . . . Xik] = E Xi11(i1,i2)Xi2Xi31(i3,i4)Xi4 Xik 11(ik 1,ik)Xik (96)

j [k 1]odd E[Xij1(ij,ij+1)Xij+1] (97)

e path(ij,ij+1) µe (98)

e CPT(i1,i2,...,ik) µe, (99)

Predictive Learning on Hidden Tree-Structured Ising Models

where (96) and (97) come from (80), and (98) holds because of (95). Odd k. Lemma 12 gives p(x) = 2 p Q (i,j) E(1 + xixj E [Xi Xj]). Then

E [Xi1Xi2 . . . Xik] = 1

x { 1,+1}ik xi1xi2 . . . xik Y

1 + xixj E [Xi Xj]

2 = 0, (100)

gives the second part of (94).

Lemma 17 The mutual information of Xi, Xj { 1, +1} is symmetric function of the correlation E [Xi Xj] and increasing with respect to |E [Xi Xj]|,

I (Xi, Xj) = 1

2 log2 (1 E [Xi Xj])1 E[Xi Xj] (1 + E [Xi Xj])1+E[Xi Xj] . (101)

The proof can be derived through the deﬁnition of I (Xi, Xj) and the expression (79), under the assumption of uniform marginal distributions.

Appendix B. Bounding the Probability of Mis-Estimating Correlations

The following lemma bounds the probability that the estimated pairwise correlations in the graph deviate from their true values. This follows from standard concentration of measure arguments.

Lemma 18 Fix δ > 0. Then for any ϵ > 0, if

n 2 log p2/δ /ϵ2 , (102)

then the event Ecorr (ϵ ) deﬁned in (61) holds with high probability:

P Ecorr (ϵ ) 1 δ = 1 p2 exp

Proof Let Z(i) be the ith sample of Z = Yw Y w = Nw Xw N w X w. Then ˆµ i,j = 1 n Pn i=1 Z(i) =

1 n Pn i=1 N(i) w X(i) w N(i) w X(i) w for all i = j V. Then Hoeﬀding s inequality and union bound

over all pairs of nodes p 2 < p2/2 give (103).

For the rest of the paper we consider ϵ = p

2 log (2p2/δ) /n , which satisﬁes Lemma 18. We apply Lemmata 19, 20, 21 to Lemma 22 to bound the required number of samples for exact structure recovery using noisy observations of the hidden model. To analyze the error event we use the Two trees lemma of Bresler and Karzand (2020, Appendix F, supplementary material). Informally, if two maximum spanning trees T, T diﬀer in how a pair of nodes are connected then there exists at least one edge in ET which does not exist in ET and vice versa. Lemma 19 characterizes errors in the Chow-Liu in terms of correlations.

Nikolakakis, Kalogerias and Sarwate

Lemma 19 Suppose the error event {T = TCL } holds and let f (w, w) be an edge such that f T and f / TCL . Then there exists an edge g (u, u) TCL and g / T such that f path T (u, u) and g path TCL (w, w) and n X

i=1 Z(i) f,u, u

i=1 M(i) f,u, u

and Zf,u, u Yw Y w Yu Y u and Mf,u, u Yw Y w + Yu Y u.

Proof Using similar approaches to the procedures as in (Bresler and Karzand, 2020, Lemmata 8.2, 8.3) we have that the condition ˆµ f ˆµ g implies

0 ˆµ f 2 ˆµ g 2

= ˆµ f ˆµ g ˆµ f + ˆµ g

i=1 N(i) w X(i) w N(i) w X(i) w N(i) u X(i) u N(i) u X(i) u

i=1 N(i) w X(i) w N(i) w X(i) w + N(i) w X(i) u N(i) u X(i) u

i=1 Z(i) f,u, u

i=1 M(i) f,u, u

that gives (104).

2 log (2p2/δ)

1 (1 2q)4 tanh β

1 tanh β (107)

µe E [Xw X w] , (108)

then µA is given through the following relationship

E [Xw X w Xu X u] = µe(1 µA), (109)

µ A (1 2q)4µA. (110)

In Lemmata 20, 21 we derive two concentration of measure inequalities for the variables Z(i) f,u, u, M(i) f,u, u. In fact, we have that the event

i=1 Z(i) e,u, u n E [Ze,u, u]

n max 8ϵ2 , 4ϵ q

: e E and u, u V

Predictive Learning on Hidden Tree-Structured Ising Models

happens with probability at least 1 δ

2 and the event

i=1 M(i) e,u, u n E [Me,u, u]

n max 8ϵ2 , 4ϵ q

: e E and u, u V

happens with probability at least 1 δ

2 . The parameters ϵ and µA, deﬁned below, are decreasing functions of n . Finally, we apply the union bound to guarantee that the event EZ EM happens with probability at least 1 δ, where δ

2 } δ. The union bound is ﬁrst applied over all tuples (w, w, u, u) in Lemmata 20 and 21 and then for the events EZ and EM.

Lemma 20 Fix δ > 0 and let ϵ be given by (106). For all pairs of vertices u, u V and edges e = (w, w) in the path path T (u, u) from u to u, given n samples Z(1) e,u, u, Z(2) e,u, u, ..., Z(n) e,u, u of Ze,u, u = Yw Y w Yu Y u, it is true that

i=1 Z(i) e,u, u n E [Ze,u, u]

n max 8ϵ2 , 4ϵ q

where A = path T (u, u) \ {e}.

Proof The proof is an application of Bernstein s inequality. First, it is true that

Ze,u, u = Xw Nw X w N w Nu Xu N u X u = Nw Xw N w X w (1 Nw Xw N w X w Nu Xu N u X u) . (114)

E [Ze,u, u] = (1 2q)2 E [Xw X w Xu X u]

= (1 2q)2 µe (1 µA) (115)

Var (Ze,u, u) = E h (Ze,u, u)2i E [(Ze,u, u)]2

= E h (Xw Nw X w N w N u Xu N u X u)2i h (1 2q)2 E [Xw X w Xu X u] i2

= E [1 + 1 2Xw Nw X w N w Nu Xu N u X u] (1 2q)4 E [Xw X w Xu X u]2

= 2 2E [Xw Nw X w N w Nu Xu N u X u] (1 2q)4 E [Xw X w Xu X u]2

= 2 2 (1 2q)4 E [Xw X w Xu X u] (1 2q)4 E [Xw X w Xu X u]2

= 2 2 (1 2q)4 µA (1 2q)4 (µe (1 µA))2

= 2 (1 2q)4 h 2µA + µ2 e (1 µA)2i . (116)

Using the expressions for the mean and the variance, we apply Bernstein s inequality (Bennett, 1962) for the noisy setting: for all i [n ] we have Z(i) e,u, u E [Ze,u, u] M almost surely. Then, Bernstein s inequality gives, for all t > 0

i=1 Z(i) e,u, u n E [Ze,u, u]

2n Var (Ze,u, u) + 2

Nikolakakis, Kalogerias and Sarwate

Choose a δ > 0 and ﬁnd t such that

δ/2 = 2 exp

2n Var (Ze,u, u) + 2

After some algebra, we have

2n Var (Ze,u, u) + 2

From this we can solve for t:

δ 2n Var (Ze,u, u) log 4

δ 2 + 8n Var (Ze,u, u) log 4

2 + 2n Var (Ze,u, u) log 4

Since t > 0, we have, setting M = 4:

2 + 2n Var (Ze,u, u) log 4

If the probability of the union

u, u,w, w:(w, w) path T(u, u)

i=1 Z(i) e,u, u n E [Ze,u, u]

is at most δ 2p3 , then the union bound gives probability at most δ

Var (Ze,u, u) = 2 (1 2q)4 h 2µA + µ2 e (1 µA)2i

= 2 (1 2q)4 2µA (1 2q)4 µ2 e (1 µA)2

2 (1 2q)4 2µA + 0

= 2 1 (1 2q)4 µA

= 2 1 µ A . (120)

From (119) and (120), we have

2 + 4n 1 µ A log 4p3

4n 1 µ A log 4p3

Predictive Learning on Hidden Tree-Structured Ising Models

the latter implies that

1 µ A log 4p3

8 3n log 4p3

1 µ A log 4p3

Deﬁne ϵ = p

log (2p2/δ) 2/n (as it is deﬁned in Bresler and Karzand (2020)), then we get

n max 8ϵ2 , 4ϵ q

By combining the above we derive the concentration bound for the event EZ in (111).

The next Lemma gives the concentration of measure bound for the event EM in (112).

Lemma 21 Fix δ > 0 and let ϵ be given by (106). For all pairs of vertices u, u V and edges e = (w, w) in the path path T (u, u) from u to u, given n samples M(1) e,u, u, M(2) e,u, u, ..., M(n) e,u, u of Me,u, u = Yw Y w + Yu Y u, it is true that

i=1 M(i) e,u, u n E [Me,u, u]

n max 8ϵ2 , 4ϵ q

A = path T (u, u) \ {e}.

Proof Similarly to the prior Lemma, we calculate the mean and the variance as

E [Me,u, u] = (1 2q)2 E [Xw X w + Xu X u] = (1 2q)2 µe (1 + µA) (125)

Var (Me,u, u) = E h (Me,u, u)2i E [(Me,u, u)]2

= E h (Xw Nw X w N w + Nu Xu N u X u)2i h (1 2q)2 E [Xw X w + Xu X u] i2

= E [1 + 1 + 2Xw Nw X w N w Nu Xu N u X u] (1 2q)4 E [Xw X w + Xu X u]2

= 2 + 2E [Xw Nw X w N w Nu Xu N u X u] (1 2q)4 E [Xw X w + Xu X u]2

= 2 + 2 (1 2q)4 E [Xw X w Xu X u] (1 2q)4 E [Xw X w + Xu X u]2

= 2 + 2 (1 2q)4 µA (1 2q)4 (µe (1 + µA))2

= 2 + (1 2q)4 h 2µA µ2 e (1 + µA)2i . (126)

By applying Bernstein s inequality and we get that for any t > 0

i=1 M(i) e,u, u n E [Me,u, u]

2n Var (Me,u, u) + 2

Nikolakakis, Kalogerias and Sarwate

Similarly, we ﬁnd

8 3n log 4p3

2 n Var (Me,u, u) log 4p3

Var (Me,u, u) =2 + (1 2q)4 h 2µA µ2 e (1 + µA)2i

2 + (1 2q)4 2µA

=2 1 + µ A . (128)

We deﬁne ϵ p

log (2p2/δ) 2/n , then

n max 8ϵ2 , 4ϵ q

which completes the proof.

Appendix C. Recovering Strong Edges

In Lemma 22, we deﬁne the set of strong edges for the hidden model and show that the event Estrong (ϵ ) deﬁned in (62) occurs with high probability. That is, only the strong edges are guaranteed to exist in the estimated structure TCL We also ﬁnd a lower bound for the necessary number of samples for exact structure recovery. In fact we have n n, as expected. Our bounds coincide with the noiseless case (Bresler and Karzand, 2020) by setting the noise level q = 0.

Lemma 22 Fix δ (0, 1), and let ϵ = p

2 log (2p2/δ) /n , for any n > 0. Consider the set of strong edges

1 (1 2q)4 tanh β

(1 tanh β) and (i, j) ET : |tanh θij| τ

Then, the Chow-Liu algorithm recovers the strong edges with probability at least 1 δ. In other words, it is true that

P h Estrong (ϵ ) i 1 2p2 exp

Proof From Lemma 19, if there is an error then for an edge f not recovered in the tree TCL , we have

i=1 Z(i) f,u, u

i=1 M(i) f,u, u

Predictive Learning on Hidden Tree-Structured Ising Models

Therefore one of the sums must be negative. Expanding, one of the two following inequalities must hold:

i=1 Z(i) f,u, u n E h Z(i) f,u, u i n E h Z(i) f,u, u i

i=1 Y (i) f,u, u n E h Y (i) f,u, u i n E h M(i) f,u, u i .

In addition, (115), (125), Lemma 20 and Lemma 21 give the following pairs of inequalities:

(1 2q)2 µf (1 µA) max 8ϵ2 , 4ϵ q

(1 2q)2 µf (1 + µA) max 8ϵ2 , 4ϵ q

µ f (1 µA) 1 max 8ϵ2 , 4ϵ q

µ f (1 + µA) 1 max 8ϵ2 , 4ϵ q

Putting these together:

8ϵ2 (1 µA), 8ϵ2 (1 + µA), 4ϵ q

1 µ A (1 µA) , 4ϵ q

1 + µ A (1 + µA)

8ϵ2 (1 µA), 4ϵ q

1 µ A (1 µA)

1 µ A (1 µA) . (132)

We get the last inequality for non trivial values of the bound 8ϵ2 1 µ A 1 and by using the

following bound

8ϵ2 (1 µA) 16ϵ2 (1 µA) 4ϵ 1 µA = 4ϵ 1 µA (1 µA) 4ϵ q

1 µ A (1 µA) . (133)

Finally the function f(µA) = 4ϵ q

1 µ A (1 µA) = 4ϵ q

1 (1 2q)4µA

(1 µA) is increasing with respect to µA (for all µA 1) and µA tanh β < 1, so we have

1 µ A (1 µA) 4ϵ q

1 (1 2q)4 tanh β

(1 tanh β) τ . (134)

Nikolakakis, Kalogerias and Sarwate

The weakest edge should satisfy µ f τ to guarantee the correct recovery of the tree

under the event Estrong (ϵ ). This yields a condition on the edge strengths: µ f τ =

(1 2q)2 tanh α 4ϵ q

1 (1 2q)4 tanh β

tanh α 4ϵ q

1 (1 2q)4 tanh β

(1 2q)2 (1 tanh β) , q [0, 1

The last inequality gives the deﬁnition of the strong edges in the noisy scheme.

Based on the deﬁnition (134) we derive the following bound on τ

1 (1 2q)4 tanh β

1 (1 8q + 24q2 32q3 + 16q4) tanh β

1 tanh β + p

(1 3q + 4q2 2q3) 8q tanh β (1 tanh β)

4ϵ eβ 1 + eβp

(1 q) (2q2 2q + 1) 8q tanh β (136)

< 4ϵ eβ 1 + 2eβp

2 (1 q) q tanh β . (137)

(136) holds because 1 tanh(β) e 2β. We will later use (137) in Lemma 25. In comparison to the noiseless setting (see Bresler and Karzand (2020)), we can guarantee exact recover with high probability under the event Estrong (ϵ) when the weakest edge satisﬁes the inequality

tanh α 4ϵ 1 tanh β . (138)

Notice that (138) can be obtained by (135) when q = 0 and n = n . When q > 0 and n = n it is clear that the set of trees that can be recovered from noisy observations is a subset of the set of trees that can be recovered from the original observations. Also, we have

2 log (2p2/δ) /n = n = 2

ϵ2 log 2p2/δ and

2 log (2p2/δ) /n = n = 2

ϵ2 log 2p2/δ . (139)

By combining (135) with (139) we found the number of samples that we need to recover the tree with probality at 1 δ (Theorem 5):

n > 32 h 1 (1 2q)4 tanh β i

(1 tanh β)2 (1 2q)4 tanh2 α log 2p2

Predictive Learning on Hidden Tree-Structured Ising Models

On the other hand when there is no noise (Bresler and Karzand, 2020) we need

n > 32 tanh2 α (1 tanh β) log 2p2

Appendix D. Analysis of the Event Ecascade (γ )

Lemma 23 Consider a path of length d 2 in the original tree T, and without loss of generality assume that path is X1 X2 Xd+1. Recall that Y (i) m is the ith sample of Ym and m [d + 1] and ˆµ k 1

n Pn i=1 Y (i) k Y (i) k+1, k [d]. Then

P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ k 1, . . . , ˆµ 1

= 1 (1 2q)2µk

2 1 µ k 1ˆµ k 1 1 (µ k 1)2 + µ k 1 1 µk

2 ˆµ k 1 µ k 1 1 (µ k 1)2 . (142)

Proof Note that

i=1 Y (i) k Y (i) k+1 = 1

i=1 (Xk Nk Xk+1Nk+1)(i) , (143)

where eacch term

(Xk Nk Xk+1Nk+1)(ℓ) ˆµ r r [1, 2, . . . k 2], ℓ [1, 2, . . . , n] (144)

P (Xk Nk Xk+1Nk+1)(ℓ) = 1 ˆµ k 1, . . . , ˆµ 1

= P (Xk Nk Xk+1Nk+1)(ℓ) = 1 ˆµ k 1

(Xk Nk Xk+1Nk+1)(ℓ) = 1 ˆµ k 1 = 1

i=1 (Xk 1Nk 1Xk Nk)(i) !

= P ˆµ k 1 = 1

n Pn i=1 (Xk 1Nk 1Xk Nk)(i) (Xk Nk Xk+1Nk+1)(ℓ) = 1

P ˆµ k 1 = 1

n Pn i=1 (Xk 1Nk 1Xk Nk)(i)

P (Xk Nk Xk+1Nk+1)(ℓ) = 1 . (145)

First we compute the probability P ˆµ k 1 = 1

n Pn i=1 (Xk 1Nk 1Xk Nk)(i) . Deﬁne the Bernoulli random variable Zk 1 as

Zk 1 Xk 1Nk 1Xk Nk + 1

( 0, w.p. 1 (1 2q)2µk 1

2 1, w.p. 1+(1 2q)2µk 1

i=1 (Xk 1Nk 1Xk Nk)(i) !

Nikolakakis, Kalogerias and Sarwate

i=1 (2Zk 1 1)(i) !

i=1 Z(i) k 1 = n ˆµ k 1 + 1

1 (1 2q)2µk 1

n n ˆµ k 1+1

2 1 + (1 2q)2µk 1

As a second step we compute the probability

i=1 (Xk 1Nk 1Xk Nk)(i) (Xk Nk Xk+1Nk+1)(ℓ) = 1

i=1,i =ℓ (Xk 1Nk 1Xk Nk)(i)

+ (Xk 1Nk 1Xk Nk)(ℓ)

(Xk Nk Xk+1Nk+1)(ℓ) = 1

(Xk 1Nk 1Xk Nk)(i) (Xk Nk Xk+1Nk+1)(ℓ) , i = ℓ, (149)

and we would like to ﬁnd the conditional distribution of (Xk 1Nk 1Xk Nk)(ℓ) under the event {(Xk Nk Xk+1Nk+1)(ℓ) = 1}. We have

P (Xk 1Nk 1Xk Nk)(ℓ) = c (Xk Nk Xk+1Nk+1)(ℓ) = 1

= P (Xk 1Nk 1Xk Nk)(ℓ) = c, (Xk Nk Xk+1Nk+1)(ℓ) = 1

P (Xk Nk Xk+1Nk+1)(ℓ) = 1

1 c E[Xk 1Nk 1Xk+1Nk+1]+c(1 2q)2µk 1 (1 2q)2µk

4 1 (1 2q)2µk

= 1 c(1 2q)2µk 1µk + c(1 2q)2µk 1 (1 2q)2µk

2 (1 (1 2q)2µk) , c { 1, +1}. (150)

P1 P (Xk 1Nk 1Xk Nk)(ℓ) = +1 (Xk Nk Xk+1Nk+1)(ℓ) = 1 (151)

P2 P (Xk 1Nk 1Xk Nk)(ℓ) = 1 (Xk Nk Xk+1Nk+1)(ℓ) = 1 , (152)

i=1,i =ℓ (Xk 1Nk 1Xk Nk)(i)

Predictive Learning on Hidden Tree-Structured Ising Models

+ (Xk 1Nk 1Xk Nk)(ℓ)

(Xk Nk Xk+1Nk+1)(ℓ) = 1

i=1,i =ℓ (2Zk 1 1)(i) + (2Zk 1 1)(ℓ)

(Xk Nk Xk+1Nk+1)(ℓ) = 1

i=1,i =ℓ (Zk 1)(i) + (Zk 1)(ℓ) = n ˆµ k 1 + 1

(Xk Nk Xk+1Nk+1)(ℓ) = 1

= P (Zk 1)(ℓ) = 0 (Xk Nk Xk+1Nk+1)(ℓ) = 1 P

i=1,i =ℓ (Zk 1)(i) = n ˆµ k 1 + 1

+ P (Zk 1)(ℓ) = 1 (Xk Nk Xk+1Nk+1)(ℓ) = 1 P

i=1,i =ℓ (Zk 1)(i) = n ˆµ k 1 + 1

1 (1 2q)2µk 1

n 1 n ˆµ k 1+1

2 1 + (1 2q)2µk 1

1 (1 2q)2µk 1

n 1 n ˆµ k 1+1

1 + (1 2q)2µk 1

2 1 . (153)

P (Xk Nk Xk+1Nk+1)(ℓ) = 1 = 1 (1 2q)2µk

and (145), (147), (148), (153) give

P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ k 1, . . . , ˆµ 1

P (Xk Nk Xk+1Nk+1)(ℓ) = 1

n 1 n ˆµ k 1+1

2 1+µ k 1 2

n n ˆµ k 1+1

2 1+µ k 1 2

n 1 n ˆµ k 1+1

2 +1 1+µ k 1 2

n n ˆµ k 1+1

2 1+µ k 1 2

Nikolakakis, Kalogerias and Sarwate

1 1 µ k 1 2

1 1 + µ k 1 2

= P2 n n ˆµ k 1+1

! 1 + P1 n ˆµ k 1+1

1 + µ k 1 2

= P2 1 ˆµ k 1 1 µ k 1 + P1 1 + ˆµ k 1 1 + µ k 1 . (155)

The latter and the deﬁnition of P1, P2 (see Equations 151 and 152) give

P Y (ℓ) k Y (ℓ) k+1 1 ˆµ k 1, . . . , ˆµ 1

P2 1 ˆµ k 1 1 µ k 1 + P1 1 + ˆµ k 1 1 + µ k 1

P (Xk Nk Xk+1Nk+1)(ℓ) = 1

P2 1 ˆµ k 1 1 µ k 1 + P1 1 + ˆµ k 1 1 + µ k 1

# 1 (1 2q)2µk

" 1 (1 2q)2µk 1µk (1 2q)2µk 1 (1 2q)2µk

2 (1 (1 2q)2µk) 1 ˆµ k 1 1 µ k 1

+ 1 (1 2q)2µk 1µk + (1 2q)2µk 1 (1 2q)2µk

2 (1 (1 2q)2µk) 1 + ˆµ k 1 1 + µ k 1

# 1 (1 2q)2µk

= 1 (1 2q)2µk 1µk (1 2q)2µk 1 (1 2q)2µk

4 1 ˆµ k 1 1 µ k 1

+ 1 (1 2q)2µk 1µk + (1 2q)2µk 1 (1 2q)2µk

4 1 + ˆµ k 1 1 + µ k 1

= 1 (1 2q)2µk

1 ˆµ k 1 1 µ k 1 + 1 + ˆµ k 1 1 + µ k 1

+ (1 2q)2µk 1µk + (1 2q)2µk 1

1 + ˆµ k 1 1 + µ k 1 1 ˆµ k 1 1 µ k 1

= 1 (1 2q)2µk

2 1 µ k 1ˆµ k 1 1 (µ k 1)2 + (1 2q)2µk 1 1 µk

2 ˆµ k 1 µ k 1 1 (µ k 1)2 . (156)

Note that (1 2q)2µk 1 = µ k 1, and the proof is completed.

Predictive Learning on Hidden Tree-Structured Ising Models

Lemma 24 Deﬁne the function K(β, q) as

K(β, q) 10(1 tanh2(β)) 9 + (1 2q)2 tanh2(β)(1 2q)2(9(1 2q)2 + 1) (157)

and the event Eedge e, as

Eedge e, n ˆµ e µ e γe o , e ET, γe > 0, (158)

and Eedge (ET) e ET Eedge e, . If

n 108e2β log(2p/δ)

(1 2q)4K(β, q) and γe =

n K(β, q) log(2p/δ) (159)

then P h Eedge (ET) ci δ.

Proof The variance of ˆµ e is (1 (µ e)2)/n and by applying Bernstein s inequality

P h Eedge e, ci 2 exp

nγ2 e 2 1 (µ e)2 + 4

, γe > 0. (160)

We choose γe = q

3 1 µ2e n K(β,q) log(2p/δ) (because the parameter γe is free, that is, Bernstein s inequality holds for all γe > 0). If n satisﬁes (159) then

108e2β (1 2q)4 (1 2q)2

6 (1 µ2 e), (161)

and the last is true because e 2β 1 tanh(β) 1 |µe| 1 µ2 e. By applying (159) and (161) on (160) we get

P h Eedge e, ci

nγ2 e 2 1 (µ e)2 + 4

3 1 µ2 e K(β,q) log(2p/δ)

2 (1 (1 2q)4µ2e) + 4

3 K(β, q) 1 µ2 e 2 + 2

9(1 2q)2 µ2e(1 2q)2(2(1 2q)2 + 2

9) log(2p/δ)

3 2K(β, q) 1 µ2 e 1 + 1

9(1 2q)2 µ2e(1 2q)2((1 2q)2 + 1

9) log(2p/δ)

Nikolakakis, Kalogerias and Sarwate

10 9K(β, q) 1 µ2 e 1 + 1

9(1 2q)2 µ2e(1 2q)2((1 2q)2 + 1

9) log(2p/δ)

The following function

9 1 x 1 + 1

9(1 2q)2 x(1 2q)2((1 2q)2 + 1

9), x [tanh2(α), tanh2(β)] (163)

is strictly decreasing, thus we have f(tanh2(β)) f(x) for all x [tanh2(α), tanh2(β)]. Also K(β, q) f(tanh2(β)), the latter together with (162) give

P h Eedge e, ci

2 exp 1 K(β, q)f(µ2 e) log(2p/δ)

2 exp 1 K(β, q)f(tanh2(β)) log(2p/δ)

= 2 exp 1 K(β, q)K(β, q) log(2p/δ)

Finally, by applying union over the p 1 edges of the tree we get P h Eedge (ET) ci δ.

The next Lemma is the extension of Lemma 8.7 by Bresler and Karzand (2020). The sample complexity bound exactly recovers the noiseless case and its expression is continuous at q = 0. Further, the bound is independent of the length of the longest path d, similarly to the noiseless setting. Finally, we provide upper bounds on the functions that appear in the bound. The latter give a more tractable version of the result and a clear representation of the required number of samples as a function of the parameters.

Lemma 25 (Concentration bound for the event Ecascade (γ )) For β > 0 and q [0, 1/2) we deﬁne the functions S( ), G( ), K( ), A( ), ( )

S(β, q) 2 + (1 2q)2

6 (1 (1 2q)2) tanh2(β) 3 (1 2q)2 S (165)

A(β, q) A (1 2q)2[1 tanh(β)(1 (1 2q)2)] (166)

G(β, q) 3 4(1 2q)2

d(1 A) A + 2

3 3e 11q =0 + 1

4(1 2q)2 G, (167)

and the inequality in (167) holds because the function G(β, q) is bounded for all d N \{1}.

K(β, q) 10(1 tanh2(β)) 9 + (1 2q)2 tanh2(β)(1 2q)2(9(1 2q)2 + 1) e 2β1q=0 K (168)

1 (1 2q)4 tanh2(β)

3 log(2p3/δ)

n tanh2(β)e2β. (169)

Predictive Learning on Hidden Tree-Structured Ising Models

If < γ S(β, q)G(β, q)/3 + and

( S2(β, q)G2(β, q)

0.32 (γ )2 log(4p2/δ), 108e2β

(1 2q)4K(β, q) log(2p3/δ)

then for any path Ad = {e1, e2, . . . , ed} of T with d edges, it is true that

ˆµ e (1 2q)2 Y

µ e (1 2q)2

p2 , d > 2. (171)

Proof For sake of space we proceed by using the notation µ k and ˆµ k instead of µ ek and ˆµ ek for k [d]. Deﬁne the random variable

ˆµ i (1 2q)2 µ i (1 2q)2

ˆµ j (1 2q)2

µ j (1 2q)2 . (172)

Then Pd i=1 M i = Qd i=1 ˆµ i (1 2q)2 Qd i=1 µ i (1 2q)2 , and deﬁne the sequence of paths with length

k as Ak {e1, e2, . . . , ek} Ad, for 2 k d. Although we provided the deﬁnition of the event Eedge ( ) in Lemma 24, we restate it below for completeness. For some γe > 0 the deﬁnition follows

Eedge (Ak) \

n ˆµ e µ e γe o . (173)

The law of total probability gives

Eedge (Ad 1)

+ P h Eedge (Ad 1) ci . (174)

For second term, Lemma 24 gives that P h Eedge (Ad 1) ci δ/p2 if

n 108e2β log(2p3/δ)

(1 2q)4K(β, q) , (175)

we deﬁne the function K(β, q) in Lemma 24 (157). Here we will ﬁnd an upper bound for the ﬁrst term of the right hand-side of (174). Note that M k is written as

n Pn ℓ=1 (Yk Yk+1)(ℓ)

(1 2q)2 µ k (1 2q)2

ˆµ j (1 2q)2

µ j (1 2q)2

(Xk Nk Xk+1Nk+1)(ℓ)

(1 2q)2 µ k (1 2q)2

ˆµ j (1 2q)2

µ j (1 2q)2 . (176)

Nikolakakis, Kalogerias and Sarwate

and we deﬁne

(Xk Nk Xk+1Nk+1)(ℓ)

(1 2q)2 µ k (1 2q)2

ˆµ j (1 2q)2

µ j (1 2q)2 . (177)

The random variables Z(ℓ) k for ℓ [n] and ﬁxed k [d] are independent conditioned on the event Eedge (Ak 1). However the conditional expectation E[Z(i) k |Z(i 1) k , . . . , Z(1) k , ˆµ k 1, . . . , ˆµ 1]

is not zero. To apply a concentration of measure result on Z(ℓ) k we use the extended Bennet s inequality for supermartingales (Fan et al., 2012). Martingale Diﬀerences: Deﬁne ξ(0) k 0, ξ(1) k Z(1) k E h Z(1) k |ˆµ k 1, . . . , ˆµ 1 i , ξ(i) k

Z(i) k E h Z(i) k |Z(i 1) k , . . . , Z(1) k , ˆµ k 1, . . . , ˆµ 1 i . Also, deﬁne as Fk i 1 the σ-algebra generated

by Z(i 1) k , . . . , Z(1) k , ˆµ k 1, . . . , ˆµ 1, then (ξ(i) k , Fk i )i=1....,n is a Martingale Diﬀerence Sequence (MDS). Additionally, conditioned on Z(i 1) k , . . . , Z(1) k , ˆµ k 1, . . . , ˆµ 1 we have

1 (1 2q)2d 1 µ k Qk 1 j=1 ˆµ j Qd j=k+1 µ j, w.p. P Y (ℓ) k Y (ℓ) k+1 = +1 ˆµ k 1

1 (1 2q)2d 1 + µ k Qk 1 j=1 ˆµ j Qd j=k+1 µ j, w.p. P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ k 1 , (178)

and we have proved (Lemma 23) that

P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ k 1, . . . , ˆµ 1

= P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ k 1

= 1 µ k 2 1 µ k 1ˆµ k 1 1 (µ k 1)2 + µ k 1 1 µk

2 ˆµ k 1 µ k 1 1 (µ k 1)2 . (179)

Thus we have

E h Z(i) k |Fk i 1 i = E h Z(i) k |Z(i 1) k , . . . , Z(1) k , ˆµ k 1, . . . , ˆµ 1 i

= h 1 µ k P Y (ℓ) k Y (ℓ) k+1 = +1 ˆµ k 1 (1 + µ k)P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ k 1 i

Qk 1 j=1 ˆµ j Qd j=k+1 µ j (1 2q)2d . (180)

Note that (179) gives h 1 µ k P Y (ℓ) k Y (ℓ) k+1 = +1 ˆµ k 1 (1 + µ k)P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ k 1 i

(1 µ k)1 + µ k 2 (1 + µ k)1 µ k 2

µ k 1 1 µ k 1ˆµ k 1 1 (µ k 1)2

+ (1 µ k)1 + µk

2 (1 + µ k)1 µk

µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2

Predictive Learning on Hidden Tree-Structured Ising Models

= (1 µ k)1 + µk

2 (1 + µ k)1 µk

µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2

1 µ k + µk µ kµk 1 + µk µ k + µ kµk µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2

= (µk µ k)µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2 . (181)

Combine the latter with (180) to get

E h Z(i) k |Fk i 1 i = µ k 1(µk µ k) ˆµ k 1 µ k 1 1 (µ k 1)2

Qk 1 j=1 ˆµ j Qd j=k+1 µ j (1 2q)2d , i [n]. (182)

If q = 0 then E h Z(i) k |Fk i 1 i = 0. Also limn E h Z(i) k |Fk i 1 i 0 for all q [0, 1/2) because

limn ˆµ k 1 µ k 1. Note that

" ξ(i) k 2 Fk i 1

" Z(i) k E h Z(i) k |Fk i 1 i 2 Fk i 1

" Z(i) k 2 Fk i 1

E2 h Z(i) k |Fk i 1 i . (183)

We compute E h Z(i) k 2 Fk i 1 i :

E Z(i) k 2 |Fk i 1

= E Z(i) k 2 |Z(i 1) k , . . . , Z(1) k , ˆµ k 1, . . . , ˆµ 1

= 1 µ k 2 P Y (ℓ) k Y (ℓ) k+1 = +1 ˆµ k 1 + (1 + µ k)2P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ k 1

Qk 1 j=1 ˆµ j Qd j=k+1 µ j (1 2q)2d

We use (179) to ﬁnd 1 µ k 2 P Y (ℓ) k Y (ℓ) k+1 = +1 ˆµ k 1 + (1 + µ k)2P Y (ℓ) k Y (ℓ) k+1 = 1 ˆµ k 1

1 µ k 2 1 + µ k 2 + 1 + µ k 2 1 µ k 2

! 1 µ k 1ˆµ k 1 1 (µ k 1)2

+ 1 µ k 2 1 + µk

2 + 1 + µ k 2 1 µk

µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2

= (1 (µ k)2)1 µ k 1ˆµ k 1 1 (µ k 1)2 + 1 + (µ k)2 2µkµ k µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2

Nikolakakis, Kalogerias and Sarwate

= (1 (µ k)2)1 µ k 1ˆµ k 1 1 (µ k 1)2 + 1 + (µ k)2 2(µ k)2 + 2(µ k)2 2µkµ k µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2

= (1 (µ k)2)1 µ k 1ˆµ k 1 1 (µ k 1)2 + 1 (µ k)2 µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2

+ 2µ k µ k µk µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2

= 1 (µ k)2 " 1 µ k 1ˆµ k 1 1 (µ k 1)2 + µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2

+ 2µ k µ k µk µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2

= 1 (µ k)2 + 2µ k µ k µk µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2 . (185)

Now we combine (182), (183), (184) and (185) to get

" ξ(i) k 2 Fk i 1

" Z(i) k 2 Fk i 1

E2 h Z(i) k |Fk i 1 i

" 1 (µ k)2 + 2µ k µ k µk µ k 1 ˆµ k 1 µ k 1 1 (µ k 1)2

µ k 1(µk µ k) ˆµ k 1 µ k 1 1 (µ k 1)2

Qk 1 j=1 ˆµ j Qd j=k+1 µ j (1 2q)2d

µ k + µ k 1(µk µ k) ˆµ k 1 µ k 1 1 (µ k 1)2

Qk 1 j=1 ˆµ j Qd j=k+1 µ j (1 2q)2d

For sake of space we deﬁne the function

ˆfµk µk 1(q) f(µ k, µ k 1, ˆµ k 1, q) µ k 1(µk µ k) ˆµ k 1 µ k 1 1 (µ k 1)2 , (187)

and then (182) and (186) can be written as

E h Z(i) k |Fk i 1 i = ˆfµk µk 1(q)

Qk 1 j=1 ˆµ j Qd j=k+1 µ j (1 2q)2d , i [n], (188)

" ξ(i) k 2 Fk i 1

= 1 µ k + ˆfµk µk 1(q) 2 Qk 1 j=1 ˆµ j Qd j=k+1 µ j (1 2q)2d

, i [n]. (189)

Predictive Learning on Hidden Tree-Structured Ising Models

We would like to ﬁnd an upper bound on the summation Pd k=1 E h ξ(i) k 2 Fk i 1 i . Deﬁne

A A(β, q) (1 2q)2[1 tanh(β)(1 (1 2q)2)], for all β > 0 and q [0, 1). Then 1 µ k + ˆfµk µk 1(q) 2

µ k + µ k 1(µk µ k) ˆµ k 1 µ k 1 1 (µ k 1)2

(1 2q)2 + µ k 1(1 (1 2q)2) ˆµ k 1 µ k 1 1 (µ k 1)2

(1 2q)2 |µ k 1|(1 (1 2q)2)|ˆµ k 1 µ k 1|

h 1 µ2 k (1 2q)2 (1 2q)2 tanh(β)(1 (1 2q)2) 2i (190)

[1 µ2 k A(β, q)], (191)

and (190) holds because |ˆµ k 1 µ k 1| γ j (1 2q)2(1 µ2 j)/6 (1 2q)2(1 (µ j)2)/6. Then (189) and (191) give

" ξ(i) k 2 Fk i 1

k=1 [1 µ2 k A(β, q)]

Qk 1 j=1 ˆµ j Qd j=k+1 µ j (1 2q)2d

k=1 [1 µ2 k A(β, q)]

µ2 j + 2 γ j (1 2q)2

The inequality (192) holds under the event Eedge (ET) deﬁned in (158) (see Lemma 24) because ˆµ j (1 2q)2

µ j (1 2q)2

+ 2 γ j (1 2q)2 , (193)

since |µ j| (1 2q)2, |ˆµ j| (1 2q)2 under the assumption of known q. Next we deﬁne

xj µ2 j + 2 γ j (1 2q)2 , then 3(1 A(β, q)xj)/2 + (A(β, q) 1)/2) 1 µ2 k A(β, q) and (192) gives

" ξ(i) k 2 Fk i 1

2(1 Axj) + A 1

j=1,j =k xj

Nikolakakis, Kalogerias and Sarwate

2(1 Ax) + A 1

xd 1. (194)

The latter is maximized at x = (A + 2) 1 1

d /3 and A (0, 1], thus we have

2(1 Ax) + A 1

= d (1 2q)2

(x )d 1 + A(A + 2)

2(1 2q)2 (x )d 1

d 1 + A(A + 2)

= d(A + 2)(1 A)

d 1 + A(A + 2)2

d + 3 4(1 2q)2 G(β, q) (195)

(1 A) 4(1 2q)2 3 e log 3 A+2 + 3 4(1 2q)2

= 3(1 A) e4(1 2q)2 log 3 A+2 + 3 4(1 2q)2

In (195) we deﬁne the function G(β, q) and we proved that it has an upper bound independent of d [p 1],

G(β, q) d3(1 A(β, q))

A(β, q) + 2

d + 3 4(1 2q)2 3 3e 11q =0 + 1

4(1 2q)2 G. (196)

For the rest of the proof and the ﬁnal result G(β, q) can be replaced by its upper bound in (196), however the deﬁnition of G(β, q) shows the continuity of the result for q 0. The following inequality holds with probability 1 for all i [n] and k [d] under the event Eedge (ET) (Lemma 24),

|ξ(i) k | 2 + E h Z(i) k |Fk i 1 i

= 2 + | ˆfµk µk 1(q)| (1 2q)2

Qk 1 j=1 |ˆµ j| Qd j=k+1 |µ j|

2 + | ˆfµk µk 1(q)| (1 2q)2

Predictive Learning on Hidden Tree-Structured Ising Models

= 2 + 1 (1 2q)2 |µ k 1||(µk µ k)||ˆµ k 1 µ k 1|

= 2 + tanh2(β)(1 (1 2q)2) γ k 1 1 (µ k 1)2

2 + (1 2q)2

6 (1 (1 2q)2) tanh2(β) S(β, q). (197)

the last step comes form the inequality γ e (1 2q)2

6 (1 (µe)2) (161), which holds if the inequality n > 108e2β K(β,q)(1 2q)4 log(4p) holds (see 159). Recall that Lemma 24 gives

n K(β, q) log(2p3/δ)

31 tanh2(β)

n K(β, q) log(2p3/δ)

3 log(2p3/δ)

Also, for all i [n] we have

k=1 E h Z(i) k |Fk i 1 i

k=2 E h Z(i) k |Fk i 1 i

E h Z(i) k |Fk i 1 i

| ˆfµk µk 1(q)| (1 2q)2

Qk 1 j=1 |ˆµ j| Qd j=k+1 |µ j|

k=2 |µ k 1|(µk µ k)|ˆµ k 1 µ k 1|

tanh2(β)(1 (1 2q)2) γe 1 (1 2q)4 tanh2(β)

(1 2q)2 (199)

tanh2(β) (1 (1 2q)2) γ 1 (1 2q)4 tanh2(β)

(1 2q)2 (200)

tanh2(β) (1 (1 2q)2) γ 1 (1 2q)4 tanh2(β)

|µ j| + γj (1 2q)2

(1 2q)2 (201)

= tanh2(β) (1 (1 2q)2) γ 1 (1 2q)4 tanh2(β)

j=1 tanh(β) + γj (1 2q)2

j=k+1 tanh(β)

tanh2(β) (1 (1 2q)2) γ 1 (1 2q)4 tanh2(β)

j=1,j =k tanh(β) + γj (1 2q)2

Nikolakakis, Kalogerias and Sarwate

tanh2(β) (1 (1 2q)2) γ 1 (1 2q)4 tanh2(β)

tanh(β) + 1

6(1 tanh2(β))

= tanh2(β) (1 (1 2q)2) γ 1 (1 2q)4 tanh2(β)(d 1) 5

6(tanh(β) 3)2 d 1

tanh2(β) (1 (1 2q)2) γ 1 (1 2q)4 tanh2(β) 1 e log 5

6(tanh(β) 3)2

(1 (1 2q)2) tanh2(β) γ

1 (1 2q)4 tanh2(β) e2β 1.

(1 (1 2q)2) tanh2(β)e2β

1 (1 2q)4 tanh2(β)

3 log(2p3/δ)

where (199), (200) , (201) come from Lemma 24 and (198). Finally, 0 < 5

6(tanh(β) 3)2 < 1 for all β > 0 and 1/ log 5

6(tanh(β) 3)2 e2β. We use the symbol EAk 1[ ] to denote the conditional expectation given the the event Eedge (Ak 1), for instance

! Eedge (Ak 1)

Further we deﬁne the function F( , ) as

F(t, λ) = log 1 1 + te λt + t 1 + teλ . (204)

For any k d we have

! Eedge (Ak 1)

exp λM k ˆµ 1, . . . , ˆµ k 1

! ˆµ 1, . . . , ˆµ k 1

i=1 E h Z(i) k |Fk i 1 i! ˆµ 1, . . . , ˆµ k 1

exp λE h Z(i) k |Fk i 1 i

! ˆµ 1, . . . , ˆµ k 1

Predictive Learning on Hidden Tree-Structured Ising Models

exp λE h Z(i) k |Fk i 1 i

" ξ(i) k 2 Fk i 1

S(β, q)2 , |λ|S(β, q)

exp λE h Z(i) k |Fk i 1 i

" ξ(i) k 2 Fk i 1

S(β, q)2 , |λ|S(β, q)

The equation (205) comes from change of measure and tower property, the deﬁnitions (176) and (177) of M k and ξ(i) k respectively give (206) and (207). The (208) is derived by upper

bounding the quantity E h Z(i) k |Fk i 1 i similarly to (197), (209) is the upper bound on the moment generating function of the supermartingale Fan et al. (2012). To get a recurrence we proceed as follows:

! Eedge (Ak 1)

P Eedge (Ak 1)

! Eedge (Ak 2) Eedge ek 1,

P Eedge (Ak 1)

! Eedge (Ak 2) Eedge ek 1,

Eedge (Ak 1)

Eedge (Ak 2)

P Eedge (Ak 2)

! Eedge (Ak 2) Eedge ek 1,

Eedge ek 1,

Eedge (Ak 2)

P Eedge (Ak 2)

1Eedge ek 1,

Eedge (Ak 2)

P Eedge (Ak 2)

! Eedge (Ak 2)

P Eedge (Ak 2) . (211)

Nikolakakis, Kalogerias and Sarwate

By applying the recurrence d times, we derive the following bound

! Eedge (Ak 1)

k=1 E h Z(i) k |Fk i 1 i!

" ξ(i) k 2 Fk i 1

S(β, q)2 , |λ|S(β, q)

k=1 E h Z(i) k |Fk i 1 i!

" ξ(i) k 2 Fk i 1

S(β, q)2 , |λ|S(β, q)

Further (196), (197), (202) and (212) give

! Eedge (Ak 1)

exp (λ (β, q)) exp

1 d G(β, q) S2 ξ (β, q), |λ|S(β, q)

exp nd F G(β, q)

d , |λ|S(β, q)

+ λ (β, q) . (213)

For sake of space, we denote the functions G(β, q), S(β, q), and (β, q) as G, S, and respectively. It is true that

! Eedge (Ad 1)

P h Eedge (Ad 1) Eq i

+ λ . (214)

Under the assumption n > 108e2β (1 2q)4K(β,q) log(4p), we have

P h Eedge (Ad 1) ci 1

The latter gives

! Eedge (Ad 1)

2 exp nd F G

+ λ , (216)

Predictive Learning on Hidden Tree-Structured Ising Models

which implies that

Eedge (Ad 1)

2 min λ>0 exp nd F G

= 2 min λ>0 exp nd F G

+ λ ( γ) , (217)

and we deﬁne γ γ . The minimum value is attained at

λ = n/S 1 + G/d log 1 + γ

and by substituting the optimal value we get

! G/d 1+G/d γ /(d S)

! 1 1+G/d 1 γ

G/d 1+G/d γ /(d S)

(1+G/d) 1 γ

G/d 1+G/d+ γ /(d S)

1 1+G/d 1 γ

1 1+G/d 1 γ

G/d 1+G/d γ /(d S)

(1+G/d) 1 γ

G/d 1+G/d+ γ /(d S)

(1+G/d) 1 1 γ

1 1+G/d 1 γ

1 1+G/d 1 γ

G/d 1+G/d γ /(d S)

(1+G/d) 1 γ

G/d 1+G/d+ γ /(d S)

G/d 1+G/d γ /(d S)

(1+G/d) 1 γ

G/d 1+G/d+ γ /(d S)

G/d 1+G/d γ /(d S)

(1+G/d) 1 γ

1 1+G/d+ γ /(d S)

Nikolakakis, Kalogerias and Sarwate

Then (217) and (219) give

Eedge (Ad 1)

G/d 1+G/d γ /(d S)

(1+G/d) 1 γ

1 1+G/d+ γ /(d S)

As a ﬁnal step we want to express the upper bound as an exponential function of γ, we deﬁne ζ γ /(SG) and we proceed as follows:

d G/d 1 + G/d + γ /(d S) (1 + G/d)

+ 1 1 + G/4d γ /(dy) (1 + G/4d)

" G/d 1 + G/d + γ /(d S) (1 + G/d)

1 1 + G/d γ /(d S) (1 + G/d)

log 1 + γ /d S 1 γ /d S

d + G + dγ /S

d + G dγ /S

γ /d S 1 γ /d S 1

γ /d S 1 γ /d S

" G + γ /S γ

d γ /S γ /d S 1 γ /d S 1

γ /d S 1 γ /d S

" G + γ /S γ

= d G d + G

" 1 + γ /(SG) γ

= d G d + G

(1 + ζ) ζ 1

(1 + ζ) ζ 1

2 , γ (0, SG

Predictive Learning on Hidden Tree-Structured Ising Models

Recall that ζ γ /(SG), γ = γ . If < γ S(β, q)G(β, q)/3+ then (220) and (221) give

Eedge (Ad 1)

0.32n (γ )2

S2(β, q)G2(β, q)

In a similar way we derive the bound

k=1 M k γ Eedge (Ad 1)

0.32n (γ )2

S2(β, q)G2(β, q)

Finally, we combine (174), Lemma 24, (222) and (223) to derive the bound (170) which guarantees that

ˆµ i (1 2q)2

µ i (1 2q)2

2δ/p2, d 2. (224)

To summarize we proved that the event Ecascade (γ ) happens with probability at least 1 2δ/p2 by combining Bresler s and Karzand s technique, the Corollary 2.3 by Fan et al. (2012) and Lemma 23.

Appendix E. Predictive Learning, Proof of Theorem 3 and Theorem 7

Recall that our goal is to guarantee that the quantity L(2)(p( ), ΠTCL (ˆp )) is smaller than a number η > 0 with probability at least 1 δ. To do this, we use the triangle inequality as

L(2) p( ), ΠTCL (ˆp ) L(2) p( ), ΠTCL (p( )) + L(2) ΠTCL (p( )) , ΠTCL (ˆp ) (225)

and we ﬁnd the required number of samples such that each of the terms L(2)(p( ), ΠTCL (p( )))

and L(2)(ΠTCL (p( )) , ΠTCL (ˆp )) in (225) is less than η/2 with probability at least 1 δ. The next Lemma provides the necessary bounds on γ and ϵ that guarantee

L(2)(ΠTCL (p( )) , ΠTCL (ˆp )) η/2.

Lemma 26 If γ η

ϵ (1 2q)2e β h 20 1 + 2eβp

2 (1 q) q tanh β i 1 ,

then L(2)(ΠTCL (p( )) , ΠTCL (ˆp )) η/2 under the event Ecorr (ϵ ) Ecascade (γ ) Estrong (ϵ ).

Proof The derivation of the bound is similar to the approach by Bresler and Karzand (2020, Section 6.1) but with diﬀerent calculations. In the hidden model, we consider the path between two nodes i, j in the estimated structure TCL , namely path TCL (i, j), to be

Nikolakakis, Kalogerias and Sarwate

(F0, e1, F1, e1..., Ft 1, et, Ft), and Fi are segments with all strong edges and ei are all weak edges. We consider the case of at least one weak edge to exist in the path. If there is no weak edge the bound reduces to the case of Lemma 25. The length of each sub-path Fi is denoted as di, for all i {0, 1, . . . , t}. Each segment (sub-path) Fi has exactly di edges, and the total number of edges in the path are d; thus d = Pt i=0 di + t. Note that t 1 and di 0 for all i {0, 1, . . . , t}. Recall that

ΠTCL (p( )) = 1

1 + xixj E [Xi Xj]

1 + xixj E[Yi Yj] (1 2q)2

(the latter comes from (19)), and

ΠTCL (ˆp ) 1

1 + xixj ˆE[Yi Yj] (1 2q)2

Further, for any tree-structured Ising model distributions P, P with structures T = (V, E) and T = (V, E) respectively, we have

L(2) P, P sup i,j V

xi,xj { 1,+1}2

P(xi, xj) P(xi, xj) (228)

= sup i,j V

e path T(i,j) µe Y

e path T(i,j) µe

To upper bound the quantity L(2)(ΠTCL (p( )) , ΠTCL (ˆp )) we have

2L(2) ΠTCL (p( )) , ΠTCL (ˆp )

e path TCL (i,j)

µ e (1 2q)2 Y

e path TCL (i,j)

ˆµ e (1 2q)2

= 1 (1 2q)2d

i=1 ˆµ Fi ˆµ ei µ F0

i=1 µ Fiµ ei

" ˆµ F0 µ F0

ˆµ Fi ˆµ ei µ Fiµ ei

ˆµ Fj ˆµ ej

µ Fi (1 2q)2di µ ei (1 2q)2t

Predictive Learning on Hidden Tree-Structured Ising Models

ˆµ Fi ˆµ ei µ Fiµ ei

(1 2q)2(di+1) |ˆµ F0| (1 2q)2d0

Qi 1 j=1 ˆµ Fj ˆµ ej Qt k=i+1 µ Fkµ ek

(1 2q)2(d di d0 1) (232)

t 1 + τ + ϵ

ˆµ Fi ˆµ ei µ Fiµ ei

(1 2q)2(di+1) (233)

t 1 + τ + ϵ

(ˆµ Fi µ Fi)ˆµ ei

(1 2q)2(di+1) +

µ Fi(ˆµ ei µ ei)

(1 2q)2(di+1)

t 1 + τ + ϵ

γ + ϵ (1 2q)2

t 1 (2t + 1) max γ , ϵ (1 2q)2

4ϵ eβ 1 + 2eβp

2 (1 q) q tanh β + ϵ

(2t + 1) max γ , ϵ (1 2q)2

t 1 1 + 2eβp

2 (1 q) q tanh β t 1 (2t + 1) max γ , ϵ (1 2q)2

4t 1 η 3 (236)

Telescoping summation and triangle inequality give (230) and (231). We use the deﬁnition of d, d = Pt i=0 di + t to get (232). The inequalities µ Fi

(1 2q)2di, ˆµ Fi

(1 2q)2di, µ ei τ , ˆµ ei τ +ϵ hold under Ecorr (ϵ ), Estrong (ϵ ). Further, under event Ecascade (γ )

(Lemma 25) it is true that ˆµ Fi µ Fi

γ , the latter give (233) and (234). The bound

τ 4ϵ eβ(1 + 2eβp

2 (1 q) q tanh β) gives (235) (see inequality 137). Inequality (236) requires

max ϵ (1 2q)2 , γ

ϵ (1 2q)2e β h 20 1 + 2eβp

2 (1 q) q tanh β i 1 . (239)

Finally (237) holds for all t N. The latter completes the proof.

The next Lemma provides the set of values of ϵ that guarantee L(2)(p( ), ΠTCL (p( ))) η

2 with high probability.

Nikolakakis, Kalogerias and Sarwate

Lemma 27 If

η 16(1 2q)2, (1 2q)2e β

24 1 + 2eβp

2 (1 q) q tanh β

then L(2) p( ), ΠTCL (p( )) η

2 under the event Ecorr (ϵ ) Estrong (ϵ ).

Proof Recall that

L(2) p( ), ΠTCL (p( )) = 1

e path T(w, w)

µ e (1 2q)2 Y

e path TCL (w, w)

µ e (1 2q)2

e path T(w, w) µe Y

e path TCL (w, w) µe

We follow Bresler s and Karzand s technique Loss due to graph estimation (Bresler and Karzand, 2020, Section 6.2) and highlight the diﬀerence that appears in our setting. For the noisy case/hidden model, the argument changes slightly in the following manner:

2L(2) p( ), ΠTCL (p( )) |µfµAµ AµBµ B||µ2 Cµ2 C 1| + |µf| (k) + ( k) + ( k) (k)

µ f (1 2q)2 µAµ AµBµ B

µ f (1 2q)2

(k) + ( k) + ( k) (k)

8 ϵ (1 2q)2 + τ

(1 2q)2 (2η + η2) (242)

(242) holds since |µ f| |µ g| 4ϵ , |µf| 1 µ2 Cµ2 C

2|µf| 2|µg|, |µ f| τ , and (243) holds for all the values of ϵ that satisfy

η 16(1 2q)2, (1 2q)2e β

24 1 + 2eβp

2 (1 q) q tanh β

The latter provides the statement of the Lemma.

The next Theorem provides the suﬃcient number of samples for predictive learning that recovers exactly the noiseless setting for q = 0. Note that the dependence on β changes from e2β to e4β when the data are noisy. A key component of the bound is the following function

Γ(β, q) 1 (1 2q)2

1 (1 2q)4 tanh2(β)

2 , β > 0 and q [0, 1/2). (245)

Predictive Learning on Hidden Tree-Structured Ising Models

Note that Γ(β, q) [0, 1] for all β > 0 and q [0, 1/2), and Γ(β, 0) = 0 for all β > 0. Further we deﬁne

B(β, q) max 1 K(β, q), 1 + 2eβp

2 (1 q) q tanh β 2 , (246)

and the expression of K(β, q) is given by (168).

Theorem 28 Fix δ (0, 1). Choose η > 0 (independent of δ). If

( 512 η2(1 2q)4 , 1152e2βB(β, q)

(1 2q)4 , 48e4β

P L(2) p( ), ΠTCL (ˆp ) η 1 δ. (248)

Additionally, as a consequence of (247) , if

( 512 η2(1 2q)4 , 1152 1 + 3 q 2 e2β(1+1q =0)

(1 2q)4 , 48e4β

P L(2) p( ), ΠTCL (ˆp ) η 1 δ. (250)

Proof Recall that

L(2) p( ), ΠTCL (p( )) = 1

e path T(w, w) µe Y

e path TCL (w, w) µe

We combine the triangle inequality

L(2) p( ), ΠTCL (ˆp ) L(2) p( ), ΠTCL (p( )) + L(2) ΠTCL (p( )) , ΠTCL (ˆp ) , (252)

Lemma 26, and Lemma 27 to get that L(2)(p( ), ΠTCL (ˆp )) η with probability at least 1 δ if

3 and ϵ min

η 16(1 2q)2, (1 2q)2e β

24 1 + 2eβp

2 (1 q) q tanh β

First, we ﬁnd the necessary number of samples such that for γ η/3 the probability of the complement of Ecascade (γ ) is not greater than δ/3. Recall that

G 3 3e 11q =0 + 1

4(1 2q)2 , (254)

Nikolakakis, Kalogerias and Sarwate

S 3 (1 2q)2. (255)

Recall that

Γ(β, q) 1 (1 2q)2

1 (1 2q)4 tanh2(β)

2 , β > 0 and q [0, 1/2). (256)

Lemma 25 gives that for any > 0 and η > if

n max 0.3 2S2G2

(η )2 , 108e2β

(1 2q)4K(β, q), 3e4β

2 Γ(β, q) log 6p3

then the probability of the complement of Ecascade (γ ) is upper bounded by δ/3 and we write

P Ecascade (γ ) c δ

Second, we ﬁnd the necessary number of samples such that the complements of the events Estrong (ϵ ) and Ecorr (ϵ ) occur with probability not greater than δ/3 each. In fact the upper bound on ϵ (253) and Lemma 18 gives that if

n max 512 η2(1 2q)4 , 1152e2β

2 (1 q) q tanh β 2 log 6p3

then ϵ satisﬁes the inequality in (253) with probability at least 1 δ/3. Note that (257) holds for any (0, η) and we will choose = η/4. Under the choice = η/4

(η )2 = 0.3 2S2G2

(η η/4)2 < 512

η2 , η > 0, q [0, 1/2). (260)

Recall that

B(β, q) max 1 K(β, q), 1 + 2eβp

2 (1 q) q tanh β 2 . (261)

Combining (257), (259), (260) and (261) yields

( 512 η2(1 2q)4 , 1152e2βB(β, q)

(1 2q)4 , 48e4β

The latter gives the sample complexity for accurate predictive learning, it reduces exactly to the noiseless setting of prior work by Bresler and Karzand (2020) and it is continuous because

lim q 0+ Γ(β, q) = Γ(β, 0) = 0, β > 0 (263)

lim q 0+ K(β, q) = K(β, 0) = 1, β > 0 (264)

Predictive Learning on Hidden Tree-Structured Ising Models

2 (1 q) q tanh β 2 = 1 (265)

lim q 0+ B(β, q) = Γ(β, 0) = 1, β > 0. (266)

To derive a simpliﬁed version of (262) note that

1 K(β, q) e2β1q =0 (267)

by the deﬁnition (168) of K(β, q) and 1 + 2eβp

2 (1 q) q tanh β 2 eβ1q =0 + 2eβ1q =0p

2 (1 q) q tanh β 2

(1 + 3 q)2 e2β1q =0. (268)

Then (261), (267) and (268) give

B(β, q) (1 + 3 q)2 e2β1q =0 (269)

and by the deﬁnition (256) Γ(β, q) [0, 1) and Γ(β, 0) = 0 for all β > 0, thus

Γ(β, q) 1q =0. (270)

Finally, we combine (262), (269), (270) to get

( 512 η2(1 2q)4 , 1152 1 + 3 q 2 e2β(1+1q =0)

(1 2q)4 , 48e4β

This completes the proof.

Appendix F. Theorem 11: KL-Divergence Loss

Assume the Ising model tree distributions Pθ according to a tree Tθ = (V, Eθ) and the estimate Pθ according a tree Tθ = (V, Eθ ) The goal is to upper bound the symmetric KL divergence

SKL θ||θ = X

θst θ st µst µ st

with high probability. Under the event Ecorr (ϵ) we can upper bound the quantity |µst µ st| for all (s, t) E with high probability. By using bounds |θst θ st| 2β and |µst µ st| ϵ for all (s, t) E under the event Ecorr (ϵ), we have

SKL θ||θ = SKL θ||θ

Nikolakakis, Kalogerias and Sarwate

θst θ st µst µ st

θst θ st µst µ st

(p 1) |β ( β)| ϵ

by assuming ϵ ηs 2β(p 1). The suﬃcient number of samples satisﬁes the inequality

n 4 log p2/δ β2(p 1)2

η2s . (273)

Now assume that n samples of Y are given, by using the estimate Pθ = ΠTCL (ˆp ) deﬁned

in (227) under the event Ecorr (ϵ ) we have µst ˆµ st (1 2q)2

ϵ (1 2q)2 from Lemma 18. In the

same way by assuming ϵ ηs(1 2q)2

2β(p 1) , we get

n 4 log p2/δ β2(p 1)2

(1 2q)4η2s . (274)

Appendix G. Theorem 6 and Theorem 8: Proofs

We combine Fano s inequality and a Strong Data Processing Inequality to prove the necessary number of samples in the hidden model setting, ﬁrst for structure learning (Theorem 6) and then for inference (Theorem 8). We use the following variation of Fano s inequality.

Corollary 29 (Tsybakov, 2009, Corollary 2.6): Assume that Θ is a family of M + 1 distributions θ0, θ1, . . . , θM such that M 2. Let Pθi be the distribution of the variable X under the model θi, if

i=1 DKL (Pθi||Pθ0) γ log M, for γ (0, 1) (275)

then for the probability of error pe the following inequality holds: pe log(M+1) log(2)

The construction from the noiseless case, with Corollary 29 and the Strong Data Processing Inequality for the BSC yield the bound of Theorem 6. We start by presenting Bresler s and Karzand s construction, which gives a suﬃciently tight upper bound on symmetric KL divergence. Proof of Theorem 6: Consider a family of M + 1 diﬀerent Ising model distributions {Pθi : i {0, . . . , M}}. This family of the structured distributions is chosen such that the structure recovery task (through The Chow-Liu algorithm) is suﬃciently hard. First, we deﬁne Pθ0 to be an Ising model distribution with underlying structure a chain with p nodes and parameters θ0 j,j+1 = α, when j is odd and θ0 j,j+1 = β when j is even. The rest of family is

Predictive Learning on Hidden Tree-Structured Ising Models

constructed as follows: the elements of each θi, i [M] are equal to the elements of θ0 apart from two elements θi i,i+1 = 0 and θi i,i+2 = arctanh(tanh(α) tanh(β)), for each odd value of j. There are (p + 1)/2 distinct distributions in the constructed family. Through the expression (49), we derive the following upper bound on the SKL(Pθ0||Pθi), for all i [M], (Bresler and Karzand, 2020, Section 7.1),

SKL(Pθ0||Pθi) = α tanh(α) tanh(α) tanh2(β) 4α tanh(α)e 2β. (276)

Strong Data Processing Inequality: For each distribution Pθi and i {0, . . . , M} we consider the distribution of the noisy variable in the hidden model P θi PY|X Pθi. We would like to ﬁnd an upper bound for the quantities SKL(P θ0||P θi), i {0, . . . , M}. For that purpose, we a use a strong data processing inequality result for the BSC by Polyanskiy and Wu (2017). The input random variable X is considered to have correlated binary elements, while the noise variables Ni are i.i.d Rademacher(q). This scheme is equivalent to the hidden model that we consider in this paper. In fact we have the following bound

ηKL 1 (4q(1 q))p, (277)

that is proved by Polyanski (Polyanskiy and Wu, 2017, Evaluation for the BSC , equation (39)), where the quantity ηKL is deﬁned as

ηKL sup Q sup P:0<DKL(P||Q)<

DKL PY|X P||PY|X Q

DKL (P||Q) , (278)

PY|X is the distribution of the BSC and P, Q are any distributions of the input variable X. Since the supremum in (278) is with respect to all possible distributions, it covers any pair of distributions in the desired family {Pθj : j {0, . . . , M}}. Thus, for all k, ℓ {0, 1, . . . , M} and k = ℓ, it is true that

DKL(P θk||P θℓ) DKL(Pθk||Pθℓ) 1 (4q(1 q))p, (279)

which comes from (277),(278) and implies the following

SKL(P θk||P θℓ) [1 (4q(1 q))p]SKL(Pθk||Pθℓ), k = ℓ {0, 1, . . . , M}. (280)

We combine (276) and (280) to get

SKL(P θk||P θi) [1 (4q(1 q))p]4α tanh(α)e 2β [1 (4q(1 q))p]4α2e 2β. (281)

Finally, from (281) and Corollary 29 we derive the ﬁrst part of Theorem 6. Proof of Theorem 8: Theorem 8 is the extended version of Theorem 3.4 by Bresler and Karzand (2020) to the hidden model. Following a similar technique, we consider chain structured Ising models with parameters θj for j [M] such that θj j,j+1 = α and θj i,i+1 = arctanh(tanh(α) + 2η), for all i = j. Then

L(2) Pθj, Pθj = max s,t Eθj[Xs Xt] Eθj [Xs Xt] 2η (282)

Nikolakakis, Kalogerias and Sarwate

SKL Pθj, Pθj 2η [arctanh(tanh(α) + 2η) α] 2η 2η 1 [tanh(α) + 2η]2 , (283)

where the last inequality is a consequence of Mean Value Theorem (see (Bresler and Karzand, 2020, Section 6.3) for the original statement). We derive the bound of Theorem 8 by combining the strong data processing inequality (278) with (277), (283), and Corollary 29.

Appendix H. Supplementary Discussion

In this section we provide supplementary material that supports the discussion in Sections 2.4 and 2.6. First, we present one marginal case for which perfect denoising is possible before applying the Chow-Liu algorithm. Then we show a structure-preserving case.

H.1 The Gap between the Upper and Lower Bounds

We continue by analyzing the gap that appears between the upper and lower bounds for an example where perfect denoising can be applied on a speciﬁc class of tree models in the high-dimensional regime. This shows why the eﬀect of noise vanishes in Theorems 1.2 and 1.4 for p . Further, while it seems counter-intuitive that when p the problem becomes easier, we show below one example that this is the case. Our lower bound is directly aﬀected by marginal cases like this, for instance see Proposition 1.4. The gap is introduced by the terms (1 2q)4 and 1 (4q(1 2q))p in the denominator of the lower and upper bounds respectively. Speciﬁcally, for p there exists a special case for which perfect denoising before running the Chow-Liu algorithm is possible, while in other cases that is not possible. Thus the minimax bound ought to be identical to noiseless case when p and 1 (4q(1 2q))p 1 in the large dimensional regime. We continue by providing the marginal case of a trivial tree structure and showing that perfect denoising is possible in this case before running the Chow-Liu algorithm. First notice that if p , then the sample size n , even in the noiseless regime. Consider the case of E[Xi Xj] 1 for all (i, j) E. Because an inﬁnite number of samples are available, we can estimate perfectly the correlations of the observables and we ﬁnd ˆE[Yi Yj] = E[Yi Yj] = (1 2q)2 for all (i, j) V. The latter as information is suﬃcient to ﬁnd that E[Xi Xj] 1 for all (i, j) E. The hidden layer X take two values, (+1, +1, . . .) +1p

(p values +1) or ( 1. 1, . . .) 1p (p values 1), because E[Xi Xj] 1 for all (i, j) V and the later allows us to denoise each sample. Deﬁne as d H(X, Y) the Hamming distance between X and Y. At this point we can perform perfect denoising for each sample ys of inﬁnite length p and ﬁnd the hidden sample xs with probability 1 because

P (X = xs|Y = ys) = P (Y = ys|Xs = xs) P(Xs = xs) P x P (Y = ys|X = x) P(X = x)

= qd H(xs,ys)(1 q)p d H(xs,ys)

qd H(xs,ys)(1 q)p d H(xs,ys) + qd H( xs,ys)(1 q)p d H( xs,ys) (284)

Predictive Learning on Hidden Tree-Structured Ising Models

and the last holds for both of the cases xs = +1p or xs = 1p because of symmetry. Further for any observation ys for any q (0, 1/2) we have

lim p d H( xs, ys) d H(xs, ys)

a.s. = 1 q q = 1 2q. (285)

We combine (284) and (285) to ﬁnd

lim p P (X = xs|Y = ys) = lim p 1

1 + q 1 q d H( xs,ys) d H(xs,ys)

p p = 1 (286)

lim p P (X = xs|Y = ys) = lim p 1

q d H( xs,ys) d H(xs,ys)

p p = 0. (287)

As a consequence there exists one case for which perfect denoising is possible before running the Chow-Liu algorithm. Because we want Theorems 1.2 and 1.4 to reduce to the noiseless case for p , the above best case scenario must be covered. However, perfect denoising is not possible in general (for instance E[Xi Xj] < 1 and ﬁnite p.).

H.2 A Structure-Preserving Case

Lemma 30 considers a special case of tree structures for the hidden variables, the set of edges is a set with disconnected edges, no edge is connected to any other. Then we show that the same structure is preserved for the observable variables.

Lemma 30 Let F = (V, E) be a forest with |V| = p and |E| = p/2 N such that no edge is connected to any other edge. Assume that Xi { 1, +1} and E [Xi] = 0 for all i [1, . . . , p]. If Y is the output of the BSC channel (in the hidden model) with distribution p (y), then p (y) also factorizes with respect to F.

Proof The pair variables (Yi, Yj) for (i, j) E are independent because of the disconnected edges of the hidden layer. The latter directly gives the factorization as

(i,j) E p (yi, yj) = Y

i V p (yi) Y

p (yi, yj) p (yi)p (yj), (288)

because |V| = p, |E| = p/2 and the marginal distributions are uniform.

Animashree Anandkumar and Ragupathyraj Valluvan. Learning loopy graphical models with latent variables: Eﬃcient methods and guarantees. The Annals of Statistics, pages 401 435, 2013. URL https://www.jstor.org/stable/23566568.

Nikolakakis, Kalogerias and Sarwate

Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research, 15(1):2773 2832, 2014. URL http://jmlr.org/papers/v15/ anandkumar14b.html.

M Sanjeev Arulampalam, Simon Maskell, Neil Gordon, and Tim Clapp. A tutorial on particle ﬁlters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on Signal Processing, 50(2):174 188, 2002. URL https://www.irisa.fr/aspi/legland/ ensta/ref/arulampalam02a.pdf.

Onureena Banerjee, Laurent El Ghaoui, and Alexandre d Aspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine learning research, 9(3):485 516, 2008. URL http://www.jmlr.org/ papers/volume9/banerjee08a/banerjee08a.pdf.

Michael Banf and Seung Y Rhee. Enhancing gene regulatory network inference through data integration with Markov random ﬁelds. Scientiﬁc Reports, 7, 2017. URL https: //www.nature.com/articles/srep41174.

Rina Foygel Barber, Mathias Drton, et al. High-dimensional Ising model selection with bayesian information criteria. Electronic Journal of Statistics, 9(1):567 607, 2015. URL https://projecteuclid.org/euclid.ejs/1427203129.

Lionel Barnett, Joseph T Lizier, Michael Harré, Anil K Seth, and Terry Bossomaier. Information ﬂow in a kinetic Ising model peaks in the disordered phase. Physical Review Letters, 111(17):177203, 2013. URL https://journals.aps.org/prl/abstract/10.1103/ Phys Rev Lett.111.177203.

George Bennett. Probability inequalities for the sum of independent random variables. Journal of the American Statistical Association, 57(297):33 45, 1962. URL https:// www.jstor.org/stable/2282438?seq=1#page_scan_tab_contents.

Martin Bilodeau. Graphical lassos for meta-elliptical distributions. Canadian Journal of Statistics, 42(2):185 203, 2014. URL https://www.jstor.org/stable/43185177?seq= 1#page_scan_tab_contents.

Guy Bresler. Eﬃciently learning Ising models on arbitrary graphs. In Proceedings of the Forty-seventh Annual ACM Symposium on Theory of Computing, pages 771 782. ACM, 2015. URL https://doi.org/10.1145/2746539.2746631.

Guy Bresler and Mina Karzand. Learning a tree-structured Ising model in order to make predictions. The Annals of Statistics, 48(2), April 2020. URL https://doi.org/10. 1214/19-AOS1808.

Claudio Castellano, Santo Fortunato, and Vittorio Loreto. Statistical physics of social dynamics. Reviews of modern physics, 81(2):591, 2009. URL https://journals.aps.org/ rmp/abstract/10.1103/Rev Mod Phys.81.591.

Predictive Learning on Hidden Tree-Structured Ising Models

Arun Tejasvi Chaganty and Percy Liang. Estimating latent-variable graphical models using moments and likelihoods. In International Conference on Machine Learning, pages 1872 1880, 2014. URL http://proceedings.mlr.press/v32/chaganty14.html.

Venkat Chandrasekaran, Pablo A Parrilo, and Alan S Willsky. Latent variable graphical model selection via convex optimization. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pages 1610 1613. IEEE, 2010. URL https://ieeexplore.ieee.org/document/5707106/.

C Chow and Cong Liu. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory, 14(3):462 467, 1968. URL https: //ieeexplore.ieee.org/abstract/document/1054142.

Thomas M Cover and Joy A Thomas. Elements of Information Theory. John Wiley & Sons, 2012.

Constantinos Daskalakis, Elchanan Mossel, and Sébastien Roch. Optimal phylogenetic reconstruction. In Proceedings of the Thirty-eighth Annual ACM Symposium on Theory of Computing, pages 159 168. ACM, 2006. URL https://dl.acm.org/doi/abs/10.1145/ 1132516.1132540.

Constantinos Daskalakis, Elchanan Mossel, and Sébastien Roch. Phylogenies without branch bounds: Contracting the short, pruning the deep. In Annual International Conference on Research in Computational Molecular Biology, pages 451 465. Springer, 2009. URL https://link.springer.com/chapter/10.1007/978-3-642-02008-7_32.

Constantinos Daskalakis, Sebastien Roch, et al. Alignment-free phylogenetic reconstruction: Sample complexity via a branching process analysis. The Annals of Applied Probability, 23(2):693 721, 2013. URL https://projecteuclid.org/euclid.aoap/1360682027.

Constantinos Daskalakis, Nishanth Dikkala, and Gautam Kamath. Testing Ising models. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1989 2007. SIAM, 2018. URL https://dl.acm.org/doi/10.5555/ 3174304.3175435.

Randal Douc, Eric Moulines, Jimmy Olsson, Ramon Van Handel, et al. Consistency of the maximum likelihood estimator for general hidden Markov models. the Annals of Statistics, 39(1):474 513, 2011. URL https://projecteuclid.org/euclid.aos/1297779854.

Mathias Drton and Marloes H Maathuis. Structure learning in graphical modeling. Annual Review of Statistics and Its Application, 4:365 393, 2017. URL https: //www.annualreviews.org/doi/abs/10.1146/annurev-statistics-060116-053803? journal Code=statistics.

Cynthia Dwork, Krishnaram Kenthapadi, Frank Mc Sherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 486 503. Springer, 2006a. URL https://www.iacr.org/archive/eurocrypt2006/40040493/ 40040493.pdf.

Nikolakakis, Kalogerias and Sarwate

Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265 284. Springer, 2006b. URL https://link.springer.com/chapter/10.1007/11681878_14.

David Edwards, Gabriel CG De Abreu, and Rodrigo Labouriau. Selecting high-dimensional mixed graphical models using minimal AIC or BIC forests. BMC Bioinformatics, 11(1): 18, 2010. URL https://bmcbioinformatics.biomedcentral.com/articles/10.1186/ 1471-2105-11-18.

Péter L Erdős, Michael A Steel, László A Székely, and Tandy J Warnow. A few logs suﬃce to build (almost) all trees: Part ii. Theoretical Computer Science, 221(1-2):77 118, 1999.

E Yagmur Erten, Joseph T Lizier, Mahendra Piraveenan, and Mikhail Prokopenko. Criticality and information dynamics in epidemiological models. Entropy, 19(5):194, 2017. URL https://www.mdpi.com/1099-4300/19/5/194.

Jianqing Fan, Han Liu, Yang Ning, and Hui Zou. High dimensional semiparametric latent graphical model for mixed data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(2):405 421, 2017. URL https://rss.onlinelibrary.wiley.com/ doi/abs/10.1111/rssb.12168.

Xiequan Fan, Ion Grama, and Quansheng Liu. Hoeﬀding s inequality for supermartingales. Stochastic Processes and their Applications, 122(10):3545 3559, 2012. URL https://doi. org/10.1016/j.spa.2012.06.009.

Michael Finegold and Mathias Drton. Robust graphical modeling of gene networks using classical and alternative t-distributions. The Annals of Applied Statistics, pages 1057 1080, 2011. URL https://projecteuclid.org/euclid.aoas/1310562216.

Rina Foygel and Mathias Drton. Extended Bayesian information criteria for Gaussian graphical models. In Advances in Neural Information Processing Systems 23, pages 604 612, 2010. URL https://arxiv.org/abs/1011.6640.

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432 441, 2008. URL https: //www.ncbi.nlm.nih.gov/pubmed/18079126.

Xin Gao, Daniel Q Pu, Yuehua Wu, and Hong Xu. Tuning parameter selection for penalized likelihood estimation of Gaussian graphical model. Statistica Sinica, pages 1123 1146, 2012. URL https://www.jstor.org/stable/24309977.

Hans-Otto Georgii. Gibbs Measures and Phase Transitions, volume 9. Walter de Gruyter, 2011.

Surbhi Goel, Daniel M Kane, and Adam R Klivans. Learning ising models with independent failures. In Conference on Learning Theory, pages 1449 1469, 2019. URL http://proceedings.mlr.press/v99/goel19a.html.

Søren Højsgaard, David Edwards, and Steﬀen Lauritzen. Graphical Models with R. Springer Science & Business Media, 2012.

Predictive Learning on Hidden Tree-Structured Ising Models

Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(1):695 709, 2005. URL jmlr.org/papers/volume6/ hyvarinen05a/old.pdf.

Aapo Hyvärinen. Some extensions of score matching. Computational statistics & data analysis, 51(5):2499 2512, 2007. URL https://www.cs.helsinki.fi/u/ahyvarin/papers/ CSDA07.pdf.

Leon Isserlis. On a formula for the product-moment coeﬃcient of any order of a normal frequency distribution in any number of variables. Biometrika, 12(1/2):134 139, 1918. URL https://www.jstor.org/stable/2331932?seq=1#page_scan_tab_contents.

Ali Jalali, Christopher C Johnson, and Pradeep K Ravikumar. On learning discrete graphical models using greedy methods. In Advances in Neural Information Processing Systems, pages 1935 1943, 2011. URL https://papers.nips.cc/paper/ 4290-on-learning-discrete-graphical-models-using-greedy-methods.

Andrew H Jazwinski. Stochastic Processes and Filtering Theory. Courier Corporation, 2007.

Dionysios S Kalogerias and Athina P Petropulu. Grid based nonlinear ﬁltering revisited: Recursive estimation & asymptotic optimality. IEEE Transactions on Signal Processing, 64(16):4244 4259, 2016. URL https://ieeexplore.ieee.org/document/7457725.

David Karger and Nathan Srebro. Learning Markov networks: Maximum bounded treewidth graphs. In Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 392 401. Society for Industrial and Applied Mathematics, 2001. URL https://www.biostat.wisc.edu/~page/markovnet.pdf.

S. A. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith. What can we learn privately? In IEEE 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 531 540, 2008. doi: 10.1109/FOCS.2008.27. URL http://doi.org/10.1109/FOCS.2008.27.

Minje Kim and Paris Smaragdis. Single channel source separation using smooth nonnegative matrix factorization with Markov random ﬁelds. In Machine Learning for Signal Processing (MLSP), 2013 IEEE International Workshop on, pages 1 6. IEEE, 2013. URL https://ieeexplore.ieee.org/document/6661909/.

Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT press, 2009.

Steﬀen L Lauritzen. Graphical Models, volume 17. The Clarendon Press Oxford University Press, New York, 1996.

Chuan Li and Michael Wand. Combining Markov random ﬁelds and convolutional neural networks for image synthesis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. URL https://ieeexplore.ieee.org/document/ 7780641.

Nikolakakis, Kalogerias and Sarwate

Guosheng Lin, Chunhua Shen, Anton van den Hengel, and Ian Reid. Eﬃcient piecewise training of deep structured models for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3194 3203, 2016a. URL http://openaccess.thecvf.com/content_cvpr_2016/papers/Lin_ Efficient_Piecewise_Training_CVPR_2016_paper.pdf.

Lina Lin, Mathias Drton, Ali Shojaie, et al. Estimation of high-dimensional graphical models using regularized score matching. Electronic Journal of Statistics, 10(1):806 854, 2016b. URL https://projecteuclid.org/euclid.ejs/1459967424.

Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Laﬀerty, and Larry Wasserman. Forest density estimation. Journal of Machine Learning Research, 12(Mar):907 951, 2011. URL http://jmlr.csail.mit.edu/papers/volume12/liu11a/liu11a.pdf.

Ziwei Liu, Xiaoxiao Li, Ping Luo, Chen Change Loy, and Xiaoou Tang. Deep learning Markov random ﬁeld for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. URL https://ieeexplore.ieee.org/document/8006236.

Shiqian Ma, Lingzhou Xue, and Hui Zou. Alternating direction methods for latent variable Gaussian graphical model selection. Neural computation, 25(8):2172 2198, 2013. URL https://www.mitpressjournals.org/doi/abs/10.1162/NECO_a_00379? journal Code=neco.

Hirotsugu Matsuda, Naofumi Ogita, Akira Sasaki, and Kazunori Sat o. Statistical mechanics of population: The lattice Lotka-Volterra model. Progress of Theoretical Physics, 88(6): 1035 1049, 1992. URL http://doi.org/10.1143/ptp/88.6.1035.

Alan Morningstar and Roger G. Melko. Deep learning the ising model near criticality. Journal of Machine Learning Research, 18(163):1 17, 2018. URL http://jmlr.org/papers/ v18/17-527.html.

Preetam Nandy, Alain Hauser, and Marloes H. Maathuis. High-dimensional consistency in score-based and hybrid structure learning. The Annals of Statistics, 46(6A):3151 3183, 2018. URL https://projecteuclid.org/euclid.aos/1536307246.

Konstantinos E Nikolakakis, Dionysios S Kalogerias, and Anand D Sarwate. Information thresholds for non-parametric structure learning on tree graphical models. ar Xiv preprint ar Xiv:1909.09596, 2020. URL https://arxiv.org/abs/1909.09596.

Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (Morgan Kaufmann Series in Representation and Reasoning), 1988.

Yury Polyanskiy and Yihong Wu. Strong data-processing inequalities for channels and Bayesian networks. In Convexity and Concentration, pages 211 249. Springer, 2017. URL http://people.lids.mit.edu/yp/homepage/data/simple-IMA.pdf.

Maxim Raginsky. Strong data processing inequalities and φ-Sobolev inequalities for discrete channels. IEEE Transactions on Information Theory, 62(6):3355 3389, 2016. URL https: //doi.org/10.1109/TIT.2016.2549542.

Predictive Learning on Hidden Tree-Structured Ising Models

Pradeep Ravikumar, Martin J Wainwright, John D Laﬀerty, et al. High-dimensional Ising model selection using L1-regularized logistic regression. The Annals of Statistics, 38(3): 1287 1319, 2010. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1. 1.67.6940&rep=rep1&type=pdf.

Avik Ray, Sujay Sanghavi, and Sanjay Shakkottai. Improved greedy algorithms for learning graphical models. IEEE Transactions on Information Theory, 61(6):3457 3468, 2015. URL https://ieeexplore.ieee.org/document/7097023/.

Patrick Rebeschini, Ramon Van Handel, et al. Can local particle ﬁlters beat the curse of dimensionality? The Annals of Applied Probability, 25(5):2809 2866, 2015. URL https://projecteuclid.org/euclid.aoap/1438261054#info/10.1214/14-AAP1061.

Alexander G Schwing and Raquel Urtasun. Fully connected deep structured networks. ar Xiv preprint ar Xiv:1503.02351, 2015. URL https://arxiv.org/abs/1503.02351.

Tetsuya Takaishi. Multiple time series Ising model for ﬁnancial market simulations. In Journal of Physics: Conference Series, volume 574, page 012149. IOP Publishing, 2015. URL https://iopscience.iop.org/article/10.1088/1742-6596/574/1/012149/pdf.

Vincent YF Tan, Animashree Anandkumar, and Alan S Willsky. Learning Gaussian tree models: Analysis of error exponents and extremal structures. IEEE Transactions on Signal Processing, 58(5):2701 2714, 2010. URL https://ieeexplore.ieee.org/document/ 5406101.

Vincent YF Tan, Animashree Anandkumar, and Alan S Willsky. Learning high-dimensional Markov forest distributions: Analysis of error rates. Journal of Machine Learning Research, 12(May):1617 1653, 2011. URL https://dl.acm.org/citation.cfm?id= 2021052.

Salvatore Torquato. Toward an Ising model of cancer and beyond. Physical Biology, 8(1): 015017, 2011. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3151151/.

Alexandre B Tsybakov. Introduction to Non-parametric Estimation. Revised and Extended from the 2004 French Original. Translated by Vladimir Zaiats, 2009.

Ramon Van Handel. Observability and nonlinear ﬁltering. Probability theory and related ﬁelds, 145(1-2):35 74, 2009. URL https://link.springer.com/article/10.1007/ s00440-008-0161-y.

Daniel Vogel and Roland Fried. Elliptical graphical modelling. Biometrika, 98(4): 935 951, 2011. URL https://www.jstor.org/stable/pdf/23076182.pdf?seq=1#page_ scan_tab_contents.

Daniel Vogel and David E Tyler. Robust estimators for non-decomposable elliptical graphical models. Biometrika, 101(4):865 882, 2014. URL https://doi.org/10.1093/biomet/ asu041.

Nikolakakis, Kalogerias and Sarwate

Martin J Wainwright, Tommi S Jaakkola, and Alan S Willsky. Tree-reweighted belief propagation algorithms and approximate ML estimation by pseudo-moment matching. In AISTATS, 2003. URL https://www.jstor.org/stable/23566568?seq=1#page_scan_ tab_contents.

Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1 2):1 305, 2008. URL https://www.nowpublishers.com/article/Download Summary/MAL-001.

Bin Wang, Zhijian Ou, and Zhiqiang Tan. Learning trans-dimensional random ﬁelds with applications to language modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. URL https://ieeexplore.ieee.org/document/7906616.

Stanley L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63 69, March 1965. doi: 10.2307/2283137. URL http://doi.org/10.2307/2283137.

Scott Wisdom, John Hershey, Jonathan Le Roux, and Shinji Watanabe. Deep unfolding for multichannel source separation. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 121 125. IEEE, 2016. URL https:// ieeexplore.ieee.org/document/7471649.

Baoyuan Wu, Bao-Gang Hu, and Qiang Ji. A coupled hidden Markov random ﬁeld model for simultaneous face clustering and tracking in videos. Pattern Recognition, 64:361 373, 2017. URL https://www.sciencedirect.com/science/article/abs/pii/ S0031320316303387.

Ming Yuan and Yi Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19 35, 2007. URL https://www.jstor.org/stable/20441351?seq=1# page_scan_tab_contents.

W-X Zhou and Didier Sornette. Self-organizing Ising model of ﬁnancial markets. The European Physical Journal B, 55(2):175 181, 2007. URL https://epjb.epj.org/articles/ epjb/abs/2007/02/b06183/b06183.html.

Abdelhak M Zoubir, Visa Koivunen, Yacine Chakhchoukh, and Michael Muma. Robust estimation in signal processing: A tutorial-style treatment of fundamental concepts. IEEE Signal Processing Magazine, 29(4):61 80, 2012. URL https:// ieeexplore.ieee.org/abstract/document/6217389?casa_token=a-j Zi HR6I9g AAAAA: EWBF9R3XSy Why LB8Xab0FUDihr BT_1sa Guspv5z AI8JZ_OWpv Crrwwo_By YH9WFtv0Sz7LC4Ww.

Yiming Zuo, Yi Cui, Guoqiang Yu, Ruijiang Li, and Habtom W Ressom. Incorporating prior biological knowledge for network-based diﬀerential gene expression analysis using diﬀerentially weighted graphical LASSO. BMC Bioinformatics, 18(1):99, 2017. URL https: //bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1515-1.