# statistical_estimation_from_dependent_data__63fe5d25.pdf Statistical Estimation from Dependent Data Yuval Dagan 1 Constantinos Daskalakis 1 Nishanth Dikkala 2 Surbhi Goel 3 Vardis Kandiros 1 We consider a general statistical estimation problem wherein binary labels across different observations are not independent conditioned on their feature vectors, but dependent, capturing settings where e.g. these observations are collected on a spatial domain, a temporal domain, or a social network, which induce dependencies. We model these dependencies in the language of Markov Random Fields and, importantly, allow these dependencies to be substantial, i.e. do not assume that the Markov Random Field capturing these dependencies is in high temperature. As our main contribution we provide algorithms and statistically efficient estimation rates for this model, giving several instantiations of our bounds in logistic regression, sparse logistic regression, and neural network settings with dependent data. Our estimation guarantees follow from novel results for estimating the parameters (i.e. external fields and interaction strengths) of Ising models from a single sample. We evaluate our estimation approach on real networked data, showing that it outperforms standard regression approaches that ignore dependencies, across three text classification datasets: Cora, Citeseer and Pubmed. 1. Introduction The standard supervised learning framework assumes access to a collection (xi, yi)n i=1 of observations, where the labels y1, . . . , yn Y are independent conditioning on the feature vectors x1, . . . , xn X. Further, it is common to assume that each label yi is independent of {xj}j =i conditioning 1MIT EECS 2Google Research 3Microsoft Research NYC. Correspondence to: Yuval Dagan , Constantinos Daskalakis , Nishanth Dikkala , Surbhi Goel , Vardis Kandiros . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). on xi, i.e. that P[y1...n | x1...n] = i=1 P[yi | xi], and, moreover, that the observations share the same generative process P[y | x] sampling a label conditioning on a feature vector. Under these assumptions, a common goal is to identify a model Pθ[y | x] from some parametric class, which approximates the true generative process P[y | x] in some precise sense, or, under realizability assumptions, to estimate the parameter θ of the true generative process. A special case of this problem is the familiar logistic regression problem, where each label lies in Y = { 1}, each feature vector lies in Rd and for some θ Rd it is assumed that P[y1...n | x1...n] = 1 1 + exp( 2(θ xi)yi). (1) The standard assumptions outlined above are, however, too strong and almost never truly hold in practice. Indeed, they become especially prominent when it comes to observations collected in a temporal domain, a spatial domain or a social network, which naturally induce dependencies among the observations. Such dependencies could arise from physical constraints, causal relationships among observations, or peer effects in a social network. They have been studied extensively in many practical fields, and from a theoretical standpoint in econometrics and statistical learning theory. See section 1.2 for further discussion. In this paper we study such dependencies conforming to the following general class of models: P[y1...n | x1...n] exp( β H( y)) i=1 exp(fθ(xi, yi)) i=1 fθ(xi, yi) where fθ is an (unknown) function from some parametric class, H is a (known) function that captures the dependency structure and β is an (unknown) parameter that captures the strengths of dependencies. It should be appreciated that Model (2) is more general than the standard supervised Statistical Estimation from Dependent Data learning problem without dependencies, which results from setting β = 0. Once we allow β = 0, Model (2) becomes more expressive in capturing the dependencies among the observations, which become stronger with higher values of β. The challenging estimation problem that arises, which motivates our work, is whether the model parameters θ and/or β can be identified, and at what rates, in the presence of the intricate dependencies arising from this model. Importantly, while the labels are intricately dependent, we do not have access to multiple independent samples from the conditional distribution (2), but a single sample from that distribution! We focus here on a special case of Model (2) wherein the labels are binary and the function H is pairwise separable, studying models of the following form: Pθ,β[y1...n | x1...n] = exp(βy TAy) Qn i=1 exp(yifθ(xi)) Zθ,β where fθ is an unknown function from some parametric class F = {fθ : X R | θ Θ}, A is a known, symmetric interaction matrix with zeros on the diagonal, β is an unknown parameter, and Zθ,β is the normalizing constant. In other words, under (3), the labels y1, . . . , yn are sampled from an n-variable Ising model with external field fθ(xi) on variable i, interaction strength Aij Aji between variables i and j, and inverse temperature β. Notice that Aij encourages yi and yj to have the same or opposite values, depending on its sign, however, this local encouragement can be overwritten by indirect interactions through other values of yk. Such indirect interactions make this model rich in spite of the simple form of H(y) = y Ay and as a consequence, it has found profound applications in a range of disciplines, including Statistical Physics, Computer Vision, Computational Biology, and the Social Sciences; see e.g. (Geman & Graffigne, 1986; Ellison, 1993; Felsenstein, 2004; Chatterjee, 2005; Daskalakis et al., 2011; 2017). It is clear that Model (3) generalizes (1), which can be obtained by setting β = 0, X = Rd, and fθ(x) = θTx. It also generalizes the model studied by Daskalakis et al. (2019), which results from setting fθ(x) = θTx and 0 β O(1), as well as the model studied by Ghosal & Mukherjee (2018); Bhattacharya & Mukherjee (2018); Chatterjee (2007), which results from taking fθ(x) to be a constant function. We study under what conditions on the function class F, the interaction matrix A, and the feature vectors (xi)n i=1, and at what rates can the parameters θ and/or β of Model (3) be estimated given a collection (xi, yi)n i=1 of observations, where the labels y1, . . . , yn are sampled from (3) condition- ing on the feature vectors x1, . . . , xn. As explained earlier, in comparison to the standard supervised learning setting without dependencies, the statistical challenge that arises here is that, while the labels y1, . . . , yn are intricately dependent, we do not have access to multiple independent samples from the conditional distribution (3), but a single sample from that distribution. Thus, it is not clear how to extract good estimates of the parameters from our observations and where to find statistical power to bound the error of these estimates from the true parameters. As a consequence, only limited theoretical results in this area are known. 1.1. Overview of Results We provide a general algorithmic approach which yields efficient statistical rates for the estimation of θ and/or β of Model (3) for general function classes F, in terms of the metric entropy of F. We also prove information theoretic lower bounds, which combined with our upper bounds characterize the min-max estimation rate of the problem up to a certain factor, discussed below. Before stating our general result as Theorem 6, we present some corollaries of this theorem in more familiar settings. All the theorems that follow are also presented and proved in more detail in the Supplementary Material. Finally, in all statements below we use the following notation and assumptions, which summarize the already described setting. Assumptions 1 (and useful Notation). We are given observations (xi, yi)n i=1, where y1, . . . , yn are sampled from (3) conditioning on x1, . . . , xn, using some unknown parameters θ Θ and β [ B, B], and some known A, normalized such that A = 1. We further assume that |fθ(xi)| M, for all i and θ Θ. In all statements below, ˆθ and ˆβ refer to the estimates produced by the algorithm described in Section 2, i.e. the Maximum Pseudo-Likelihood Estimator (MPLE). Moreover, we let denote an inequality up to factors that are singly-exponential in M and B, a necessary dependence on those parameters when is used, and are independent of all other parameters. In particular, when M, B = O(1), denotes inequality up to a constant. Under the assumptions on our observations, and notation introduced above, we consider two settings to illustrate our general result (Theorem 6), namely linear classes (Setting 1) and neural network classes (Setting 2). Setting 1 (Linear Classes). Make Assumptions 1, suppose xi Rd and xi 2 M, for all i, and suppose that fθ is linear, i.e. fθ(xi) = x i θ, for some θ Rd and θ 2 1. Denote by X the matrix whose rows are x1, . . . , xn and by κ the minimum eigenvalue of XT X n , or its minimum restricted eigenvalue in the sparse setting of Theorem 2. We suppress from our bounds of Theorems 1 and 2 a factor of 1/κ. Theorem 1 (Linear Class). Suppose Setting 1. Then, with Statistical Estimation from Dependent Data probability 1 δ, ˆθ θ 2 2 + |ˆβ β |2 d log n + log(1/δ) Theorem 2 (Sparse Linear Class). Suppose Setting 1 and additionally that θ 1 s. Then, w.pr. 1 δ, ˆθ θ 2 2 + |ˆβ β |2 (n2s log(d))1/3 + log(1/δ) Both bounds above are obtained by minimizing a convex function over a convex domain, which can be performed in polynomial time. We note that the bound of Theorem 1 generalizes the main result of Daskalakis et al. (2019), which makes the additional assumption that A F = Ω( n). We need no such assumption and our bound gracefully degrades as A F decreases. Theorem 2 extends these results to the sparse linear model, for which no prior results exist. Note that our bound is non-vacuous as long as A F = Ω(n1/3), which is a reasonable expectation, given that A is n n. Moreover, it is possible to remove the appearance of n2 from the bound of this theorem, if our model class satisfies |θ|0 s. Finally, we note that the factor 1/ A 2 F which appears in our error bounds is tight, as per the following. Theorem 3 (Lower bound). For any n and r [1, n] there exists an instance of a d = 1-dimensional linear class that satisfies the assumptions of Theorems 1 and 2 and further A 2 F = r, such that any estimator (θ , β ) satisfies with probability 0.49, |θ θ |2 1 A 2 F , |β β |2 1 A 2 F . While Theorem 3 shows that a dependence in 1 A 2 F is unavoidable in the worst case, under favorable assumptions we can remove such dependence as per the following theorem. Theorem 4 (Linear Class, Random Features). In the same setting as Theorem 1, remove all assumptions involving the feature vectors and suppose instead that x1, . . . , xn i.i.d. N(0, Id). Then, with probability 1 δ, ˆθ θ 2 2 ξ(n, 1/δ)d + log(1/δ) 1 + d + log(1/δ) A 2 F / A 2 2 where ξ(n, 1 δ ) is linear in log log( 1 δ ) and sub-polynomial (i.e. asymptotically smaller than any polynomial) in n. Noticing that A 2 F / A 2 2 1, Theorem 4 shows that no lower bound on A F is necessary at all, if we are only looking to estimate θ , which answers a main problem left open by Daskalakis et al. (2019). Moreover, when A 2 F / A 2 2 d, which is a reasonable expectation in our setting since A 2 1 and A is n n, our bound here essentially matches the estimation rates known for the familiar logistic regression problem, which corresponds to the case β = 0, even though we make no such assumption, and hence our labels are dependent. Beyond linear and sparse linear function classes, our main result (Theorem 6) provides estimation rates for neural network regression, as in the following setting. Setting 2 (Neural Networks). Make Assumptions 1 and suppose that the function fθ in (3) is a neural network parameterized by θ. We adopt the setting and terminology of (Bartlett et al., 2017). In particular, we assume that the neural network takes the form: fθ(x) = σL(WLσL 1(WL 1 σ1(W1x) )), (4) where the depth L of the network is fixed, σ1, . . . , σL : R R are some fixed non-linearities, and W1, . . . , WL are (unknown) weight matrices. In particular, θ = (W1, . . . , WL). We denote by ρ1, . . . , ρL the Lipschitz constants of the nonlinearities, and when, abusing notation, we apply some nonlinearity σi to a vector v, the result σi(v) is a vector whose j-th coordinate is σi(vj). We also adopt from (Bartlett et al., 2017) the notion of spectral complexity Rθ of a neural network fθ with respect to reference matrices M1, . . . , ML (of the same dimensions as W1, . . . , WL respectively), defined in terms of different matrix norms as follows: i=1 ρi Wi 2 W T i M T i 2/3 2,1 Wi 2/3 2 where M 2,1 = Pn j=1 q Pn i=1 M 2 ij. Assuming a fixed bound on each matrix norm involved in the above expression, we take F = {fθ : θ Θ} to be the collection of all neural networks of Form (4), whose weight matrices satisfy those bounds. Suppose R is the resulting bound on the spectral norm of all networks in our family, implied by our assumed bounds on the various matrix norms. Finally, we assume that the widths of all networks fθ F are bounded by d. Theorem 5. Suppose Setting 2, and let K2 = 1 i xi 2 2. Then, with probability 1 δ, i=1 (fˆθ(xi) fθ (xi))2 + |ˆβ β |2 (n2K2R2 log d)1/3 + log n Notice that, in this case, we do not provide guarantees for the estimation of θ. Since these networks are often overparametrized, it might be impossible to recover θ. All estimation results above, namely Theorems 1 5, are corollaries of our general estimation result given below. Statistical Estimation from Dependent Data Theorem 6 (General Estimation Result). Make Assumptions 1, where fθ lies in some general class F = {fθ}θ. Then, w.pr. 1 δ, i=1 (fˆθ(xi) fθ (xi))2 C1(F, X, β , θ ) inf ϵ 0 δ + ϵn + log N(F, X, ϵ) , where X denotes the collection of feature vectors, N (F, X, ϵ) is the ϵ-covering number of F under distance d(f, f ) = p Pn i=1(f(xi) f (xi))2/n and C1 1/ A 2 F is a quantity that has a simple formula (both quantities are formally defined in Section 3.2). Further, if F is convex and closed under negation,1 for any estimator (θ , β ) there exists (θ , β ), s.t. w.pr. 0.49, i=1 (fθ (xi) fθ (xi))2 C1(F, X, β , θ ). Similar upper and lower bounds hold for estimating β , with C1 replaced with a different quantity C2 1/ A 2 F . Theorem 6 is used to derive Theorems 1, 2, 4, and 5 by bounding the covering numbers of linear, sparse linear and neural network classes. It is also used to derive Theorem 3 in a straight-forward way. It is worth emphasizing that we obtain separate general estimation rates for β and θ, which are tight or near-tight in a variety of settings. 1.2. Related Work Data dependencies are pervasive in many applications of Statistics and Machine Learning, e.g. in financial, meteorological, epidemiological, and geographical applications, as well as social-network analyses, where peer effects have been studied in topics as diverse as criminal activity (Glaeser et al., 1996), welfare participation (Bertrand et al., 2000), school achievement (Sacerdote, 2001), retirement plan participation (Duflo & Saez, 2003), and obesity (Christakis & Fowler, 2013; Trogdon et al., 2008). These applications have motivated substantial work in Econometrics (see e.g. Manski (1993); Bramoull e et al. (2009) and their references), where identification results have been pursued and debated, mostly in linear auto-regressive models; see also Daskalakis et al. (2019). In Statistical Learning Theory, learnability and uniform convergence bounds have been shown in the presence of sample dependencies; see e.g. Yu (1994); Gamarnik (2003); Berti et al. (2009); Mohri & Rostamizadeh (2009); Pestov (2010); Mohri & Rostamizadeh 1We say that F is convex if for any f, f F and any λ [0, 1] the function f(x) = (1 λ)f(x) + λf (x) belongs to F. We say that F is closed under negation if f F for all f F. (2010); Shalizi & Kontorovich (2013); London et al. (2013); Kuznetsov & Mohri (2015); London et al. (2016); Mc Donald & Shalizi (2017); Dagan et al. (2019). Those learnability frameworks are not applicable to our setting due to exchangeability, fast-mixing, or weak-dependence properties that they are exploiting. Close to our setting, recent work of Daskalakis et al. (2019) considers a special case of our problem, where function fθ in Model (3) is assumed linear. We obtain stronger estimation bounds, under weaker assumptions, our bounds gracefully degrading with A F , as we have already discussed. Similarly, earlier work by Chatterjee (2007); Bhattacharya & Mukherjee (2018); Ghosal & Mukherjee (2018); Dagan et al. (2020), motivated by single-sample estimation of Ising models, considers a special case of our problem where function fθ in Model (3) is assumed constant. Our bounds in this simple setting are as tight as the tightest bounds in that line of work. Overall, in comparison to these works, our general estimation result (Theorem 6) covers arbitrary classes F, characterizing the estimation rate up to a factor that depends on the metric entropy of F. We thus obtain rates for sparse linear classes (Theorem 2), neural network classes (Theorem 5), and Lipschitz classes (discussed in the Supplementary Material), which had not been shown before. Finally, our bounds disentangle our ability of estimating θ and β, allowing for the estimation of θ even when the estimation of β is impossible, as shown in Theorem 4 for linear classes, answering a main open problem left open by (Daskalakis et al., 2019). At a higher level, single-sample statistical estimation is both a classical and an emerging field (Besag, 1974; Bresler & Nagaraj, 2018; Chen et al., 2019; Dagan et al., 2020) with intimate connections to Statistical Physics, Combinatorics, and High-Dimensional Probability. Roadmap. We present the estimator used to derive all our upper bounds in Section 2. We present a sketch of our proof of Theorem 6 in Section 3. We do this in two steps. First we present a sketch for the toy case of Theorem 1, i.e. the singledimensional case. This illustrates some of the main ideas of the proof. We then provide the modifications necessary for the multi-dimensional case, which naturally lead us to the formulation of Theorem 6. While the main technical ideas are already illustrated in Section 3 in sufficient detail, the complete details can be found in the supplementary material. We conclude with experiments in Section 4, where we apply our estimator on citation datasets and compare its prediction accuracy to supervised learning approaches that do not take into account label dependencies. Statistical Estimation from Dependent Data 2. The Estimation Algorithm In all our theorems, the estimator we use is the Maximum Pseudo-Likelihood Estimator (MPLE), first proposed by Besag (1974) and defined as follows (ˆθ, ˆβ) := arg max θ,β i=1 Pθ,β[yi|x, y i] exp yi fθ(xi) + β Pn j=1 Aijyj 2 cosh fθ(xi) + β Pn j=1 Aijyj , (6) where Pθ,β is defined in (3), x = (x1, . . . , xn) and y i = (y1, . . . , yi 1, yi+1, . . . , yn). We optimize the MPLE over θ Θ and β [ B, B], for Θ, B as per Assumption 1. In comparison to MPLE, the more common Maximum Likelihood Estimator (MLE) optimizes Pθ,β[y1...n | x1...n]. Notice that the MPLE coincides with the MLE in the case β = 0, which corresponds to y1, . . . , yn being independent conditioned on x1...n. When β = 0, this conditional independence ceases to hold and the two methods target different objectives. In this case, the objective function of MLE, which is (3), involves the normalizing factor Zθ,β, which is in general computationally hard to approximate (Sly & Sun, 2014). In contrast, the MPLE is efficiently computable in many cases. For example, in the linear case where fθ(xi) = x i θ, the logarithm of (6) is a convex function of θ and β. Hence, we can use a variety of convex optimization algorithms to find the optimal solution. Even in cases where it is not a convex function, we can always use generic optimization techniques such as gradient-based methods to find a local optimum fast, since the derivative is easy to compute. Thus, the MPLE is a very appealing choice for various models. In all the results that follow, both theoretical and practical, the algorithm used will be the MPLE. 3. Proof overview In this section, we will briefly describe the most important contributions of this work at the technical level. We start by discussing the case where fθ is linear and θ is a onedimensional parameter. We describe in detail the obstacles that had to be overcome to obtain tight rates for the estimation of θ and β in this case and highlight some of the most important features of the proof. In particular, we use the mean field approximation, a tool from statistical physics, to derive the bounds. Later, we sketch the proof of the general Theorem 6. Notation: Matrix Norms. We use the Forbenius norm A F , the spectral norm A 2 and the infinityto-infinity norm A , which is defined as A := max1 i n Pn j=1 |Aij|. In our setting A is symmetric, so one has A 2 A = 1 and A F n A 2 n. 3.1. Single-dimensional linear classes We consider the setting of Theorem 1, when the dimension is d = 1. We denote x = (x1, . . . xn) and y = (y1, . . . , yn). To simplify the presentation, we assume κ Ω(1), which implies that x 2 Ω( n), and further that M, B = O(1). In this sketch we focus on estimating θ while the bound on β is similarly obtained, and our goal is to show the special case of Theorem 1 for dimension d = 1, namely, that with probability 1 δ: δ A F . (7) In fact, we will show the tighter bound of: |ˆθ θ | sup λ R δ λA F + x λAtanh β x λ + θ x 2 (8) where tanh(z1, . . . , zn) = (tanh(z1), . . . , tanh(zn)). We note that this bound is tight up to the factor of plog n δ (after a small tweak to these bounds that we omit for simplicity), and it can be obtained from our general bound of Theorem 6 with respect to the quantity C1 (see Section 3.2. Before establishing (8), we note that it is stronger than the right hand side of (7). This follows from a simple exercise, considering cases for λ and utilizing the fact that under the assumptions stated above, λAtanh((β /λ)x + θ x) 2 O(λ n), while x 2 Ω( n). We proceed with sketching the proof of (8). Let ϕ(θ, β) be the negative pseudo log-likelihood for the pair (θ, β), namely, minus the log of the quantity in (6). This is a convex function whose minimum equals (ˆθ, ˆβ) and our goal is to show that (θ , β ) lies in proximity to this minimum. In order to show this, it suffices to prove that the gradient of ϕ at (θ , β ) is small, while the function is strongly convex in its neighborhood. For a more rigorous proof, we write ϕ(ˆθ, ˆβ) using a Taylor sum around (θ , β ). Denoting v = (vθ, vβ) = (ˆθ θ , ˆβ β ), we get: ϕ(ˆθ, ˆβ) = ϕ(θ , β )+v ϕ(θ , β )+ 1 2v 2ϕ(θ , β )v, for some (θ , β ) in the segment connecting (θ , β ) and (ˆθ, ˆβ). Since (ˆθ, ˆβ) is the minimizer of the MPLE, one has ϕ(ˆθ, ˆβ) ϕ(θ , β ), which implies that 1 2v 2ϕ(θ , β )v v ϕ(θ , β ) |v ϕ(θ , β )|. (9) Using concentration inequalities from (Dagan et al., 2020), we can show that w.pr. 1 δ (w.r.t. the randomness of the y1, . . . , yn which are implicit arguments of ϕ), any u R2 Statistical Estimation from Dependent Data |u ϕ(θ , β )| u 2ϕ(θ , β )u log n/δ uβA F + uθx + uβAy . (10) After substituting u = v, it follows from (9) that the left hand side of (10) is lower bounded by 1/2. We derive that log n/δ vβA F + vθx + vβAy . Multiplying by vθ, and writing λ = vβ/vθ, we have |ˆθ θ | = |vθ| log n/δ λA F + x λAy log n/δ λA F + x λAy . (11) At this point, we have bounded the rate by the solution to an optimization problem. However, notice that the right hand side contains y which is a random variable. We would like to show that the whole expression is bounded by a nonrandom quantity and, in particular, by (8). This statement requires new insights and, as a result, a significant part of the proof is devoted to it. Here, we first sketch the main idea and then give a more technical explanation for it. We would like to bound the optimization problem in (11) by that in (8), which corresponds to showing λA F + x λAy x λAtanh((β /λ)x + θ x) . (12) We start by describing a rough and informal intuition for proving (12), and later proceed with a more formal derivation. We use an approach from statistical physics that is called mean-field approximation: we can substitute each yi with E[yi | x, y i] = tanh(β P j Aijyj +θ x). Applying this substitution for all i, we obtain that y tanh(β Ay + θ x). (13) We assume towards contradiction that (12) does not hold, and in this case we make the (false) substitution x λAy 2 0, which implies that Ay x/λ. Substituting this in the right hand side of (13), we obtain that y tanh(β x/λ + θ x). Making this substitution in x λAy , we obtain (12). Now, we will argue more formally about the previous claims to derive (12). Using the triangle inequality, we get x λAtanh(β x/λ + θ x) x λAy + λAy λAtanh(β Ay + θ x) + λAtanh(β Ay +θ x) λAtanh(β x/λ+θ x) . (14) We would like to bound each of the three terms on the right hand side by a constant times the left hand side of (12). For the first term, this is trivial. Further, we can show that the third term on the right hand side of (14) is bounded by the first term, using the Lipschitzness of tanh: λAtanh(β Ay + θ x) λAtanh((β /λ)x + θ x) λAβ Ay Aβ x β A 2 x λAy O( x λAy ), where β A 2 O(1) using the assumptions of this paper. As for the second term, it represents the error of the mean field approximation for y, which corresponds to the substitution in (13). In order to bound this error term, we use the method of exchangeable pairs developed in (Chatterjee, 2005), which provides a strong and general concentration inequality for non-independent random variables. We can show that with high probability, this term will be O( λβ A 2) O(λ A 2) O(λ A F ), since B = O(1). Combining the above bounds we derive (12), as required. 3.2. Definitions of the terms in Theorem 6 We now sketch the proof of our general upper bound of Theorem 6. We first define the notions of covering numbers and the quantities C1 and C2 in the theorem statement. Definition 1. Given a metric space (Ω, d) and ϵ > 0, a subset Ω Ωis an ϵ-net for Ωif for any ω Ωthere exists ω ω such that d(ω, ω ) ϵ. The covering number at scale ϵ, N(Ω, ϵ), is the smallest size of an ϵ-net. For a function class F and collection of feature vectors X = (x1, . . . , xn), we denote by N(F, X, ϵ) the covering number at scale ϵ of F w.r.t. the distance d(f, g) = p f(X) g(X) 2 2/n, where we use the convenient notation f(X) = (f(x1), . . . , f(xn)) and similarly for g(X). Next, we define the quantities C1 and C2. We start by defining the following as a function of β, β R and h, h Rn: ψ(h, β; h , β ) = (β β )2 A 2 F + h h + (β β )Atanh β β β (h h) + h where tanh((z1, . . . , zn)) = (tanh(z1), . . . , tanh(zn)). Now, for some universal constant c 0, we define C1(F, X, θ , β ) := sup (θ,β) Θ [ B,B] min Uf ψ(fθ(X), β; fθ (X), β ), Uf Statistical Estimation from Dependent Data where Uf := fθ(X) fθ (X) 2 2/n. Similarly, C2 is defined in an analogous way, by replacing fθ(X) fθ (X) 2 2/n with (β β )2. Conveniently, we can use the following upper bound on C1: C 1(F, X, θ , β ) := sup (θ,β) Θ [ B,B] fθ(X) fθ (X) 2 2/n ψ(fθ(X), β; fθ (X), β ). (16) At this point, we can explain how the rate in (8) for d = 1 is derived from the bound of Theorem 6. In this case, (x1, . . . , xn) is simply a vector x Rn and fθ(xi) = θxi. Substituting C1 C 1 into (5), substituting fθ(xi) = θxi and substituting λ = (β β )/(θ θ ), (8) follows. 3.3. Sketch of the upper bound in Theorem 6 Here, we sketch the proof of the upper bound in Theorem 6, but a weaker one where C1 is replaced by its upper bound C 1 defined in (15). In particular, we sketch that w.pr. 1 δ, 1 n fˆθ(X) fθ (X) 2 2 C 1(F, X, β , θ ) inf ϵ 0 δ + ϵn + log N(F, X, ϵ) . It is possible to prove that C 1 O(1/ A 2 F ), similarly to the corresponding argument in Section 3.1 and we focus below on proving (17). Notice that in the definition of C1 and C 1, we do not need the set F itself, but only the vectors fθ(X) for every θ in the class F. Hence, if we define the set H = {fθ(X) : fθ F}, we immediately observe that C1 is in fact a function of H. In this setting, we can similarly define h = fθ (X) and ˆh = fˆθ(X) and define the covering numbers N(H, ϵ) with respect to the distance d(h, h ) = p h h 2 2/n. In this language, (17) translates to 1 n ˆh h 2 2 C 1(H, h , β ) inf ϵ 0 δ + ϵn + log N(H, ϵ) . (18) In the remainder of the proof, we will focus on proving (18), dividing the proof to multiple steps. Step 1: A single dimensional H. In this case, H is a single dimensional subspace of Rn, namely, there exists v Rn such that H = {h + tv: t T R}. This is clearly reminiscent of the setting on a one-dimensional functionclass discussed in Section 3.1. Hence, using the exact same approach and using the calculation of Section 3.2, we can prove that w.pr. 1 δ 1 n ˆh h 2 2 C 1(H, h , β ) log n Step 2: A union of single-dimensional classes. Now, suppose that we have a finite set of directions (unit vectors) v1, . . . , v N and denote Hi = {h +tvi : h +tvi H}. In other words, Hi is the restriction of H on a specific line passing through h with direction vi. Suppose we run MPLE on each direction, producing an output ˆhi for each direction. The calculations of Step 1 suggest that for all i [N], w.pr. 1 δ : 1 n ˆhi h 2 2 C 1(H, h , β ) log n With a simple union bound over these N events, we can set δ = δ/N and obtain that w.pr. 1 δ, for all i [N], 1 n ˆhi h 2 2 C 1(H, h , β ) log n δ + log N . (19) This essentially means that, if we run MPLE on the original set H and it ends up lying in any of the Hi s, it will lie close to the optimal point h . Since we don t know in which direction the MPLE will lie, we have to establish a statement like (19) for all directions in H. The problem is that usually there are infinity directions, so the union bound approach doesn t automatically work. However, we can approximate the set of directions by a finite subset of directions that form an ϵ-net. Since any point h H defines a direction h h , we can take an ϵ-net U with respect to H, which has size N = N(H, ϵ), which corresponds to the covering number defined in Definition 1 of Section 3.2. Due to Lipschitzness of the optimization target, one can prove that the MPLE over U is close to the MPLE over H. By selecting ϵ appropriately and substituting N = N(H, ϵ) in (19), we derive (18). 4. Experiments When there is network information about dependencies between samples, we can use it to significantly boost the performance of supervised learning approaches. We demonstrate such improvements of MPLE-β from including the dependency structure compared to assuming the data is i.i.d. We call this MPLE-0 (i.e. setting β = 0). This is tantamount to not having an underlying influence network between the nodes. We observe that MPLE-β consistently outperforms MPLE-0 by a significant margin. Datasets. We utilize three public citation datasets - Cora, Citeseer and Pubmed (Yang et al., 2016). These datasets consist of a network where each node corresponds to a publication and the edges correspond to citation links. Each node contains a bag-of-words representation of the publication and a corresponding label indicating the area of the publication. The adjacency information Aij between two nodes is given as a real-value between 0 and 1 in all 3 of the these datasets. Table 1 gives the statistics of the datasets. Statistical Estimation from Dependent Data Figure 1. From Left to Right: Plots of the accuracy of MPLE-β (blue) vs MPLE-0 (orange) for Cora, Citeseer, Pubmed respectively as we increase the training data size gradually while maintaining the class probabilities. Table 1. Datasets: Cora and Citeseer have probability vectors as features. Pubmed has TF-IDF frequencies as features. DATASET CLASSES NODES EDGES FEATURES CORA 7 2708 5429 1433 CITESEER 6 3327 4732 3703 PUBMED 3 19717 44338 500 Experimental Setup. The datasets we use are common benchmarks used for semi-supervised and fully-supervised learning on graph structured data. The state of the art for a lot of these datasets is graph neural network (GNN) (Chen et al., 2020) based approaches. The setups considered in prior literature on these datasets differ from ours in the following sense: these works consider the transductive setting, that is, they assume access to the adjacency matrix of the entire graph as well as the features of the entire dataset (including those in the test set) at train time. In contrast, we work in the inductive setting, where we do not assume access to any information about the test set. However, at test time, our hypothesis uses the labels in the validation set (not the features). We perform three different experiments on each dataset where we measure the accuracy of prediction on the test labels. We run each experiment with 10 fixed random seeds and report the average and standard deviation. 1. Sparse-data: Following the semi-supervised setup of (Kipf & Welling, 2016; Feng et al., 2020) and others, we compare performance of MPLE-0 and MPLE-β over a public split which includes only 20 nodes per class as training, 500 nodes for validation and 1000 nodes for testing. 2. Increasing training data: We compare the gap in performance of the two methods when training data is gradually increased from the semi-supervised setting towards the fullsupervised setting. 3. Full-supervised: We consider the fully-supervised setup from (Pei et al., 2020). In this setup, we consider 10 ran- dom splits of the entire dataset. Each split maintains class distribution by splitting the set of nodes of each class into 60%(train)-20%(val)-20%(test). For this experiment, we compare against an inductive variant of GCNII we denote GCNII-In. We disable access to the test set features during training in order to have a fair comparison with our inductive setting. Model Details. Since our classification task is multi-class, we extend the MPLE-β algorithm for Ising models to its natural Pott s model generalization. For number of classes K, the probability of label yi = k conditioned on the other data and labels is computed as follows: Pθ,β[yi = k |x, y i] = exp fθ(xi)k + β Pn j=1 Aij1[yj = k ] PK k=1 exp fθ(xi)k + β Pn j=1 Aij1[yj = k] . Note that this extension is a strict generalization of the model we used in our theory (which only deals with binary classification). Even in this more general setting, we observe significant empirical benefits which attests to the applicability of our approach in more general settings than those considered in our theory. Using this we compute the MPLE-β objective. For both MPLE-0 and MPLE-β, our underlying model fθ : R#features R#classes is a 2-layer neural network with 32 units in the hidden layer and Re LU activations. The difference between the two models is just the use of β. For comparison with the graph neural networks (GNNs), we use the GCNII (Chen et al., 2020) model which is a state-of-theart GNN with depth 64 and hidden layer size of 64. We run our code on a GPU and use Adam to train all our models. We use the tuned hyper-parameters for GCNII however for our algorithms we do not perform a hyper-parameter search but use the parameters used in prior work (Feng et al., 2020). Results. On the sparse-data experiment, for Cora MPLEβ gives an accuracy of 65.8 0.09% vs 60 0.4% given by MPLE-0. For Citeseer, MPLE-β gets 60.9 0.7% vs Statistical Estimation from Dependent Data Table 2. Accuracy comparison between MPLE-0, MLPE-β and GCNII-In for full-supervised experiment. DATASET MPLE-0 MPLE-β GCNII-IN CORA 74.5 1.8 85.3 1.7 85.3 1.3 CITESEER 72.3 1.7 76.3 1.0 68.6 0.3 PUBMED 87.3 0.2 89.0 0.2 83.3 0.6 MPLE-0 which gets 60.2 0.3%. For Pubmed, both approaches get 73.3 0.2%. As we increase the train data size as shown in Figure 1 our gains also tend to increase. Finally for the fully-supervised setting we again outperform MPLE-0 and GCNII-In. On Pubmed, our gains are smaller as the TF-IDF feature vector already implicitly encodes some network information from the neighbors. Moreover, MPLE-β runs much faster than any of the GNN approaches and is simpler with a low overhead of a scalar parameter on any given model, while remaining competitive in performance. However, it should be noted that we do not compare performance in the transductive setting, in which GCNII was probably intended to run. Finally, our experiments are based on an approach with provable end-to-end guarantees, in contrast with the GNN approaches. Bartlett, P., Foster, D. J., and Telgarsky, M. Spectrallynormalized margin bounds for neural networks. ar Xiv preprint ar Xiv:1706.08498, 2017. Berti, P., Crimaldi, I., Pratelli, L., and Rigo, P. Rate of convergence of predictive distributions for dependent data. Bernoulli, 15(4):1351 1367, 2009. Bertrand, M., Luttmer, E. F., and Mullainathan, S. Network effects and welfare cultures. The Quarterly Journal of Economics, 115(3):1019 1055, 2000. Besag, J. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society: Series B (Methodological), 36(2):192 225, 1974. Bhattacharya, B. B. and Mukherjee, S. Inference in ising models. Bernoulli, 24(1):493 525, 2018. Bramoull e, Y., Djebbari, H., and Fortin, B. Identification of peer effects through social networks. Journal of econometrics, 150(1):41 55, 2009. Bresler, G. and Nagaraj, D. Optimal single sample tests for structured versus unstructured network data. ar Xiv preprint ar Xiv:1802.06186, 2018. Chatterjee, S. Concentration inequalities with exchangeable pairs. Ph D thesis, Citeseer, 2005. Chatterjee, S. Estimation in spin glasses: A first step. The Annals of Statistics, 35(5):1931 1946, 2007. Chen, J. Y., Valiant, G., and Valiant, P. How bad is worstcase data if you know where it comes from? ar Xiv, abs/1911.03605, 2019. Chen, M., Wei, Z., Huang, Z., Ding, B., and Li, Y. Simple and deep graph convolutional networks. In International Conference on Machine Learning, pp. 1725 1735. PMLR, 2020. Christakis, N. A. and Fowler, J. H. Social contagion theory: examining dynamic social networks and human behavior. Statistics in medicine, 32(4):556 577, 2013. Dagan, Y., Daskalakis, C., Dikkala, N., and Jayanti, S. Learning from weakly dependent data under dobrushin s condition. In Conference on Learning Theory, pp. 914 928, 2019. Dagan, Y., Daskalakis, C., Dikkala, N., and Kandiros, A. V. Learning ising models from one or several samples. ar Xiv preprint ar Xiv:2004.09370, 2020. Daskalakis, C., Mossel, E., and Roch, S. Evolutionary trees and the Ising model on the Bethe lattice: A proof of Steel s conjecture. Probability Theory and Related Fields, 149(1):149 189, 2011. Daskalakis, C., Dikkala, N., and Kamath, G. Concentration of multilinear functions of the ising model with applications to network data. In Advances in Neural Information Processing Systems, pp. 12 23, 2017. Daskalakis, C., Dikkala, N., and Panageas, I. Regression from dependent observations. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pp. 881 889, 2019. Duflo, E. and Saez, E. The role of information and social interactions in retirement plan decisions: Evidence from a randomized experiment. The Quarterly journal of economics, 118(3):815 842, 2003. Ellison, G. Learning, local interaction, and coordination. Econometrica, 61(5):1047 1071, 1993. Felsenstein, J. Inferring Phylogenies. Sinauer Associates Sunderland, 2004. Feng, W., Zhang, J., Dong, Y., Han, Y., Luan, H., Xu, Q., Yang, Q., Kharlamov, E., and Tang, J. Graph random neural networks for semi-supervised learning on graphs. Advances in Neural Information Processing Systems, 33, 2020. Statistical Estimation from Dependent Data Gamarnik, D. Extension of the pac framework to finite and countable markov chains. IEEE Transactions on Information Theory, 49(1):338 345, 2003. Geman, S. and Graffigne, C. Markov random field image models and their applications to computer vision. In Proceedings of the International Congress of Mathematicians, pp. 1496 1517. American Mathematical Society, 1986. Ghosal, P. and Mukherjee, S. Joint estimation of parameters in ising model. Annals of Statistics, 2018. Glaeser, E. L., Sacerdote, B., and Scheinkman, J. A. Crime and social interactions. The Quarterly Journal of Economics, 111(2):507 548, 1996. Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. ICLR, 2016. Kuznetsov, V. and Mohri, M. Learning theory and algorithms for forecasting non-stationary time series. In Advances in neural information processing systems, pp. 541 549, 2015. London, B., Huang, B., Taskar, B., and Getoor, L. Collective stability in structured prediction: Generalization from one example. In International Conference on Machine Learning, pp. 828 836, 2013. London, B., Huang, B., and Getoor, L. Stability and generalization in structured prediction. The Journal of Machine Learning Research, 17(1):7808 7859, 2016. Manski, C. F. Identification of endogenous social effects: The reflection problem. The review of economic studies, 60(3):531 542, 1993. Mc Donald, D. J. and Shalizi, C. R. Rademacher complexity of stationary sequences. ar Xiv preprint ar Xiv:1106.0730, 2017. Mohri, M. and Rostamizadeh, A. Rademacher complexity bounds for non-iid processes. In Advances in Neural Information Processing Systems, pp. 1097 1104, 2009. Mohri, M. and Rostamizadeh, A. Stability bounds for stationary ϕ-mixing and β-mixing processes. Journal of Machine Learning Research, 11(Feb):789 814, 2010. Pei, H., Wei, B., Chang, K. C.-C., Lei, Y., and Yang, B. Geom-gcn: Geometric graph convolutional networks. ICLR, 2020. Pestov, V. Predictive pac learnability: A paradigm for learning from exchangeable input data. In Granular Computing (Gr C), 2010 IEEE International Conference on, pp. 387 391. IEEE, 2010. Sacerdote, B. Peer effects with random assignment: Results for dartmouth roommates. The Quarterly journal of economics, 116(2):681 704, 2001. Shalizi, C. and Kontorovich, A. Predictive pac learning and process decompositions. In Advances in Neural Information Processing Systems, 2013. Sly, A. and Sun, N. Counting in two-spin models on dregular graphs. The Annals of Probability, 42(6):2383 2416, 2014. Trogdon, J. G., Nonnemaker, J., and Pais, J. Peer effects in adolescent overweight. Journal of health economics, 27 (5):1388 1399, 2008. Yang, Z., Cohen, W., and Salakhudinov, R. Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning, pp. 40 48. PMLR, 2016. Yu, B. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pp. 94 116, 1994.