# logistic_regression_under_network_dependence__d38bc214.pdf Journal of Machine Learning Research 25 (2024) 1-62 Submitted 9/22; Revised 4/24; Published 5/24 High Dimensional Logistic Regression Under Network Dependence Somabha Mukherjee somabha@nus.edu.sg Department of Statistics and Data Science National University of Singapore, Singapore Ziang Niu ziangniu@wharton.upenn.edu Department of Statistics and Data Science University of Pennsylvania, Philadelphia, USA Sagnik Halder shalder@ufl.edu Department of Statistics University of Florida, Gainesville, USA Bhaswar B. Bhattacharya bhaswar@wharton.upenn.edu Department of Statistics and Data Science University of Pennsylvania, Philadelphia, USA George Michailidis gmichail@ufl.edu Department of Statistics and Data Science University of California, Los Angeles, Los Angeles, USA Editor: Ji Zhu Logistic regression is a key method for modeling the probability of a binary outcome based on a collection of covariates. However, the classical formulation of logistic regression relies on the independent sampling assumption, which is often violated when the outcomes interact through an underlying network structure, such as over a temporal/spatial domain or on a social network. This necessitates the development of models that can simultaneously handle both the network peer-effect (arising from neighborhood interactions) and the effect of (possibly) high-dimensional covariates. In this paper, we develop a framework for incorporating such dependencies in a high-dimensional logistic regression model by introducing a quadratic interaction term, as in the Ising model, designed to capture the pairwise interactions from the underlying network. The resulting model can also be viewed as an Ising model, where the node-dependent external fields linearly encode the high-dimensional covariates. We propose a penalized maximum pseudo-likelihood method for estimating the network peer-effect and the effect of the covariates (the regression coefficients), which, in addition to handling the high-dimensionality of the parameters, conveniently avoids the computational intractability of the maximum likelihood approach. Under various standard regularity conditions, we show that the corresponding estimate attains the classical highdimensional rate of consistency. In particular, our results imply that even under network dependence it is possible to consistently estimate the model parameters at the same rate as in classical (independent) logistic regression, when the true parameter is sparse and the underlying network is not too dense. Consequently, we derive the rates of consistency of our proposed estimator for various natural graph ensembles, such as bounded degree graphs, sparse Erd os-R enyi random graphs, and stochastic block models. We also develop c 2024 Somabha Mukherjee, Ziang Niu, Sagnik Halder, Bhaswar B. Bhattacharya, and George Michailidis. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/22-1040.html. Mukherjee, Niu, Halder, Bhattacharya, and Michailidis an efficient algorithm for computing the estimates and validate our theoretical results in numerical experiments. An application to selecting genes in clustering spatial transcriptomics data is also discussed. Keywords: High-dimensional inference, Ising models, logistic regression, Markov random fields, network data, penalized regression, pseudo-likelihood, random graphs. 1. Introduction Logistic regression (Hosmer et al., 2013; Mc Cullagh and Nelder, 1989; Nelder and Wedderburn, 1972) is a very popular and widely used method for modeling the probability of a binary response based on multiple features/predictor variables. It is a mainstay of modern multivariate statistics that has found widespread applications in various fields, including machine learning, biological and medical sciences, economics, marketing and finance industries, and social sciences. For example, in machine learning it is regularly used for image classification and in the medical sciences it is often used to predict the risk of developing a particular disease based on the patients observed characteristics, among others. To describe the model formally, denote the vector of predictor variables (covariates) by Z1, . . . , ZN Rd and the independent response variables by X1, . . . , XN { 1, 1}. Then, the logistic regression model for the probability of a positive outcome conditional on the covariates is given by P(Xi = 1|Zi) = eθ Zi eθ Zi + e θ Zi , for 1 i N, where θ = (θ1, θ2, . . . , θd) Rd is the vector of regression coefficients.1 Using the independence of the response variables, the joint distribution of X := (X1, . . . , XN) given Z := (Z1, . . . , ZN) Rd N can be written as: eθ Zi + e θ Zi = 1 ZN(θ, Z) exp i=1 Xi(θ Zi) where ZN(θ, Z) = QN i=1 1 eθ Zi+e θ Zi is the normalizing constant. It is well-known from the classical theory of generalized linear models that the parameter θ in (1) can be estimated at rate O(1/ N) for fixed dimensions, using the maximum likelihood (ML) method (see, for example, Lehmann and Romano (2005); Mc Cullagh and Nelder (1989); Vaart (1998)). The classical framework of logistic regression described above is, however, inadequate if the independence assumption on the response variables is violated, which is often the case if the observations are collected over a temporal or spatial domain or on a social network. The recent accumulation of dependent network data in modern applications has accentuated the need to develop realistic and mathematically tractable methods for modeling high-dimensional distributions with underlying network dependencies (network peer-effect). Towards this, the Ising model, initially proposed in statistical physics to model ferromagnetism (Ising, 1925), has turned out to be a useful primitive for modeling such datasets, which arise naturally in spatial statistics, social networks, epidemic modeling, computer vision, neural networks, and computational biology, among others (see Banerjee et al. (2015); 1. Note that we are parameterizing the outcomes as { 1, 1} instead of the more standard {0, 1} to integrate this within the framework of the Ising model (defined in (2)), where the { 1, 1} notation is more common. Logistic Regression Under Network Dependence Geman and Graffigne (1986); Green and Richardson (2002); Hopfield (1982); Montanari and Saberi (2010) and references therein). This can be viewed as a discrete exponential family with binary outcomes, wherein the sufficient statistic is a quadratic form designed to capture correlations arising from pairwise interactions. Formally, given an interaction matrix A := ((aij))1 i,j N and binary vector X = (X1, X2, , XN) CN = { 1, 1}N, the Ising model with parameters β and h encodes the dependence among the coordinates of X as follows: Pβ,h(X) = 1 2NZN(β, h) exp 1 i 0, there exists M := M(δ, β) > 0 such that P( N|ˆβN(X) β| M) > 1 δ, for all N large enough. Logistic Regression Under Network Dependence words, we assume that the parameter vector θ is s-sparse, that is ||θ||0 := Pd i=1 1{θi = 0} s, because, despite the fact that many covariates are available, we only expect a few of them to be relevant. Under this assumption, we want to estimate the parameters (β, θ) given a single sample of dependent observations (Xi, Zi)1 i N from model (3). One of the main reasons this problem is especially delicate, is that we only have access to a single (dependent) sample from the underlying model. Consequently, classical results from the M-estimation framework, which require multiple independent samples, are inapplicable. To deal with this dependence (which leads to the intractable normalizing constant as mentioned before) and the high-dimensionality of the parameter space, we propose a penalized maximum pseudolikelihood approach for estimating the parameters. To this end, note that the conditional distribution of Xi given (Xj)j =i is: P(Xi|(Xj)j =i, Z) = e Xiθ Zi+βXimi(X) eθ Zi+βmi(X) + e θ Zi βmi(X) , (4) where, as before, mi(X) = PN j=1 aij Xj is the local-effect at node i, for 1 i N. Therefore, by multiplying (4) over 1 i N and taking logarithms, we get the (negative) logpseudolikelihood (LPL) function LN(β, θ) = 1 i=1 log P(Xi (Xj)j =i, Z) i=1 {Xi(θ Zi + βmi(X)) log cosh(θ Zi + βmi(X))} + log 2. (5) To encode the sparsity of the high-dimensional parameters, we propose a penalized maximum pseudo-likelihood (PMPL) estimator of (β, θ ), which, given a regularization (tuning) parameter λ, is defined as: (ˆβ, ˆθ ) := arg min (β,θ){LN(β, θ) + λ||θ||1}, (6) where ||θ||1 := Pd i=1 |θi|. Under various regularity assumptions, we prove that if λ is chosen proportional to p log d/n, then with high probability, (ˆβ, ˆθ ) (β, θ ) s p log d/N, (7) whenever d such that d = o(N) (Theorem 1). In particular, it follows from our results that for bounded degree networks, the PMPL estimate attains the same rate as in the independent logistic case (1), when d = o(N) and the sparsity is bounded. We also have a more general result that quantifies the dependence on the network sparsity in the rate (7), which allows us to establish consistency of the PMPL estimate beyond bounded degree graphs (Proposition 7). Our results are fundamentally different from existing results on parameter estimation in high-dimensional graphical models based on multiple i.i.d. samples (see Section 1.1 for a review). Here, we only have access to a single sample from the model (3), hence, unlike in the multiple samples case, one cannot treat the different neighborhoods in the network as independent, which renders classical techniques for proving consistency Mukherjee, Niu, Halder, Bhattacharya, and Michailidis inapplicable. Consequently, to handle the dependencies among the different neighborhoods in the pseudo-likelihood function we need to use a different (non-classical) set of tools. Specifically, our proofs combine a conditioning technique introduced in Dagan et al. (2021), which tranforms a general Ising model to a model satisfying the Dobrushin condition (where the dependence is sufficiently weak), and the concentration inequalities for functions of dependent random variables in the Dobrushin regime, based on the method of exchangeable pairs (Chatterjee, 2016). Next, we study the effect of dependence on estimating the regression parameters θ. Specifically, we want to understand how the presence of dependence through the underlying network structure effects the rate of estimation of the high-dimensional regression coefficient under sparsity constraints. While there have been several recent attempts to understand the implications of dependence in high-dimensional inference, most of them have focused on Gaussian models. Going beyond Gaussian models, Mukherjee, Mukherjee, and Yuan (2018) and Deb et al. (2020) considered the problem of testing the global null hypothesis (that is, θ = 0) against sparse alternatives in a special case of model (3) (where d = N and the design matrix Z = IN is the identity). However, to the best of our knowledge, the effect of dependence on parameter estimation in Ising models with covariates has not been explored before. In the sequel, we establish that the PMPL estimate for θ in model (3), given a dependence strength β and the sparsity constraint θ 0 = s, attains the classical O( p s log d/N) rate, despite the presence of dependence, in the entire high-dimensional regime (where d can be much larger than N) and also recovers the correct dependence on the sparsity s (see Theorem 2). As a consequence, we establish that the PMPL estimate is O(1/ N)-consistent (up to logarithmic factors) for the model (3) in various natural sparse graph ensembles, such as Erd os-R enyi random graphs and inhomogeneous random graphs that include the popular stochastic block model (Theorem 3 and Corollary 5). We also develop a proximal gradient algorithm for efficiently computing the PMPL estimate and evaluate its performance in numerical experiments for Erd os-R enyi random graphs, inhomogeneous random graph models, such as the stochastic block model and the β-model, and the preferential attachment model (Section 4). Finally, in Section 5, we illustrate the effectiveness of the proposed model in selecting relevant genes for clustering spatial gene expression data. 1.1 Related Work The asymptotic properties of penalized likelihood methods for logistic regression and generalized linear models in high dimensional settings have been extensively studied (see, for example, Bach (2010); Bunea (2008); Kakade et al. (2010); Meier et al. (2008); Salehi et al. (2019); van de Geer (2008); van de Geer et al. (2014) and references therein). These results allow d to be much bigger than N and provide rates of convergence for the high-dimensional parameters under various sparsity constraints. The performance of the ML estimate in the logistic regression model when the dimension d scales proportionally with N has also been studied in a series of recent papers (Cand es and Sur, 2020; Sur and Cand es, 2019; Sur et al., 2019). However, as discussed earlier, ML estimation is both computationally and mathematically intractable in model (3), because of the dependency induced by the underlying network. That we are able to recover rates similar to those in the classical high-dimensional Logistic Regression Under Network Dependence logistic regression, in spite of this underlying dependency, using the PMPL method, is one of the highlights of our results. The problem of estimation and structure learning in graphical models and Markov random fields is another related area of active research. Here, the goal is to estimate the model parameters or learn the underlying graph structure given access to multiple i.i.d. samples from a graphical model. For more on these results refer to Anandkumar et al. (2012); Bresler (2015); Bresler and Karzand (2020); Chow and Liu (1968); Guo et al. (2011); Hamilton et al. (2017); Klivans and Meka (2017); Ravikumar et al. (2010); Santhanam and Wainwright (2012); Vuffray et al. (2016) and references therein. In particular, Ravikumar et al. (2010) and Xue et al. (2012) use a regularized pseudo-likelihood approach to learn the structure of the interaction A given multiple i.i.d. samples from an Ising model. In a related direction, Daskalakis et al. (2019a) studied the related problems of identity and independence testing, and Cao et al. (2022); Neykov and Liu (2019) considered problems in graph property testing, given access to multiple samples from an Ising model. All the aforementioned results, however, are in contrast with the present work, where the underlying graph structure is assumed to be known and the goal is to derive rates of estimation for the parameters given a single sample from the model. This is motivated by the applications mentioned above where it is often difficult, if not impossible, to generate many independent samples from the model within a reasonable amount of time. More closely related to the present work are results of Li and Zhang (2010) and Li et al. (2015) on Bayesian methods for variable selection in high-dimensional covariate spaces with an underlying network structure, where an Ising prior is used on the model space for incorporating the structural information. Recently, Kim et al. (2021) developed a variational Bayes procedure using the pseudo-likelihood for estimation based on a single sample in a two-parameter Ising model. Convergence of coordinate ascent algorithms for mean field variational inference in the Ising model has been recently analyzed in Plummer et al. (2020). For continuous response with an underlying network/spatial dependence structure, a related model is the well-known spatial autoregressive (SAR) model and its variants, where likelihood estimation based on conditional distributions have been used as well (see Huang et al. (2019); Lee (2004); Lee et al. (2010); Zhu et al. (2020) and the references therein). 1.2 Notation The following notation will be used throughout the remainder of the paper. For a vector a := (a1, . . . , as) Rs and 0 < p < , a p := (Ps i=1 |ai|p) 1 p denotes its p-th norm and a := max1 i s |as| its maximum norm, respectively. Moreover, a 0 := Ps i=1 1{ai = 0} denote the zero-norm of a, which counts the number of non-zero coordinates of a. For a matrix M := ((Mij))1 i s,1 j t Rs t we define the following norms: M F := q Ps i=1 Pt j=1 M2 ij, M := max1 i s Pt j=1 |Mij| , M 1 := max1 j t Ps i=1 |Mij|, M 2 := σmax(M), where σmax(M) denotes the largest singular value of M. Mukherjee, Niu, Halder, Bhattacharya, and Michailidis Note that if M is a square matrix, then M 2 = max{|λmin(M)|, |λmax(M)|}, where λmax(M) and λmin(M) denote the maximum and minimum eigenvalues of M, respectively. For positive sequences {an}n 1 and {bn}n 1, an = O(bn) means an C1bn, an = Ω(bn) means an C2bn, and an = Θ(bn) means C2n an C1bn, for all n large enough and positive constants C1, C2. Similarly, an bn means an = O(bn), an bn means an = Ω(bn). Moreover, subscripts in the above notations, for example , O and Ω denote that the hidden constants may depend on the subscripted parameters . Finally, we say that an = O(bn), if an C1(log n)r1bn and an = Θ(bn), if C2(log n)r2bn an C1(log n)r1bn, for all n large enough and some positive constants C1, C2, r1, r2. 1.3 Organization The remainder of the paper is organized as follows. The rates of consistency of the estimates are presented in Section 2. In Section 3, we apply our results to various common network models. The algorithm for computing the estimates and simulation results are presented in Section 4. The proofs of the technical results are given in the Appendix. 2. Rates of Consistency Next, we present our results on rates of convergence of the PMPL estimator. In Section 2.1, we present the rates of convergence of the PMPL estimates (ˆβ, ˆθ ). The rate for estimating the regression parameters is presented in Section 2.2. 2.1 Consistency of the PMPL Estimate We begin by stating the relevant assumptions: Assumption 1 The interaction matrix A in (3) satisfies the following comdition: sup N 1 A < . Assumption 2 The design matrix Z := (Z1, . . . , ZN) satisfies lim inf N λmin Assumption 3 The signal parameters θ and the covariates {Zi}1 i N are uniformly bounded, that is, there exist positive constants Θ and M such that θ < Θ and Zi < M, for all 1 i N. Under the above assumptions we establish the rate of convergence of the PMPL estimate (6) given a single sample of observations from the model (3), when the parameter vector is sparse, that is, (β, θ ) 0 = s. For notational convenience, we henceforth denote the (d + 1)-dimensional vector of parameters by γ := (β, θ ) and the (d + 1)-dimensional vector of PMPL estimates obtained from (6) by ˆγ = (ˆβ, ˆθ ) . Logistic Regression Under Network Dependence Theorem 1 Suppose that Assumptions 1, 2, 3 hold, and lim inf N 1 N A 2 F > 0. Then, there exists a constant δ > 0 such that by choosing λ := δ p log(d + 1)/N in the objective function in (22) we have, ˆγ γ 2 = Os and ˆγ γ 1 = Os with probability 1 o(1), as N and d such that d = o(N). The conditions in Theorem 1 combined aim to strike a balance between the signal from the peer-effects to that from the covariates, to ensure consistent estimation for all β. In particular, a control on A is required to ensure that the peer effects coming from the quadratic dependence term in the probability mass function (3) do not overpower the effect of the signal θ coming from the linear terms θ Zi, thereby hindering joint recovery of the correlation term β and the signal term θ. At the same time, we also require the interaction matrix to be not too sparse, and its entries to be not too small, in order to ensure that the effect of the correlation parameter β is not nullified. This is guaranteed by the condition A 2 F = Ω(N). For example, when A is the scaled adjacency of a graph GN, then Assumption 1 together with the condition A 2 F = Ω(N) implies that GN has bounded maximum degree (see Section 3). In fact, in the proof we keep track of the dependence on A F in the error rate (see Proposition 7 in Section A), which allows us to establish consistency of the PMPL estimate beyond bounded degree graphs (see Section 3.3 for details). Remark 1 It is worth noting that it may be impossible to estimate (β, θ) consistently without any diverging lower bound on A 2 F = Ω(N) or, in other words, if the graph GN is too dense. This phenomenon is observed in the Curie-Weiss Model (where the interaction matrix aij = 1/N, for 1 i = j N) (Comets and Gidas, 1991). In this example, each entry of A is O(1/N), and hence, A 2 F = O(1), and even when d = 1 (and Z1 = . . . = ZN) consistent estimation of the parameters β and θ is impossible (see Theorem 1.13 in Ghosal and Mukherjee (2020)). The proof of Theorem 1 is given in Section A. As mentioned before, this is a consequence of a more general result which gives rates of convergence for the PMPL estimate in terms of the A F (Proposition 7). Broadly speaking, the proof involves the following two steps: Concentration of the gradient: The first step in the proof of Theorem 1 is to show that the gradient of the logarithm of the pseudo-likelihood function LN (recall (5)) is concentrated around zero in the ℓ norm. For this step, we use the conditioning trick introduced in Dagan et al. (2021), which reduces a general Ising model to an Ising model in the high-temperature regime, where exponential concentration inequalities for functionals of Ising models are available (Chatterjee, 2016). The details are formalized in Lemma 24. Strong-concavity of the pseudo-likelihood: In the second step, we show that the logarithm of the pseudo-likelihood function is strongly concave with high probability. Mukherjee, Niu, Halder, Bhattacharya, and Michailidis This entails showing that the lowest eigenvalue of the Hessian of LN is bounded away from zero with high probability. Towards this, the minimum eigenvalue condition in Assumption 2, which is standard in the high-dimensional literature (see Loh (2017); Ravikumar et al. (2010)), is crucial. In particular, this condition holds with high probability, if the covariates Z1, . . . , ZN are i.i.d. realizations from a sub-Gaussian distribution on Rd (see Theorem 2.1 in Daskalakis et al. (2019b)). Under Assumption 2 and using lower bounds on the variance of linear projections of X developed in Dagan et al. (2021) and concentration results from Adamczak et al. (2019), we establish the strong-concavity of the pseudo-likelihood in Lemma 9. Remark 2 Note that the rate in (8) suppresses the dependence on the sparsity parameter s in the order term. From the proof of Theorem 1, it will be seen that the dependence is, in general, exponential in s. However, if one replaces Assumption 3 with the stronger assumption that the ℓ2 norms of the parameters and the covariates are bounded, that is, θ 2 < Θ and Zi 2 < M, for all 1 i N, then our proof can be easily modified to recover the standard high-dimensional O( p s log d/N) rate (see Remark 16). In fact, this stronger assumption has been used recently in Kandiros et al. (2021) to derive rates of the pseudo-likelihood estimate under ℓ1 sparsity. Specifically, (Kandiros et al., 2021, Theorem 2) showed that if γ 1 s and the parameters and the covariates are ℓ2-bounded, their estimate γ under Assumption 1 satisfies: with high probability. Note that the dependence on N in the RHS above is worse than the expected 1/ N-rate. Moreover, the ℓ2-boundedness of the covariates is quite restrictive in the high-dimensional setup. On the other hand, this work derives rates under ℓ0 sparsity and a more realistic ℓ -bounded condition (Assumption 3). Under this condition, we are able to derive the correct dependence on N and d (and also on s, if the stronger ℓ2-boundedness is imposed as mentioned above) in the regime where d = o(N). An exponential dependence on d also appears in Daskalakis et al. (2019b) (see footnote in page 4), where the convergence rate of the MPL estimate is derived in the fixed d regime. This rate can be improved to O( p d/N) under the ℓ2boundedness assumption (see, for example, Daskalakis et al. (2020)). Our results show that this can be further improved to O( p log d/N) in the regime of constant sparsity. 2.2 Estimation of the Regression Coefficients In this section, we consider the problem of estimating the regression coefficients θ, for fixed β. The goal is to understand how network dependence may affect our ability to estimate the high-dimensional regression coefficient under sparsity constraints. Towards this, we study the properties of the following PMPL estimator for the regression coefficients θ: ˆθ := arg min θ {Lβ,N(θ) + λ||θ||1}, (9) Logistic Regression Under Network Dependence where λ is a regularization parameter and (recalling (5)) Lβ,N(θ) = 1 i=1 {Xi(θ Zi + βmi(X)) log cosh(θ Zi + βmi(X))} + log 2. (10) To handle the high-dimensional regime, we need to make the following assumption on the design matrix Z := (Z1, . . . , ZN) . To this end, we define the Rademacher complexity of the {Zi}1 i N as: where {εi}1 i N is a sequence of i.i.d. Rademacher random variables and the expectation in (11) is taken jointly over the randomness of {Zi}1 i N and {εi}1 i N. Assumption 4 Suppose the covariates Z1, Z2, . . . , ZN are drawn i.i.d. from a distribution with mean zero and satisfying the following conditions: (1) There exist positive constants κ1, κ2 such that E η, Z1 2 κ1 and E η, Z1 4 κ2, for all η Rd such that η 2 = 1. (2) RN = O( p (3) There exists a constant C > 0 such that max1 j d 1 N PN i=1 Z2 ij C holds with probability 1. These types of conditions are standard in the high-dimensional statistics literature (see Bickel et al. (2009); Candes and Tao (2007); Meinshausen and Yu (2009); Negahban et al. (2012); Raskutti et al. (2011); van de Geer and B uhlmann (2009) and references therein), which are known to hold for many natural classes of design matrices. For example, if Z1, Z2, . . . , Zn are i.i.d. sub-Gaussian random variables with mean zero, then Wainwright (2019, Exercise 9.8) implies that RN = O( p Remark 3 As mentioned before, when Z1, Z2, . . . , Zn are i.i.d Gaussian with mean zero and covariance matrix Σ, then Assumption 4 (2) holds (by Wainwright (2019, Exercise 9.8)). To understand when Assumption 4 (1) holds, note that for η Rd such that η 2 = 1 we have E η, Z 2 = E(N(0, η Ση)2) = η Ση λmin(Σ) and E η, Z 4 = E(N(0, η Ση)4) = 3(η Ση)4 3λ2 max(Σ), where λmin(Σ) and λmax(Σ) are the minimum and maximum eigenvalues of Σ, respectively. Therefore, Assumption 4 (1) holds, if we assume that there exist positive constants c , c , such that the covariance matrix Σ satisfies c λmin(Σ) λmax(Σ) c . Mukherjee, Niu, Halder, Bhattacharya, and Michailidis Under the above assumptions, we now show in the following theorem that one can consistently estimate the regression parameters of the model (3) at the same rate as the classical (independent) logistic regression model (1). Theorem 2 Fix β R. Suppose the interaction matrix A in (3) satisfies Assumptions 1 and the covariates Z1, Z2, . . . , ZN satisfy Assumption 4. Moreover, assume that there exists a positive constant Θ such that θ 2 Θ. Then, there exists a constant δ > 0 such that by choosing λ := δ p log d/N in the objective function in (9) we have, and ˆθ θ 1 = O with probability 1 o(1), as N, d such that s p log d/N = o(1). The proof of Theorem 2 is given in Section B in the Appendix. We follow the strategy outlined in Wainwright (2019, Chapter 9) for showing rates of consistency for highdimensional generalized linear models. In particular, we show that the pseudo-likelihood loss function satisfies the restricted strong concavity condition (Proposition 19) under Assumption 4. Consequently, we can establish the consistency of the PMPL estimate of the regression parameters in the entire high-dimensional regime (where d can be much larger than N) and also recover the correct dependence on the sparsity s. Remark 4 Note that, unlike in Theorem 1, the Frobenius norm assumption A 2 F = Ω(N), is not required in Theorem 2. In particular, the only assumption on A one needs in Theorem 2 is A < 1 (Assumption 1). For example, when A is the scaled adjacency matrix of a graph GN, the assumption A < 1 is equivalent to the maximum degree of GN being of the same order as the average degree of GN (see (13)). This is expected because when β is known, the parameter θ can be estimated at the classical high-dimensional rate, irrespective of the total edge density of the network, as long as the peer effects coming from the quadratic dependence term do not overshadow effect of the linear term θ Z, which is ensured by the condition A < 1. This condition, in particular, implies that no node of the network has an unduly large effect on the corresponding model, and is satisfied by most Ising models that are commonly studied in the literature. 3. Application to Various Network Structures In this section, we apply Theorem 1 to establish the consistency of the PMPL estimator (22) for various natural network models. To this end, let GN = (V (GN), E(GN)) be a sequence of graphs with V (GN) = [N] := {1, 2, . . . , N} and adjacency matrix A(GN) = ((aij))1 i,j N. We denote by dv the degree of the vertex v V (Gn). To ensure that the model (3) has non-trivial scaling properties, one needs to chose the interaction matrix A as the scaled adjacency matrix of GN. In particular, define AGN := N 2|E(GN)| A(GN) (12) Throughout this section, we consider model (3) with A = AGN as above. We also assume that the number of non-isolated vertices in GN (that is, the number of vertices in GN with Logistic Regression Under Network Dependence degree at least 1) is Ω(N). Note that this implies, |E(GN)| N. Finally, we also assume that the sparsity s = O(1) and consequently, absorb the dependence on s in the O-terms (recall Remark 2). 3.1 Bounded Degree Graphs A sequence of graphs {GN}N 1 is said to have bounded maximum degree if its maximum degree is uniformly bounded, that is, sup N 1 dmax < , where dmax := maxv V (Gn) dv is the maximum degree of GN. Note that if GN has bounded maximum degree and has Ω(N) non-isolated vertices, then |E(GN)| = Θ(N). Networks arising in certain applications, especially those with an underlying spatial or lattice structure generally have bounded degree. These include planar maps which encode neighborhood relations (Batra et al., 2010; Johnson et al., 2016), lattice models for capturing nearest-neighbor interactions between image pixels, and demand-aware networks (Avin et al., 2020) among others. It is easy to check that the conditions in Assumption 1 are satisfied for bounded degree graphs. Towards this, note that, under the scaling in (12), the condition sup N 1 A < is equivalent to AGN = N 2|E(GN)| maxv V (Gn) dv. Hence, under the scaling in (12), the assumption sup N 1 AGN < is equivalent to dmax := max v V (Gn) dv = O |E(G)| that is, the maximum degree of GN is of the same order as its average degree. Moreover, the condition lim inf N 1 N AGN 2 F > 0 is equivalent to that is, the average degree of GN is bounded. Therefore, (13) and (14) are together equivalent to the condition that GN has bounded maximum degree. Hence, the PMPL estimate is p log d/N-consistent for any sequence of graphs of bounded maximum degree, whenever the assumptions in Theorem 1 hold. 3.2 Sparse Inhomogeneous Random Graphs Although Theorem 1 requires that the maximum degree of GN has to be of the same order as the average degree (see (13)), our proofs can be easily adapted to establish similar rates of consistency of the PMPL estimate (up to polylog(N) factors), if the maximum degree GN grows poly-logarithmically with respect to the average degree, which, in particular, is the case for sparse inhomogeneous random graphs. This is summarized in the following result. The proof is given in Section C.1 of the Appendix. Theorem 3 Suppose {GN}N 1 is a sequence of graphs with |E(GN)| = O(N), dmax = O(1), and Ω(N) non-isolated vertices. Then for λ = Θ p ˆγ γ 2 = O 1 with probability 1 o(1) as N, d such that d = o(N). Mukherjee, Niu, Halder, Bhattacharya, and Michailidis Theorem 3 can be applied to obtain rates of convergence in sparse inhomogeneous random graph models. Definition 4 (Bollob as et al., 2007) Given a symmetric matrix P (N) = ((puv)) [0, 1]N N with zeroes on the diagonal, the inhomogeneous random graph G(N, P (N)) is the graph with vertex set [N] := {1, 2, . . . , N} where the edge (u, v) is present with probability puv, independent of the other edges, for every 1 u < v N. The class of inhomogeneous random graph models defined above includes several popular network models, such as the Chung-Lu model (Chung and Lu, 2002), the β-model (Chatterjee et al., 2011), random dot product graphs (Young and Scheinerman, 2007; Tang et al., 2017), and stochastic block models (Holland et al., 1983; Lei, 2016). Next, we consider the sparse regime wherein max 1 u,v N puv = O 1 In this regime, the expected degree remains bounded, although the maximum degree can diverge at rate O(log N) (Benaych-Georges et al., 2019; Krivelevich and Sudakov, 2003). We will also assume that there exists ε (0, 1) and Ω(N) vertices u GN, such that v=1 (1 puv) < ε. (16) This will ensure G(N, P (N)) has Ω(N) non-isolated vertices. Under these assumptions we have the following result: Corollary 5 Suppose GN is a realization of the inhomogeneous random graph G(N, P (N)), where P (N) satisfies the conditions in (15) and (16). Then for λ = Θ p log(d + 1)/N , ˆγ γ 2 = O 1 with probability 1 o(1) as N, d such that d = o(N). Corollary 5 is proved in Section C.2 of the Appendix. In the following example, we illustrate how it can be applied to sparse stochastic block models, in particular, sparse Erd os-R enyi random graphs. Example 1 (Sparse stochastic block models) Fix K 1, a vector of community proportions λ := (λ1, . . . , λK) (0, 1)K, such that PK j=1 λj = 1, and a symmetric probability matrix B := ((bij))1 i,j K, where bij [0, 1], for all 1 i, j K and bij > 0 for some 1 i, j K. The (sparse) stochastic block model with proportion vector λ and probability matrix B is the inhomogeneous random graph G(N, P (N)), with P (N) = ((puv))1 u,v N where N for (u, v) Bi Bj, (17) Logistic Regression Under Network Dependence where Bj := (N Pj 1 i=1 λi, N Pj i=1 λi] T[N], for j {1, . . . , K}. In other words, the set of vertices is divided into K blocks (communities) B1, B2, . . . , BK, such that the edge between vertices u Bi and v Bj occurs independently with probability bij/N. Clearly, in this case (15) holds. Next, to check that (16) holds, choose 1 i, j K such that bij > 0. Then for all u Bi, v=1 (1 puv) lim sup N = exp( λjbij) < 1, which verifies (16), since |Bj| = Ω(N). Hence, by Corollary 5, the PMPL estimate (6) is O(1/ N)-consistent in this example. As a consequence, the PMPL estimate is O(1/ N)- consistent for sparse Erd os-R enyi random graphs G(N, c/N), which corresponds to setting bij = c, for all 1 i, j K in (17). 3.3 Beyond Bounded Degree Graphs We can also establish the consistency of the PMPL estimate beyond bounded degree graphs using Proposition 7, which provides error rates in terms of A F . To this end, note that when A = AGN is the scaled adjacency matrix of GN as in (12), then Hence, whenever (13) holds, Proposition 7 implies, ˆγ γ 2 = Os |E(GN)|2 log d with probability 1 o(1), whenever d = o(N2/|E(GN)|). This shows that the PMPL estimate is consistent whenever |E(GN)| = o(N3/2) (up to log-factors) and d = o(N2/|E(GN)|). In particular, if GN is -regular (that is, all vertices have of GN has degree ), then the rate of convergence becomes Os( p log d/N), if d = o(N/ ). 4. Computation and Experiments Next, we discuss an algorithm for computing the PMPL estimates (Section 4.1) and evaluate its performance in numerical experiments using synthetic data (Section 4.2). 4.1 Computation of the PMPL Estimates A classical method developed for solving sparse estimation problems is the proximal descent algorithm (Friedman et al., 2010). We employ this algorithm to the optimization problem (9). To describe the algorithm, let f(z) := LN(z) + λ||z||1, (18) Mukherjee, Niu, Halder, Bhattacharya, and Michailidis for z Rd+1, with LN( ) as defined in (5). Also, for t R and x Rd+1 define t (x proxt(x t LN(x))) , proxt(x) := arg min z Rd+1 2t x z 2 2 + λ||z||1 is minimized by the soft thresholding estimator. To chose the step size in the proximal descent algorithm, we employ a backtracking line search, which is commonly used in gradientbased as well as in lasso-type problems (Qin et al., 2013). To justify this we invoke the following result applied to the function f defined in (18): Proposition 6 (Vandenberghe, 2022) Fix s 1 and a step size t > 0. Suppose at the s-th iteration the following bound holds: LN(γ(s) t Gt(γ(s))) LN(γ(s)) t LN(γ(s)) Gt(γ(s)) + t 2 Gt(γ(s)) 2 2. (19) Then for all γ Rd+1, f(γ(s) t Gt(γ(s))) f(γ) + Gt(γ(s)) (γ(s) γ) t 2 Gt(γ(s)) 2 2. (20) Note that setting γ = γ(s) in (20) gives, f(γ(s) t Gt(γ(s))) f(γ(s)) t 2 Gt(γ(s)) 2 2. This shows that whenever the line-search condition (19) holds, the descent of the objective function is guaranteed by setting γ(s+1) γ(s) t Gt(γ(s)) = proxt(γ(s) t LN(γ(s))). Therefore, the proximal gradient descent algorithm for optimization problem (9), with step size chosen by backtracking line search, proceeds in the following two steps: We initialize with γ(0) = 0 Rd+1 and t(0) = 1. If, at the s-th iteration (γ(s), t(s)) satisfies the line-search condition (19), then we update the estimates γ(s+1) proxt(s)(γ(s) t(s) LN(γ(s))) = γ(s) t(s) LN(γ(s)) 1 t(s)λ γ(s) t(s) LN(γ(s)) and keep the step size unchanged, that is, t(s+1) t(s). Logistic Regression Under Network Dependence Algorithm 1 Fix a value of τ (0, 1) and δ > 0. Initialize with γ(0) = 0 Rd+1 and t(0) = 1. while γ(s+1) γ(s) 1 > δ do if LN(γ(s) t(s)Gt(s)(γ(s))) LN(γ(s)) t(s) LN(γ(s)) Gt(s)(γ(s))+ t(s) 2 Gt(s)(γ(s)) 2 2 then γ(s+1) proxt(s)(γ(s) t(s) LN(γ(s))) and t(s+1) t(s). end if if LN(γ(s) t(s)Gt(s)(γ(s))) > LN(γ(s)) t(s) LN(γ(s)) Gt(s)(γ(s))+ t(s) 2 Gt(s)(γ(s)) 2 2 then γ(s+1) γ(s) and t(s+1) τt(s). end if end while If at the s-th iteration (γ(s), t(s)) does not satisfy the line-search condition (19) we shrink the step size by a factor of τ (0, 1), that is, t(s+1) τt(s), and keep the estimates unchanged, that is, γ(s+1) γ(s). The procedure is summarized in Algorithm 1. Note that the smooth part of the objective function LN is differentiable and its gradient LN is Lipschitz (by Lemma 24). Hence, Algorithm 1 reaches ε-close to the optimum value in O(1/ε) iterations (Vandenberghe, 2022). 4.2 Numerical Experiments We evaluate the performance of the PMPL estimator using Algorithm 1 on synthetic data. The first step is to develop an algorithm to sample from model (3). As mentioned before, direct sampling from the model (3) is computationally challenging due to the presence of an intractable normalizing constant. To circumvent this issue, we deploy a Gibbs sampling algorithm which iteratively updates each outcome variable Xi, for 1 i N, based on the conditional distribution P(Xi|(Xj)j =i, Z) (recall (4)). Formally, the sampling algorithm can be described as follows: Start with an initial configuration X(0) := (X(0) i )1 i N { 1, +1}N. At the (s + 1)-th step, for s 1, choose a vertex of GN uniformly at random. If the vertex 1 i N is selected, then update X(s) i to ( +1, with probability P(X(s) i = 1|(X(s) j )j =i, Z) 1, with probability P(X(s) i = 1|(X(s) j )j =i, Z) , and keep X(s+1) j = X(s) j , for j = i. Define X(s+1) := (X(s+1) i )1 i N. The Markov chain {X(s+1)}s 0 has stationary distribution (3) and, hence, can be used to generate approximate samples from (3). Mukherjee, Niu, Halder, Bhattacharya, and Michailidis For the numerical experiments, we consider β = 0.3 and choose the first s regression coefficients θ1, θ2, . . . , θs independently from Uniform([ 1, 1 2, 1]), while the remaining d s regression coefficients θs+1, θs+2, . . . , θd are set to zero. The covariates Z1, Z2, . . . , ZN are sampled i.i.d. from a d-dimensional multivariate Gaussian distribution with mean vector 0 and covariance matrix Σ = ((σij))1 i,j d, with σij = 0.2|i j|. With the aforementioned choices of the parameters and the covariates, we generate a sample from the model (3) by running the Gibbs sampling algorithm described above for 30000 iterations. We then apply Algorithm 1 by setting ε = 0.001, τ = 0.8, and consider the solution paths of the PMPL estimator as a function of log(λ), when the network GN is selected to be the Erd os-R enyi (ER) model and the stochastic block model (SBM). We set the range of λ to be a geometric sequence of length 100 from 0.001 to 0.1. Figure 1 depicts the solution paths of the PMPL estimate when GN is a realization of the Erd os-R enyi random graph G(N, 5/N). Figure 1 (a), corresponds to a setting N = 1200, d = 200, s = 5, while Figure 1 (b) to N = 1200, d = 600, s = 600. 0.5 0.0 0.5 1.0 Solution paths in the ER Model with N = 1200, d = 200, and s = 5 Coefficients Solution paths in the ER Model with N = 1200, d = 600, and s = 10 Coefficients Figure 1: Solution paths of the PMPL estimates in the Erd os-R enyi model G(N, 5/N): (a) N = 1200, d = 200, s = 5, and (b) N = 1200, d = 600, s = 10. Figure 2 shows the solution paths of the PMPL estimate when GN is a realization of a SBM with K = 2, λ1 = λ2 = 1 2, p11 = p22 = 4/N, and p12 = 8/N (that is, a SBM with 2 equal size blocks with within block connection probability 4/N and between block connection probability 8/N (recall Example 1)). In Figure 2 (a) we have N = 1200, d = 200, s = 5, and in Figure 2 (b) N = 1200, d = 600, s = 10. From the plots in Figures 1 and 2, it is evident that the first 5 signal (non-zero) coefficients remain non-zero throughout the range of tuning parameters λ considered. Moreover, as Logistic Regression Under Network Dependence 0.5 0.0 0.5 1.0 Solution paths in the SBM with N = 1200, d = 200, and s = 5 Coefficients 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Solution paths in the SBM with N = 1200, d = 600, and s = 10 Coefficients Figure 2: Solution paths of the PMPL estimates in the stochastic block model: (a) N = 1200, d = 200, s = 5, and (b) N = 1200, d = 600, s = 10. expected, λ needs to be larger when d = 600 for the non-signal (zero) coefficients to shrink to zero exactly. Next, we investigate the estimation errors by varying the size N of the network GN. To select the regularization parameter λ, we use a Bayesian Information Criterion (BIC). Specifically, we define BIC(λ) = LN(ˆβλ, ˆθλ) + df(λ) log N, where ˆβλ, ˆθλ = (ˆθλ,i)1 i d are the PMPL estimates obtained from (6) for a fixed value of λ and df(λ) = |{1 i d : ˆθλ,i = 0}|. We choose ˆλ by minimizing BIC(λ) over a grid of values of λ and denote the corresponding PMPL estimates by γ = (ˆβˆλ, ˆθ ˆλ ). Figure 3 shows the average ℓ1 and ℓ2 estimation errors γ γ 1 and γ γ 2 and their 1-standard deviation error bars (over 200 repititions) for the Erd os-R enyi (ER) model and the SBM. We refer to these by Ising L1 and Ising L2 in Figure 3, respectively. For comparison purposes, we also show the ℓ1 and ℓ2 estimation errors for the classical penalized logistic regression (with no interaction term, that is, β = 0), denoted by Logistic L1 and Logistic L2 in Figure 3, respectively. The parameters in the numerical experiment are set as follows: β = 0.3, d = 50, the first s = 5 regression coefficients θ1, θ2, . . . , θ5 are independent samples from Uniform([ 1, 1 2, 1]) and the remaining 45 regression coefficients θ6, θ7, . . . , θ50 are set to zero. As before, the covariates Z1, Z2, . . . , ZN are sampled i.i.d. from a 50dimensional multivariate Gaussian distribution with mean vector 0 and covariance matrix Σ = ((σij))1 i,j 100, with σij = 0.2|i j|. Figure 3 (a) shows the estimation errors when GN is a realization of the Erd os-R enyi random graph G(N, 1/N), as N varies from 200 to 1200 over a grid of 6 values. Mukherjee, Niu, Halder, Bhattacharya, and Michailidis 200 400 600 800 1000 1200 0 1 2 3 4 5 Error Comparison in the ER Model with d=50 and c=1 Ising L1 Logistic L1 Ising L2 Logistic L2 200 400 600 800 1000 1200 0 1 2 3 4 5 Error Comparison in the SBM with d=50 and c=1 Ising L1 Logistic L1 Ising L2 Logistic L2 200 400 600 800 1000 1200 0 1 2 3 4 5 Error Comparison in the β model with d=50 Ising L1 Logistic L1 Ising L2 Logistic L2 200 400 600 800 1000 1200 0 1 2 3 4 5 Error Comparison in the preferential attachment model with d=50 Ising L1 Logistic L1 Ising L2 Logistic L2 Figure 3: Estimation errors of the PMPL and the penalized logistic regression estimates in the (a) Erd os-R enyi model and (b) the stochastic block model, (c) the β-model, and (d) the preferential attachment model. Figure 3 (b) shows the estimation errors when GN is a realization of a SBM with K = 2, λ1 = λ2 = 1 2, p11 = p22 = 0.5/N, and p12 = 1/N, with N varying as before. Figure 3 (c) shows the estimation errors when GN is a realization from a β-model (Chatterjee et al., 2011). The β-model is an inhomogeneous random graph model Logistic Regression Under Network Dependence where each edge (i, j), for 1 u < v N, is present independently with probability puv = eβu+βv 1 + eβu+βv , with (β1, . . . , βN) Rn. In Figure 3 (c) we chose βu = c u log(log(u + 1)), for 1 u N, where c = 200/N and N varies from 200 to 1200 over a grid of 6 values. Figure 3 (d) shows the estimation errors when GN is a realization from the linear preferential attachment model with one edge added each time, with N varying as before. The linear preferential attachment graph evolves sequentially one vertex at a time, where each new vertex connects to an existing vertex with probability proportional to their degrees (see Bollobas et al. (2003); Krapivsky and Redner (2001)). Consequently, the model exhibits the rich gets richer phenomenon and the degree sequence follows a power law distribution (Barab asi and Albert, 1999). The plots in Figure 3 show that the estimation errors of PMPL estimates exhibit a decreasing trend as N increases, validating the consistency results established in Section 2. Although the ℓ2 errors of the PMPL and penalized logistic regression estimates are similar for small N, the PMPL errors are better as N increases. Also, the difference between the ℓ1 errors of the PMPL and penalized logistic regression estimates is significant. While the ℓ1 errors of PMPL estimates show consistent decreasing trends in all four settings, those for the penalized logistic regression estimates are much higher. Moreover, as expected, the empirical variances of the ℓ1 and ℓ2 errors for the penalized logistic regression estimate are significantly larger than those for the PMPL estimate. These findings illustrate the effectiveness of the proposed method for modeling dependent network data for range of network models, encompassing different network topologies, such as community structure and degree distribution. We also investigate how the PMPL estimate performs with respect to the density of the network. To this end, we consider the Erd os-R enyi random graph G(N, c/N), with N = 600, and vary c. Figure 4 shows the estimation errors as c increases, with dependence parameter (a) β = 0.15 and (b) β = 0.3 in the respective sub-plots. As expected, the error curves for the PMPL estimates are generally better than those for the penalized logistic regression estimates. Moreover, the estimation errors are relatively small to begin with (when c is small), but starts to show an increasing trend with c after a while. This is expected because as the network density increases the rate of convergence slows down and, as a result, consistent estimation becomes harder (recall the discussion in Section 3.3). 5. Application to Spatial Transcriptomics In this section, we illustrate how the proposed model can be useful in selecting relevant genes in spatial gene expression data. As mentioned in the Introduction, spatial transcriptomics is a new direction in molecular biology where, in addition to measuring the gene expression levels of individual cells, one also has information about the spatial location of the cells (Eng et al., 2019; Goltsev et al., 2018; Palla et al., 2022; Perkel, 2019). To understand how the spatial location of a cell affects its phenotype, it is natural to consider a model as in (3) with a nearest neighbor graph of the cell locations as the underlying network. Mukherjee, Niu, Halder, Bhattacharya, and Michailidis 0 5 10 15 20 0 1 2 3 4 5 Error Comparison in the ER Model with d=50, N=600 and β=0.15 Ising L1 Logistic L1 Ising L2 Logistic L2 0 2 4 6 8 10 12 0 1 2 3 4 5 Error Comparison in the ER Model with d=50, N=600 and β=0.3 Ising L1 Logistic L1 Ising L2 Logistic L2 Figure 4: Estimation errors of the PMPL and the penalized logistic regression estimates in the Erd os-R enyi model G(N, c/N), with N = 600, as c varies, where (a) β = 0.15 and (b) β = 0.3. We consider the Visium spatial gene expression data set for human breast cancer (see https://www.10xgenomics.com/spatial-transcriptomics for details about the spatial capture technology) available in the Python package scanpy. The data set is available at https://support.10xgenomics.com/spatial-gene-expression/datasets and can be loaded using the Python command: scanpy.datasets.visium_sge(sample_id= V1_Breast_Cancer_Block_A_Section_1 ) The data consists of 36601 genes and 3798 cells along with their spatial locations. To obtain the cell labels, we first filter out the top 50 highly variable genes, that is, the genes whose expression variance is within the top 50 among all genes. Subsequently, we cluster the cells based on the expression levels of these 50 gene into 2 types (clusters) using the Leiden algorithm (Tragg, 2019). The output of the clustering algorithm visualized using the Python command sc.pl.spatial is shown in Figure 5 (a). Using the cell labels obtained as above and the first 100 highly variable genes as the covariates, we then fit the model (3) with the 1-nearest neighbor graph of the spatial location of the cells as the underlying network, using the PMPL method. The optimal λ is chosen using the BIC criterion. The PMPL method with the BIC chosen regularization parameter, selects 6 genes among the top 100 highly variable genes. Among the selected ones, four of them are actually in the top 50 highly variable gene set obtained in the first filtering step. These genes are shown in Table 1. Next, we re-cluster the cells based on only the 6 selected genes (see Figure 5 (b)). Interestingly, just using the 6 selected genes we can recover the clustering result obtained with the top 50 variable genes with high accuracy. This illustrates how incorporating spatial Logistic Regression Under Network Dependence information can significantly reduce dimensionality for clustering single cell data and the usefulness of our method in selecting relevant genes. Selected genes among top 100 Estimated coefficients Selected genes among top 50 S100A9 0.0087 S100A9 CPB1 -0.0330 CPB1 SPP1 0.0123 SPP1 CRISP3 0.1465 CRISP3 SLITRK6 0.1276 IGLC2 -0.0392 Table 1: Names of the selected genes and the estimates of the corresponding regression coefficients. The estimate of β is ˆβ = 0.1203. Figure 5: Clustering results using the Leiden algorithm: (a) with the top 50 highly variable genes, (b) with the 6 selected genes. To capture the spatial dependence one can, more generally, consider the K-nearest neighbor graph (instead of the 1-nearest neighbor graph as above) of the spatial locations of the cells in the model (3). To understand the sensitivity of the PMPL method on the choice of the number of nearest neighbors, we repeat the experiment with K = 1, K = 2, and K = 3. The genes selected by the PMPL method and the estimates of the corresponding regression coefficients for each of these settings are shown in Table 2. It turns out that for K = 1 and K = 2 the genes selected are the same, and for K = 3 the genes selected match except one (the gene SPP1 is no longer selected). This shows that the PMPL method is quite robust to choice of the underlying nearest-neighbor graph as long as K is not too large. While one can incorporate more distant spatial dependencies by increasing K, this makes the graph denser and, as a result, the rate of estimation becomes slower (as shown Mukherjee, Niu, Halder, Bhattacharya, and Michailidis in Section 3.3). In practice, especially for spatial problems, where the dependence often decreases with distance, choosing a small value of K should suffice. Selected genes among top 100 Estimated coefficients Selected genes among top 50 S100A9 0.0087 S100A9 CPB1 -0.0330 CPB1 SPP1 0.0123 SPP1 CRISP3 0.1465 CRISP3 SLITRK6 0.1276 IGLC2 -0.0392 S100A9 0.0047 S100A9 CPB1 -0.0250 CPB1 SPP1 0.0037 SPP1 CRISP3 0.1121 CRISP3 SLITRK6 0.1131 IGLC2 -0.0317 S100A9 0.0034 S100A9 CPB1 -0.0177 CPB1 CRISP3 0.0805 CRISP3 SLITRK6 0.0845 IGLC2 -0.0241 Table 2: Names of the selected genes and the estimates of the corresponding regression coefficients for the K-nearest-neighbor graph, with K = 1, K = 2, and K = 3. The estimates of β are ˆβ = 0.1203, ˆβ = 0.2434, and ˆβ = 0.2870 for K = 1, K = 2, and K = 3, respectively. 6. Conclusion Understanding the effect of dependence in high-dimensional inference tasks for non-Gaussian models is an emerging research direction. In this paper, we develop a framework for efficient parameter estimation in a model for dependent network data with binary outcomes and high-dimensional covariates. The model combines the classical high-dimensional logistic regression with the Ising model from statistical physics to simultaneously capture dependence from the underlying network and the effect of high-dimensional covariates. This dependence makes the model different and the analysis more challenging compared to existing results based on independent samples. In the this paper we develop an efficient algorithm for jointly estimating the effect of dependence and the high-dimensional regressions parameters using a penalized maximum pseudo-likelihood (PMPL) method and derive its rate of consistency. To understand which of the covariates have an effect on the outcome under the presence of network dependence, we also consider the problem of estimation given a fixed (known) level of dependence. Towards this, we show that using the PMPL method the regression parameters can be estimated at the classical high-dimensional rate, despite the presence of dependence, in the entire high-dimensional regime. We expect the model to be broadly useful in network econometrics and spatial statistics for understanding dependent binary Logistic Regression Under Network Dependence data with an underlying network geometry. As an application, we apply the proposed model to select genes in spatial transcriptomics data. Various questions remain and future directions emerge. Theoretically, it would be interesting to see if the conditions for joint estimation can be relaxed. Computationally, it would be interesting to explore more efficient sampling schemes for Ising models with covariates. Incorporating dependence in other generalized linear models and high-dimensional distributions, through the lens of the Ising and more general graphical models, is another interesting direction for future research. Acknowledgments Bhaswar B. Bhattacharya thanks Rajarshi Mukherjee and Nancy R. Zhang for many helpful discussions. The authors also thank the anonymous referees for their insightful comments which improved the quality and the presentation of the paper. The authors are also grateful to Sagnik Nandy for his help with the data analysis and Baichen Yu for pointing out an important typo in a previous version of the manuscript. Bhaswar B. Bhattacharya was supported by NSF CAREER grant DMS 2046393 and a Sloan Research Fellowship. Somabha Mukherjee was supported by the National University of Singapore start-up grant WBS A0008523-00-00 and the Fo S Tier 1 grant WBS A-8001449-00-00. George Michailidis was supported by NSF grant DMS 2334735. Rados law Adamczak, Micha l Kotowski, Bart lomiej Polaczyk, and Micha l Strzelecki. A note on concentration for polynomials in the Ising model. Electronic Journal of Probability, 24: 1 22, January 2019. Animashree Anandkumar, Vincent Y. F. Tan, Furong Huang, and Alan S. Willsky. High-dimensional structure estimation in Ising models: Local separation criterion. The Annals of Statistics, 40 (3): 1346 1375, June 2012. Chen Avin, Kaushik Mondal, and Stefan Schmid. Demand-aware network designs of bounded degree. Distributed Computing, 33 (3-4): 311 325, June 2020. Francis Bach. Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4: 384 414, January 2010. Sudipto Banerjee, Alan E. Gelfand, and Bradley P. Carlin. Hierarchical modeling and analysis for spatial data. Chapman & Hall/CRC Press, Boca Raton, Florida, second edition, 2015. Dhruv Batra, A. C. Gallagher, Devi Parikh, and Tsuhan Chen. Beyond trees: MRF inference via outer-planar decomposition. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2496 2503, June 2010. Albert-L aszl o Barab asi and R eka Albert. Emergence of scaling in random networks. Science, 286 (5439):509 512, 1999. Florent Benaych-Georges, Charles Bordenave, and Antti Knowles. Largest eigenvalues of sparse inhomogeneous Erd os-R enyi graphs. The Annals of Probability, 47 (3): 1653 - 1676, May 2019. Mukherjee, Niu, Halder, Bhattacharya, and Michailidis Marianne Bertrand, Erzo F. P. Luttmer, and Sendhil Mullainathan. Network effects and welfare cultures. Quarterly Journal of Economics, 115 (3): 1019 1055, August 2000. Julian Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society: Series B (Methodological), 36 (2): 192 225, January 1974. Julian Besag. Statistical analysis of non-lattice data. The Statistician, 24 (3): 179 195, September 1975. Bhaswar B. Bhattacharya and Sumit Mukherjee. Inference in Ising models. Bernoulli, 24 (1): 493 525, February 2018. Peter J. Bickel, Ya acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37 (4): 1705 1732, August 2009. Bela Bollobas, Christian Borgs, Jennifer Chayes, and Oliver Riordan. Directed scale-free graphs. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 132 139, 2003. B ela Bollob as, Svante Janson, and Oliver Riordan. The phase transition in inhomogeneous random graphs. Random Structures and Algorithms, 31 (1): 3 122, August 2007. Guy Bresler. Efficiently Learning Ising Models on Arbitrary Graphs. In Proceedings of the fortyseventh annual ACM symposium on Theory of Computing, pages 771 782, Portland Oregon USA, June 2015. Guy Bresler and Mina Karzand. Learning a tree-structured Ising model in order to make predictions. The Annals of Statistics, 48 (2): 713 737, April 2020. Florentina Bunea. Honest variable selection in linear and logistic regression models via ℓ1 and ℓ1+ℓ2 penalization. Electronic Journal of Statistics, 2: 1153 1194, January 2008. Emmanuel Candes and Terence Tao. The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35 (6): 2313 2351, December 2007. Emmanuel J. Cand es and Pragya Sur. The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. The Annals of Statistics, 48 (1): 27 42, February 2020. Yuan Cao, Matey Neykov, and Han Liu. High-temperature structure detection in ferromagnets. Information and Inference: A Journal of the IMA, 11 (1): 55 102, March 2022. Sourav Chatterjee. Estimation in spin glasses: A first step. The Annals of Statistics, 35 (5): 1931 1946, October 2007. Sourav Chatterjee. Concentration inequalities with exchangeable pairs (Ph.D. thesis), March 2016. ar Xiv:math/0507526. Sourav Chatterjee, Persi Diaconis, and Allan Sly. Random graphs with a given degree sequence. The Annals of Applied Probability, 21 (4): 1400 1435, August 2011. C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14 (3): 462 467, May 1968. Nicholas A. Christakis and James H. Fowler. Social contagion theory: examining dynamic social networks and human behavior. Statistics in medicine, 32 (4): 556 577, February 2013. Logistic Regression Under Network Dependence Fan Chung and Linyuan Lu. Connected components in random graphs with given expected degree sequences. Annals of Combinatorics, 6 (2): 125 145, November 2002. Francis Comets. On consistency of a class of estimators for exponential families of Markov Random Fields on the lattice. The Annals of Statistics, 20 (1): 557 578, March 1992. Francis Comets and Basilis Gidas. Asymptotics of Maximum Likelihood Estimators for the Curie Weiss Model. The Annals of Statistics, 19 (2): 557 578, June 1991. Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, and Siddhartha Jayanti. Learning from Weakly Dependent Data under Dobrushin s Condition. In Proceedings of the Thirty-Second Conference on Learning Theory, pages 914 928. PMLR, June 2019. Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, and Anthimos Vardis Kandiros. Learning Ising models from one or multiple samples. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 161 168, June 2021. Constantinos Daskalakis, Nishanth Dikkala, and Gautam Kamath. Testing Ising Models. IEEE Transactions on Information Theory, 65 (11): 6829 6852, November 2019. Constantinos Daskalakis, Nishanth Dikkala, and Ioannis Panageas. Regression from dependent observations. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 881 889, Phoenix AZ USA, June 2019. Constantinos Daskalakis, Nishanth Dikkala, and Ioannis Panageas. Logistic regression with peergroup effects via inference in higher-order Ising models. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, pages 3653 3663. PMLR, June 2020. Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. Computational Geometry: Algorithms and Applications. Springer, 2008. Nabarun Deb, Rajarshi Mukherjee, Sumit Mukherjee, and Ming Yuan. Detecting structured signals in Ising models. The Annals of Applied Probability, 34 (1A), 1 45, 2024. E. Duflo and E. Saez. The role of information and social interactions in retirement plan decisions: Evidence from a randomized experiment. The Quarterly Journal of Economics, 118 (3): 815 842, August 2003. Chee-Huat Linus Eng, Michael Lawson, Qian Zhu, Ruben Dries, Noushin Koulena, Yodai Takei, Jina Yun, Christopher Cronin, Christoph Karp, Guo-Cheng Yuan, and Long Cai. Transcriptome-scale super-resolved imaging in tissues by RNA seq FISH+. Nature, 568 (7751): 235 239, April 2019. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33 (1), 2010. Stuart Geman and Christine Graffigne. Markov random field image models and their applications to computer vision. In Proceedings of the International Congress of Mathematicians, pages 1496 1517, 1986. Promit Ghosal and Sumit Mukherjee. Joint estimation of parameters in Ising model. The Annals of Statistics, 48 (2): 785 810, 2020. Mukherjee, Niu, Halder, Bhattacharya, and Michailidis B. Gidas. Consistency of maximum likelihood and pseudo-likelihood estimators for Gibbs distributions. In Wendell Fleming and Pierre-Louis Lions, editors, Stochastic Differential Systems, Stochastic Control Theory and Applications, The IMA Volumes in Mathematics and Its Applications, pages 129 145, New York, NY, 1988. Springer. E. L. Glaeser, B. Sacerdote, and J. A. Scheinkman. Crime and Social Interactions. The Quarterly Journal of Economics, 111 (2): 507 548, May 1996. Yury Goltsev, Nikolay Samusik, Julia Kennedy-Darling, Salil Bhate, Matthew Hale, Gustavo Vazquez, Sarah Black, and Garry P. Nolan. Deep profiling of mouse splenic architecture with CODEX multiplexed imaging. Cell, 174 (4): 968 981.e15, August 2018. Peter J Green and Sylvia Richardson. Hidden Markov models and disease mapping. Journal of the American Statistical Association, 97 (460): 1055 1070, December 2002. J. Guo, E. Levina, G. Michailidis, and J. Zhu. Joint estimation of multiple graphical models. Biometrika, 98 (1): 1 15, March 2011. Xavier Guyon and Hans R. K unsch. Asymptotic comparison of estimators in the Ising model. In Piero Barone, Arnoldo Frigessi, and Mauro Piccioni, editors, Stochastic Models, Statistical Methods, and Algorithms in Image Analysis, Lecture Notes in Statistics, pages 177 198, New York, NY, 1992. Springer. Linus Hamilton, Frederic Koehler, and Ankur Moitra. Information theoretic properties of Markov Random Fields, and their algorithmic applications. In Advances in Neural Information Processing Systems, volume 30: 2463 2472, Inc., 2017. K. M. Harris, National Longitudinal Study of Adolescent Health, et al. Waves i & ii, 1994-1996; wave iii, 2001-2002; wave iv, 2007-2009. Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109 137, 1983. J J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79 (8): 2554 2558, April 1982. David W. Hosmer, Stanley Lemeshow, and Rodney X. Sturdivant. Applied logistic regression. Number 398 in Wiley series in probability and statistics. Wiley, Hoboken, New Jersey, third edition edition, 2013. Danyang Huang, Wei Lan, Hao Helen Zhang, and Hansheng Wang. Least squares estimation of spatial autoregressive models for large-scale social networks. Electronic Journal of Statistics, 13: 1135 1165, 2019. Ernst Ising. Beitrag zur Theorie des Ferromagnetismus. Zeitschrift f ur Physik, 31 (1): 253 258, February 1925. Jason K. Johnson, Diane Oyen, Michael Chertkov, and Praneeth Netrapalli. Learning planar ising models. The Journal of Machine Learning Research, 17 (1): 7539 7564, January 2016. Sham Kakade, Ohad Shamir, Karthik Sindharan, and Ambuj Tewari. Learning exponential families in high-dimensions: strong convexity and sparsity. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 381 388. JMLR Workshop and Conference Proceedings, March 2010. Logistic Regression Under Network Dependence Vardis Kandiros, Yuval Dagan, Nishanth Dikkala, Surbhi Goel, and Constantinos Daskalakis. Statistical estimation from dependent data. In Proceedings of the 38th International Conference on Machine Learning, pages 5269 5278. PMLR, July 2021. Minwoo Kim, Shrijita Bhattacharya, and Tapabrata Maiti. Variational Bayes algorithm and posterior consistency of Ising model parameter estimation, ar Xiv:2109.01548, September 2021. Adam Klivans and Raghu Meka. Learning graphical models using multiplicative weights. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 343 354, Berkeley, CA, October 2017. Paul L Krapivsky and Sidney Redner. Organization of growing random networks. Physical Review E, 63(6):066123, 2001. Michael Krivelevich and Benny Sudakov. The largest eigenvalue of sparse random graphs. Combinatorics, Probability and Computing, 12 (1): 61 72, January 2003. Erich Leo Lehmann and Joseph P Romano. Testing statistical hypotheses, volume 3. Springer, 2005. Lung-Fei Lee. Asymptotic distributions of quasi-maximum likelihood estimators for spatial autoregressive models. Econometrica, 72(6):1899 1925, 2004. Lung-fei Lee, Xiaodong Liu, and Xu Lin. Specification and estimation of social interaction models with network structures. The Econometrics Journal, 13(2):145 176, 2010. Jing Lei. A goodness-of-fit test for stochastic block models. The Annals of Statistics, 44 (1): 401 424, February 2016. Fan Li and Nancy R. Zhang. Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. Journal of the American Statistical Association, 105 (491): 1202 1214, September 2010. Fan Li, Tingting Zhang, Quanli Wang, Marlen Z. Gonzalez, Erin L. Maresh, and James A. Coan. Spatial Bayesian variable selection and grouping for high-dimensional scalar-on-image regression. The Annals of Applied Statistics, 9 (2): 687 713, June 2015. Po-Ling Loh. Statistical consistency and asymptotic normality for high-dimensional robust $M$- estimators. The Annals of Statistics, 45 (2): 866-896, April 2017. P. Mc Cullagh and J. A. Nelder. Generalized Linear Models. Springer US, Boston, MA, 1989. Lukas Meier, Sara Van De Geer, and Peter B uhlmann. The group lasso for logistic regression: Group Lasso for Logistic Regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70 (1): 53 71, January 2008. Nicolai Meinshausen and Bin Yu. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37 (1): 246 270, February 2009. Andrea Montanari and Amin Saberi. The spread of innovations in social networks. Proceedings of the National Academy of Sciences, 107 (47): 20196 20201, November 2010. Rajarshi Mukherjee and Gourab Ray. On testing for parameters in Ising models. Annales de l Institut Henri Poincar e, Probabilit es et Statistiques, 58 (1): 164 187, February 2022. Rajarshi Mukherjee, Sumit Mukherjee, and Ming Yuan. Global testing against sparse alternatives under Ising models. The Annals of Statistics, 46 (5): 2062 2093, October 2018. Mukherjee, Niu, Halder, Bhattacharya, and Michailidis Somabha Mukherjee, Jaesung Son, and Bhaswar B Bhattacharya. Estimation in tensor Ising models. Information and Inference: A Journal of the IMA, June 2022. Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. A Unified Framework for High-Dimensional Analysis of $M$-Estimators with Decomposable Regularizers. Statistical Science, 27 (4): 538 557, November 2012. J. A. Nelder and R. W. M. Wedderburn. Generalized Linear Models. Journal of the Royal Statistical Society. Series A (General), 135 (3): 370, 1972. Matey Neykov and Han Liu. Property testing in high-dimensional Ising models. The Annals of Statistics, 47 (5): 2472 2503, October 2019. Giovanni Palla, Hannah Spitzer, Michal Klein, David Fischer, Anna Christina Schaar, Louis Benedikt Kuemmerle, Sergei Rybakov, Ignacio L. Ibarra, Olle Holmberg, Isaac Virshup, Mohammad Lotfollahi, Sabrina Richter, and Fabian J. Theis. Squidpy: a scalable framework for spatial omics analysis. Nature Methods, 19: 171 178, January 2022. Jeffrey M. Perkel. Starfish enterprise: finding RNA patterns in single cells. Nature, 572 (7770): 549 551, August 2019. David K. Pickard. Inference for discrete Markov Fields: The simplest nontrivial case. Journal of the American Statistical Association, 82 (397): 90 96, March 1987. Sean Plummer, Debdeep Pati, and Anirban Bhattacharya. Dynamics of coordinate ascent variational inference: A case study in 2D Ising models. Entropy, 22 (11): 1263, November 2020. Zhiwei Qin, Katya Scheinberg, and Donald Goldfarb. Efficient block-coordinate descent algorithms for the Group Lasso. Mathematical Programming Computation, 5 (2): 143 169, June 2013. Garvesh Raskutti, Martin J. Wainwright, and Bin Yu. Minimax rates of estimation for highdimensional linear regression over ℓq-balls. IEEE Transactions on Information Theory, 57 (10): 6976 6994, October 2011. Pradeep Ravikumar, Martin J. Wainwright, and John D. Lafferty. High-dimensional Ising model selection using ℓ1-regularized logistic regression. The Annals of Statistics, 38 (3): 1287 1319, June 2010. B. Sacerdote. Peer effects with random assignment: results for Dartmouth roommates. The Quarterly Journal of Economics, 116 (2): 681 704, May 2001. Fariborz Salehi, Ehsan Abbasi, and Babak Hassibi. The impact of regularization on high-dimensional logistic regression, ar Xiv:1906.03761, November 2019. Narayana P. Santhanam and Martin J. Wainwright. Information-theoretic limits of selecting binary graphical models in high dimensions. IEEE Transactions on Information Theory, 58 (7): 4117 4134, July 2012. Pragya Sur and Emmanuel J. Cand es. A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116 (29): 14516 14525, July 2019. Pragya Sur, Yuxin Chen, and Emmanuel J. Cand es. The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled Chi-square. Probability Theory and Related Fields, 175(1-2): 487 558, October 2019. Logistic Regression Under Network Dependence Minh Tang, Avanti Athreya, Daniel L. Sussman, Vince Lyzinski, and Carey E. Priebe. A nonparametric two-sample hypothesis testing problem for random graphs. Bernoulli, 23 (3): 1599 1630, August 2017. V. Traag, L. Waltman, and N. van Eck, From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports. 9, 5233, 2019. Justin G. Trogdon, James Nonnemaker, and Joanne Pais. Peer effects in adolescent overweight. Journal of Health Economics, 27 (5): 1388 1399, September 2008. A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 1998. ISBN 9780521784504. Sara van de Geer, Peter B uhlmann, Ya acov Ritov, and Ruben Dezeure. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3): 1166 1202, June 2014. Sara A. van de Geer. High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2): 614 645, April 2008. Sara A. van de Geer and Peter B uhlmann. On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics, 3: 1166 1202, January 2009. L. Vandenberghe, Proximal gradient method, ECE236C Lecture Notes, UCLA, 2022. (http://www. seas.ucla.edu/~vandenbe/236C/lectures/proxgrad.pdf) Marc Vuffray, Sidhant Misra, Andrey Lokhov, and Michael Chertkov. Interaction screening: efficient and sample-optimal learning of Ising models. In Advances in Neural Information Processing Systems, volume 29: 2595 2603, 2016. Martin Wainwright. High-dimensional statistics: a non-asymptotic viewpoint. Number 48 in Cambridge series in statistical and probabilistic mathematics. Cambridge University Press, Cambridge; New York, NY, 2019. Bo-Ying Wang and Bo-Yan Xi. Some inequalities for singular values of matrix products. Linear Algebra and its Applications, 264: 109 115, October 1997. Lingzhou Xue, Hui Zou, and Tianxi Cai. Nonconcave penalized composite conditional likelihood estimation of sparse Ising models. The Annals of Statistics, 40 (3): 1403 1429, June 2012. Stephen J Young and Edward R Scheinerman. Random dot product graph models for social networks. In International Workshop on Algorithms and Models for the Web-Graph, pages 138 149. Springer, 2007. Xuening Zhu, Danyang Huang, Rui Pan, and Hansheng Wang. Multivariate spatial autoregressive model for large scale social networks. Journal of Econometrics, 215(2):591 606, 2020. Mukherjee, Niu, Halder, Bhattacharya, and Michailidis Appendix A. Proof of Theorem 1 Theorem 1 is a consequence of the following more general result which provides rates of consistency for the PMPL estimate in terms of A 2 F . Proposition 7 Suppose that Assumptions 1, 2, and 3 hold. Then, there exists a constant δ > 0 such that by choosing λ := δ p log(d + 1)/N in the objective function in (22) we have, ˆγ γ 2 = Os log d A 4 F /N and ˆγ γ 1 = Os log d A 4 F /N with probability 1 o(1), as N and d such that d = o( A 2 F ) and log d = o( A 4 F /N). Note that when lim inf N 1 N A 2 F > 0, then rates in Theorem 7 is an immediate consequence of Proposition 7. The rest of this section is devoted the proof of Proposition 7. To this end, recall from (6) that our PMPL estimator is defined as: (ˆβ, ˆθ ) := argmin (β,θ ) Rd+1 LN(β, θ) + λ θ 1 (22) where λ > 0 is a tuning parameter and LN(β, θ) = 1 j=1 Aij Xj + θ Zi j=1 Aij Xj + θ Zi is as defined in (5) (where we have dropped the additive factor of log 2). To begin with, note that since Assumption 1 holds, by scaling the interaction matrix and the covariate vectors by A we can assume without loss of generality, sup N 1 A 1. (23) The first step towards the proof of Theorem 1 is to establish the concentration of the pseudolikelihood gradient vector LN(ˆγ) in the ℓ norm. This is formalized in the following lemma which is proved in Section A.1. Lemma 8 (Concentration of the gradient) For ˆγ and any λ > 0, LN(ˆγ) λ. (24) Moreover, there exists δ > 0 such that with λ := δ p log(d + 1)/N the following holds: P LN(γ) > λ = o(1), (25) where the o(1)-term goes to infinity as d . Logistic Regression Under Network Dependence The next lemma shows that the pseudo-likelihood function is strongly concave with high probability. The proof of this lemma is given in Section A.2. Lemma 9 (Strong concavity of pseudo-likelihood) Suppose the assumptions of Theorem 1 hold. Then, there exists a constant κ := κ(s, M, β, Θ) > 0, such that LN(ˆγ) LN(γ) LN(γ) (ˆγ γ) κ A 2 F ˆγ γ 2 2 N , with probability 1 o(1). The proof of Theorem 1 can now be easily completed using the above lemmas. Towards this define: S := {1 i d : θi = 0}. Moreover, for any vector a Rp and any set Q {1, . . . , p}, we denote by a Q the vector (ai)i Q. Now, for the constant κ as in Lemma 9, consider the event EN := n X CN : LN(γ) λ and LN(ˆγ) LN(γ) LN(γ) (ˆγ γ) κ A 2 F ˆγ γ 2 2 N Clearly, from Lemma 8 and Lemma 9, P(Ec N) = o(1). Next, suppose X EN. From the definition of ˆγ it follows that LN(ˆγ) + λ ˆθ 1 LN(γ) + λ θ 1. (26) λ( θ 1 ˆθ 1) LN(ˆγ) LN(γ) = LN(γ) (ˆγ γ) + (LN(ˆγ) LN(γ) LN(γ) (ˆγ γ)) LN(γ) ˆγ γ 1 + (LN(ˆγ) LN(γ) LN(γ) (ˆγ γ)) 2 + (LN(ˆγ) LN(γ) LN(γ) (ˆγ γ)), (27) where the last step uses LN(γ) λ 2, for X EN. Next, note that ˆθ 1 = θS + (ˆθ θ)S 1 + (ˆθ θ)Sc 1 θS 1 (ˆθ θ)S 1 + (ˆθ θ)Sc 1 = θ 1 (ˆθ θ)S 1 + (ˆθ θ)Sc 1. This implies, (ˆθ θ)S 1 (ˆθ θ)Sc 1 θ 1 ˆθ 1. (28) Combining (27) and (28) it follows that λ (ˆθ θ)S 1 (ˆθ θ)Sc 1 + ˆγ γ 1 LN(ˆγ) LN(γ) LN(γ) (ˆγ γ) κ A 2 F ˆγ γ 2 2 N , (29) Mukherjee, Niu, Halder, Bhattacharya, and Michailidis where the last inequality uses LN(ˆγ) LN(γ) LN(γ) (ˆγ γ) κ A 2 F ˆγ γ 2 2 N , for X EN. Using ˆγ γ 1 = (ˆθ θ)S 1 + (ˆθ θ)Sc 1 + |ˆβ β| in the LHS of (29) now gives, ˆγ γ 2 2 λN κ A 2 F 3 (ˆθ θ)S 1 2 (ˆθ θ)Sc 1 (ˆθ θ)S 1 + |ˆβ β| i S (ˆθi θi)2 + (ˆβ β)2 This implies, for X EN, ||ˆγ γ||2 = Oκ,δ This completes the proof of the ℓ2 error rate in Theorem 1, since P(EN) = 1 o(1). The bound on the ℓ1 error ||ˆγ γ||1 is shown in Lemma 15. A.1 Proof of Lemma 8 First, we establish that || LN(ˆγ)|| λ. Fix 1 i d and define the univariate function: f(x) := LN(ˆβ, ˆθ1, . . . , ˆθi 1, x, ˆθi+1, . . . , ˆθd). Note that f (ˆθi) = θi LN(γ) |γ=ˆγ. Now, by the definition of ˆγ we have, f(ˆθi) + λ|ˆθi| f(x) + λ|x|, which implies, f(x) f(ˆθi) λ(|ˆθi| |x|). Then consider the following cases: ˆθi > 0: Then, for all x > ˆθi, f(x) f(ˆθi) x ˆθi λ|ˆθi| |x| Similarly, for all 0 < x < ˆθi, f(x) f(ˆθi) x ˆθi λ |ˆθi| |x| x ˆθi = λ. This implies, f (ˆθi) = λ. ˆθi < 0: Then, for all 0 > x > ˆθi, f(x) f(ˆθi) x ˆθi λ|ˆθi| |x| Similarly, x < ˆθi, f(x) f(ˆθi) x ˆθi λ |ˆθi| |x| x ˆθi = λ. Hence, in this case, f (ˆθi) = λ. Logistic Regression Under Network Dependence ˆθi = 0: In this case, for all x > 0, and for all x < 0, f(x) f(0) x = λ. Since f exists, this implies that |f (0)| λ. Next, define g(x) := LN(x, ˆθ1, . . . , ˆθi 1, ˆθi, ˆθi+1, . . . , ˆθd). Note that g (ˆβ) = β LN(γ) |γ=ˆγ. By the definition of ˆγ we have, g(ˆβ) g(x). This implies that g (ˆβ) = 0. Combining the above, it follows that || LN(ˆγ)|| = max j [d] |f (ˆθi)| λ, completing the proof of (24). Next, we establish the concentration of LN(γ) as in (25). For this step, we require the following definitions. For 1 i N, denote j=1 aij Xj. Define functions φi : CN R, for 1 i N, as follows: n mi(x) xi tanh(βmi(x) + θ Zi) o , (30) for x = (x1, x2, . . . , xn) CN. Similarly, define functions φi,s : CN R, for 1 i N and 1 s d, as follows: φi,s(x) := 1 n Zi,s xi tanh(βmi(x) + θ Zi) o . (31) Note that LN = ( LN θ1 , . . . , LN i=1 φi(X) and LN i=1 φi,s(X), for 1 s d. To establish the concentration of LN(γ) , we use the conditioning trick from Dagan et al. (2021), which allows to reduce the model (3) to an Ising model in the Dobrushin regime (where the correlations are sufficiently weak and the model approximately behaves like a product measure), by conditioning on a subset of the nodes. To describe this, we need the following definition: Definition 10 Suppose that σ { 1, 1}N is a sample from the Ising model: Pβ,h(σ) exp Mukherjee, Niu, Halder, Bhattacharya, and Michailidis where h = (h1, h2, . . . , hn) Rn and D is a symmetric matrix with zeros on the diagonal with sup N 1 D R. Moreover, suppose that with probability 1, min 1 i N Var(σi|σ i) Υ, for some Υ 0. Then, the model (32) is referred to as an (R, Υ)-Ising model. Recently, Dagan et al. (2021) developed a technique for reducing an (R, Υ)-Ising model to an (η, Υ)-Ising model, for 0 < η < R, by conditioning on a subset of vertices. As a consequence, by choosing η one can ensure that the conditional model is in the Dobrushin high-temperature regime. Although the Ising model studied in Dagan et al. (2021) is different, the same proof extends to our model (3) as well. We formalize this in the following lemma, which is proved in Appendix D.2. Lemma 11 Fix R > 0 and η (0, R). Let X { 1, 1}N be a sample from an (R, Υ)- Ising model. Then there exist subsets I1, . . . , Iℓ [N] with ℓ R2 log N/η2 such that: 1. For all 1 i N, |{j [ℓ] : i Ij}| = ηℓ/8R 2. For all 1 j ℓ, the conditional distribution of XIj given XIc j := (Xu)u [N]\Ij is an (η, Υ)- Ising model. Furthermore, for any non-negative vector a RN, there exists j ℓsuch that i=1 ai. (33) We will apply the above result to our model (3). Towards this, set D = β 2 A and hi = θ Zi, for 1 i N in (32). Under this parametrization, (3) is an (R, Υ)-Ising model as shown below: Lemma 12 The model (3) is a (R, Υ)-Ising model with R = |β|/2 and any Υ = e 4ΘMs 4|β|. Proof Note that for every j [N], Var(Xj|X[N]\{j}) = 4p(1 p), where p = P(Xj = 1|X[N]\{j}). Now, denoting the elements of of the matrix D as (dij)1 i,j N, note that p = exp hj + 2 P v [N]\{j} djv Xv 2 cosh hj + 2 P v [N]\{j} djv Xv . Then using the inequality ex 2 cosh(x) 1 2e 2|x| gives, min{p, 1 p} 1 v [N]\{j} djv Xv Logistic Regression Under Network Dependence Next, using hj + 2 X v [N]\{j} djv Xv v [N]\{j} djv Xv |θ Zj| + 2 D ΘMs + |β|, it follows from (34) that min{p, 1 p} 1 2e 2ΘMs 2|β|. Hence, we have: Var(Xj|X[N]\{j}) e 4ΘMs 4|β|. This completes the proof of the lemma, since D |β| (since A 1 by (23)). By the above lemma, model (3) is an (R, Υ)-Ising model, with R := |β|/2 and Υ = e 4ΘMs 4|β|. Next, choose and suppose I1, . . . , Iℓare subsets of [N] as in Lemma 11. Then, defining ℓ := ηℓ/R we get, ℓ max r [ℓ] |Qr(X)| , (35) i Ir φi(X), (36) for r [ℓ]. Similarly, it follows that, for s [d], i=1 φi,s(X) i Ir φi,s(X) i Ir φi,s(X) ℓ max r [ℓ] |Qr,s(X)| , (37) Qr,s(X) := X i Ir φi,s(X). (38) The following lemma shows that the functions Qr and Qr,s are Lipschitz in the Hamming metric. The proof is given in Appendix D.1. Lemma 13 For r [ℓ] and s [d], let Qr and Qr,s be as defined in (36) and (38), respectively. Then for any two vectors X, X CN differing in just the k-th coordinate, the following hold: (1) For r [ℓ], |Qr(X) Qr(X )| 2|β| + 6 Mukherjee, Niu, Halder, Bhattacharya, and Michailidis (2) Similarly, for r [ℓ] and s [d], |Qr,s(X) Qr,s(X )| 2|Zk,s| i=1 |Zi,saik| =: ck. Using the above result together with Lemma 12, we can now establish the concentrations of Qr(X) and Qr,s(X), conditional on XIcr. To this end, recalling the definition of φi( ) from (30) note that E Qr(X)|XIcr = 0, for r [ℓ]. Moreover, by Lemma 12, X|XIcr is an (η, Υ)-Ising model, where η 1 16. Therefore, since Qr is Oβ(1/N)-Lipschitz (by Lemma 13), applying Theorem 4.3 and Lemma 4.4 in Chatterjee (2016) gives, for every t > 0, P |Qr(X)| t XIcr 2e Oβ(Nt2). (39) Similarly, recalling (31), it follows that for each r [ℓ] and s [d], E(Qr,s(X)|XIcr) = 0. Then, since PN k=1 c2 k = OM(1) under Assumptions 1 and 4, Lemma 13 together with Theorem 4.3 and Lemma 4.4 in Chatterjee (2016) gives, for every t 0, P |Qr,s(X)| t XIcr 2e Oβ,M(Nt2). (40) Hence, combining (35), (39), (37), (40), and Lemma 11 (which implies that ℓ= O(log N)) gives, 2e Oβ(Nt2) and P i=1 φi,s(X) 2e Oβ,M(Nt2), (41) for each s [d]. It thus follows from (41) and a union bound, that P ( LN(γ) t) 2(d + 1)e KNt2, (42) for some constant K > 0, depending on β and M. Now, choosing t = λ log(d + 1)/N in (42) above gives, 2(d + 1)1 Kδ2 whenever δ2 < 4/K. This completes the proof of Lemma 8. A.2 Proof of Lemma 9 Define the following (d + 1) (d + 1) dimensional matrix, m m m Z Z m Z Z The key step towards proving Lemma 9 is to show that the lowest eigenvalue of 2LN is bounded away from 0 with high probability. Logistic Regression Under Network Dependence Lemma 14 There exists a constant C > 0 (depending only on s, Θ and M), such that P λmin(G) C A 2 F N 1 e Ω( A 4 F /N), (44) as N, d , such that d = o( A 2 F ). The proof of Lemma 14 is given in Section A.2.1. We first show it can be used to complete the proof of Lemma 9. To this end, by a second order Taylor series expansion, we know that there exists α (0, 1) and γ = (β, θ ) = αγ + (1 α)ˆγ such that LN(ˆγ) LN(γ) LN(γ) (ˆγ γ) = 1 2(ˆγ γ) 2LN(γ)(ˆγ γ) (ˆγ γ) Ui U i (ˆγ γ) cosh2(βmi(X) + θ Zi) , (45) where Ui := (mi(X), Z i ) . Now, note that: |βmi(X)| |β||mi(X)| |β|||A|| |β| + |ˆβ β| |β| + ˆγ γ 1 (46) |θ Zi| θ 1 Zi M ( θ θ 1 + θ 1) M( ˆγ γ 1 + sΘ). (47) Since cosh is an even function and increasing on the positive axis, we obtain (ˆγ γ) Ui U i (ˆγ γ) cosh2(βmi(X) + θ Zi) (ˆγ γ) G(ˆγ γ) 2 cosh2(|β| + (M + 1)( ˆγ γ 1) + s MΘ), (48) where m := (m1(X), . . . , m N(X)) , Z = (Z1, . . . , ZN) , and G is as defined in (43). Combining (45) and (48) gives, LN(ˆγ) LN(γ) LN(γ) (ˆγ γ) = 1 2(ˆγ γ) 2LN(γ)(ˆγ γ) (ˆγ γ) G(ˆγ γ) 2 cosh2(|β| + (M + 1)( ˆγ γ 1) + s MΘ). (49) Next, we establish a high probability upper bound on ˆγ γ 1 whenever the conditions of Theorem 1 are satisfied. Lemma 15 Suppose (44) holds. Then, for λ := δ p log(d + 1)/N as in Lemma 8, ˆγ γ 1 = Os with probability 1 o(1), whenever N, d such that log d = o( A 4 F /N), Mukherjee, Niu, Halder, Bhattacharya, and Michailidis Proof By the convexity of the function LN it follows from (26) that λ( θ 1 ˆθ 1) LN(ˆγ) LN(γ) LN(γ) (ˆγ γ) LN(γ) ˆγ γ 1 where the last step uses LN(γ) λ 2, for X EN. Recall from (28) that ˆθ 1 θ 1 (ˆθ θ)Sc 1 (ˆθ θ)S 1. Therefore, from (50), we have: (ˆθ θ)Sc 1 (ˆθ θ)S 1 ˆθ 1 θ 1 ˆγ γ 1 = (ˆθ θ)S 1 2 + (ˆθ θ)Sc 1 This means, (ˆθ θ)Sc 1 3( (ˆθ θ)S 1 + |ˆβ β|), and hence, (ˆγ γ) 1 4( (ˆθ θ)S 1 + |ˆβ β|). (51) Denote K := (ˆθ θ)S 1 + |ˆβ β|. By the Cauchy-Schwarz inequality, i S (ˆθi θi)2 + (ˆβ β)2 s + 1||ˆγ γ||2. (52) Next, for t [0, 1], let γt := tˆγ + (1 t)γ, and g(t) := (ˆγ γ) LN(γt). Then |g(1) g(0)| = (ˆγ γ) ( LN(ˆγ) LN(γ)) ˆγ γ 1 LN(ˆγ) LN(γ) . (53) g (t) = (ˆγ γ) 2LN(γt)(ˆγ γ) (ˆγ γ) Ui U i (ˆγ γ) cosh2(βtmi(X) + θ t Zi) (where Ui := (mi(X), Z i ) ) (ˆγ γ) G(ˆγ γ) cosh2(|β| + (M + 1) γt γ 1 + s MΘ) (by (46) and (47)) ˆγ γ 2 2 λmin(G) cosh2(|β| + (M + 1) γt γ 1 + s MΘ) C ˆγ γ 2 2 A 2 F N cosh2(|β| + 4|t|(M + 1)K + MsΘ), Logistic Regression Under Network Dependence where the last step uses (44) (which holds with probability 1 o(1)), C is as in Lemma 14, and γt γ 1 = |t| (ˆγ γ) 1 4|t|K (by (51)). Hence, |g(1) g(0)| g(1) g(0) = Z 1 Z min{1, 1/K} s A 2 F N ˆγ γ 2 2 min{1, 1/K}. (54) Combining (53) and (54) gives, min{K, 1} s KN ˆγ γ 1 A 2 F ˆγ γ 2 2 LN(ˆγ) LN(γ) K2N A 2 F ˆγ γ 2 2 LN(ˆγ) LN(γ) (by (51)) s N A 2 F LN(ˆγ) LN(γ) , (55) using (52). Now, recall that, by Lemma 8, with probability 1 o(1), LN(γ) δ p log(d + 1)/N and LN(ˆγ) p log(d + 1)/N. Applying this in (55) shows that with probability 1 o(1), min{K, 1} = Os This implies, K = (ˆθ θ)S 1 + |ˆβ β| = Os with probability 1 o(1), whenever N, d such that log d = o( A 4 F /N). Therefore, by (51), ˆγ γ 1 = Os with probability 1 o(1). Using Lemma 14 and Lemma 15 in (49) it follows that, there exists κ (as the statement of Lemma 9) such that LN(ˆγ) LN(γ) LN(γ) (ˆγ γ) (ˆγ γ) G(ˆγ γ) 2 cosh2(|β| + (M + 1) ˆγ γ 1 + s MΘ) β,M,κ A 2 F ˆγ γ 2 2 N with probability 1 o(1), as N, d such that d = o( A 2 F ) and log d = o( A 4 F /N). This completes the proof of Lemma 9. Mukherjee, Niu, Halder, Bhattacharya, and Michailidis Remark 16 Note that if one assumes ||Zi||2 M, for all 1 i N, and ||θ||2 Θ (recall the discussion in Remark 2) then, for θ as in (47), by the Cauchy-Schwarz inequality, |θ Zi| θ 2 Zi 2 M ( θ θ 2 + θ 2) M( ˆγ γ 1 + Θ). Using this bound and (46) we get, (ˆγ γ) Ui U i (ˆγ γ) cosh2(βmi(X) + θ Zi) (ˆγ γ) G(ˆγ γ) 2 cosh2(|β| + (M + 1)( ˆγ γ 1) + MΘ). Note that the bound in the RHS above does not have any dependence on s in the cosh term (unlike in (48)). Hence, by the same arguments as before we can now get the following rate where the dependence on s matches that in the classical high-dimensional logistic regression: with probability 1 o(1), as N, d such that d = o(N). A.2.1 Proof of Lemma 14 The first step towards proving Lemma 14 is to observe that: det(G λI) = 1 N F m 2 2 λ det 1 where F := I Z(Z Z) 1Z . Hence, λmin(G) = min λmin N Z Z , 1 N F m 2 2 In view of Assumption 2, to prove Lemma 14 it suffices to show that there exists a constant C > 0 (depending only on s, Θ and M), such that N F m 2 2 C A 2 F N = 1 e Ω( A 4 F /N). (56) To this end, it suffices to prove the following conditional version of (56): N F m 2 2 C A 2 F N XJc = 1 e Ω( A 4 F /N) (57) where J is a suitably chosen subset of [N] and XJc := (Xi)i Jc. To this end, note that (3) is a (|β| A /2, Υ)-Ising model (Lemma 12). Now, applying Lemma 11 with Logistic Regression Under Network Dependence gives subsets I1, . . . , Iℓ [N] such that, for all 1 j ℓ, the conditional distribution of XIj given XIc j := (Xu)u [N]\Ij is an (η, Υ)-Ising model. Furthermore, for any vector a, there exists j [ℓ] such that a Ij 1 η 4|β| A a 1. (58) The proof of (57) now proceeds in the following two steps: First, we show that there exists j [ℓ] such that the expectation of N 1 F m 2 2 conditioned on XIj is Ω(1). Subsequently, we establish that conditioned on XIj, N 1 F m 2 2 concentrates around its conditional expectation. These steps are verified in Lemmas 17 and 18, respectively. Lemma 17 Under the assumptions of Theorem 1, there exists J {I1, I2, . . . , Iℓ} such that for all N 1, N F m 2 2 XJc C A 2 F N , where C > 0 is a constant depending only on Θ, M and s. Proof For any (d + 1) n dimensional matrix M, we will denote the i-th row of M by Mi and the i-th largest singular value of M by σi(M), for 1 i d + 1. Also, for J [N] denote (F A)i,J := ((F A)i,j)j J. Note that for any J {I1, I2, . . . , Iℓ} [N], since m = AX, E F m 2 2 XJc = i=1 E [(F A)i X]2 XJc i=1 Var (F A)i X XJc i=1 Var (F A)i,JXJ XJc i=1 (F A)i,J 2 2, (59) where the last step follows from Lemma 23. Now define a vector a = (a1, a2, . . . , a N) , with ai = (F A) ,i 2 2, where (F A) ,i denotes the i-th column of the matrix F A. Then by (58) there exists J {I1, I2, . . . , Iℓ} [N] such that i=1 (F A)i,J 2 2 η 4|β| A i=1 (F A) ,i 2 2. Therefore, by (59), E F m 2 2 XJc Υ2 |β| A F A 2 F s,B,M F A 2 F . (60) By Theorem 2 in Wang and Xi (1997), we get t=1 σ2 t (F A) t=1 σ2 t (F ) σ2 N t+1(A) (61) Mukherjee, Niu, Halder, Bhattacharya, and Michailidis Since F is idempotent with trace N d, it follows that σ1(F ) = . . . = σN d(F ) = 1 and σN d+1(F ) = . . . = σN(F ) = 0. Hence, we have from (61), t=1 σ2 t (F A) t=1 σ2 N t+1(A) = A 2 F i=1 σ2 i (A). (62) where the last step uses σ2 i (A) 1 for all 1 i N, since A 2 A 1. Applying the bound in (62) to (60) gives, E F m 2 2 XJc Υ2 A 2 F d . (63) Lemma 17 now follows from the hypothesis d = o( A 2 F ). Next, we show that F m 2 2 concentrates around its conditional expectation E( F m 2 2|XJc), for the set J as defined above. Lemma 18 For any t > 0 and J {I1, I2, . . . , Iℓ}, P F m 2 2 < E( F m 2 2|XJc) t XJc 2 exp C min t2 where C is a constant depending only on Θ, M and s. Proof Denote W := F A and let H be the matrix obtained from W W by zeroing out all its diagonal elements. Moreover, throughout the proof we will denote I := Jc. Clearly, F m 2 2 E( F m 2 2|XI) = X HX E X HX|XI . By permuting the indices let us partition the vector X as (X I , X J ) and the matrix H as: HI,I HI,J H I,J HJ,J where for two subsets A, B of [N], we define HA,B := ((Hij))i A,j B. Note that X HX E X HX|XI = X J HJ,JXJ E X J HJ,JXJ|XI + 2X I HI,JXJ 2E X I HI,JXJ|XI . (65) Since X|XI is an (η, Υ)-Ising model, Example 2.5 in Adamczak et al. (2019) implies (by taking the parameters α and ρ in Adamczak et al. (2019) to be ΘMs + 1/16 and 7/8, respectively), P X J HJ,JXJ < E X J HJ,JXJ|XI t XI 2 exp c min t2 HJ,J 2 F + E(HJ,JXJ) 2 2 , t HJ,J 2 Logistic Regression Under Network Dependence where c is a constant depending only on Θ, M and s. Clearly, one has HJ,J 2 F H 2 F and, since the spectral norm of a matrix is always greater than or equal to the spectral norm of any of its submatrices, HJ,J 2 H 2. Moreover, if X 0 := (0 , X J ), then HJ,JXJ is a subvector of HX0, and hence, E(HJ,JXJ) 2 2 E(HX0) 2 2. Combining all these, we have from (66) P X J HJ,JXJ < E X J HJ,JXJ|XI t XI 2 exp c min t2 H 2 F + E(HX0) 2 2 , t H 2 Next, note that, since H W W is a diagonal matrix, H 2 F + E(HX0) 2 2 H 2 F + E(W W X0) 2 + E[(H W W )X0] 2 2 2 H 2 F + 2E [(H W W )X0] 2 2 + 2 E(W W X0) 2 2 1 i =j N (W W )2 ij + 2 i=1 (W W )2 ii + 2 E(W W X0) 2 2 = 2 W W 2 F + 2 E(W W X0) 2 2. (68) We also have H 2 W W 2, since for any vector a RN, a Ha = a W W a i=1 (W W )iia2 i a W W a. Hence, it follows from (67) and (68) that P X J HJ,JXJ < E X J HJ,JXJ|XI t XI 2 exp c min t2 W W 2 F + E(W W X0) 2 2 , t W W 2 where c := c/2. Next, recall that for any two matrices U and (ˆγ γ) such that U(ˆγ γ) exists, we have U(ˆγ γ) F U 2 ˆγ γ F . This implies, W W 2 F W 2 2 W 2 F , (70) Moreover, by the submultiplicativity of the matrix ℓ2 norm, W W 2 W 2 W 2 = W 2 2 (71) and E(W W X0) 2 2 = W E(W X0) 2 2 W 2 2 E(W X0) 2 2. (72) Mukherjee, Niu, Halder, Bhattacharya, and Michailidis It follows from (69), (70), (71) and (72), that: P X J HJ,JXJ < E X J HJ,JXJ|XI t XI F A 2 2 min t2 F A 2 F + E [W X0] 2 2 , t . (73) Since F A 2 2 F 2 2 A 2 2 1, F A 2 F N F A 2 2 N, and E [W X0] 2 2 E W X0 2 2 F A 2 2 E X0 2 2 N, we can conclude from (73) that: P X J HJ,JXJ < E X J HJ,JXJ|XI t XI 2 exp c min t2 2N , t . (74) Next, let us define y := H I,JXI. Then, X I HI,JXJ E X I HI,JXJ|XI = y XJ E(y XJ|XI). By Lemma 4.4 in Chatterjee (2016), Dobrushin s interdependence matrix for the model XJ|XI is given by 8D, where D denotes the interaction matrix for the Ising model XJ|XI. Since 8D 2 1 2, by Theorem 4.3 in Chatterjee (2016), P y XJ < E(y XJ|XI) t XI 2 exp t2 Using the submultiplicativity of the matrix ℓ2 norm and the fact that the spectral norm of a matrix is always greater than or equal to the spectral norm of any of its submatrices gives, y 2 2 = H I,JXI 2 2 H I,J 2 2 XI 2 2 N H 2 2 N W 4 2 = N F A 4 2 N. (76) Combining (75) and (76), P y XJ < E(y XJ|XI) t XI 2 exp t2 Finally, combining (74), (77), and (65) gives, P X HX < E X HX|XI t XI 2 exp c min t2 This completes the proof of Lemma 18. To complete the proof of (57), we choose J as in Lemma 17 and apply Lemma 18 with t = 1 2E( F m 2 2|XJc). This implies, there exists a constant C, depending only on Θ, M and s, such that N F m 2 2 C A 2 F N XJc = 1 e Ω( A 4 F /N). This proves (57) and completes the proof of Lemma 14. Logistic Regression Under Network Dependence Appendix B. Proof of Theorem 2 Recall the definition of the log-pseudo-likelihood function Lβ,N( ) from (10). As before, the proof of Theorem 1 entails showing the following: (1) concentration of the gradient of Lβ,N and (2) restricted strong concavity of Lβ,N. The concentration of the gradient Lβ,N follows by arguments similar to Lemma 8. Towards this, recall the definition of the functions φi,s : CN R, for 1 i N and 1 s d, from (31). Note that Lβ,N = ( Lβ,N θ1 , . . . , Lβ,N θd ) , where Lβ,N θs = PN i=1 φi,s(X) for 1 s d. For k [N], recall from Lemma 13 the definition of ck := 2|Zk,s| i=1 |Zi,saik| , where s [d]. Therefore, for s [d], k=1 c2 k 1 N2 k=1 Z2 k,s + β2 i=1 |Zi,saik| N max 1 s d 1 N k=1 Z2 k,s + β2 i=1 Z2 i,s|aik| (by the Cauchy-Schwarz inequality) N max 1 s d 1 N k=1 Z2 k,s + max 1 s d 1 N k=1 Z2 k,s + β2 A 2 1 max 1 s d 1 N where the last holds with probability 1 under Assumptions 1 and 4. Then by analogous arguments as in Lemma 8 it follows that there exists δ > 0 such that with λ := δ p log d/N the following holds: P Lβ,N(θ) > λ = o(1), (78) where the o(1)-term goes to infinity as d . To establish the strong concavity of Lβ,N we consider the second-order Taylor expansion: For any η Rd, Lβ,N(θ + η) Lβ,N(θ) Lβ,N(θ) η = 1 2η 2Lβ,N(θ + tη)η, Mukherjee, Niu, Halder, Bhattacharya, and Michailidis for some t (0, 1). Computing the Hessian matrix 2Lβ,N(θ + tη) gives, Lβ,N(θ + η) Lβ,N(θ) Lβ,N(θ) η η Zi Z i η cosh2(βmi(X) + (θ + tη) Zi) η Zi Z i η cosh2(|β| A + (θ + tη) Zi) (using |βmi(X)| |β|||A|| ) i=1 η, Zi 2 1 cosh2(|β| A + (θ + tη) Zi) i=1 η, Zi 2ψ( θ, Zi + t η, Zi ), (79) where ψ(x) := sech2(|β| A + x). Proposition 19 Suppose the assumptions of Theorem 2 hold. Then there exists positive constants ν, c0 (depending on κ1, κ2, β, and Θ) such that for all t (0, 1), i=1 η, Zi 2ψ( θ, Zi + t η, Zi ) ν η 2 2 c0RN η 2 1, with probability at least 1 o(1), for all η Rd with η 2 1. The proof of Proposition 19 is given in Section B.1. Here we use it to complete the proof of Theorem 2. Note that by (79) and Proposition 19, for any η Rd with η 2 1, Lβ,N(θ + η) Lβ,N(θ) Lβ,N(θ) η ν η 2 2 c0RN η 2 1 This establishes the restricted strong convexity (RSC) property for pseudo-likelihood loss function Lβ,N( ). Therefore, by Wainwright (2019, Corollary 9.20), whenever RNs = s p log d/n = o(1), then when the event { LN(ˆθ) λ 2} happens, ˆθ θ 2 2 ν sλ2 ν,δ s log d n and ˆθ θ 1 ν sλ ν,δ s Since the event { LN(ˆθ) λ 2} happens with probability 1 o(1) by (78) (jointly over the randomness of the data and the covariates), the bounds in (80) hold with probability 1 o(1), which completes the proof of Theorem 2. B.1 Proof of Proposition 19 Consider a fixed vector η Rd with η 2 = r (0, 1] and set L = L(r) := Kr, for a constant K > 0 to be chosen. Since the function φL(u) := u2I{|u| 2L} u2 and ψ is Logistic Regression Under Network Dependence i=1 η, Zi 2ψ( θ, Zi + t η, Zi ) 1 i=1 ψ( θ, Zi + t η, Zi )φL( η, Zi )1 {| θ, Zi | T} , where T is second truncation parameter to be chosen. Note that on the events {| η, Zi | 2L} and {| θ, Zi | T} one has, | θ, Zi + t η, Zi | T + 2L T + 2K, since t, r [0, 1]. This implies, i=1 η, Zi 2ψ( θ, Zi + t η, Zi ) i=1 ψ( θ, Zi + t η, Zi )φL( η, Zi )1 {| θ, Zi | T} i=1 φL( η, Zi )1 {| θ, Zi | T} , (81) where γ := min|u| T+2K ψ(u) > 0. Based on this lower bound, to prove Proposition 19 it is sufficient to show that for all r (0, 1] and for η Rd with η 2 = r, i=1 φL(r)( η, Zi )1 {| θ, Zi | T} c1r2 c2RN η 1r, (82) holds with probability 1 o(1), where L(r) = Kr and some positive constants c1, c2. This is because when (82) holds by substituting η 2 = r and using (81) gives, i=1 η, Zi 2ψ( θ, Zi + t η, Zi ) ν η 2 2 c0RN η 2 1, where the constants (ν, c0) depend on (c1, c2, γ) and we use the inequality η 2 η 1. In fact, it suffices to prove (82) for r = 1, that is, i=1 φK( η, Zi )1 {| θ, Zi | T} c1 c2RN η 1, (83) holds with probability 1 o(1). This is because given any vector in Rd with η 2 = r > 0, we can apply (83) to the rescaled unit-norm vector η/r to obtain i=1 φK( η/r, Zi )1 {| θ, Zi | T} c1 c2RN η 1 Noting that φK(u/r) = φL(1)(u/r) = (1/r)2φL(r)(u) and multiplying both sides of (84) by r2 gives (82). Mukherjee, Niu, Halder, Bhattacharya, and Michailidis B.1.1 Proof of (83) Define a new truncation function: φK(u) = u21 {|u| K} + (u 2K)21 {K < u 2K} + (u + 2K)21 { 2K u < K} . Note this function is Lipschitz with parameter 2K. Moreover, since φK φK, to prove (83) it suffices to show that for all vectors η Rd of unit norm, i=1 φK( η, Zi )1 {| θ, Zi | T} c1 c2RN η 1, (85) holds with probability 1 o(1). To this end, for a given ℓ1 radius b 1, define the random variable Yn(b) := sup η 2=1 b 2 η 1 b i=1 φK( η, Zi )1 {| θ, Zi | T} E φK( η, Z )1 {| θ, Z | T} . Lemma 20 Under the assumptions of Theorem 2 the following hold: (1) By choosing K2 = 8κ2/κ1 and T 2 = 8κ2Θ2/κ1, E φK( η, Z )1 {| θ, Z | T} 3 (2) There exist a positive constant c2 such that P Yn(b) > 1 2c2RNb e Oκ1,κ2(n). Lemma 20 implies that with probability 1 e Oκ1,κ2(n) for all η Rd such that η 2 = 1 and b 2 η 1 b, we have i=1 φK( η, Zi )1 {| θ, Zi | T} E φK( η, Z )1 {| θ, Z | T} 1 4κ1 c2RN η 1. (86) This establishes the bound (85) with c1 = κ1/4 for all vectors η Rd with η 2 = 1 and b 2 η 1 b, with probability 1 e Oκ1,κ2(n). We first prove Lemma 20. Then we will extend the bound in (85) for all vectors η Rd with η 2 = 1 using a peeling strategy to complete the proof. Logistic Regression Under Network Dependence Proof of Lemma 20 (1) To show Lemma 20 (1) it suffices to show that E φK( η, Z ) 7 8κ1 and E φK( η, Z )1 {| θ, Z | > T} 1 Indeed, if these two inequalities hold, then E φK( η, Z )1 {| θ, Z | T} = E φK( η, Z ) E φK( η, Z )1 {| θ, Z | > T} To prove first inequality in (87), note that E φK( η, Z ) E η, Z 21 {| η, Z | K} = E η, Z 2 E η, Z 21 {|η, Z| > K} κ1 E η, Z 21 {| η, Z | > K} . (88) To lower bound the second term, applying the Cauchy-Schwarz inequality yields, E η, Z 21 {| η, Z | > K} p E ( η, Z 4) p P (| η, Z | > K) K2 (by Markov s inequality) using the assumption E( η, Z 4) κ2, for all η such that η 2 1. Therefore, setting K2 = 8κ2/κ1 and using (88) proves the first inequality in (87). Now, we turn to prove the second inequality in (87). For this note that φK ( η, Z ) η, Z 2 and by Markov s inequality, P (| θ, Z | T) κ2 θ 4 2 T 4 . Then the Cauchy-Schwarz inequality implies, E φK ( η, Z ) 1 {| θ, Z | > T} κ2 θ 2 2 T 2 κ2Θ2 Thus setting T 2 = 8κ2Θ2/κ1 shows the second inequality in (87). Proof of Lemma 20 (2) We begin by recalling the functional Hoeffding s inequality. Towards this, let F be a symmetric collection of functions from Rd to R, that is, if f F, then f F. Suppose X1, X2, . . . , XN are i.i.d. from a distribution supported on X Rd and let Z := sup f F Then we have the following result: Mukherjee, Niu, Halder, Bhattacharya, and Michailidis Lemma 21 ((Wainwright, 2019, Theorem 3.26)) For each f F assume that there are real numbers af bf such that f(x) [af, bf] for all x X. Then for all δ 0, P (Z E(Z) δ) exp nδ2 where L2 := supf F (bf af)2 . For η Rd such that η 2 = 1 and η 1 b, define fη(Z) := φK( η, Z )1 {| θ, Z | T} E φK( η, Z )1 {| θ, Z | T} Yn(b) = sup η 2=1 b 2 η 1 b Since |fη(z)| K2, for any positive constant c3 by Lemma 21, P Yn(b) E(Yn(b)) c3RNb + 1 e c4n R2 Nb2 c4n e c4n, (89) where c4 is a positive constant depending on K and c3. Now, suppose {εi}n i=1 be a sequence of i.i.d. Rademacher variables. Then a symmetrization argument (Wainwright, 2019, Proposition 4.11) implies that E (Yn(b)) 2EZ,ε sup η 2=1 b 2 η 1 b i=1 εi φK ( η, Zi ) 1 {| θ, Zi | T} Since 1 {| θ, Zi | T} 1 and φK is Lipschitz with parameter 2K, the contraction principle yields E (Yn(b)) 8KEZ,ε sup η 2=1 b 2 η 1 b i=1 εi η, Zi (where η := η/ η 1) where the final step follows by applying H older s inequality. Then choosing c2 = 32K gives, P Yn(b) > 1 2c2RNb P Yn(b) E(Yn(b)) > 1 2κ1 + 8Kb RN e Oκ1,κ2(n), using (89) with c3 = 8K. This proves Lemma 20 (2). Logistic Regression Under Network Dependence Final Details: Peeling Strategy Recall from (86) that we have proved (85) holds for any fixed b such that b 2 η 1 b and η 2 1. The final step in the proof of Proposition 19 is to remove the restriction on the ℓ1 norm of η. For this, we apply a peeling strategy as in the proof of Wainwright (2019, Theorem 9.34). To this end, consider the set Sℓ:= η Rd : 2ℓ 1 η 1 η 2 2ℓ n η Rd : η 2 1 o , for ℓ= 1, . . . , log2( d) . Then for η Rd Sℓby (86), i=1 η, Zi 2ψ( θ, Zi + t η, Zi ) ν η 2 2 c0RN η 2 1, with probability at least 1 e Oκ1,κ2(n). Then by a union bound, i=1 η, Zi 2ψ( θ, Zi + t η, Zi ) ν η 2 2 c0RN η 2 1, for all η Rd such that η 2 1, with probability at least 1 log2(d) e Oκ1,κ2(n) = 1 o(1). This completes the proof of Proposition 19. Appendix C. Proofs from Section 3 In this section, we prove the results stated in Section 3. We begin with the proof of Theorem 3 in Section C.1. Corollary 5 is proved in Section C.2. C.1 Proof of Theorem 3 The proof of Theorem C.1 follows along the same lines as in Theorem 1, so to avoid repetition we sketch the steps and highlight the relevant modifications. As in the proof of Theorem C.1, the first step is to show the concentration of LN(γ). Towards this, following the proof of Lemma 13 shows that Qr (as defined in (36)) is Oβ(poly(dmax)/N)-Lipschitz, for each r [ℓ]. Therefore, by arguments as in Lemma 13, P max r [ℓ] |Qr(X)| tℓ e Oβ,M(Nt2/poly(dmax)). since ℓ= O(log N) and ℓ /ℓ= Θ(1). In this case, following the notations in the proof of Lemma 8, we have ℓ/ℓ = Θ(dmax). Similarly, for s [d], i=1 φi,s(X) e Oβ,M(Nt2/poly(dmax)). Mukherjee, Niu, Halder, Bhattacharya, and Michailidis Hence, choosing λ := δpoly(dmax) p log(d + 1)/N and t := λ/2 gives, for some constant C depending only on β and M, (d + 1) Cδ2/4 and P i=1 φi,j(X) (d + 1) Cδ2/4, for all j [s]. A final union bound over the (d + 1) coordinates now shows that P LN(γ) > λ = o(1), (90) where λ := δpoly(dmax) p log(d + 1)/N as above and the o(1)-term goes to infinity as d . This establishes the concentration of the gradient. Next, we need to show the strong concavity of the Hessian. To this end, first note that since |E(GN)| = O(N) and the number of non-isolated vertices of GN is Ω(N), there exist constants L1, L2 > 0, such that |E(GN)| L1N, and the number of non-isolated vertices of GN is larger than L2N, for all N large enough. For D 1 define, VN(D) := {v V (GN) : dv [1, D]}. |VN(D)| N(1 (2L1/D)) N(1 L2) = N(L2 (2L1/D)), since GN has at least N(1 (2L1/D)) vertices with degree not exceeding D, among which at most N(1 L2) are isolated. Hereafter, we choose D := 4L1/L2 , so that |VN(D)| L2N/2. Then, with notations as in (45) we have, LN(ˆγ) LN(γ) LN(γ) (ˆγ γ) 2(ˆγ γ) 2LN(γ)(ˆγ γ) (ˆγ γ) Ui U i (ˆγ γ) cosh2(βmi(X) + θ Zi) (ˆγ γ) Ui U i (ˆγ γ) cosh2(βmi(X) + θ Zi) (ˆγ γ) Ui U i (ˆγ γ) cosh2(|β|D + (D + M) ˆγ γ 1 + s MΘ), (91) where the last step uses the bound in (47) and the bound |βmi(X)| |β||mi(X)| |β|D |β|D + D ˆγ γ 1. Next, using the bound |VN(D)| L2N/2 in (91) gives, LN(ˆγ) LN(γ) LN(γ) (ˆγ γ) L2 4 cosh2(|β|D + (D + M) ˆγ γ 1 + s MΘ) 1 |VN(D)| i VN(D) (ˆγ γ) Ui U i (ˆγ γ) = L2 4 cosh2(|β|D + (D + M) ˆγ γ 1 + s MΘ)(ˆγ γ) G(ˆγ γ), (92) Logistic Regression Under Network Dependence where G := 1 |VN(D)| m m m Z Z m Z Z with m := (mi(X)) i VN(D) and Z = (Zi) i VN(D). The calculation in (92) implies that in order to establish the strong concavity of the Hessian in the setting of Theorem 3, we need to show λmin( G) 1 with probability going to 1, where λmin( G) denotes the minimum eigenvalue of G. For this step, we follow the steps of Lemma 14 with the matrix F replaced by F := I Z( Z Z) 1 Z . Then, repeating the proof of (63) in Lemma 17, with A replaced by A := A|VN(D) [N], we can find a set J [N] such that E F m 2 2 XJc Υ2 A 2 F d polylog(N) , since A 2 A = O(polylog(N)). This implies, E F m 2 2 XJc N, (93) since A 2 F |VN(D)| N (because every vertex in VN(D) has degree at least 1) and d = o(N). The final step is to establish that F m concentrates around F m 2 2 XJc, conditional on XJc. This follows by repeating the proof of Lemma 18, which introduces an extra 1/polylog(N) factor in each of the two exponential terms in the RHS of (64), since A 2 A = O(polylog(N)). This, combined with (93), shows P(λmin( G) C) 1 e Ω(N/polylog(N)), (94) for some constant C > 0. The proof of Theorem 3 can be now completed using (90) and (94), as in Theorem 1. C.2 Proof of Corollary 5 To prove Corollary 5 we verify that the hypotheses of Theorem 3 are satisfied. Note that we can write |E(GN)| = X 1 u 0. Then for any vector a RN, Var(a X) a 2 2Υ2 Proof First, consider the case R Υ/32 1/32. In this case, 1 8R 3/4, so Now, it follows from Lemma 22 that for every i, j [N]\{i} |E (Xj|Xi = 1) E (Xj|Xi = 1)| X j [N]\{i} W1 PXj|Xi=1, PXj|Xi= 1) W1 PX i|Xi=1, PX i|Xi= 1 16R 1 8R. (106) Next, note that: Cov(Xi, Xj) = E[(Xi EXi)Xj] = E [(Xi EXi)E(Xj|Xi)] = P(Xi = 1)(1 EXi)E(Xj|Xi = 1) P(Xi = 1)(1 + EXi)E(Xj|Xi = 1) 2 (1 EXi)E(Xj|Xi = 1) 1 EXi 2 (1 + EXi)E(Xj|Xi = 1) 2[1 (EXi)2] [E(Xj|Xi = 1) E(Xj|Xi = 1)] 2 [E(Xj|Xi = 1) E(Xj|Xi = 1)] (107) Logistic Regression Under Network Dependence Combining (105), (106) and (107), we have for every i, j [N]\{i} |Cov(Xi, Xj)| 8R 1 8R Υ Hence, we have by (108), i=1 a2 i Var(Xi) X i =j |aiaj Cov(Xi, Xj)| i=1 a2 i Var(Xi) X (a2 i + a2 j) |Cov(Xi, Xj)| j [N]\{i} |Cov(Xi, Xj)| 3 a 2 2 a 2 2Υ2 Now consider the case R > Υ/32. By Lemma 11, we choose a subset I of [N] such that conditioned on XIc, XI is a (Υ/32, Υ)- Ising model, and a I 2 2 Υ 256R a 2 2. (110) Hence, we have from (109) and (110), Var a X|XIc = Var a I XI|XIc 2Υ a I 2 2 3 Υ2 a 2 2 384R . (111) Lemma 23 now follows from (111) on observing that Var(a X) E Var a X|XIc . D.4 Lipschitz Condition for the Gradient of LN In this section we show that LN, the gradient of the pseudo-likelihood loss function LN defined in (5), is Lipschitz. Lemma 24 Suppose the design matrix Z := (Z1, . . . , ZN) satisfies λmax 1 N Z Z = O(1). Then there exists a constant L > 0 such that for any two γ1, γ2 Rd+1, || LN(γ1) LN(γ2)||2 L||γ1 γ2||2. Proof For any two γ1, γ2 Rd+1, there exists γ Rd+1 such that LN(γ1) LN(γ2) = (γ1 γ2) 2LN(γ ). Mukherjee, Niu, Halder, Bhattacharya, and Michailidis || LN(γ1) LN(γ2)||2 sup γ Rd+1 λmax( 2LN(γ ))||γ1 γ2||2, (112) where λmax( 2LN(γ )) is the largest eigenvalue of the Hessian matrix 2LN(γ ). Recall from (45) that 2LN(γ ) = 1 Ui U i cosh2(β mi(X) + (θ ) Zi), where γ = (β , (θ ) ) and Ui := (mi(X), Z i ) , for 1 i N. Using sech2(x) 1 it follows that sup γ Rd+1 λmax( 2LN(γ )) λmax Now, suppose w = (u, v ) Rd+1 be such that ||w||2 = 1. Then i=1 w Ui U i w = 1 i=1 (w Ui)2 = 1 i=1 (umi(X) + v Zi)2 n u2(mi(X))2 + (v Zi)2o i=1 mi(X)2 + 1 i=1 v Zi Z i v. Note that, since |mi(X)| ||A|| 1 (by (23)), for 1 i N, we have 1 N PN i=1 mi(X)2 1. Moreover, 1 N PN i=1 v Zi Z i v = 1 N v Z Zv λmax( 1 N Z Z) = O(1). This implies i=1 Ui U i ) = O(1), hence by (112) and (113), there exists a constant L > 0 such that || LN(γ1) LN(γ2)||2 L||γ1 γ2||2. This completes the proof of Lemma 24.