# nonparametric_neighborhood_selection_in_graphical_models__3b5f1ec6.pdf Journal of Machine Learning Research 23 (2022) 1-36 Submitted 2/22; Revised 7/22; Published 10/22 Nonparametric Neighborhood Selection in Graphical Models Hao Dong hao dong@pstat.ucsb.edu Department of Statistics and Applied Probability University of California, Santa Barbara Santa Barbara, CA, USA Yuedong Wang yuedong@pstat.ucsb.edu Department of Statistics and Applied Probability University of California, Santa Barbara Santa Barbara, CA, USA Editor: Daniela Witten The neighborhood selection method directly explores the conditional dependence structure and has been widely used to construct undirected graphical models. However, except for some special cases with discrete data, there is little research on nonparametric methods for neighborhood selection with mixed data. This paper develops a fully nonparametric neighborhood selection method under a consolidated smoothing spline ANOVA (SS ANOVA) decomposition framework. The proposed model is flexible and contains many existing models as special cases. The proposed method provides a unified framework for mixed data without any restrictions on the type of each random variable. We detect edges by applying an L1 regularization to interactions in the SS ANOVA decomposition. We propose an iterative procedure to compute the estimates and establish the convergence rates for conditional density and interactions. Simulations indicate that the proposed methods perform well under Gaussian and non-Gaussian settings. We illustrate the proposed methods using two real data examples. Keywords: conditional density estimation, mixed data, regularization, reproducing kernel Hilbert space, smoothing spline ANOVA 1. Introduction Discovering conditional independence among random variables is an essential task in statistics. Undirected probabilistic graphical models play a pivotal role in characterizing conditional independence. They have been utilized in a wide range of scientific and engineering domains, including statistical physics, computer vision, machine learning, and computational biology (Koller and Friedman, 2009). A graphical model is constructed based on an undirected graph G = (V, E) with node set V = {1, , p} representing p random variables X1, , Xp and edge set E V V describing the conditional dependence among X1, , Xp. Let X = (X1, , Xp) and X\{i1, ,ik} be the sub-vector of X without elements in {i1, , ik}. Then, {i, j} / E corresponds to the conditional independence between Xi and Xj given other variables in X, denoted as Xi Xj|X\{i,j}. As joint density ultimately determines the conditional relationship, methods for edge detection based on estimating joint density have been proposed (Yuan and Lin, 2007; Banerjee 2022 Hao Dong and Yuedong Wang. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/22-0207.html. Dong and Wang et al., 2008; Friedman et al., 2008; Hsieh et al., 2014; Liu et al., 2009). Under the Gaussian assumption of X N(0, Σ), the task of edge detection reduces to the estimation of the precision matrix Σ 1. Yuan and Lin (2007), Banerjee et al. (2008), and Friedman et al. (2008) proposed regularization methods that minimize the log-likelihood with an L1 penalty on the entries of Σ 1. Hsieh et al. (2014) proposed a fast second-order algorithm for solving the L1-regularized Gaussian MLE. Liu et al. (2009) extended the L1-regularized Gaussian MLE approach to the setting where there exist monotone transformations f1, , fp such that (f1(X1), , fp(Xp)) N(0, Σ). These parametric and semi-parametric methods may be too restrictive for some applications and cannot handle mixed data since they rely on the Gaussian assumption. Let f(x) be the joint density function of X, and consider the transformation f(x) = eη(x)/ R eη(x)dx, where η(x) is the logistic transformation of f. The SS ANOVA decomposition represents η(x) as a summation of a constant, main effects, and interactions: η(x1, , xp) = c + j=1 ηj(xj) + X 1 j 0 and R f = 1. The function η is the logistic transformation of f. An SS ANOVA model for η in (3) may contain any subset of components in the SS ANOVA decomposition (2). For simplicity, we assume that η Mα where k =α [H(α) H(k)] is a subspace with main effects and two-way interactions only. A function η Mα can be decomposed as follows: j=1 ηj(xj) + X k =α ηαk(xα, xk), (5) Nonparametric neighborhood selection in graphical models where each functional component in (5) belongs to the corresponding subspace in (4). We note that the proposed method can be easily extended to include higher-order interactions. Remark 1 The SS ANOVA model (1) for the joint density with main effects and two-way interaction only is a pairwise graphical model, which is commonly assumed in the existing literature. Remark 2 We consider both the log-likelihood and pseudo log-likelihood approaches for estimating the conditional density (Gu, 2013). We present the pseudo likelihood approach in the main text since it is computationally more efficient. The log-likelihood approach is presented in Appendix A. For the pseudo log-likelihood approach, model space Mα includes constant functions. The model space for the log-likelihood approach eliminates the constant functions for identifiability. Remark 3 To compare the estimation between the joint and neighborhood approaches under pairwise graphical models, we consider the SS ANOVA decomposition (1) with main effects and two-way interaction only for the joint density and the SS ANOVA decomposition (5) for the conditional density. The joint density approach needs to estimate all main effects and two-way interactions simultaneously with a total number of components proportional to p2. Our experience indicates that this joint approach is computationally infeasible even with moderately large p due to memory constraints. On the other hand, the neighborhood approach needs to estimate p main effects and p 1 two-way interactions for each node, which significantly reduces the computational cost and memory requirement and is parallelizable. Remark 4 Model (5) contains many parametric models as special cases. Specifically, the Gaussian graphical model is a special case with Xj = R, ηj(xj) = βjxj x2 j/2 for j = α and 0 otherwise, and ηαk(xα, xk) = βαkxαxk for some constants βj and βαk. The Ising model for binary data is a special case with Xj = {0, 1}, ηj(xj) = xj for j = α and 0 otherwise, and ηαk(xα, xk) = βαkxαxk. The Poisson graphical model for discrete data is a special case with Xj = {0, 1, 2, }, ηj(xj) = xj log(xj!) for j = α and 0 otherwise, and ηαk(xα, xk) = βαkxαxk. The exponential family model proposed by Suggala et al. (2017), logf(xα|x\{α}) βαBα(xα) + X {α,k} E βαk Bα(xα)Bk(xk) + Cα(xα) is also a special case with ηj(xj) = βj Bj(xj) + Cj(xj) for j = α and 0 otherwise, and ηαk(xα, xk) = βαk Bα(xα)Bk(xk). Note that many existing exponential family models including (6) assume a multiplicative interaction while model (5) does not assume any specific interaction. Therefore, the proposed model is more general. 2.3 Penalized Pseudo Log-likelihood Estimation For each node α V , we assume that η(x) Mα where Mα is given in (4) and η is decomposed as in (5). We further decompose H(j) as H(j) = H0 (j) H1 (j) where H0 (j) is a finite dimensional space containing functions that are not subject to penalty. We estimate η in (5) by minimizing the following penalized pseudo log-likelihood in Mα: j=1 θ 1 j ||Pjηj||2 + τ1 X k =α wαk||ηαk||, (7) Dong and Wang where lα = n 1 Pn i=1 n e η(xi) + R Xα η(xα i )ρ(xα i )dxα o is the pseudo log-likelihood, ρ( ) is a known density of Xα conditional on X\{α} = xi,\{α}, Pj is the projection operator onto H1 (j), λ1, τ1, and θj s are tuning parameters, 0 wαk < are pre-specified weights, and || || is an induced norm in Mα. The pseudo log-likelihood lα measures the goodness-of-fit. The second element in (7) is the roughness L2 penalty on main effects. The third element in (7) is the L1 penalty for selecting the neighborhood nb G(α). We allow different weights in the L1 penalty for flexibility. Remark 5 The idea of pseudo log-likelihood was first developed in Jeon and Lin (2006) for joint density estimation. Gu (2013) extended this approach to conditional density estimation. We present the pseudo log-likelihood estimation in the main text since this approach is computationally more efficient. The log-likelihood approach to conditional density estimation needs to calculate the integral R Xα eη(xα i )dxα repeatedly, which can be computationally intensive. With a proper choice of ρ, the pseudo log-likelihood approach needs to calculate an integral only once. Remark 6 The proposed method replaces the L2 penalty on interactions in Gu (2013) with the L1 penalty for neighborhood selection and differs from that in Jeon and Lin (2006) in two aspects. First, Jeon and Lin (2006) s approach is a global method that estimates the joint density; thus is computationally intensive and can only handle small dimensions p. Second, Jeon and Lin (2006) posed the L1 penalty to both main effects and interactions. Consequently, their method selects both nodes and edges. In practice, the nodes are usually given, and the goal is to detect edges. Therefore, we consider the smoothness promoting L2 penalty to main effects and the sparsity promoting L1 penalty to interactions. Let k =α [H(α) H(k)] We can rewrite η(x) = ς + g(x) where g(x) = p P j=1 ηj(xj) + P k =α ηαk(xα, xk) G. We first estimate ς with fixed g and then estimate g using the profiled pseudo log-likelihood. The results are summarized in the following proposition. Proposition 1 With fixed g, the minimizer of ς in (7) is ˆς = log{n 1 Pn i=1 e g(xi)} and the penalized pseudo log-likelihood (7) reduces to the following penalized profiled pseudo log-likelihood l(ˆς(g), g) + λ1 j=1 θ 1 j ||Pjηj||2 + τ1 X k =α wαk||ηαk||, (9) where l(ˆς(g), g) = log{n 1 n P i=1 e g(xi)} + n 1 n P Xα g(xα i )ρ(xα i )dxα is the profiled pseudo log-likelihood. The proof can be found in Appendix C. Instead of minimizing (9) that involves L1 penalties on functions, as in Lin and Zhang (2006), we will solve an equivalent but more convenient minimization problem that involves L1 penalties on the smoothing parameters. Nonparametric neighborhood selection in graphical models Proposition 2 Minimizing l(ˆς(g), g) + λ1 j=1 θ 1 j ||Pjηj||2 + X k =α wαkθ 1 αk ||ηαk||2 + λ2 X k =α wαkθαk, (10) subject to θαk 0 for k = 1, , p and k = α is equivalent to minimizing (9). The proof of equivalence can be found in Appendix C. Proposition 2 transforms the selection of nonzero functions ηαk in (9) into a selection of nonzero parameters θαk. The minimization problem (10) consists of L2 penalties on functions and L1 penalties on parameters and existing methods can be modified to solve each part. Computational details for solving (10) are presented in Section 3. Since the pseudo log-likelihood is used for estimation, we need to compute the conditional density estimate using the following proportion. Proposition 3 The resulting estimate of the conditional density is ˆf(xα|x\{α}) eˆg(x)ρ(x) where ˆg is the minimizer of (10). The proof of Proposition 3 is given in Appendix C. Notice that the minimization problem (10) involves p 1 two-way interaction terms. Solving (10) for all α = 1, , p leads to two estimates for each two-way interaction, denoted as ˆηαk and ˆηkα for α, k = 1, , p and α = k. There are two commonly used rules to combine the results: AND-rule ({α, k} E iffˆηαk = 0 and ˆηkα = 0) or OR-rule ({α, k} E iffˆηαk = 0 or ˆηkα = 0) (Hastie et al., 2015). As discussed in Section 4.2 in Chen et al. (2015), when the αth and kth nodes are of the same type (same marginal distribution) or are both non-Gaussian, there is no clear reason to prefer one edge estimate over the other. We adopt the AND-rule in all simulations and real data examples. 3. Algorithm In this section, we propose a computational algorithm that solves (10) iteratively. Denote θ1 = (θ1, , θp)T , θ2 = (θα1, , θα(α 1), θα(α+1), , θαp)T , and w = (wα1, , wα(α 1), wα(α+1), , wαp)T . Let H(j) = H0 (j) H1 (j) where H0 (j) is a finite-dimensional space containing functions that are not subject to L2 penalty. Denote φj1, , φjmj as basis functions of H0 (j), and R1 j, Rj, and Rαk as reproducing kernels of H1 (j), H(j), and H(αk), respectively. We collect all basis functions φjk for j = 1, , p and k = 1, , mj and denote them as φ = (φ1, , φm)T , a vector of functions of x with dimension m = Pp j=1 mj. Since in general the minimization problem (10) does not have a solution in a finitedimensional space, as in Gu (2013), we approximate the solution by a subset of representers. Specifically, let { xu = ( xu,1, , xu,p), u = 1, , q} be a subset of all observations {xi, i = 1, , n}. Let ξ1ju(xj) = R1 j( xu,j, xj) and ξαku(xα, xk) = Rαk(( xu,α, xu,k), (xα, xk)) for u = 1, , q, k = 1, , p, and k = α. Let ξθ1,u(x) = p P j=1 θjξ1ju(xj), ξθ1(x) = (ξθ1,1, , ξθ1,q)T , ξθ2,u(x) = p P k=1,k =α w 1 αk θαkξαku(xα, xk), ξθ2(x) = (ξθ2,1, , ξθ2,q)T , and ξ(x) = ξθ1(x) + Dong and Wang ξθ2(x). The approximate solution can be represented as a linear combination of basis functions and representers: v=1 dvφv(x) + j=1 θjξ1ju(xj) + k=1,k =α w 1 αk θα,kξαku(xα, xk) = φT (x)d + ξT (x)c, (11) where c = (c1, , cq)T and d = (d1, , dm)T are coefficients. Let Q = p P j=1 θj Qj + k=1,k =α w 1 αk θαk Qαk, where Qj = n R1 j( xu,j, xv,j) oq u,v=1 are kernel matrices for the main effects and Qαk = n Rαk(( xu,α, xu,k), ( xv,α, xv,k)) oq u,v=1 are kernel matrices for the two-way interactions. We can rewrite (10) in a vector form: A(c, d, θ2) = log i=1 e φ T i d ξ T i c ) + b T φd + b T ξc + λ1 2 c T Qc + λ2w T θ2, (12) where φi = φ(xi), ξi = ξ(xi), bφ = n 1 n P Xα φ(xα i )ρ(xα i )dxα, and bξ = n 1 n P Xα ξ(xα i )ρ(xα i )dxα. We solve (12) by updating c, d, and θ2 between two steps discussed in the following two subsections. 3.1 Newton-Raphson Procedure We fix θ2 and update c and d at this step. Dropping the last term which is independent of c and d, (12) reduces to A1(c, d) = log i=1 e φ T i d ξ T i c ) + b T φd + b T ξc + λ1 2 c T Qc. (13) Note that (13) has the same form as (10.31) in Gu (2013). Therefore, we can solve (13) using the Newton-Raphson procedure with λ1 and θ1 selected by the approximate crossvalidation (ACV) method (Gu, 2013). We note that θ2 are fixed at this step. Therefore, the existing function in the gss R package cannot be used directly. More implementation details can be found in Appendix B.1. 3.2 Quadratic Programming We fix c, d, λ1 and θ1 and update θ2 at this step. We rewrite ˆg in (11) as v=1 dvφv(x) + u=1 cuξ1ju(xj) + k=1,k =α θαkw 1 αk u=1 cuξαku(xα, xk) = φT (x)d + ψT 1 (x)θ1 + ψT 2 (x)θ2. (14) Nonparametric neighborhood selection in graphical models Let Q(2) = p P k=1,k =α w 1 αk θαk Qαk. Plugging ˆg(xi) and keeping terms involving θ2 only, (12) i=1 e φ T i d ψ T 1iθ1 ψ T 2iθ2 ) + b T ψ2θ2 + λ1 2 c T Q(2)c + λ2w T θ2 (15) subject to θ2 0, where ψ1i = ψ1(xi), ψ2i = ψ2(xi), and bψ2 = 1 Xα ψ2(xα i )ρ(xα i )dxα. Furthermore, the constraint minimization problem (15) is equivalent to A2(θ2) = log i=1 e φ T i d ψ T 1iθ1 ψ T 2iθ2 ) + b T ψ2θ2 + λ1 2 c T Q(2)c (16) subject to θ2 0 and w T θ2 M for some constant M, where M controls the sparsity in θ2. We note that A2(θ2) is a convex function of θ2 (see Appendix C for a brief proof). We solve (16) iteratively using quadratic programming. We apply K-fold cross-validation or BIC method to select M. Implementation details can be found in Appendix B.2. 3.3 Algorithm We summarize the whole algorithm as follows. A parameter with superscript (t) denotes its value at the tth iteration. Algorithm 1 Input: Data frame X containing n observations with p dimensions. Output: Estimated c, d, θ2, and the neighborhood set nb G(α). 1: Initialization θ(1) 2 = θ2,0, θ(0) 2 = 0, and t = 1. 2: while ||θ(t) 2 θ(t 1) 2 ||2/(||θ(t 1) 2 ||2 + 10 6) ε or t = 1 do: 3: Fix θ(t) 2 , c(t), d(t) argmin c,d A1(c, d) with tuning parameters λ(t) 1 and θ(t) 1 selected by the ACV method. 4: Fix d(t), c(t), λ(t) 1 , and θ(t) 1 , θ(t+1) 2 argmin θ2 A2(θ2), subject to θ2 0 and w T θ2 M(t) where the tuning parameter M(t) is selected by K-fold cross-validation or BIC method. 6: end while More implementation details can be found in Appendix B, including the initialization of θ2, the convergence criterion, and the selection of M. Dong and Wang 4. Theoretical Analysis In this section, we study the theoretical properties of the proposed method. Following similar steps and under the same regularity conditions as Gu (2013), we derive the convergence rate for the conditional density estimate ˆg subject to both L1 and L2 penalties. In addition, we derive the convergence rate for interactions in the SS ANOVA decomposition, which is new and important for edge detection. Let f0(xα|x\{α}) = eg0(x)ρ(x) be the true conditional density to be estimated. Let g = g(1) + g(2) where g(1) = p P j=1 ηj and g(2) = P k =α ηαk are main effects and interactions respectively. Denote ˆg as the minimizer of (9). Define V (h1, h2) = Z X\{α} f\{α}(x\{α}) Z Xα h1(x)h2(x)ρ(x)dxαdx\{α}, J1(h1, h2) = Xj (Pjh1)(Pjh2)dxj, J2(h1, h2) = X k =α wαk( Z Xk |h1,αkh2,αk|dxαdxk)1/2, J 2(h1, h2) = X k =α θ 1 αk Xk h1,αkh2,αkdxαdxk, for any functions h1, h2 G, where f\{α}(x\{α}) is the density of X\{α} on X\{α} = X1 Xα 1 Xα+1 Xp. Furthermore, we define V (g) = V (g, g), V1(g(1)) = V (g(1)), V2(g(2)) = [V (g(2))]1/2, J1(g(1)) = J1(g(1), g(1)) = p P j=1 θ 1 j ||Pjηj||2, J2(g(2)) = J2(g(2), g(2)) = P k =α wαk||ηαk||, and J 2(g(2)) = J 2(g(2), g(2)) = P k =α θ 1 αk ||ηαk||2. Without loss of generality, we assume wαk = 1 in the proof, simulations, and real data examples. We note that V , J1, and J 2 are quadratic functionals. In the proof of Corollary 1 in Appendix C, it is shown that V (g), J1(g(1)), and J 2(g(2)) are equivalent to ||g||2 2, p P j=1 ||Pjηj||2 2, and p P k =α ||ηαk||2 2, respectively, where || ||2 is the L2 norm. It is also shown that V2(g(2)) and J2(g(2)) are equivalent to the square root of V (g(2)) and J 2(g(2)). Let V (g) = V1(g(1))+V2(g(2)), J = J1+J2, and J (g) = J1(g)+J 2(g). To derive the convergence rate, we need the following conditions. Condition 1 V is completely continuous with respect to J . From Theorem 3.1 of Weinberger (1974), there exist eigenvalues γv of J with respect to V and the associated eigenfunctions ζv such that V (ζv, ζu) = δv,u and J (ζv, ζu) = γvδv,u, where 0 γv and δv,u is the Kronecker delta. Functions satisfying J (g) < can be expressed as a Fourier series expansion g = P v avζv, where av = V (g, ζv) are the Fourier coefficients. Nonparametric neighborhood selection in graphical models Condition 2 For v sufficiently large and some ϕ > 0, the eigenvalues γv of J with respect to V satisfy γv > ϕvr where r > 1. Consider the quadratic functional i=1 e g0(Xi)g(Xi) + 1 Xα g(xα i )ρ(xα i )dxα + 1 2V (g g0) + λ1 2 J (g), (17) and denote the minimizer of (17) as g. Plugging the Fourier series expansions g = P v av,0ζv into (17), g has Fourier coefficients av = (κv + av,0)/(1 + λ1γv), where κv = n 1 n P i=1 {e g0(Xi)ζv(Xi) R Xα ζv(x)ρ(x)dxα}. It is not difficult to verify that E(κv) = 0 and E(κ2 v) n 1 R X\{α} f\{α}(x\{α}) R Xα ζ2 v(x)e g0(x)ρ(x)dxαdx\{α}. Condition 3 For some c1 < , e g0 < c1. Under Condition 3, noting that V (ζv) = R X\{α} f\{α}(x\{α}) R Xα ζ2 v(x)ρ(x)dxαdx\{α} = 1 by the definition of V and ζv, we have E(κ2 v) n 1c1. Condition 4 For g in a convex set B0 around g0 containing ˆg and g, c2 < eg0 g < c3 holds uniformly for some 0 < c2 < c3 < . Condition 5 For any u, v = 1, 2, , R X\{α} f\{α}(x\{α}) R Xα ζ2 vζ2 ue g0ρ(x)dxαdx\{α} < c4 for some c4 < . Conditions 1-5 are common assumptions for convergence rate analysis of the SS ANOVA estimates, which were also made in Gu (2013). Condition 2 states that the growth rate of the eigenvalues γv is at vr, which controls how fast λ1 approaches zero. Many commonly used smoothing spline models, including tensor products of cubic splines, thin-plate splines, and spherical splines, satisfy Conditions 1 and 2. See Chapter 9 in Gu (2013) for examples. Condition 4 bounds eg0 g at g in a convex set B0 around g0. Condition 5 requires a bounded fourth moment of ζv. We consider metrics V + λ1J and V + λ1J. Let Y > 0, we denote X = Op(Y ) if P(|X| > CY ) 0 for some constant C < , and denote X = op(Y ) if P(|X| > ϵY ) 0 for ϵ > 0. Theorem 1 Assume P v γl va2 v,0 < for some l [1, 2]. Under Conditions 1-5, for some r > 1, as λ1 0 and nλ2/r 1 , (V + λ1J )(ˆg g0) = Op(n 1λ 1/r 1 + λl 1). Theorem 2 Under the conditions in Theorem 1, (V + λ1J)(ˆg g0) = Op(n 1/2λ 1/2r 1 + λl/2 1 ). Dong and Wang Corollary 1 Assume conditions in Theorem 2 hold, 0 < c5 < ρ(x) < c6 and 0 < c7 < f\{α}(x\{α}) < c8 for some positive constants c5, c6, c7, and c8, we have ||ˆηαk η0αk||2 = Op(n 1/2λ 1/2r 1 + λl/2 1 ), k = α, k = 1, , p, where η0αk are two-way interactions in the true function g0. We note that V + λ1J and V + λ1J are associated with the L2 norm and its square, respectively. Consequently, the convergence rate in Theorem 2 is the square root of the rate in Theorem 1. Corollary 1 holds because V2 and J2 associated with two-way interactions are equivalent to the L2 norm. Consequently, two-way interactions under the L2 norm have the same convergence rate as that in Theorem 2. We only show the convergence rate for interactions in Corollary 1 since we are mainly interested in edge selection. Proofs of all theoretical results are in Appendix C. 5. Simulation Results We conduct simulations to evaluate the performance of the proposed method and compare it with some existing methods. We consider four scenarios: multivariate Gaussian, multivariate skewed Gaussian, a directed acyclic graph, and a Gaussian-Bernoulli mixed graphical model. In implementing the proposed method, we estimate the conditional density for each continuous variable on the data range and transform the data into [0, 1]. We construct an SS ANOVA model using the tensor product of cubic spline models. Specifically, let H(j) = W 2 2 [0, 1] where W 2 2 [0, 1] = f : f, f are absolutely continuous, Z 1 0 (f )2dx < (18) is the Sobolev space for cubic spline models. Each H(j) can be decomposed as H(j) = {1(j)} H(j) and H(j) = H0 (j) H1 (j), where H0 (j) and H1 (j) are RKHS s with reproducing kernels R0 j(x, z) = k1(x)k1(z) and R1 j(x, z) = k2(x)k2(z) k4(|x z|) respectively, k1(x) = x 0.5, 2(k2 1(x) 1 12), and k4(x) = 1 24(k4 1(x) k2 1(x) 2 + 7 240). SS ANOVA decomposition j=1 H(j) can then be constructed based on these decompositions. More details can be found in Wang (2011). In all simulations and real data applications, when using the pseudo log-likelihood method, we set ρ(xα, x\{α}) = φ((xα µ(x\{α}))/σ) Φ((1 µ(x\{α}))/σ) Φ(( µ(x\{α}))/σ), (19) where φ( ) and Φ( ) are the standard normal density and cumulative distribution functions, and µ( ) and σ are estimated by fitting a nonparametric regression model in model space (4) with covariates x\{α}. More estimation details can be found in Chapter 3 of Gu (2013). We select the tuning parameter M using the 5-fold cross-validation method in all simulations. For the first three scenarios where all variables are continuous, we compare the proposed method with four existing parametric and semiparametric methods: space (Sparse Nonparametric neighborhood selection in graphical models PArtial Correlation Estimation) (Peng et al., 2009), QUIC (QUadratic Inverse Covariance estimation) (Hsieh et al., 2011), nonparanormal (NPN) (Liu et al., 2009), and Spa CE JAM (Voorman et al., 2014). Due to memory constraints, we will not compare the proposed method with the nonparametric joint density estimation method in Gu et al. (2013). The space method assumes that E(X) = 0 and Cov(X) = Σ. Denote the precision matrix Ω= Σ 1 = (σij)p p and ρij = σij/ σiiσjj as the partial correlation between Xi and Xj. Denote x(i) = (x1,i, , xn,i)T as the vector of n observations on the ith variable, i = 1, , p. Peng et al. (2009) solved the following regularization problem for edge selection i=1 wi||x(i) X σii x(j)||2 + λ X 1 i 0 for any nonzero h Mα, and consequently, l is strictly convex. Therefore, if ˆη is the solution to (7), the estimate of the conditional density equals eˆη(x)ρ(x). We note that ˆη = ˆς + ˆg where ˆη is the solution to (7), ˆς is given in Proposition 1, and ˆg is the solution to (10). Since ˆς is a constant independent of xα, then the estimate of the conditional density ˆf(xα|x\{α}) is proportional to eˆg(x)ρ(x). Nonparametric neighborhood selection in graphical models Proof of Convexity of A2(θ2): We show that the Hessian matrix HA(θ2) of A2(θ2) is positive semi-definite. For any vector ν = 0, let si = e g(xi) and ti = νT ψ2(xi), we have νT HA(θ2)ν = by the Cauchy-Schwartz inequality. In the remainder of the Appendix, we first introduce three lemmas and then provide proofs of Theorem 1, Theorem 2, and Corollary 1. Lemma 1 Assume J (g0) < . Under Conditions 1 3, as λ1 0 and n , (V + λ1J )( g g0) = Op(n 1λ 1/r 1 + λ1). Proof: By the Fourier series expansions of g and g0, we have V ( g g0) = X v ( av av,0)2 = X κ2 v 2κvλ1γvav,0 + λ2 1γ2 va2 v,0 (1 + λ1γv)2 , λ1J ( g g0) = X v λ1γv( av av,0)2 = X v λ1γv κ2 v 2κvλ1γvav,0 + λ2 1γ2 va2 v,0 (1 + λ1γv)2 . Since E(κv) = 0 and E(κ2 v) c1/n, we have E[V ( g g0)] c1 1 (1 + λ1γv)2 + λ1 X λ1γv (1 + λ1γv)2 γva2 v,0, E[λ1J ( g g0)] c1 λ1γv (1 + λ1γv)2 + λ1 X (1 + λ1γv)2 γva2 v,0. (47) Following similar arguments in the proof of Lemma 9.1 in Gu (2013), we have X λ1γv (1 + λ1γv)2 = O(λ 1/r 1 ), X 1 (1 + λ1γv)2 = O(λ 1/r 1 ), X 1 1 + λ1γv = O(λ 1/r 1 ). The lemma follows from (47) and the fact that P v γva2 v,0 = J (g0) < . As in Gu (2013), when g0 is supersmooth in the sense that P v γl va2 v,0 < for some 1 < l 2, which is assumed in Theorem 1, the rates can be improved to O(n 1λ 1/r 1 + λl 1). Now we want to bound the approximation error ˆg g. Define Ah1,h2(τ) = 1 i=1 e (h1+τh2)(Xi) + 1 Xα (h1 + τh2)ρ(xα i )dxα + λ1 2 J (h1 + τh2) Bh1,h2(τ) = 1 i=1 e g0(Xi)(h1 + τh2)(Xi) + 1 Xα (h1 + τh2)ρ(xα i )dxα 2V (h1 + τh2 g0) + λ1 2 J (h1 + τh2). Dong and Wang Taking derivatives of Ah1,h2 and Bh1,h2 with respect to τ and evaluating them at τ = 0, we obtain Ah1,h2(0) = 1 i=1 e h1(Xi)h2(Xi) + 1 Xα h2ρ(xα i )dxα + λ1J (h1, h2), (48) Bh1,h2(0) = 1 i=1 e g0(Xi)h2(Xi) + 1 Xα h2ρ(xα i )dxα + V (h1 g0, h2) + λ1J (h1, h2). Setting h1 = ˆg and h2 = ˆg g in (48), we have i=1 e ˆg(Xi)(ˆg g)(Xi) + 1 Xα (ˆg g)ρ(xα i )dxα + λ1J (ˆg, ˆg g) = 0. (50) Setting h1 = g and h2 = ˆg g in (49), we have i=1 e g0(Xi)(ˆg g)(Xi) + 1 Xα (ˆg g)ρ(xα i )dxα + V ( g g0, ˆg g) + λ1J ( g, ˆg g) = 0. (51) Subtracting (51) from (50), we obtain λ1J (ˆg g) 1 n e ˆg(Xi) e g(Xi)o (ˆg g)(Xi) n e g(Xi) e g0(Xi)o (ˆg g)(Xi) + V (ˆg g, g g0). (52) Applying the mean value theorem, we have e ˆg(Xi) e g(Xi) = e ( g+τi(ˆg g))(Xi)(ˆg g)(Xi) where τi [0, 1]. Since ˆg and g belong to B0 which is a convex set around g0, under Condition 4, there exists a b(i) 0 (c2, c3) such that e ( g+τi(ˆg g))(Xi)(ˆg g)(Xi) = b(i) 0 e g0(Xi)(ˆg g)(Xi). Then n e ˆg(Xi) e g(Xi)o (ˆg g)(Xi) = 1 i=1 b(i) 0 e g0(Xi)(ˆg g)2(Xi) i=1 e g0(Xi)(ˆg g)2(Xi). (53) By the same argument, there exists a c(i) 0 (c2, c3) such that n e g(Xi) e g0(Xi)o (ˆg g)(Xi) = 1 i=1 c(i) 0 e g0(Xi)(ˆg g)(Xi)( g g0)(Xi). Nonparametric neighborhood selection in graphical models Lemma 2 Under Conditions 1, 2, and 5, suppose h1 and h2 are functions satisfying J (h1) < and J (h2) < , as λ1 0 and nλ2/r 1 , one has i=1 e g0(Xi)h1(Xi)h2(Xi) V (h1, h2) = op {(V + λ1J )(h1)(V + λ1J )(h2)}1/2 . Proof: Since J (h1) < and J (h2) < , then h1 and h2 can be expressed as Fourier series h1 = P v h1,vζv and h2 = P v h2,vζv. Let Ui = ζv(Xi)ζu(Xi)e g0(Xi) Z X\{α} f\{α}(x\{α}) Z Xα ζv(x)ζu(x)ρ(x)dxαdx\{α}. Note that Ui are i.i.d. random variables with E(Ui) = 0. Then under Condition 5, we have n Var ζv(X1)ζu(X1)e g0(X1) < c4 Furthermore, i=1 e g0(Xi)h1(Xi)h2(Xi) V (h1, h2) u h1,vh2,u 1 n u (1 + λ1γv)(1 + λ1γu)h2 1,vh2 2,u =Op(n 1/2λ 1/r 1 ){(V + λ1J )(h1)(V + λ1J )(h2)}1/2 =op {(V + λ1J )(h1)(V + λ1J )(h2)}1/2 , where the second equality holds because of the fact P 1 1+λ1γv = O(λ 1/r 1 ) and the strong law of large numbers. Lemma 3 Under Conditions 1, 2, and 5, as λ1 0 and nλ2/r 1 , then i=1 e g0(Xi)h1(Xi)h2(Xi) 1 i=1 c(i) 0 e g0(Xi)h1(Xi)h2(Xi) 2c0{(V + λ1J )(h1)(V + λ1J )(h2)}1/2 (56) holds with probability 1, where c0 = max{|c2 1|, |c3 1|}. Dong and Wang Proof: Note that E|e g0(Xi)h1(Xi)h2(Xi)| X\{α} f\{α}(x\{α}) Z Xα |h1(x)h2(x)|ρ(x)dxαdx\{α} X\{α} f\{α}(x\{α}) Z Xα h2 1(x)ρ(x)dxαdx\{α} Z X\{α} f\{α}(x\{α}) Z Xα h2 2(x)ρ(x)dxαdx\{α} o1/2 ={V (h1)V (h2)}1/2 {(V + λ1J )(h1)(V + λ1J )(h2)}1/2, where the first inequality follows the Cauchy-Schwartz inequality. Then, we have i=1 e g0(Xi)h1(Xi)h2(Xi) 1 i=1 c(i) 0 e g0(Xi)h1(Xi)h2(Xi) i=1 (1 c(i) 0 )e g0(Xi)h1(Xi)h2(Xi) i=1 |(1 c(i) 0 )||e g0(Xi)h1(Xi)h2(Xi)| i=1 |e g0(Xi)h1(Xi)h2(Xi)| 2c0{(V + λ1J )(h1)(V + λ1J )(h2)}1/2, where the last inequality holds due to the strong law of large numbers. Proof of Theorem 1: Note that E{e g0(Xi)(ˆg g)2(Xi)} = R X\{α} f\{α}(x\{α}) R g)2(x)ρ(x)dxαdx\{α} = V (ˆg g). Substituting (53) into the left-hand side of (52), we have λ1J (ˆg g) 1 n e ˆg(Xi) e g(Xi)o (ˆg g)(Xi) i=1 e g0(Xi)(ˆg g)2(Xi) + λ1J (ˆg g) 2 V (ˆg g) + λ1J (ˆg g), (57) Nonparametric neighborhood selection in graphical models where the last equality holds due to the strong law of large numbers. Substituting (55) and (56) into the right-hand side of (52) and letting h1 = ˆg g, h2 = g g0, we have n e g(Xi) e g0(Xi)o (ˆg g)(Xi) + V (ˆg g, g g0) V (ˆg g, g g0) 1 i=1 e g0(Xi)(ˆg g)(Xi)( g g0)(Xi) i=1 e g0(Xi)(ˆg g)(Xi)( g g0)(Xi) 1 i=1 c(i) 0 e g0(Xi)(ˆg g)(Xi)( g g0)(Xi) (op(1) + 2c0){(V + λ1J )(ˆg g)(V + λ1J )( g g0)}1/2, (58) where the first inequality follows (54) and the second inequality follows Lemma 2 and 3. Combining (52), (57), and (58), we obtain 2 V + λ1J )(ˆg g) (op(1) + 2c0){(V + λ1J )(ˆg g)(V + λ1J )( g g0)}1/2. (59) Combining (59) with Lemma 1, as λ1 0 and nλ2/r 1 , we have (V + λ1J )(ˆg g) = Op(n 1λ 1/r 1 + λl 1) and Theorem 1 holds. Proof of Theorem 2: We know k =α ||ηαk(xα, xk)||2 k =α ||ηαk(xj, xk)|| k =α ||ηαk(xα, xk)||2. (60) Therefore, there exists some constant C [1, p 1] such that C { P k =α ||ηαk(xα, xk)||2}1/2 = P k =α ||ηαk(xα, xk)||. Since P k =α θαk is bounded by M, we can scale λ1 and λ2 such that θαk 1. Since J 2(g) = P k =α θ 1 αk ||ηαk(xα, xk)||2 = c T ( P k =α θαk Qαk)c, P k =α ||ηαk(xα, xk)||2 = k =α θ2 αk Qαk)c, we have J2 2(g) = C2 P k =α ||ηαk(xα, xk)||2 C2J 2(g) and consequently J2 C(J )1/2. Furthermore, since V 2 2 (g(2)) = R X\{α} f\{α}(x\{α}) R Xα g(2)(x) 2 ρ(x)dxαdx\{α} = V (g(2)), we have V2(g(2)) = [V (g(2))]1/2. Therefore, (V2 + λ1J2)(g(2)) = ((V )1/2 + C p λ1(λ1J )1/2)(g(2)) (1 + C2λ1)1/2(V + λ1J )1/2(g(2)) by the Cauchy-Schwarz inequality. Finally, (V + λ1J)(ˆg g0) = (V1 + λ1J1)(ˆg(1) g(1) 0 ) + (V2 + λ1J2)(ˆg(2) g(2) 0 ) (V + λ1J )(ˆg(1) g(1) 0 ) + (1 + C2λ1)1/2(V + λ1J )1/2(ˆg(2) g(2) 0 ) = Op(n 1λ 1/r 1 + λl 1) + O(n 1/2λ 1/2r 1 + λl/2 1 ) = Op(n 1/2λ 1/2r 1 + λl/2 1 ). (61) Dong and Wang Proof of Corollary 1: By the definition of V ( ), V (ˆg g0) = V1(ˆg(1) g(1) 0 ) + V2(ˆg(2) g(2) 0 ) = V (ˆg(1) g(1) 0 ) + [V (ˆg(2) g(2) 0 )]1/2. Following (61), [V (ˆg(2) g(2) 0 )]1/2 = Op(n 1/2λ 1/2r 1 + λl/2 1 ). Following Lin et al. (2000), under the condition 0 < c5 < ρ(x) < c6 and 0 < c7 < f\{α}(x\{α}) < c8 for some positive constants c5, c6, c7, and c8, [V (g)]1/2 is equivalent to the L2 norm. Specifically, V (g) ||g||2 2 = p P j=1 ||ηj||2 2+ P k =α ||ηαk(xα, xk)||2 2, V (g(1)) p P j=1 ||ηj||2 2, and V (g(2)) P k =α ||ηαk(xα, xk)||2 2, where means equivalence. By definition, V (g(2)) = [V (g(2))]1/2 ( P k =α ||ηαk(xα, xk)||2 2)1/2. Consequently, two-way interactions under L2 norm have the same convergence rate as [V (g(2))]1/2, ||ˆηαk η0αk||2 = Op(n 1/2λ 1/2r 1 + λl/2 1 ), k = α, k = 1, , p. Rajiv Agarwal, Joyce L. Davis, and Linda Smith. Serum albumin is strongly associated with erythropoietin sensitivity in hemodialysis patients. Clin J Am Soc Nephrol., 3(1): 98 104, 2008. Alnur Ali, J Zico Kolter, and Ryan J Tibshirani. The multiple quantile graphical model. ar Xiv preprint ar Xiv:1607.00515, 2016. Adelchi Azzalini and A Dalla Valle. The multivariate skew-normal distribution. Biometrika, 83(4):715 726, 1996. Onureena Banerjee, Laurent El Ghaoui, and Alexandre d Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research, 9:485 516, 2008. Shizhe Chen, Daniela M Witten, and Ali Shojaie. Selection and estimation for mixed graphical models. Biometrika, 102(1):47 64, 2015. Mathias Drton and Marloes H Maathuis. Structure learning in graphical modeling. Annual Review of Statistics and Its Application, 4:365 393, 2017. Rina Foygel and Mathias Drton. Extended bayesian information criteria for gaussian graphical models. ar Xiv preprint ar Xiv:1011.6640, 2010. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432 441, 2008. Chong Gu. Smoothing spline ANOVA models, volume 297. Springer Science & Business Media, 2013. Nonparametric neighborhood selection in graphical models Chong Gu and Ping Ma. Nonparametric regression with cross-classified responses. Canadian Journal of Statistics, 39(4):591 609, 2011. Chong Gu, Yongho Jeon, and Yi Lin. Nonparametric density estimation in high-dimensions. Statistica Sinica, pages 1131 1153, 2013. Chong Gu et al. Smoothing spline anova models: R package gss. Journal of Statistical Software, 58(5):1 25, 2014. Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015. Holger H ofling and Robert Tibshirani. Estimation of sparse binary pairwise markov networks using pseudo-likelihoods. Journal of Machine Learning Research, 10(4), 2009. Cho-Jui Hsieh, Inderjit S Dhillon, Pradeep K Ravikumar, and M aty as A Sustik. Sparse inverse covariance matrix estimation using quadratic approximation. In Advances in neural information processing systems, pages 2330 2338, 2011. Cho-Jui Hsieh, M aty as A Sustik, Inderjit S Dhillon, Pradeep Ravikumar, et al. Quic: quadratic approximation for sparse inverse covariance estimation. J. Mach. Learn. Res., 15(1):2911 2947, 2014. Yongho Jeon and Yi Lin. An effective method for high-dimensional log-density anova estimation, with application to nonparametric graphical model building. Statistica Sinica, pages 353 374, 2006. Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009. Eduardo Lacson, John Rogus, Ming Teng, Michael Lazarus, and Raymond Hakim. The association of race with erythropoietin dose in patients on long-term hemodialysis. American Journal of Kidney Diseases, 52(6):1104 1114, 2008. John Lafferty, Han Liu, Larry Wasserman, et al. Sparse nonparametric graphical models. Statistical Science, 27(4):519 537, 2012. Ginette Lafit, Francis Tuerlinckx, Inez Myin-Germeys, and Eva Ceulemans. A partial correlation screening approach for controlling the false positive rate in sparse gaussian graphical models. Scientific Reports, 9(1):1 24, 2019. Jason D Lee and Trevor J Hastie. Learning the structure of mixed graphical models. Journal of Computational and Graphical Statistics, 24(1):230 253, 2015. Yi Lin and Hao Helen Zhang. Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34(5):2272 2297, 2006. Yi Lin et al. Tensor product space anova models. The Annals of Statistics, 28(3):734 755, 2000. Dong and Wang Han Liu, John Lafferty, and Larry Wasserman. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research, 10(Oct):2295 2328, 2009. Nicolai Meinshausen and Peter B uhlmann. High-dimensional graphs and variable selection with the lasso. Annals of Statistics, 34(3):1436 1462, 2006. Jie Peng, Pei Wang, Nengfeng Zhou, and Ji Zhu. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association, 104(486):735 746, 2009. Pradeep Ravikumar, Martin J Wainwright, John D Lafferty, et al. High-dimensional ising model selection using ℓ1-regularized logistic regression. The Annals of Statistics, 38(3): 1287 1319, 2010. Jian Shi, Anna Liu, and Yuedong Wang. Spline density estimation and inference with model-based penalties. Journal of Nonparametric Statistics, 31:596 611, 2019. Arun Suggala, Mladen Kolar, and Pradeep K Ravikumar. The expxorcist: nonparametric graphical models via conditional exponential densities. In Advances in neural information processing systems, pages 4446 4456, 2017. Berwin A Turlach and Andreas Weingessel. quadprog: Functions to solve quadratic programming problems. CRAN-Package quadprog, 2007. Peter Kehinde Uduagbamen, John Omotola Ogunkoya, Igwebuike Chukwuyerem Nwogbe, Solomon Olubunmi Eigbe, and Oluwamayowa Ruth Timothy. Ultrafiltration volume: Surrogate marker of the extraction ratio, determinants, clinical correlates and relationship with the dialysis dose. J Clin Nephrol Ren Care, 7:068, 2021. Arend Voorman, Ali Shojaie, and Daniela Witten. Graph estimation with joint additive models. Biometrika, 101(1):85 101, 2014. Yuedong Wang. Smoothing splines: methods and applications. CRC Press, 2011. Hans F Weinberger. Variational methods for eigenvalue approximation. SIAM, 1974. Anja Wille, Philip Zimmermann, Eva Vranov a, Andreas F urholz, Oliver Laule, Stefan Bleuler, Lars Hennig, Amela Preli c, Peter von Rohr, Lothar Thiele, et al. Sparse graphical gaussian modeling of the isoprenoid gene network in arabidopsis thaliana. Genome biology, 5(11):1 13, 2004. Jiahui Yu, Jian Shi, Anna Liu, and Yuedong Wang. Smoothing spline semiparametric density models. Journal of the American Statistical Association, 2020. Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19 35, 2007. Hao Helen Zhang, Guang Cheng, and Yufeng Liu. Linear or nonlinear? automatic structure discovery for partially linear models. Journal of the American Statistical Association, 106 (495):1099 1112, 2011.