# nonparametric_neighborhood_selection_in_graphical_models__3b5f1ec6.pdf

Journal of Machine Learning Research 23 (2022) 1-36 Submitted 2/22; Revised 7/22; Published 10/22

Nonparametric Neighborhood Selection in Graphical Models

Hao Dong hao dong@pstat.ucsb.edu Department of Statistics and Applied Probability University of California, Santa Barbara Santa Barbara, CA, USA

Yuedong Wang yuedong@pstat.ucsb.edu Department of Statistics and Applied Probability University of California, Santa Barbara Santa Barbara, CA, USA

Editor: Daniela Witten

The neighborhood selection method directly explores the conditional dependence structure and has been widely used to construct undirected graphical models. However, except for some special cases with discrete data, there is little research on nonparametric methods for neighborhood selection with mixed data. This paper develops a fully nonparametric neighborhood selection method under a consolidated smoothing spline ANOVA (SS ANOVA) decomposition framework. The proposed model is ﬂexible and contains many existing models as special cases. The proposed method provides a uniﬁed framework for mixed data without any restrictions on the type of each random variable. We detect edges by applying an L1 regularization to interactions in the SS ANOVA decomposition. We propose an iterative procedure to compute the estimates and establish the convergence rates for conditional density and interactions. Simulations indicate that the proposed methods perform well under Gaussian and non-Gaussian settings. We illustrate the proposed methods using two real data examples. Keywords: conditional density estimation, mixed data, regularization, reproducing kernel Hilbert space, smoothing spline ANOVA

1. Introduction

Discovering conditional independence among random variables is an essential task in statistics. Undirected probabilistic graphical models play a pivotal role in characterizing conditional independence. They have been utilized in a wide range of scientiﬁc and engineering domains, including statistical physics, computer vision, machine learning, and computational biology (Koller and Friedman, 2009). A graphical model is constructed based on an undirected graph G = (V, E) with node set V = {1, , p} representing p random variables X1, , Xp and edge set E V V describing the conditional dependence among X1, , Xp. Let X = (X1, , Xp) and X\{i1, ,ik} be the sub-vector of X without elements in {i1, , ik}. Then, {i, j} / E corresponds to the conditional independence between Xi and Xj given other variables in X, denoted as Xi Xj|X\{i,j}. As joint density ultimately determines the conditional relationship, methods for edge detection based on estimating joint density have been proposed (Yuan and Lin, 2007; Banerjee

2022 Hao Dong and Yuedong Wang.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/22-0207.html.

Dong and Wang

et al., 2008; Friedman et al., 2008; Hsieh et al., 2014; Liu et al., 2009). Under the Gaussian assumption of X N(0, Σ), the task of edge detection reduces to the estimation of the precision matrix Σ 1. Yuan and Lin (2007), Banerjee et al. (2008), and Friedman et al. (2008) proposed regularization methods that minimize the log-likelihood with an L1 penalty on the entries of Σ 1. Hsieh et al. (2014) proposed a fast second-order algorithm for solving the L1-regularized Gaussian MLE. Liu et al. (2009) extended the L1-regularized Gaussian MLE approach to the setting where there exist monotone transformations f1, , fp such that (f1(X1), , fp(Xp)) N(0, Σ). These parametric and semi-parametric methods may be too restrictive for some applications and cannot handle mixed data since they rely on the Gaussian assumption. Let f(x) be the joint density function of X, and consider the transformation f(x) = eη(x)/ R eη(x)dx, where η(x) is the logistic transformation of f. The SS ANOVA decomposition represents η(x) as a summation of a constant, main eﬀects, and interactions:

η(x1, , xp) = c +

j=1 ηj(xj) + X

1 j<k p ηjk(xj, xk) + + η1 p(x1, , xp). (1)

The conditional independence Xj Xk|X\{j,k} is equivalent to the summation of all interactions involving xj and xk equal to zero (Gu, 2013). Consequently, identifying edges is equivalent to identifying nonzero interactions. Jeon and Lin (2006) developed a penalized M-estimation method for edge detection based on the SS ANOVA decomposition (1). Our experience indicates that this joint density estimation approach is only computationally feasible with a small p due to large memory requirements. The neighborhood selection approach explores structures in conditional densities and is usually more computationally eﬃcient. By the conditional independence properties of undirected graphical models, for any node α V , Xα only depends on other variables in its neighborhood set nb G(α), where nb G(α) = {k V |{α, k} E}. Consequently, the conditional independence structure of graph G can be constructed by estimating all of its neighborhoods nb G(α) for α = 1, , p. Neighborhood selection aims to identify a minimal set of variables nb G(α) that Xα depends on for each node α V . Many neighborhood selection methods have been developed for learning sparse graphical models (Hastie et al., 2015; Drton and Maathuis, 2017). Flexible models were proposed for discrete data (H oﬂing and Tibshirani, 2009; Ravikumar et al., 2010). Methods for continuous data usually model the conditional mean (Meinshausen and B uhlmann, 2006; Voorman et al., 2014) or conditional quantiles (Ali et al., 2016). For example, Meinshausen and B uhlmann (2006) and Peng et al. (2009) considered a linear model for the conditional mean with L1 penalties on coeﬃcients and partial correlations, respectively. Voorman et al. (2014) considered an additive model for the conditional mean. The conditional mean approach does not assume a speciﬁc distribution for the regression error and therefore appears to be distribution-free. However, if the conditional relationships are linear, the joint distribution must be multivariate Gaussian under mild assumptions (Voorman et al., 2014). In other words, the restriction of Gaussianity has not been removed as it appears. For mixed continuous and discrete variables, Lee and Hastie (2015) considered a pairwise model that generalizes Gaussian graphical and discrete models. Chen et al. (2015) proposed a ﬂexible pairwise graphical model where each node s conditional distribution is in the exponential

Nonparametric neighborhood selection in graphical models

family. Gu and Ma (2011) developed a functional ANOVA method for estimating the conditional density of cross-classiﬁed responses and identifying conditional independence structure through Kullback-Leibler projection. Modeling mixed data is challenging since it is diﬃcult to specify a joint density. Existing neighborhood selection methods are restrictive since they model speciﬁc mixed data or assume speciﬁc conditional distributions. In this paper, we propose a new fully nonparametric neighborhood selection method. We construct an SS ANOVA model for each conditional density and select neighborhood via L1 regularization. The new contributions of our neighborhood selection method consist of four parts. First, we directly target the neighborhood deﬁnition based on conditional density without assuming any speciﬁc family of distributions. The whole conditional density provides the most comprehensive summary of the relationship which might be missed by speciﬁc characteristics such as conditional mean and quantiles. Second, we allow the range of each random variable to be an arbitrary set and use the tensor product of reproducing kernel Hilbert spaces (RKHS) to construct a model space for each conditional density. Therefore, the proposed method provides a uniﬁed framework for mixed data types without any restrictions on the type of each random variable. The proposed model is more general and ﬂexible than existing models. Third, we use the SS ANOVA structure to facilitate the selection of neighborhoods. Speciﬁcally, we estimate the conditional density for each node based on SS ANOVA decomposition with an L1 penalty involving interaction components. This approach for neighborhood selection has not been studied before. Last but not least, the new neighborhood selection method based on conditional density is more computationally eﬃcient than those based on joint density and is parallelizable. The rest of the paper is organized as follows. Section 2 introduces the new neighborhood selection method. Section 3 presents the computational method and the implementation of the proposed algorithm. Section 4 derives the convergence rate of the conditional density estimate and its components in the SS ANOVA decomposition. Section 5 conducts simulations to compare edge detection performance with existing methods under both Gaussian and non-Gaussian settings. Section 6 illustrates the proposed methods using two real data sets. Section 7 provides some discussion. The Appendix contains proofs and auxiliary material.

2. Neighborhood Selection Through Conditional Density Estimation with L1 Penalty

In this section, we ﬁrst introduce some notation and the SS ANOVA decomposition. Then, we present our nonparametric method for edge detection.

2.1 Notation and SS ANOVA Decomposition

Consider p random variables X1, , Xp with ranges denoted as X1, , Xp. Each range Xα is an arbitrary set for generality. It may be a continuous interval, a discrete set, or a circle. It could even be a subset in Euclidean space or a sphere. That is, each Xα could be a multivariate random variable. Denote X = (X1, , Xp) as the p-dimensional random vector with range X = X1 Xp and x = (x1, , xp) as a realization of the random vector. For a ﬁxed α V = {1, , p}, denote X\{α} = (X1, , Xα 1, Xα+1, , Xp)

Dong and Wang

and x\{α} = (x1, , xα 1, xα+1, , xp) as the vectors of X and x with the αth element being removed. Our goal is to select the neighborhood nb G(α) through the estimation of the conditional density f(xα|x\{α}). Denote Xi = (Xi,1, , Xi,p) and xi = (xi,1, , xi,p) for i = 1, , n as n i.i.d. random vectors and their realizations. Let xi,\{α} = (xi,1, , xi,α 1, xi,α+1, , xi,p). Denote xα i = (xi,1, , xi,α 1, xα, xi,α+1, , xi,p) as the p-dimensional vector with xα varies in Xα and all other variables ﬁxed at their ith realizations. For simplicity, the dependence of xα i on xα is not expressed explicitly. Let H(j) be an RKHS on Xj and H(j) = {1(j)} H(j), where {1(j)} is the space of the constant functions on Xj and H(j) is the orthogonal complement of {1(j)}. One may construct a ﬂexible and interpretable model for a p-dimensional function through the following

SS ANOVA decomposition of the tensor product space p N

j=1 H(j) on X (Wang, 2011; Gu,

{1(j)} H(j)

1 j<k p [H(j) H(k)]

H(1) H(p) .

The decomposition in equation (1) corresponds to the SS ANOVA decomposition to the logistic transformation of the joint density function.

2.2 SS ANOVA Model for Conditional Density

For the conditional density of Xα, we consider the logistic density transformation

f(xα|x\{α}) = eη(x) R

Xα eη(x)dxα (3)

to enforce the conditions of f > 0 and R f = 1. The function η is the logistic transformation of f. An SS ANOVA model for η in (3) may contain any subset of components in the SS ANOVA decomposition (2). For simplicity, we assume that η Mα where

k =α [H(α) H(k)]

is a subspace with main eﬀects and two-way interactions only. A function η Mα can be decomposed as follows:

j=1 ηj(xj) + X

k =α ηαk(xα, xk), (5)

Nonparametric neighborhood selection in graphical models

where each functional component in (5) belongs to the corresponding subspace in (4). We note that the proposed method can be easily extended to include higher-order interactions. Remark 1 The SS ANOVA model (1) for the joint density with main eﬀects and two-way interaction only is a pairwise graphical model, which is commonly assumed in the existing literature. Remark 2 We consider both the log-likelihood and pseudo log-likelihood approaches for estimating the conditional density (Gu, 2013). We present the pseudo likelihood approach in the main text since it is computationally more eﬃcient. The log-likelihood approach is presented in Appendix A. For the pseudo log-likelihood approach, model space Mα includes constant functions. The model space for the log-likelihood approach eliminates the constant functions for identiﬁability. Remark 3 To compare the estimation between the joint and neighborhood approaches under pairwise graphical models, we consider the SS ANOVA decomposition (1) with main eﬀects and two-way interaction only for the joint density and the SS ANOVA decomposition (5) for the conditional density. The joint density approach needs to estimate all main eﬀects and two-way interactions simultaneously with a total number of components proportional to p2. Our experience indicates that this joint approach is computationally infeasible even with moderately large p due to memory constraints. On the other hand, the neighborhood approach needs to estimate p main eﬀects and p 1 two-way interactions for each node, which signiﬁcantly reduces the computational cost and memory requirement and is parallelizable. Remark 4 Model (5) contains many parametric models as special cases. Speciﬁcally, the Gaussian graphical model is a special case with Xj = R, ηj(xj) = βjxj x2 j/2 for j = α and 0 otherwise, and ηαk(xα, xk) = βαkxαxk for some constants βj and βαk. The Ising model for binary data is a special case with Xj = {0, 1}, ηj(xj) = xj for j = α and 0 otherwise, and ηαk(xα, xk) = βαkxαxk. The Poisson graphical model for discrete data is a special case with Xj = {0, 1, 2, }, ηj(xj) = xj log(xj!) for j = α and 0 otherwise, and ηαk(xα, xk) = βαkxαxk. The exponential family model proposed by Suggala et al. (2017),

logf(xα|x\{α})

βαBα(xα) + X

{α,k} E βαk Bα(xα)Bk(xk) + Cα(xα)

is also a special case with ηj(xj) = βj Bj(xj) + Cj(xj) for j = α and 0 otherwise, and ηαk(xα, xk) = βαk Bα(xα)Bk(xk). Note that many existing exponential family models including (6) assume a multiplicative interaction while model (5) does not assume any speciﬁc interaction. Therefore, the proposed model is more general.

2.3 Penalized Pseudo Log-likelihood Estimation

For each node α V , we assume that η(x) Mα where Mα is given in (4) and η is decomposed as in (5). We further decompose H(j) as H(j) = H0 (j) H1 (j) where H0 (j) is a ﬁnite dimensional space containing functions that are not subject to penalty. We estimate η in (5) by minimizing the following penalized pseudo log-likelihood in Mα:

j=1 θ 1 j ||Pjηj||2 + τ1 X

k =α wαk||ηαk||, (7)

Dong and Wang

where lα = n 1 Pn i=1 n e η(xi) + R

Xα η(xα i )ρ(xα i )dxα o is the pseudo log-likelihood, ρ( ) is a known density of Xα conditional on X\{α} = xi,\{α}, Pj is the projection operator onto H1 (j), λ1, τ1, and θj s are tuning parameters, 0 wαk < are pre-speciﬁed weights, and || || is an induced norm in Mα. The pseudo log-likelihood lα measures the goodness-of-ﬁt. The second element in (7) is the roughness L2 penalty on main eﬀects. The third element in (7) is the L1 penalty for selecting the neighborhood nb G(α). We allow diﬀerent weights in the L1 penalty for ﬂexibility. Remark 5 The idea of pseudo log-likelihood was ﬁrst developed in Jeon and Lin (2006) for joint density estimation. Gu (2013) extended this approach to conditional density estimation. We present the pseudo log-likelihood estimation in the main text since this approach is computationally more eﬃcient. The log-likelihood approach to conditional density estimation needs to calculate the integral R

Xα eη(xα i )dxα repeatedly, which can be computationally intensive. With a proper choice of ρ, the pseudo log-likelihood approach needs to calculate an integral only once. Remark 6 The proposed method replaces the L2 penalty on interactions in Gu (2013) with the L1 penalty for neighborhood selection and diﬀers from that in Jeon and Lin (2006) in two aspects. First, Jeon and Lin (2006) s approach is a global method that estimates the joint density; thus is computationally intensive and can only handle small dimensions p. Second, Jeon and Lin (2006) posed the L1 penalty to both main eﬀects and interactions. Consequently, their method selects both nodes and edges. In practice, the nodes are usually given, and the goal is to detect edges. Therefore, we consider the smoothness promoting L2 penalty to main eﬀects and the sparsity promoting L1 penalty to interactions. Let

k =α [H(α) H(k)]

We can rewrite η(x) = ς + g(x) where g(x) = p P

j=1 ηj(xj) + P

k =α ηαk(xα, xk) G. We ﬁrst

estimate ς with ﬁxed g and then estimate g using the proﬁled pseudo log-likelihood. The results are summarized in the following proposition.

Proposition 1 With ﬁxed g, the minimizer of ς in (7) is ˆς = log{n 1 Pn i=1 e g(xi)} and the penalized pseudo log-likelihood (7) reduces to the following penalized proﬁled pseudo log-likelihood

l(ˆς(g), g) + λ1

j=1 θ 1 j ||Pjηj||2 + τ1 X

k =α wαk||ηαk||, (9)

where l(ˆς(g), g) = log{n 1 n P

i=1 e g(xi)} + n 1 n P

Xα g(xα i )ρ(xα i )dxα is the proﬁled pseudo

log-likelihood.

The proof can be found in Appendix C. Instead of minimizing (9) that involves L1 penalties on functions, as in Lin and Zhang (2006), we will solve an equivalent but more convenient minimization problem that involves L1 penalties on the smoothing parameters.

Nonparametric neighborhood selection in graphical models

Proposition 2 Minimizing

l(ˆς(g), g) + λ1

j=1 θ 1 j ||Pjηj||2 + X

k =α wαkθ 1 αk ||ηαk||2 + λ2 X

k =α wαkθαk, (10)

subject to θαk 0 for k = 1, , p and k = α is equivalent to minimizing (9).

The proof of equivalence can be found in Appendix C. Proposition 2 transforms the selection of nonzero functions ηαk in (9) into a selection of nonzero parameters θαk. The minimization problem (10) consists of L2 penalties on functions and L1 penalties on parameters and existing methods can be modiﬁed to solve each part. Computational details for solving (10) are presented in Section 3. Since the pseudo log-likelihood is used for estimation, we need to compute the conditional density estimate using the following proportion.

Proposition 3 The resulting estimate of the conditional density is ˆf(xα|x\{α}) eˆg(x)ρ(x) where ˆg is the minimizer of (10).

The proof of Proposition 3 is given in Appendix C. Notice that the minimization problem (10) involves p 1 two-way interaction terms. Solving (10) for all α = 1, , p leads to two estimates for each two-way interaction, denoted as ˆηαk and ˆηkα for α, k = 1, , p and α = k. There are two commonly used rules to combine the results: AND-rule ({α, k} E iﬀˆηαk = 0 and ˆηkα = 0) or OR-rule ({α, k} E iﬀˆηαk = 0 or ˆηkα = 0) (Hastie et al., 2015). As discussed in Section 4.2 in Chen et al. (2015), when the αth and kth nodes are of the same type (same marginal distribution) or are both non-Gaussian, there is no clear reason to prefer one edge estimate over the other. We adopt the AND-rule in all simulations and real data examples.

3. Algorithm

In this section, we propose a computational algorithm that solves (10) iteratively. Denote θ1 = (θ1, , θp)T , θ2 = (θα1, , θα(α 1), θα(α+1), , θαp)T , and w = (wα1, , wα(α 1), wα(α+1), , wαp)T . Let H(j) = H0 (j) H1 (j) where H0 (j) is a ﬁnite-dimensional space containing functions that are not subject to L2 penalty. Denote φj1, , φjmj as basis functions of H0 (j), and R1 j, Rj, and Rαk as reproducing kernels of H1 (j), H(j), and H(αk), respectively. We collect all basis functions φjk for j = 1, , p and k = 1, , mj and denote them as φ = (φ1, , φm)T , a vector of functions of x with dimension m = Pp j=1 mj. Since in general the minimization problem (10) does not have a solution in a ﬁnitedimensional space, as in Gu (2013), we approximate the solution by a subset of representers. Speciﬁcally, let { xu = ( xu,1, , xu,p), u = 1, , q} be a subset of all observations {xi, i = 1, , n}. Let ξ1ju(xj) = R1 j( xu,j, xj) and ξαku(xα, xk) = Rαk(( xu,α, xu,k), (xα, xk)) for u =

1, , q, k = 1, , p, and k = α. Let ξθ1,u(x) = p P

j=1 θjξ1ju(xj), ξθ1(x) = (ξθ1,1, , ξθ1,q)T ,

ξθ2,u(x) = p P

k=1,k =α w 1 αk θαkξαku(xα, xk), ξθ2(x) = (ξθ2,1, , ξθ2,q)T , and ξ(x) = ξθ1(x) +

Dong and Wang

ξθ2(x). The approximate solution can be represented as a linear combination of basis functions and representers:

v=1 dvφv(x) +

j=1 θjξ1ju(xj) +

k=1,k =α w 1 αk θα,kξαku(xα, xk)

= φT (x)d + ξT (x)c, (11)

where c = (c1, , cq)T and d = (d1, , dm)T are coeﬃcients. Let Q = p P

j=1 θj Qj +

k=1,k =α w 1 αk θαk Qαk, where Qj = n R1 j( xu,j, xv,j) oq

u,v=1 are kernel matrices for the main

eﬀects and Qαk = n Rαk(( xu,α, xu,k), ( xv,α, xv,k)) oq

u,v=1 are kernel matrices for the two-way

interactions. We can rewrite (10) in a vector form:

A(c, d, θ2) = log

i=1 e φ T i d ξ T i c )

+ b T φd + b T ξc + λ1

2 c T Qc + λ2w T θ2, (12)

where φi = φ(xi), ξi = ξ(xi), bφ = n 1 n P

Xα φ(xα i )ρ(xα i )dxα, and

bξ = n 1 n P

Xα ξ(xα i )ρ(xα i )dxα. We solve (12) by updating c, d, and θ2 between two steps

discussed in the following two subsections.

3.1 Newton-Raphson Procedure

We ﬁx θ2 and update c and d at this step. Dropping the last term which is independent of c and d, (12) reduces to

A1(c, d) = log

i=1 e φ T i d ξ T i c )

+ b T φd + b T ξc + λ1

2 c T Qc. (13)

Note that (13) has the same form as (10.31) in Gu (2013). Therefore, we can solve (13) using the Newton-Raphson procedure with λ1 and θ1 selected by the approximate crossvalidation (ACV) method (Gu, 2013). We note that θ2 are ﬁxed at this step. Therefore, the existing function in the gss R package cannot be used directly. More implementation details can be found in Appendix B.1.

3.2 Quadratic Programming

We ﬁx c, d, λ1 and θ1 and update θ2 at this step. We rewrite ˆg in (11) as

v=1 dvφv(x) +

u=1 cuξ1ju(xj) +

k=1,k =α θαkw 1 αk

u=1 cuξαku(xα, xk)

= φT (x)d + ψT 1 (x)θ1 + ψT 2 (x)θ2. (14)

Nonparametric neighborhood selection in graphical models

Let Q(2) = p P

k=1,k =α w 1 αk θαk Qαk. Plugging ˆg(xi) and keeping terms involving θ2 only, (12)

i=1 e φ T i d ψ T 1iθ1 ψ T 2iθ2 )

+ b T ψ2θ2 + λ1

2 c T Q(2)c + λ2w T θ2 (15)

subject to θ2 0, where ψ1i = ψ1(xi), ψ2i = ψ2(xi), and bψ2 = 1

Xα ψ2(xα i )ρ(xα i )dxα.

Furthermore, the constraint minimization problem (15) is equivalent to

A2(θ2) = log

i=1 e φ T i d ψ T 1iθ1 ψ T 2iθ2 )

+ b T ψ2θ2 + λ1

2 c T Q(2)c (16)

subject to θ2 0 and w T θ2 M for some constant M, where M controls the sparsity in θ2. We note that A2(θ2) is a convex function of θ2 (see Appendix C for a brief proof). We solve (16) iteratively using quadratic programming. We apply K-fold cross-validation or BIC method to select M. Implementation details can be found in Appendix B.2.

3.3 Algorithm

We summarize the whole algorithm as follows. A parameter with superscript (t) denotes its value at the tth iteration.

Algorithm 1 Input: Data frame X containing n observations with p dimensions. Output: Estimated c, d, θ2, and the neighborhood set nb G(α).

1: Initialization θ(1) 2 = θ2,0, θ(0) 2 = 0, and t = 1.

2: while ||θ(t) 2 θ(t 1) 2 ||2/(||θ(t 1) 2 ||2 + 10 6) ε or t = 1 do:

3: Fix θ(t) 2 ,

c(t), d(t) argmin c,d A1(c, d)

with tuning parameters λ(t) 1 and θ(t) 1 selected by the ACV method.

4: Fix d(t), c(t), λ(t) 1 , and θ(t) 1 ,

θ(t+1) 2 argmin θ2 A2(θ2),

subject to θ2 0 and w T θ2 M(t) where the tuning parameter M(t) is selected by K-fold cross-validation or BIC method.

6: end while

More implementation details can be found in Appendix B, including the initialization of θ2, the convergence criterion, and the selection of M.

Dong and Wang

4. Theoretical Analysis

In this section, we study the theoretical properties of the proposed method. Following similar steps and under the same regularity conditions as Gu (2013), we derive the convergence rate for the conditional density estimate ˆg subject to both L1 and L2 penalties. In addition, we derive the convergence rate for interactions in the SS ANOVA decomposition, which is new and important for edge detection. Let f0(xα|x\{α}) = eg0(x)ρ(x) be the true conditional density to be estimated. Let

g = g(1) + g(2) where g(1) = p P

j=1 ηj and g(2) = P

k =α ηαk are main eﬀects and interactions

respectively. Denote ˆg as the minimizer of (9). Deﬁne

V (h1, h2) = Z

X\{α} f\{α}(x\{α}) Z

Xα h1(x)h2(x)ρ(x)dxαdx\{α},

J1(h1, h2) =

Xj (Pjh1)(Pjh2)dxj,

J2(h1, h2) = X

k =α wαk( Z

Xk |h1,αkh2,αk|dxαdxk)1/2,

J 2(h1, h2) = X

k =α θ 1 αk

Xk h1,αkh2,αkdxαdxk,

for any functions h1, h2 G, where f\{α}(x\{α}) is the density of X\{α} on X\{α} = X1 Xα 1 Xα+1 Xp. Furthermore, we deﬁne V (g) = V (g, g), V1(g(1)) =

V (g(1)), V2(g(2)) = [V (g(2))]1/2, J1(g(1)) = J1(g(1), g(1)) = p P

j=1 θ 1 j ||Pjηj||2, J2(g(2)) =

J2(g(2), g(2)) = P

k =α wαk||ηαk||, and J 2(g(2)) = J 2(g(2), g(2)) = P

k =α θ 1 αk ||ηαk||2.

Without loss of generality, we assume wαk = 1 in the proof, simulations, and real data examples. We note that V , J1, and J 2 are quadratic functionals. In the proof of Corollary 1 in Appendix C, it is shown that V (g), J1(g(1)), and J 2(g(2)) are equivalent to ||g||2 2, p P

j=1 ||Pjηj||2 2, and p P

k =α ||ηαk||2 2, respectively, where || ||2 is the L2 norm. It is also shown

that V2(g(2)) and J2(g(2)) are equivalent to the square root of V (g(2)) and J 2(g(2)). Let V (g) = V1(g(1))+V2(g(2)), J = J1+J2, and J (g) = J1(g)+J 2(g). To derive the convergence rate, we need the following conditions.

Condition 1 V is completely continuous with respect to J .

From Theorem 3.1 of Weinberger (1974), there exist eigenvalues γv of J with respect to V and the associated eigenfunctions ζv such that V (ζv, ζu) = δv,u and J (ζv, ζu) = γvδv,u, where 0 γv and δv,u is the Kronecker delta. Functions satisfying J (g) < can be expressed as a Fourier series expansion g = P

v avζv, where av = V (g, ζv) are the Fourier

coeﬃcients.

Nonparametric neighborhood selection in graphical models

Condition 2 For v suﬃciently large and some ϕ > 0, the eigenvalues γv of J with respect to V satisfy γv > ϕvr where r > 1.

Consider the quadratic functional

i=1 e g0(Xi)g(Xi) + 1

Xα g(xα i )ρ(xα i )dxα + 1

2V (g g0) + λ1

2 J (g), (17)

and denote the minimizer of (17) as g. Plugging the Fourier series expansions g = P

v av,0ζv into (17), g has Fourier coeﬃcients av = (κv + av,0)/(1 + λ1γv), where

κv = n 1 n P

i=1 {e g0(Xi)ζv(Xi) R

Xα ζv(x)ρ(x)dxα}. It is not diﬃcult to verify that E(κv) = 0

and E(κ2 v) n 1 R

X\{α} f\{α}(x\{α}) R

Xα ζ2 v(x)e g0(x)ρ(x)dxαdx\{α}.

Condition 3 For some c1 < , e g0 < c1.

Under Condition 3, noting that V (ζv) = R

X\{α} f\{α}(x\{α}) R

Xα ζ2 v(x)ρ(x)dxαdx\{α} =

1 by the deﬁnition of V and ζv, we have E(κ2 v) n 1c1.

Condition 4 For g in a convex set B0 around g0 containing ˆg and g, c2 < eg0 g < c3 holds uniformly for some 0 < c2 < c3 < .

Condition 5 For any u, v = 1, 2, , R

X\{α} f\{α}(x\{α}) R

Xα ζ2 vζ2 ue g0ρ(x)dxαdx\{α} < c4 for some c4 < .

Conditions 1-5 are common assumptions for convergence rate analysis of the SS ANOVA estimates, which were also made in Gu (2013). Condition 2 states that the growth rate of the eigenvalues γv is at vr, which controls how fast λ1 approaches zero. Many commonly used smoothing spline models, including tensor products of cubic splines, thin-plate splines, and spherical splines, satisfy Conditions 1 and 2. See Chapter 9 in Gu (2013) for examples. Condition 4 bounds eg0 g at g in a convex set B0 around g0. Condition 5 requires a bounded fourth moment of ζv. We consider metrics V + λ1J and V + λ1J. Let Y > 0, we denote X = Op(Y ) if P(|X| > CY ) 0 for some constant C < , and denote X = op(Y ) if P(|X| > ϵY ) 0 for ϵ > 0.

Theorem 1 Assume P

v γl va2 v,0 < for some l [1, 2]. Under Conditions 1-5, for some

r > 1, as λ1 0 and nλ2/r 1 ,

(V + λ1J )(ˆg g0) = Op(n 1λ 1/r 1 + λl 1).

Theorem 2 Under the conditions in Theorem 1,

(V + λ1J)(ˆg g0) = Op(n 1/2λ 1/2r 1 + λl/2 1 ).

Dong and Wang

Corollary 1 Assume conditions in Theorem 2 hold, 0 < c5 < ρ(x) < c6 and 0 < c7 < f\{α}(x\{α}) < c8 for some positive constants c5, c6, c7, and c8, we have

||ˆηαk η0αk||2 = Op(n 1/2λ 1/2r 1 + λl/2 1 ), k = α, k = 1, , p,

where η0αk are two-way interactions in the true function g0.

We note that V + λ1J and V + λ1J are associated with the L2 norm and its square, respectively. Consequently, the convergence rate in Theorem 2 is the square root of the rate in Theorem 1. Corollary 1 holds because V2 and J2 associated with two-way interactions are equivalent to the L2 norm. Consequently, two-way interactions under the L2 norm have the same convergence rate as that in Theorem 2. We only show the convergence rate for interactions in Corollary 1 since we are mainly interested in edge selection. Proofs of all theoretical results are in Appendix C.

5. Simulation Results

We conduct simulations to evaluate the performance of the proposed method and compare it with some existing methods. We consider four scenarios: multivariate Gaussian, multivariate skewed Gaussian, a directed acyclic graph, and a Gaussian-Bernoulli mixed graphical model. In implementing the proposed method, we estimate the conditional density for each continuous variable on the data range and transform the data into [0, 1]. We construct an SS ANOVA model using the tensor product of cubic spline models. Speciﬁcally, let H(j) = W 2 2 [0, 1] where

W 2 2 [0, 1] = f : f, f are absolutely continuous, Z 1

0 (f )2dx < (18)

is the Sobolev space for cubic spline models. Each H(j) can be decomposed as H(j) = {1(j)} H(j) and H(j) = H0 (j) H1 (j), where H0 (j) and H1 (j) are RKHS s with reproducing kernels R0 j(x, z) = k1(x)k1(z) and R1 j(x, z) = k2(x)k2(z) k4(|x z|) respectively, k1(x) = x 0.5,

2(k2 1(x) 1

12), and k4(x) = 1 24(k4 1(x) k2 1(x)

2 + 7 240). SS ANOVA decomposition

j=1 H(j) can then be constructed based on these decompositions. More details can be

found in Wang (2011). In all simulations and real data applications, when using the pseudo log-likelihood method, we set

ρ(xα, x\{α}) = φ((xα µ(x\{α}))/σ) Φ((1 µ(x\{α}))/σ) Φ(( µ(x\{α}))/σ), (19)

where φ( ) and Φ( ) are the standard normal density and cumulative distribution functions, and µ( ) and σ are estimated by ﬁtting a nonparametric regression model in model space (4) with covariates x\{α}. More estimation details can be found in Chapter 3 of Gu (2013). We select the tuning parameter M using the 5-fold cross-validation method in all simulations. For the ﬁrst three scenarios where all variables are continuous, we compare the proposed method with four existing parametric and semiparametric methods: space (Sparse

Nonparametric neighborhood selection in graphical models

PArtial Correlation Estimation) (Peng et al., 2009), QUIC (QUadratic Inverse Covariance estimation) (Hsieh et al., 2011), nonparanormal (NPN) (Liu et al., 2009), and Spa CE JAM (Voorman et al., 2014). Due to memory constraints, we will not compare the proposed method with the nonparametric joint density estimation method in Gu et al. (2013). The space method assumes that E(X) = 0 and Cov(X) = Σ. Denote the precision matrix Ω= Σ 1 = (σij)p p and ρij = σij/

σiiσjj as the partial correlation between Xi and Xj. Denote x(i) = (x1,i, , xn,i)T as the vector of n observations on the ith variable, i = 1, , p. Peng et al. (2009) solved the following regularization problem for edge selection

i=1 wi||x(i) X

σii x(j)||2 + λ X

1 i<j p |ρij|, (20)

where wi s are non-negative weights. We implement the space method using the R package space with weights wi = 1 and tuning parameter λ selected by the 5-fold cross-validation method (Laﬁt et al., 2019). The QUIC method assumes that X is multivariate Gaussian and learns the precision matrix Ωby solving the following penalized negative log-likelihood

log det(Ω) + tr(SΩ) + λ||Ω||1, (21)

where || ||1 is the L1 penalty, S is the sample covariance matrix, and λ is the tuning parameter. We implement the QUIC method using the R package QUIC and select λ using the BIC method. The NPN method assumes that there exists some monotone functions f1, , fp such that f(X) N(µ, Σ) where f(X) = (f1(X1), , fp(Xp))T . The NPN is a semiparametric model since it consists of parameters µ and Σ and nonparametric transformations f s. The graphical lasso is applied to the transformed data to estimate the undirected graph. Estimation details were given in Liu et al. (2009). We use the R package huge to implement the NPN method with the tuning parameter selected by an extended BIC score (Foygel and Drton, 2010). The Spa CE JAM method models the conditional mean using additive models: E(Xj|X\{j}) = P

k =j fjk(Xk) where fjk( ) belongs to a functional space F (Voorman et al.,

2014). The functions fjk are estimated as the minimizers of the following least squares with a group lasso type penalty:

argminfjk F

j=1 ||x(j) X

k =j sjk||2 2 + λ X

||sjk||2 2 + ||skj||2 2 1/2

where sjk = (fjk(x1,k), , fjk(xn,k))T and skj = (fkj(x1,j), , fkj(xn,j))T . We implement the Spa CE JAM method using the R package spacejam (Voorman et al., 2014) with cubic basis functions for non-linear conditional relationships among variables. The tuning parameter λ is selected by the BIC method. The last scenario comes from Chen et al. (2015), where half of the variables are Gaussian and half are Bernoulli. Chen et al. (2015) assumed a parametric mixed graphical model

Dong and Wang

where each node s conditional distribution is in the exponential family. Speciﬁcally, they considered conditional densities of the form

f(xα|x\{α}) = exp

hα(xα, βα) + X

k =α γαkxαxk Dα(ϖα(x\{α}, Γα, βα))

where hα is a known function of xα with parameters βα and ϖα is a known function of x\{α} with parameters Γα = (γα1, , γα(α 1), γα(α+1), , γαp)T and βα. Chen et al. (2015) selected the neighborhood set by maximizing the following penalized log-likelihoods for each node:

arg min Γα,βα lα(Γα, βα; X) + λ||Γα||1, (24)

where lα is the log-likelihood function. We refer to the method in Chen et al. (2015) as the CEF (Conditional Exponential Family) method. We implement the CEF method using author s R codes deposited at https://github.com/Chen Shizhe/Mixed Graphical Models. We select the tuning parameter λ using the BIC method. We note that space and CEF are neighborhood selection methods while QUIC, NPN, and Spa CE JAM are global methods. To evaluate the performance of edge detection, we compute three criteria: speciﬁcity (SPE), sensitivity (SEN), and F1 scores, which are deﬁned as follows:

SPE = TN TN + FP, SEN = TP TP + FN, F1 = 2TP 2TP + FN + FP,

where TP, TN, FP, and FN are the numbers of true positives, true negatives, false positives and false negatives. We set dimension p = 20 and consider two sample sizes n = 150 and n = 300. All simulations are repeated for 100 times.

5.1 Multivariate Gaussian

In this section, we generate data from Gaussian distributions with diﬀerent precision matrices. We ﬁrst use huge.generator function to randomly generate a p p sparse precision matrix Ω, where the probability poﬀof the oﬀ-diagonal elements being nonzero is equal to 0.2 or 0.4. Then, we generate n i.i.d. samples X1, , Xn from N(0, Ω 1). We apply the proposed method and compare its performance with the space, QUIC, NPN, and Spa CE JAM methods. Table 1 presents averages and standard deviations of the sensitivity, speciﬁcity, and F1 score. In general, the performances of all methods are better for the larger sample size. In most settings, all methods perform better when the precision matrix is sparser (i.e. poﬀ= 0.2). Diﬀerent methods have diﬀerent trade-oﬀs between sensitivity and speciﬁcity. Overall, the NPN method has inferior performance compared to other methods, which is expected since the true distribution is Gaussian. In general, the Spa CE JAM performs better than QUIC in speciﬁcity and F1 score. This result agrees with the observations of Voorman et al. (2014) that Spa CE JAM tends to outperform the NPN and graphical lasso methods. The space method has a similar performance as the Spa CE JAM. Unexpectedly,

Nonparametric neighborhood selection in graphical models

even in this Gaussian case, the proposed method has larger sensitivities and F1 scores and reasonable speciﬁcities compared to other methods. Therefore, the proposed method is eﬃcient in edge detection and performs better with a more balanced trade-oﬀbetween speciﬁcity and sensitivity even under this multivariate Gaussian scenario.

Proposed Method space QUIC NPN Spa CE JAM

SPE SEN F1 SPE SEN F1 SPE SEN F1 SPE SEN F1 SPE SEN F1

n=150 0.912 0.968 0.834 0.986 0.654 0.762 0.768 0.964 0.666 0.820 0.751 0.521 0.939 0.798 0.776

(0.028) (0.037) (0.050) (0.011) (0.096) (0.075) (0.036) (0.034) (0.042) (0.108) (0.402) (0.281) (0.035) (0.147) (0.068)

n=300 0.929 0.998 0.870 0.989 0.869 0.907 0.813 0.982 0.712 0.762 0.995 0.668 0.945 0.954 0.875

(0.026) (0.008) (0.046) (0.01) (0.059) (0.04) (0.032) (0.025) (0.04) (0.042) (0.016) (0.042) (0.022) (0.041) (0.037)

n=150 0.866 0.815 0.807 0.969 0.461 0.599 0.668 0.797 0.691 0.793 0.617 0.474 0.945 0.508 0.608

(0.040) (0.066) (0.046) (0.013) (0.062) (0.058) (0.047) (0.058) (0.030) (0.142) (0.411) (0.312) (0.025) (0.135) (0.126)

n=300 0.883 0.968 0.903 0.961 0.564 0.681 0.689 0.827 0.717 0.673 0.951 0.706 0.905 0.704 0.727

(0.042) (0.027) (0.028) (0.015) (0.058) (0.049) (0.043) (0.046) (0.026) (0.039) (0.036) (0.021) (0.031) (0.128) (0.076)

Table 1: Averages and standard deviations (in parentheses) of speciﬁcity (SPE), sensitivity (SEN), and F1 score for the multivariate Gaussian scenario.

5.2 Multivariate Skewed Gaussian

In this section, we consider the scenario when X follows a multivariate skewed Gaussian distribution with density function (Azzalini and Valle, 1996)

f(x) = 2φp(x; µ, Σ)Φ(a T x), (25)

where φp(x; µ, Σ) is the p-dimensional normal density with mean µ and covariance matrix Σ, Φ( ) is the cumulative distribution function of the standard Gaussian distribution, and a is a p-dimensional vector that controls the skewness of the multivariate Gaussian distribution. When a = 0, the distribution reduces to the multivariate Gaussian distribution. We set a = a1 and consider two choices of a: a = 1 and a = 4, where 1 is a p-dimensional vector of all ones. We let µ = 0.51 and randomly generate Σ 1 as a p p matrix, where the probability of the oﬀ-diagonal elements being nonzero equals 0.4. True edges correspond to nonzero oﬀ-diagonal elements of the precision matrix Σ 1. Table 2 presents averages and standard deviations of sensitivity, speciﬁcity, and F1 score. All methods have better performances under the larger sample size. Again, diﬀerent methods have diﬀerent trade-oﬀs between sensitivity and speciﬁcity. The space, NPN, and Spa CE JAM methods have small sensitivities and F1 scores when n = 150. As expected, the proposed method has the best overall performance with signiﬁcantly larger sensitivity and F1 score and reasonable speciﬁcity.

Dong and Wang

Proposed Method space QUIC NPN Spa CE JAM

SPE SEN F1 SPE SEN F1 SPE SEN F1 SPE SEN F1 SPE SEN F1

n=150 0.892 0.854 0.847 0.957 0.302 0.440 0.688 0.798 0.704 0.873 0.332 0.286 0.926 0.378 0.486

(0.039) (0.058) (0.044) (0.023) (0.053) (0.060) (0.041) (0.055) (0.027) (0.160) (0.412) (0.352) (0.039) (0.160) (0.164)

n=300 0.927 0.976 0.937 0.949 0.470 0.608 0.718 0.827 0.737 0.668 0.932 0.769 0.845 0.728 0.739

(0.034) (0.022) (0.025) (0.027) (0.061) (0.061) (0.045) (0.047) (0.027) (0.049) (0.047) (0.024) (0.088) (0.133) (0.069)

n=150 0.898 0.860 0.854 0.958 0.299 0.438 0.692 0.800 0.707 0.861 0.357 0.306 0.929 0.384 0.493

(0.034) (0.056) (0.040) (0.023) (0.054) (0.063) (0.040) (0.055) (0.027) (0.164) (0.415) (0.355) (0.037) (0.161) (0.166)

n=300 0.917 0.979 0.931 0.950 0.473 0.610 0.772 0.828 0.739 0.673 0.934 0.773 0.834 0.748 0.745

(0.034) (0.019) (0.026) (0.026) (0.060) (0.059) (0.041) (0.047) (0.027) (0.049) (0.045) (0.023) (0.103) (0.140) (0.067)

Table 2: Averages and standard deviations (in parentheses) of speciﬁcity (SPE), sensitivity (SEN), and F1 score for the multivariate skewed Gaussian scenario.

5.3 Directed Acyclic Graph

It is generally diﬃcult to construct a ﬂexible multivariate nonparametric distribution, as discussed in Section 2 in Voorman et al. (2014). To overcome this problem, we use the same approach in Voorman et al. (2014) to generate a graphical model using a directed acyclic graph (DAG) and conditional distributions. We use the rdag function in the spacejam package to create a DAG of X and denote ED as the directed edge set. The conditional relationships among variables can be created via E(Xj|X\{j}) = P

k =j fjk(Xk). The distri-

bution of X is usually not a well-known multivariate distribution except for the particular case when all fjks are linear for multivariate Gaussian distribution. We decompose XT = (Y T , ZT ) where Y and Z are random vectors of dimensions 5 and 15 respectively. We ﬁrst generate a DAG with p = 20 nodes and m edges selected at random from all possible p(p 1)/2 possible edges. We consider two choices of m: m = 20 and m = 40. Given a DAG, we generate data as follows:

Zj|{Zk, Ys : {k, j}, {s, j} ED} = X

{k,j} ED f(1) jk (Zk) + X

{s,j} ED f(1) js (Ys) + ϵj

Yj|{Yk : {k, j} ED} = X

{k,j} ED f(2) jk (Yk) + ϵj,

where ϵj s are i.i.d. random noises from the standard normal distribution, f(1) jk (t) = b(1) jk,1t

with b(1) jk,1 generated from the standard Gaussian distribution, and f(2) jk (t) = b(2) jk,1t+b(2) jk,2t2 +

b(2) jk,3t3 with b(2) jk,1, b(2) jk,2 and b(2) jk,3 independently generated from the Gaussian distributions with mean zero and variances 1, 0.3, and 0.1, respectively.

Nonparametric neighborhood selection in graphical models

Simulation results are shown in Table 3. The performances of all methods are better when the sample size is larger. Diﬀerent methods have diﬀerent trade-oﬀs between sensitivity and speciﬁcity. Since data are generated according to a model assumed by the Spa CE JAM method, as expected, the Spa CE JAM method performs better in F1 score than the space, QUIC, and NPN methods. Remarkably, in all cases, the proposed method has larger F1 scores than all methods, including the Spa CE JAM. It is interesting to note that the denser graph (i.e. m = 40) reduces the sensitivity of the proposed method and the speciﬁcity of the Spa CE JAM method.

Proposed Method space QUIC NPN Spa CE JAM

SPE SEN F1 SPE SEN F1 SPE SEN F1 SPE SEN F1 SPE SEN F1

n = 150 0.970 0.835 0.840 0.997 0.588 0.730 0.808 0.838 0.588 0.83 0.791 0.588 0.963 0.697 0.736

(0.020) (0.074) (0.066) (0.004) (0.084) (0.068) (0.034) (0.079) (0.050) (0.066) (0.125) (0.054) (0.019) (0.079) (0.062)

n = 300 0.984 0.917 0.915 0.998 0.631 0.767 0.859 0.854 0.660 0.818 0.894 0.629 0.979 0.716 0.786

(0.013) (0.064) (0.050) (0.003) (0.075) (0.058) (0.035) (0.079) (0.055) (0.041) (0.082) (0.052) (0.057) (0.089) (0.072)

n = 150 0.970 0.598 0.725 0.984 0.359 0.517 0.705 0.671 0.627 0.671 0.708 0.634 0.690 0.710 0.643

(0.020) (0.064) (0.05) (0.012) (0.039) (0.041) (0.039) (0.053) (0.031) (0.042) (0.060) (0.031) (0.111) (0.112) (0.044)

n = 300 0.985 0.685 0.800 0.982 0.430 0.587 0.740 0.673 0.645 0.707 0.724 0.662 0.654 0.814 0.692

(0.014) (0.067) (0.049) (0.013) (0.054) (0.050) (0.046) (0.049) (0.036) (0.045) (0.048) (0.038) (0.07) (0.051) (0.031)

Table 3: Averages and standard deviations (in parentheses) of speciﬁcity (SPE), sensitivity (SEN), and F1 score for the directed acyclic graph scenario.

5.4 Gaussian-Bernoulli Mixed Graphical Model

In this section, we consider a mixed graphical model used in Section 6.1 of Chen et al. (2015). The graph used to generate the data is shown in Figure 1 of Chen et al. (2015). Speciﬁcally, there are m Gaussian nodes labeled as 1, , m and m Bernoulli nodes labeled as m + 1, , 2m. For j = 1, , m, the jth and (j + m)th node are connected to its adjacent nodes of the same type, and the jth node and the (j + m)th node are connected to each other. Consider the following model

j=1 hj(xj) + 1

j =k γjkxjxk

where hj is the node potential and γjk are edge potentials. The edge potentials γjk and γkj are generated as

γjk = γkj = yjkrjk, P(yjk = 1) = P(yjk = 1) = 0.5, rjk Unif(0.3, 0.6),

Dong and Wang

and γjk = γkj = 0 if (j, k) / E. Gibbs sampling is employed to sample data from (26). In this simulation scenario, we compare the proposed method with the CEF method only since it performed better than other existing methods for mixed data (Chen et al., 2015).

Proposed Method CEF

SPE SEN F1 SPE SEN F1

n = 150 0.897 0.752 0.656 0.947 0.450 0.509

(0.021) (0.053) (0.044) (0.020) (0.015) (0.035)

n = 300 0.900 0.804 0.675 0.934 0.467 0.504

(0.022) (0.042) (0.039) (0.021) (0.017) (0.032)

Table 4: Averages and standard deviations (in parentheses) of speciﬁcity (SPE), sensitivity (SEN), and F1 score for the Gaussian-Bernoulli mixed graphical model.

Table 4 shows that the proposed and the CEF methods have diﬀerent trade-oﬀs between sensitivity and speciﬁcity. The proposed method has better sensitivity, while the CEF method has better speciﬁcity. The proposed method has slightly better F1 scores.

6. Applications

We illustrate our neighborhood selection method using two real datasets. Section 6.1 applies our method to Arabidopsis Thaliana gene expression data and compares the estimated graph with those from space, QUIC, NPN, and Spa CE JAM. In addition, we present a diagnostic procedure for some existing methods. Section 6.2 illustrates our method using a dataset with mixed data types.

6.1 Isoprenoid Gene Network in Arabidopsis Thaliana

In this section, we consider the gene expression data for Arabidopsis thaliana, an important plant species in molecular biology and genetics studies. There are n = 118 observations of Aﬀymetrix Gene Chip microarrays in the dataset, where a subset of p = 39 genes from the isoprenoid pathway is selected for analysis. The dataset was introduced in Wille et al. (2004) and was downloaded at https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC545783/. Laﬀerty et al. (2012) analyzed this dataset using the nonparanormal method. All observations are preprocessed by log-transformation and standardization as in Lafferty et al. (2012). Using the proposed method, we build a graph for all 39 gene expression levels and compare its structure with those from space, QUIC, NPN, and Spa CE JAM. Wille et al. (2004) stated that the Gaussian graphical model selection with the BIC choice of the tuning parameter usually detects too many edges for biologically-relevant analysis. Therefore, we limit the number of edges in the graph by controlling the regularization parameters as in Laﬀerty et al. (2012). Speciﬁcally, we tune M such that the number of edges |E| = 52. Similarly, by tuning the regularization parameters in space, QUIC, NPN, and Spa CE JAM, we select the graphs with the same number of edges |E| = 52.

Nonparametric neighborhood selection in graphical models

G G G G G G G G G G

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

24252627282930313233343536

Proposed Method

G G G G G G G G G G

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

24252627282930313233343536

G G G G G G G G G G

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

24252627282930313233343536

G G G G G G G G G G

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

24252627282930313233343536

G G G G G G G G G G

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

24252627282930313233343536

Figure 1: The estimated graph with 52 edges from the proposed (top left), the space (top middle), the QUIC (top right), the NPN (bottom left), and the Spa CE JAM (bottom right) methods.

Figure 1 presents graphs with |E| = 52 for all methods. These ﬁve graphs have some common edges, for example, edges 1-27, 1-33, 2-28, 2-30, 2-34, 2-35, 3-32, 3-33, 3-39, 5-37, 10-26, 10-33, 10-39, 11-36, 12-29, 12-30, 12-34, 12-35, 22-39, 23-33, 25-37, 28-34, 34-35, and 37-38. There are also some interesting diﬀerences. For instance, only our proposed method detects edge 16-21. We now describe a general diagnostic procedure that explains why other methods miss this edge. We ﬁrst extend the squared error projection in Gu (2013) for diagnostics on any subspaces of Mα. Let

V (ˆg g) = Z

X\{α} f\{α}(x\{α}) Z

(ˆg g)(x) Z

Xα (ˆg g)(x)ρ(x) 2ρ(x)dxαdx\{α}

where ˆg Mα {1}. We remove the constant functions from the model space since they are not relevant to the diagnostics on interactions. V (ˆg g) can be treated as a proxy of

Dong and Wang

the symmetrized Kullback-Leibler distance (Gu, 2013). For any decomposition Mα {1} = M0 α M1 α, the squared error projection of ˆg in M0 α is deﬁned as g = arg min g M0α

n V (ˆg g) o .

It can be shown that V (ˆg gu) = V (ˆg g) + V ( g gu) when gu = log ρ(x) M0 α. The ratio V (ˆg g)/ V (ˆg gu) represents the contribution of functions in subspace M1 α which can be dropped when the ratio is small (Gu et al., 2013). Now we apply the diagnostic procedure to explain why our proposed method detects edge 16-21, which is missed by other methods. Note that the interaction space H(αk) = H(0) (αk)

H(1) (αk) H(2) (αk) H(3) (αk) where H(0) (αk) = H0 (α) H0 (k), H(1) (αk) = H0 (α) H1 (k), H(2) (αk) = H1 (α) H0 (k),

and H(3) (αk) = H1 (α) H1 (k) correspond to linear-linear, linear-smooth, smooth-linear, and smooth-smooth interactions (Wang, 2011). The QUIC and space are special cases with ηαk H(0) (αk), and the Spa CE JAM is a special cases with ηαk H(0) (αk) H(1) (αk). Therefore, for diagnostics of QUIC and Spa CE JAM methods, we consider the contribution of M1 α = Mα {1} H(0) (αk) and the contribution of M1 α = Mα {1} H(0) (αk) H(1) (αk), respectively. For

edge 16-21, we have V (ˆg g)/ V (ˆg gu) = 0.352 for QUIC and V (ˆg g)/ V (ˆg gu) = 0.340 for Spa CE JAM, respective. These non-ignorable contributions suggest that the assumptions of the QUIC and Spa CE JAM methods are likely violated.

6.2 Conditional Dependence Among Demographic, Clinical, Laboratory and Treatment Variables of Hemodialysis Patients

In this section, we illustrate the application of the proposed methods to mixed binary and continuous variables using a data set collected from hemodialysis patients. The data include patients who received dialysis treatments during 2010-2014 and stayed at Fresenius Medical Care - North America throughout their treatments. To reduce heterogeneity, we include n = 2932 non-diabetic and non-Hispanic patients who used arteriovenous ﬁstula for dialysis access and survived longer than two years. We use the averages of measurements in the second year of dialysis for analysis. We consider the following 23 variables: demographic variables including race (white and non-white) and gender (male and female); clinical variables including height (cm), weight (kg), sbp (systolic blood pressure, mm Hg), dbp (diastolic blood pressure, mm Hg), and temp (temperature, Celsius); laboratory variables including albumin (g/d L), ferritin (ng/m L), hgb (hemoglobin, g/d L), lymphocytes (%), neutrophils (%), nlr (neutrophils to lymphocytes ratio, unitless), sna (serum sodium concentration, m Eq/L), wbc (white blood cell, 1000/mc); and treatment variables including qb (blood ﬂow, m L/min), qd (dialysis ﬂow, m L/min), saline (m L), olc (on-line clearance, unitless), idwg (interdialytic weight gain, kg), ufv (ultraﬁltration volume, L), ufr (ultraﬁltration rate, m L/hr/kg), and epodose (erythropoietin dose, unit). We have 2 binary variables, race and male, and 21 continuous variables. We apply the logistic regression approach described in the Supplement to estimate the conditional density of each binary variable and the pseudo log-likelihood to estimate the conditional density of each continuous variable. We apply the BIC method to select the tuning parameter M. The left panel in Figure 2 shows the estimated graph which contains some of the expected dependencies between variables such as gender and height, weight and height, and sbp and dbp. The link between ufv and idwg is also well-known (Uduagbamen et al., 2021).

Nonparametric neighborhood selection in graphical models

Many other edges corroborate with existing literature. For example, anemia is a common complication of dialysis patients, and its management is a major challenge. A central aim of anemia management is to maintain patients hemoglobin levels consistently within a target range. Erythropoietin has been used to raise hemoglobin levels, which is revealed by the edge between epodose and hgb. Serum albumin has been found to be strongly associated with erythropoietin sensitivity (Agarwal et al., 2008), which is corroborated by the edge between epodose and albumin. It has been found that black patients receive greater doses of erythropoietin than white patients (Lacson et al., 2008), which is corroborated by the edge between epodose and race. The estimated graph from our proposed method in Figure 2 (left panel) provides a holistic view of complex relationships between the demographic, clinical, laboratory, and treatment variables and helps build new theories to be tested in future studies. For comparison, we apply the CEF method to ﬁt the Gaussian-Bernoulli model (26) with the BIC choice of the tuning parameter to this data and show the resulting graph on the right panel in Figure 2. The CEF method leads to a very dense graph.

weight sbp dbp temp albumin ferritin hgb

lymphocytes

neutrophils

qd saline olc

Proposed Method

weight sbp dbp temp albumin ferritin hgb

lymphocytes

neutrophils

qd saline olc

Figure 2: The estimated graph for dialysis data from the proposed (left) and the CEF (right) methods.

7. Conclusion

This paper develops a fully nonparametric method for neighborhood selection in pairwise graphical models. Since the range of each random variable is an arbitrary set, the proposed method provides a uniﬁed framework for mixed data types. The proposed SS ANOVA models are more general than existing parametric and semiparametric models. We de-

Dong and Wang

velop penalized log-likelihood and pseudo log-likelihood methods with an L1 penalty to select edges. As illustrated in Section 6.1, in addition to providing more ﬂexible alternatives, the proposed method also serves as a new diagnostic tool for existing graphical models. We establish convergence rates of the conditional density function estimate and interaction components in the SS ANOVA decomposition. Simulation results show that the proposed method is eﬃcient in edge detection and performs well under Gaussian and non-Gaussian situations. Applications to real data indicate that the proposed method could detect edges that may provide new perspectives for researchers. We note that as a nonparametric method, even though it is parallelizable, the proposed method takes much longer CPU time than parametric and semiparametric methods compared in this paper. We note that the proposed methods can be easily extended to select variables in nonparametric conditional density estimation, which has not been studied to the best of our knowledge. The proposed method can also be extended to incorporate prior knowledge of the conditional density of a node using a model-based penalty or a semiparametric model (Shi et al., 2019; Yu et al., 2020). For example, it may be known that the conditional density of Xα is close to but not necessarily a Gaussian distribution. We may consider a quintic thin-plate spline space for H(α) with a tensor sum decomposition H(α) = H0 (α) H1 (α), where H0 (α) = {1(α), x(α), x2 (α)} corresponds to the space for logistic density of a Gaussian distribution. The edge selection consistency and the control of false positives also warrant further investigation.

Acknowledgments

We would like to thank the editor and three anonymous reviewers for their insightful comments that signiﬁcantly improved the manuscript. This research is partially supported by NIH R01 DK130067. We thank Fresenius Medical Care North America for providing deidentiﬁed data.

Nonparametric neighborhood selection in graphical models

Appendix A. Penalized Log-likelihood Estimation

We describe the general penalized log-likelihood approach in Section A.1 and how it is applied for binary variables in Section A.2.

A.1 Log-likelihood Approach for Conditional Density Estimation

In this section, we describe the log-likelihood approach for estimating conditional density with L1 penalty. For identiﬁability, the constant and main eﬀects of x\{α} are removed from the model space. Speciﬁcally, the model space for η in (3) is assumed to be

k =α [H(α) H(k)]

A function η M α can be decomposed as follows:

η(x) = ηj(xα) + X

k =α ηαk(xα, xk), (29)

where each component in (29) belongs to the corresponding subspace in (28). We further decompose H(α) as H(α) = H0 (α) H1 (α) where H0 (α) is a ﬁnite-dimensional space containing functions that are not subject to the L2 penalty. We estimate η by minimizing the following penalized log-likelihood in M α:

θ 1 α ||Pαηα||2 + X

k =α wαkθ 1 αk ||ηαk||2 + λ2 X

k =α wαkθαk, (30)

where l α = n 1 n P

n η(xi) log R

Xα eη(xα,xi,\{α})dxα o , and Pα is the projection operator

onto H1 (α). Let θ2 = (θα1, , θα(α 1), θα(α+1), , θαp)T , φα = (φα1, , φαmα)T be a

vector of basis functions of H0 (α), ξu(x) = θαξ1αu(xα) + p P

k=1,k =α w 1 αk θα,kξαku(xα, xk) for

u = 1, , q, and ξ(x) = (ξ1(x), , ξq(x))T . Similar to (11), the approximate solution can be represented as

v=1 dvφαv(x) +

θαξ1αu(xα) +

k=1,k =α w 1 αk θα,kξαku(xα, xk)

= φT α(x)d + ξT (x)c, (31)

where c = (c1, , cq)T and d = (d1, , dmα)T . Plugging ˆη(xi) in (31) into (30), we need to compute c, d, and θ2 as the minimizers of

i=1 (φT α,id+ξT i c)+ 1

Xα eφ T α(xα,xi,\{α})d+ξ T (xα,xi,\{α})cdxα + λ1

2 c T Qc+λ2w T θ2

Dong and Wang

subject to θ2 0. Similar to the Algorithm in Section 3, we can update c, d, and θ2 sequentially. The estimate of θ2 provides the selection results. We skip the details here because the derivation is similar to Section 3. The algorithm can be similarly implemented as described in Appendix B. With minor changes in Conditions 1-5 and the deﬁnition of V , following similar steps in Appendix C, it can be shown that Theorem 1, Theorem 2, and Corollary 1 hold for the log-likelihood approach with the same convergence rates.

A.2 Logistic Regression for Binary Variables

In this section, we consider the penalized log-likelihood approach for the special case when xα is a binary variable taking values 0 or 1. Consider the logit function

ν(x) = log{f(1|x\{α})/f(0|x\{α})}

= η(1, x\{α}) η(0, x\{α})

= ηα(1) ηα(0) + X

j =α {ηαj(1, xj) ηαj(0, xj)}

= 2ηα(1) + 2 X

j =α ηαj(1, xj)

j =α η j(xj)

where the third equation comes from the SS ANOVA model (29), and fourth equation is the consequence of the side conditions: ηα(1) + ηα(0) = 0 and ηαj(1, xj) + ηαj(0, xj) = 0 for j = α. Therefore, ηαj(xα, xj) = 0 (i.e. there is no edge between xα and xj) is equivalent to η j(xj) = 0. Consequently, we can consider a logistic regression model with the following model space for the function ν,

where η j H (j).

We write the conditional density f(xα|x\{α}) = exp{xαν(x) log(1 + eν(x))}. Then, the penalized log-likelihood function

n xα,iν(xi) log(1 + eν(xi)) o + λ1

j =α ||η j||. (34)

Similar to Zhang et al. (2011), instead of (34), we will minimize the following equivalent but more convenient form

n xα,iν(xi) log(1 + eν(xi)) o + λ1

j =α θ 1 j ||η j||2 + λ2 X

j =α θj. (35)

We approximate the solution as described in Section 3. Denote the approximate solution

as ˆν(x) = d + q P

j =α θjξju(xj)

= d + ξT (x)c, where d R, ξju(xj) = Rj( xu,j, xj),

Nonparametric neighborhood selection in graphical models

and ξ(x) = (ξ1(x), , ξq(x))T . Then, (35) reduces to

xα,i(d + ξT i c) log(1 + ed+ξ T i c) + λ1

2 c T Qc + λ21T θ, (36)

subject to θ 0, where ξi = ξ(xi), Q = n P

j =α θj Rj( xu,j, xv,j) oq

u,v=1 are deﬁned similarly

as in Section 3, 1 is a p 1 vector with all 1 s, and θ = (θ1, , θα 1, θα+1, , θp)T . We need to compute c, d, and θ as minimizers of (36). Again, as the Algorithm in Section 3, we estimate c, d, and θ alternatively. With ﬁxed θ, dropping the last term which is independent of c and d, (36) has the same form as (5.1) in Gu (2013). Therefore, we update c and d using the Newton-Raphson procedure with λ1 selected by the generalized approximate cross-validation (GACV) method (Gu, 2013). With ﬁxed c and d, we rewrite ˆν(x) = d + ψT (x)θ, where ψT (x) = (ψ 1, , ψ α 1,

ψ α+1, , ψ p) and ψ j = q P

u=1 cuξju(xj). Plugging ˆν(xi) into (35) and keeping terms involving

θ only, we have

xα,iψT i θ log(1 + ed+ψ T i θ) + λ1

2 c T Qc + λ21T θ, (37)

where ψi = ψ(xi). Furthermore, minimizing (37) is equivalent to minimizing

xα,iψT i θ log(1 + ed+ψ T i θ) + λ1

2 c T Qc, (38)

subject to θ 0 and 1T θ M for some constant M. It is easy to see that the Hessian matrix of A(θ) is semi-deﬁnite, and consequently, A(θ) is a convex function of θ. We solve (38) iteratively using quadratic programming. Denote the current estimate of θ as ˆθ, ˆν(x) = d + ψT (x)ˆθ, and ˆνi = ˆν(xi) = d + ψT i ˆθ. We update θ by minimizing the following second-order Taylor approximation of A(θ) (some constants independent of θ have been removed): 1 2θT HA( θ)θ + θT n GA( θ) HA( θ) θ o (39)

subject to θ 0 and 1T θ M for some constant M, where GA( θ) = n 1 n P

i=1 {xα,iψi

ψieˆνi/(1 + eˆνi)} + λ1q/2 is the gradient, HA( θ) = 1

i=1 ψiψT i eˆνi/(1 + eˆνi)2 is the Hessian,

q = (c T Q1c, , c T Qα 1c, , c T Qα+1c, c T Qpc)T , and Qj = n Rj( xu,j, xv,j) oq

j = 1, , p and j = α. We select the tuning parameter M by the K-fold cross-validation or the BIC method. We skip the implementation detail for the above algorithm since it is similar to that for the pseudo log-likelihood approach described in the next section.

Dong and Wang

Appendix B. Algorithm Implementation

In this section, we provide details about the implementation of the proposed algorithm using existing R packages. Speciﬁcally, we implement the Newton-Raphson procedure in the algorithm using a modiﬁcation of the sscden1 function in the gss package (Gu et al., 2014) and quadratic programming using the R function solve.QP in the quadprog package (Turlach and Weingessel, 2007).

B.1 Implementation of the Newton-Raphson Method

Given the current value of θ2, we update c and d by minimizing (13) using the Newton Raphson method. We implement by modifying the function sscden1 in the gss package since (13) has the same form as (10.31) in Gu (2013) with diﬀerent penalties and certain smoothing parameters being ﬁxed. By deﬁnition, H(αk) = H(α) H(k) = (H0 (α) H1 (α))

(H0 (k) H1 (k)) = (H0 (α) H0 (k)) (H0 (α) H1 (k)) (H1 (α) H0 (k)) (H1 (α) H1 (k)) = H(0) (αk)

H(1) (αk) H(2) (αk) H(3) (αk) where H(0) (αk) = H0 (α) H0 (k), H(1) (αk) = H0 (α) H1 (k), H(2) (αk) = H1 (α) H0 (k),

and H(3) (αk) = H1 (α) H1 (k). For density estimation, the penalized log-likelihood method in Gu (2013) does not penalize functions in the parametric component space H0 (αk) and has diﬀerent smoothing parameters for components in the nonparametric component spaces H(1) (αk), H(2) (αk), and H(3) (αk). Our goal is edge detection by detecting nonzero interactions. Therefore, we penalize the combined interaction ηαk H(αk) as a whole with a smoothing parameter θαk for k = 1, , p and k = α. The interaction ηαk collects parametric and nonparametric interaction components in H(0) (αk), H(1) (αk), H(2) (αk), and H(3) (αk). Note that θ2 = (θα1, , θα(α 1), θα(α+1), , θαp)T is ﬁxed at this step. We modiﬁed the function sscden1 to solve (13) with smoothing parameters λ1 and θ1 estimated by the approximated crossvalidation method.

B.2 Implementation of Quadratic Programming

Denote the current estimate of θ2 as θ2 and g(x) = φT (x)d + ψT 1 (x)θ1 + ψT 2 (x) θ2. Deﬁne

µ g(h) = n P

i=1 e g(xi)h(xi)/ n P

i=1 e g(xi), V g(h1, h2) = µ g(h1h2) µ g(h1)µ g(h2) for any functions

h, h1, and h2. We update θ2 by minimizing the following second-order Taylor approximation of A2(θ2) with some constants independent of θ2 have been removed:

1 2θT 2 HA( θ2)θ2 + θT 2 n GA( θ2) HA( θ2) θ2 o (40)

subject to θ2 0 and w T θ2 M for some constant M, where GA( θ2) = µ g(ψ2) + bψ2 +

λ1q2/2 is the gradient, HA( θ2) = V g(ψ2, ψT 2 ) is the Hessian, q2 = (w 1 α1 c T Qα1c, , w 1 α(α 1)c T Qα(α 1)c, w 1 α(α+1)c T Qα(α+1)c, , w 1 αp c T Qαpc)T , and Qαk = n Rαk(( xu,α, xu,k), ( xv,α, xv,k)) oq

u,v=1 for k = 1, , p and k = α.

We use the R function solve.QP to solve (40). We estimate the tuning parameter M by minimizing a K-fold cross-validation or BIC score deﬁned as follows. Let I1, , IK be K randomly partitioned subsamples of the original data, nj = |Ij|, and n( j) = n nj. Denote

Nonparametric neighborhood selection in graphical models

g( j) M as the estimate without observations in the subset Ij which minimizes the following function with respect to θ2:

i/ Ij e g(xα i )

Xα g(xα i )ρ(xα i )dxα + λ1 X

k =α wαkθ 1 αk ||ηαk||2 (41)

subject to θαk 0 for k = α and w T θ2 M. The K-fold cross-validation estimate of M is the minimizer of the following score:

CV(M) = log

i Ij e g( j) M (xα i )

Xα g( j) M (xα i )ρ(xα i )dxα. (42)

The BIC estimate of M is the minimizer of the following score:

BIC(M) = log

i=1 e g M(xα i ) )

Xα g M(xα i )ρ(xα i )dxα + log(nkn), (43)

where g M expresses the dependence of the estimate on M explicitly, and kn is the number of nonzero elements in the estimate of θ2. We applied the K-fold cross-validation method in all simulations with K = 5. We applied the BIC method in real data examples to get sparser graphs.

B.3 Initial Values and Convergence Criterion

To get a good initial value θ2,0, we ﬁrst estimate the conditional density f(xα|x\{α}) with τ1 P

k =α wαk||ηαk|| in (7) being replaced by (λ1/2) P

k =α θ 1 αk ||ηαk||2. We modiﬁed the sscden

function in the gss package to estimate the conditional density and denote the estimate of ηαk as ˇηαk. Since θαk = 0 in θ2 iﬀηαk = 0, the magnitude of ˇηαk provides one way to initialize θαk. Speciﬁcally, we set θ0 αk = {Pn i=1 ˇη2 αk(xi)}1/2. The convergence criterion in the algorithm is ||θ2 θ2||2/(|| θ2||2 + 10 6) ε or the number of zeros in θ2 stops increasing for ﬁxed number of steps, where θ2 and θ2 are the updated and previous estimates, respectively, || ||2 is the Euclidean norm, and ε a threshold. We set ε = 0.001 in simulation and real data examples.

Appendix C. Proofs

Proof of Proposition 1: We show the equivalence between minimization problems (7) and (9). First,

e η(xi) + Z

Xα η(xα i )ρ(xα i )dxα

j=1 θ 1 j ||Pjηj||2 + τ1 X

k =α wαk||ηαk||

= min g G,ς R

n e g(xi) ς + Z

Xα (g(xα i ) + ς)ρ(xα i )dxα o + λ1

j=1 θ 1 j ||Pjηj||2

k =α wαk||ηαk||

Dong and Wang

Setting the derivative of (44) with respect to ς to zero, we get eς = n 1 n P

i=1 e g(xi). Plugging

it back to (44), we have the following proﬁled penalized pseudo log-likelihood

Xα g(xα i )ρ(xα i )dxα + log n 1

i=1 e g(xi)o + λ1

j=1 θ 1 j ||Pjηj||2

k =α wαk||ηαk||

Proof of Proposition 2: Set λ2 = τ 2 1 /2λ1. Denote the functional in (9) as B1(g) and the functional in (10) as B2(θ2, g). For any θαk 0 and g G, we have λ1θ 1 αk ||ηαk||2/2+λ2θαk

2λ1/2 1 λ1/2 2 ||ηαk|| = τ1||ηαk||, and the equality holds if and only if θαk = λ1/2 1 λ 1/2 2 ||ηαk||/

2. Therefore, B2(θ2, g) B1(g) for any θαk 0 and g G, and the equality holds if and only if θαk = λ1/2 1 λ 1/2 2 ||ηαk||/

2 for α = k. The equivalence between (9) and (10) follows. Proof of Proposition 3: The population version of the pseudo log-likelihood lα = n 1 Pn i=1 n e η(xi) + R

Xα η(xα i )ρ(xα i )dxα o in (7) is

l(η) = E[e η(x)] + Z

X\{α} f\{α}(x\{α}) Z

Xα η(x)ρ(x)dx

X\{α} f\{α}(x\{α}) Z

Xα e η(x)f(xα|x\{α})dx + Z

X\{α} f\{α}(x\{α}) Z

Xα η(x)ρ(x)dx,

where f\{α}(x\{α}) is the density of X\{α} on X\{α} = X1 Xα 1 Xα+1 Xp. The ﬁrst and second-order Fr echet derivatives of l(η) are

Dl(η)h1 = Z

X\{α} f\{α}(x\{α}) Z

Xα e η(x)h1(x)f(xα|x\{α})dx

X\{α} f\{α}(x\{α}) Z

Xα h1(x)ρ(x)dx

D2l(f)h1h2 = Z

X\{α} f\{α}(x\{α}) Z

Xα e η(x)h1(x)h2(x)f(xα|x\{α})dx,

where D denotes Fr echet derivative operator. We set Dl(η)h1 = R

X\{α} f\{α}(x\{α}) R

Xα h1(x)[ρ(x) e η(x)f(xα|x\{α})]dx = 0 for all

h1 Mα. Then, we have ρ(x) e η(x)f(xα|x\{α}) = 0, which implies that f(xα|x\{α}) = eη(x)ρ(x). In addition, D2l(η)hh = R

X\{α} f\{α}(x\{α}) R

Xα e η(x)h2(x)f(xα|x\{α})dx > 0 for any nonzero h Mα, and consequently, l is strictly convex. Therefore, if ˆη is the solution to (7), the estimate of the conditional density equals eˆη(x)ρ(x). We note that ˆη = ˆς + ˆg where ˆη is the solution to (7), ˆς is given in Proposition 1, and ˆg is the solution to (10). Since ˆς is a constant independent of xα, then the estimate of the conditional density ˆf(xα|x\{α}) is proportional to eˆg(x)ρ(x).

Nonparametric neighborhood selection in graphical models

Proof of Convexity of A2(θ2): We show that the Hessian matrix HA(θ2) of A2(θ2) is positive semi-deﬁnite. For any vector ν = 0, let si = e g(xi) and ti = νT ψ2(xi), we have

νT HA(θ2)ν =

by the Cauchy-Schwartz inequality. In the remainder of the Appendix, we ﬁrst introduce three lemmas and then provide proofs of Theorem 1, Theorem 2, and Corollary 1.

Lemma 1 Assume J (g0) < . Under Conditions 1 3, as λ1 0 and n ,

(V + λ1J )( g g0) = Op(n 1λ 1/r 1 + λ1).

Proof: By the Fourier series expansions of g and g0, we have

V ( g g0) = X

v ( av av,0)2 = X

κ2 v 2κvλ1γvav,0 + λ2 1γ2 va2 v,0 (1 + λ1γv)2 ,

λ1J ( g g0) = X

v λ1γv( av av,0)2 = X

v λ1γv κ2 v 2κvλ1γvav,0 + λ2 1γ2 va2 v,0 (1 + λ1γv)2 .

Since E(κv) = 0 and E(κ2 v) c1/n, we have

E[V ( g g0)] c1

1 (1 + λ1γv)2 + λ1 X

λ1γv (1 + λ1γv)2 γva2 v,0,

E[λ1J ( g g0)] c1

λ1γv (1 + λ1γv)2 + λ1 X

(1 + λ1γv)2 γva2 v,0. (47)

Following similar arguments in the proof of Lemma 9.1 in Gu (2013), we have X

λ1γv (1 + λ1γv)2 = O(λ 1/r 1 ), X

1 (1 + λ1γv)2 = O(λ 1/r 1 ), X

1 1 + λ1γv = O(λ 1/r 1 ).

The lemma follows from (47) and the fact that P v γva2 v,0 = J (g0) < . As in Gu (2013), when g0 is supersmooth in the sense that P v γl va2 v,0 < for some

1 < l 2, which is assumed in Theorem 1, the rates can be improved to O(n 1λ 1/r 1 + λl 1). Now we want to bound the approximation error ˆg g. Deﬁne

Ah1,h2(τ) = 1

i=1 e (h1+τh2)(Xi) + 1

Xα (h1 + τh2)ρ(xα i )dxα + λ1

2 J (h1 + τh2)

Bh1,h2(τ) = 1

i=1 e g0(Xi)(h1 + τh2)(Xi) + 1

Xα (h1 + τh2)ρ(xα i )dxα

2V (h1 + τh2 g0) + λ1

2 J (h1 + τh2).

Dong and Wang

Taking derivatives of Ah1,h2 and Bh1,h2 with respect to τ and evaluating them at τ = 0, we obtain

Ah1,h2(0) = 1

i=1 e h1(Xi)h2(Xi) + 1

Xα h2ρ(xα i )dxα + λ1J (h1, h2), (48)

Bh1,h2(0) = 1

i=1 e g0(Xi)h2(Xi) + 1

Xα h2ρ(xα i )dxα + V (h1 g0, h2) + λ1J (h1, h2).

Setting h1 = ˆg and h2 = ˆg g in (48), we have

i=1 e ˆg(Xi)(ˆg g)(Xi) + 1

Xα (ˆg g)ρ(xα i )dxα + λ1J (ˆg, ˆg g) = 0. (50)

Setting h1 = g and h2 = ˆg g in (49), we have

i=1 e g0(Xi)(ˆg g)(Xi) + 1

Xα (ˆg g)ρ(xα i )dxα + V ( g g0, ˆg g)

+ λ1J ( g, ˆg g) = 0. (51)

Subtracting (51) from (50), we obtain

λ1J (ˆg g) 1

n e ˆg(Xi) e g(Xi)o (ˆg g)(Xi)

n e g(Xi) e g0(Xi)o (ˆg g)(Xi) + V (ˆg g, g g0). (52)

Applying the mean value theorem, we have e ˆg(Xi) e g(Xi) = e ( g+τi(ˆg g))(Xi)(ˆg g)(Xi) where τi [0, 1]. Since ˆg and g belong to B0 which is a convex set around g0, under Condition 4, there exists a b(i) 0 (c2, c3) such that e ( g+τi(ˆg g))(Xi)(ˆg g)(Xi) = b(i) 0 e g0(Xi)(ˆg g)(Xi). Then

n e ˆg(Xi) e g(Xi)o (ˆg g)(Xi) = 1

i=1 b(i) 0 e g0(Xi)(ˆg g)2(Xi)

i=1 e g0(Xi)(ˆg g)2(Xi). (53)

By the same argument, there exists a c(i) 0 (c2, c3) such that

n e g(Xi) e g0(Xi)o (ˆg g)(Xi) = 1

i=1 c(i) 0 e g0(Xi)(ˆg g)(Xi)( g g0)(Xi).

Nonparametric neighborhood selection in graphical models

Lemma 2 Under Conditions 1, 2, and 5, suppose h1 and h2 are functions satisfying J (h1) < and J (h2) < , as λ1 0 and nλ2/r 1 , one has

i=1 e g0(Xi)h1(Xi)h2(Xi) V (h1, h2) = op {(V + λ1J )(h1)(V + λ1J )(h2)}1/2 .

Proof: Since J (h1) < and J (h2) < , then h1 and h2 can be expressed as Fourier series h1 = P v h1,vζv and h2 = P v h2,vζv. Let

Ui = ζv(Xi)ζu(Xi)e g0(Xi) Z

X\{α} f\{α}(x\{α}) Z

Xα ζv(x)ζu(x)ρ(x)dxαdx\{α}.

Note that Ui are i.i.d. random variables with E(Ui) = 0. Then under Condition 5, we have

n Var ζv(X1)ζu(X1)e g0(X1) < c4

Furthermore,

i=1 e g0(Xi)h1(Xi)h2(Xi) V (h1, h2)

u h1,vh2,u 1 n

u (1 + λ1γv)(1 + λ1γu)h2 1,vh2 2,u

=Op(n 1/2λ 1/r 1 ){(V + λ1J )(h1)(V + λ1J )(h2)}1/2

=op {(V + λ1J )(h1)(V + λ1J )(h2)}1/2 ,

where the second equality holds because of the fact P

1 1+λ1γv = O(λ 1/r 1 ) and the strong

law of large numbers.

Lemma 3 Under Conditions 1, 2, and 5, as λ1 0 and nλ2/r 1 , then

i=1 e g0(Xi)h1(Xi)h2(Xi) 1

i=1 c(i) 0 e g0(Xi)h1(Xi)h2(Xi)

2c0{(V + λ1J )(h1)(V + λ1J )(h2)}1/2 (56)

holds with probability 1, where c0 = max{|c2 1|, |c3 1|}.

Dong and Wang

Proof: Note that

E|e g0(Xi)h1(Xi)h2(Xi)|

X\{α} f\{α}(x\{α}) Z

Xα |h1(x)h2(x)|ρ(x)dxαdx\{α}

X\{α} f\{α}(x\{α}) Z

Xα h2 1(x)ρ(x)dxαdx\{α} Z

X\{α} f\{α}(x\{α}) Z

Xα h2 2(x)ρ(x)dxαdx\{α} o1/2

={V (h1)V (h2)}1/2

{(V + λ1J )(h1)(V + λ1J )(h2)}1/2,

where the ﬁrst inequality follows the Cauchy-Schwartz inequality. Then, we have

i=1 e g0(Xi)h1(Xi)h2(Xi) 1

i=1 c(i) 0 e g0(Xi)h1(Xi)h2(Xi)

i=1 (1 c(i) 0 )e g0(Xi)h1(Xi)h2(Xi)

i=1 |(1 c(i) 0 )||e g0(Xi)h1(Xi)h2(Xi)|

i=1 |e g0(Xi)h1(Xi)h2(Xi)|

2c0{(V + λ1J )(h1)(V + λ1J )(h2)}1/2,

where the last inequality holds due to the strong law of large numbers.

Proof of Theorem 1: Note that E{e g0(Xi)(ˆg g)2(Xi)} = R

X\{α} f\{α}(x\{α}) R

g)2(x)ρ(x)dxαdx\{α} = V (ˆg g). Substituting (53) into the left-hand side of (52), we have

λ1J (ˆg g) 1

n e ˆg(Xi) e g(Xi)o (ˆg g)(Xi)

i=1 e g0(Xi)(ˆg g)2(Xi) + λ1J (ˆg g)

2 V (ˆg g) + λ1J (ˆg g), (57)

Nonparametric neighborhood selection in graphical models

where the last equality holds due to the strong law of large numbers. Substituting (55) and (56) into the right-hand side of (52) and letting h1 = ˆg g, h2 = g g0, we have

n e g(Xi) e g0(Xi)o (ˆg g)(Xi) + V (ˆg g, g g0)

V (ˆg g, g g0) 1

i=1 e g0(Xi)(ˆg g)(Xi)( g g0)(Xi)

i=1 e g0(Xi)(ˆg g)(Xi)( g g0)(Xi) 1

i=1 c(i) 0 e g0(Xi)(ˆg g)(Xi)( g g0)(Xi)

(op(1) + 2c0){(V + λ1J )(ˆg g)(V + λ1J )( g g0)}1/2, (58)

where the ﬁrst inequality follows (54) and the second inequality follows Lemma 2 and 3. Combining (52), (57), and (58), we obtain

2 V + λ1J )(ˆg g) (op(1) + 2c0){(V + λ1J )(ˆg g)(V + λ1J )( g g0)}1/2. (59)

Combining (59) with Lemma 1, as λ1 0 and nλ2/r 1 , we have (V + λ1J )(ˆg g) = Op(n 1λ 1/r 1 + λl 1) and Theorem 1 holds. Proof of Theorem 2: We know

k =α ||ηαk(xα, xk)||2

k =α ||ηαk(xj, xk)||

k =α ||ηαk(xα, xk)||2. (60)

Therefore, there exists some constant C [1, p 1] such that C { P

k =α ||ηαk(xα, xk)||2}1/2 = P

k =α ||ηαk(xα, xk)||. Since P

k =α θαk is bounded by M, we can scale λ1 and λ2 such that

θαk 1. Since J 2(g) = P

k =α θ 1 αk ||ηαk(xα, xk)||2 = c T ( P

k =α θαk Qαk)c, P

k =α ||ηαk(xα, xk)||2 =

k =α θ2 αk Qαk)c, we have J2 2(g) = C2 P

k =α ||ηαk(xα, xk)||2 C2J 2(g) and consequently J2

C(J )1/2. Furthermore, since V 2 2 (g(2)) = R

X\{α} f\{α}(x\{α}) R

Xα g(2)(x) 2 ρ(x)dxαdx\{α} =

V (g(2)), we have V2(g(2)) = [V (g(2))]1/2. Therefore,

(V2 + λ1J2)(g(2)) = ((V )1/2 + C p

λ1(λ1J )1/2)(g(2)) (1 + C2λ1)1/2(V + λ1J )1/2(g(2))

by the Cauchy-Schwarz inequality. Finally,

(V + λ1J)(ˆg g0) = (V1 + λ1J1)(ˆg(1) g(1) 0 ) + (V2 + λ1J2)(ˆg(2) g(2) 0 )

(V + λ1J )(ˆg(1) g(1) 0 ) + (1 + C2λ1)1/2(V + λ1J )1/2(ˆg(2) g(2) 0 )

= Op(n 1λ 1/r 1 + λl 1) + O(n 1/2λ 1/2r 1 + λl/2 1 )

= Op(n 1/2λ 1/2r 1 + λl/2 1 ). (61)

Dong and Wang

Proof of Corollary 1: By the deﬁnition of V ( ), V (ˆg g0) = V1(ˆg(1) g(1) 0 ) + V2(ˆg(2) g(2) 0 ) = V (ˆg(1) g(1) 0 ) + [V (ˆg(2) g(2) 0 )]1/2. Following (61),

[V (ˆg(2) g(2) 0 )]1/2 = Op(n 1/2λ 1/2r 1 + λl/2 1 ).

Following Lin et al. (2000), under the condition 0 < c5 < ρ(x) < c6 and 0 < c7 < f\{α}(x\{α}) < c8 for some positive constants c5, c6, c7, and c8, [V (g)]1/2 is equivalent to the

L2 norm. Speciﬁcally, V (g) ||g||2 2 = p P

j=1 ||ηj||2 2+ P

k =α ||ηαk(xα, xk)||2 2, V (g(1)) p P

j=1 ||ηj||2 2,

and V (g(2)) P

k =α ||ηαk(xα, xk)||2 2, where means equivalence. By deﬁnition, V (g(2)) =

[V (g(2))]1/2 ( P

k =α ||ηαk(xα, xk)||2 2)1/2. Consequently, two-way interactions under L2 norm

have the same convergence rate as [V (g(2))]1/2,

||ˆηαk η0αk||2 = Op(n 1/2λ 1/2r 1 + λl/2 1 ), k = α, k = 1, , p.

Rajiv Agarwal, Joyce L. Davis, and Linda Smith. Serum albumin is strongly associated with erythropoietin sensitivity in hemodialysis patients. Clin J Am Soc Nephrol., 3(1): 98 104, 2008.

Alnur Ali, J Zico Kolter, and Ryan J Tibshirani. The multiple quantile graphical model. ar Xiv preprint ar Xiv:1607.00515, 2016.

Adelchi Azzalini and A Dalla Valle. The multivariate skew-normal distribution. Biometrika, 83(4):715 726, 1996.

Onureena Banerjee, Laurent El Ghaoui, and Alexandre d Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research, 9:485 516, 2008.

Shizhe Chen, Daniela M Witten, and Ali Shojaie. Selection and estimation for mixed graphical models. Biometrika, 102(1):47 64, 2015.

Mathias Drton and Marloes H Maathuis. Structure learning in graphical modeling. Annual Review of Statistics and Its Application, 4:365 393, 2017.

Rina Foygel and Mathias Drton. Extended bayesian information criteria for gaussian graphical models. ar Xiv preprint ar Xiv:1011.6640, 2010.

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432 441, 2008.

Chong Gu. Smoothing spline ANOVA models, volume 297. Springer Science & Business Media, 2013.

Nonparametric neighborhood selection in graphical models

Chong Gu and Ping Ma. Nonparametric regression with cross-classiﬁed responses. Canadian Journal of Statistics, 39(4):591 609, 2011.

Chong Gu, Yongho Jeon, and Yi Lin. Nonparametric density estimation in high-dimensions. Statistica Sinica, pages 1131 1153, 2013.

Chong Gu et al. Smoothing spline anova models: R package gss. Journal of Statistical Software, 58(5):1 25, 2014.

Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015.

Holger H oﬂing and Robert Tibshirani. Estimation of sparse binary pairwise markov networks using pseudo-likelihoods. Journal of Machine Learning Research, 10(4), 2009.

Cho-Jui Hsieh, Inderjit S Dhillon, Pradeep K Ravikumar, and M aty as A Sustik. Sparse inverse covariance matrix estimation using quadratic approximation. In Advances in neural information processing systems, pages 2330 2338, 2011.

Cho-Jui Hsieh, M aty as A Sustik, Inderjit S Dhillon, Pradeep Ravikumar, et al. Quic: quadratic approximation for sparse inverse covariance estimation. J. Mach. Learn. Res., 15(1):2911 2947, 2014.

Yongho Jeon and Yi Lin. An eﬀective method for high-dimensional log-density anova estimation, with application to nonparametric graphical model building. Statistica Sinica, pages 353 374, 2006.

Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.

Eduardo Lacson, John Rogus, Ming Teng, Michael Lazarus, and Raymond Hakim. The association of race with erythropoietin dose in patients on long-term hemodialysis. American Journal of Kidney Diseases, 52(6):1104 1114, 2008.

John Laﬀerty, Han Liu, Larry Wasserman, et al. Sparse nonparametric graphical models. Statistical Science, 27(4):519 537, 2012.

Ginette Laﬁt, Francis Tuerlinckx, Inez Myin-Germeys, and Eva Ceulemans. A partial correlation screening approach for controlling the false positive rate in sparse gaussian graphical models. Scientiﬁc Reports, 9(1):1 24, 2019.

Jason D Lee and Trevor J Hastie. Learning the structure of mixed graphical models. Journal of Computational and Graphical Statistics, 24(1):230 253, 2015.

Yi Lin and Hao Helen Zhang. Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34(5):2272 2297, 2006.

Yi Lin et al. Tensor product space anova models. The Annals of Statistics, 28(3):734 755, 2000.

Dong and Wang

Han Liu, John Laﬀerty, and Larry Wasserman. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research, 10(Oct):2295 2328, 2009.

Nicolai Meinshausen and Peter B uhlmann. High-dimensional graphs and variable selection with the lasso. Annals of Statistics, 34(3):1436 1462, 2006.

Jie Peng, Pei Wang, Nengfeng Zhou, and Ji Zhu. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association, 104(486):735 746, 2009.

Pradeep Ravikumar, Martin J Wainwright, John D Laﬀerty, et al. High-dimensional ising model selection using ℓ1-regularized logistic regression. The Annals of Statistics, 38(3): 1287 1319, 2010.

Jian Shi, Anna Liu, and Yuedong Wang. Spline density estimation and inference with model-based penalties. Journal of Nonparametric Statistics, 31:596 611, 2019.

Arun Suggala, Mladen Kolar, and Pradeep K Ravikumar. The expxorcist: nonparametric graphical models via conditional exponential densities. In Advances in neural information processing systems, pages 4446 4456, 2017.

Berwin A Turlach and Andreas Weingessel. quadprog: Functions to solve quadratic programming problems. CRAN-Package quadprog, 2007.

Peter Kehinde Uduagbamen, John Omotola Ogunkoya, Igwebuike Chukwuyerem Nwogbe, Solomon Olubunmi Eigbe, and Oluwamayowa Ruth Timothy. Ultraﬁltration volume: Surrogate marker of the extraction ratio, determinants, clinical correlates and relationship with the dialysis dose. J Clin Nephrol Ren Care, 7:068, 2021.

Arend Voorman, Ali Shojaie, and Daniela Witten. Graph estimation with joint additive models. Biometrika, 101(1):85 101, 2014.

Yuedong Wang. Smoothing splines: methods and applications. CRC Press, 2011.

Hans F Weinberger. Variational methods for eigenvalue approximation. SIAM, 1974.

Anja Wille, Philip Zimmermann, Eva Vranov a, Andreas F urholz, Oliver Laule, Stefan Bleuler, Lars Hennig, Amela Preli c, Peter von Rohr, Lothar Thiele, et al. Sparse graphical gaussian modeling of the isoprenoid gene network in arabidopsis thaliana. Genome biology, 5(11):1 13, 2004.

Jiahui Yu, Jian Shi, Anna Liu, and Yuedong Wang. Smoothing spline semiparametric density models. Journal of the American Statistical Association, 2020.

Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19 35, 2007.

Hao Helen Zhang, Guang Cheng, and Yufeng Liu. Linear or nonlinear? automatic structure discovery for partially linear models. Journal of the American Statistical Association, 106 (495):1099 1112, 2011.