# beyond_trees_classification_with_sparse_pairwise_dependencies__56377efb.pdf

Journal of Machine Learning Research 21 (2020) 1-33 Submitted 6/18; Revised 3/20; Published 10/20

Beyond Trees: Classiﬁcation with Sparse Pairwise Dependencies

Yaniv Tenzer yaniv.tenzer@weizmann.ac.il Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot, 76100, Israel

Amit Moscovich amit@moscovich.org Program in Applied and Computational Mathematics Princeton University Princeton, NJ 08544, USA

Mary Frances Dorn mfdorn@lanl.gov Los Alamos National Laboratory P.O. Box 1663, MS F600 Los Alamos, NM 87545, USA

Boaz Nadler boaz.nadler@weizmann.ac.il Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot, 76100, Israel

Cliﬀord Spiegelman Department of Statistics Texas A&M University College Station, TX 77843, USA

Editor: Gal Elidan

Several classiﬁcation methods assume that the underlying distributions follow tree-structured graphical models. Indeed, trees capture statistical dependencies between pairs of variables, which may be crucial to attaining low classiﬁcation errors. In this setting, the optimal classiﬁer is linear in the log-transformed univariate and bivariate densities that correspond to the tree edges. In practice, observed data may not be well approximated by trees. Yet, motivated by the importance of pairwise dependencies for accurate classiﬁcation, here we propose to approximate the optimal decision boundary by a sparse linear combination of the univariate and bivariate log-transformed densities. Our proposed approach is semi-parametric in nature: we non-parametrically estimate the univariate and bivariate densities, remove pairs of variables that are nearly independent using the Hilbert-Schmidt independence criterion, and ﬁnally construct a linear SVM using the retained log-transformed densities. We demonstrate on synthetic and real data sets, that our classiﬁer, named SLB (sparse log-bivariate density), is competitive with other popular classiﬁcation methods.

Keywords: binary classiﬁcation, graphical model, Bayesian network, sparsity, semi-parametric

c 2020 Yaniv Tenzer, Amit Moscovich, Mary Frances Dorn, Boaz Nadler and Cliﬀord Spiegelman.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-348.html.

Tenzer, Moscovich, Dorn, Nadler and Spiegelman

1. Introduction

Consider a binary classiﬁcation problem where the vector of explanatory variables X = (X1, . . . , Xd) belongs to some d-dimensional space Ω Rd, and the response variable Y takes values in the set Y = {+1, 1}. Given a training set of labeled samples {xℓ, yℓ}n ℓ=1, the goal is to construct a classiﬁer f : Ω Y, with a small misclassiﬁcation error rate on future unlabeled instances of X. It is well known that the optimal classiﬁer is a threshold of the likelihood ratio statistic,

LR(x) = p(x|Y = +1)

p(x|Y = 1). (1)

A key challenge in applying this result, is the estimation of the class-conditional densities p(x|Y = y). Unfortunately, accurate non-parametric density estimation requires a number of labeled samples that is exponential in the dimension d. This is known as the curse of dimensionality (see Chapter 2 of Tsybakov (2009)). Hence, in high dimensional settings with a limited number of labeled samples, classiﬁcation methods based on non-parametric multivariate density estimates may perform poorly and are not widely used in practice. Instead of a non-parametric estimate of p(x|y), a popular alternative is to model this class conditional multivariate density with Bayesian Networks (Lauritzen, 1996). Bayesian networks provide a powerful probabilistic representation of multivariate distributions, which in turn often yields accurate classiﬁers. However, learning an unrestricted Bayesian network may require a large number of labeled samples. Moreover it may be computationally prohibitive (Friedman et al., 1997; Grossman and Domingos, 2004). To ease the computational burden, many works suggested various simplifying assumptions on the underlying distribution. One popular approach is to restrict attention to speciﬁc parametric forms such as the multivariate Gaussian or other distributions, for example, copulas (Lauritzen, 1996; Elidan, 2012). A diﬀerent approach, most relevant to our work and reviewed in Section 2, is based on the assumption that the multivariate distribution of each class follows a tree-structured Bayesian network (Tan et al., 2011; Friedman et al., 1997). In this case, the tree of each class may be learned by the computationally eﬃcient algorithm of Chow and Liu (1968) or extensions thereof (Laﬀerty et al., 2012). As we show in Section 3, the tree structure implies that the likelihood ratio statistic of Eq. (1) takes the form of a sparse linear combination of log-transformed univariate and bivariate density terms. Hence, these tree-based approaches require only univariate and bivariate density estimates, and may be applied to high dimensional settings with a limited number of samples. However, as real data sets may not follow a tree distribution, the resulting tree-based classiﬁer may not accurately capture the optimal decision boundary. Motivated by the success of forest-based approaches, in Section 3 we present a new semi-parametric classiﬁer, denoted as the sparse log-bivariate density classiﬁer (SLB). As its name suggests, it is also based on univariate and bivariate densities. However, instead of imposing hard tree constraints, we assume that the decision boundary can be well approximated by a sparse linear combination of the logarithms of the univariate and bivariate densities. Concretely, the SLB classiﬁer is constructed as follows: ﬁrst, we run a feature selection procedure, removing pairs of variables that are nearly independent. Next, we non-parametrically estimate for each class, its univariate densities as well as the bivariate

Beyond Trees: Classification with Sparse Pairwise Dependencies

densities of those pairs of variables not excluded by the feature selection procedure. Finally, we construct a linear SVM in the space of these log-transformed density estimates. Our method can be viewed as a generalization of tree-based classiﬁers. As we prove in Lemma 1, when the conditional distributions of both classes are tree-structured Bayesian networks, the log-likelihood ratio is a linear function of the log-transformed univariate and bivariate densities. The vector of coeﬃcients of the resulting linear classiﬁer, whose entries correspond to these log-transformed densities, is sparse. Furthermore, its non-zero entries are integers, determined by the tree structure. Our approach eliminates the tree structure assumption, thus relaxing the integer constraints on the coeﬃcients. Instead, it learns a sparse linear classiﬁer in the space of these log-transformed densities. Our approach also extends the recent work of Fan et al. (2016), which considered a logarithmic transformation of only univariate densities to address the same problem of binary classiﬁcation. In Section 4 we present some theoretical properties of our classiﬁer, in particular, its asymptotics as the sample size tends to inﬁnity. In Section 5 we compare our approach to several popular classiﬁers on both synthetic and real data sets. As our empirical results show, when the underlying distribution is forest structured, the accuracy of our classiﬁer is comparable to methods that explicitly assume a forest structure. However, when the underlying distributions are not forest structured, but instead follow more complicated Bayesian network models, our more ﬂexible method often outperforms the other methods. Furthermore, our experiments highlight the importance of incorporating bivariate features. Finally, as we illustrate with several real data sets, SLB is competitive to popular classiﬁers. We conclude the paper in Section 6 with a discussion and potential extensions.

2. Background

In Section 2.1 we brieﬂy review Bayesian networks and their relation to our approach. In Section 2.2 we describe the Hilbert-Schmidt independence criterion, used in our feature selection step.

2.1 Bayesian Network Classiﬁers

Bayesian networks provide a general framework to represent and infer about high dimensional densities (Koller and Friedman, 2009). A Bayesian network is described by two components: (i) a directed acyclic graph G whose nodes correspond to the random variables X1, . . . , Xd. (ii) a set of conditional densities, {f(Xi|Pari)}, where Pari denotes the parents of variable Xi in the graph G. For a root node Xi, without parents, Pari = , f(Xi|Pari) is its marginal density. The structure of the graph G encodes the statistical dependencies among the d variables Xi. Speciﬁcally, given the values of its parents, each random variable Xi is conditionally independent of all other variables Xj which are not its descendants (Lauritzen, 1996); The full density p(x) is the product of all of these conditional densities. It can be shown that this indeed deﬁnes a valid joint density, which satisﬁes the conditional independence properties encoded by G as outlined above. One popular application of Bayesian networks is to construct classiﬁers using the likelihood ratio. Since in practice, neither the graph structure nor the conditional densities are known, an essential part in constructing a Bayesian-network-based classiﬁer is learning, for each class label, its graph G and the corresponding conditional densities. This problem is known

Tenzer, Moscovich, Dorn, Nadler and Spiegelman

as structure learning and has received considerable interest in recent years in the context of Bayesian networks and more general Graphical models (Lee and Hastie, 2015; Yang et al., 2015; Beretta et al., 2018; Yu et al., 2020). Unfortunately, even with as few as 20 variables, learning a general real-valued graphical model beyond the simple Gaussian Bayesian network can be computationally impractical (Friedman et al., 1997; Grossman and Domingos, 2004). To overcome this computational burden, a common approach is to resort to tree-structured networks. Under this assumption, the probability density p(x) factorizes as

p T (x) = p(xm1)

i=2 p(xmi|xmj(i)), (2)

where (m1, . . . , md) is a permutation of {1, . . . , d} and j(i) {1, . . . , i 1} is the parent of i in the tree. An important property resulting from the tree-structured assumption is that for each class, the corresponding d-dimensional density depends only on its univariate and bivariate marginals. Given n samples, the structure learning task simpliﬁes to estimating the optimal tree structure and its associated univariate and bivariate densities. For a discrete multivariate distribution, a tree structure may be estimated by the computationally eﬃcient algorithm of Chow and Liu (1968). If the underlying distribution follows a tree structure, then as sample size n the estimated tree is consistent (Chow and Wagner, 1973). Tan et al. (2011) extended this approach to more sparse structures, by removing weak edges from the learned tree. The result is a forest, or union of disjoint trees, with the following factorization,

p F (x) = Y

p(xi, xj) p(xi)p(xj)

i VF p(xi), (3)

where EF and VF are the edge and vertex sets of the forest. Laﬀerty et al. (2012) extended the Chow and Liu approach to the case of multivariate continuous data by using kernel density estimates of the univariate and bivariate densities in Eq. (2). In their paper, the authors presented conditions under which this approach is consistent. Several classiﬁers have been proposed which are based on speciﬁc forest models. Perhaps the simplest one is the popular naive Bayes classiﬁer (Duda and Hart, 1973; Langley et al., 1992). This classiﬁer assumes that the class conditional joint density is simply the product distribution py(x) = Πipy(xi), whose corresponding graph has no edges. A more sophisticated model was considered by Park et al. (2011). They assumed that both classes are distributed according to the same a Markov chain. Namely, there exists a permutation π Sym(d) such that for all x and y, py(x) = py(xπ1)py(xπ2|xπ1) . . . py(xπd|xπd 1). Finally, Friedman et al. (1997) introduced the tree-augmented Naive-Bayes model (TAN). Their approach learns a tree structure over the set of variables for each class, and then apply a likelihood ratio based classiﬁer. A diﬀerent approach that makes a tree assumption is discriminative. For example, Tan et al. (2010) proposed to learn trees that are speciﬁcally tailored for classiﬁcation. Their procedure ﬁts each tree distribution to the observed data from that class while simultaneously maximizing the distance between the two distributions. Another example of

Beyond Trees: Classification with Sparse Pairwise Dependencies

a discriminative approach is the work of Meshi et al. (2013), where instead of ﬁtting a tree to the observed data, the authors suggest to learn a tree structure that directly minimizes a classiﬁcation loss function.

2.2 Hilbert-Schmidt Independence Criterion

An important component in our semi-parametric approach is a feature selection step, whereby we remove bivariate densities that correspond to (nearly) independent variables. We perform this step using the Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., 2005a,b). For completeness we brieﬂy review this method, restricting attention to the case of continuous real-valued random variables. Let (Z, W) be a pair of real-valued random variables on a compact domain Ωz Ωw R2

with joint density Pzw and marginals Pz, Pw. HSIC is a method to assess if Z and W are independent, i.e., if Pz,w = Pz Pw. The basic idea is that while Cov(Z, W) = 0 does not imply that Z and W are independent, having Cov(s(Z), t(W)) = 0 for all bounded continuous functions s and t does actually imply independence. Unfortunately, going over all bounded continuous functions is not tractable. Instead, Gretton et al. (2004) proposed evaluating sups F, t G Cov(s(Z), t(W)), where F, G are universal Reproducing Kernel Hilbert Spaces (RKHS). They proved that if the RKHS is universal and the supremum is zero, then Z and W are independent. Next, in Gretton et al. (2005a), the authors introduced a quantity HSIC(Z, W, F, G), which upper bounds the supremum, and still satisﬁes that HSIC(Z, W, F, G) = 0 if and only if Z and W are independent. Deﬁning HSIC(Z, W, F, G) in its most general form requires some background in the theory of RKHS. Here, however, we only present the properties that are essential for the reading of this paper. In particular, any RKHS F is associated with a kernel k( ) and a mapping function φ from R to F, such that k(x1, x2) = φ(x1), φ(x2) F. Next, let F, G be two RKHS with associated kernels k( ), l( ) respectively. Let (Z , W ) be an independent copy of (Z, W) with an identical joint distribution. Gretton et al. (2005a) showed that HSIC(Z, W, F, G), deﬁned as the Hilbert-Schmidt norm of the cross-covariance operator of Z and W, can be equivalently expressed in terms of the two kernels as follows, assuming all expectations exist,

HSIC(Z, W, F, G) = EZWZ W [k(Z, Z )l(W, W )]

+ EZZ [k(Z, Z )] EWW [l(W, W )]

2EZW h EZ [k(Z, Z )]EW [l(W, W )] i .

Given a sample of n independent pairs S {(zi, wi)}n i=1 drawn from the joint distribution Pzw, one can estimate the HSIC by the following formula with complexity O(n2),

\ HSIC(Z, W, F, G) = (n 1) 2 Tr(KLHL),

where K, L, H Rn n, Kij k(zi, zj), Lij l(wi, wj), H In n (n 1)11T, with 1 denoting a vector of all ones, and Tr(A) denotes the trace of a matrix A. Furthermore, as proven in Gretton et al. (2005a), this estimator is nearly unbiased,

ES[\ HSIC(Z, W, F, G)] = HSIC(Z, W, F, G) + O(n 1),

with ES denoting expectation over the sample S.

Tenzer, Moscovich, Dorn, Nadler and Spiegelman

3. The Sparse Log-Bivariate Density Classiﬁer

To motivate the construction of the sparse log-bivariate classiﬁer, let us ﬁrst consider the optimal log-likelihood ratio based classiﬁer, when the distributions of both classes y { 1, +1} are forest-structured, possibly with diﬀerent graphs Gy = (Vy, Ey). This is described in the following lemma, which follows directly from Eq. (3).

Lemma 1 If both classes are forest-structured then their log likelihood ratio is

p+1(x) p 1(x)

(i,j) E+1 log p+1(xi, xj) + X

i V+1 (1 deg+1(i)) log p+1(xi)

(i,j) E 1 log p 1(xi, xj) X

i V 1 (1 deg 1(i)) log p 1(xi),

where degy(i) is the degree of xi in the graph of class y.

An immediate corollary of Lemma 1 is that if both classes follow a forest-structured graphical model then the optimal decision boundary is a linear combination of their univariate and bivariate log-transformed densities. Speciﬁcally, let To : Rd Rd(d+1) be the oracle transformation that maps an input vector x Rd to a vector containing all of x s univariate and bivariate log density terms for both classes,

To(x) := {log py(xi, xj) | y { 1, +1} i, j {1, . . . , d}} ,

where py(xi, xi) is an alias for py(xi). In this notation, Lemma 1 can be restated as

log LR(x) = w T o To(x),

where the coeﬃcients of wo depend on the graph structures of both classes. It follows that the Bayes classiﬁer has the form

ˆy(x) = sign(w T o To(x) bo). (4)

Under the forest model, the coeﬃcients in wo that correspond to the bivariate densities are all either 0, +1 or 1, depending on the existence of edges in the respective graphical models. Similarly, the coeﬃcients that multiply the univariate densities are integers that depend on the degrees of the corresponding variables in the graphs. The intercept bo depends on the class imbalance and misclassiﬁcation costs. Note that by the forest assumption, there are only O(d) non-zero coeﬃcients in the vector wo. As its length is O(d2), it is thus sparse. To derive the SLB classiﬁer, we relax the forest-structure constraints, and simply assume that the optimal decision boundary can be well approximated by a sparse linear combination of the univariate and bivariate log-densities,

w TTo(x) b.

To construct such a sparse vector w, we ﬁrst exclude bivariate densities that correspond to nearly independent pairs of variables. Speciﬁcally, we measure the strength of dependence between all pairs of variables using their empirical HSIC measure, as deﬁned in Section 2.2,

Beyond Trees: Classification with Sparse Pairwise Dependencies

Algorithm 1 The Sparse Log-Bivariate Density Classiﬁer (SLB)

Input: Sample (x1, y1), . . . , (xn, yn) of feature vectors x Rd and labels y { 1, +1}.

Step 1: For each class y {+1, 1}, estimate statistical dependencies \ HSIC between all pairs of variables i, j {1, . . . , d} and y { 1, +1} and ﬁlter out weakly-dependent pairs of variables.

Step 2: Compute all univariate density estimates bpy(xi), and the bivariate density estimates bpy(xi, xj) for the variable pairs that were not ﬁltered in the previous step.

Step 3: Fit a linear SVM (bw,bb) to the transformed samples {( b T(xi), yi)}, where b T is the log-density transformation deﬁned in Eq. (5).

Output: Classiﬁer given by x 7 sign(bw T b T(x) bb).

and ﬁlter out bivariate densities with estimated HSIC values smaller than a given threshold λ. We then ﬁt a linear SVM to learn the coeﬃcients of all the univariate log-densities and bivariate log-densities that correspond to the remaining pairs. Since To is typically unknown, we replace it by a vector of estimated densities:

b T(x) := (log bpy(xi, xj))y { 1,+1} i,j {1,...,d} (5)

Lastly, setting λ in a principled manner can be done in several ways. One approach is to run a cross-validation procedure and choose the threshold whose corresponding classiﬁer has the smallest misclassiﬁcation error. Another option is to use a multiple hypothesis testing procedure on all pairwise \ HSIC values. The SLB classiﬁer is outlined in Algorithm 1.

4. Theoretical Analysis

In this section we study the theoretical properties of the linear SVM solution (bw,bb) based on the estimated densities bpy(xi, xj) given a training set of n i.i.d. samples D = {(xℓ, yℓ)}n ℓ=1. Concretely we prove that under certain conditions, the risk of the linear SVM solution (bw,bb) based on the estimated densities bpy(xi, xj) converges to that of the optimal SVM classiﬁer based on the exact densities py(xi, xj). To simplify the theoretical analysis we make the following two assumptions: (i) the input to the SVM classiﬁer consists of all the univariate and bivariate log-transformed densities, without the HSIC-based ﬁltering step. This can be viewed as an analysis of the large-sample regime where the HSIC is able to detect all of the dependent pairs. (ii) the n training samples were split into two disjoint sets D0 and D1 of sizes n0 and n1, respectively, with n0 +n1 = n. The set D0 is used to estimate the densities bpy(xi) and bpy(xi, xj) for each class y { 1, +1} and the set D1 to construct the SVM classiﬁer. Speciﬁcally, given the estimated densities, we apply the feature map b T : Rd Rd(d+1) of Eq. (5) to the n1 samples in D1 and construct a linear classiﬁer using the transformed samples {( b T(xℓ), yℓ)}n ℓ=n0+1. We add a feature that is identically equal to 1 to the feature map T, so that the intercept b can be subsumed into w. The class label predicted by a linear classiﬁer w applied to a feature-mapped sample is thus equal to sign(w TT(x)).

Tenzer, Moscovich, Dorn, Nadler and Spiegelman

In the rest of this section, we deﬁne several risk functions and then formulate SLB as an empirical risk minimizer. The classiﬁcation error rate, or 0-1 risk, of the classiﬁer (w, T) is

R01(w, T) := E (x,y) D

1(y = sign(w TT(x))) .

From Eq. (4) it follows that wo minimizes R01(w, To). The empirical 0-1 risk is the average loss over the n1 samples in D1,

b R01(w, T) := 1

1(yℓ = sign(w TT(xℓ))) .

Minimizing this risk with respect to w is computationally diﬃcult (Ben-David et al., 2003). To circumvent this diﬃculty, one may consider instead the risk and empirical risk with respect to the hinge loss φ(z) := max{0, 1 z},

Rφ(w, T) := E (x,y) D

φ(yw TT(x)) , b Rφ(w, T) := 1

ℓ=n0+1 φ(yℓw TT(xℓ)).

The hinge loss is a convex function and therefore can be minimized eﬃciently. Since the hinge-loss bounds the 0-1 loss from above, minimizing the hinge loss is a way to control for the misclassiﬁcation error. A common formulation of the SVM classiﬁer minimizes the empirical hinge loss with an additional Tikhonov regularization term. For the transformed data set {( b T(xℓ), yℓ)}n ℓ=n0+1 this is

bw Tikhonov := argmin w

b Rφ(w, b T) + λ w 2 , (6)

where λ > 0 controls the regularization strength (see for example Section 15.2 of Ben-David and Shalev-Shwartz (2014)). For SLB, we consider instead the SVM solution with Ivanov regularization, bw := argmin w: w 2 B b Rφ(w, b T), (7)

where the radius B controls the complexity of the hypothesis class. Both the Tikhonov form in Eq. (6) and the Ivanov form in Eq. (7) share the same regularization path (Oneto et al., 2015). Namely, for every choice of Tikhonov regularization parameter λ there is an Ivanov regularizer Bλ which yields the same solution. Finally, we denote by wφ the SVM population minimizer using the oracle feature map To

wφ := argmin w: w 2 B Rφ(w, To). (8)

In the rest of this section, we prove that as the sample size tends to inﬁnity, the risk of the SLB classiﬁer Rφ(bw, b T) converges to the oracle risk Rφ(wφ, To) and give high probability bounds on their diﬀerence.

Beyond Trees: Classification with Sparse Pairwise Dependencies

4.1 Convergence of the Data-dependent Feature Map

We begin by studying the convergence rate of the estimated feature map b T to the oracle map To. For simplicity, we use the bivariate notation bpy(xi, xi) to also denote univariate density estimates bpy(xi). For our theoretical analysis we need 3 assumptions on the class conditional densities and their estimation procedures:

Assumption 1 (bounded density) There are constants pmin, pmax > 0 such that for all x Ωand all y, i, j we have that pmin py(xi, xj) pmax.

We make this assumption for the sake of simplicity. See Remark 2 at the end of the next section for a discussion on how it may be relaxed.

Assumption 2 (density estimator consistency) There is a function U(n0) that satisﬁes U(n0) n0 0 such that the following uniform bound holds with high probability,

sup x Ω,y,i,j |bpy(xi, xj) py(xi, xj)| < U(n0).

This assumption holds under various conditions on the underlying density and method of density estimation. For example, for estimating β H older bivariate densities, the minimax bound U(n) = n β/(2β+2) holds (up to log factors) with high probability, both for kernel density estimation (Liu et al., 2011; Jiang, 2017) and for k-nearest-neighbor density estimation (Dasgupta and Kpotufe, 2014).

Assumption 3 The following expectation is ﬁnite,

E(n0, d) := E x b T(x) To(x) 2 = E x

q P y,i,j(log bpy(xi, xj) log py(xi, xj))2

and E(n0, d) n0 0.

Remark 1 A potential problem with non-parametric kernel density estimators is that bpy(xi, xj) may be zero (or even negative) for some values of (xi, xj). In that case, log bpy(xi, xj) is undeﬁned. However, the uniform consistency in Assumption 2 combined with the lower bound py(xi, xj) pmin in Assumption 1 means that the probability of this event vanishes as n . To avoid the issue altogether, one may construct an adjusted density estimator of the form epy(xi, xj) := max{bpy(xi, xj), c} for some c (0, pmin). It is easy to check that Assumptions 1, 2 and 3 still hold for the adjusted estimator ep.

The following lemma, proved in the appendix quantiﬁes the convergence rate of the feature map b T to the oracle map.

Lemma 2 Under Assumption 1,

To(x) 2 < p

d(d + 1)L, where L := max{| log pmin|, | log pmax|}.

Under assumptions 1 and 2, the following bound holds w.h.p.

sup x Ω b T(x) To(x) 2 < 2 pmin

d(d + 1)U(n0).

Corollary 3 Under assumptions 1 and 2, w.h.p.

sup x Ω b T(x) 2 < p

d(d + 1) L + 2 pmin U(n0) .

Tenzer, Moscovich, Dorn, Nadler and Spiegelman

4.2 Convergence of the SVM Risk to the Oracle Risk

In this section we present our main theoretical result. The proofs appear in the appendix. We begin with a technical lemma that bounds the diﬀerence in risks of using the exact log densities To compared to using the estimated log densities b T.

Lemma 4 Under Assumption 3, for any w Rd(d+1),

|Rφ(w, b T) Rφ(w, To)| w 2E(n0, d).

This holds for the empirical risk as well,

| b Rφ(w, b T) b Rφ(w, To)| w 2E(n0, d).

We now show that for the hinge loss, the population risk of the estimated bw of Eq. (7), converges in probability to the population risk of the oracle SVM solution wφ, deﬁned in Eq. (8). In other words, Rφ(bw, b T) converges to Rφ(wφ, To).

Theorem 5 Let b T be the feature map constructed using n0 samples and let bw be the SVM classiﬁer of SLB, trained using n1 samples. Then, under Assumptions 1,2 and 3, the following bound holds w.h.p.

Rφ(bw, b T) Rφ(wφ, To) < B E(n0, d) + q

d(d + 1)ln n1

2L + 2 pmin U(n0) .

The terms E(n0, d) and U(n0) are due to errors in the univariate and bivariate density estimates, whereas the term p

ln n1/n1 is an SVM generalization term.

Remark 2 The lower bound on the density py(xi, xj) pmin > 0 in Assumption 1 may be relaxed by dividing the (possibly inﬁnite) support of the distribution to the following domain and its complement,

Ω(pmin) := x y, i, j : py(xi, xj) pmin . (9)

Our theoretical analysis holds on the domain Ω(pmin). The remaining part contributes at most a factor of B Pr[x Ω(pmin)] to the excess SVM risk in Theorem 5. Taking pmin 0 at an appropriate rate as the number of samples tends to inﬁnity yields risk-consistency. Further distributional assumptions, such as sub-Gaussianity may be used to tighten the bounds on the excess risk.

5. Experimental Results

We compare the predictive accuracy of SLB to several widely used classiﬁers both on synthetic data and on various public data sets. Our method may be tuned to improve performance on a speciﬁc domain by considering such things as diﬀerent kernels and bandwidths in the density estimation step, as well as diﬀerent misclassiﬁcation costs and penalty terms for the norm of the vector of coeﬃcients in the construction of the SVM classiﬁer. However, we refrain from doing so and instead demonstrate SLB s competitiveness across a broad range of data sets. The seven classiﬁcation methods we tested appear below. All compared methods were implemented in the R programming language. When required, univariate or bivariate densities were estimated by the ks package.

Beyond Trees: Classification with Sparse Pairwise Dependencies

SLB: Our sparse log-bivariate density classiﬁer. We estimated all pairwise HSIC values using the d HSIC package with a Gaussian kernel. The SVM classiﬁer was constructed by the e1071 package with a default parameter λ = 1/2.

LU: Log-univariate density classiﬁer. This classiﬁer trains an SVM model using only the log univariate densities of the d variables of each class, as its input features. This is similar to the approach of Fan et al. (2016), that trained a logistic regression model rather than an SVM, on the same features.

5NN: 5 nearest-neighbor classiﬁer, as implemented in the FNN package.

SVM: Support vector machine classiﬁer with the radial basis function kernel, as implemented in the e1071 package. Note that by default λ = 1

NB: Naive Bayes classiﬁer. Here we used our own implementation.

TAN: Tree-augmented naive Bayes. We used the mpmi package to estimate the mutual information between each pair of variables, and the optrees package to estimate the maximum spanning tree of each class.

RF: Random forest classiﬁer as implemented in the random Forest package, with 50 trees and the default number of

d randomly selected variables at each split.

5.1 Synthetic Data Experiments

5.1.1 Forest Structured Bayesian Networks

We begin with the simple setting where the underlying distribution of each class is a forest structured Bayesian network with d = 20 variables. We consider the following conﬁgurations that diﬀer in the number of training samples, class imbalance proportion and complexity of the Conditional Probability Distributions (CPDs) that parametrize the Bayesian network. Concretely, our simulations follow a 5 22 factorial design:

Sample size of the training data: n = 200, 400, 600, 800, 1000.

Class imbalance: either balanced 50%-50% or imbalanced 75%-25%.

Complexity of the Conditional Probability Distributions: Let Z be the single parent of X in the forest. We consider the following two cases for P(X|Z = z): (i) a simple setting where X|Z = z N(z, 1). (ii) a more complex setting in which

( t + z w.p. 1

2 1 2N(z + 1, 1) + 1

2N(z 1, 1) w.p. 1

where t is a student-t distribution with 5 degrees of freedom. In both cases, root variables follow a standard Gaussian distribution.

Tenzer, Moscovich, Dorn, Nadler and Spiegelman

Marginals Class labels n SLB LU TAN NB RF SVM 5-NN

200 15.4 2.98 21.2 3.64 8.17 3.78 23 3.75 15 2.54 11.9 2.36 13.8 2.65 400 9.37 2.34 19.4 3.5 6.68 3.88 21.8 3.66 11.6 2.08 8.07 2.02 11.4 2.56 600 6.91 1.76 17.6 3.45 4.98 2.64 20.3 3.36 9.73 1.55 6.34 1.42 9.91 2.1 800 5.82 1.82 18.3 3.19 5.56 4.02 21 3.14 8.87 1.67 5.46 1.54 8.89 2.22 1000 5.11 1.36 17.3 3.46 4.66 2.87 20.7 3.44 8.23 1.63 4.84 1.16 8.37 1.93

200 20.4 4.7 24.6 4.57 11.4 5.51 26.1 4.41 30.3 4.6 25.8 4.83 24.3 3.89 400 13.2 3.71 22.7 4.88 7.37 4.32 25.1 4.29 25.5 4.08 18.7 3.43 19.2 3.72 600 9.48 2.98 21.5 4.04 6.52 3.96 23.7 4.25 22 3.65 14.7 2.79 17.2 3.71 800 7.43 2 21.3 3.99 6.13 4.05 23.4 3.74 20.3 2.71 12.7 2.43 15.9 2.95 1000 7.04 2.12 20.5 3.6 6.56 4.54 22.6 3.86 18.9 2.61 11.6 2.23 15 3.18

200 33.3 3.23 33.3 3.23 35.5 3.16 33.1 3.38 30.8 2.89 37.2 2.92 38.2 2.38 400 31.1 3.71 30.7 3.97 32.8 3.85 30.8 3.66 26.8 2.87 33.9 3.18 36.1 2.95 600 29.4 3.34 29.5 3.46 30.8 4.93 29.7 3.37 25.5 2.49 31.7 3.19 34.8 2.91 800 27.7 3.13 28.1 3.32 28.4 4.83 27.9 3.34 24.2 2.28 29.3 2.62 33.5 2.64 1000 27.6 3.33 28.1 3.65 27.8 4.79 28.1 3.68 23.8 2.41 28.7 2.61 33.1 2.65

200 37.1 3.72 37.1 3.86 38 3.41 37.9 3.35 43.2 3.13 48.3 1.57 43.8 1.75 400 34.9 3.71 35.3 3.87 33.3 4.05 35.6 3.6 39.7 3.12 45.9 2.69 41.3 2.48 600 32.7 3.66 33 3.72 31.2 4.65 33.6 3.58 37.6 3.07 43.7 2.93 39.9 2.26 800 32.2 3.59 33.2 3.75 30.5 4.71 33.5 3.65 36.4 3.13 42.5 2.99 39.3 2.06 1000 31.6 3.68 32.9 3.93 29 5.04 33.1 3.7 35.7 3.32 41.4 3.03 38.7 2.68

Table 1: Misclassiﬁcation error rates for the various classiﬁers, in the forest setting, averaged across 100 replicate data sets generated at each factor level combination. The classiﬁer with the best performance, as well as classiﬁers that achieve comparable error rates are printed in boldface (see main text for details).

For each combination of factor levels, 100 Bayesian networks were generated. A training data set drawn from each model was used to construct the diﬀerent classiﬁers. Finally, we evaluate the predictive accuracy on an independent test set of 1000 i.i.d. samples generated from the same model. In each test set, half of the observations were generated from each class. Average misclassiﬁcation error rates from the average 100 replicates are shown in Appendix B, Figures 1 4. The full results appear in Table 1. In addition, in each setting, we compare the classiﬁer with the best performance to each of the other classiﬁers, using a two-sided Wilcoxon signed-rank test. Since there are 6 comparisons per conﬁguration setting, we use Bonferroni-correction with α = 0.05/6. The classiﬁer with the best performance, as well as classiﬁers that were found to be statistically compatible (i.e., the null hypothesis is not rejected) are printed in the table in boldface. As shown, under the Gaussian setting, TAN either achieves the minimum error rate or it is comparable to the best performing classiﬁer. This is somewhat to be expected since the TAN classiﬁer is based on the Chow-Liu algorithm (Chow and Liu, 1968), which is consistent under the Gaussian-tree setting. In our Gaussian-forest setting, we expect the tree constructed by TAN to include all true forest edges and a few additional spurious ones that make the forest a connected graph. Nevertheless, in all the 5 22 conﬁgurations, as the sample size increases, the gap between SLB and TAN decreases. It is also apparent that both SLB and TAN models outperform all other classiﬁers under the Gaussian-imbalanced conﬁguration.

Beyond Trees: Classification with Sparse Pairwise Dependencies

Common Marginals structure Class labels n SLBSLB LU TAN NB RF SVM 5-NN

200 12.3 3.23 11.7 2.92 15.2 4.33 16.8 6.81 28.7 6.45 15 3.68 17.8 2.66 17.1 3.09 400 7.53 2.1 7.3 2.04 13.2 3.83 12 6.83 26.4 6.13 10.3 2.82 12.5 2.74 13 2.76 600 5.4 1.53 5.24 1.46 12.9 3.54 12.9 7.59 26.7 5.72 8.92 2.17 10.4 2.46 10.9 2.3 800 4.24 1.31 4.21 1.32 11.7 3.3 9.96 5.22 24.4 6.18 7.47 2.24 8.83 2.49 9.26 2.09 1000 3.73 1.22 3.67 1.24 11.7 3.79 10.8 7.81 24.8 6.82 6.5 1.91 7.75 2.33 8.89 2.47

200 15.1 4.52 15.1 3.73 17.1 4.27 16.2 8.36 29.6 6.7 25.7 6.07 30.1 5.86 26.1 5.24 400 10.4 2.61 8.87 2.51 14.9 3.95 14.4 8.41 27.8 6.93 20.3 5.39 25.3 4.21 20.5 4.43 600 7.69 2.12 6.31 1.72 13.3 3.94 12.7 7.66 25.8 5.67 16.9 4.44 20.9 3.81 17.6 4.01 800 6.21 1.78 4.99 1.77 13.3 4.02 10.5 6.41 26.1 5.41 16 4.21 19 2.93 16.2 3.31 1000 5.29 1.54 .4.11 1.18 13 4.09 10.7 6.8 25 5.01 13.9 3.87 18.2 3.45 14.4 3.44

200 14.8 3.71 14.3 3.59 17.9 4.57 18.7 8.01 31.2 6.77 17.6 4.31 20.3 3.71 21.4 4.26 400 9.56 2.59 9.42 2.53 15.5 4.93 16.5 7.97 29 7.08 12.6 3.42 14.8 3.03 15.9 3.52 600 7.02 1.79 6.92 1.74 14.6 4.24 13.8 7.56 27.7 6.55 10.5 2.62 11.7 3.05 13.7 3.13 800 5.91 1.73 5.82 1.69 13.7 4.21 14.1 7.37 26.7 6.21 9.28 2.8 10.7 2.45 12.5 2.98 1000 5.59 1.85 5.57 1.9 14.2 4.03 14.4 7.91 27.6 6.39 8.78 2.77 9.61 2.56 12.1 3.15

200 19 5.44 17.7 5.18 21.4 5.66 21.2 9.18 32.9 6.1 29.5 6.45 35.2 6.87 31 4.89 400 12 3.52 11.6 3.33 18.8 5.04 19.6 9.23 31.2 6.95 24 5.6 28.3 5.26 25.6 4.79 600 9.16 2.87 8.97 2.82 18.1 5.73 17.1 8.19 29.7 6.59 21.1 5.89 24.7 5.29 23 4.79 800 7.96 2.24 7.8 2.16 18 5.4 17.1 8.3 30.2 5.85 20 5.04 22.9 4.03 21.3 3.95 1000 6.77 2.2 6.67 2.21 17.7 5.45 16 8.13 29.8 6.65 18.4 5.49 21 4.28 20 4.73

200 14 3.29 13.5 3.13 17.9 4.71 16.9 7.79 30.1 6.54 16.6 3.78 18.7 3.33 18.9 3.47 400 8.54 1.84 8.32 1.86 15.3 3.71 13.6 6.91 27.7 5.61 11.8 2.69 13.3 2.67 14.6 2.45 600 6.37 1.65 6.24 1.69 15.1 3.93 13.2 7.49 27.1 5.46 10 2.58 10.4 2.57 12.2 2.73 800 5.17 1.44 5.13 1.42 14.5 3.63 13.4 7.52 27 5.52 8.86 2.12 9.01 2.37 11.1 2.24 1000 4.55 1.15 4.48 1.13 13.5 3.96 11.2 7.45 25.8 6.1 7.76 1.95 8.08 2.31 10.3 2.13

200 18.3 5 17.1 4.98 21 4.73 18.4 7.8 31.4 5.6 29.4 5.07 32.4 5.79 29.3 4.43 400 10.9 2.76 10.6 2.83 18.7 4.88 14.8 7.45 29.5 5.15 23.6 5.19 25.8 4.17 23.4 3.91 600 8.18 2.29 8 2.34 18.1 5.08 13 8.36 29 5.8 20.6 4.82 23.2 4.31 20.6 4.09 800 7.04 1.8 6.89 1.82 18.3 4.5 15 8.81 28.8 5.6 19.8 4.2 20.9 3.38 19.5 3.5 1000 5.77 1.62 5.69 1.61 17.4 4.74 13.1 7.98 27.7 5.63 17.8 4.18 18.9 2.84 18.1 3.81

200 17.3 4.02 16.5 3.86 20.9 5.39 19.7 7.28 32.8 6.65 19.4 4.26 21.1 3.89 22.5 3.8 400 11.5 2.94 11.1 2.92 19 4.63 18.3 9 30.9 5.29 14.6 3.48 15.6 3.53 18.3 3.73 600 8.64 2.6 8.58 2.57 16.7 4.61 15.8 8.03 28.6 6.21 11.8 3.06 13 3.11 16.1 3.51 800 7.43 2.22 7.36 2.17 17.1 4.77 15.3 7.53 29.6 6.41 11 2.96 11.2 2.79 14.7 3.04 1000 6.38 1.73 6.34 1.7 15.9 4.18 14.9 8.07 28 5.81 9.83 2.42 10.1 2.57 13.5 3.02

200 20.8 5.64 19.7 5.58 24.1 6.42 23.3 8.8 35.1 5.91 32.6 6.77 37.1 7.2 32.9 5 400 14.2 3.76 13.7 3.59 21.6 5.17 19.9 9.47 32.5 6 26.8 5.86 29.7 5.18 27.4 4.23 600 11.5 2.95 11.2 2.78 21.6 5.24 19.6 8.94 32.5 5.23 24.4 4.92 26.5 4.72 25.5 4.09 800 9.22 2.59 9.18 2.63 20.5 5.99 17.8 9.27 30.8 6.51 22 5.65 23.6 4.14 23.1 4.47 1000 8.26 2.73 8.18 2.69 20 5.79 17.2 8.01 30.8 5.53 20.6 5.7 21.7 4.25 21.9 4.49

Table 2: Misclassiﬁcation error rates for the various classiﬁers, in the general Bayesian network setting, averaged across 100 replicate data sets generated at each factor level combination. The classiﬁer with the best performance, as well as classiﬁers that achieve comparable error rates are printed in boldface.

Tenzer, Moscovich, Dorn, Nadler and Spiegelman

Marginals Class labels n SLB (HSIC) SLB (Pearson)

200 0.12 0.03 0.12 0.03 400 0.07 0.02 0.08 0.02 600 0.05 0.01 0.06 0.02 800 0.04 0.01 0.05 0.02 1000 0.04 0.01 0.05 0.01

200 0.16 0.05 0.16 0.04 400 0.10 0.03 0.11 0.03 600 0.07 0.02 0.08 0.02 800 0.06 0.02 0.07 0.02 1000 0.06 0.02 0.06 0.02

200 0.14 0.03 0.15 0.03 400 0.08 0.02 0.10 0.02 600 0.06 0.02 0.08 0.02 800 0.05 0.01 0.07 0.01 1000 0.05 0.01 0.06 0.01

200 0.19 0.05 0.19 0.04 400 0.11 0.03 0.12 0.03 600 0.08 0.02 0.10 0.03 800 0.07 0.02 0.08 0.01 1000 0.06 0.01 0.07 0.01

Table 3: Misclassiﬁcation error rates of the SLB(HSIC) and SLB(Pearson) classiﬁers, in the general Bayesian network setting, averaged across 100 replicate data sets generated at each factor level combination. Results that are signiﬁcantly better are printed in boldface (see main text for details).

5.1.2 General Bayesian Networks

Next, we consider a more challenging setting where each class follows a Bayesian network in which all non-root nodes have three parents. For a r.v. X, at a non-root node, let Par X {Z1, Z2, Z3} be its three parents. We consider the following two settings for P(X|Par X):

(i) a simple setting where X|Par X N(P Zi Par X zi, 1).

(ii) a setting in which

( t + P Zi Par X zi w.p. 1

2 1 2N(z1 + z2, 1) + 1

2N(z2 + z3, 1) w.p. 1

where t is a variable distributed as Student s t-distribution with 5 degrees of freedom. As before, we set d = 20 and the root variable follows a standard Gaussian distribution. To accommodate further variety, we extend our factorial design and allow for partial common structure between the Bayesian networks of the two classes. In the simpler setting, the two networks are constructed independently of each other. In the more challenging setting, roughly a third of the nodes share the same parent sets and associated conditional

Beyond Trees: Classification with Sparse Pairwise Dependencies

probability distributions. Overall our simulations now follow a 5 23 factorial design. We repeat the same protocol as in the forest setting, with 100 Bayesian networks generated for each conﬁguration. Classiﬁers are evaluated on an independent test set of 1000 samples. As before, in these test sets, half of the observations were generated from each class. To highlight the importance of removing bivariate densities corresponding to nearly independent variables, we also consider a variant of SLB in which we skip the ﬁrst step of Algorithm 1. The resulting classiﬁer denoted SLB-, constructs a linear SVM using all log bivariate densities. Average misclassiﬁcation error rates from the 100 replicates are shown in Appendix B, Figures 5 12. The full results, including standard deviations, are displayed in Table 2. Again, we compare the classiﬁer with the best performance to each of the other classiﬁers, using a two-sided Wilcoxon signed-rank test, at the Bonferroni-corrected α = 0.05/7 level (since now we have a total of 8 classiﬁers). The classiﬁer with the best performance, as well as classiﬁers that were found to be statistically compatible are printed in boldface. As can be seen, in all conﬁgurations, SLB achieves the lowest misclassiﬁcation error rate, whereas TAN attains compatible results in only ﬁve (out of 40) conﬁgurations, all ﬁve having Gaussian distributions. SLB is signiﬁcantly more accurate than SLB-, particularly for small sample sizes. As sample size increases, the estimated bivariate densities of independent variables become closer to the product of the univariate densities. In this case, their logarithm is approximately the sum of the log of the corresponding univariate densities. As the latter are already incorporated into the SVM model, the eﬀect of removing such pairwise densities is negligible, and SLB and SLBachieve similar error rates. Lastly, in this work we used the Hilbert-Schmidt independence criterion to decide if two variables are nearly independent. Given a d-dimensional sample of size n, computing the HSIC for all pairs of variables takes O(d2n2) time (Gretton et al., 2004). This can be computationally expensive for large data sets. Therefore, one may prefer other measures of independence that are faster to compute, even if they do not enjoy the nice theoretical properties of HSIC. For example, Pearson s correlation coeﬃcient, Spearman s ρ, etc. To quantify the potential impact of this change, we consider a variant of SLB that uses Pearson s correlation coeﬃcient for measuring pairwise variable dependence. We focus on the setting of independent Bayesian networks and compare the performance of the two variants, denoted by SLB(HSIC) and SLB(Pearson). Full results, including standard deviations, are displayed in Table 3. We use a two-sided Wilcoxon signed-rank test at signiﬁcance level α = 0.05 to compare the performance. In cases where a statistically signiﬁcant diﬀerence was found, the superior result is shown in boldface. As can be seen, under most conﬁgurations, the two variants show compatible results. Nevertheless, in two conﬁgurations (out of 20), SLB(HSIC) performs better.

5.2 Real Data Experiments

We next evaluate the various classiﬁers on 16 real data sets, publicly available at the UCI Machine Learning and Kaggle1 Databases (Dheeru and Karra Taniskidou, 2017). The data sets we consider diﬀer in sample size and dimensionality. In all data sets, instances with missing values were removed. In data sets where features are not commensurate, such as

1. www.kaggle.com

Tenzer, Moscovich, Dorn, Nadler and Spiegelman

Data set d n (n1, n2) SLB LU TAN NB RF SVM (RBF) 5NN Blood 4 748 (570, 178) 40 6.3 49.7 1.7 42.6 2.9 36.8 6.7 40.1 3.7 44.1 3.2 37.4 3.9 Climate model sim. crashes 18 540 (46, 494) 25.2 9.8 50 1.3 43.8 2.7 47.8 3 46.7 4.9 42.4 4.8 36.2 7.9 Hill-valley with out noise 100 606 (305, 301) 38.9 7.1 51.3 6.1 46.7 3 43.9 3.1 46.6 3.1 47.2 2.2 47.2 6.6 Hill-valley with noise 100 606 (307, 299) 37.7 2.3 51.8 6.4 47.3 61 44.3 4.6 47.9 3.6 47 5.8 48.4 2.4 Ionosphere 32 351 (126, 225) 7.5 5 45.4 5.4 7.6 4.1 10.2 4.3 7.9 4.4 8 4.3 22.1 7.4 Liver 6 345 (145, 200) 30.8 2.7 48.6 2.8 35.5 5.2 38.1 4.08 29.9 7.3 33.2 4.7 40.6 3.7 Ozone 71 1847 (1719, 128) 32.3 2.8 48.8 1.1 42.5 3.9 24.9 3.3 40.1 5.1 49.6 0.8 41.4 3.7 Parkinson 22 195 (48, 147) 18.2 7.1 54.7 3.9 21.3 5.5 20.6 7.5 13.1 1.7 27.6 6.7 13.6 6.3 Pima 8 768 (500, 268) 28.6 3.8 49.2 1.5 34.5 6.6 27.4 4.9 26.8 2.4 29 1.4 31.2 2.6 Pulsar stars 8 17898 (16259, 1639) 7.1 1.3 49.6 2.6 8.6 0.8 8.31 0.9 8.2 1.2 9.3 1.2 8.8 1.2 Ringnorm 20 7400 (3664, 3736) 1.4 0.3 31.8 1.6 2.7 0.9 1.4 0.3 3.8 0.5 1.4 0.3 31.7 0.8 Sonar 60 208 (111, 97) 18.1 6.5 52.9 5.1 20.5 4.9 26.2 9.2 19.6 9.8 20.1 5.9 22.9 2.7 Spectf 44 80 (40, 40) 21.2 7.1 45 21.8 26.2 20.4 22.5 7.1 21.2 13.7 20 11.2 27.5 14.4 WI prog 32 198 (151, 47) 35.7 13.5 52.7 11.6 42.9 6.9 39.8 9.6 42 4.9 44 4.1 40.3 6.3 WI diag 30 569 (357, 212) 4.52 2.2 47.9 4.7 11.5 2.1 6.3 3 4.8 2.3 2.9 2.4 4 2.6 Vertebral 6 310 (210, 100) 23.2 5.3 50 1.2 18.8 4.9 22.5 3.8 18.8 7.3 19.2 7.9 22.1 7

Table 4: Summary of balanced error rates on data sets from the UCI ML and Kaggle Repository. n is the total number of labeled samples, n1 and n2 are the number of samples in the positive and negative classes respectively. The classiﬁer with the best performance, as well as classiﬁers that achieve comparable error rates are printed in boldface (see main text for details).

the Ozone data set in which one feature is the temperature and another is the wind speed, they were standardized prior to applying 5NN and SVM (RBF), both of which are known to be sensitive to the scale of the data. The misclassiﬁcation error was estimated by 5-fold cross-validation, with the folds sampled in a stratiﬁed manner so that they have approximately the same proportions of class labels as the full data set. For each data set, the various classiﬁers were learned on the same training sets and their performance was evaluated on the same test sets. We evaluated each classiﬁer by its Balanced Error Rate (BER),

2(Sensitivity + Speciﬁcity).

Sensitivity is the true positive rate, equal to the proportion of predicted positives from the positive set. Speciﬁcity is the true negative rate, equal to the proportion of predicted negatives from the negative set. Note that the misclassiﬁcation error rates for the simulated data sets in the previous subsection were estimated on a test set with an equal number of samples in each class, and hence are equivalent to the BER. The empirical mean and standard deviation of the BER taken across the 5 cross-validation folds are given in Table 4. For each data set, the classiﬁer with the smallest error rate and classiﬁers that are not signiﬁcantly diﬀerent from it, according to a two-sided Wilcoxon signed-rank test, appear in bold. Note that the tests were performed at Bonferroni-corrected α = 0.05/6 level. As shown, in 13 out of 16 data sets, SLB either achieves the minimum misclassiﬁcation rate or is on par with the best performing model. Comparing SLB to LU and NB which only uses the univariate log density features, highlights the importance of including at least some of the bivariate features. Finally, note that in 15 out of 16 data sets, SLB achieves balanced error rates that are either better or compatible with those of the TAN classiﬁer.

Beyond Trees: Classification with Sparse Pairwise Dependencies

These results demonstrate the potential beneﬁt of SLB over the TAN classiﬁer that relies on a tree assumption, which may be violated in practice.

6. Discussion

In this work, we propose the sparse log-bivariate density classiﬁer (SLB), a semiparametric procedure that generalizes tree and forest-based classiﬁcation approaches. As in forest-based methods, SLB requires the estimation of only univariate and bivariate densities. However, it is more ﬂexible and does not require learning the structure of the class conditional distributions.

In recent years, the winners of many data modeling competitions made use of extensive feature engineering, whereby many new features are generated from the original features in the data set (for example, Koren (2009); Narayanan et al. (2011)). Our work suggests that the estimated log densities log bpy(xi) and log bpy(xi, xj) are interesting features to consider for a wide range of classiﬁcation problems. Of course, one may choose to also keep the original features as in Fan et al. (2016) or to combine the log density features with other non-linear features.

At test time, SLB requires the evaluation of the estimated log densities b T(x) at new instances x. Using a standard kernel density estimator, this requires to have at test time the original training data. Nevertheless, since the computation involves only 1-dimensional and 2-dimensional density estimates, there are eﬃcient methods to approximate these estimated densities without access to the original data (Ram et al., 2009).

SLB uses the HSIC measure to rank the bivariate densities and applies a cross-validation procedure to select only some of them. A diﬀerent approach is to directly incorporate a sparsity-inducing penalty term, into the SVM loss function, similar to the approaches presented in Neumann et al. (2005) and Zhu et al. (2004).

Acknowledgments

This work was supported in part by a Texas A&M-Weizmann research grant from Paul and Tina Gardner. Parts of this work were done while AM was at the Weizmann Institute of Science and at Tel-Aviv University. BN is incumbent of the William Petschek professorial chair in mathematics. Part of this work was done while BN was on sabbatical at the Institute for Advanced Study at Princeton. BN would like to thank the IAS and the Charles Simonyi endowment for their generous support.

Professor Cliﬀord H. Spiegelman passed away during the ﬁnal review process of this article. We would like to dedicate this paper to his memory.

Tenzer, Moscovich, Dorn, Nadler and Spiegelman

Appendix A. Proofs

Proof of Lemma 2. The ﬁrst part follows immediately from Assumption 1, since

To(x) 2 2 = X

y,i,j log2 py(xi, xj) < d(d + 1) max{| log pmin|, | log pmax|}2.

We now turn to the second part of the lemma.

sup x Ω b T(x) To(x) 2 2 = sup x Ω

y,i,j (log bpy(xi, xj) log py(xi, xj))2.

By the mean value theorem, for any a, b > 0 there exists some ξ [a, b] such that log b log a = (b a)/ξ. It follows that

|b a| max{a, b} | log b log a| |b a| min{a, b}. (10)

sup x Ω b T(x) To(x) 2 2 sup x Ω

bpy(xi, xj) py(xi, xj) min{bpy(xi, xj), py(xi, xj)}

By Assumptions 1 and 2, w.h.p. min{bpy(xi, xj), py(xi, xj)} > pmin/2, therefore

sup x Ω b T(x) To(x) 2 2

4 p2 min sup x Ω

y,i,j (bpy(xi, xj) py(xi, xj))2

y,i,j sup x Ω (bpy(xi, xj) py(xi, xj))2

< 4 p2 min d(d + 1)U2(n0).

Proof of Lemma 4. By deﬁnition,

|Rφ(w, b T) Rφ(w, To)| =

h φ(yw T b T(x)) φ(yw TTo(x)) i .

Using the triangle inequality we have

|Rφ(w, b T) Rφ(w, To)| E φ(yw T b T(x)) φ(yw TTo(x)) .

Since the hinge-loss φ is 1-Lipschitz and |y| = 1, we can simplify the bound further to

|Rφ(w, b T) Rφ(w, To)| E yw T( b T(x) To(x)) = E w T( b T(x) To(x)) .

Beyond Trees: Classification with Sparse Pairwise Dependencies

By the Cauchy-Schwarz inequality and Assumption 3,

|Rφ(w, b T) Rφ(w, To)| w 2 E b T(x) To(x) 2 = w 2E(n0, d).

The bound for the empirical risk is proved in the same manner.

Proof of Theorem 5. We decompose the diﬀerence of risks into 4 terms,

Rφ(bw, b T) Rφ(wφ, To) (11)

= Rφ(bw, b T) b Rφ(bw, b T) + b Rφ(bw, b T) b Rφ(wφ, b T)

+ b Rφ(wφ, b T) b Rφ(wφ, To) + b Rφ(wφ, To) Rφ(wφ, To) .

We now bound each of these terms separately. To bound the ﬁrst and fourth terms we apply a generalization bound for the soft-margin SVM. First, recall that by Lemma 2, To(x) 2 < p

d(d + 1)L. It follows from Theorem 26.12 of Ben-David and Shalev-Shwartz (2014), that the following bound is satisﬁed w.h.p. over D1 Dn1.

sup w: w B |Rφ(w, To) b Rφ(w, To)| < p

In particular, this gives a high probability bound on the fourth term of Eq. (11). Using Corollary 3 we can similarly bound the ﬁrst term of Eq. (11). w.h.p over D Dn0+n1,

|Rφ(bw, b T) b Rφ(bw, b T)| < p

d(d + 1) L + 2 pmin U(n0) B

To bound the second term of Eq. (11), note that bw is the minimizer of b Rφ(w, b T), hence

b R(bw, b T) b R(wφ, b T) 0.

The proof concludes by bounding the third term of Eq. (11) using Lemma 4,

| b Rφ(wφ, b T) b Rφ(wφ, To)| = BE(n0, d).

Appendix B. Simulation Results

In this section, we present the simulation results (in the form of box plots) of the various experiments described in Section 5.

Tenzer, Moscovich, Dorn, Nadler and Spiegelman

Structure: Forest, Marginal: Normal, Prior: Balanced

n = 200 n = 400

n = 600 n = 800

Figure 1: Misclassiﬁcation error rates for the various classiﬁers, under the Gaussian forest regime and balanced training set, averaged across 100 replicate data sets.

Beyond Trees: Classification with Sparse Pairwise Dependencies

Structure: Forest, Marginal: Normal, Prior: Imbalanced

n = 200 n = 400

n = 600 n = 800

Figure 2: Misclassiﬁcation error rates for the various classiﬁers, under the Gaussian forest regime and imbalanced training set, averaged across 100 replicate data sets.

Tenzer, Moscovich, Dorn, Nadler and Spiegelman

Structure: Forest, Marginal: Complex, Prior: Balanced

n = 200 n = 400

n = 600 n = 800

Figure 3: Misclassiﬁcation error rates for the various classiﬁers, under the complex forest regime and balanced training set, averaged across 100 replicate data sets.

Beyond Trees: Classification with Sparse Pairwise Dependencies

Structure: Forest, Marginal: Complex, Prior: Imbalanced

n = 200 n = 400

n = 600 n = 800

Figure 4: Misclassiﬁcation error rates for the various classiﬁers, under the complex forest regime and imbalanced training set, averaged across 100 replicate data sets.

Tenzer, Moscovich, Dorn, Nadler and Spiegelman

Structure: Bayesian net, Marginal: Normal, Prior: Balanced, Common structure: No

n = 200 n = 400

n = 600 n = 800

Figure 5: Misclassiﬁcation error rates for the various classiﬁers, across 100 replicate data sets. As indicated, data sets are balanced and sampled from Gaussian Bayesian networks with no common structure.

Beyond Trees: Classification with Sparse Pairwise Dependencies

Structure: Bayesian net, Marginal: Normal, Prior: Imbalanced, Common structure: No

n = 200 n = 400

n = 600 n = 800

Figure 6: Misclassiﬁcation error rates for the various classiﬁers, across 100 replicate data sets. As indicated, data sets are imbalanced and sampled from Gaussian Bayesian networks with no common structure.

Tenzer, Moscovich, Dorn, Nadler and Spiegelman

Structure: Bayesian net, Marginal: Normal, Prior: Balanced, Common structure: Yes

n = 200 n = 400

n = 600 n = 800

Figure 7: Misclassiﬁcation error rates for the various classiﬁers, across 100 replicate data sets. As indicated, data sets are balanced and sampled from Gaussian Bayesian networks with a common structure.

Beyond Trees: Classification with Sparse Pairwise Dependencies

Structure: Bayesian net, Marginal: Normal, Prior: Imbalanced, Common structure: Yes

n = 200 n = 400

n = 600 n = 800

Figure 8: Misclassiﬁcation error rates for the various classiﬁers, across 100 replicate data sets. As indicated, data sets are imbalanced and sampled from Gaussian Bayesian networks with a common structure.

Tenzer, Moscovich, Dorn, Nadler and Spiegelman

Structure: Bayesian net, Marginal: Complex, Prior: Balanced, Common structure: No

n = 200 n = 400

n = 600 n = 800

Figure 9: Misclassiﬁcation error rates for the various classiﬁers, across 100 replicate data sets. As indicated, data sets are balanced and sampled from complex Bayesian networks with no common structure.

Beyond Trees: Classification with Sparse Pairwise Dependencies

Structure: Bayesian net, Marginal: Complex, Prior: Imbalanced, Common structure: No

n = 200 n = 400

n = 600 n = 800

Figure 10: Misclassiﬁcation error rates for the various classiﬁers, across 100 replicate data sets. As indicated, data sets are imbalanced and sampled from complex Bayesian networks with no common structure.

Tenzer, Moscovich, Dorn, Nadler and Spiegelman

Structure: Bayesian net, Marginal: Complex, Prior: Balanced, Common structure: Yes

n = 200 n = 400

n = 600 n = 800

Figure 11: Misclassiﬁcation error rates for the various classiﬁers, across 100 replicate data sets. As indicated, data sets are balanced and sampled from complex Bayesian networks with a common structure.

Beyond Trees: Classification with Sparse Pairwise Dependencies

Structure: Bayesian net, Marginal: Complex, Prior: Imbalanced, Common structure: Yes

n = 200 n = 400

n = 600 n = 800

Figure 12: Misclassiﬁcation error rates for the various classiﬁers, across 100 replicate data sets. As indicated, data sets are imbalanced and sampled from complex Bayesian networks with a common structure.

Tenzer, Moscovich, Dorn, Nadler and Spiegelman

Shai Ben-David and Shai Shalev-Shwartz. Understanding Machine Learning: From Theory to Algorithms. Cambridge university press, 2014.

Shai Ben-David, Nadav Eiron, and Philip M. Long. On the diﬃculty of approximately maximizing agreements. Journal of Computer and System Sciences, 66(3):496 514, 2003.

Stefano Beretta, Mauro Castelli, Ivo Gon calves, Roberto Henriques, and Daniele Ramazzotti. Learning the structure of Bayesian networks: A quantitative assessment of the eﬀect of diﬀerent algorithmic schemes. Complexity, 2018, 2018.

C. K. Chow and C. N. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3):462 467, 1968.

C. K. Chow and T. J. Wagner. Consistency of an estimate of tree-dependent probability distributions (corresp.). IEEE Transactions on Information Theory, 19(3):369 371, 1973.

Sanjoy Dasgupta and Samory Kpotufe. Optimal rates for k-nn density and mode estimation. In Neural Information Processing Systems (NIPS), pages 2555 2563, 2014.

Dua Dheeru and EﬁKarra Taniskidou. UCI machine learning repository, 2017.

Richard O Duda and Peter E Hart. Pattern recognition and scene analysis, 1973.

Gal Elidan. Copula Network Classiﬁers (CNCs). In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 22, pages 346 354, 2012.

Jianqing Fan, Yang Feng, Jiancheng Jiang, and Xin Tong. Feature augmentation via nonparametrics and selection (FANS) in high-dimensional classiﬁcation. Journal of the American Statistical Association, 111(513):275 287, 2016.

Nir Friedman, Dan Geiger, and Moises Goldszmidt. Bayesian network classiﬁers. Machine Learning, 29(2):131 163, 1997.

Arthur Gretton, Alexander Smola, Olivier Bousquet, Ralf Herbrich, Bernhard Sch olkopf, and NK Logothetis. Behaviour and convergence of the constrained covariance. Technical report, Max Planck Institute for Biological Cybernetics, 2004.

Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Sch olkopf. Measuring statistical dependence with Hilbert-Schmidt norms. In International Conference on Algorithmic Learning Theory (ALT), pages 63 77. Springer, 2005a.

Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet, and Bernhard Sch olkopf. Kernel methods for measuring independence. Journal of Machine Learning Research, 6: 2075 2129, 2005b.

Daniel Grossman and Pedro Domingos. Learning Bayesian network classiﬁers by maximizing conditional likelihood. In International Conference on Machine Learning (ICML), page 46. ACM, 2004.

Heinrich Jiang. Uniform convergence rates for kernel density estimation. In International Conference on Machine Learning (ICML), volume 70, pages 1694 1703, 2017.

Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.

Beyond Trees: Classification with Sparse Pairwise Dependencies

Yehuda Koren. The Bell Kor solution to the Netﬂix grand prize. Netﬂix prize documentation, 2009.

John Laﬀerty, Han Liu, and Larry Wasserman. Sparse nonparametric graphical models. Statistical Science, 27(4):519 537, 2012.

Pat Langley, Wayne Iba, and Kevin Thompson. An analysis of Bayesian classiﬁers. In National Conference on Artiﬁcial Intelligence (AAAI), volume 90, pages 223 228, 1992.

Steﬀen L. Lauritzen. Graphical Models. Clarendon Press, 1996.

Jason D. Lee and Trevor J. Hastie. Learning the structure of mixed graphical models. Journal of Computational and Graphical Statistics, 24(1):230 253, 2015.

Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Laﬀerty, and Larry Wasserman. Forest density estimation. Journal of Machine Learning Research, 12:907 951, 2011.

Ofer Meshi, Elad Eban, Gal Elidan, and Amir Globerson. Learning max-margin tree predictors. In Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 411 420, 2013.

Arvind Narayanan, Elaine Shi, and Benjamin I. P. Rubinstein. Link prediction by de-anonymization: How we won the Kaggle social network challenge. In International Joint Conference on Neural Networks (IJCNN), 2011.

Julia Neumann, Christoph Schn orr, and Gabriele Steidl. Combined SVM-based feature selection and classiﬁcation. Machine Learning, 61(1-3):129 150, 2005.

Luca Oneto, Sandro Ridella, and Davide Anguita. Tikhonov, Ivanov and Morozov regularization for support vector machine learning. Machine Learning, 103(1):103 136, 2015.

Eun Sug Park, Cliﬀord Spiegelman, and Jeongyoun Ahn. A nonparametric approach based on a Markov like property for classiﬁcation. Chemometrics and Intelligent Laboratory Systems, 108 (2):87 92, 2011.

Parikshit Ram, Dongryeol Lee, William March, and Alexander G. Gray. Linear-time algorithms for pairwise statistical problems. In Neural Information Processing Systems (NIPS), pages 1527 1535. 2009.

Vincent Y. F. Tan, Sujay Sanghavi, John W. Fisher, and Alan S. Willsky. Learning graphical models for hypothesis testing and classiﬁcation. IEEE Transactions on Signal Processing, 58 (11):5481 5495, 2010.

Vincent Y. F. Tan, Animashree Anandkumar, and Alan S. Willsky. Learning high-dimensional Markov forest distributions: Analysis of error rates. Journal of Machine Learning Research, 12: 1617 1653, 2011.

Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.

Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, and Zhandong Liu. Graphical models via univariate exponential family distributions. Journal of Machine Learning Research, 16(115): 3813 3847, 2015.

Ming Yu, Varun Gupta, and Mladen Kolar. Simultaneous inference for pairwise graphical models with generalized score matching. Journal of Machine Learning Research, 21(91):1 51, 2020.

Ji Zhu, Saharon Rosset, Robert Tibshirani, and Trevor J Hastie. 1-norm support vector machines. In Neural Information Processing Systems (NIPS), pages 49 56, 2004.