# core_dependency_networks__77f9ef9b.pdf

Core Dependency Networks

Alejandro Molina alejandro.molina@tu-dortmund.de CS Department TU Dortmund, Germany

Alexander Munteanu alexander.munteanu@tu-dortmund.de CS Department TU Dortmund, Germany

Kristian Kersting kersting@cs.tu-darmstadt.de CS Department and Centre for Cognitive Science TU Darmstadt, Germany

Many applications infer the structure of a probabilistic graphical model from data to elucidate the relationships between variables. But how can we train graphical models on a massive data set? In this paper, we show how to construct coresets compressed data sets which can be used as proxy for the original data and have provably bounded worst case error for Gaussian dependency networks (DNs), i.e., cyclic directed graphical models over Gaussians, where the parents of each variable are its Markov blanket. Speciﬁcally, we prove that Gaussian DNs admit coresets of size independent of the size of the data set. Unfortunately, this does not extend to DNs over members of the exponential family in general. As we will prove, Poisson DNs do not admit small coresets. Despite this worst-case result, we will provide an argument why our coreset construction for DNs can still work well in practice on count data. To corroborate our theoretical results, we empirically evaluated the resulting Core DNs on real data sets. The results demonstrate signiﬁcant gains over no or naive sub-sampling, even in the case of count data.

Artiﬁcial intelligence and machine learning have achieved considerable successes in recent years, and an ever-growing number of disciplines rely on them. Data is now ubiquitous, and there is great value in understanding the data, e.g., building probabilistic graphical models to elucidate the relationships between variables. In the big data era, however, scalability has become crucial for any useful machine learning approach. In this paper, we consider the problem of training graphical models, in particular, Dependency Networks (Heckerman et al. 2000), on massive data sets. They are cyclic directed graphical models, where the parents of each variable are its Markov blanket and have been proven successful in various tasks, such as collaborative ﬁltering (Heckerman et al. 2000), phylogenetic analysis (Carlson et al. 2008), genetic analysis (Dobra 2009; Phatak et al. 2010), network inference from sequencing data (Allen and Liu 2013), and trafﬁc as well as topic modeling (Hadiji et al. 2015). Speciﬁcally, we show that Dependency Networks over Gaussians arguably one of the most prominent type of distribution in statistical machine learning admit coresets of size independent of the size of the data set. Coresets are

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

weighted subsets of the data, which guarantee that models ﬁtting them will also provide a good ﬁt for the original data set, and have been studied before for clustering (Badoiu, Har-Peled, and Indyk 2002; Feldman, Faulkner, and Krause 2011; Feldman, Schmidt, and Sohler 2013; Lucic, Bachem, and Krause 2016), classiﬁcation (Har-Peled, Roth, and Zimak 2007; Har-Peled 2015; Reddi, P oczos, and Smola 2015), regression (Drineas, Mahoney, and Muthukrishnan 2006; 2008; Dasgupta et al. 2009; Geppert et al. 2017), and the smallest enclosing ball problem (Badoiu and Clarkson 2003; 2008; Feldman, Munteanu, and Sohler 2014; Agarwal and Sharathkumar 2015); we refer to (Phillips 2017) for a recent extensive literature overview. Our contribution continues this line of research and generalizes the use of coresets to probabilistic graphical modeling. Unfortunately, this coreset result does not extend to Dependency Networks over members of the exponential family in general. We prove that Dependency Networks over Poisson random variables (Allen and Liu 2013; Hadiji et al. 2015) do not admit (sublinear size) coresets: every single input point is important for the model and needs to appear in the coreset.This is unfortunate when modeling count data the primary target of Poisson distributions which is at the center of many scientiﬁc endeavors such as citation counts, number of web page hits, counts of procedures in medicine, etc. Therefore, despite our worst-case result, we will provide an argument why our coreset construction for Dependency Networks can still work well in practice on count data. To corroborate our theoretical results, we empirically evaluated the resulting Core Dependency Networks (CDNs) on several real data sets and demonstrate signiﬁcant gains over no or naive sub-sampling, even for count data. We proceed as follows. We review Dependency Networks (DNs), prove that Gaussian DNs admit sublinear size coresets, and discuss the possibility to generalize this result to count data. Before concluding, we present empirical results.

Dependency Networks Most of the existing AI and machine learning literature on graphical models is dedicated to binary, multinomial, or certain classes of continuous (e.g. Gaussian) random variables. Undirected models, aka Markov Random Fields (MRFs), such as Ising (binary random variables) and Potts (multinomial random variables) models have found a lot of applica-

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

tions in various ﬁelds such as robotics, computer vision, and statistical physics, among others. Whereas MRFs allow for cycles in the structures, directed models aka Bayesian Networks (BNs) required acyclic directed relationships among the random variables. Dependency Networks (DNs) the focus of the present paper combine concepts from directed and undirected worlds and are due to Heckerman et al. (2000). Specifically, like BNs, DNs have directed arcs, but they allow for networks with cycles and bi-directional arcs, akin to MRFs. This makes DNs quite appealing for many applications because we can build multivariate models from univariate distributions (Allen and Liu 2013; Yang et al. 2015; Hadiji et al. 2015), while still permitting efﬁcient structure learning using local estimators or gradient tree boosting. If the data are fully observed, learning is done locally on the level of the conditional probability distributions for each variable mixing directed and undirected as needed. Based on these local distributions, samples from the joint distribution are obtained via Gibbs sampling. Indeed, the Gibbs sampling neglects the question of a consistent joint probability distribution and instead makes only use of local distributions. Formally, let X = (X(1), . . . , X(d)) denote a random vector and x its instantiation. A Dependency Network (DN) on X is a pair (G, Ψ) where G = (V, E) is a directed, possibly cyclic, graph where each node in V = [d] = {1, . . . , d} corresponds to the random variable X(i). In the set of directed edges E V V \ {(i, i) | i [d]}, each edge models a dependency between variables, i.e., if there is no edge between i and j then the variables X(i) and X(j) are conditionally independent given the other variables X\i,j indexed by [d] \ {i, j} in the network. We refer to the nodes that have an edge pointing to X(i) as its parents, denoted by pai = {X(j) | (j, i) E}. Ψ = {pi | i [d]} is a set of conditional probability distributions associated with each variable X(i) pi, where

pi = p(x(i)| pai) = p(x(i)| x\i) .

As example of such a local model, consider Poisson conditional probability distributions as illustrated in Fig. 1 (left):

p(x(i)| pai) = λi(x\i)x(i)

x(i)! e λi(x\i) .

Here, λi(x\i) highlights the fact that the mean can have a functional form that is dependent on X(i) s parents. Often, we will refer to it simply as λi. The construction of the local conditional probability distribution is similar to the (multinomial) Bayesian network case. However, in the case of DNs, the graph is not necessarily acyclic and p(x(i)| x\i) typically has an inﬁnite range, and hence cannot be represented using a ﬁnite table of probability values. Finally, the full joint distribution is simply deﬁned as the product of local distributions:

i [d] p(x(i)| x\i) ,

also called pseudo likelihood. For the Poisson case, this

0 2 4 6 8 Number of Goals

Relative Frequency

Figure 1: Illustration of Dependency Networks (DNs) using Poissons. (left) The number of goals scored in soccer games follows a Poisson distribution. The plot shows the distribution of home goals in the season 2012/13 of the German Bundesliga by the home team. The home team scored on average λ = 1.59 goals per game. (right) Example structure of a Poisson DN. The conditional distribution of each count variable given its neighbors is a Poisson distribution. Similar to a Bayesian network a Poisson DN is directed. However, it also contains cycles. (Best viewed in color)

i [d] λx(i) i x(i)! e λi .

Note, however, that doing so does not guarantee the existence of a consistent joint distribution, i.e., a joint distribution of which they are the conditionals. Bengio et al. (2014), however, have recently proven the existence of a consistent distribution per given evidence, which does not have to be known in closed form, as long as an unordered Gibbs sampler converges.

Core Dependency Networks As argued, learning Dependency Networks (DNs) amounts to determining the conditional probability distributions from a given set of n training instances xi Rd representing the rows of the data matrix X Rn d over d variables. Assuming that p(x(i)| pai) is parameterized as a generalized linear model (GLM) (Mc Cullagh and Nelder 1989), this amounts to estimating the parameters γ(i) of the GLM associated with each variable X(i), since this completely determines the local distributions, but p(x(i)| pai) will possibly depend on all other variables in the network, and these dependencies deﬁne the structure of the network. This view of training DNs as ﬁtting d GLMs to the data allows us to develop Core Dependency Networks (CDNs): Sample a coreset and train a DN over certain members of the GLM family on the sampled coreset. A coreset is a (possibly) weighted and usually considerably smaller subset of the input data that approximates a given objective function for all candidate solutions (Badoiu, Har-Peled, and Indyk 2002):

Deﬁnition 1 (ε-coreset). Let X be a set of points from a universe U and let Γ be a set of candidate solutions. Let f : U Γ R 0 be a non-negative measurable function. Then a set C X is an ε-coreset of X for f, if

γ Γ : |f(X, γ) f(C, γ)| ε f(X, γ).

We now introduce the formal framework that we need towards the design of coresets for learning dependency networks. A very useful structural property for ℓ2 based objective (or loss) functions is the concept of an εsubspace embedding (Drineas, Mahoney, and Muthukrishnan 2006),2008. Deﬁnition 2 (ε-subspace embedding). An ε-subspace embedding for the columnspace of X is a matrix S such that

γ Rd : (1 ε) Xγ 2 SXγ 2 (1 + ε) Xγ 2

We can construct a sampling matrix S which forms an εsubspace embedding with constant probabilty in the following way: Let U be any orthonormal basis for the columnspace of X. This basis can be obtained from the singular value decomposition (SVD) X = UΣV T of the data matrix. Now let ρ = rank(U) = rank(X) and deﬁne the leverage scores li = Ui 2/ U 2 F = Ui 2/ρ for i [n]. Now we ﬁx a sampling size parameter k = O(ρ log(ρ/ε)/ε2), sample the input points one-by-one with probability qi = min{1, k li} and reweight their contribution to the loss function by wi = 1/qi. Note that, for the sum of squares loss, this corresponds to deﬁning a diagonal (sampling) matrix S by Sii = 1/ qi with probability qi and Sii = 0 otherwise. Also note, that the expected number of samples is k = O(ρ log(ρ/ε)/ε2), which also holds with constant probability by Markov s inequality. Moreover, to give an intuition why this works, note that for any ﬁxed γ Rd, we have

E SXγ 2 = xiγ qi

2 qi = (xiγ)2 = Xγ 2.

The signiﬁcantly stronger property of forming an ε-subspace embedding, according to Deﬁnition 2, follows from a matrix approximation bound given in (Rudelson and Vershynin 2007; Drineas, Mahoney, and Muthukrishnan 2008). Lemma 3. Let X be an input matrix with rank(X) = ρ. Let S be a sampling matrix constructed as stated above with sampling size parameter k = O(ρ log(ρ/ε)/ε2). Then S forms an ε-subspace embedding for the columnspace of X with constant probability.

Proof. Let X = UΣV T be the SVD of X. By Theorem 7 in (Drineas, Mahoney, and Muthukrishnan 2008) there exists an absolute constant C > 1 such that

E U T ST SU U T U C

where we used the fact that U F = ρ and U = 1 by orthonormality of U. The last inequality holds by choice of k = Dρ log(ρ/ε)/ε2 for a large enough absolute constant D > 1 such that 1+log D

D < 1 4C2 , since

k = log(Dρ log(ρ/ε)/ε2)

Dρ log(ρ/ε)/ε2 2ε2 log(Dρ log(ρ/ε)/ε)

Dρ log(ρ/ε)

4ε2(log(ρ/ε) + log D)

Dρ log(ρ/ε) 4ε2

By an application of Markov s inequality and rescaling ε, we can assume with constant probability

U T ST SU U T U ε. (1)

We show that this implies the ε-subspace embedding property. To this end, ﬁx γ Rd.

| SXγ 2 Xγ 2 |

= γT XT ST SXγ γT XT Xγ

= γT V ΣU T ST SUΣV T γ γT V ΣU T UΣV T γ

= γT V Σ (U T ST SU U T U) ΣV T γ

U T ST SU U T U ΣV T γ 2

U T ST SU U T U Xγ 2 ε Xγ 2,

The ﬁrst inequality follows by submultiplicativity, and the second from rotational invariance of the spectral norm. Finally we conclude the proof by Inequality (1).

The question arises whether we can do better than O(ρ log(ρ/ε)/ε2). One can show by reduction from the coupon collectors theorem that there is a lower bound of Ω(ρ log ρ) matching the upper bound up to its dependency on ε. The hard instance is a dm d, m N orthonormal matrix in which the scaled canonical basis Id/

dm 1 is stacked dm 1 times. The leverage scores are all equal to 1/dm, implying a uniform sampling distribution with probability 1/d for each basis vector. Any rank ρ = d preserving sample must comprise at least one of them. This is exactly the coupon collectors theorem with d coupons which has a lower bound of Ω(d log d) (Motwani and Raghavan 1995). The fact that the sampling is without replacement does not change this since the reduction holds for arbitrarily large m creating sufﬁcient multiple copies of each element to simulate the sampling with replacement (Tropp 2011). Now we know that with constant probability over the randomness of the construction algorithm, S satisﬁes the εsubspace embedding property for a given input matrix X. This is the crucial structural property to show that actually SX is a coreset for Gaussian linear regression models and dependency networks. Consider (G, Ψ), a Gaussian dependency network (GDN), i.e., a collection of Gaussian linear regression models

Ψ = {pi(X(i)|X\i, γ(i)) = N(X\iγ(i), σ2) | i [d]}

on an arbitrary digraph structure G (Heckerman et al. 2000). The logarithm of the (pseudo-)likelihood (Besag 1975) of the above model is given by

ln L (Ψ) = ln pi = ln pi.

A maximum likelihood estimate can be obtained by maximizing this function with respect to γ = (γ(1), . . . , γ(d)) which is equivalent to minimizing the GDN loss function

f G(X, γ) = X\iγ(i) X(i) 2.

Theorem 4. Given S, an ε-subspace embedding for the columnspace of X as constructed above, SX is an ε-coreset of X for the GDN loss function.

Proof. Fix an arbitrary γ = (γ(1), . . . , γ(d)) Rd(d 1). Consider the afﬁne map Φ : Rd 1 [d] Rd, deﬁned by Φ(γ(i)) = I\i d γ(i) ei. Clearly Φ extends its argument from d 1 to d dimensions by inserting a 1 entry at position i and leaving the other entries in their original order. Let β(i) = Φ(γ(i)) Rd. Note that for each i [d] we have

Xβ(i) = XΦ(γ(i)) = X\iγ(i) X(i), (2)

and each β(i) is a vector in Rd. Thus, the triangle inequality and the universal quantiﬁer in Deﬁnition 2 guarantee that

| SXβ(i) 2 Xβ(i) 2 |

= | ( SXβ(i) 2 Xβ(i) 2) |

| SXβ(i) 2 Xβ(i) 2 |

ε Xβ(i) 2 = ε Xβ(i) 2.

The claim follows by substituting Identity (2).

It is noteworthy that computing one single coreset for the columnspace of X is sufﬁcient, rather than computing d coresets for the d different subspaces spanned by X\i. From Theorem 4 it is straightforward to show that the minimizer found for the coreset is a good approximation of the minimizer for the original data.

Corollary 5. Given an ε-coreset C of X for the GDN loss function, let γ argminγ Rd(d 1)f G(C, γ). Then it holds that f G(X, γ) (1 + 4ε) min γ Rd(d 1) f G(X, γ).

Proof. Let γ argminγ Rd(d 1)f G(X, γ). Then

f G(X, γ) 1 1 εf G(C, γ) 1 1 εf G(C, γ )

1 + ε 1 εf G(X, γ ) (1 + 4ε)f G(X, γ ).

The ﬁrst and third inequalities are direct applications of the coreset property, the second holds by optimality of γ for the coreset, and the last follows from ε < 1

Moreover, the coreset does not affect inference within GDNs. Recently, it was shown for (Bayesian) Gaussian linear regression models that the entire multivariate normal distribution over the parameter space is approximately preserved by ε-subspace embeddings (Geppert et al. 2017), which generalizes the above. This implies that the coreset yields a useful pointwise approximation in Markov Chain Monte Carlo inference via random walks like the pseudo Gibbs sampler in (Heckerman et al. 2000).

Negative Result on Coresets for Poisson DNs Unfortunately, there is no (sublinear size) coreset for the simpler problem of Poisson regression, which implies the result for Poisson DNs. We show this formally by reduction from the communication complexity problem known as indexing.

Recall that the negative log-likelihood for Poisson regression is (Mc Cullagh and Nelder 1989; Winkelmann 2008)

ℓ(γ) := ℓ(γ|X, Y ) = exp(xiγ) yi xiγ + ln(yi!).

Theorem 6. Let ΣD be a data structure for D = [X, Y ] that approximates likelihood queries ΣD(γ) for Poisson regression, such that

γ Rd : η 1 ℓ(γ|D) ΣD(γ) η ℓ(γ|D).

If η < exp( n

4 ) 2n2 then ΣD requires Ω(n) bits of storage.

Proof. We reduce from the indexing problem which is known to have Ω(n) one-way randomized communication complexity (Jayram, Kumar, and Sivakumar 2008). Alice is given a vector b {0, 1}n. She produces for every i with bi = 1 the points xi = (r ωi, 1) R3, where ωi, i {0, . . . , n 1} denote the nth unit roots in the plane, i.e., the vertices of a regular n-polygon of radius r = n/(1 cos( 2π

n )) n3 in canonical order. The corresponding counts are set to yi = 1. She builds and sends ΣD of size s(n) to Bob, whose task is to guess the bit bj. He chooses to query γ = (ωj, r cos( 2π

n )) R3. Note that this afﬁne hyperplane separates r ωj from the other scaled unit roots since it passes exactly through r ω(j 1) mod n

and r ω(j+1) mod n. Also, all points are within distance 2r from each other by construction and consequently from the hyperplane. Thus, 2r xiγ 0 for all i = j. If bj = 0, then xj does not exist and the cost is at most

ℓ(γ) = exp(xiγ) yi xiγ + ln(yi!)

1 + 2r + 1 2n + 2nr 4n4 .

If bj = 1 then xj is in the expensive halfspace and at distance exactly

xjγ = (rωj)T ωj r cos 2π

= r 1 cos 2π

So the cost is bounded below by ℓ(γ) exp(n) n + 1 exp( n

Given η < exp( n

4 ) 2n2 , Bob can distinguish these two cases based on the data structure only, by deciding whether ΣD(γ) is strictly smaller or larger than exp( n

4 ) 2n2. Consequently s(n) = Ω(n), since this solves the indexing problem.

Note that the bound is given in bit complexity, but restricting the data structure to a sampling-based coreset and assuming every data point can be expressed in O(d log n) bits, this means we still have a lower bound of k = Ω( n log n) samples. Corollary 7. Every sampling based coreset for Poisson regression with approximation factor η < exp( n

4 ) 2n2 as in Theorem 6 requires at least k = Ω( n log n) samples.

At this point, it seems very likely that a similar argument can be used to rule out any o(n)-space constant approximation algorithm. This remains an open problem for now.

Why Core DNs for Count Data can still work In the Gaussian setting, the loss is measured in squared Euclidean distance and the number of important points, i.e., having signiﬁcantly large leverage scores, is bounded essentially by O(d). This is implicit in the original early works (Drineas, Mahoney, and Muthukrishnan 2008) and has been explicitly formalized later (Langberg and Schulman 2010; Clarkson and Woodruff 2013). It is crucial to understand that this is an inherent property of the norm function, and thus holds for arbitrary data. For the Poisson GLM, in contrast, we have shown that its loss function does not come with such properties from scratch. We constructed a worst case scenario, where basically every single input point is important for the model and needs to appear in the coreset. Usually, this is not the case with statistical models, where the data is assumed to be generated i.i.d. from some generating distribution that ﬁts the model assumptions. Consider for instance a data reduction for Gaussian linear regression via leverage score sampling vs. uniform sampling. It was shown that the leverage scores are quite uniform. In the presence of more and more outliers generated by the heavier tails of t-distributions, the leverage scores increasingly outperform uniform sampling (Ma, Mahoney, and Yu 2015). The Poisson model

yi Poi(λi), λi = exp(xiγ). (3)

suffers from its inherent limitation on equidispersed data since E [yi|xi] = V [yi|xi] = exp(xiγ). Count data, however, is often overdispersed especially for large counts. This is due to unobserved variables or problem speciﬁc heterogeneity and contagion-effects. The log-normal Poisson model is known to be inferior for data which speciﬁcally follows the Poisson model, but turns out to be more powerful in modeling the effects that can not be captured by the simple Poisson model. We review the log-normal Poisson model for count data (Winkelmann 2008)

yi Poi(λi), λi = exp(xiγ)ui = exp(xiγ + vi), vi = ln ui N (μ, σ) .

A natural choice for the parameters of the log-normal distribution is μ = σ2

2 in which case we have

E [yi|xi] = exp(xiγ + μ + σ2/2) = exp(xiγ) ,

V [yi|xi] = E [yi|xi] + (exp(σ2) 1)E [yi|xi]2 .

It follows that V [yi|xi] = exp(xiγ) + Ω(exp(xiγ)2) > exp(xiγ), where a constant σ2 that is independent of xi, controls the amount of overdispersion. Taking the limit for σ 0 we arrive at the simple model (3), since the distribution of vi = ln ui tends to δ0, the deterministic Dirac delta distribution which puts all mass on 0. The inference might aim for the log-normal Poisson model directly as in (Zhou et al. 2012), or it can be performed by (pseudo-)maximum likelihood estimation of the simple Poisson model. The latter provides a consistent estimator as long as the log-linear mean function is correctly speciﬁed, even if higher moments do not possess the limitations inherent in the simple Poisson model (Winkelmann 2008).

Preserving the log-linear mean function in a Poisson model is crucial towards consistency of the estimator. Moreover, modeling counts in a log-normal model gives us intuition why leverage score sampling can capture the underlying linear model accurately: In the log-normal Poisson model, u follows a log-normal distribution. It thus holds for ln λ = Xγ + ln u = Xγ + v, that

by independence of the observations, which implies

ln λ N Xγ σ2

Omitting the bias μ = σ2

2 in each intercept term (which can be cast into X), we notice that this yields again an ordinary least squares problem Xγ ln(λ) 2 deﬁned in the column space of X. There is still a missing piece in our argumentation. In the previous section, we have used that the coreset construction is an ε-subspace embedding for the columnspace of the whole data set including the dependent variable, i.e., for [X, ln(λ)]. We face two problems. First, λ is only implicitly given in the data, but is not explicitly available. Second, λ is a vector derived from X\i in our setting and might be different for any of the d instances. Fortunately, it was shown via more complicated arguments (Drineas, Mahoney, and Muthukrishnan 2008), that it is sufﬁcient for a good approximation if the sampling is done obliviously to the dependent variable. The intuition comes from the fact that the loss of any point in the subspace can be expressed via the projection of ln(λ) onto the subspace spanned by X, and the residual of its projection. A good approximation of the subspace implicitly approximates the projection of any ﬁxed vector, which is then applied to the residual vector of the orthogonal projection. This solves the ﬁrst problem since it is only necessary to have a subspace embedding for X. The second issue can be addressed by increasing the sample size by a factor of O(log d) for boosting the error probability to O(1/d) and taking a union bound.

Empirical Illustration Our intention here is to corroborate our theoretical results by investigating the following questions empirically:

(Q1) How does the performance of CDNs compare to DNs with access to the full training data set and to a uniform sample from the training data set? And, how does the empirical error behave according to the sample sizes?

(Q2) Do coresets affect the structure recovered by the DN?

To this aim, we implemented (C)DNs in Python1. All experiments ran on a Linux machine (56 cores, 4 GPUs, and 512GB RAM). All DNs were trained using Iteratively reweighted least squares (IRWLS), however, coresets do not depend on the learning algorithm used.

1https://github.com/alejandromolinaml/Core DNs

10% 20% 30% 40% 100% Training data (Sample size in percentage)

Negative Gaussian Pseudo Log Likelihood

3.59 5.23 3.58 3.59 3.58 3.63 3.58 3.59 3.58

CDN Uniform Full

10% 20% 30% 40% 100% Training data (Sample size in percentage)

Log RMSE (Root Mean Square Error)

2.6 2.55 2.64 2.62 2.65 2.65 2.66 2.66 2.67

CDN Uniform Full

10% 20% 30% 40% 100% Training data (Sample size in percentage)

Log Time (in hours)

1.72 1.76 2.38 2.46 2.79 2.9 3.08 3.18 4.0

CDN Uniform Full

10% 20% 30% 40% 100% Training data (Sample size in percentage)

Negative Poisson Pseudo Log Likelihood

3.48 3.58 3.35 3.36 3.33 3.31 3.31 3.29 3.26

CDN Uniform Full

10% 20% 30% 40% 100% Training data (Sample size in percentage)

Log RMSE (Root Mean Square Error)

1.55 1.68 1.4 1.48 1.36 1.42 1.35 1.4 1.39

CDN Uniform Full

10% 20% 30% 40% 100% Training data (Sample size in percentage)

Log Time (in minutes)

0.29 0.25 0.48 0.09 0.81 0.54 1.05 0.86 2.56

CDN Uniform Full

Negative Log Pseudo Likelihood RMSE Training Time

Figure 2: (Q1) Performance (the lower, the better) of Gaussian CDNs on MNIST (upper row) and Poisson CNDs on the trafﬁc dataset (lower row) 10-fold cross-validated. Shown are the negative log pseudo-likelihood (left), the squared error loss (middle, in log-space) as well as the training time (right, in log-space) on the y-axis for different proportions of the data sampled (x axis). Please note the jump in the x-axis after 40%. As one can see, CDNs (blue) quickly approach the predictive performance of the full dataset (Full, black). Uniform sampling (Uniform, red) does not perform as well as CDNs. Moreover, CDNs can be orders of magnitude faster than DNs on the full dataset and scale similar to uniform sampling. This is also supported by the vertical lines. They denote the mean performances (the more to the left, the better) on the top axes. (Best viewed in color)

Benchmarks on MNIST and Trafﬁc Data (Q1): We considered two datasets. We used the MNIST2 data set of handwritten labeled digits. We employed the training set consisting of 55000 images, each with 784 pixels, for a total of 43,120,000 measurements, and trained Gaussian DNs on it. The second dataset contains trafﬁc count measurements on selected roads around the city of Cologne in Germany (Ide et al. 2015). It consists of 7994 time-stamped measurements taken by 184 sensors for a total of 1,470,896 measurements, and we trained Poisson DNs on it. For each dataset, we performed ten fold cross-validation for training a full DN (Full) using all the data, leverage score sampling coresets (CDNs), and uniform samples (Uniform), for different sample sizes. We then compared the predictions made by all the DNs and the time taken to train them. Although the trafﬁc dataset is easy to approximate by larger uniform sampling; due to the regularities in daily trafﬁc patterns (commuting people cause peaks in the morning and evening, little trafﬁc at night, more trafﬁc at daytime). The challenging task is to be good at small sample sizes, where CPDNs are superior. It can also be seen that CPDNs are better in predictive performance (RMSE). For the predictions on the MNIST dataset, we clipped the values to the range [0,1] for all the DNs. For the Trafﬁc dataset, we computed the predictions x of every measurement x rounded to the largest integer less than or equal to x. Fig. 2 summarizes the results. As one can see, CDNs outperform DNs trained on full data and are orders of magnitude faster. Compared to uniform sampling, coresets are competitive. As seen on the trafﬁc dataset, CDNs can have

2http://yann.lecun.com/exdb/mnist/

Sample MNIST Trafﬁc

portion GCDN GUDN PCDN PUDN 10% 18.03% 11162.01% 6.81% 9.6% 20% 0.57% 13.86% 2.9% 3.17% 30% 0.01% 13.33% 2.04% 1.68% 40% 0.01% 2.3% 1.59% 0.99%

Table 1: (Q1) Comparison of the empirical relative error (the lower, the better). Best results per dataset are bold. Both Gaussian (GCDNs) and Poisson (PCDNs) CDNs recover the model well, with a fraction of the training data. Uniformly sampled DNs (UDNs) lag behind as the sample size drops.

more predictive power than the optimal model using the full data. This is in line with Mahoney (2011), who observed that coresets implicitly introduce regularization and lead to more robust output. Table 1 summarizes the empirical relative errors |f(X, γ) f(X, γ )|/f(X, γ ) between (C/U)DNs γ and DNs γ trained on all the data. CDNs clearly recover the original model, at a fraction of training data. Overall, this answers (Q1) afﬁrmatively. Relationship Elucidation (Q2): We investigated the performance of CDNs when recovering the graph structure of word interactions from a text corpus. For this purpose, we used the NIPS3 bag-of-words dataset. It contains 1,500 documents with a vocabulary above 12k words. We considered the 100 most frequent words. Fig. 3 illustrates the results qualitatively. It shows three CDNs of sampling sizes 40%, 70% and 100% for Gaussians (top) after a log(x + 1) trans-

3https://archive.ics.uci.edu/ml/datasets/bag+of+words

block controller attractor

iiii tangent

call letter

building star

circuit chip

instruction

block controller attractor

iiii tangent

call letter

building star

circuit chip

instruction

block controller attractor

iiii tangent

call letter

building star

circuit chip

instruction

block controller attractor

iiii tangent

call letter

building star

circuit chip

instruction

block controller attractor

iiii tangent

call letter

building star

circuit chip

instruction

block controller attractor

iiii tangent

call letter

building star

circuit chip

instruction

Gaussian CDN

Poisson CDN

block controller attractor

iiii tangent

call letter

building star

circuit chip

instruction

block controller attractor

iiii tangent

call letter

building star

circuit chip

instruction

block controller attractor

iiii tangent

call letter

building star

circuit chip

instruction

block controller attractor

iiii tangent

call letter

building star

circuit chip

instruction

block controller attractor

iiii tangent

call letter

building star

circuit chip

instruction

block controller attractor

iiii tangent

call letter

building star

circuit chip

instruction

Figure 3: (Q2) Elucidating the relationships between random variables. Shown are the (positive) dependency structures of Gaussian (top) and Poisson (bottom) CDNs on NIPS and different learning sampling sizes: using 40% (Left) , 70% (Middle) and 100% (Right). The edges show the 70 top thresholded positive coefﬁcients of the GLMs. The colors of the edges represent modularity. As one can see, CDNs elucidate relationships among the words that make semantical sense and approach the structure learned using the full dataset. For a quantitative assessment, see Tab. 2. (Best viewed in color)

formation and for Poissons (bottom): CDNs capture well the gist of the NIPS corpus. Table 2 conﬁrms this quantitatively. It shows the Frobenius norms between the DNs: CDNs capture the gist better than naive, i.e., uniform sampling. This answers (Q2) afﬁrmatively. To summarize our empirical results, the answers to questions (Q1) and (Q2) show the beneﬁts of CDNs.

Conclusions Inspired by the question of how we can train graphical models on massive datasets, we have studied coresets for estimating Dependency networks (DNs). We present the ﬁrst rigorous guarantees for obtaining compressed ε-approximations of Gaussian DNs for large data sets. We proved worst-case impossibility results on coresets for Poisson DNs. A review of log-normal Poisson modeling of counts provided deep insights into why our coreset construction still performs well for count data in practice. Our experimental results demonstrate the resulting Core Dependency Networks (CDNs) can achieve signiﬁcant gains over no or naive sub-sampling, even in the case of count data, making it possible to learn models on much larger datasets using the same hardware. CDNs provide several interesting avenues for future work. The conditional independence assumption opens the door to explore hybrid multivariate models, where each variable can potentially come from a different GLM family or link function, on massive data sets. This can further be used to hint at independencies among variables in the multivariate setting, making them useful in other large data applications.

Sample UDN CDN

portion Gaussian Poisson Gaussian Poisson 40% 9.0676 6.4042 3.9135 0.6497 70% 4.8487 1.6262 2.6327 0.3821

Table 2: (Q2) Frobenius norm of the difference of the adjacency matrices (the lower, the better) recovered by DNs trained on the full data and trained on a uniform subsample (UDN) resp. coresets (CDNs) of the training data. The best results per statistical type (Gaussian/Poisson) are bold. CDNs recover the structure better than UDNs.

Our results may pave the way to establish coresets for deep models using the close connection between dependency networks and deep generative stochastic networks (Bengio et al. 2014), sum-product networks (Poon and Domingos 2011; Molina, Natarajan, and Kersting 2017), as well as other statistical models that build multivariate distributions from univariate ones (Yang et al. 2015).

Acknowledgements: The authors would like to thank the anonymous reviewers for their feedback and acknowledge the support by the German Science Foundation (DFG) Collaborative Research Center SFB 876 Providing Information by Resource-Constrained Analysis, projects B4 and C4. KK acknowledges the support by the Centre for Cognitive Science at the TU Darmstadt.

References Agarwal, P. K., and Sharathkumar, R. 2015. Streaming algorithms for extent problems in high dimensions. Algorithmica 72(1):83 98. Allen, G. I., and Liu, Z. 2013. A local poisson graphical model for inferring networks from sequencing data. IEEE Transactions on Nanobioscience 12(3):189 198. Badoiu, M., and Clarkson, K. L. 2003. Smaller core-sets for balls. In Proc. of SODA, 801 802. Badoiu, M., and Clarkson, K. L. 2008. Optimal core-sets for balls. Computational Geometry 40(1):14 22. Badoiu, M.; Har-Peled, S.; and Indyk, P. 2002. Approximate clustering via core-sets. In Proceedings of STOC, 250 257. Bengio, Y.; Laufer, E.; Alain, G.; and Yosinski, J. 2014. Deep generative stochastic networks trainable by backprop. In Proc. of ICML, 226 234. Besag, J. 1975. Statistical analysis of non-lattice data. Journal of the Royal Statistical Society, Series D 24(3):179 195. Carlson, J. M.; Brumme, Z. L.; Rousseau, C. M.; Brumme, C. J.; Matthews, P.; Kadie, C. M.; Mullins, J. I.; Walker, B. D.; Harrigan, P. R.; Goulder, P. J. R.; and Heckerman, D. 2008. Phylogenetic dependency networks: Inferring patterns of CTL escape and codon covariation in HIV-1 gag. PLo S Computational Biology 4(11). Clarkson, K. L., and Woodruff, D. P. 2013. Low rank approximation and regression in input sparsity time. In Proc. of STOC, 81 90. Dasgupta, A.; Drineas, P.; Harb, B.; Kumar, R.; and Mahoney, M. W. 2009. Sampling algorithms and coresets for ℓp regression. SIAM Journal on Computing 38(5):2060 2078. Dobra, A. 2009. Variable selection and dependency networks for genomewide data. Biostatistics 10(4):621 639. Drineas, P.; Mahoney, M. W.; and Muthukrishnan, S. 2006. Sampling algorithms for ℓ2 regression and applications. In Proc. of SODA, 1127 1136. Drineas, P.; Mahoney, M. W.; and Muthukrishnan, S. 2008. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications 30(2):844 881. Feldman, D.; Faulkner, M.; and Krause, A. 2011. Scalable training of mixture models via coresets. In Proc. of NIPS. Feldman, D.; Munteanu, A.; and Sohler, C. 2014. Smallest enclosing ball for probabilistic data. In Proc. of SOCG, 214 223. Feldman, D.; Schmidt, M.; and Sohler, C. 2013. Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In Proc. of SODA, 1434 1453. Geppert, L. N.; Ickstadt, K.; Munteanu, A.; Quedenfeld, J.; and Sohler, C. 2017. Random projections for Bayesian regression. Statistics and Computing 27(1):79 101. Hadiji, F.; Molina, A.; Natarajan, S.; and Kersting, K. 2015. Poisson dependency networks: Gradient boosted models for multivariate count data. MLJ 100(2-3):477 507. Har-Peled, S.; Roth, D.; and Zimak, D. 2007. Maximum margin coresets for active and noise tolerant learning. In Proc. of IJCAI, 836 841.

Har-Peled, S. 2015. A simple algorithm for maximum margin classiﬁcation, revisited. ar Xiv 1507.01563. Heckerman, D.; Chickering, D.; Meek, C.; Rounthwaite, R.; and Kadie, C. 2000. Dependency networks for density estimation, collaborative ﬁltering, and data visualization. Journal of Machine Learning Research 1:49 76. Ide, C.; Hadiji, F.; Habel, L.; Molina, A.; Zaksek, T.; Schreckenberg, M.; Kersting, K.; and Wietfeld, C. 2015. LTE connectivity and vehicular trafﬁc prediction based on machine learning approaches. In Proc. of IEEE VTC Fall. Jayram, T. S.; Kumar, R.; and Sivakumar, D. 2008. The one-way communication complexity of Hamming distance. Theory of Computing 4(1):129 135. Langberg, M., and Schulman, L. J. 2010. Universal epsilonapproximators for integrals. In Proc. of SODA. Lucic, M.; Bachem, O.; and Krause, A. 2016. Strong coresets for hard and soft bregman clustering with applications to exponential family mixtures. In Proc. of AISTATS, 1 9. Ma, P.; Mahoney, M. W.; and Yu, B. 2015. A statistical perspective on algorithmic leveraging. JMLR 16:861 911. Mahoney, M. W. 2011. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning 3(2):123 224. Mc Cullagh, P., and Nelder, J. 1989. Generalized Linear Models. Chapman and Hall. Molina, A.; Natarajan, S.; and Kersting, K. 2017. Poisson sum-product networks: A deep architecture for tractable multivariate poisson distributions. In Proc. of AAAI. Motwani, R., and Raghavan, P. 1995. Randomized Algorithms. Cambridge Univ. Press. Phatak, A.; Kiiveri, H. T.; Clemmensen, L. H.; and Wilson, W. J. 2010. Net Ra VE: constructing dependency networks using sparse linear regression. Bioinformatics 26(12):1576 1577. Phillips, J. M. 2017. Coresets and sketches. In Handbook of Discrete and Computational Geometry. Poon, H., and Domingos, P. 2011. Sum-Product Networks: A New Deep Architecture. Proc. of UAI. Reddi, S. J.; P oczos, B.; and Smola, A. J. 2015. Communication efﬁcient coresets for empirical loss minimization. In Proc. of UAI, 752 761. Rudelson, M., and Vershynin, R. 2007. Sampling from large matrices: An approach through geometric functional analysis. Journal of the ACM 54(4):21. Tropp, J. A. 2011. Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive Data Analysis 3(1-2):115 126. Winkelmann, R. 2008. Econometric Analysis of Count Data. Springer, 5th edition. Yang, E.; Ravikumar, P.; Allen, G. I.; and Liu, Z. 2015. On graphical models via univariate exponential family distributions. JMLR 16:3813 3847. Zhou, M.; Li, L.; Dunson, D. B.; and Carin, L. 2012. Lognormal and gamma mixed negative binomial regression. In Proceedings of ICML.