# scalable_learning_of_bayesian_network_classifiers__aeb803e0.pdf

Journal of Machine Learning Research 17 (2016) 1-35 Submitted 7/13; Revised 5/15; Published 4/16

Scalable Learning of Bayesian Network Classiﬁers

Ana M. Mart ınez anam.martinezf@gmail.com Geoﬀrey I. Webb geoff.webb@monash.edu Faculty of Information Technology Monash University VIC 3800, Australia

Shenglei Chen tristan chen@126.com College of Information Science/Faculty of Information Technology Nanjing Audit University/Monash University China/Australia

Nayyar A. Zaidi nayyar.zaidi@monash.edu Faculty of Information Technology Monash University VIC 3800, Australia

Editor: Russ Greiner

Ever increasing data quantity makes ever more urgent the need for highly scalable learners that have good classiﬁcation performance. Therefore, an out-of-core learner with excellent time and space complexity, along with high expressivity (that is, capacity to learn very complex multivariate probability distributions) is extremely desirable. This paper presents such a learner. We propose an extension to the k-dependence Bayesian classiﬁer (KDB) that discriminatively selects a sub-model of a full KDB classiﬁer. It requires only one additional pass through the training data, making it a three-pass learner. Our extensive experimental evaluation on 16 large data sets reveals that this out-of-core algorithm achieves competitive classiﬁcation performance, and substantially better training and classiﬁcation time than state-of-the-art in-core learners such as random forest and linear and non-linear logistic regression. Keywords: scalable Bayesian classiﬁcation, feature selection, out-of-core learning, big data

1. Introduction

Until very recently most machine learning research has been conducted in the context of relatively small datasets, that is, no more than 50, 000 instances and a few hundred attributes. It is only in the last ﬁve years that larger datasets have started to appear in the literature (Erhan et al., 2010; Sonnenburg and Franc, 2010; Agarwal et al., 2014; Sariyar et al., 2011) on a regular basis. We argue that larger data quantities do not simply call for scaling-up of existing learning algorithms. Rather, we believe, as ﬁrst argued by Brain and Webb (1999), that the most accurate learners for large data will have much lower bias than the most accurate learners for small data. Thus we need a new generation of computationally eﬃcient low bias learners. To achieve the necessary eﬃciency, a learning

c 2016 Ana M. Mart ınez, Geoﬀrey I. Webb, Shenglei Chen and Nayyar A. Zaidi.

Mart ınez, Webb, Chen and Zaidi

algorithm for very large data should ideally require only a few passes through the training data. Further, to remove memory size as a bottleneck, the algorithm should be able to process data out-of-core. In this paper, we extend the KDB classiﬁer. KDB is a form of restricted Bayesian network classiﬁer (BNC). It has numerous desirable properties in the context of learning from large quantities of data. These include:

the capacity to control its bias/variance trade-oﬀwith a single parameter, k,

it does not require data to be held in-core,

the capacity to learn in just two passes through the training data.

The contributions of this paper are as follows:

We extend KDB to perform discriminative model selection in a single additional pass through the training data. In this single pass, our algorithm selects between both attribute subsets and network structures. The resulting highly scalable algorithm combines the computational eﬃciency of classical generative learning with the low bias of discriminative learning.

We compare the performance of our algorithm with other state-of-the-art classiﬁers on several large datasets, ranging in size from 165 thousand to 54 million instances and 5 to 784 attributes. We show that our highly scalable algorithms achieve comparable or lower error on large data than a range of out-of-core and in-core state-of-the-art classiﬁcation learning algorithms.

To illustrate our motivations, in Figure 1 we present learning curves for the poker-hand dataset (which is described in Table 7 in Appendix A). As can be seen, for lower quantities of data the lower variance delivered by low values of k results in lower error for KBD, while for larger quantities of data the lower bias of higher values of k results in lower error. While selective KDB sometimes selects a k that overﬁts this data with smaller training set sizes, when the training set size exceeds 550, 000 it successfully selects the best k and by selecting a subset of the attributes it can substantially reduce error across the entire range of training set sizes relative to the KDB classiﬁer using the same k. Section 2 reviews the state-of-the-art in out-of-core BNCs and introduces our proposal: selective KDB (SKDB, Section 2.1). In Section 3 we present a set of comparisons for our proposed algorithm on 16 big datasets with out-of-core BNCs (Section 3.1), out-of-core linear classiﬁers using Stochastic Gradient Descent (Section 3.2), and with in-core (batch) learners Bayes Net and Random Forest (Section 3.3). To ﬁnalize, Section 4 shows the main conclusions and outlines future work.

2. Scalable Bayesian Networks for Classiﬁcation

We deﬁne the classiﬁcation learning task as the assignment of a label y ΩY , from a set of c labels of the class variable Y , to a given example x = (x1, . . . , xd), with values for the d attributes A = {X1, . . . , Xd}. Bayesian networks (BN) (Pearl, 1988) provide a framework for computing the joint probability of these d random variables. A BN is

Scalable Learning of Bayesian Network Classifiers

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Training set size

SKDB k=5 only k

SKDB k=5 only atts

Figure 1: Learning curves for KDB and Selective KDB on the poker-hand dataset. The values plotted are averages over 10 runs. For each run 1,000 examples are selected as a test set and samples of successive sizes are selected from the remaining data for training. As can be seen, for lower quantities of data the lower variance delivered by low values of k results in lower error for KBD, while for larger quantities of data the lower bias of higher values of k results in lower error, although there is not suﬃcient data for k = 5 to asymptote on this dataset. While only selecting a k (SKDB k=5 only k) causes selective KDB to overﬁt this data with smaller training set sizes (between 20,000 and 550,000), when the training set size is large enough it successfully selects the best k. Only selecting attributes (SKDB k=5 only atts) always reduces error (relative to KDB k=5). By selecting both k and attributes (SKDB k=5) selective KDB substantially reduces error relative to KDB with any value of k once there are more than approximately 450,000 training examples.

a directed acyclic graph where the the joint probability of all attributes can be written as the product of the individual probabilities of all attributes given their parents, that is p(y, x) = p(y|πy) Qd i=1 p(xi|πxi), where πxi denotes the parents of attribute Xi. If the structure is known, the likelihood of the BN given the data can be maximized by estimating p(xi|πxi) using the empirical estimates from the training data T , composed of t training instances.

Mart ınez, Webb, Chen and Zaidi

While unrestricted BNs are the least biased, training such a model on even moderate size data sets can be extremely challenging, as the search-space that needs to be explored grows exponentially with the number of attributes. This has led to restricted BNCs, including naive Bayes (NB) (Duda and Hart, 1973; Minsky, 1961; Lewis, 1998), tree-augmented naive Bayes (TAN) (Friedman et al., 1997), averaged n-dependence estimators (ANDE) (Webb et al., 2012) and k-dependence estimators (KDB) (Sahami, 1996). These classiﬁers have in common either no structural learning or minimal structural learning that requires only one or two passes through the data. They have been referred to as semi-naive BNCs (Zheng and Webb, 2005). Even though highly biased due to their inherent conditional attribute independence assumption, they have been shown to be very eﬀective classiﬁcation methods.

NB is the simplest of the BNCs, assuming that all attributes are independent given the class. It estimates the joint probability using ˆp(y) Qd i=1 ˆp(xi|y), where ˆp( ) denotes an estimate of p( ). TAN is a structural augmentation of NB where every attribute has as parents the class and at most one other attribute. The structure is determined by using an extension of the Chow-Liu tree (Chow and Liu, 1968), that utilizes conditional mutual information to ﬁnd a maximum spanning tree. This alleviates some of NB s independence assumption and therefore reduces its bias at the expense of increasing its variance. This results in better performance on larger data sets. TAN, provides an intermediate biasvariance trade-oﬀ, standing between NB on one hand and unrestricted BNCs on the other.

There has been a lot of prior work that has explored approaches to alleviate NB s independence assumption. One approach is to add a bounded number of additional interdependencies (Friedman et al., 1997; Zheng and Webb, 2000; Webb et al., 2005; Zhang et al., 2005; Yang et al., 2007; Flores et al., 2009a,b; Zheng et al., 2012; Zaidi et al., 2013). At the other extreme, Su and Zhang (2006) propose the use of a full Bayesian network. Of these algorithms, only two approaches allow the number of interdependencies to be adjusted to best accommodate the best trade-oﬀs between bias and variance for diﬀering data quantities. ANDE (Webb et al., 2012) averages many BNCs, where a parameter n controls the number of interdependencies to be modeled. KDB uses a parameter k to control the number of interdependencies modeled.

KDB relaxes NB s independence assumption by allowing every attribute to be conditioned on the class and, at most, k other attributes. Like TAN, KDB requires two passes over the training data for learning: the ﬁrst pass collects the statistics to perform calculations of mutual information for structure learning, and the second performs the parameter learning based on the structure created in the former step (Algorithm 1). An advantage over the ANDE model is that KDB does not need to collect all statistics for all combinations of n or k attributes, allowing it to scale to higher order dependencies. It also has the capacity to select a model and hence more closely ﬁt the data. This is a disadvantage for small data, which it may overﬁt, but KDB will often lead to higher accuracy than ANDE on large data which are more diﬃcult to overﬁt.

Recently, some techniques have been proposed that address the classiﬁcation problem with Bayesian networks from a relatively eﬃcient discriminative perspective. Carvalho et al. (2011) propose a decomposable approximation to the conditional log-likelihood scoring criteria. Pernkopf and Bilmes (2010) propose a BNC learning algorithm with some resemblance to KDB but using discriminative evaluation for parent assignment. Neither

Scalable Learning of Bayesian Network Classifiers

Algorithm 1: The KDB algorithm

Input: Training set T with attributes {X1, . . . , Xa, Y } and k. Output: KDB model.

1 Let G be a directed graph G = (V, E), in which V is a set of vertices and E is a set of links.

2 Let Θ be a Bayesian network with structure G.

3 First pass begin

4 G =learn Structure(T ) ; /* Algorithm 2 */

6 Second pass begin

7 Θ = learn Parameters(T , G) ; /* Algorithm 3 */

9 Let KDB be a BN with structure G and conditional probability distributions in Θ.

10 return KDB

Algorithm 2: learn Structure(T )

Input: Training set T Output: G, network structure.

1 Calculate MI(Xi; Y ) from T for all attributes.

2 Calculate MI(Xi; Xj|Y ) from T for each pair of attributes (i = j).

3 Let L be a list of all Xi in decreasing order of MI(Xi; Y ).

4 V = {Y }; E = ;

5 for i = 1 L.size do

6 V = V Li;

7 E = E (Y, Li);

8 vk = min(i 1, k);

9 while (vk > 0) do

10 m = argmaxj{MI(Li; Lj|Y ) | 1 j<i (Lj, Li) / E};

11 E = E (Lj, Li);

12 vk = vk 1;

15 return G

Algorithm 3: learn Parameters(T , G)

Input: Training set T and G. Output: Θ.

1 Initialize Θ to structure G.

2 Compute the CPTs for Θ from T .

Mart ınez, Webb, Chen and Zaidi

algorithm is suited to learning a BNC in a limited number of passes through the training data. There have also been some important reﬁnements that improve KDB s performance. Bouckaert (2006) proposes averaging all possible network structures for a ﬁxed value of k (including lower orders). Time complexity is signiﬁcantly reduced by multiplying the sum of every attribute given all possible parents in the order. For k = 2 the authors propose a simpliﬁcation that identiﬁes one attribute as super parent and consider for every attribute of the class, possibly the super parent and possibly one other lower ordered attribute. The sum of class probabilities is taken as selecting criteria for the superparent. However they agree that this approach is not easy to implement incrementally for large datasets. Rubio and G amez (2011) present a variant of KDB that employs a hill-climbing search to incrementally build a KDB classiﬁer. This approach requires at least one pass through the data for each attribute.

2.1 Selective KDB

The parameter k controls KDB s bias-variance trade-oﬀ. Higher k results in higher variance and lower bias. Unfortunately there is no apriori means to preselect a value of k for which this trade-oﬀwill result in the lowest error for a given training set as this is a complex interplay between the data quantity and the complexity and strength of the interactions between the attributes. A further factor that can increase KDB s error is that attributes that carry no useful information about the class must be treated as if they do, which invariably introduces some noise into the estimates. Discarding these attributes can both reduce error and classiﬁcation time. Our proposed algorithm, Selective KDB (SKDB), extends KDB to select between attribute subsets and values of k in a single additional pass through the training data. In the general case, attribute selection is a complex combinatorial task. For d attributes there are 2d alternative attribute subsets that could be explored. Many attribute selection algorithms utilize either forward selection or backwards elimination hill-climbing strategies (Langley and Sage, 1994; Koller and Sahami, 1996). These start with either no attributes or all attributes and then iteratively add or remove one attribute at a time. This requires that all attribute subsets resulting from adding/removing any one attribute are considered at each step. There are at most d such steps resulting in O(d2) attribute subset evaluations. If all candidate modiﬁcations to a current subset are considered simultaneously in a single pass through the data, this implies up to d passes for the attribute selection stage. We seek to perform both attribute and best-k selection in a single pass. For attribute selection, we take advantage of the attribute ordering selected by KDB based on mutual information with the class. Given that the attributes are sorted on this order, we consider only the attribute sets {x1, . . . xi}, 1 i d, that is attribute sets that each contain the i attributes that have the greatest mutual information with the class. These d attribute subsets are evaluated simultaneously in a single pass through the data using leave-one-out cross validation (LOOCV). For big data, LOOCV can be expected to provide an unbiased low-variance estimate of the out-of-sample error. It is possible to eﬃciently evaluate all d subsets simultaneously, because a KDB classiﬁer using subset {x1, . . . xi} is a minor addition to a KDB classiﬁer using subset {x1, . . . xi 1}. Calculating a KDB classiﬁer for all attributes

Scalable Learning of Bayesian Network Classifiers

already incorporates all the calculations for all the KDB classiﬁers using subsets that we consider. We use incremental cross validation (Kohavi, 1995), which performs LOOCV on BNCs very eﬃciently. A naive approach to LOOCV will for each holdout example recalculate from scratch all of the joint frequency counts required by a BNC. Incremental cross-validation ﬁrst gathers the counts for all training examples in a single pass through the data. Then, when classifying a holdout example, its counts are removed from the table. SKDB collects the complete counts in its second pass through the training data and performs LOOCV in a third pass. Using the given order, the number of attributes with the best leave-one-out error is selected. In case of a draw, preference is given to the smallest number of attributes. Any loss function can be used that is a function, { yx, ˆp(Y, x) | x T } R over all objects x T of a predicted class distribution ˆp(Y, x) and the true class for that object, yx. Such loss functions include 0-1 loss, root mean square error, log-loss, Matthews correlation coeﬃcient (Matthews, 1975) or Brier score (Brier, 1950) to name a few. Because we believe it to be an eﬀective measure of the calibration of a classiﬁer s class probability estimates, we use root mean squared error (RMSE) as follows:

x T (1 ˆp(yx|x))2 (1)

where ˆp(yx|x) is the estimated posterior probability of the true class given x, yx the class label for the example x. In order to make full use of available computational resources while reducing the risk of overﬁtting, we propose to select algorithmically the best value of k for a particular dataset. Just as the KDB classiﬁers built on successive attribute subsets are embedded one inside the other, a KDB classiﬁer with k = i, can be evaluated with little additional computational eﬀort during the process of evaluating a KDB classiﬁer with k = i + 1. We leverage this capacity to simultaneously evaluate KDB classiﬁers with all attribute subsets {x1, . . . xi 1} and values of k up to the maximum capacity available, taking advantage of the above mentioned LOOCV. The combination of attribute subset and k with the best LOOCV error is selected. In case of a draw, the smaller attribute subset and smaller value of k are selected. The resulting SKDB algorithm is presented in Algorithm 4. In the ﬁrst pass our implementation of SKDB generates a three-dimensional table of cooccurrence counts for each pair of attribute values and each class value. This is required to calculate the mutual information of each attribute with the class and the conditional mutual information of every pair of attributes given the class. The resulting space complexity is O(c(dv)2). The time complexity of forming the three dimensional probability table is O(td2), as an entry must be updated for every training case and every combination of two attribute-values for that case. To calculate the conditional mutual informations, SKDB must consider every pairwise combination of their respective values in conjunction with each class value O(c(dv)2). Attribute ordering and parent assignment are O(d log d) and O(d2 log d) respectively. The second pass requires d tables of k + 2 dimensions (one for each of k parents, one for the child and one for the class), with O(cdvk+1). Updating the probability tables is O(tdk).

Mart ınez, Webb, Chen and Zaidi

Algorithm 4: The SKDB algorithm

Input: Training set T with attributes {X1, . . . , Xa, Y } and kmax. Output: SKDB model.

1 First pass begin

2 G =learn Structure(T ) ; /* Algorithm 2 */

4 Second pass begin

5 Θ = learn Parameters(T , G) ; /* Algorithm 3 */

7 Let L be a list of all Xi in decreasing order of MI(Xi; Y ).

8 Let P be a k x a matrix of posterior probabilities;

9 Let LF be a matrix of LOOCV results (of length k x a); Initialize it with zeros;

10 Let Θ x be the BN Θ with example x discounted from its CPTs;

11 Third pass ( x T ) begin

12 for k = 1 to kmax do

13 P[k ][y ] = ˆpΘ x(y |x), y Y ;

14 for l = 1 l = L.size do

15 Xmax = L.next Element;

16 P[k ][y ] = P[k ][y ] ˆpΘ x(xmax|pak xmax, y ), y ΩY ; where pak xmax are the k ﬁrst parent-values of Xmax in x.

17 LF[k ][l] = LF[k ][l] + Loss Function(P[k ], yx); where P[k ] is the vector of posterior probabilities considering the top l attributes by MI.

21 Select b and k indexes with best LF;

22 Truncate Θ to attribute subset {1 . . . b} and maximum number of parents k;

23 return Θ;

The algorithm to this stage is simply KDB, with time complexity O(td2 +c(dv)2 +tdk), where v is the maximum number of values per attribute, and space complexity O(cdvk+1). SKDB then extends KDB with an extra pass through the training data to perform leave-one-out cross validation for the diﬀerent (ordered) attributes O(tcdk). At classiﬁcation time, SKDB need only store the conditional probability distribution for the db selected attributes and the kb selected, O(cdbvk b ). The time complexity of classifying a single example is O(cdbkb). In summary, SKDB requires an extra pass over the data compared with KDB, and the complexity of this extra pass is linearly dependent on the number of training examples, classes, attributes and k so the order of the time complexity increases from O(td2 +c(dv)2 + tdk) to O(td2 + c(dv)2 + tcdk). This is not a great increase in practice, since the number of classes is generally small, and indeed for high dimensional data the complexity of the ﬁrst

Scalable Learning of Bayesian Network Classifiers

pass can dominate that of the subsequent passes. Classiﬁcation time and space complexity are reduced compared to KDB if the number of selected attributes db < d and/or kb < kmax. For the cost of this modest increase in computation, KDB is extended so as to discriminatively choose between a large class of alternative BNC models. Note that while it is possible to use cross validation to select between any collection of alternative models, it is only possible to do so in the extremely eﬃcient manner we have developed for a very restricted class of learners, those that can be evaluated using incremental cross-validation and can be decomposed into nested models. The only types of learners of which we are aware that belong to this class are KDB and ANDE. We leave applying the approach to ANDE and variants of KDB as a topic for future research. It may be possible to gain further eﬃciencies through sampling. As discussed in Hulten and Domingos (2002), it is sometimes possible to obtain suﬃciently accurate estimates of the parameters for a learning program from a subsample of the full data. For very large datasets we could potentially utilize a random sample for the ﬁrst pass of our algorithm, which only needs to estimate the parameters for three-way interactions between each pair of attributes and the class. We believe that the best value of k will usually be the one for which there is only just enough data in the dataset to obtain suﬃciently accurate estimates of the required parameters. Thus we do not believe that sampling will usually be useful for the second pass. A technique such as Racing (Maron and Moore, 1994) could be used to select the model in the ﬁnal pass. However, we suspect that it will usually require extremely large samples to conﬁdently distinguish between the large number of very similar models that are assessed in the third pass. Given that the computational overheads of scanning a large dataset in random order are considerable, we do not believe that sampling will often be useful. However, this also remains a potential direction for future research.

3. Experimental Methodology

We undertook an extensive online search to gather a group of large datasets, all of which have more than 100K instances. These datasets are described in Table 7 in Appendix A, in ascending order of number of samples. Those below the double line have more than 1 million instances. All datasets except for poker-hand, uscensus1990 and splice contain one or more numeric attributes. 7 datasets contain only numeric attributes: MITFace Set A, MITFace Set B, MITFace Set C, Mnist, USPSExtended, MSDYear Prediction and satellite. For the Bayesian network classiﬁers we discretize numeric attributes using 5-bin equal frequency discretization. To avoid loading the whole data into memory, only a random sample of 100K points is used to deﬁne the bins for discretization. As described in Section 2.1, we use SKBD with RMSE as the objective function for the third pass that selects between structures because we believe it to be a good measure of the calibration of a classiﬁer s class probability predictions. However, this raises the issue that we might be reporting the measure that best favors our approach. To guard against this, we also present in the Appendices 0-1 loss results of the relative performance of all algorithms, and discuss these results in the appropriate sections below. Table 9 shows that when evaluated on RMSE, SKDB trained to optimize RMSE and SKDB trained to optimize 0-1 loss (SKDB0-1) attain identical 0-1 loss on 12 out of 16 datasets, suggesting that SKDB s performance is similar when selection uses diﬀerent but closely related loss functions.

Mart ınez, Webb, Chen and Zaidi

Train Test time space time space NB O(td) O(cdv) O(cd) O(cdv) TAN O(td2 + c(dv)2 + d2 log a) O(c(dv)2) O(cd) O(cdv2) AODE O(td2) O(c(dv)2) O(cd2) O(c(dv)2) KDB O(td2 + c(dv)2 + tdk) O(c(dv)2 + cdvk+1) O(cdk) O(cdvk+1) SKDBk O(td2 + c(dv)2 + tcdk) O(c(dv)2 + cdvk+1) O(cdbk) O(cdbvk+1) SKDB O(td2 + c(dv)2 + tcdk) O(c(dv)2 + cdvk+1) O(cdbkb) O(cdbvkb+1)

Table 1: Orders of complexity for the diﬀerent semi-naive BNCs where: t is the number of training examples, d is the number of attributes, v is the maximum number of values per attribute, c the number of classes, db (the best d) is the remaining attributes after applying SKDB, and kb (the best k) is the k value selected in SKDB.

The structure of the experimental Section is as follows: Section 3.1 compares our proposed SKDB algorithm s results with diﬀerent out-of-core semi-naive BNCs. Section 3.2 compares our SKDB algorithm with several online (out-of-core) linear classiﬁers. Section 3.3 includes comparisons with two state-of-the-art in-core machine learning algorithms, Bayes Net (Augmented Bayesian network) and Random Forest. Section 3.4 presents a global comparison of all learners considered, along with empirical time comparisons for the diﬀerent out-of-core classiﬁers. Each of these sections provides a summary of the relevant results. Detailed RMSE error outcomes are presented in Table 8 in Appendix A and a more detailed analysis of results is provided in in Appendix C. All the experiments for the Bayesian out-of-core algorithms use C++ software specially designed to deal with out-of-core classiﬁcation methods1. Appendix B speciﬁes the implementation details. Note that both RMSE and 0-1 Loss are assessed using 10-fold cross-validation. This should not be confused with the leave-one-out cross-validation used to select the number of attributes and parents in SKDB. For each fold of the 10-fold cross-validation, SKDB performs leave-one-out cross-validation on the training set to select the parameters of the model that is then tested on the holdout test fold.

3.1 SKDB vs Bayesian Out-of-core Algorithms

The ﬁrst set of experiments compare SKDB with the out-of-core semi-naive BNCs described in Section 2: NB, TAN, AODE and KDB. We also consider a version of SKDB which ﬁxes the value of k and just selects among the attributes, referred to as SKDBk. Table 1 shows a summary of the orders of complexity for all these BNCs. Tables 2 and 3 show win-draw-loss records summarizing the relative RMSE and 0-1 loss of the diﬀerent approaches. Cell [i, j] of each table contains the number of wins/draw/losses

1. A minimum functional part of the software containing SKDB can be found in the (two ﬁrst) authors academic web pages, for example http://www.csse.monash.edu.au/~webb/.

Scalable Learning of Bayesian Network Classifiers

KDB k = 1 k = 2 k = 3 k = 4 k = 5 TAN 6-3-7 0-0-16 - - - AODE 3-0-13 0-0-16 - - - SKDBk 8-8-0 6-10-0 6-10-0 7-9-0 6-9-1 SKDB 16-0-0 16-0-0 14-2-0 14-2-0 8-8-0 SKDB kmax = 5 6-10-0 (vs best k)*

* The result is 6-9-1 (shift in uscensus) if SKDB is optimized with 0-1 loss.

Table 2: Win-draw-loss records when comparing the RMSE of diﬀerent out-of-core BNCs.

KDB k = 1 k = 2 k = 3 k = 4 k = 5 TAN 7-2-7 0-0-16 - - - AODE 4-1-11 0-0-16 - - - SKDBk 5-10-1 5-11-0 6-10-0 6-9-1 6-9-1 SKDB 16-0-0 14-2-0 13-3-0 10-5-1 6-9-1 SKDB kmax = 5 5-11-0 (vs best k)*

* The result is 6-10-0 if SKDB is optimized with 0-1 loss.

Table 3: Win-draw-loss records when comparing the 0-1 loss of diﬀerent out-of-core BNCs.

for the classiﬁer on row i against the classiﬁer on column j. A win indicates that the algorithm has signiﬁcantly lower error than the comparator. A draw indicates that the diﬀerences in error are not signiﬁcant. Statistical signiﬁcance has been assessed using Wilcoxon signed-rank tests to compare the 10-fold outputs. SKDBk and SKDB are compared against standard KDB with the same value of k. Additionally, SKDB with k = 5 is compared against the lowest error of any standard KDB with k between 1 and 5 (which is a theoretical result, since the value of k that will minimize out of sample error can only be determined a posteriori). The results indicate that TAN is competitive with KDB k = 1, whereas AODE only achieves lower error on three or four of the datasets, depending on the loss function. We believe the reason AODE performs so poorly is that the advantage it gets in reduced variance from ensembling the d models is less useful for larger data. Both TAN and AODE consistently show higher error than higher orders of KDB on these large datasets. SKDBk only increases the RMSE of KDB relative to one value of k and only on one dataset, MITFace Set B, and then only from 0.0841 to 0.0844. For many datasets it does not aﬀect error (but nonetheless speeds up classiﬁcation time). For many datasets it substantially reduces error, such as the reduction from 0.2048 to 0.1868 for poker-hand. The average best k for SKDB across the 16 datasets is 4.08. On average, 78.46 29% of the attributes are selected, which explains some of the ties, although in other cases SKDB

Mart ınez, Webb, Chen and Zaidi

MITFace Set A

MITFace Set B

MITFace Set C

MSDYear Prediction

USPSExtended

Census-income

localization

mnist8ms poker-hand

KDB (k=5) - RMSE (log. scale)

SKDB (max k=5) - RMSE (log. scale)

(a) KDB with k = 5 and SKDB (kmax = 5).

MITFace Set A

MITFace Set B

MITFace Set C

MSDYear Prediction

USPSExtended

Census-income

localization

mnist8ms poker-hand

KDB (best k) - RMSE (log. scale)

SKDB (max k=5) - RMSE (log. scale)

(b) KDB with the best k for each dataset and SKDB (kmax = 5).

Figure 2: Scatter plot of RMSE comparisons for KDB and SKDB.

obtains the same results as KDB with the best value of k, but reducing the number of attributes. Indeed, reducing the number of attributes without aﬀecting error is the worst case encountered by SKDB, because it never loses in terms of RMSE compared to the best KDB). One could think that KDB with k = 5 would be the fairest comparison with SKDB (kmax = 5). This comparison is shown in Figure 2a, where the X-axis represents the RMSE results with KDB with k = 5 for the 16 datasets and the Y-axis the RMSE with SKDB. We have used a logarithmic scale because almost half of the points fall below the 0.1 range. Note that there is not a single point above the diagonal line, which indicates that SKDB is never worse than KDB k = 5 or even KDB with the best k value considered, as shown in Figure 2b. If only the best number of parents is selected by SKDB (best k) and no feature selection is performed, then the RMSE is always equivalent to the best RMSE obtained by KDB with any value of k {1, 5}. That is, for all datasets SKDB without attribute selection is successful at selecting the best value of k with respect to the speciﬁed loss function. However, for ﬁve datasets the value of k that attains the best RMSE is diﬀerent to that which attains the best 0-1 loss, demonstrating the value of optimizing with respect to the relevant loss function. These results are summarized on row SKDBonlyk in Tables 8 and 9. The non-shaded cells on these rows correspond to those datasets for which feature selection contributes to further reduce the error of SKDB. There are 9 datasets in total for which the feature selection part of the algorithm does not reduce RMSE and 10 for which it does not reduce 0-1 loss. Note, however, that the extra computational complexity added by the feature selection evaluation is only related to the complexity of the loss function utilized. In the case of considering RMSE-like functions, this extra time is negligible, but if more

Scalable Learning of Bayesian Network Classifiers

Time (seconds)

KDB1 SKDB1 KDB5 SKDB5

0 500 1000 2000

(a) Training Times

Time (seconds)

KDB1 SKDB1 KDB5 SKDB5

0 10 20 30 40 50 60

(b) Classiﬁcation Times

Figure 3: Training and classiﬁcation time comparisons for KDB and SKDBk.

complex evaluation functions were to be used in a highly time-constrained scenario, then only a random sample of instances could be considered at this step. Figures 3a and 3b show the training and classiﬁcation time comparisons for KDB and SKDBk. Only the 11 smallest out of the 16 datasets have been considered for time measurement, since these experiments have been conducted on a desktop computer with an Intel(R) Core(TM) i5-2400 CPU @ 3.10 GHz, 3101 MHz, 64 bits and 7866 Mi B of memory, whereas the remaining experiments have been conducted in a heterogeneous grid environment for which CPU times are not commensurable. Each bar represents the sum of all 11 datasets2 in a 10-fold cross validation experiment. These graphs reinforce what the orders of complexity for the two algorithms indicated, that is, SKDBk requires a bit more time for training (extra pass on the data), while the classiﬁcation time is less in practice, since the number of attributes selected is generally smaller than d. Figures 4a and 4b show the training and classiﬁcation empirical time comparisons of the diﬀerent out-of-core BNCs relative to SKDB (with kmax=5). The datasets are ordered at increasing time for SKDB. Note that SKDB always takes a bit more time for training, taking the largest time for kddcup, due to its large number of classes. Nevertheless, for the datasets with many attributes: MITFace Set A, MITFace Set B, USPSExtended and MITFace Set C; the time diﬀerences with other BNCs become smaller. At classiﬁcation time some gain can be appreciated for SKDB compared to KDB k = 5 if the selected k is less than 5. AODE s classiﬁcation time is quadratic in the number of attributes, and hence much higher for all datasets.

3.2 SKDB vs Non-Bayesian Out-of-core Algorithms

In this section we compare SKDB with the state-of-the-art in online learning: Logistic Regression with Stochastic Gradient Descent (LRSGD). We use LRSGD s implementation in Vowpal Wabbit (VW) (Agarwal et al., 2014), an open source out-of-core linear learning system. VW provides a scalable implementation of LRSGD, which can be used in combination with several loss functions, diﬀerent parameters for the learning rate and regularization, and quadratic or cubic combination of features.

2. census-income, covtype, donation, kddcup, localization, MITFace Set A, MITFace Set B, MITFace Set C, USPSExtended, poker-hand, uscensus1990.

Mart ınez, Webb, Chen and Zaidi

localization

Census-income

MITFace Set A

MITFace Set B

USPSExtended

MITFace Set C

(a) Training Times

k=3 k=4 k=3 k=5 k=4.5

k=4 k=4 k=2

localization

Census-incom

MITFace Set A

USPSExtende

MITFace Set B

MITFace Set C

(b) Classiﬁcation Times

Figure 4: Time comparisons per dataset for NB, TAN, AODE, KDB against SKDB.

We have performed experiments with varying numbers of passes for LRSGD, varying loss functions and combinations of features. We use the default advanced (step size) updates provided by VW, which are:

1. Importance weight invariance (Karampatziakis and Langford, 2011): this update is meant to be useful specially for cost-sensitive learning, that is, when the misclassiﬁcation penalty varies for the diﬀerent class labels. However, it is also claimed to reduce classiﬁcation error even when no matrix of weights is provided. The following equation is used to compute the weights:

wi wi s(ηI) L(ˆyw(x), y)

where y is the true class; ˆyw(x) is the predicted value for example x Rd with parameters w Rd; L(ˆyw(x), y) is the loss function; η is the learning rate; I N the number of times an example is more important than a typical example, in our case I = 1; and s(ηI) is a scaling factor, for which a closed form solution can be found for each loss function (Karampatziakis and Langford, 2011) .

2. Adaptive updates (Duchi et al., 2011; Mc Mahan and Streeter, 2010): the learning rate must decay to converge. Instead of using ηt = 1

t or ηt = 1

t at iteration t for all features, we use a per-feature learning rate decay:

wi wi η git q Pt t =1 g2 it , git = L(ˆyw(xt), yt)

3. Normalized updates (Ross et al., 2013): instead of using Gaussian sphering standard solution, which would imply in-core processing and an increase in the size of the dataset, a scale-free update is used that renormalizes wi at each timestep and keeps track of a global scale.

A fair comparison of LRSGD with SKDB should perform the same number of passes on the data. Since LRSGD requires data to be pre-shuﬄed, an n-pass LRSGD learner

Scalable Learning of Bayesian Network Classifiers

SKDB RMSE 7(+2)-0-7 14-0-2 0-1 Loss 8(+1)-1-5(+1) 8-1-7

Table 4: Win-draw-loss records for VW and SKDB in terms of RMSE and 0-1 Loss

actually results in n + 1 passes through the data. This is also true for SKDB on numeric data, as an extra pass is required for discretization. We have run experiments with 3, 10 and 20 passes for LRSGD. Each 10-fold cross validation experiment is repeated 10 times. Both the quadratic and logistic loss functions are considered. The one-against-all reduction from multiclass classiﬁcation to binary classiﬁcation has been used for the datasets with multiple class labels, which creates one binary problem for each of the classes. Table 10 in Appendix A shows the results for all of the diﬀerent combinations for which we have conducted comprehensive experiments. The overall best results for LRSGD are obtained with 3 passes and using quadratic features. When the squared loss function is optimized we refer to it as VWSF, and VWLF if the logistic function is used. In all cases, SKDB is highly competitive compared to VW.

Note that while LRSGD has a number of parameters that could potentially be tuned using cross validation, doing so does not make a reasonable comparator to SKBD as such tuning would require multiple passes through the data for each cross-validation fold.

A win-draw-loss record summary is displayed on Table 4. The (+1) and (+2) for SKDB vs VWLF aim at distinguishing the results for satellite and splice, for which more than 400 hours are required to get the 10cv results using VWLF, and hence VWSF is considered instead. Irrespective of the fact that the SKDB model is optimized with respect to RMSE, the win-draw-loss records only vary for three datasets between evaluation with respect to RMSE or 0-1 loss. The win-draw-loss records remain exactly the same whichever loss function is optimized in VW (see Table 10 for more details).

Since the output of the VW algorithms is not probabilistic, we are using the class prediction, and hence computing 0-1 loss for comparisons with SKDB as well. By using, for example, the following sigmoid function 1 1+e ˆyw(x) , probability estimates can be estimated from VW. Using this approach, the RMSE results for VWSF are only better than SKDB for two of the datasets (MSDYear Prediction and USPSExtended), both of them originally sparse and containing exclusively numeric attributes. These records improve when using a logistic function for optimization, that is VWLF (see Table 10 in Appendix A for more details). Nevertheless, computation time for VWLF is much larger than for VWSF as shown below.

Figures 5a and 5b show the comparisons per dataset with SKDB in terms of RMSE, whereas Figures 5c and 5d show comparisons per dataset for 0-1 Loss. Figures on the left (5a and 5c) show VW optimizing a logistic function, while the two ﬁgures on the right (5b and 5d) show VW optimizing RMSE. A diamond symbol is used for those points above the diagonal line, indicating better results for SKDB, as opposed to the bullets. A squared symbol is used to show the ties. We can see that VW achieves lower error using a logistic function (VWLF), with the same number of datasets above and below the diagonal line, as opposed to only 2 wins for VW with the squared function (VWSF). Figures 5c and 5d shows

Mart ınez, Webb, Chen and Zaidi

MITFace Set A

MITFace Set B

MITFace Set C

MSDYear Prediction

USPSExtended

Census-income

localization

VWLF - RMSE (log. scale)

SKDB (max k=5) - RMSE (log. scale)

(a) RMSE for VWLF and SKDB.

MITFace Set A

MITFace Set B

MITFace Set C

MSDYear Prediction

USPSExtended

Census-income

localization

VWSF - RMSE (log. scale)

SKDB (max k=5) - RMSE (log. scale)

(b) RMSE for VWSF and SKDB.

0.0001 0.001 0.01 0.1 1 0.0001

MITFace Set A

MITFace Set B

MITFace Set C

MSDYear Prediction

USPSExtended

Census-income

localization

VWLF - 0-1 Loss (log. scale)

SKDB (max k=5) - 0-1 Loss (log. scale)

(c) 0-1 loss for VWLF and SKDB.

0.0001 0.001 0.01 0.1 1 0.0001

MITFace Set A

MITFace Set B

MITFace Set C

MSDYear Prediction

USPSExtended

Census-income

localization

VWSF - 0-1 Loss (log. scale)

SKDB (max k=5) - 0-1 Loss (log. scale)

(d) 0-1 loss for VWSF and SKDB.

Figure 5: Scatter plot of RMSE and 0-1 loss comparisons for VW and SKDB (kmax = 5).

the comparisons of VW with SKDB in terms of 0-1 Loss. In this case, VWSF performs slightly better than VWLF compared to SKDB. Figure 5 shows the relative training and test times for VW and SKDB across all datasets. SKDB has substantially lower training and test time on most datasets. One exception is the sparse numeric USPSExtended dataset. Another is covtype, which has a large number of sparse boolean attributes. While neither SKDB nor VW dominates the other in terms of error, SKDB is substantially more eﬃcient, as shown in Figure 6a, where again training and classiﬁcation times per dataset for VWLF, VWSF and SKDB are displayed. There is only one dataset for which

Scalable Learning of Bayesian Network Classifiers

Census-inco

(a) Training Times

Census-inco

(b) Classiﬁcation Times

Figure 6: Time comparisons per dataset for VW with squared and logistic functions against SKDB.

SKDB takes more than the alternatives, which is USPSExtended. We believe that for this dataset VW is able to exploit the sparseness of the data while SKDB cannot. Some of the reasons why VW is taking in general more time than SKDB both for training and classifying is the consideration of quadratic features, the need to binarize discrete attributes (creating many binary attributes) and the need also to use one-against-all for multiclass classiﬁcation.

3.3 SKDB vs In-core Algorithms

In order to understand how much predictive capacity, if any, is sacriﬁced by using our outof-core classiﬁers, it is useful to compare them to the state-of-the-art in in-core classiﬁcation. To this end, we have replicated our set of experiments with two in-core algorithms: a more general BNC and Random Forest, a powerful exemplar of the state-of-the-art. For some datasets it was infeasible to get results, even with high-performance computers, due either to time or memory limitations (indicated in Table 8 in Appendix A with > 600h, when more than 600 CPU hours are required; or by >138G, when more than 138GB of RAM memory are required). Note that while each of these algorithms has a number of parameters that could potentially be tuned using cross validation, doing so would not make reasonable comparators to SKBD. First, SKDB uses cross-validation to choose between multiple models rather than to choose between parameterizations. Cross validation could also be used to tune SKBD parameters such as the smoothing technique. Second, SKDB uses cross-validation in a constrained manner that requires only a single additional pass through the data while parameter tuning of these algorithms would increase compute time proportionally to the number of folds employed.

3.3.1 Augmented Bayesian Network

We compare performance against a state-of-the-art generic in-core Bayesian network classiﬁer, that uses a hill-climbing algorithm for adding, deleting and reversing arcs (starting with a NB structure, this hill climber also considers arrows as part of the NB structure for

Mart ınez, Webb, Chen and Zaidi

Bayes Net SKDB 6(+4)-3-3

Table 5: Win-draw-loss records for Bayes Net and SKDB in terms of RMSE.

deletion, that is, arcs from the class to the attributes can be removed). It optimizes the K2 metric (Cooper and Herskovits, 1992). The maximum number of parents allowed for a node in the Bayes net is set to 5. The win-draw-loss comparison is presented in Table 5. The detailed results can be found in Table 8 (Appendix A), row Bayes Net. The darker grey background corresponds to those cases where Bayes Net signiﬁcantly outperforms SKDB. Note that there are only records for 12 out of the 16 datasets, because there are four cases for which results could not be computed for Bayes Net due to memory constraints, since the network cannot be learned incrementally. Even though Bayes Net performs a more exhaustive search than SKDB of the model space of Bayesian networks with up to 5 parents, SKDB obtains lower error than Bayes Net more often than the reverse. We hypothesize that this is due to SKDB s use of both generative and discriminative measures for model selection in contrast to Bayes Net s use exclusively of generative measures.

3.3.2 Random Forest

Random Forest (RF) is a state-of-the-art learning algorithm introduced in Breiman (2001). It uses bagging (Breiman, 1996) to aggregate an ensemble of trees that are each grown using a process that involves a stochastic element to increase diversity. We have conducted experiments with RF selecting 100 trees. We present results both with pre-discretized data, to show relative performance on categorical data, and with undiscretized data. The detailed results can be found on the ﬁfth and sixth from last rows in Table 8 in Appendix A. Table 6 shows the win-draw-loss records when compared with SKDB. The superscripts compute those datasets for which results cannot be obtained for RF due to main memory constrains (> 138G are required). The results show that SKDB is competitive with RF in terms of RMSE, even when RF is performing its own discretization (5 wins for SKDB and 6 wins for RF), for which the computational cost in terms of memory and time is much higher than SKDB s. When 0-1 Loss is considered, out-of-core RF wins more frequently than SKDB, even when 0-1 Loss is used as the objective function during structure selection. When MDL discretization is used instead of 5EF for SKDB, then the results improve slightly for SKDB compared to RF performing its own discretization. When interpreting these results it is important to keep in mind that SKDB is a 3-pass outof-core algorithm (4-pass with discretization) in contrast to the inherently computationally intensive in-core processing of RF. Note that the sum of some cells is not exactly 16 but smaller, due to computational constraints for RF. RF is an in-core algorithm, and hence we have not been able to learn classiﬁcation models for the largest datasets. Interestingly, RF is more negatively aﬀected (in terms of computation) by a large number of classes than the BNCs, for example no 100 tree model can be learnt for MSDYear Prediction, which has 90 class labels. Precisely, there are three cases where RF with 10 trees cannot be run, and four in the case of RF with 100

Scalable Learning of Bayesian Network Classifiers

RF (5EF) RF (Num) RMSE SKDB 5(+4)-1-6 5(+5)-0-6

SKDB 2(+4)-3-7 2(+5)-2-7 SKDB0 1 3(+4)-2-7 3(+5)-1-7 SKDBMDL 8(+4)-2-2 3(+5)-2-6

Table 6: Win-draw-loss records for RF and SKDB in terms of RMSE and 0-1 Loss.

MITFace Set A

MITFace Set B

MITFace Set C

MSDYear Prediction

USPSExtended

Census-income

localization

Bayes Net - RMSE (log. scale)

SKDB (max k=5) - RMSE (log. scale)

(a) Bayes Net and SKDB.

MITFace Set A

MITFace Set B

MITFace Set C

MSDYear Prediction

USPSExtended

Census-income

localization

RF - RMSE (log. scale)

SKDB (max k=5) - RMSE (log. scale)

(b) RF and SKDB.

Figure 7: Scatter plot of RMSE comparisons for Bayes Net, RF and SKDB.

trees. These four datasets are MSDYear Prediction, satellite, mnist8ms and splice, so the largest possible samples able to be computed have been considered.

RMSE comparisons for all datasets for Bayes Net and RF when compared with SKDB are shown in Figures 7a and 7b, where the points with an X symbol indicate those datasets for which sampling is required. While Bayes Net is not competitive with SKDB, RF provides competitive results for some datasets. PAMAP2 provides a notable case for which only a sample of 2 millions out of 3.5 suﬃces for RF to provide better classiﬁcation performance than SKDB using all the training data. This is however the only dataset for which RF learning from a sample can achieve lower RMSE than SKDB learning from all the data, the reverse being true for splice, mnist and satellite.

Figures 8a and 8b display the training and classiﬁcation times per dataset for these classiﬁers compared to SKDB. For most datasets the in-core learners require substantially more time for learning and classifying3.

3. RF requires more than 8GB of main memory to complete for poker-hand and uscensus1990 since these experiments to measure time has been conducted in a desktop computer no time measurements can be provided to compare with the rest of experiments. Weka s implementation has been considered.

Mart ınez, Webb, Chen and Zaidi

Census-inco

(a) Training Times

Census-inco

(b) Classiﬁcation Times

Figure 8: Time comparisons per dataset for Bayes Net and RF against SKDB.

3.4 Global Comparison of All Classiﬁers

In this section we analyze how all the classiﬁers explored in this paper perform when they are compared as a set. Figure 9 plots the average rankings across all datasets, along with the standard deviation for each learner. This ranking corresponds to the ranking obtained when a Friedman test is conducted. In this analysis we present the 0-1 Loss performance of SKDB when it is optimized for RMSE, its average rank on 0-1 Loss improves marginally from 2.94 to 2.84 when it is instead optimized for 0-1 Loss. When assessing the calibration of the probability estimates using RMSE, SKDB obtains the lowest average rank of 2.53, followed by RF with 2.91 and KDB with 3.34 (very close to those for VW and Bayes Net). When assessing performance using 0-1 Loss, RF using its own internal discretization of numeric attributes enjoys an average advantage of 1 in rank over SKDB, and SKDB enjoys an average advantage of almost one third over VWLF. We ﬁnd NB at the other extreme on both measures, with average ranks 7.63 and 7.72 out of a total of 8 classiﬁers. SKDB obtains in the worst case the 4.5th position (this half is due to a tie with KDBbest-k) for the linkage dataset and the 4th position for census-income when ranked on RMSE and the 4th position for MITFace Set B and kddcup when ranked on 0-1 Loss. On both measures the largest standard deviation is observed for VW, which usually ranks either very well or rather poorly among the datasets. Still, VW obtains the largest number of top 1 datasets for RMSE, with a total number of 7, against 5 for RF and 4 for SKDB. On 0-1 Loss, RF ranks ﬁrst 8 times, VW 4 times and SKDB twice (one of those times being a tie with KDB-best-k). A Friedman test (Demˇsar, 2006) was performed for the set of 8 classiﬁers, yielding statistical diﬀerence for both measures. To identify where exactly these diﬀerences are found, we ran a set of posterior Nemenyi tests. These tests conﬁrm the existence of two sets of algorithms in terms of performance on both RMSE and 0-1 Loss for these large datasets: on one hand NB, AODE and TAN; and on the other hand Bayes Net, VW, KDB, RF and SKDB. The Nemenyi tests4 showed signiﬁcant diﬀerences for the second set of algorithms over NB and AODE, and only RF and SKDB over TAN.

4. Similarly for Holms and Shaﬀer post-hoc tests.

Scalable Learning of Bayesian Network Classifiers

3.47 3.44 3.34 2.91 2.53

NB AODE TAN Bayes Net VWLF KDB (best k) RF (Num) SKDB

3.81 3.47 3.25 2.94

NB AODE TAN Bayes Net KDB (best k) VWLF SKDB RF (Num)

(b) 0-1 Loss

Figure 9: Ranking in terms of RMSE and 0-1 Loss for NB, TAN, AODE, KDB (with best k), SKDB, RF using numeric attributes, Bayes Net and VW with a logistic function.

Even though the small number of datasets only allow tentative conclusions to be drawn, the following dataset characteristics appear to indicate a preference for one or another learning scheme.

RF performs better on datasets with a small number of attributes.

RF derives an advantage with respect to 0-1 Loss due to its internal discretization of numeric values. It does not achieve the lowest RMSE or 0-1 Loss on any of the three categorical only datasets poker-hand, PAMAP2 or splice.

SKDB seems to cope well with high-dimensional datasets and datasets with more than 1 million points. However its complexity is quadratic with respect to the number of attributes so, like most existing BNCs, it is not feasible for very high-dimensionality.

Mart ınez, Webb, Chen and Zaidi

Time (seconds)

NB TAN AODE KDB5 SKDB VWSF VWLF

0 5000 10000 15000

(a) Training Times

Time (seconds) log. scale

NB TAN AODE KDB5 SKDB VWSF VWLF

50 100 200 500 1000

(b) Classiﬁcation Times

Figure 10: Training and classiﬁcation time comparisons for the out-of-core classiﬁers.

VW appears to have an advantage for sparse numeric datasets.

A more detailed study of the domain of competence of these classiﬁers remains an important area for future work. We can also analyze the performance of these classiﬁers with respect to the complexity of the decision function they provide. NB and linear regression, represented in this work by VW with linear features, are classiﬁers that learn linear decision boundaries. They have low variance and are particularly useful for small data quantity and sparse data. An extra level of expressiveness is oﬀered by AODE, KDB (k = 1), TAN and VWSF/VWLF (with quadratic features); which produce quadratic decision functions. A2DE, KDB (k = 2) and VWSF/VWLF (with cubic features) produce cubic decision functions. More generally, An DE and KDB can separate order-k(n) polynomially separable concepts (Jaeger, 2003). These two families of classiﬁers provide a set of a spectrum of models of increasing complexity, allowing a ﬁne-tuned trade-oﬀbetween expressivity on the one hand, and eﬃciency of learning and guarding against overﬁtting on the other hand, that is, the well-known bias-variance trade oﬀ(Brain and Webb, 1999). Figures 10a and 10b show the time comparison for the out-of-core classiﬁers considered: namely NB, TAN, AODE, KDB with k = 5, SKDB and LRSGD with quadratic features in VW (using either a squared and a logistic function). The time required to pre-randomize the data for VW has not been included. No parallelization techniques have been used in any case, although most algorithms could be parallelized. The average of the 10-fold CV is shown. It is important to note that logarithmic scale has been used for classiﬁcation times in order to appreciate the smaller diﬀerences in the faster learners. Again, only the 11 smallest out of the 16 datasets have been considered for time measurement, since these experiments have been conducted in a personal computer for a controlled environment.

Scalable Learning of Bayesian Network Classifiers

Note that training SKDB with k = 5 (maximum) takes only a bit more time than training KDB with k = 5. The classiﬁcation time for KDB is slightly higher than SKDB s (since all attributes are considered). Training time for LRSGD with quadratic features and 3 passes (VWSF and VWLF), which are those that obtain comparable results with SKDB, are also higher, presumably because they have to compute all pairs of quadratic features as indicated above. Similar reasoning holds for classiﬁcation time. We believe that the extra time for VW is also due to the need to binarize discrete features, that is, since VW cannot deal with discrete attributes directly, v binary attributes are created for each originally discrete attribute with v values; and also because of the one-against-all approach required to handle classes with multiple labels. This overhead may be noticeable since we are using quadratic features. The time required for VWLF is much larger than VWSF, and in any case much larger both for training and testing than that for SKDB (specially note the logarithmic scale in Figure 10b).

We have developed an extension to KDB which performs discriminative attribute and best-k selection using leave-one-out cross validation. SKDB only requires a single extra pass over the training data compared to KDB. In return for this extra pass during training, it often increases prediction accuracy relative to KDB with the best possible k, and always improves classiﬁcation time (and space) by reducing the number of attributes considered, which can be quite signiﬁcant for some datasets (for example for MITFace Set C 60%, 215.8 out of 361, of the attributes are selected improving the classiﬁcation performance at the same time). It is embarrassingly parallelizable. Furthermore, it can optimize any loss function that is a function over each training example x of ˆp(Y, x) and yx, making it extremely ﬂexible. Due to its low bias, when datasets are large enough to avoid overﬁtting, SKDB attains lower error than the other in and out-of-core Bayesian network algorithms considered. It even provides competitive error compared to Random Forest, one of the most powerful in-core classiﬁcation methods; and online linear and non-linear classiﬁers with stochastic gradient descent (that is, linear regression with quadratic and cubic features). Compared to the latter, it is in most cases substantially more eﬃcient both at training and classiﬁcation time. Further, it provides well-calibrated posterior class probability estimates. Future directions for research include:

1. using only a sample of the data in the leave-one-out cross validation, in order to use more complex loss functions eﬃciently;

2. tackling SKDB s tendency to overﬁt small training sets;

3. studying a customized selection of the diﬀerent links and attributes in a discriminative manner, that is, diﬀerent numbers of parents for diﬀerent attributes and also discarded in an order that does not follow the conditional mutual information rank;

4. consideration of other means of generating alternative classiﬁers to be selected in the ﬁnal LOOCV pass; and

Mart ınez, Webb, Chen and Zaidi

5. study of more appropriate loss functions for imbalanced datasets.

SKDB provides an eﬃcient and eﬀective solution to the problem of scalable low-bias learning. Its low bias allows it to capture and take advantage of the additional ﬁne-detail that is inherent in very large data while its eﬃciency makes it feasible to deploy. SKDB thus sets new standards for classiﬁcation learning from big data.

Acknowledgments

This research has been supported by the Australian Research Council grant DP140100087 and Asian Oﬃce of Aerospace Research and Development, Air Force Oﬃce of Scientiﬁc Research contract FA2386-15-1-4007. It has also been supported in part by the Monash e Research Center and e Solutions-Research Support Services through the use of the Monash Campus HPC Cluster and the LIEF grant. This research was also undertaken on the NCI National Facility in Canberra, Australia, which is supported by the Australian Commonwealth Government. The authors would also like to thank the researchers from CESBIO (D. Ducrot, C. Marais-Sicre, O. Hagolle and M. Huc) for providing the land cover maps and the geometrically and radiometrically corrected FORMOSAT-2 images (satellite dataset). We also want to thank the anonymous reviewers for their insightful comments which have helped to greatly improve the revised version of the paper.

Scalable Learning of Bayesian Network Classifiers

Appendix A. Tables of the Experimental Section

id name t a c ? (max) (average sd) Source Relevant papers description size

1 localization 0.165 5 11 N UCI Kaluˇza et al. (2010) Recordings of 5 people performing diﬀerent activities. Each person wore 3 sensors while performing the same scenario 5 times. 11MB

2 census-income 0.299 41 2 Y (50%) (4.9 14.6%) UCI Bache and Lichman (2013)

Oza and Russell (2001) Weighted census data extracted from the 1994 and 1995 current population surveys conducted by the U.S. Census Bureau. 136MB

3 USPSExtended 0.341 676 2 N CVM Tsang et al. (2005) 0/1 digit classiﬁcation (extended version of the USPS data seta). 603MBs

4 MITFace Set A 0.474 361 2 N CVM Tsang et al. (2005) Face detection using an extended version of the MIT face databasec. By adding nonfaces to the original training set. 584MB

5 MITFace Set B 0.489 361 2 N CVM Tsang et al. (2005) Each training face is blurred and added to set A. They are then ﬂipped laterally. 603MB

6 MSDYear Prediction 0.515 90 90 N UCI Bertin Mahieux et al. (2011)

Prediction of the release year of a song from audio features. Songs are mostly western, commercial tracks ranging from 1922 to 2011, with a peak in the year 2000sc.

7 covtype 0.581 54 7 N UCI Blackard (1998) Gama et al. (2003); Oza and Russell (2001)

Predicting forest cover type from cartographic attributes only (no remotely sensed data). 72MB

8 MITFace Set C 0.839 361 2 N CVM Tsang et al. (2005) Each face in set B is rotated. 1.1GB

9 poker-hand 1.025 10 10 N UCI Cattral et al. (2002) Each record is an example of a hand consisting of 5 cards drawn from a deck of 52. Each card is described using 2 attributes (suit and rank), thus 10 predictive attributes in total. The class label describes the Poker Hand . The order of cards is important.

10 uscensus1990 2.458 67 4 N UCI Discretized version of the USCensus1990raw dataset, a 1% sample from the full 1990 census. Temp. Absence From Work has been selected as class.

11 PAMAP2 3.850 54 19 Y (91%) (1.6 12.2%) UCI Reiss and Stricker (2012)

Data of 18 diﬀerent physical activities (such as walking, cycling, playing soccer, etc.), performed by 9 subjects wearing 3 inertial measurement units and a heart rate monitor.

12 kddcup 5.209 41 40 N UCI (KDD Cup 1999) Contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment: bad connections, called intrusions or attacks, and good normal connections.

13 linkage 5.749 11 2 Y (100%) (16.5 36.9%) Schmidtmann et al. (2009) Sariyar et al. (2011) Element-wise comparison of records with personal data from a record linkage setting. The task is to decide from a comparison pattern whether the underlying records belong to one person.

14 mnist8ms 8.100 784 10 N CVM Loosli et al. (2007) Erhan et al. (2010); Bottou et al. (1994)

Handwritten digit classiﬁcation. Extended version by performing random elastic deformations of the original mnist digits. 19GBs

15 satellite 8.705 138 24 Y (93%) (26.4 33%) Petitjean et al. (2012) Satellite image time series to predict land cover. 3.6GB

16 splice 54.627 141 2 N Sonnenburg and Franc (2010)

Agarwal et al. (2014) Recognising a human acceptor splice site (largest public data for which subsampling is not an eﬀective learning strategy). 7.3GB

Table 7: Very large datasets for classiﬁcation. From left to right showing the identiﬁcation number; name of the dataset; number of instances, attributes and classes; presence of missing values (and if positive, then maximum percentage per attribute and the average), source paper of the dataset; relevant papers and description.

a http://c2inet.sce.ntu.edu.sg/ivor/cvm.html. b http://cbcl.mit.edu/cbcl/software-datasets/Face Data2.html. c Subset of the Million Song Dataset (http://labrosa.ee.columbia.edu/millionsong/). Songs from a given artist may end up in both the train and test set. s The superscript s stands for sparse format.

Mart ınez, Webb, Chen and Zaidi

Appendix B. Implementation Details

All the experiments for the out-of-core algorithms have been carried out in a C++ software specially designed to deal with out-of-core classiﬁcation methods. The following details in terms of implementation and conﬁguration should be considered:

10-fold cross validation has been used.

Equal frequency pre-discretization with 5 bins (EF5) has been utilized to discretize all numeric attributes (except if otherwise speciﬁed5). We have observed that EF5 and MDL (Minimum Description Length) (Fayyad and Irani, 1993) discretization provide the best results in approximately half of the datasets each, EF5 has been chosen because it is faster to perform than MDL, and also because it is not supervised and hence does not potentially provide the classiﬁer with class information from the holdout data when used for pre-discretization. Using a pre-ﬁxed number of bins gives us the added advantage of not having to deal with a huge number of values per attribute created by MDL discretization in some cases.

Missing values have been considered as a distinct value in all cases.

The root mean square error is calculated exclusively on the true class label (diﬀerent from Weka s implementation (Hall et al., 2009), where all class labels are considered).

KDB s implementation beneﬁts from a tree structure that resembles an ADTtree (Moore and Lee, 1998), sparseness is achieved by storing a NULL instead of a node for any query that matches zero records. In our case, to save the probability distribution of the diﬀerent attributes once the structure has been determined, we build a (distribution) tree for each attribute. Given, for example, attribute X4 that has X0 and X3 as parents (in this order, and apart from the class), the counts (X4, Y ) are stored in the root node. The following level expands according to the number of values of X0, which is the ﬁrst parent, and store the counts for (X4, Y, X0). The following level expands as for the number of values of X3, which is the second parent, and store the counts for (X4, Y, X0, X2), and so forth if there were more parents. Note that as opposed to the ADTtree, all the parameters are stored, that is, more than strictly required, since a branch is created for each attribute value, when strictly one less branch would be suﬃcient to represent the model parameters. The reason to do that is not to overload classiﬁcation time by having to explore and combine branches.

Two combined techniques are considered for smoothing. In the ﬁrst place, we use m-estimates6 (Mitchell, 1997) as follows:

ˆp(xi|πxi) = counts(xi, πxi) + m |Xi| counts(πxi) + m (4)

5. Only for the experiments on the antepenultimate row on Table 8 as speciﬁed. 6. Also known as Schurmann-Grassberger s Law when m = 1, which is a particular case of Lidstone s Law (Lidstone, 1920; Hardy, 1920) with λ = 1 |Xi|, also based on a Dirichlet prior.

Scalable Learning of Bayesian Network Classifiers

learner args localiz. cens-inc USPS MITA MITB MSDY covtype MITC poker uscensus PAMAP kddcup linkage mnist8 satellite splice

NB 0.7106 0.0007 0.4660 0.0018 0.2256 0.0026 0.0982 0.0029 0.1394 0.0014 0.9600 0.0002 0.5103 0.0013 0.2367 0.0021 0.5801 0.0006 0.2911 0.0006 0.4647 0.0006 0.1849 0.0008 0.0125 0.0006 0.4957 0.0005 0.6540 0.0002 0.0971 0.0002

TAN 0.6321 0.0014 0.2247 0.0025 0.1153 0.0022 0.0202 0.0025 0.1213 0.0020 0.9436 0.0002 0.4862 0.0008 0.2068 0.0017 0.4987 0.0006 0.1845 0.0009 0.3232 0.0007 0.0572 0.0006 0.0081 0.0009 0.3817 0.0006 0.5424 0.0003 0.1020 0.0003

AODE 0.6520 0.0010 0.2932 0.0020 0.1538 0.0028 0.1001 0.0036 0.1682 0.0025 0.9459 0.0002 0.4587 0.0013 0.1564 0.0014 0.5392 0.0006 0.2154 0.0010 0.3881 0.0008 0.0979 0.0007 0.0120 0.0005 0.3497 0.0005 0.5783 0.0003 0.1034 0.0003

k = 1 0.6501 0.0012 0.2219 0.0024 0.1125 0.0025 0.0209 0.0021 0.1291 0.0019 0.9447 0.0003 0.4709 0.0026 0.1940 0.0021 0.4987 0.0005 0.1814 0.0008 0.3276 0.0007 0.0606 0.0005 0.0082 0.0009 0.3751 0.0005 0.5506 0.0003 0.0940 0.0003

k = 2 0.5840 0.0020 0.2064 0.0021 0.0990 0.0029 0.0126 0.0019 0.0633 0.0018 0.9393 0.0002 0.4576 0.0013 0.0906 0.0012 0.4055 0.0007 0.1677 0.0008 0.2721 0.0005 0.0497 0.0008 0.0062 0.0007 0.2922 0.0004 0.5072 0.0006 0.0953 0.0002

k = 3 0.5485 0.0020 0.2102 0.0015 0.0928 0.0025 0.0142 0.0018 0.0468 0.0018 0.9361 0.0004 0.4399 0.0015 0.0705 0.0018 0.2859 0.0011 0.1664 0.0008 0.2290 0.0004 0.0485 0.0008 0.0060 0.0006 0.2500 0.0005 0.4769 0.0004 0.1008 0.0005

k = 4 0.5225 0.0023 0.2107 0.0012 0.0877 0.0026 0.0514 0.0031 0.0387 0.0021 0.9383 0.0006 0.4126 0.0019 0.0616 0.0016 0.2048 0.0012 0.1601 0.0007 0.1937 0.0004 0.0478 0.0008 0.0060 0.0006 0.2169 0.0005 0.4583 0.0005 0.1084 0.0003

k = 5 0.5225 0.0023 0.2143 0.0022 0.0839 0.0034 0.1350 0.0035 0.0841 0.0023 0.9447 0.0003 0.3903 0.0020 0.0494 0.0015 0.2842 0.0011 0.1638 0.0006 0.1697 0.0004 0.0477 0.0008 0.0060 0.0006 0.1839 0.0005 0.4448 0.0004 0.0924 0.0002

kmax = 5 0.5225 0.0023 0.2064 0.0021 0.0839 0.0034 0.0126 0.0019 0.0387 0.0021 0.9361 0.0004 0.3903 0.0020 0.0494 0.0015 0.2048 0.0012 0.1601 0.0007 0.1697 0.0004 0.0477 0.0008 0.0060 0.0006 0.1839 0.0005 0.4448 0.0004 0.0924 0.0002

bestk 4 2 5 2 4 3 5 5 4 4 5 5 3.8 5 5 5

k = 1 0.6501 0.0012 0.2054 0.0014 0.1125 0.0025 0.0207 0.0020 0.0746 0.0012 0.9446 0.0003 0.4704 0.0016 0.1227 0.0014 0.4987 0.0005 0.1540 0.0007 0.3276 0.0007 0.0606 0.0005 0.0082 0.0009 0.3751 0.0005 0.5485 0.0002 0.0527 0.0002

k = 2 0.5840 0.0020 0.2021 0.0021 0.0990 0.0029 0.0123 0.0018 0.0574 0.0011 0.9393 0.0002 0.4576 0.0013 0.0782 0.0017 0.4055 0.0007 0.1510 0.0007 0.2721 0.0005 0.0497 0.0008 0.0062 0.0007 0.2922 0.0004 0.5065 0.0004 0.0523 0.0002

k = 3 0.5485 0.0020 0.2056 0.0016 0.0928 0.0026 0.0144 0.0021 0.0439 0.0021 0.9361 0.0004 0.4399 0.0015 0.0596 0.0020 0.2847 0.0011 0.1508 0.0007 0.2290 0.0004 0.0485 0.0008 0.0060 0.0006 0.2500 0.0005 0.4769 0.0004 0.0523 0.0002

k = 4 0.5225 0.0023 0.2041 0.0015 0.0877 0.0026 0.0386 0.0019 0.0376 0.0026 0.9382 0.0005 0.4126 0.0019 0.0517 0.0020 0.1868 0.0012 0.1504 0.0007 0.1937 0.0004 0.0478 0.0008 0.0060 0.0006 0.2169 0.0005 0.4583 0.0005 0.0524 0.0002

k = 5 0.5225 0.0023 0.2070 0.0015 0.0839 0.0033 0.0552 0.0032 0.0844 0.0025 0.9447 0.0004 0.3903 0.0020 0.0446 0.0020 0.1868 0.0012 0.1505 0.0007 0.1697 0.0004 0.0477 0.0008 0.0060 0.0006 0.1839 0.0005 0.4448 0.0004 0.0526 0.0002

kmax = 5 0.5225 0.0023 0.2021 0.0021 0.0839 0.0033 0.0123 0.0018 0.0376 0.0026 0.9361 0.0004 0.3903 0.0020 0.0446 0.0020 0.1868 0.0012 0.1504 0.0007 0.1697 0.0004 0.0477 0.0008 0.0060 0.0006 0.1839 0.0005 0.4448 0.0004 0.0523 0.0002

bestk 4 2 5 2 4 3 5 5 4 4 5 5 3.8 5 5 3

# of atts. 5 19.2 647.5 337.9 275.6 90 54 215.8 5 22 54 37 11 773.6 138 16

SKDBk (# of attributes selected)

k = 1 5 7 584.9 354.9 17 80.5 45.8 59 10 11 54 38.2 11 772.2 75 13

k = 2 5 19.2 646.5 337.9 87 90 54 206.7 10 12 54 38.1 11 773.1 125.8 14.9

k = 3 5 18 648.1 356.1 271.3 90 54 209.6 5 22 54 37.6 11 774.2 138 16

k = 4 5 18 640.9 21.3 275.6 78.8 54 214.3 5 22 54 37.9 11 774.2 138 14

k = 5 5 18 647.5 18 359.3 81.2 54 215.8 5 22 54 37 11 773.6 138 13

n 5 41 676 361 361 90 54 361 10 67 54 41 11 784 138 141

Bayes Net k = 5 0.5217 0.0023 0.2009 0.0016 0.0920 0.0023 0.0379 0.0136 0.0354 0.0063 0.9369 0.0004 0.4231 0.0016 >138G 0.2821 0.0011 0.1573 0.0008 0.1615 0.0006 0.0477 0.0007 0.0059 0.0005 >138G >138G >138G

Random Forest 100 trees 0.5194 0.0024 0.1992 0.0013 0.0338 0.0010 0.0353 0.0017 0.0566 0.0012 >138G 0.3478 0.0018 0.0888 0.0006 0.3190 0.0028 0.1573 0.0008 0.0962 0.0012 0.0466 0.0008 0.0060 0.0007 >138G >138G >138G

Random Forest (Num)

100 trees 0.4199 0.0017 0.1906 0.0014 0.0291 0.0007 0.0262 0.0016 0.0395 0.0010 >138G 0.1998 0.0011 0.0465 0.0004 0.3190 0.0028 0.1573 0.0008 >600h 0.0298 0.0003 0.0029 0.0006 >138G >138G >138G

SKDB0-1 kmax = 5 0.5225 0.0023 0.2045 0.0015 0.0839 0.0033 0.0123 0.0018 0.0376 0.0026 0.9447 0.0003 0.3903 0.0020 0.0446 0.0020 0.1868 0.0012 0.1509 0.0007 0.1697 0.0004 0.0477 0.0008 0.0060 0.0006 0.1839 0.0005 0.4448 0.0004 0.0523 0.0002

SKDB MDL disc. 0.5252 0.0036 0.1923 0.0025 0.0837 0.0023 0.0147 0.0020 0.0370 0.0026 0.9430 0.0005 0.2595 0.0029 0.1810 0.0048 0.1868 0.0012 0.1504 0.0007 0.0377 0.0029 0.0326 0.0005 0.0037 0.0009 0.1449 0.0007 0.4460 0.0006 0.0523 0.0002

m (millions) 0.1649 0.2993 0.3415 0.4741 0.4894 0.5153 0.5811 0.8393 1.025 2.4583 3.8505 5.2095 5.7491 8.1 8.7052 50

c 11 2 9 2 2 90 7 2 10 4 19 40 2 10 24 2

Table 8: Results in terms of RMSE for NB, TAN, AODE, KDB, Bayes Net, RF, SKDBk and SKDB classiﬁers.

Mart ınez, Webb, Chen and Zaidi

learner args localiz. Censusincome USPS MITA MITB MSDY covtype MITC poker uscensus PAMAP kddcup linkage mnist8 satellite splice

NB 0.5449 0.0026 0.2410 0.0017 0.0532 0.0012 0.0100 0.0006 0.0199 0.0004 0.9514 0.0005 0.3536 0.0025 0.0582 0.0010 0.4988 0.0018 0.0896 0.0003 0.2365 0.0007 0.0361 0.0005 0.0002 0.0000 0.2521 0.0005 0.4425 0.0002 0.0121 0.0001

TAN 0.4367 0.0033 0.0675 0.0016 0.0149 0.0006 0.0008 0.0001 0.0161 0.0005 0.9268 0.0010 0.3151 0.0016 0.0455 0.0008 0.3295 0.0015 0.0390 0.0005 0.1171 0.0005 0.0034 0.0001 0.0001 0.0000 0.1327 0.0007 0.3240 0.0004 0.0133 0.0001

AODE 0.433 0.0027 0.1106 0.0015 0.0244 0.0008 0.0104 0.0007 0.0294 0.0008 0.9281 0.0013 0.2859 0.0016 0.0254 0.0005 0.4812 0.0028 0.0532 0.0004 0.1654 0.0007 0.0154 0.0002 0.0002 0.0000 0.1271 0.0004 0.3537 0.0004 0.0134 0.0001

k = 1 0.4642 0.0040 0.0667 0.0014 0.0142 0.0006 0.0005 0.0001 0.0183 0.0009 0.9279 0.0008 0.2968 0.0034 0.0406 0.0009 0.3291 0.0012 0.0383 0.0004 0.1209 0.0006 0.0048 0.0001 0.0001 0.0000 0.1500 0.0004 0.3321 0.0005 0.0114 0.0001

k = 2 0.3710 0.0037 0.0562 0.0013 0.0110 0.0006 0.0002 0.0001 0.0043 0.0002 0.9239 0.0008 0.2799 0.0025 0.0088 0.0003 0.1961 0.0009 0.0333 0.0004 0.0850 0.0003 0.0028 0.0001 0.0000 0.0000 0.0919 0.0001 0.2903 0.0006 0.0116 0.0001

k = 3 0.3338 0.0023 0.0556 0.0010 0.0097 0.0005 0.0002 0.0001 0.0023 0.0002 0.9187 0.0009 0.2641 0.0020 0.0053 0.0003 0.0838 0.0007 0.0331 0.0004 0.0609 0.0002 0.0026 0.0001 0.0000 0.0000 0.0682 0.0002 0.2613 0.0003 0.0127 0.0001

k = 4 0.3064 0.0036 0.0547 0.0008 0.0087 0.0005 0.0028 0.0003 0.0016 0.0002 0.9131 0.0012 0.2337 0.0023 0.0040 0.0002 0.0486 0.0007 0.0297 0.0003 0.0438 0.0002 0.0026 0.0001 0.0000 0.0000 0.0519 0.0002 0.2426 0.0005 0.0145 0.0001

k = 5 0.3064 0.0036 0.0547 0.0010 0.0080 0.0006 0.0188 0.0010 0.0073 0.0004 0.9124 0.0006 0.2077 0.0023 0.0026 0.0002 0.0877 0.0008 0.0313 0.0002 0.0340 0.0002 0.0026 0.0001 0.0000 0.0000 0.0378 0.0002 0.2284 0.0004 0.0104 0.0001

SKDBonlyk kmax = 5 0.3064 0.0036 0.0562 0.0013 0.0080 0.0006 0.0028 0.0003 0.0199 0.0004 0.9187 0.0009 0.2077 0.0023 0.0582 0.0010 0.0486 0.0007 0.0297 0.0003 0.0340 0.0002 0.0026 0.0001 0.0000 0.0000 0.0378 0.0002 0.2284 0.0004 0.0029 0.0000

k = 1 0.4642 0.0040 0.0553 0.0009 0.0142 0.0007 0.0005 0.0001 0.0071 0.0002 0.9276 0.0010 0.2961 0.0021 0.0185 0.0004 0.3291 0.0012 0.0293 0.0004 0.1209 0.0006 0.0048 0.0001 0.0001 0.0000 0.1500 0.0004 0.3391 0.0003 0.0029 0.0000

k = 2 0.3710 0.0037 0.0542 0.0013 0.0110 0.0006 0.0002 0.0000 0.0040 0.0002 0.9239 0.0008 0.2799 0.0025 0.0069 0.0003 0.1961 0.0009 0.0271 0.0003 0.0850 0.0003 0.0028 0.0001 0.0000 0.0000 0.0919 0.0002 0.2907 0.0022 0.0029 0.0000

k = 3 0.3338 0.0023 0.0547 0.0009 0.0097 0.0005 0.0002 0.0001 0.0021 0.0002 0.9187 0.0009 0.2641 0.0020 0.0040 0.0003 0.0835 0.0008 0.0268 0.0003 0.0609 0.0002 0.0026 0.0001 0.0000 0.0000 0.0682 0.0002 0.2613 0.0003 0.0029 0.0000

k = 4 0.3064 0.0036 0.0532 0.0008 0.0087 0.0005 0.0018 0.0002 0.0015 0.0002 0.9144 0.0011 0.2337 0.0023 0.0030 0.0002 0.0318 0.0008 0.0266 0.0003 0.0438 0.0002 0.0026 0.0001 0.0000 0.0000 0.0519 0.0002 0.2426 0.0005 0.0029 0.0000

k = 5 0.3064 0.0036 0.0536 0.0009 0.0080 0.0006 0.0038 0.0005 0.0074 0.0004 0.9135 0.0008 0.2077 0.0023 0.0022 0.0002 0.0318 0.0008 0.0266 0.0003 0.0340 0.0002 0.0026 0.0001 0.0000 0.0000 0.0378 0.0002 0.2284 0.0004 0.0029 0.0000

SKDB kmax = 5 0.3064 0.0036 0.0542 0.0013 0.0080 0.0006 0.0002 0.0000 0.0015 0.0002 0.9187 0.0009 0.2077 0.0023 0.0022 0.0002 0.0318 0.0008 0.0266 0.0003 0.0340 0.0002 0.0026 0.0001 0.0000 0.0000 0.0378 0.0002 0.2284 0.0004 0.0029 0.0000

SKDB0-1 kmax = 5 0.3064 0.0036 0.0533 0.0011 0.0080 0.0006 0.0002 0.0000 0.0015 0.0002 0.9124 0.0006 0.2077 0.0023 0.0022 0.0002 0.0318 0.0008 0.0261 0.0002 0.0340 0.0002 0.0026 0.0001 0.0000 0.0000 0.0378 0.0002 0.2284 0.0004 0.0029 0.0000

Bayes Net k = 5 0.3053 0.0039 0.0533 0.0012 0.0097 0.0006 0.0017 0.0006 0.0014 0.0004 0.9204 0.0015 0.2442 0.0020 >138G 0.0831 0.0010 0.0288 0.0003 0.0308 0.0003 0.0026 0.0001 0.0000 0.0000 >138G >138G >138G

Random Forest 100 trees 0.3061 0.0036 0.0527 0.0009 0.0009 0.0001 0.0018 0.0002 0.0031 0.0003 >138G 0.1157 0.0013 0.0015 0.0002 0.0527 0.0013 0.0266 0.0003 0.0046 0.0001 0.0025 0.0001 0.0000 0.0000 >138G >138G >138G

Random Forest (Num)

100 trees 0.2030 0.0014 0.0481 0.0011 0.0005 0.0001 0.0009 0.0002 0.0013 0.0001 >138G 0.0405 0.0006 0.0004 0.0000 0.0527 0.0013 0.0266 0.0003 >138G 0.0014 0.0000 0.0000 0.0000 >138G >138G >138G

SKDB MDL disc. 0.3013 0.0048 0.0486 0.0015 0.0078 0.0005 0.0002 0.0000 0.0015 0.0002 0.9129 0.0006 0.0824 0.0019 0.0035 0.0002 0.0318 0.0008 0.0266 0.0003 0.0016 0.0001 0.0017 0.0001 0.0000 0.0000 0.0229 0.0002 0.2175 0.0005 0.0029 0.0000

n 5 41 676 361 361 90 54 361 10 67 54 41 11 784 138 141

m (millions) 0.1649 0.2993 0.3415 0.4741 0.4894 0.5153 0.5811 0.8393 1.025 2.4583 3.8505 5.2095 5.7491 8.1 8.7052 50

c 11 2 9 2 2 90 7 2 10 4 19 40 2 10 24 2

Table 9: Results in terms of 0-1 Loss for NB, TAN, AODE, KDB, Bayes Net, RF, SKDBk and SKDB classiﬁers.

Scalable Learning of Bayesian Network Classifiers

learner args localiz. Censusincome USPS MITA MITB MSDY covtype MITC poker uscensus PAMAP kddcup linkage mnist8 satellite splice

SKDB kmax = 5 0.5225 0.0023 0.2021 0.0021 0.0839 0.0033 0.0123 0.0018 0.0376 0.0026 0.9361 0.0004 0.3903 0.0020 0.0446 0.0020 0.1868 0.0012 0.1504 0.0007 0.1697 0.0004 0.0477 0.0008 0.0060 0.0006 0.1449 0.0007 0.4448 0.0004 0.0523 0.0002

VW -p3 0.6249 0.0006 0.2709 0.0048 0.2156 0.0011 0.0326 0.0076 0.0696 0.0069 0.7101 0.0001 0.5227 0.0010 0.1361 0.0013 0.5355 0.0006 0.2038 0.0014 0.5309 0.0006 0.1240 0.0020 0.0184 0.0038 0.4176 0.0009 0.5107 0.0006 0.2085 0.0045

-p20 0.6243 0.0006 0.2662 0.0028 0.2157 0.0008 0.0266 0.0035 0.0714 0.0047 0.7093 0.0001 0.5116 0.0009 0.1366 0.0008 0.5355 0.0005 0.2018 0.0011 0.5217 0.0006 0.1144 0.0025 0.0151 0.0032 0.4173 0.0008 0.5024 0.0006 0.2115 0.0038

-p3 -q 0.5955 0.0009 0.2426 0.0080 0.0402 0.0006 0.0277 0.0072) 0.0403 0.0058 0.7276 0.0005 0.5018 0.0018 0.0518 0.0042 0.3679 0.0019 0.1938 0.0040 0.3633 0.0010 0.2426 0.0080 0.0103 0.0029 0.2884 0.0063 0.4505 0.0015 0.0855 0.0082

-p3 -c 0.6217 0.0006 0.2681 0.0047 0.1894 0.0014 0.0319 0.0058 0.0645 0.0065 0.7088 0.0001 0.5177 0.0012 0.1232 0.0022 0.5342 0.0006 0.2023 0.0015 0.4689 0.0007 0.1123 0.0026 0.0079 0.0017 0.4173 0.0009 0.5031 0.0007 0.1964 0.0045

-p3 -q -l 0.7357 0.0021 0.1895 0.0020 0.0190 0.0074 0.0238 0.0021 0.0303 0.0016 0.9273 0 0.5467 0.0020 0.0308 0.0016 0.3545 0.0018 0.1482 0.0009 0.3920 0.0017 0.0519 0.0007 0.0043 0.0006 0.2773 0.0039 >400h >400h

SKDB0-1 kmax = 5 0.3064 0.0036 0.0533 0.0011 0.0080 0.0006 0.0002 0.0000 0.0015 0.0002 0.9124 0.0006 0.2077 0.0023 0.0022 0.0002 0.0318 0.0008 0.0261 0.0002 0.0340 0.0002 0.0026 0.0001 0.0000 0.0000 0.0378 0.0002 0.2284 0.0004 0.0029 0.0000

SKDB kmax = 5 0.3064 0.0036 0.0542 0.0013 0.0080 0.0006 0.0002 0.0000 0.0015 0.0002 0.9187 0.0009 0.2077 0.0023 0.0022 0.0002 0.0318 0.0008 0.0266 0.0003 0.0340 0.0002 0.0026 0.0001 0.0000 0.0000 0.0378 0.0002 0.2284 0.0004 0.0029 0.0000

VW -p3 0.6036 0.0041 0.0495 0.0012 0.0311 0.0011 0.0006 0.0003 0.0031 0.0005 0.9192 0.0012 0.3688 0.0029 0.0117 0.0006 0.4989 0.0015 0.0290 0.0003 0.3488 0.0013 0.0040 0.0002 0.0000 0.0000 0.1344 0.0006 0.3053 0.0013 0.0029 0.0000

-p20 0.6025 0.0036 0.0490 0.0011 0.0307 0.0009 0.0005 0.0002 0.0028 0.0003 0.9173 0.0013 0.3530 0.0023 0.0115 0.0004 0.4988 0.0015 0.0283 0.0003 0.3316 0.0009 0.0035 0.0001 0.0000 0.0000 0.1341 0.0006 0.2970 0.0010 0.0029 0

-p3 -q 0.5254 0.0050 0.0502 0.0020 0.0000 0.0000 0.0003 0.0001 0.0007 0.0004 0.9198 0.0012 0.3313 0.0032 0.0007 0.0003 0.0740 0.0007 0.0255 0.0004 0.0980 0.0006 0.0020 0.0001 0.0000 0.0000 0.0440 0.0010 0.2341 0.0015 0.0027 0.0001

-p3 -c 0.5932 0.0039 0.0493 0.0011 0.0193 0.0009 0.0005 0.0002 0.0028 0.0005 0.9162 0.0013 0.3619 0.0029 0.0091 0.0005 0.5015 0.0026 0.0285 0.0003 0.2286 0.0011 0.0024 0.0001 0.0000 0.0000 0.1342 0.0006 0.2947 0.0010 0.0029 0.0000

-p3 -q -l 0.5452 0.0051 0.0475 0.0011 0.0004 0.0005 0.0003 0.0001 0.0007 0.0002 0.9205 0 0.3418 0.0030 0.0009 0.0001 0.0746 0.0008 0.0250 0.0003 0.1328 0.0005 0.0030 0.0001 0.0000 0.0000 0.0565 0.0005 >400h >400h

m 0.1649 0.2993 0.3415 0.4741 0.4894 0.5153 0.5811 0.8393 1.025 2.4583 3.8505 5.2095 5.7491 8.1 8.7052 50

c 11 2 9 2 2 90 7 2 10 4 19 40 2 10 24 2

Table 10: Results in terms of RMSE and 0-1 loss for SKDB and VW with diﬀerent options: -p indicates the number of passes (3 or 20), -q indicates quadratic features (all possible combinations), -c refers to cubic features (all possible combinations), and -l the use of a logistic function for optimization. SKDB0-1 is SKDB optimized for 0-1 loss.

* Results shown for 1 experiment only due to time limitations when using the logistic function in VW. ** Results shown for 1 fold only due to time limitations when using the logistic function in VW.

Mart ınez, Webb, Chen and Zaidi

where πxi are the parent-values of Xi and m = 1. Secondly, if zero counts are found, we back oﬀas many levels in the tree as necessary to ﬁnd at least one count, that is, if counts(x4, x0, x3) is equal to zero, then ˆp(x4|x0) is considered instead of ˆp(x4|x0, x3).

In principle, any loss function that can be computed incrementally is suitable for SKDB. We have chosen RMSE (as deﬁned above) for our ﬁrst set of experiments.

We use Averaged One Dependence Estimators (AODE) as a representative of the ANDE family of algorithms where n = 1.

In order to avoid an extensive number of passes on each of the datasets, no datasettailored parameter tuning has been carried out for any of the algorithms considered. However as detailed in Sections 3.2, for logistic regression we have pre-conducted experiments with varying numbers of passes for stochastic gradient descent, varying loss functions and combinations of features; and used the default advanced (step size) updates provided by vowpal wabbit. In the case of random forest, we have made experiments with and without pre-discretising continuous attributes, as well as learning a variable number of trees (see Section 3.3.2 for more details).

Appendix C. Detailed Analysis of Results

The upper part of Table 8 (in the Appendix) shows the results in terms of RMSE standard deviation on the 16 datasets described above. We have decided to stop at k = 5, since some of the datasets already start to show worse results for k = 5 (due to variance increase), so increasing the complexity further is not productive. The results for NB, TAN and AODE are displayed as baseline orientations. The dark gray background on certain outputs for SKDBk, corresponds to experiments that outperform KDB for the same value of k; the light grey background on others indicate a tie; whereas the absence of the two indicates that plain KDB outperforms SKDBk. As for SKDB, the background coloring has the same meaning, but each entry is compared with the best KDB value for all possible k. Note that only the average of the 10 folds is shown for each experiment, and the background colors indicate statistical diﬀerence (Wilcoxon signed-rank tests), dark grey or white (in favor or against the algorithm in the row respectively); or lack of statistical diﬀerence, in light grey. Table 8 also shows the number of selected attributes by SKDBk. There are a few cases for which the number of selected attributes for all values of k is the total number of attributes in the datasets, such as localization, PAMAP and linkage. For some other datasets, for example MSDYear Prediction, USPSExtended or mnist8ms, the performance between selected KDB and KDB is similar, but fewer attributes are considered, so classiﬁcation time (and space) will be reduced. In most of the cases the number of selected attributes is smaller than the original, sometimes even less than half, reducing the RMSE compared to KDB, for example MITFace Set C with 215.8 attributes selected (out of 361) results in 0.0446 vs 0.0494 for KDB (k = 5), or uscensus with 22 attributes selected (out of 67) results in 0.1504 vs 0.1601 for KDB (k = 4), that in both cases results in statistical improvement. The results for SKDB using MDL discretization are also shown in Table 8 (antepenultimate row). Note that there are some datasets, such as linkage, for which the discretization method used, in this case EF5, is not an adequate one; whereas a technique based on entropy minimization provides better results. Just the opposite situation occurs for datasets

Scalable Learning of Bayesian Network Classifiers

such as MITFace Set C. Apriori selection of the most appropriate discretization method is an open question, even more for very large datasets where we should consider the same bias/variance trade-oﬀin terms of discretization bias and variance (Yang and Webb, 2009), that is, on the one hand we want that the number of intervals is suﬃciently large to provide low discretization bias, but not too large to avoid an increase in discretization variance. The latter should not be a problem in large datasets, but considering a large number of intervals implies bigger datasets, which will have an impact on the classiﬁer complexity.

Alekh Agarwal, Olivier Chapelle, Miroslav Dud ık, and John Langford. A Reliable Eﬀective Terascale Linear Learning System. J. Mach. Learn. Res., 15(1):1111 1133, January 2014. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=2627435.2638571.

Kevin Bache and Moshe Lichman. UCI Machine Learning Repository, 2013. URL http: //archive.ics.uci.edu/ml.

Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011.

Jock A. Blackard. Comparison of Neural Networks and Discriminant Analysis in Predicting Forest Cover Types. Ph D thesis, Fort Collins, CO, USA, 1998. AAI9921979.

L eon Bottou, Corinna Cortes, John S. Denker, Harris Drucker, Isabelle Guyon, Lawrence D. Jackel, Yann Le Cun, Urs A. Muller, Eduard S ackinger, Patrice Simard, and Vladimir Vapnik. Comparison of Classiﬁer Methods: A Case Study in Handwritten Digit Recognition. In Proceedings of the 12th International Conference on Pattern Recognition, Los Alamitos, CA., volume 2, pages 77 82, 1994.

Remco R. Bouckaert. Voting Massive Collections of Bayesian Network Classiﬁers for Data Streams. In Proceedings of the 19th Australian joint conference on Artiﬁcial Intelligence: advances in Artiﬁcial Intelligence, AI 06, pages 243 252, Berlin, Heidelberg, 2006. Springer-Verlag.

Damien Brain and Geoﬀrey I. Webb. On The Eﬀect of Data Set Size on Bias and Variance in Classiﬁcation Learning. In D. Richards, G. Beydoun, A. Hoﬀmann, and P. Compton, editors, Proceedings of the Fourth Australian Knowledge Acquisition Workshop (AKAW 99), pages 117 128, Sydney, 1999. The University of New South Wales.

Leo Breiman. Bagging Predictors. Mach. Learn., 24(2):123 140, 1996.

Leo Breiman. Random Forests. Mach. Learn., 45(1):5 32, 2001.

Glenn W. Brier. Veriﬁcation of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 75:1 3, 1950.

Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira, and Petri Myllym aki. Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood. J. Mach. Learn. Res., 12:2181 2210, 2011.

Mart ınez, Webb, Chen and Zaidi

Robert Cattral, Franz Oppacher, and Dwight Deugo. Evolutionary Data Mining with Automatic Rule Generalization. In Recent Advances in Computers, Computing and Communications, pages 296 300, 2002.

C. K. Chow and C. N. Liu. Approximating Discrete Probability Distributions with Dependence Trees. IEEE Transactions on Information Theory, 14:462 467, 1968.

Gregory F. Cooper and Edward Herskovits. A Bayesian Method for the Induction of Probabilistic Networks from Data. Mach. Learn., 9(4):309 347, 1992.

Janez Demˇsar. Statistical Comparisons of Classiﬁers over Multiple Data Sets. J. Mach. Learn. Res., 7:1 30, 2006.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res., 12:2121 2159, 2011.

Richard O. Duda and Peter E. Hart. Pattern Classiﬁcation and Scene Analysis. John Wiley & Sons Inc, 1 edition, 1973.

Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why Does Unsupervised Pre-training Help Deep Learning? J. Mach. Learn. Res., 11:625 660, 2010.

Usama M. Fayyad and Keki B. Irani. Multi-interval Discretization of Continuous-valued Attributes for Classiﬁcation Learning. In Ruzena Bajcsy, editor, IJCAI, pages 1022 1029. Morgan Kaufmann, 1993.

M. Julia Flores, Jos e A. G amez, Ana M. Mart ınez, and Jos e M. Puerta. GAODE and HAODE: Two Proposals Based on AODE to Deal with Continuous Variables. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 313 320. ACM, 2009a.

M. Julia Flores, Josi A. Gamez, Ana M. Martmnez, and Jose Miguel Puerta. HODE: Hidden One-Dependence Estimator. In Claudio Sossai and Gaetano Chemello, editors, ECSQARU, volume 5590 of Lecture Notes in Computer Science, pages 481 492. Springer, 2009b. ISBN 978-3-642-02905-9.

Nir Friedman, Dan Geiger, and Moises Goldszmidt. Bayesian Network Classiﬁers. Mach. Learn., 29(2-3):131 163, 1997.

Jo ao Gama, Ricardo Rocha, and Pedro Medas. Accurate Decision Trees for Mining Highspeed Data Streams. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 03, pages 523 528, New York, NY, USA, 2003. ACM.

Mark Hall, Eibe Frank, Geoﬀrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. The WEKA Data Mining Software: an Update. SIGKDD Explor. Newsl., 11(1):10 18, 2009.

Scalable Learning of Bayesian Network Classifiers

G. Hardy. Correspondence. Insurance Record (1889). Transactions of the Faculty Actuaries, 8, 1920.

GeoﬀHulten and Pedro Domingos. Mining Complex Models from Arbitrarily Large Databases in Constant Time. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 02, pages 525 531, New York, NY, USA, 2002. ACM.

Manfred Jaeger. Probabilistic Classiﬁers and the Concepts They Recognize. In Tom Fawcett and Nina Mishra, editors, ICML, pages 266 273. AAAI Press, 2003.

Boˇstjan Kaluˇza, Violeta Mirchevska, Erik Dovgan, Mitja Luˇstrek, and Matjaˇz Gams. An Agent-based Approach to Care in Independent Living. In Proceedings of the First international joint conference on Ambient intelligence, Am I 10, pages 177 186, Berlin, Heidelberg, 2010. Springer-Verlag.

Nikos Karampatziakis and John Langford. Online Importance Weight Aware Updates. In Fabio Gagliardi Cozman and Avi Pfeﬀer, editors, UAI, pages 392 399. AUAI Press, 2011.

Ron Kohavi. The Power of Decision Tables. In Nada Lavrac and Stefan Wrobel, editors, Proceedings of the European Conference on Machine Learning ECML-95, volume 912 of Lecture Notes in Computer Science, pages 174 189. Springer Berlin Heidelberg, 1995.

Daphne Koller and Mehran Sahami. Toward Optimal Feature Selection. In Lorenza Saitta, editor, Proceedings of the Thirteenth International Conference on Machine Learning (ICML), pages 284 292. Morgan Kaufmann Publishers, 1996.

Pat Langley and Stephanie Sage. Induction of Selective Bayesian Classiﬁers. In Proceedings of the Tenth Conference Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI-94), pages 399 406, San Francisco, CA, 1994. Morgan Kaufmann.

David D. Lewis. Naive Bayes at Forty: The Independence Assumption in Information Retrieval. In Proceedings of the 10th European Conference on Machine Learning, ECML 98, pages 4 15, London, UK, UK, 1998. Springer-Verlag.

George James Lidstone. Note on the General Case of the Bayes-Laplace Formula for Inductive or a Posteriori Probabilities. Transactions of the Faculty Actuaries, 8:182 192, 1920.

Ga elle Loosli, St ephane Canu, and L eon Bottou. Training Invariant Support Vector Machines using Selective Sampling. In L eon Bottou, Olivier Chapelle, Dennis De Coste, and Jason Weston, editors, Large Scale Kernel Machines, pages 301 320. MIT Press, Cambridge, MA., 2007.

Oded Maron and Andrew Moore. Hoeﬀding Races: Accelerating Model Selection Search for Classiﬁcation and Function Approximation. In Jack D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 59 66, San Francisco, CA, 1994. Morgan Kaufmann.

Mart ınez, Webb, Chen and Zaidi

B.W. Matthews. Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure , 405(2):442 451, 1975.

H. Brendan Mc Mahan and Matthew Streeter. Adaptive Bound Optimization for Online Convex Optimization. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), 2010.

Marvin Minsky. Steps Toward Artiﬁcial Intelligence. Proceedings of the Institute of Radio Engineers, 49:8 30, 1961.

Tom M. Mitchell. Machine Learning. Mc Graw-Hill, New York, 1997.

Andrew Moore and Mary Soon Lee. Cached Suﬃcient Statistics for Eﬃcient Machine Learning with Large Datasets. J. Artif. Int. Res., 8(1):67 91, 1998.

Nikunj C. Oza and Stuart Russell. Experimental Comparisons of Online and Batch Versions of Bagging and Boosting. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 01, pages 359 364, New York, NY, USA, 2001. ACM.

Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.

Franz Pernkopf and JeﬀA. Bilmes. Eﬃcient Heuristics for Discriminative Structure Learning of Bayesian Network Classiﬁers. J. Mach. Learn. Res., 11:2323 2360, 2010.

Fran cois Petitjean, Jordi Inglada, and Pierre Gan carski. Satellite Image Time Series Analysis under Time Warping. IEEE Transactions on Geoscience and Remote Sensing, 50(8): 3081 3095, 2012.

Attila Reiss and Didier Stricker. Creating and Benchmarking a New Dataset for Physical Activity Monitoring. In Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments, PETRA 12, pages 40:1 40:8, New York, NY, USA, 2012. ACM.

St ephane Ross, Paul Mineiro, and John Langford. Normalized Online Learning. ar Xiv preprint ar Xiv:1305.6646, 2013.

Arcadio Rubio and Jos e Antonio G amez. Flexible Learning of K-dependence Bayesian Network Classiﬁers. In Proceedings of the 13th annual conference on Genetic and evolutionary computation, GECCO 11, pages 1219 1226, New York, NY, USA, 2011. ACM.

Mehran Sahami. Learning Limited Dependence Bayesian Classiﬁers. In In KDD-96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 335 338. AAAI Press, 1996.

Murat Sariyar, Andreas Borg, and Klaus Pommerening. Controlling False Match Rates in Record Linkage Using Extreme Value Theory. J. of Biomedical Informatics, 44(4): 648 654, 2011.

Scalable Learning of Bayesian Network Classifiers

I Schmidtmann, G Hammer, M Sariyar, and A. Gerhold-Ay. Evaluation des Krebsregisters NRW - Schwerpunkt Record Linkage - Abschlussbericht. Technical report, Institut f ur medizinische Biometrie, Epidemiologie und Informatik, Universit atsmedizin Mainz, 2009.

S oren Sonnenburg and Vojtech Franc. COFFIN : A Computational Framework for Linear SVMs. In Proc. ICML 2010, 2010.

Jiang Su and Harry Zhang. Full Bayesian Network Classiﬁers. In Proceedings of the 23rd international conference on Machine learning, ICML 06, pages 897 904, New York, NY, USA, 2006. ACM.

Ivor W. Tsang, James T. Kwok, and Pak-Ming Cheung. Core Vector Machines: Fast SVM Training on Very Large Data Sets. J. Mach. Learn. Res., 6:363 392, 2005.

Geoﬀrey I. Webb, Janice R. Boughton, and Zhihai Wang. Not So Naive Bayes: Aggregating One-Dependence Estimators. Mach. Learn., 58(1):5 24, 2005.

Geoﬀrey I. Webb, Janice R. Boughton, Fei Zheng, Kai Ming Ting, and Houssam Salem. Learning by Extrapolation from Marginal to Full-multivariate Probability Distributions: Decreasingly Naive Bayesian Classiﬁcation. Mach. Learn., 86(2):233 272, 2012.

Y. Yang, G.I. Webb, J. Cerquides, K. Korb, J. Boughton, and K-M. Ting. To Select or To Weigh: A Comparative Study of Linear Combination Schemes for Super Parent One-Dependence Estimators. IEEE Transactions on Knowledge and Data Engineering (TKDE), 19(12):1652 1665, 2007.

Ying Yang and Geoﬀrey I. Webb. Discretization for Naive-Bayes Learning: Managing Discretization Bias and Variance. Mach. Learn., 74(1):39 74, 2009.

Nayyar A. Zaidi, Jes us Cerquides, Mark James Carman, and Geoﬀrey I. Webb. Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting. J. Mach. Learn. Res., 14(1):1947 1988, 2013.

Harry Zhang, Liangxiao Jiang, and Jiang Su. Hidden Naive Bayes. In Proceedings of the 20th National Conference on Artiﬁcial intelligence - Volume 2, pages 919 924. AAAI Press, 2005.

F. Zheng, G.I. Webb, P. Suraweera, and L. Zhu. Subsumption Resolution: An Eﬃcient and Eﬀective Technique for Semi-Naive Bayesian Learning. Mach. Learn., 87(1):93 125, 2012.

Fei Zheng and Geoﬀrey I Webb. A Comparative Study of Semi-naive Bayes Methods in Classiﬁcation Learning. In Proceedings of the Fourth Australasian Data Mining Conference (Aus DM05), pages 141 156, 2005.

Zijian Zheng and Geoﬀrey I. Webb. Lazy Learning of Bayesian Rules. Mach. Learn., 41(1): 53 84, 2000.