# scalable_learning_of_bayesian_network_classifiers__aeb803e0.pdf Journal of Machine Learning Research 17 (2016) 1-35 Submitted 7/13; Revised 5/15; Published 4/16 Scalable Learning of Bayesian Network Classifiers Ana M. Mart ınez anam.martinezf@gmail.com Geoffrey I. Webb geoff.webb@monash.edu Faculty of Information Technology Monash University VIC 3800, Australia Shenglei Chen tristan chen@126.com College of Information Science/Faculty of Information Technology Nanjing Audit University/Monash University China/Australia Nayyar A. Zaidi nayyar.zaidi@monash.edu Faculty of Information Technology Monash University VIC 3800, Australia Editor: Russ Greiner Ever increasing data quantity makes ever more urgent the need for highly scalable learners that have good classification performance. Therefore, an out-of-core learner with excellent time and space complexity, along with high expressivity (that is, capacity to learn very complex multivariate probability distributions) is extremely desirable. This paper presents such a learner. We propose an extension to the k-dependence Bayesian classifier (KDB) that discriminatively selects a sub-model of a full KDB classifier. It requires only one additional pass through the training data, making it a three-pass learner. Our extensive experimental evaluation on 16 large data sets reveals that this out-of-core algorithm achieves competitive classification performance, and substantially better training and classification time than state-of-the-art in-core learners such as random forest and linear and non-linear logistic regression. Keywords: scalable Bayesian classification, feature selection, out-of-core learning, big data 1. Introduction Until very recently most machine learning research has been conducted in the context of relatively small datasets, that is, no more than 50, 000 instances and a few hundred attributes. It is only in the last five years that larger datasets have started to appear in the literature (Erhan et al., 2010; Sonnenburg and Franc, 2010; Agarwal et al., 2014; Sariyar et al., 2011) on a regular basis. We argue that larger data quantities do not simply call for scaling-up of existing learning algorithms. Rather, we believe, as first argued by Brain and Webb (1999), that the most accurate learners for large data will have much lower bias than the most accurate learners for small data. Thus we need a new generation of computationally efficient low bias learners. To achieve the necessary efficiency, a learning c 2016 Ana M. Mart ınez, Geoffrey I. Webb, Shenglei Chen and Nayyar A. Zaidi. Mart ınez, Webb, Chen and Zaidi algorithm for very large data should ideally require only a few passes through the training data. Further, to remove memory size as a bottleneck, the algorithm should be able to process data out-of-core. In this paper, we extend the KDB classifier. KDB is a form of restricted Bayesian network classifier (BNC). It has numerous desirable properties in the context of learning from large quantities of data. These include: the capacity to control its bias/variance trade-offwith a single parameter, k, it does not require data to be held in-core, the capacity to learn in just two passes through the training data. The contributions of this paper are as follows: We extend KDB to perform discriminative model selection in a single additional pass through the training data. In this single pass, our algorithm selects between both attribute subsets and network structures. The resulting highly scalable algorithm combines the computational efficiency of classical generative learning with the low bias of discriminative learning. We compare the performance of our algorithm with other state-of-the-art classifiers on several large datasets, ranging in size from 165 thousand to 54 million instances and 5 to 784 attributes. We show that our highly scalable algorithms achieve comparable or lower error on large data than a range of out-of-core and in-core state-of-the-art classification learning algorithms. To illustrate our motivations, in Figure 1 we present learning curves for the poker-hand dataset (which is described in Table 7 in Appendix A). As can be seen, for lower quantities of data the lower variance delivered by low values of k results in lower error for KBD, while for larger quantities of data the lower bias of higher values of k results in lower error. While selective KDB sometimes selects a k that overfits this data with smaller training set sizes, when the training set size exceeds 550, 000 it successfully selects the best k and by selecting a subset of the attributes it can substantially reduce error across the entire range of training set sizes relative to the KDB classifier using the same k. Section 2 reviews the state-of-the-art in out-of-core BNCs and introduces our proposal: selective KDB (SKDB, Section 2.1). In Section 3 we present a set of comparisons for our proposed algorithm on 16 big datasets with out-of-core BNCs (Section 3.1), out-of-core linear classifiers using Stochastic Gradient Descent (Section 3.2), and with in-core (batch) learners Bayes Net and Random Forest (Section 3.3). To finalize, Section 4 shows the main conclusions and outlines future work. 2. Scalable Bayesian Networks for Classification We define the classification learning task as the assignment of a label y ΩY , from a set of c labels of the class variable Y , to a given example x = (x1, . . . , xd), with values for the d attributes A = {X1, . . . , Xd}. Bayesian networks (BN) (Pearl, 1988) provide a framework for computing the joint probability of these d random variables. A BN is Scalable Learning of Bayesian Network Classifiers 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000 Training set size SKDB k=5 only k SKDB k=5 only atts Figure 1: Learning curves for KDB and Selective KDB on the poker-hand dataset. The values plotted are averages over 10 runs. For each run 1,000 examples are selected as a test set and samples of successive sizes are selected from the remaining data for training. As can be seen, for lower quantities of data the lower variance delivered by low values of k results in lower error for KBD, while for larger quantities of data the lower bias of higher values of k results in lower error, although there is not sufficient data for k = 5 to asymptote on this dataset. While only selecting a k (SKDB k=5 only k) causes selective KDB to overfit this data with smaller training set sizes (between 20,000 and 550,000), when the training set size is large enough it successfully selects the best k. Only selecting attributes (SKDB k=5 only atts) always reduces error (relative to KDB k=5). By selecting both k and attributes (SKDB k=5) selective KDB substantially reduces error relative to KDB with any value of k once there are more than approximately 450,000 training examples. a directed acyclic graph where the the joint probability of all attributes can be written as the product of the individual probabilities of all attributes given their parents, that is p(y, x) = p(y|πy) Qd i=1 p(xi|πxi), where πxi denotes the parents of attribute Xi. If the structure is known, the likelihood of the BN given the data can be maximized by estimating p(xi|πxi) using the empirical estimates from the training data T , composed of t training instances. Mart ınez, Webb, Chen and Zaidi While unrestricted BNs are the least biased, training such a model on even moderate size data sets can be extremely challenging, as the search-space that needs to be explored grows exponentially with the number of attributes. This has led to restricted BNCs, including naive Bayes (NB) (Duda and Hart, 1973; Minsky, 1961; Lewis, 1998), tree-augmented naive Bayes (TAN) (Friedman et al., 1997), averaged n-dependence estimators (ANDE) (Webb et al., 2012) and k-dependence estimators (KDB) (Sahami, 1996). These classifiers have in common either no structural learning or minimal structural learning that requires only one or two passes through the data. They have been referred to as semi-naive BNCs (Zheng and Webb, 2005). Even though highly biased due to their inherent conditional attribute independence assumption, they have been shown to be very effective classification methods. NB is the simplest of the BNCs, assuming that all attributes are independent given the class. It estimates the joint probability using ˆp(y) Qd i=1 ˆp(xi|y), where ˆp( ) denotes an estimate of p( ). TAN is a structural augmentation of NB where every attribute has as parents the class and at most one other attribute. The structure is determined by using an extension of the Chow-Liu tree (Chow and Liu, 1968), that utilizes conditional mutual information to find a maximum spanning tree. This alleviates some of NB s independence assumption and therefore reduces its bias at the expense of increasing its variance. This results in better performance on larger data sets. TAN, provides an intermediate biasvariance trade-off, standing between NB on one hand and unrestricted BNCs on the other. There has been a lot of prior work that has explored approaches to alleviate NB s independence assumption. One approach is to add a bounded number of additional interdependencies (Friedman et al., 1997; Zheng and Webb, 2000; Webb et al., 2005; Zhang et al., 2005; Yang et al., 2007; Flores et al., 2009a,b; Zheng et al., 2012; Zaidi et al., 2013). At the other extreme, Su and Zhang (2006) propose the use of a full Bayesian network. Of these algorithms, only two approaches allow the number of interdependencies to be adjusted to best accommodate the best trade-offs between bias and variance for differing data quantities. ANDE (Webb et al., 2012) averages many BNCs, where a parameter n controls the number of interdependencies to be modeled. KDB uses a parameter k to control the number of interdependencies modeled. KDB relaxes NB s independence assumption by allowing every attribute to be conditioned on the class and, at most, k other attributes. Like TAN, KDB requires two passes over the training data for learning: the first pass collects the statistics to perform calculations of mutual information for structure learning, and the second performs the parameter learning based on the structure created in the former step (Algorithm 1). An advantage over the ANDE model is that KDB does not need to collect all statistics for all combinations of n or k attributes, allowing it to scale to higher order dependencies. It also has the capacity to select a model and hence more closely fit the data. This is a disadvantage for small data, which it may overfit, but KDB will often lead to higher accuracy than ANDE on large data which are more difficult to overfit. Recently, some techniques have been proposed that address the classification problem with Bayesian networks from a relatively efficient discriminative perspective. Carvalho et al. (2011) propose a decomposable approximation to the conditional log-likelihood scoring criteria. Pernkopf and Bilmes (2010) propose a BNC learning algorithm with some resemblance to KDB but using discriminative evaluation for parent assignment. Neither Scalable Learning of Bayesian Network Classifiers Algorithm 1: The KDB algorithm Input: Training set T with attributes {X1, . . . , Xa, Y } and k. Output: KDB model. 1 Let G be a directed graph G = (V, E), in which V is a set of vertices and E is a set of links. 2 Let Θ be a Bayesian network with structure G. 3 First pass begin 4 G =learn Structure(T ) ; /* Algorithm 2 */ 6 Second pass begin 7 Θ = learn Parameters(T , G) ; /* Algorithm 3 */ 9 Let KDB be a BN with structure G and conditional probability distributions in Θ. 10 return KDB Algorithm 2: learn Structure(T ) Input: Training set T Output: G, network structure. 1 Calculate MI(Xi; Y ) from T for all attributes. 2 Calculate MI(Xi; Xj|Y ) from T for each pair of attributes (i = j). 3 Let L be a list of all Xi in decreasing order of MI(Xi; Y ). 4 V = {Y }; E = ; 5 for i = 1 L.size do 6 V = V Li; 7 E = E (Y, Li); 8 vk = min(i 1, k); 9 while (vk > 0) do 10 m = argmaxj{MI(Li; Lj|Y ) | 1 j 600h, when more than 600 CPU hours are required; or by >138G, when more than 138GB of RAM memory are required). Note that while each of these algorithms has a number of parameters that could potentially be tuned using cross validation, doing so would not make reasonable comparators to SKBD. First, SKDB uses cross-validation to choose between multiple models rather than to choose between parameterizations. Cross validation could also be used to tune SKBD parameters such as the smoothing technique. Second, SKDB uses cross-validation in a constrained manner that requires only a single additional pass through the data while parameter tuning of these algorithms would increase compute time proportionally to the number of folds employed. 3.3.1 Augmented Bayesian Network We compare performance against a state-of-the-art generic in-core Bayesian network classifier, that uses a hill-climbing algorithm for adding, deleting and reversing arcs (starting with a NB structure, this hill climber also considers arrows as part of the NB structure for Mart ınez, Webb, Chen and Zaidi Bayes Net SKDB 6(+4)-3-3 Table 5: Win-draw-loss records for Bayes Net and SKDB in terms of RMSE. deletion, that is, arcs from the class to the attributes can be removed). It optimizes the K2 metric (Cooper and Herskovits, 1992). The maximum number of parents allowed for a node in the Bayes net is set to 5. The win-draw-loss comparison is presented in Table 5. The detailed results can be found in Table 8 (Appendix A), row Bayes Net. The darker grey background corresponds to those cases where Bayes Net significantly outperforms SKDB. Note that there are only records for 12 out of the 16 datasets, because there are four cases for which results could not be computed for Bayes Net due to memory constraints, since the network cannot be learned incrementally. Even though Bayes Net performs a more exhaustive search than SKDB of the model space of Bayesian networks with up to 5 parents, SKDB obtains lower error than Bayes Net more often than the reverse. We hypothesize that this is due to SKDB s use of both generative and discriminative measures for model selection in contrast to Bayes Net s use exclusively of generative measures. 3.3.2 Random Forest Random Forest (RF) is a state-of-the-art learning algorithm introduced in Breiman (2001). It uses bagging (Breiman, 1996) to aggregate an ensemble of trees that are each grown using a process that involves a stochastic element to increase diversity. We have conducted experiments with RF selecting 100 trees. We present results both with pre-discretized data, to show relative performance on categorical data, and with undiscretized data. The detailed results can be found on the fifth and sixth from last rows in Table 8 in Appendix A. Table 6 shows the win-draw-loss records when compared with SKDB. The superscripts compute those datasets for which results cannot be obtained for RF due to main memory constrains (> 138G are required). The results show that SKDB is competitive with RF in terms of RMSE, even when RF is performing its own discretization (5 wins for SKDB and 6 wins for RF), for which the computational cost in terms of memory and time is much higher than SKDB s. When 0-1 Loss is considered, out-of-core RF wins more frequently than SKDB, even when 0-1 Loss is used as the objective function during structure selection. When MDL discretization is used instead of 5EF for SKDB, then the results improve slightly for SKDB compared to RF performing its own discretization. When interpreting these results it is important to keep in mind that SKDB is a 3-pass outof-core algorithm (4-pass with discretization) in contrast to the inherently computationally intensive in-core processing of RF. Note that the sum of some cells is not exactly 16 but smaller, due to computational constraints for RF. RF is an in-core algorithm, and hence we have not been able to learn classification models for the largest datasets. Interestingly, RF is more negatively affected (in terms of computation) by a large number of classes than the BNCs, for example no 100 tree model can be learnt for MSDYear Prediction, which has 90 class labels. Precisely, there are three cases where RF with 10 trees cannot be run, and four in the case of RF with 100 Scalable Learning of Bayesian Network Classifiers RF (5EF) RF (Num) RMSE SKDB 5(+4)-1-6 5(+5)-0-6 SKDB 2(+4)-3-7 2(+5)-2-7 SKDB0 1 3(+4)-2-7 3(+5)-1-7 SKDBMDL 8(+4)-2-2 3(+5)-2-6 Table 6: Win-draw-loss records for RF and SKDB in terms of RMSE and 0-1 Loss. MITFace Set A MITFace Set B MITFace Set C MSDYear Prediction USPSExtended Census-income localization Bayes Net - RMSE (log. scale) SKDB (max k=5) - RMSE (log. scale) (a) Bayes Net and SKDB. MITFace Set A MITFace Set B MITFace Set C MSDYear Prediction USPSExtended Census-income localization RF - RMSE (log. scale) SKDB (max k=5) - RMSE (log. scale) (b) RF and SKDB. Figure 7: Scatter plot of RMSE comparisons for Bayes Net, RF and SKDB. trees. These four datasets are MSDYear Prediction, satellite, mnist8ms and splice, so the largest possible samples able to be computed have been considered. RMSE comparisons for all datasets for Bayes Net and RF when compared with SKDB are shown in Figures 7a and 7b, where the points with an X symbol indicate those datasets for which sampling is required. While Bayes Net is not competitive with SKDB, RF provides competitive results for some datasets. PAMAP2 provides a notable case for which only a sample of 2 millions out of 3.5 suffices for RF to provide better classification performance than SKDB using all the training data. This is however the only dataset for which RF learning from a sample can achieve lower RMSE than SKDB learning from all the data, the reverse being true for splice, mnist and satellite. Figures 8a and 8b display the training and classification times per dataset for these classifiers compared to SKDB. For most datasets the in-core learners require substantially more time for learning and classifying3. 3. RF requires more than 8GB of main memory to complete for poker-hand and uscensus1990 since these experiments to measure time has been conducted in a desktop computer no time measurements can be provided to compare with the rest of experiments. Weka s implementation has been considered. Mart ınez, Webb, Chen and Zaidi Census-inco (a) Training Times Census-inco (b) Classification Times Figure 8: Time comparisons per dataset for Bayes Net and RF against SKDB. 3.4 Global Comparison of All Classifiers In this section we analyze how all the classifiers explored in this paper perform when they are compared as a set. Figure 9 plots the average rankings across all datasets, along with the standard deviation for each learner. This ranking corresponds to the ranking obtained when a Friedman test is conducted. In this analysis we present the 0-1 Loss performance of SKDB when it is optimized for RMSE, its average rank on 0-1 Loss improves marginally from 2.94 to 2.84 when it is instead optimized for 0-1 Loss. When assessing the calibration of the probability estimates using RMSE, SKDB obtains the lowest average rank of 2.53, followed by RF with 2.91 and KDB with 3.34 (very close to those for VW and Bayes Net). When assessing performance using 0-1 Loss, RF using its own internal discretization of numeric attributes enjoys an average advantage of 1 in rank over SKDB, and SKDB enjoys an average advantage of almost one third over VWLF. We find NB at the other extreme on both measures, with average ranks 7.63 and 7.72 out of a total of 8 classifiers. SKDB obtains in the worst case the 4.5th position (this half is due to a tie with KDBbest-k) for the linkage dataset and the 4th position for census-income when ranked on RMSE and the 4th position for MITFace Set B and kddcup when ranked on 0-1 Loss. On both measures the largest standard deviation is observed for VW, which usually ranks either very well or rather poorly among the datasets. Still, VW obtains the largest number of top 1 datasets for RMSE, with a total number of 7, against 5 for RF and 4 for SKDB. On 0-1 Loss, RF ranks first 8 times, VW 4 times and SKDB twice (one of those times being a tie with KDB-best-k). A Friedman test (Demˇsar, 2006) was performed for the set of 8 classifiers, yielding statistical difference for both measures. To identify where exactly these differences are found, we ran a set of posterior Nemenyi tests. These tests confirm the existence of two sets of algorithms in terms of performance on both RMSE and 0-1 Loss for these large datasets: on one hand NB, AODE and TAN; and on the other hand Bayes Net, VW, KDB, RF and SKDB. The Nemenyi tests4 showed significant differences for the second set of algorithms over NB and AODE, and only RF and SKDB over TAN. 4. Similarly for Holms and Shaffer post-hoc tests. Scalable Learning of Bayesian Network Classifiers 3.47 3.44 3.34 2.91 2.53 NB AODE TAN Bayes Net VWLF KDB (best k) RF (Num) SKDB 3.81 3.47 3.25 2.94 NB AODE TAN Bayes Net KDB (best k) VWLF SKDB RF (Num) (b) 0-1 Loss Figure 9: Ranking in terms of RMSE and 0-1 Loss for NB, TAN, AODE, KDB (with best k), SKDB, RF using numeric attributes, Bayes Net and VW with a logistic function. Even though the small number of datasets only allow tentative conclusions to be drawn, the following dataset characteristics appear to indicate a preference for one or another learning scheme. RF performs better on datasets with a small number of attributes. RF derives an advantage with respect to 0-1 Loss due to its internal discretization of numeric values. It does not achieve the lowest RMSE or 0-1 Loss on any of the three categorical only datasets poker-hand, PAMAP2 or splice. SKDB seems to cope well with high-dimensional datasets and datasets with more than 1 million points. However its complexity is quadratic with respect to the number of attributes so, like most existing BNCs, it is not feasible for very high-dimensionality. Mart ınez, Webb, Chen and Zaidi Time (seconds) NB TAN AODE KDB5 SKDB VWSF VWLF 0 5000 10000 15000 (a) Training Times Time (seconds) log. scale NB TAN AODE KDB5 SKDB VWSF VWLF 50 100 200 500 1000 (b) Classification Times Figure 10: Training and classification time comparisons for the out-of-core classifiers. VW appears to have an advantage for sparse numeric datasets. A more detailed study of the domain of competence of these classifiers remains an important area for future work. We can also analyze the performance of these classifiers with respect to the complexity of the decision function they provide. NB and linear regression, represented in this work by VW with linear features, are classifiers that learn linear decision boundaries. They have low variance and are particularly useful for small data quantity and sparse data. An extra level of expressiveness is offered by AODE, KDB (k = 1), TAN and VWSF/VWLF (with quadratic features); which produce quadratic decision functions. A2DE, KDB (k = 2) and VWSF/VWLF (with cubic features) produce cubic decision functions. More generally, An DE and KDB can separate order-k(n) polynomially separable concepts (Jaeger, 2003). These two families of classifiers provide a set of a spectrum of models of increasing complexity, allowing a fine-tuned trade-offbetween expressivity on the one hand, and efficiency of learning and guarding against overfitting on the other hand, that is, the well-known bias-variance trade off(Brain and Webb, 1999). Figures 10a and 10b show the time comparison for the out-of-core classifiers considered: namely NB, TAN, AODE, KDB with k = 5, SKDB and LRSGD with quadratic features in VW (using either a squared and a logistic function). The time required to pre-randomize the data for VW has not been included. No parallelization techniques have been used in any case, although most algorithms could be parallelized. The average of the 10-fold CV is shown. It is important to note that logarithmic scale has been used for classification times in order to appreciate the smaller differences in the faster learners. Again, only the 11 smallest out of the 16 datasets have been considered for time measurement, since these experiments have been conducted in a personal computer for a controlled environment. Scalable Learning of Bayesian Network Classifiers Note that training SKDB with k = 5 (maximum) takes only a bit more time than training KDB with k = 5. The classification time for KDB is slightly higher than SKDB s (since all attributes are considered). Training time for LRSGD with quadratic features and 3 passes (VWSF and VWLF), which are those that obtain comparable results with SKDB, are also higher, presumably because they have to compute all pairs of quadratic features as indicated above. Similar reasoning holds for classification time. We believe that the extra time for VW is also due to the need to binarize discrete features, that is, since VW cannot deal with discrete attributes directly, v binary attributes are created for each originally discrete attribute with v values; and also because of the one-against-all approach required to handle classes with multiple labels. This overhead may be noticeable since we are using quadratic features. The time required for VWLF is much larger than VWSF, and in any case much larger both for training and testing than that for SKDB (specially note the logarithmic scale in Figure 10b). We have developed an extension to KDB which performs discriminative attribute and best-k selection using leave-one-out cross validation. SKDB only requires a single extra pass over the training data compared to KDB. In return for this extra pass during training, it often increases prediction accuracy relative to KDB with the best possible k, and always improves classification time (and space) by reducing the number of attributes considered, which can be quite significant for some datasets (for example for MITFace Set C 60%, 215.8 out of 361, of the attributes are selected improving the classification performance at the same time). It is embarrassingly parallelizable. Furthermore, it can optimize any loss function that is a function over each training example x of ˆp(Y, x) and yx, making it extremely flexible. Due to its low bias, when datasets are large enough to avoid overfitting, SKDB attains lower error than the other in and out-of-core Bayesian network algorithms considered. It even provides competitive error compared to Random Forest, one of the most powerful in-core classification methods; and online linear and non-linear classifiers with stochastic gradient descent (that is, linear regression with quadratic and cubic features). Compared to the latter, it is in most cases substantially more efficient both at training and classification time. Further, it provides well-calibrated posterior class probability estimates. Future directions for research include: 1. using only a sample of the data in the leave-one-out cross validation, in order to use more complex loss functions efficiently; 2. tackling SKDB s tendency to overfit small training sets; 3. studying a customized selection of the different links and attributes in a discriminative manner, that is, different numbers of parents for different attributes and also discarded in an order that does not follow the conditional mutual information rank; 4. consideration of other means of generating alternative classifiers to be selected in the final LOOCV pass; and Mart ınez, Webb, Chen and Zaidi 5. study of more appropriate loss functions for imbalanced datasets. SKDB provides an efficient and effective solution to the problem of scalable low-bias learning. Its low bias allows it to capture and take advantage of the additional fine-detail that is inherent in very large data while its efficiency makes it feasible to deploy. SKDB thus sets new standards for classification learning from big data. Acknowledgments This research has been supported by the Australian Research Council grant DP140100087 and Asian Office of Aerospace Research and Development, Air Force Office of Scientific Research contract FA2386-15-1-4007. It has also been supported in part by the Monash e Research Center and e Solutions-Research Support Services through the use of the Monash Campus HPC Cluster and the LIEF grant. This research was also undertaken on the NCI National Facility in Canberra, Australia, which is supported by the Australian Commonwealth Government. The authors would also like to thank the researchers from CESBIO (D. Ducrot, C. Marais-Sicre, O. Hagolle and M. Huc) for providing the land cover maps and the geometrically and radiometrically corrected FORMOSAT-2 images (satellite dataset). We also want to thank the anonymous reviewers for their insightful comments which have helped to greatly improve the revised version of the paper. Scalable Learning of Bayesian Network Classifiers Appendix A. Tables of the Experimental Section id name t a c ? (max) (average sd) Source Relevant papers description size 1 localization 0.165 5 11 N UCI Kaluˇza et al. (2010) Recordings of 5 people performing different activities. Each person wore 3 sensors while performing the same scenario 5 times. 11MB 2 census-income 0.299 41 2 Y (50%) (4.9 14.6%) UCI Bache and Lichman (2013) Oza and Russell (2001) Weighted census data extracted from the 1994 and 1995 current population surveys conducted by the U.S. Census Bureau. 136MB 3 USPSExtended 0.341 676 2 N CVM Tsang et al. (2005) 0/1 digit classification (extended version of the USPS data seta). 603MBs 4 MITFace Set A 0.474 361 2 N CVM Tsang et al. (2005) Face detection using an extended version of the MIT face databasec. By adding nonfaces to the original training set. 584MB 5 MITFace Set B 0.489 361 2 N CVM Tsang et al. (2005) Each training face is blurred and added to set A. They are then flipped laterally. 603MB 6 MSDYear Prediction 0.515 90 90 N UCI Bertin Mahieux et al. (2011) Prediction of the release year of a song from audio features. Songs are mostly western, commercial tracks ranging from 1922 to 2011, with a peak in the year 2000sc. 7 covtype 0.581 54 7 N UCI Blackard (1998) Gama et al. (2003); Oza and Russell (2001) Predicting forest cover type from cartographic attributes only (no remotely sensed data). 72MB 8 MITFace Set C 0.839 361 2 N CVM Tsang et al. (2005) Each face in set B is rotated. 1.1GB 9 poker-hand 1.025 10 10 N UCI Cattral et al. (2002) Each record is an example of a hand consisting of 5 cards drawn from a deck of 52. Each card is described using 2 attributes (suit and rank), thus 10 predictive attributes in total. The class label describes the Poker Hand . The order of cards is important. 10 uscensus1990 2.458 67 4 N UCI Discretized version of the USCensus1990raw dataset, a 1% sample from the full 1990 census. Temp. Absence From Work has been selected as class. 11 PAMAP2 3.850 54 19 Y (91%) (1.6 12.2%) UCI Reiss and Stricker (2012) Data of 18 different physical activities (such as walking, cycling, playing soccer, etc.), performed by 9 subjects wearing 3 inertial measurement units and a heart rate monitor. 12 kddcup 5.209 41 40 N UCI (KDD Cup 1999) Contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment: bad connections, called intrusions or attacks, and good normal connections. 13 linkage 5.749 11 2 Y (100%) (16.5 36.9%) Schmidtmann et al. (2009) Sariyar et al. (2011) Element-wise comparison of records with personal data from a record linkage setting. The task is to decide from a comparison pattern whether the underlying records belong to one person. 14 mnist8ms 8.100 784 10 N CVM Loosli et al. (2007) Erhan et al. (2010); Bottou et al. (1994) Handwritten digit classification. Extended version by performing random elastic deformations of the original mnist digits. 19GBs 15 satellite 8.705 138 24 Y (93%) (26.4 33%) Petitjean et al. (2012) Satellite image time series to predict land cover. 3.6GB 16 splice 54.627 141 2 N Sonnenburg and Franc (2010) Agarwal et al. (2014) Recognising a human acceptor splice site (largest public data for which subsampling is not an effective learning strategy). 7.3GB Table 7: Very large datasets for classification. From left to right showing the identification number; name of the dataset; number of instances, attributes and classes; presence of missing values (and if positive, then maximum percentage per attribute and the average), source paper of the dataset; relevant papers and description. a http://c2inet.sce.ntu.edu.sg/ivor/cvm.html. b http://cbcl.mit.edu/cbcl/software-datasets/Face Data2.html. c Subset of the Million Song Dataset (http://labrosa.ee.columbia.edu/millionsong/). Songs from a given artist may end up in both the train and test set. s The superscript s stands for sparse format. Mart ınez, Webb, Chen and Zaidi Appendix B. Implementation Details All the experiments for the out-of-core algorithms have been carried out in a C++ software specially designed to deal with out-of-core classification methods. The following details in terms of implementation and configuration should be considered: 10-fold cross validation has been used. Equal frequency pre-discretization with 5 bins (EF5) has been utilized to discretize all numeric attributes (except if otherwise specified5). We have observed that EF5 and MDL (Minimum Description Length) (Fayyad and Irani, 1993) discretization provide the best results in approximately half of the datasets each, EF5 has been chosen because it is faster to perform than MDL, and also because it is not supervised and hence does not potentially provide the classifier with class information from the holdout data when used for pre-discretization. Using a pre-fixed number of bins gives us the added advantage of not having to deal with a huge number of values per attribute created by MDL discretization in some cases. Missing values have been considered as a distinct value in all cases. The root mean square error is calculated exclusively on the true class label (different from Weka s implementation (Hall et al., 2009), where all class labels are considered). KDB s implementation benefits from a tree structure that resembles an ADTtree (Moore and Lee, 1998), sparseness is achieved by storing a NULL instead of a node for any query that matches zero records. In our case, to save the probability distribution of the different attributes once the structure has been determined, we build a (distribution) tree for each attribute. Given, for example, attribute X4 that has X0 and X3 as parents (in this order, and apart from the class), the counts (X4, Y ) are stored in the root node. The following level expands according to the number of values of X0, which is the first parent, and store the counts for (X4, Y, X0). The following level expands as for the number of values of X3, which is the second parent, and store the counts for (X4, Y, X0, X2), and so forth if there were more parents. Note that as opposed to the ADTtree, all the parameters are stored, that is, more than strictly required, since a branch is created for each attribute value, when strictly one less branch would be sufficient to represent the model parameters. The reason to do that is not to overload classification time by having to explore and combine branches. Two combined techniques are considered for smoothing. In the first place, we use m-estimates6 (Mitchell, 1997) as follows: ˆp(xi|πxi) = counts(xi, πxi) + m |Xi| counts(πxi) + m (4) 5. Only for the experiments on the antepenultimate row on Table 8 as specified. 6. Also known as Schurmann-Grassberger s Law when m = 1, which is a particular case of Lidstone s Law (Lidstone, 1920; Hardy, 1920) with λ = 1 |Xi|, also based on a Dirichlet prior. Scalable Learning of Bayesian Network Classifiers learner args localiz. cens-inc USPS MITA MITB MSDY covtype MITC poker uscensus PAMAP kddcup linkage mnist8 satellite splice NB 0.7106 0.0007 0.4660 0.0018 0.2256 0.0026 0.0982 0.0029 0.1394 0.0014 0.9600 0.0002 0.5103 0.0013 0.2367 0.0021 0.5801 0.0006 0.2911 0.0006 0.4647 0.0006 0.1849 0.0008 0.0125 0.0006 0.4957 0.0005 0.6540 0.0002 0.0971 0.0002 TAN 0.6321 0.0014 0.2247 0.0025 0.1153 0.0022 0.0202 0.0025 0.1213 0.0020 0.9436 0.0002 0.4862 0.0008 0.2068 0.0017 0.4987 0.0006 0.1845 0.0009 0.3232 0.0007 0.0572 0.0006 0.0081 0.0009 0.3817 0.0006 0.5424 0.0003 0.1020 0.0003 AODE 0.6520 0.0010 0.2932 0.0020 0.1538 0.0028 0.1001 0.0036 0.1682 0.0025 0.9459 0.0002 0.4587 0.0013 0.1564 0.0014 0.5392 0.0006 0.2154 0.0010 0.3881 0.0008 0.0979 0.0007 0.0120 0.0005 0.3497 0.0005 0.5783 0.0003 0.1034 0.0003 k = 1 0.6501 0.0012 0.2219 0.0024 0.1125 0.0025 0.0209 0.0021 0.1291 0.0019 0.9447 0.0003 0.4709 0.0026 0.1940 0.0021 0.4987 0.0005 0.1814 0.0008 0.3276 0.0007 0.0606 0.0005 0.0082 0.0009 0.3751 0.0005 0.5506 0.0003 0.0940 0.0003 k = 2 0.5840 0.0020 0.2064 0.0021 0.0990 0.0029 0.0126 0.0019 0.0633 0.0018 0.9393 0.0002 0.4576 0.0013 0.0906 0.0012 0.4055 0.0007 0.1677 0.0008 0.2721 0.0005 0.0497 0.0008 0.0062 0.0007 0.2922 0.0004 0.5072 0.0006 0.0953 0.0002 k = 3 0.5485 0.0020 0.2102 0.0015 0.0928 0.0025 0.0142 0.0018 0.0468 0.0018 0.9361 0.0004 0.4399 0.0015 0.0705 0.0018 0.2859 0.0011 0.1664 0.0008 0.2290 0.0004 0.0485 0.0008 0.0060 0.0006 0.2500 0.0005 0.4769 0.0004 0.1008 0.0005 k = 4 0.5225 0.0023 0.2107 0.0012 0.0877 0.0026 0.0514 0.0031 0.0387 0.0021 0.9383 0.0006 0.4126 0.0019 0.0616 0.0016 0.2048 0.0012 0.1601 0.0007 0.1937 0.0004 0.0478 0.0008 0.0060 0.0006 0.2169 0.0005 0.4583 0.0005 0.1084 0.0003 k = 5 0.5225 0.0023 0.2143 0.0022 0.0839 0.0034 0.1350 0.0035 0.0841 0.0023 0.9447 0.0003 0.3903 0.0020 0.0494 0.0015 0.2842 0.0011 0.1638 0.0006 0.1697 0.0004 0.0477 0.0008 0.0060 0.0006 0.1839 0.0005 0.4448 0.0004 0.0924 0.0002 kmax = 5 0.5225 0.0023 0.2064 0.0021 0.0839 0.0034 0.0126 0.0019 0.0387 0.0021 0.9361 0.0004 0.3903 0.0020 0.0494 0.0015 0.2048 0.0012 0.1601 0.0007 0.1697 0.0004 0.0477 0.0008 0.0060 0.0006 0.1839 0.0005 0.4448 0.0004 0.0924 0.0002 bestk 4 2 5 2 4 3 5 5 4 4 5 5 3.8 5 5 5 k = 1 0.6501 0.0012 0.2054 0.0014 0.1125 0.0025 0.0207 0.0020 0.0746 0.0012 0.9446 0.0003 0.4704 0.0016 0.1227 0.0014 0.4987 0.0005 0.1540 0.0007 0.3276 0.0007 0.0606 0.0005 0.0082 0.0009 0.3751 0.0005 0.5485 0.0002 0.0527 0.0002 k = 2 0.5840 0.0020 0.2021 0.0021 0.0990 0.0029 0.0123 0.0018 0.0574 0.0011 0.9393 0.0002 0.4576 0.0013 0.0782 0.0017 0.4055 0.0007 0.1510 0.0007 0.2721 0.0005 0.0497 0.0008 0.0062 0.0007 0.2922 0.0004 0.5065 0.0004 0.0523 0.0002 k = 3 0.5485 0.0020 0.2056 0.0016 0.0928 0.0026 0.0144 0.0021 0.0439 0.0021 0.9361 0.0004 0.4399 0.0015 0.0596 0.0020 0.2847 0.0011 0.1508 0.0007 0.2290 0.0004 0.0485 0.0008 0.0060 0.0006 0.2500 0.0005 0.4769 0.0004 0.0523 0.0002 k = 4 0.5225 0.0023 0.2041 0.0015 0.0877 0.0026 0.0386 0.0019 0.0376 0.0026 0.9382 0.0005 0.4126 0.0019 0.0517 0.0020 0.1868 0.0012 0.1504 0.0007 0.1937 0.0004 0.0478 0.0008 0.0060 0.0006 0.2169 0.0005 0.4583 0.0005 0.0524 0.0002 k = 5 0.5225 0.0023 0.2070 0.0015 0.0839 0.0033 0.0552 0.0032 0.0844 0.0025 0.9447 0.0004 0.3903 0.0020 0.0446 0.0020 0.1868 0.0012 0.1505 0.0007 0.1697 0.0004 0.0477 0.0008 0.0060 0.0006 0.1839 0.0005 0.4448 0.0004 0.0526 0.0002 kmax = 5 0.5225 0.0023 0.2021 0.0021 0.0839 0.0033 0.0123 0.0018 0.0376 0.0026 0.9361 0.0004 0.3903 0.0020 0.0446 0.0020 0.1868 0.0012 0.1504 0.0007 0.1697 0.0004 0.0477 0.0008 0.0060 0.0006 0.1839 0.0005 0.4448 0.0004 0.0523 0.0002 bestk 4 2 5 2 4 3 5 5 4 4 5 5 3.8 5 5 3 # of atts. 5 19.2 647.5 337.9 275.6 90 54 215.8 5 22 54 37 11 773.6 138 16 SKDBk (# of attributes selected) k = 1 5 7 584.9 354.9 17 80.5 45.8 59 10 11 54 38.2 11 772.2 75 13 k = 2 5 19.2 646.5 337.9 87 90 54 206.7 10 12 54 38.1 11 773.1 125.8 14.9 k = 3 5 18 648.1 356.1 271.3 90 54 209.6 5 22 54 37.6 11 774.2 138 16 k = 4 5 18 640.9 21.3 275.6 78.8 54 214.3 5 22 54 37.9 11 774.2 138 14 k = 5 5 18 647.5 18 359.3 81.2 54 215.8 5 22 54 37 11 773.6 138 13 n 5 41 676 361 361 90 54 361 10 67 54 41 11 784 138 141 Bayes Net k = 5 0.5217 0.0023 0.2009 0.0016 0.0920 0.0023 0.0379 0.0136 0.0354 0.0063 0.9369 0.0004 0.4231 0.0016 >138G 0.2821 0.0011 0.1573 0.0008 0.1615 0.0006 0.0477 0.0007 0.0059 0.0005 >138G >138G >138G Random Forest 100 trees 0.5194 0.0024 0.1992 0.0013 0.0338 0.0010 0.0353 0.0017 0.0566 0.0012 >138G 0.3478 0.0018 0.0888 0.0006 0.3190 0.0028 0.1573 0.0008 0.0962 0.0012 0.0466 0.0008 0.0060 0.0007 >138G >138G >138G Random Forest (Num) 100 trees 0.4199 0.0017 0.1906 0.0014 0.0291 0.0007 0.0262 0.0016 0.0395 0.0010 >138G 0.1998 0.0011 0.0465 0.0004 0.3190 0.0028 0.1573 0.0008 >600h 0.0298 0.0003 0.0029 0.0006 >138G >138G >138G SKDB0-1 kmax = 5 0.5225 0.0023 0.2045 0.0015 0.0839 0.0033 0.0123 0.0018 0.0376 0.0026 0.9447 0.0003 0.3903 0.0020 0.0446 0.0020 0.1868 0.0012 0.1509 0.0007 0.1697 0.0004 0.0477 0.0008 0.0060 0.0006 0.1839 0.0005 0.4448 0.0004 0.0523 0.0002 SKDB MDL disc. 0.5252 0.0036 0.1923 0.0025 0.0837 0.0023 0.0147 0.0020 0.0370 0.0026 0.9430 0.0005 0.2595 0.0029 0.1810 0.0048 0.1868 0.0012 0.1504 0.0007 0.0377 0.0029 0.0326 0.0005 0.0037 0.0009 0.1449 0.0007 0.4460 0.0006 0.0523 0.0002 m (millions) 0.1649 0.2993 0.3415 0.4741 0.4894 0.5153 0.5811 0.8393 1.025 2.4583 3.8505 5.2095 5.7491 8.1 8.7052 50 c 11 2 9 2 2 90 7 2 10 4 19 40 2 10 24 2 Table 8: Results in terms of RMSE for NB, TAN, AODE, KDB, Bayes Net, RF, SKDBk and SKDB classifiers. Mart ınez, Webb, Chen and Zaidi learner args localiz. Censusincome USPS MITA MITB MSDY covtype MITC poker uscensus PAMAP kddcup linkage mnist8 satellite splice NB 0.5449 0.0026 0.2410 0.0017 0.0532 0.0012 0.0100 0.0006 0.0199 0.0004 0.9514 0.0005 0.3536 0.0025 0.0582 0.0010 0.4988 0.0018 0.0896 0.0003 0.2365 0.0007 0.0361 0.0005 0.0002 0.0000 0.2521 0.0005 0.4425 0.0002 0.0121 0.0001 TAN 0.4367 0.0033 0.0675 0.0016 0.0149 0.0006 0.0008 0.0001 0.0161 0.0005 0.9268 0.0010 0.3151 0.0016 0.0455 0.0008 0.3295 0.0015 0.0390 0.0005 0.1171 0.0005 0.0034 0.0001 0.0001 0.0000 0.1327 0.0007 0.3240 0.0004 0.0133 0.0001 AODE 0.433 0.0027 0.1106 0.0015 0.0244 0.0008 0.0104 0.0007 0.0294 0.0008 0.9281 0.0013 0.2859 0.0016 0.0254 0.0005 0.4812 0.0028 0.0532 0.0004 0.1654 0.0007 0.0154 0.0002 0.0002 0.0000 0.1271 0.0004 0.3537 0.0004 0.0134 0.0001 k = 1 0.4642 0.0040 0.0667 0.0014 0.0142 0.0006 0.0005 0.0001 0.0183 0.0009 0.9279 0.0008 0.2968 0.0034 0.0406 0.0009 0.3291 0.0012 0.0383 0.0004 0.1209 0.0006 0.0048 0.0001 0.0001 0.0000 0.1500 0.0004 0.3321 0.0005 0.0114 0.0001 k = 2 0.3710 0.0037 0.0562 0.0013 0.0110 0.0006 0.0002 0.0001 0.0043 0.0002 0.9239 0.0008 0.2799 0.0025 0.0088 0.0003 0.1961 0.0009 0.0333 0.0004 0.0850 0.0003 0.0028 0.0001 0.0000 0.0000 0.0919 0.0001 0.2903 0.0006 0.0116 0.0001 k = 3 0.3338 0.0023 0.0556 0.0010 0.0097 0.0005 0.0002 0.0001 0.0023 0.0002 0.9187 0.0009 0.2641 0.0020 0.0053 0.0003 0.0838 0.0007 0.0331 0.0004 0.0609 0.0002 0.0026 0.0001 0.0000 0.0000 0.0682 0.0002 0.2613 0.0003 0.0127 0.0001 k = 4 0.3064 0.0036 0.0547 0.0008 0.0087 0.0005 0.0028 0.0003 0.0016 0.0002 0.9131 0.0012 0.2337 0.0023 0.0040 0.0002 0.0486 0.0007 0.0297 0.0003 0.0438 0.0002 0.0026 0.0001 0.0000 0.0000 0.0519 0.0002 0.2426 0.0005 0.0145 0.0001 k = 5 0.3064 0.0036 0.0547 0.0010 0.0080 0.0006 0.0188 0.0010 0.0073 0.0004 0.9124 0.0006 0.2077 0.0023 0.0026 0.0002 0.0877 0.0008 0.0313 0.0002 0.0340 0.0002 0.0026 0.0001 0.0000 0.0000 0.0378 0.0002 0.2284 0.0004 0.0104 0.0001 SKDBonlyk kmax = 5 0.3064 0.0036 0.0562 0.0013 0.0080 0.0006 0.0028 0.0003 0.0199 0.0004 0.9187 0.0009 0.2077 0.0023 0.0582 0.0010 0.0486 0.0007 0.0297 0.0003 0.0340 0.0002 0.0026 0.0001 0.0000 0.0000 0.0378 0.0002 0.2284 0.0004 0.0029 0.0000 k = 1 0.4642 0.0040 0.0553 0.0009 0.0142 0.0007 0.0005 0.0001 0.0071 0.0002 0.9276 0.0010 0.2961 0.0021 0.0185 0.0004 0.3291 0.0012 0.0293 0.0004 0.1209 0.0006 0.0048 0.0001 0.0001 0.0000 0.1500 0.0004 0.3391 0.0003 0.0029 0.0000 k = 2 0.3710 0.0037 0.0542 0.0013 0.0110 0.0006 0.0002 0.0000 0.0040 0.0002 0.9239 0.0008 0.2799 0.0025 0.0069 0.0003 0.1961 0.0009 0.0271 0.0003 0.0850 0.0003 0.0028 0.0001 0.0000 0.0000 0.0919 0.0002 0.2907 0.0022 0.0029 0.0000 k = 3 0.3338 0.0023 0.0547 0.0009 0.0097 0.0005 0.0002 0.0001 0.0021 0.0002 0.9187 0.0009 0.2641 0.0020 0.0040 0.0003 0.0835 0.0008 0.0268 0.0003 0.0609 0.0002 0.0026 0.0001 0.0000 0.0000 0.0682 0.0002 0.2613 0.0003 0.0029 0.0000 k = 4 0.3064 0.0036 0.0532 0.0008 0.0087 0.0005 0.0018 0.0002 0.0015 0.0002 0.9144 0.0011 0.2337 0.0023 0.0030 0.0002 0.0318 0.0008 0.0266 0.0003 0.0438 0.0002 0.0026 0.0001 0.0000 0.0000 0.0519 0.0002 0.2426 0.0005 0.0029 0.0000 k = 5 0.3064 0.0036 0.0536 0.0009 0.0080 0.0006 0.0038 0.0005 0.0074 0.0004 0.9135 0.0008 0.2077 0.0023 0.0022 0.0002 0.0318 0.0008 0.0266 0.0003 0.0340 0.0002 0.0026 0.0001 0.0000 0.0000 0.0378 0.0002 0.2284 0.0004 0.0029 0.0000 SKDB kmax = 5 0.3064 0.0036 0.0542 0.0013 0.0080 0.0006 0.0002 0.0000 0.0015 0.0002 0.9187 0.0009 0.2077 0.0023 0.0022 0.0002 0.0318 0.0008 0.0266 0.0003 0.0340 0.0002 0.0026 0.0001 0.0000 0.0000 0.0378 0.0002 0.2284 0.0004 0.0029 0.0000 SKDB0-1 kmax = 5 0.3064 0.0036 0.0533 0.0011 0.0080 0.0006 0.0002 0.0000 0.0015 0.0002 0.9124 0.0006 0.2077 0.0023 0.0022 0.0002 0.0318 0.0008 0.0261 0.0002 0.0340 0.0002 0.0026 0.0001 0.0000 0.0000 0.0378 0.0002 0.2284 0.0004 0.0029 0.0000 Bayes Net k = 5 0.3053 0.0039 0.0533 0.0012 0.0097 0.0006 0.0017 0.0006 0.0014 0.0004 0.9204 0.0015 0.2442 0.0020 >138G 0.0831 0.0010 0.0288 0.0003 0.0308 0.0003 0.0026 0.0001 0.0000 0.0000 >138G >138G >138G Random Forest 100 trees 0.3061 0.0036 0.0527 0.0009 0.0009 0.0001 0.0018 0.0002 0.0031 0.0003 >138G 0.1157 0.0013 0.0015 0.0002 0.0527 0.0013 0.0266 0.0003 0.0046 0.0001 0.0025 0.0001 0.0000 0.0000 >138G >138G >138G Random Forest (Num) 100 trees 0.2030 0.0014 0.0481 0.0011 0.0005 0.0001 0.0009 0.0002 0.0013 0.0001 >138G 0.0405 0.0006 0.0004 0.0000 0.0527 0.0013 0.0266 0.0003 >138G 0.0014 0.0000 0.0000 0.0000 >138G >138G >138G SKDB MDL disc. 0.3013 0.0048 0.0486 0.0015 0.0078 0.0005 0.0002 0.0000 0.0015 0.0002 0.9129 0.0006 0.0824 0.0019 0.0035 0.0002 0.0318 0.0008 0.0266 0.0003 0.0016 0.0001 0.0017 0.0001 0.0000 0.0000 0.0229 0.0002 0.2175 0.0005 0.0029 0.0000 n 5 41 676 361 361 90 54 361 10 67 54 41 11 784 138 141 m (millions) 0.1649 0.2993 0.3415 0.4741 0.4894 0.5153 0.5811 0.8393 1.025 2.4583 3.8505 5.2095 5.7491 8.1 8.7052 50 c 11 2 9 2 2 90 7 2 10 4 19 40 2 10 24 2 Table 9: Results in terms of 0-1 Loss for NB, TAN, AODE, KDB, Bayes Net, RF, SKDBk and SKDB classifiers. Scalable Learning of Bayesian Network Classifiers learner args localiz. Censusincome USPS MITA MITB MSDY covtype MITC poker uscensus PAMAP kddcup linkage mnist8 satellite splice SKDB kmax = 5 0.5225 0.0023 0.2021 0.0021 0.0839 0.0033 0.0123 0.0018 0.0376 0.0026 0.9361 0.0004 0.3903 0.0020 0.0446 0.0020 0.1868 0.0012 0.1504 0.0007 0.1697 0.0004 0.0477 0.0008 0.0060 0.0006 0.1449 0.0007 0.4448 0.0004 0.0523 0.0002 VW -p3 0.6249 0.0006 0.2709 0.0048 0.2156 0.0011 0.0326 0.0076 0.0696 0.0069 0.7101 0.0001 0.5227 0.0010 0.1361 0.0013 0.5355 0.0006 0.2038 0.0014 0.5309 0.0006 0.1240 0.0020 0.0184 0.0038 0.4176 0.0009 0.5107 0.0006 0.2085 0.0045 -p20 0.6243 0.0006 0.2662 0.0028 0.2157 0.0008 0.0266 0.0035 0.0714 0.0047 0.7093 0.0001 0.5116 0.0009 0.1366 0.0008 0.5355 0.0005 0.2018 0.0011 0.5217 0.0006 0.1144 0.0025 0.0151 0.0032 0.4173 0.0008 0.5024 0.0006 0.2115 0.0038 -p3 -q 0.5955 0.0009 0.2426 0.0080 0.0402 0.0006 0.0277 0.0072) 0.0403 0.0058 0.7276 0.0005 0.5018 0.0018 0.0518 0.0042 0.3679 0.0019 0.1938 0.0040 0.3633 0.0010 0.2426 0.0080 0.0103 0.0029 0.2884 0.0063 0.4505 0.0015 0.0855 0.0082 -p3 -c 0.6217 0.0006 0.2681 0.0047 0.1894 0.0014 0.0319 0.0058 0.0645 0.0065 0.7088 0.0001 0.5177 0.0012 0.1232 0.0022 0.5342 0.0006 0.2023 0.0015 0.4689 0.0007 0.1123 0.0026 0.0079 0.0017 0.4173 0.0009 0.5031 0.0007 0.1964 0.0045 -p3 -q -l 0.7357 0.0021 0.1895 0.0020 0.0190 0.0074 0.0238 0.0021 0.0303 0.0016 0.9273 0 0.5467 0.0020 0.0308 0.0016 0.3545 0.0018 0.1482 0.0009 0.3920 0.0017 0.0519 0.0007 0.0043 0.0006 0.2773 0.0039 >400h >400h SKDB0-1 kmax = 5 0.3064 0.0036 0.0533 0.0011 0.0080 0.0006 0.0002 0.0000 0.0015 0.0002 0.9124 0.0006 0.2077 0.0023 0.0022 0.0002 0.0318 0.0008 0.0261 0.0002 0.0340 0.0002 0.0026 0.0001 0.0000 0.0000 0.0378 0.0002 0.2284 0.0004 0.0029 0.0000 SKDB kmax = 5 0.3064 0.0036 0.0542 0.0013 0.0080 0.0006 0.0002 0.0000 0.0015 0.0002 0.9187 0.0009 0.2077 0.0023 0.0022 0.0002 0.0318 0.0008 0.0266 0.0003 0.0340 0.0002 0.0026 0.0001 0.0000 0.0000 0.0378 0.0002 0.2284 0.0004 0.0029 0.0000 VW -p3 0.6036 0.0041 0.0495 0.0012 0.0311 0.0011 0.0006 0.0003 0.0031 0.0005 0.9192 0.0012 0.3688 0.0029 0.0117 0.0006 0.4989 0.0015 0.0290 0.0003 0.3488 0.0013 0.0040 0.0002 0.0000 0.0000 0.1344 0.0006 0.3053 0.0013 0.0029 0.0000 -p20 0.6025 0.0036 0.0490 0.0011 0.0307 0.0009 0.0005 0.0002 0.0028 0.0003 0.9173 0.0013 0.3530 0.0023 0.0115 0.0004 0.4988 0.0015 0.0283 0.0003 0.3316 0.0009 0.0035 0.0001 0.0000 0.0000 0.1341 0.0006 0.2970 0.0010 0.0029 0 -p3 -q 0.5254 0.0050 0.0502 0.0020 0.0000 0.0000 0.0003 0.0001 0.0007 0.0004 0.9198 0.0012 0.3313 0.0032 0.0007 0.0003 0.0740 0.0007 0.0255 0.0004 0.0980 0.0006 0.0020 0.0001 0.0000 0.0000 0.0440 0.0010 0.2341 0.0015 0.0027 0.0001 -p3 -c 0.5932 0.0039 0.0493 0.0011 0.0193 0.0009 0.0005 0.0002 0.0028 0.0005 0.9162 0.0013 0.3619 0.0029 0.0091 0.0005 0.5015 0.0026 0.0285 0.0003 0.2286 0.0011 0.0024 0.0001 0.0000 0.0000 0.1342 0.0006 0.2947 0.0010 0.0029 0.0000 -p3 -q -l 0.5452 0.0051 0.0475 0.0011 0.0004 0.0005 0.0003 0.0001 0.0007 0.0002 0.9205 0 0.3418 0.0030 0.0009 0.0001 0.0746 0.0008 0.0250 0.0003 0.1328 0.0005 0.0030 0.0001 0.0000 0.0000 0.0565 0.0005 >400h >400h m 0.1649 0.2993 0.3415 0.4741 0.4894 0.5153 0.5811 0.8393 1.025 2.4583 3.8505 5.2095 5.7491 8.1 8.7052 50 c 11 2 9 2 2 90 7 2 10 4 19 40 2 10 24 2 Table 10: Results in terms of RMSE and 0-1 loss for SKDB and VW with different options: -p indicates the number of passes (3 or 20), -q indicates quadratic features (all possible combinations), -c refers to cubic features (all possible combinations), and -l the use of a logistic function for optimization. SKDB0-1 is SKDB optimized for 0-1 loss. * Results shown for 1 experiment only due to time limitations when using the logistic function in VW. ** Results shown for 1 fold only due to time limitations when using the logistic function in VW. Mart ınez, Webb, Chen and Zaidi where πxi are the parent-values of Xi and m = 1. Secondly, if zero counts are found, we back offas many levels in the tree as necessary to find at least one count, that is, if counts(x4, x0, x3) is equal to zero, then ˆp(x4|x0) is considered instead of ˆp(x4|x0, x3). In principle, any loss function that can be computed incrementally is suitable for SKDB. We have chosen RMSE (as defined above) for our first set of experiments. We use Averaged One Dependence Estimators (AODE) as a representative of the ANDE family of algorithms where n = 1. In order to avoid an extensive number of passes on each of the datasets, no datasettailored parameter tuning has been carried out for any of the algorithms considered. However as detailed in Sections 3.2, for logistic regression we have pre-conducted experiments with varying numbers of passes for stochastic gradient descent, varying loss functions and combinations of features; and used the default advanced (step size) updates provided by vowpal wabbit. In the case of random forest, we have made experiments with and without pre-discretising continuous attributes, as well as learning a variable number of trees (see Section 3.3.2 for more details). Appendix C. Detailed Analysis of Results The upper part of Table 8 (in the Appendix) shows the results in terms of RMSE standard deviation on the 16 datasets described above. We have decided to stop at k = 5, since some of the datasets already start to show worse results for k = 5 (due to variance increase), so increasing the complexity further is not productive. The results for NB, TAN and AODE are displayed as baseline orientations. The dark gray background on certain outputs for SKDBk, corresponds to experiments that outperform KDB for the same value of k; the light grey background on others indicate a tie; whereas the absence of the two indicates that plain KDB outperforms SKDBk. As for SKDB, the background coloring has the same meaning, but each entry is compared with the best KDB value for all possible k. Note that only the average of the 10 folds is shown for each experiment, and the background colors indicate statistical difference (Wilcoxon signed-rank tests), dark grey or white (in favor or against the algorithm in the row respectively); or lack of statistical difference, in light grey. Table 8 also shows the number of selected attributes by SKDBk. There are a few cases for which the number of selected attributes for all values of k is the total number of attributes in the datasets, such as localization, PAMAP and linkage. For some other datasets, for example MSDYear Prediction, USPSExtended or mnist8ms, the performance between selected KDB and KDB is similar, but fewer attributes are considered, so classification time (and space) will be reduced. In most of the cases the number of selected attributes is smaller than the original, sometimes even less than half, reducing the RMSE compared to KDB, for example MITFace Set C with 215.8 attributes selected (out of 361) results in 0.0446 vs 0.0494 for KDB (k = 5), or uscensus with 22 attributes selected (out of 67) results in 0.1504 vs 0.1601 for KDB (k = 4), that in both cases results in statistical improvement. The results for SKDB using MDL discretization are also shown in Table 8 (antepenultimate row). Note that there are some datasets, such as linkage, for which the discretization method used, in this case EF5, is not an adequate one; whereas a technique based on entropy minimization provides better results. Just the opposite situation occurs for datasets Scalable Learning of Bayesian Network Classifiers such as MITFace Set C. Apriori selection of the most appropriate discretization method is an open question, even more for very large datasets where we should consider the same bias/variance trade-offin terms of discretization bias and variance (Yang and Webb, 2009), that is, on the one hand we want that the number of intervals is sufficiently large to provide low discretization bias, but not too large to avoid an increase in discretization variance. The latter should not be a problem in large datasets, but considering a large number of intervals implies bigger datasets, which will have an impact on the classifier complexity. Alekh Agarwal, Olivier Chapelle, Miroslav Dud ık, and John Langford. A Reliable Effective Terascale Linear Learning System. J. Mach. Learn. Res., 15(1):1111 1133, January 2014. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=2627435.2638571. Kevin Bache and Moshe Lichman. UCI Machine Learning Repository, 2013. URL http: //archive.ics.uci.edu/ml. Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011. Jock A. Blackard. Comparison of Neural Networks and Discriminant Analysis in Predicting Forest Cover Types. Ph D thesis, Fort Collins, CO, USA, 1998. AAI9921979. L eon Bottou, Corinna Cortes, John S. Denker, Harris Drucker, Isabelle Guyon, Lawrence D. Jackel, Yann Le Cun, Urs A. Muller, Eduard S ackinger, Patrice Simard, and Vladimir Vapnik. Comparison of Classifier Methods: A Case Study in Handwritten Digit Recognition. In Proceedings of the 12th International Conference on Pattern Recognition, Los Alamitos, CA., volume 2, pages 77 82, 1994. Remco R. Bouckaert. Voting Massive Collections of Bayesian Network Classifiers for Data Streams. In Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence, AI 06, pages 243 252, Berlin, Heidelberg, 2006. Springer-Verlag. Damien Brain and Geoffrey I. Webb. On The Effect of Data Set Size on Bias and Variance in Classification Learning. In D. Richards, G. Beydoun, A. Hoffmann, and P. Compton, editors, Proceedings of the Fourth Australian Knowledge Acquisition Workshop (AKAW 99), pages 117 128, Sydney, 1999. The University of New South Wales. Leo Breiman. Bagging Predictors. Mach. Learn., 24(2):123 140, 1996. Leo Breiman. Random Forests. Mach. Learn., 45(1):5 32, 2001. Glenn W. Brier. Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 75:1 3, 1950. Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira, and Petri Myllym aki. Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood. J. Mach. Learn. Res., 12:2181 2210, 2011. Mart ınez, Webb, Chen and Zaidi Robert Cattral, Franz Oppacher, and Dwight Deugo. Evolutionary Data Mining with Automatic Rule Generalization. In Recent Advances in Computers, Computing and Communications, pages 296 300, 2002. C. K. Chow and C. N. Liu. Approximating Discrete Probability Distributions with Dependence Trees. IEEE Transactions on Information Theory, 14:462 467, 1968. Gregory F. Cooper and Edward Herskovits. A Bayesian Method for the Induction of Probabilistic Networks from Data. Mach. Learn., 9(4):309 347, 1992. Janez Demˇsar. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res., 7:1 30, 2006. John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res., 12:2121 2159, 2011. Richard O. Duda and Peter E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons Inc, 1 edition, 1973. Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why Does Unsupervised Pre-training Help Deep Learning? J. Mach. Learn. Res., 11:625 660, 2010. Usama M. Fayyad and Keki B. Irani. Multi-interval Discretization of Continuous-valued Attributes for Classification Learning. In Ruzena Bajcsy, editor, IJCAI, pages 1022 1029. Morgan Kaufmann, 1993. M. Julia Flores, Jos e A. G amez, Ana M. Mart ınez, and Jos e M. Puerta. GAODE and HAODE: Two Proposals Based on AODE to Deal with Continuous Variables. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 313 320. ACM, 2009a. M. Julia Flores, Josi A. Gamez, Ana M. Martmnez, and Jose Miguel Puerta. HODE: Hidden One-Dependence Estimator. In Claudio Sossai and Gaetano Chemello, editors, ECSQARU, volume 5590 of Lecture Notes in Computer Science, pages 481 492. Springer, 2009b. ISBN 978-3-642-02905-9. Nir Friedman, Dan Geiger, and Moises Goldszmidt. Bayesian Network Classifiers. Mach. Learn., 29(2-3):131 163, 1997. Jo ao Gama, Ricardo Rocha, and Pedro Medas. Accurate Decision Trees for Mining Highspeed Data Streams. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 03, pages 523 528, New York, NY, USA, 2003. ACM. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. The WEKA Data Mining Software: an Update. SIGKDD Explor. Newsl., 11(1):10 18, 2009. Scalable Learning of Bayesian Network Classifiers G. Hardy. Correspondence. Insurance Record (1889). Transactions of the Faculty Actuaries, 8, 1920. GeoffHulten and Pedro Domingos. Mining Complex Models from Arbitrarily Large Databases in Constant Time. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 02, pages 525 531, New York, NY, USA, 2002. ACM. Manfred Jaeger. Probabilistic Classifiers and the Concepts They Recognize. In Tom Fawcett and Nina Mishra, editors, ICML, pages 266 273. AAAI Press, 2003. Boˇstjan Kaluˇza, Violeta Mirchevska, Erik Dovgan, Mitja Luˇstrek, and Matjaˇz Gams. An Agent-based Approach to Care in Independent Living. In Proceedings of the First international joint conference on Ambient intelligence, Am I 10, pages 177 186, Berlin, Heidelberg, 2010. Springer-Verlag. Nikos Karampatziakis and John Langford. Online Importance Weight Aware Updates. In Fabio Gagliardi Cozman and Avi Pfeffer, editors, UAI, pages 392 399. AUAI Press, 2011. Ron Kohavi. The Power of Decision Tables. In Nada Lavrac and Stefan Wrobel, editors, Proceedings of the European Conference on Machine Learning ECML-95, volume 912 of Lecture Notes in Computer Science, pages 174 189. Springer Berlin Heidelberg, 1995. Daphne Koller and Mehran Sahami. Toward Optimal Feature Selection. In Lorenza Saitta, editor, Proceedings of the Thirteenth International Conference on Machine Learning (ICML), pages 284 292. Morgan Kaufmann Publishers, 1996. Pat Langley and Stephanie Sage. Induction of Selective Bayesian Classifiers. In Proceedings of the Tenth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-94), pages 399 406, San Francisco, CA, 1994. Morgan Kaufmann. David D. Lewis. Naive Bayes at Forty: The Independence Assumption in Information Retrieval. In Proceedings of the 10th European Conference on Machine Learning, ECML 98, pages 4 15, London, UK, UK, 1998. Springer-Verlag. George James Lidstone. Note on the General Case of the Bayes-Laplace Formula for Inductive or a Posteriori Probabilities. Transactions of the Faculty Actuaries, 8:182 192, 1920. Ga elle Loosli, St ephane Canu, and L eon Bottou. Training Invariant Support Vector Machines using Selective Sampling. In L eon Bottou, Olivier Chapelle, Dennis De Coste, and Jason Weston, editors, Large Scale Kernel Machines, pages 301 320. MIT Press, Cambridge, MA., 2007. Oded Maron and Andrew Moore. Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation. In Jack D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 59 66, San Francisco, CA, 1994. Morgan Kaufmann. Mart ınez, Webb, Chen and Zaidi B.W. Matthews. Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure , 405(2):442 451, 1975. H. Brendan Mc Mahan and Matthew Streeter. Adaptive Bound Optimization for Online Convex Optimization. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), 2010. Marvin Minsky. Steps Toward Artificial Intelligence. Proceedings of the Institute of Radio Engineers, 49:8 30, 1961. Tom M. Mitchell. Machine Learning. Mc Graw-Hill, New York, 1997. Andrew Moore and Mary Soon Lee. Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets. J. Artif. Int. Res., 8(1):67 91, 1998. Nikunj C. Oza and Stuart Russell. Experimental Comparisons of Online and Batch Versions of Bagging and Boosting. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 01, pages 359 364, New York, NY, USA, 2001. ACM. Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. Franz Pernkopf and JeffA. Bilmes. Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers. J. Mach. Learn. Res., 11:2323 2360, 2010. Fran cois Petitjean, Jordi Inglada, and Pierre Gan carski. Satellite Image Time Series Analysis under Time Warping. IEEE Transactions on Geoscience and Remote Sensing, 50(8): 3081 3095, 2012. Attila Reiss and Didier Stricker. Creating and Benchmarking a New Dataset for Physical Activity Monitoring. In Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments, PETRA 12, pages 40:1 40:8, New York, NY, USA, 2012. ACM. St ephane Ross, Paul Mineiro, and John Langford. Normalized Online Learning. ar Xiv preprint ar Xiv:1305.6646, 2013. Arcadio Rubio and Jos e Antonio G amez. Flexible Learning of K-dependence Bayesian Network Classifiers. In Proceedings of the 13th annual conference on Genetic and evolutionary computation, GECCO 11, pages 1219 1226, New York, NY, USA, 2011. ACM. Mehran Sahami. Learning Limited Dependence Bayesian Classifiers. In In KDD-96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 335 338. AAAI Press, 1996. Murat Sariyar, Andreas Borg, and Klaus Pommerening. Controlling False Match Rates in Record Linkage Using Extreme Value Theory. J. of Biomedical Informatics, 44(4): 648 654, 2011. Scalable Learning of Bayesian Network Classifiers I Schmidtmann, G Hammer, M Sariyar, and A. Gerhold-Ay. Evaluation des Krebsregisters NRW - Schwerpunkt Record Linkage - Abschlussbericht. Technical report, Institut f ur medizinische Biometrie, Epidemiologie und Informatik, Universit atsmedizin Mainz, 2009. S oren Sonnenburg and Vojtech Franc. COFFIN : A Computational Framework for Linear SVMs. In Proc. ICML 2010, 2010. Jiang Su and Harry Zhang. Full Bayesian Network Classifiers. In Proceedings of the 23rd international conference on Machine learning, ICML 06, pages 897 904, New York, NY, USA, 2006. ACM. Ivor W. Tsang, James T. Kwok, and Pak-Ming Cheung. Core Vector Machines: Fast SVM Training on Very Large Data Sets. J. Mach. Learn. Res., 6:363 392, 2005. Geoffrey I. Webb, Janice R. Boughton, and Zhihai Wang. Not So Naive Bayes: Aggregating One-Dependence Estimators. Mach. Learn., 58(1):5 24, 2005. Geoffrey I. Webb, Janice R. Boughton, Fei Zheng, Kai Ming Ting, and Houssam Salem. Learning by Extrapolation from Marginal to Full-multivariate Probability Distributions: Decreasingly Naive Bayesian Classification. Mach. Learn., 86(2):233 272, 2012. Y. Yang, G.I. Webb, J. Cerquides, K. Korb, J. Boughton, and K-M. Ting. To Select or To Weigh: A Comparative Study of Linear Combination Schemes for Super Parent One-Dependence Estimators. IEEE Transactions on Knowledge and Data Engineering (TKDE), 19(12):1652 1665, 2007. Ying Yang and Geoffrey I. Webb. Discretization for Naive-Bayes Learning: Managing Discretization Bias and Variance. Mach. Learn., 74(1):39 74, 2009. Nayyar A. Zaidi, Jes us Cerquides, Mark James Carman, and Geoffrey I. Webb. Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting. J. Mach. Learn. Res., 14(1):1947 1988, 2013. Harry Zhang, Liangxiao Jiang, and Jiang Su. Hidden Naive Bayes. In Proceedings of the 20th National Conference on Artificial intelligence - Volume 2, pages 919 924. AAAI Press, 2005. F. Zheng, G.I. Webb, P. Suraweera, and L. Zhu. Subsumption Resolution: An Efficient and Effective Technique for Semi-Naive Bayesian Learning. Mach. Learn., 87(1):93 125, 2012. Fei Zheng and Geoffrey I Webb. A Comparative Study of Semi-naive Bayes Methods in Classification Learning. In Proceedings of the Fourth Australasian Data Mining Conference (Aus DM05), pages 141 156, 2005. Zijian Zheng and Geoffrey I. Webb. Lazy Learning of Bayesian Rules. Mach. Learn., 41(1): 53 84, 2000.