# parsimonious_bayesian_deep_networks__20ebf968.pdf Parsimonious Bayesian deep networks Mingyuan Zhou Department of IROM, Mc Combs School of Business The University of Texas at Austin, Austin, TX 78712 mingyuan.zhou@mccombs.utexas.edu Combining Bayesian nonparametrics and a forward model selection strategy, we construct parsimonious Bayesian deep networks (PBDNs) that infer capacityregularized network architectures from the data and require neither cross-validation nor fine-tuning when training the model. One of the two essential components of a PBDN is the development of a special infinite-wide single-hidden-layer neural network, whose number of active hidden units can be inferred from the data. The other one is the construction of a greedy layer-wise learning algorithm that uses a forward model selection criterion to determine when to stop adding another hidden layer. We develop both Gibbs sampling and stochastic gradient descent based maximum a posteriori inference for PBDNs, providing state-of-the-art classification accuracy and interpretable data subtypes near the decision boundaries, while maintaining low computational complexity for out-of-sample prediction. 1 Introduction To separate two linearly separable classes, a simple linear classifier such logistic regression will often suffice, in which scenario adding the capability to model nonlinearity not only complicates the model and increases computation, but also often harms rather improves the performance by increasing the risk of overfitting. On the other hand, for two classes not well separated by a single hyperplane, a linear classifier is often inadequate, and hence it is common to use either kernel support vector machines [1,2] or deep neural networks [3 5] to nonlinearly transform the covariates, making the two classes become more linearly separable in the transformed covariate space. While being able to achieve high classification accuracy, they both have clear limitations. For a kernel based classifier, its number of support vectors often increases linearly in the size of training data [6], making it not only computationally expensive and memory inefficient to train for big data, but also slow in out-of-sample predictions. A deep neural network could be scalable with an appropriate network structure, but it is often cumbersome to tune the network depth (number of layers) and width (number of hidden units) of each hidden layer [5], and has the danger of overfitting the training data if a deep neural network, which is often equipped with a larger than necessary modeling capacity, is not carefully regularized. Rather than making an uneasy choice in the first place between a linear classifier, which has fast computation and resists overfitting but may not provide sufficient class separation, and an overcapacitized model, which often wastes computation and requires careful regularization to prevent overfitting, we propose a parsimonious Bayesian deep network (PBDN) that builds its capacity regularization into the greedy-layer-wise construction and training of the deep network. More specifically, we transform the covariates in a layer-wise manner, with each layer of transformation designed to facilitate class separation via the use of the noisy-OR interactions of multiple weighted linear hyperplanes. Related to kernel support vector machines, the hyperplanes play a similar role as support vectors in transforming the covariate space, but they are inferred from the data and their number increases at a much slower rate with the training data size. Related to deep neural networks, the proposed multi-layer structure gradually increases its modeling capability by increasing its number 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. of layers, but allows inferring from the data both the width of each hidden layer and depth of the network to prevent building a model whose capacity is larger than necessary. To obtain a parsimonious deep neural network, one may also consider using a two-step approach that first trains an over-capacitized model and then compresses its size [7 10]. The design of PBDN represents a distinct philosophy. Moreover, PBDN is not contradicting with model compression, as these post-processing compression techniques [7 10] may be used to further compress PBDN. For capacity regularization of the proposed PBDN, we choose to shrink both its width and depth. To shrink the width of a hidden layer, we propose the use of a gamma process [11], a draw from which consists of countably infinite atoms, each of which is used to represent a hyperplane in the covariate space. The gamma process has an inherence shrinkage mechanism as its number of atoms, whose random weights are larger than a certain positive constant ϵ > 0, follows a Poisson distribution, whose mean is finite almost surely (a.s.) and reduces towards zero as ϵ increases. To shrink the depth of the network, we propose a layer-wise greedy-learning strategy that increases the depth by adding one hidden layer at a time, and uses an appropriate model selection criterion to decide when to stop adding another one. Note related to our work, Zhou et al. [12,13] combine the gamma process and greedy layer-wise training to build a Bayesian deep network in an unsupervised manner. Our experiments show the proposed capacity regularization strategy helps successfully build a PBDN, providing state-of-the-art classification accuracy while maintaining low computational complexity for out-of-sample prediction. We have also tried applying a highly optimized off-the-shelf deep neural network based classifier, whose network architecture for a given data is set to be the same as that inferred by the PBDN. However, we have found no performance gains, suggesting the efficacy of the PBDN s greedy training procedure that requires neither cross-validation nor fine-tuning. Note PBDN, like a conventional deep neural network, could also be further improved by introducing convolutional layers if the covariates have spatial, temporal, or other types of structures. 2 Layer-width learning via infinite support hyperplane machines The first essential component of the proposed capacity regularization strategy is to learn the width of a hidden layer. To fulfill this goal, we define infinite support hyperplane machine (i SHM), a label-asymmetric classifier, that places in the covariate space countably infinite hyperplanes {βk}1, , where each βk RV +1 is associated with a weight rk > 0. We use a gamma process [11] to generate {rk, βk}1, , making the infinite sum P k=1 rk be finite almost surely (a.s.). We measure the proximity of a covariate vector xi RV +1 to βk using the softplus function of their inner product as ln(1 + eβ kxi), which is a smoothed version of Re LU(β kxi) = max(0, β kxi) that is widely used in deep neural networks [14 17]. We consider that xi is far from hyperplane k if ln(1+eβ kxi) is close to zero. Thus, as xi moves away from hyperplane k, that proximity measure monotonically increases on one side of the hyperplane while decreasing on the other. We pass λi = P k=1 rk ln(1 + eβ kxi), a non-negative weighted combination of these proximity measures, through the Bernoulli-Poisson link [18] f(λi) = 1 e λi to define the conditional class probability as P(yi = 1 | {rk, βk}k, xi) = 1 Q k=1(1 pik), pik = 1 e rk ln(1+eβ kxi). (1) Note the model treats the data labeled as 0 and 1 differently, and (1) suggests that in general P(yi = 1 | {rk, βk}k, xi) = 1 P(yi = 0 | {rk, βk}k, xi). We will show that P can be used to represent the kth data subtype discovered by the algorithm. 2.1 Nonparametric Bayesian hierarchical model One may readily notice from (1) that the noisy-OR construction, widely used in probabilistic reasoning [19 21], is generalized by i SHM to attribute a binary outcome of yi = 1 to countably infinite hidden causes pik. Denoting W as the logical OR operator, as P(y = 0) = Q k P(bk = 0) if y = W k bk and E[e θ] = e r ln(1+ex) if θ Gamma(r, ex), we have an augmented form of (1) as yi = W k=1 bik, bik Bernoulli(pik), pik = 1 e θik, θik Gamma(rk, eβ kxi), (3) where bik Bernoulli(pik) can be further augmented as bik = δ(mik 1), mik Pois(θik), where mik Z, Z := {0, 1, . . .}, and δ(x) equals to 1 if the condition x is satisfied and 0 otherwise. We now marginalize bik out to formally define i SHM. Let G ΓP(G0, 1/c) denote a gamma process defined on the product space R+ Ω, where R+ = {x : x > 0}, c R+, and G0 is a finite and continuous base measure over a complete separable metric space Ω. As illustrated in Fig. 2 (b) in the Appendix, given a draw from G, expressed as G = P k=1 rkδβk, where βk is an atom and rk is its weight, the i SHM generates the label under the Bernoulli-Poisson link [18] as yi | G, xi Bernoulli 1 e P k=1 rk ln(1+ex iβk ) , (4) which can be represented as a noisy-OR model as in (3) or, as shown in Fig. 2 (a), constructed as yi = δ(mi 1), mi = P k=1 mik, mik Pois(θik), θik Gamma(rk, eβ kxi). (5) From (3) and (5), it is clear that one may declare hyperplane k as inactive if P 2.2 Inductive bias and distinction from multilayer perceptron Below we reveal the inductive bias of i SHM in prioritizing the fit of the data labeled as 1, due to the use of the Bernoulli-Poisson link that has previously been applied for network analysis [18,22,23] and multi-label learning [24]. As the negative log-likelihood (NLL) for xi can be expressed as NLL(xi) = yi ln 1 e λi + (1 yi)λi, λi = P k=1 rk ln(1 + ex iβk), we have NLL(xi) = λi ln(eλi 1) if yi = 1 and NLL(xi) = λi if yi = 0. As ln(eλi 1) quickly explodes towards + as λi 0, when yi = 1, i SHM would adjust rk and βk to avoid at all cost overly suppressing xi (i.e., making λi too small). By contrast, it has a high tolerance of failing to sufficiently suppress xi with yi = 0. Thus each xi with yi = 1 would be made sufficiently close to at least one active support hyperplane. By contrast, while each xi with yi = 0 is desired to be far away from any support hyperplanes, violating that is typically not strongly penalized. Therefore, by training a pair of i SHMs under two opposite labeling settings, two sets of support hyperplanes could be inferred to sufficiently cover the covariate space occupied by the training data from both classes. Note as in (4), i SHM may be viewed as an infinite-wide single-hidden-layer neural network that connects the input layer to the kth hidden unit via the connections weights βk and the softplus nonlinear activation function ln(1 + eβ kxi), and further pass a non-negative weighted combination of these hidden units through the Bernoulli-Poisson link to obtain the conditional class probability. From this point of view, it can be related to a single-hidden-layer multilayer perceptron (MLP) [5, 25] that uses a softplus activation function and cross-entropy loss, with the output activation expressed as σ[w ln(1 + e Bxi)], where σ(x) = 1/(1 + e x), K is the number of hidden units, B = (β1, . . . , βK) , and w = (w1, . . . , w K) RK. Note minimizing the cross-entropy loss is equivalent to maximizing the likelihood of yi | w, B, xi Bernoulli (1 + e PK k=1 wk ln(1+ex iβk )) 1 , which is biased towards fitting neither the data with yi = 1 nor these with yi = 0, since NLL(xi) = ln(e yiw ln(1+e Bxi) + e(1 yi)w ln(1+e Bxi)). Therefore, while i SHM is structurally similar to an MLP, it is distinct in its unbounded layer width, its positive constraint on the weights rk connecting the hidden and output layers, its ability to rigorously define whether a hyperplane is active or inactive, and its inductive bias towards fitting the data labeled as 1. As in practice labeling which class as 1 may be arbitrary, we predict the class label with (1 e P k=1 rk ln(1+ex iβk ) + e P k=1 r k ln(1+ex iβ k ))/2, where {rk, βk}1, and {r k, β k}1, are from a pair of i SHMs trained by labeling the data belonging to this class as 1 and 0, respectively. 2.3 Convex polytope geometric constraint It is straightforward to show that i SHM with a single unit-weighted hyperplane reduces to logistic regression yi Bernoulli[1/(1+e x iβ)]. To interpret the role of each individual support hyperplane when multiple non-negligibly weighted ones are inferred by i SHM, we analogize each βk to an expert of a committee that collectively make binary decisions. For expert (hyperplane) k, the weight rk indicates how strongly its opinion is weighted by the committee, bik = 0 represents that it votes No, and bik = 1 represents that it votes Yes. Since yi = W k=1 bik, the committee would vote No if and only if all its experts vote No (i.e., all bik are zeros), in other words, the committee would vote Yes even if only a single expert votes Yes. Let us now examine the confined covariate space that satisfies the inequality P(yi = 1 | xi) p0, where a data point is labeled as 1 with a probability no greater than p0. The following theorem shows that it defines a confined space bounded by a convex polytope, as defined by the intersection of countably infinite half-spaces defined by pik < p0. Theorem 1 (Convex polytope). For i SHM, the confined space specified by the inequality P(yi = 1 | {rk, βk}k, xi) p0 (6) is bounded by a convex polytope defined by the set of solutions to countably infinite inequalities as x iβk ln (1 p0) 1 rk 1 , k {1, 2, . . .}. (7) The convex polytope defined in (7) is enclosed by the intersection of countably infinite half-spaces. If we set p0 = 0.5 as the probability threshold to make binary decisions, then the convex polytope assigns a label of yi = 0 to an xi inside the convex polytope (i.e., an xi that satisfies all the inequalities in Eq. 7) with a relatively high probability, and assigns a label of yi = 1 to an xi outside the convex polytope (i.e., an xi that violates at least one of the inequalities in Eq. 7) with a probability of at least 50%. Note that hyperplane k with rk 0 has a negligible impact on the conditional class probability. Choosing the gamma process as the nonparametric Bayesian prior sidesteps the need to tune the number of experts. It shrinks the weights of all unnecessary experts, allowing automatically inferring a finite number of non-negligibly weighted ones (support hyperplanes) from the data. We provide in Appendix B the connections to previously proposed multi-hyperplane models [26 30]. 2.4 Gibbs sampling and MAP inference via SGD For the convenience of implementation, we truncate the gamma process with a finite and discrete base measure as G0 = PK k=1 γ0 K δβk, where K will be set sufficiently large to approximate the truly countably infinite model. We express i SHM using (5) together with rk Gamma(γ0/K, 1/c0), γ0 Gamma(a0, 1/b0), c0 Gamma(e0, 1/f0), v=0 R N(0, α 1 vk )Gamma(αvk; aβ, 1/bβk)dαvk, bβk Gamma(e0, 1/f0), where the normal gamma construction promotes sparsity on the connection weights βk [31]. We describe both Gibbs sampling, desirable for uncertainty quantification, and maximum a posteriori (MAP) inference, suitable for large-scale training, in Algorithm 1. We use data augmentation and marginalization to derive Gibbs sampling, with the details deferred to Appendix B. For MAP inference, we use Adam [32] in Tensorflow to minimize a stochastic objective function as f({βk, ln rk}K 1 , {yi, xi}i M i1 )+f({β k, ln r k}K 1 , {y i , xi}i M i1 ), which embeds the hierarchical Bayesian model s inductive bias and inherent shrinking mechanism into optimization, where M is the size of a randomly selected mini-batch, y i := 1 yi, λi := PK k=1 eln rk ln(1 + ex iβk), and f({βk, ln rk}K 1 , {yi, xi}i M i1 ) = PK k=1 γ0 K ln rk + c0eln rk + (aβ + 1/2) PV v=0 PK k=0 [ln(1 + β2 vk/(2bβk))] + N M Pi M i=i1 yi ln 1 e λi + (1 yi)λi . (8) 3 Network-depth learning via forward model selection The second essential component of the proposed capacity regularization strategy is to find a way to increase the network depth and determine how deep is deep enough. Our solution is to sequentially stack a pair of i SHMs on top of the previously trained one, and develop a forward model selection criterion to decide when to stop stacking another pair. We refer to the resulted model as parsimonious Bayesian deep network (PBDN), as described below in detail. The noisy-OR hyperplane interactions allow i SHM to go beyond simple linear separation, but with limited capacity due to the convex-polytope constraint imposed on the decision boundary. On the other hand, it is the convex-polytope constraint that provides an implicit regularization, determining how many non-negligibly weighted support hyperplanes are necessary in the covariate space to sufficiently activate all data of class 1, while somewhat suppressing the data of class 0. In this paper, we find that the model capacity could be quickly enhanced by sequentially stacking such convex-polytope constraints under a feedforward deep structure, while preserving the virtue of being able to learn the number of support hyperplanes in the (transformed) covariate space. More specifically, as shown in Fig. 2 (c) of the Appendix, we first train a pair of i SHMs that regress the current labels yi {0, 1} and the flipped ones y i = 1 yi, respectively, on the original covariates xi RV . After obtaining K2 support hyperplanes {β(1 2) k }1,K2, constituted by the active support hyperplanes inferred by both i SHM trained with yi and the one trained with y i , we use ln(1 + ex iβ(1 2) k ) as the hidden units of the second layer (first hidden layer). More precisely, with t {1, 2, . . .}, K0 := 0, K1 := V , x(0) i := , and x(1) i := xi, denoting x(t) i := [1, ( x(t 1) i ) , ( x(t) i ) ] RKt 1+Kt+1 as the input data vector to layer t + 1, the tth added pair of i SHMs transform x(t) i into the hidden units of layer t + 1, expressed as x(t+1) i = ln(1 + e(x(t) i ) β(t t+1) 1 ), . . . , ln(1 + e (x(t) i ) β(t t+1) Kt+1 ) . Hence the input vectors used to train the next layer would be x(t+1) i = [1, ( x(t) i ) , ( x(t+1) i ) ] RKt+Kt+1+1. Therefore, if the computational cost of a single inner product x iβk (e.g., logistic regression) is one, then that for T hidden layers would be about PT t=1 (Kt 1 + Kt + 1)Kt+1/(V + 1). Note one may also use x(t+1) i = x(t+1) i , or x(t+1) i = [(x(t) i ) , ( x(t+1) i ) ] , or other related concatenation methods to construct the covariates to train the next layer. Our intuition for why PBDN, constructed in this greedy-layer-wise manner, works well is that for two i SHMs trained on the same covariate space under two opposite labeling settings, one i SHM places enough hyperplanes to define the complement of a convex polytope to sufficiently activate all data labeled as 1, while the other does so for all data labeled as 0. Thus, for any xi, at least one pik would be sufficiently activated, in other words, xi would be sufficiently close to at least one of the active hyperplanes of the i SHM pair. This mechanism prevents any xi from being completely suppressed after transformation. Consequently, these transformed covariates ln(1 + eβ kxi), which can also be concatenated with xi, will be further used to train another i SHM pair. Thus even though a single i SHM pair may not be powerful enough, by keeping all covariate vectors sufficiently activated after transformation, they could be simply stacked sequentially to gradually enhance the model capacity, with a strong resistance to overfitting and hence without the necessity of cross-validation. While stacking an additional i SHM pair on PBDN could enhance the model capacity, when T hidden layers is sufficiently deeper that the two classes become well separated, there is no more need to add an extra i SHM pair. To detect when it is appropriate to stop adding another i SHM pair, as shown in Algorithm 2 of the Appendix, we consider a forward model selection strategy that sequentially stacks an i SHM pair after another, until the following criterion starts to rise: AIC(T) = XT t=1[2(Kt + 1)Kt+1] + 2KT +1 2 X h ln P(yi | x(T ) i ) + ln P(y i | x(T ) i ) i , (9) where 2(Kt + 1)Kt+1 represents the cost of adding the tth hidden layer and 2KT +1 represents the cost of using KT +1 nonnegative weights {r(T +1) k }k to connect the Tth hidden layer and the output layer. With (9), we choose PBDN with T hidden layers if AIC(t + 1) AIC(t) for t = 1, . . . , T 1 and AIC(T + 1) > AIC(T). We also consider another model selection criterion that accounts for the sparsity of β(t t+1) k , the connection weights between adjacent layers, using AICϵ(T) = PT t=12 |Bt| > ϵβt max 0 + |B t |>ϵβ t max 0 + 2KT +1 2 P i h ln P(yi | x(T ) i ) + ln P(y i | x(T ) i ) i , (10) where B 0 is the number of nonzero elements in matrix B, ϵ > 0 is a small constant, and Bt and B t consist of the β(t t+1) k trained by the first and second i SHMs of the tth i SHM pair, respectively, with βt max and β t max as their respective maximum absolute values. Note if the covariates have any spatial or temporal structures to exploit, one may replace each x iβk in i SHM with [CNNφ(xi)] βk, where CNNφ( ) represents a convolutional neural network parameterized by φ, to construct a CNN-i SHM, which can then be further greedily grown into CNNPBDN. Customizing PBDNs for structured covariates, such as image pixels and audio measurements, is a promising research topic, however, is beyond the scope of this paper. 4 Illustrations and experimental results Code for reproducible research is available at https://github.com/mingyuanzhou/PBDN. To illustrate the imposed geometric constraint and inductive bias of a single i SHM, we first consider a challenging 1) A pair of infinite support hyperplane machines (i SHMs) 2) PBDN with 2 pairs of i SHMs 3) PBDN with 4 pairs of i SHMs 4) PBDN with 10 pairs of i SHMs Figure 1: Visualization of PBDN, each layer of which is a pair of i SHMs trained on a two spirals dataset under two opposite labeling settings. For Subfigure 1), (a) shows the weights rk of the inferred active support hyperplanes, ordered by their values and (d) shows the trace plot of the average per data point log-likelihood. For i SHM trained by labeling the red (blue) points as ones (zeros), (b) shows a contour map, the value of each point of which represents how many inequalities specified in (7) are violated, and whose region with zero values corresponds to the convex polytope enclosed by the intersection of the hyperplanes defined in (7), and (c) shows the contour map of predicted class probabilities. (e-f) are analogous plots to (b-c) for i SHM trained with the blue points labeled as ones. The inferred data subtypes as in (2) are represented as red circles and blue pentagrams in subplots (b) and (d). Subfigures 2)-4) are analogous to 1), with two main differences: (i) the transformed covariates to train the newly added i SHM pair are obtained by propagating the original 2-D covariates through the previously trained i SHM pairs, and (ii) the contour maps in subplots (b) and (e) visualize the i SHM linear hyperplanes in the transformed space by projecting them back to the original 2-D covariate space. 2-D two spirals dataset, as shown Fig. 1, whose two classes are not fully separable by a convex polytope. We train 10 pairs of i SHMs one pair after another, which are organized into a ten-hiddenlayer PBDN, whose numbers of hidden units from the 1st to 10th hidden layers (i.e., numbers of support hyperplanes of the 1st to 10th i SHM pairs) are inferred to be 8, 14, 15, 11, 19, 22, 23, 18, 19, and 29, respectively. Both AIC and AICϵ=0.01 infers the depth as T = 4. For Fig. 1, we first train an i SHM by labeling the red points as 1 and blue as 0, whose results are shown in subplots 1) (b-c), and another i SHM under the opposite labeling setting, whose results are shown in subplots 1) (e-f). It is evident that when labeling the red points as 1, as shown in 1) (a-c), i SHM infers a convex polytope, intersected by three active support hyperplanes, to enclose the blue points, but at the same time allows the red points within that convex polytope to pass through with appreciable activations. When labeling the blue points as 1, i SHM infers five active support hyperplane, two of which are visible in the covariate space shown in 1) (e), to enclose the red points, but at the same time allows the blue points within that convex polytope to pass through with appreciable activations, as shown in 1) (f). Only capable of using a convex polytope to enclose the data labeled as 0 is a restriction of i SHM, but it is also why the i SHM pair could appropriately place two parsimonious sets of active support hyperplanes in the covariate space, ensuring the maximum distance of any data point to these support hyperplanes to be sufficiently small. Second, we concatenate the 8-D transformed and 2-D original covariates as 10-D covariates, which is further augmented with a constant bias term of one, to train the second pair of i SHMs. As in subplot 2) (a), five active support hyperplanes are inferred in the 10-D covariate space when labeling the blue points as 1, which could be matched to five nonzero smooth segments of the original 2-D spaces shown in 2) (e), which well align with highly activated regions in 2) (f). Again, with the inductive bias, as in 2) (f), all positive labeled data points are sufficiently activated at the expense of allowing some negative ones to be only moderately suppressed. Nine support hyperplanes are inferred when labeling the red points as 1, and similar activation behaviors can also be observed in 2) (b-c). Table 1: Visualization of the subtypes inferred by PBDN in a random trial and comparison of classification error rates over five random trials between PBDN and a two-hidden-layer DNN (128-64) on four different MNIST binary classification tasks. (a) Subtypes of 3 in 3 vs 5 (b) Subtypes of 3 in 3 vs 8 (c) Subtypes of 4 in 4 vs 7 (d) Subtypes of 4 in 4 vs 9 (e) Subtypes of 5 in 3 vs 5 (f) Subtypes of 8 in 3 vs 8 (g) Subtypes of 7 in 4 vs 7 (h) Subtypes of 9 in 4 vs 9 PBDN 2.53% 0.22% 2.66% 0.27% 1.37% 0.18% 2.95% 0.47% DNN 2.78% 0.36% 2.93% 0.40% 1.21% 0.12% 2.98% 0.17% We continue the same greedy layer-wise training strategy to add another eight i SHM pairs. From Figs. 1 3)-4) it becomes more and more clear that a support hyperplane, inferred in the transformed covariate space of a deep hidden layer, could be used to represent the boundary of a complex but smooth segment of the original covariate space that well encloses all or a subset of data labeled as 1. We apply PBDN to four different MNIST binary classification tasks and compare its performance with DNN (128-64), a two-hidden-layer deep neural network that will be detailedly described below. As in Tab. 1, both AIC and AICϵ=0.01 infer the depth as T = 1 for PBDN, and infer for each class only a few active hyperplanes, each of which represents a distinct data subtype, as calculated with (2). In a random trial, the inferred networks of PBDN for all four tasks have only a single hidden layer with at most 6 active hidden units. Thus its testing computation is much lower than DNN (128-64), while providing an overall lower testing error rate (both trained with 4000 mini-batches of size 100). Below we provide comprehensive comparison on eight widely used benchmark datasets between the proposed PBDNs and a variety of algorithms, including logistic regression, Gaussian radial basis function (RBF) kernel support vector machine (SVM), relevance vector machine (RVM) [31], adaptive multi-hyperplane machine (AMM) [27], convex polytope machine (CPM) [30], and the deep neural network (DNN) classifier (DNNClassifier) provided in Tensorflow [33]. Except for logistic regression that is a linear classifier, both kernel SVM and RVM are widely used nonlinear classifiers relying on the kernel trick, both AMM and CPM intersect multiple hyperplanes to construct their decision boundaries, and DNN uses a multilayer feedforward network, whose network structure often needs to be tuned to achieve a good balance between data fitting and model complexity, to handle nonlinearity. We consider DNN (8-4), a two-hidden-layer DNN that uses 8 and 4 hidden units for its first and second hidden layers, respectively, DNN (32-16), and DNN (128-64). In the Appendix, we summarize in Tab. 4 the information of eight benchmark datasets, including banana, breast cancer, titanic, waveform, german, image, ijcnn1, and a9a. For a fair comparison, to ensure the same training/testing partitions for all algorithms across all datasets, we report the results by using either widely used open-source software packages or the code made public available by the original authors. We describe in the Appendix the settings of all competing algorithms. For all datasets, we follow Algorithm 1 to first train a single-hidden-layer PBDN (PBDN1), i.e., a pair of i HSMs fitted under two opposite labeling settings. We then follow Algorithm 2 to train another pair of i HSMs to construct a two-hidden-layer PBDN (PBDN2), and repeat the same procedure to train PBDN3 and PBDN4. Note we observe that PBDN s log-likelihood increases rapidly during the first few hundred MCMC/SGD iterations, and then keeps increasing at a slower pace and eventually fluctuates. However, it often takes more iterations to shrink the weights of unneeded hyperplanes towards deactivation. Thus although insufficient iterations may not necessarily degrade the final outof-sample prediction accuracy, it may lead to a less compact network and hence higher computational cost for out-of-sample prediction. For each i HSM, we set the upper bound on the number of support hyperplanes as Kmax = 20. For Gibbs sampling, we run 5000 iterations and record {rk, βk}k with the highest likelihood during the last 2500 iterations; for MAP, we process 4000 mini-batches of size M = 100, with 0.05/(4 + T) as the Adam learning rate for the Tth added i SHM pair. We use the inferred {rk, βk}k to either produce out-of-sample predictions or generate transformed covariates for the next layer. We set a0 = b0 = 0.01, e0 = f0 = 1, and aβ = 10 6 for Gibbs sampling. We fix γ0 = c0 = 1 and aβ = bβk = 10 6 for MAP inference. As in Algorithm 1, we prune inactive support hyperplanes once every 200 MCMC or 500 SGD iterations to facilitate computation. Table 2: Comparison of classification error rates between a variety of algorithms and the proposed PBDNs with 1, 2, or 4 hidden layers, and PBDN-AIC and PBDN-AICϵ=0.01 trained with Gibbs sampling or SGD. Displayed in the last two rows of each column are the average of the error rates and that of computational complexities of an algorithm normalized by those of kernel SVM. LR SVM RVM AMM CPM DNN DNN DNN PBDN1 PBDN2 PBDN4 AIC AICϵ AIC AICϵ (8-4) (32-16) (128-64) Gibbs Gibbs SGD SGD banana 47.76 10.85 11.08 18.76 21.59 14.07 10.88 10.91 22.54 10.94 10.85 10.97 10.97 11.49 11.79 4.38 0.57 0.69 4.09 3.00 5.87 0.36 0.45 5.28 0.46 0.43 0.46 0.46 0.65 0.89 breast cancer 28.05 28.44 31.56 31.82 30.13 32.73 33.51 32.21 29.22 30.26 30.26 29.22 29.22 32.08 29.87 3.68 4.52 4.66 4.47 5.26 4.77 6.47 6.64 3.53 4.86 5.55 3.53 3.53 6.18 5.27 titanic 22.67 22.33 23.20 28.85 23.38 22.32 22.35 22.28 22.88 22.25 22.16 22.88 22.88 23.00 23.00 0.98 0.63 1.08 8.56 3.23 1.38 1.36 1.37 0.59 1.27 1.13 0.59 0.59 0.37 0.37 waveform 13.33 10.73 11.16 11.81 13.52 12.43 11.66 11.20 11.67 11.40 11.69 11.42 11.45 12.38 12.04 0.59 0.86 0.72 1.13 1.97 0.91 0.89 0.63 0.90 0.86 1.34 0.94 0.93 0.59 0.72 german 23.63 23.30 23.67 25.13 23.57 27.83 28.47 25.73 23.57 23.80 23.77 23.77 23.90 26.87 23.93 1.70 2.51 2.28 3.73 2.15 3.60 3.00 2.62 2.43 2.22 2.27 2.52 2.56 2.60 2.01 image 17.53 2.84 3.82 3.82 3.59 4.83 2.54 2.44 3.32 2.43 2.30 2.36 2.30 2.18 2.27 1.05 0.52 0.59 0.87 0.71 1.51 0.45 0.56 0.59 0.51 0.40 0.54 0.52 0.41 0.36 ijcnn1 8.41 4.01 5.37 4.82 4.83 6.35 5.17 4.17 5.46 4.33 4.09 4.33 4.06 4.18 4.15 0.60 0.53 1.40 1.99 1.14 1.33 0.83 0.97 1.32 0.84 0.76 0.84 0.84 0.73 0.68 a9a 15.34 15.87 15.39 15.97 15.56 15.82 16.16 16.97 15.74 15.64 15.70 15.74 15.74 19.90 17.13 0.11 0.33 0.12 0.23 0.23 0.23 0.22 0.55 0.12 0.10 0.09 0.12 0.12 0.30 0.19 Mean of SVM normalized errors 2.237 1.000 1.110 1.234 1.227 1.260 1.087 1.031 1.219 1.009 0.998 1.006 0.996 1.073 1.029 Mean of SVM normalized K 0.006 1.000 0.113 0.069 0.046 0.073 0.635 8.050 0.042 0.060 0.160 0.057 0.064 0.128 0.088 Table 3: The inferred depth of PBDN that increases its depth until a model selection criterion starts to rise. Dataset banana breast cancer titanic waveform german image ijcnn1 a9a AIC-Gibbs 2.30 0.48 1.00 0.00 1.00 0.00 1.90 0.74 1.30 0.67 2.40 0.52 2.00 0.00 1.00 0.00 AICϵ=0.01-Gibbs 2.30 0.48 1.00 0.00 1.00 0.00 2.00 0.67 1.60 0.84 2.60 0.52 3.40 0.55 1.00 0.00 AIC-SGD 3.20 0.78 1.90 0.99 1.00 0.00 2.40 0.52 2.80 0.63 2.90 0.74 3.20 0.45 3.20 0.45 AICϵ=0.01-SGD 2.80 0.63 1.00 0.00 1.00 0.00 1.50 0.53 1.00 0.00 2.00 0.00 3.00 0.00 1.00 0.00 We record the out-of-sample-prediction errors and computational complexity of various algorithms over these eight benchmark datasets in Tab. 2 and Tab. 5 of the Appendix, respectively, and summarize in Tab. 2 the means of SVM normalized errors and numbers of support hyperplanes/vectors. Overall, PBDN using AICϵ in (10) with ϵ = 0.01 to determine the depth, referred to as PBDN-AICϵ=0.01, has the highest out-of-sample prediction accuracy, followed by PBDN4, the RBF kernel SVM, PBDN using AIC in (9) to determine the depth, referred to as PBDN-AIC, PBDN2, PBDN-AICϵ=0.01 solved with SGD, DNN (128-64), PBDN-AIC solved with SGD, and DNN (32-16). Overall, logistic regression does not perform well, which is not surprising as it is a linear classifier that uses a single hyperplane to partition the covariate space into two halves to separate one class from the other. As shown in Tab. 2 of the Appendix, for breast cancer, titanic, german, and a9a, all classifiers have comparable classification errors, suggesting minor or no advantages of using a nonlinear classifier on them. By contrast, for banana, waveform, image, and ijcnn1, all nonlinear classifiers clearly outperform logistic regression. Note PBDN1, which clearly reduces the classification errors of logistic regression, performs similarly to both AMM and CPM. These results are not surprising as CPM, closely related to AMM, uses a convex polytope, defined as the intersection of multiple hyperplanes, to enclose one class, whereas the classification decision boundaries of PBDN1 can be bounded within a convex polytope that encloses negative examples. Note that the number of hyperplanes are automatically inferred from the data by PBDN1, thanks to the inherent shrinkage mechanism of the gamma process, whereas the ones of AMM and CPM are both selected via cross validations. While PBDN1 can partially remedy their sensitivity to how the data are labeled by combining the results obtained under two opposite labeling settings, the decision boundaries of the two i SHMs and those of both AMM and CPM are still restricted to a confined space related to a single convex polytope, which may be used to explain why on banana, image, and ijcnn1, they all clearly underperform a PBDN with more than one hidden layer. As shown in Tab. 2, DNN (8-4) clearly underperforms DNN (32-16) in terms of classification accuracy on both image and ijcnn1, indicating that having 8 and 4 hidden units for the first and second hidden layers, respectively, is far from enough for DNN to provide a sufficiently high nonlinear modeling capacity for these two datasets. Note that the equivalent number of hyperplanes for DNN (K1, K2), a two-hidden-layer DNN with K1 and K2 hidden units in the first and second hidden layers, respectively, is computed as [(V + 1)K1 + K1K2]/(V + 1). Thus the computational complexity quickly increases as the network size increases. For example, DNN (8-4) is comparable to PBDN1 and PBDN-AIC in terms of out-of-sample-prediction computational complexity, as shown in Tabs. 2 and 5, but it clearly underperforms all of them in terms of classification accuracy, as shown in Tab. 2. While DNN (128-64) performs well in terms of classification accuracy, as shown in Tab. 2, its out-of-sample-prediction computational complexity becomes clearly higher than the other algorithms with comparable or better accuracy, such as RVM and PBDN, as shown in Tab. 5. In practice, however, the search space for a DNN with two or more hidden layers is enormous, making it difficult to determine a network that is neither too large nor too small to achieve a good compromise between fitting the data well and having low complexity for both training and out-of-sample prediction. E.g., while DNN (128-64) could further improve the performance of DNN (32-16) on these two datasets, it uses a much larger network and clearly higher computational complexity for out-of-sample prediction. We show the inferred number of active support hyperplanes by PBDN in a single random trial in Figs. 3-6. For PBDN, the computation in both training and out-of-sample prediction also increases in T, the network depth. It is clear from Tab. 2 that increasing T from 1 to 2 generally leads to the most significant improvement if there is a clear advantage of increasing T, and once T is sufficiently large, further increasing T leads to small performance fluctuations but does not appear to lead to clear overfitting. As shown in Tab. 3, the use of the AIC based greedy model selection criterion eliminates the need to tuning the depth T, allowing it to be inferred from the data. Note we have tried stacking CPMs as how we stack i SHMs, but found that the accuracy often quickly deteriorates rather than improving. E.g., for CPMs with (2, 3, or 4) layers, the error rates become (0.131, 0.177, 0.223) on waveform, and (0.046, 0.080, 0.216) on image. The reason could be that CPM infers redundant unweighted hyperplanes that lead to strong multicollinearity for the covariates of deep layers. Note on each given data, we have tried training a DNN with the same network architecture inferred by a PBDN. While a DNN jointly trains all its hidden layers, it provides no performance gain over the corresponding PBDN. More specifically, the DNNs using the network architectures inferred by PBDNs with AIC-Gibbs, AICϵ=0.01-Gibbs, AIC-SGD, and AICϵ=0.01-SGD, have the mean of SVM normalized errors as 1.047, 1.011, 1.076, and 1.144, respectively. These observations suggest the efficacy of the greedy-layer-wise training strategy of the PBDN, which requires no cross-validation. For out-of-sample prediction, the computation of a classification algorithm generally increases linearly in the number of support hyperplanes/vectors. Using logistic regression with a single hyperplane for reference, we summarize the computation complexity in Tab. 2, which indicates that in comparison to SVM that consistently requires the most number of support vectors, PBDN often requires significantly less time for predicting the class label of a new data sample. For example, for out-of-sample prediction for the image dataset, as shown in Tab. 5, on average SVM uses about 212 support vectors, whereas on average PBDNs with one to five hidden layers use about 13, 16, 29, 50, and 64 hyperplanes, respectively, and PBDN-AIC uses about 22 hyperplanes, showing that in comparison to kernel SVM, PBDN could be much more computationally efficient in making out-of-sample prediction. 5 Conclusions The infinite support hyperplane machine (i SHM), which interacts countably infinite non-negative weighted hyperplanes via a noisy-OR mechanism, is employed as the building unit to greedily construct a capacity-regularized parsimonious Bayesian deep network (PBDN). i SHM has an inductive bias in fitting the positively labeled data, and employs the gamma process to infer a parsimonious set of active hyperplanes to enclose negatively labeled data within a convex-polytope bounded space. Due to the inductive bias and label asymmetry, i SHMs are trained in pairs to ensure a sufficient coverage of the covariate space occupied by the data from both classes. The sequentially trained i SHM pairs can be stacked into PBDN, a feedforward deep network that gradually enhances its modeling capacity as the network depth increases, achieving high accuracy while having low computational complexity for out-of-sample prediction. PBDN can be trained using either Gibbs sampling that is suitable for quantifying posterior uncertainty, or SGD based MAP inference that is scalable to big data. One may potentially construct PBDNs for regression analysis of count, categorical, and continuous response variables by following the same three-step strategy: constructing a nonparametric Bayesian model that infers the number of components for the task of interest, greedily adding layers one at a time, and using a forward model selection criterion to decide how deep is deep enough. For the first step, the recently proposed Lomax distribution based racing framework [34] could be a promising candidate for both categorical and non-negative response variables, and Dirichlet process mixtures of generalized linear models [35] could be promising candidates for continuous response variables and many other types of variables via appropriate link functions. Acknowledgments M. Zhou acknowledges the support of Award IIS-1812699 from the U.S. National Science Foundation, the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research, and the computational support of Texas Advanced Computing Center. [1] V. Vapnik, Statistical learning theory. Wiley New York, 1998. [2] B. Schölkopf, C. J. C. Burges, and A. J. Smola, Advances in Kernel Methods: Support Vector Learning. MIT Press, 1999. [3] G. Hinton, S. Osindero, and Y.-W. Teh, A fast learning algorithm for deep belief nets, Neural Computation, vol. 18, no. 7, pp. 1527 1554, 2006. [4] Y. Le Cun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol. 521, no. 7553, pp. 436 444, 2015. [5] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. [6] I. Steinwart, Sparseness of support vector machines, J. Mach. Learn. Res., vol. 4, pp. 1071 1105, 2003. [7] C. Bucilu a, R. Caruana, and A. Niculescu-Mizil, Model compression, in KDD, pp. 535 541, ACM, 2006. [8] Y. Gong, L. Liu, M. Yang, and L. Bourdev, Compressing deep convolutional networks using vector quantization, ar Xiv preprint ar Xiv:1412.6115, 2014. [9] G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network, ar Xiv preprint ar Xiv:1503.02531, 2015. [10] S. Han, H. Mao, and W. J. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, ar Xiv preprint ar Xiv:1510.00149, 2015. [11] T. S. Ferguson, A Bayesian analysis of some nonparametric problems, Ann. Statist., vol. 1, no. 2, pp. 209 230, 1973. [12] M. Zhou, Y. Cong, and B. Chen, The Poisson gamma belief network, in NIPS, 2015. [13] M. Zhou, Y. Cong, and B. Chen, Augmentable gamma belief networks, J. Mach. Learn. Res., vol. 17, no. 163, pp. 1 44, 2016. [14] V. Nair and G. E. Hinton, Rectified linear units improve restricted Boltzmann machines, in ICML, pp. 807 814, 2010. [15] X. Glorot, A. Bordes, and Y. Bengio, Deep sparse rectifier neural networks, in AISTATS, pp. 315 323, 2011. [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in NIPS, pp. 1097 1105, 2012. [17] W. Shang, K. Sohn, D. Almeida, and H. Lee, Understanding and improving convolutional neural networks via concatenated rectified linear units, in ICML, 2016. [18] M. Zhou, Infinite edge partition models for overlapping community detection and link prediction, in AISTATS, pp. 1135 1143, 2015. [19] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. [20] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, An introduction to variational methods for graphical models, Machine learning, vol. 37, no. 2, pp. 183 233, 1999. [21] S. Arora, R. Ge, T. Ma, and A. Risteski, Provable learning of noisy-or networks, ar Xiv preprint ar Xiv:1612.08795, 2016. [22] F. Caron and E. B. Fox, Sparse graphs using exchangeable random measures, J. R. Stat. Soc.: Series B, vol. 79, no. 5, pp. 1295 1366, 2017. [23] M. Zhou, Discussion on sparse graphs using exchangeable random measures by Francois Caron and Emily B. Fox, ar Xiv preprint ar Xiv:1802.07721, 2018. [24] P. Rai, C. Hu, R. Henao, and L. Carin, Large-scale Bayesian multi-label learning via topic-based label embeddings, in NIPS, 2015. [25] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford university press, 1995. [26] F. Aiolli and A. Sperduti, Multiclass classification with multi-prototype support vector machines, vol. 6, pp. 817 850, 2005. [27] Z. Wang, N. Djuric, K. Crammer, and S. Vucetic, Trading representability for scalability: Adaptive multi-hyperplane machine for nonlinear classification, in KDD, pp. 24 32, 2011. [28] N. Manwani and P. S. Sastry, Learning polyhedral classifiers using logistic function., in ACML, pp. 17 30, 2010. [29] N. Manwani and P. S. Sastry, Polyceptron: A polyhedral learning algorithm, ar Xiv:1107.1564, 2011. [30] A. Kantchelian, M. C. Tschantz, L. Huang, P. L. Bartlett, A. D. Joseph, and J. D. Tygar, Large-margin convex polytope machine, in NIPS, pp. 3248 3256, 2014. [31] M. Tipping, Sparse Bayesian learning and the relevance vector machine, J. Mach. Learn. Res., vol. 1, pp. 211 244, June 2001. [32] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, ar Xiv preprint ar Xiv:1412.6980, 2014. [33] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. [34] Q. Zhang and M. Zhou, Nonparametric Bayesian Lomax delegate racing for survival analysis with competing risks, in Neur IPS, 2018. [35] L. A. Hannah, D. M. Blei, and W. B. Powell, Dirichlet process mixtures of generalized linear models, J. Mach. Learn. Res., vol. 12, no. Jun, pp. 1923 1953, 2011. [36] K. Crammer and Y. Singer, On the algorithmic implementation of multiclass kernel-based vector machines, J. Mach. Learn. Res., vol. 2, pp. 265 292, 2002. [37] D. B. Dunson and A. H. Herring, Bayesian latent variable models for mixed discrete outcomes, Biostatistics, vol. 6, no. 1, pp. 11 25, 2005. [38] M. Zhou, L. Hannah, D. Dunson, and L. Carin, Beta-negative binomial process and Poisson factor analysis, in AISTATS, pp. 1462 1471, 2012. [39] M. Zhou and L. Carin, Negative binomial process count and mixture modeling, IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 2, pp. 307 320, 2015. [40] N. G. Polson and J. G. Scott, Default Bayesian analysis for multi-way tables: a data-augmentation approach, ar Xiv:1109.4180v1, 2011. [41] M. Zhou, L. Li, D. Dunson, and L. Carin, Lognormal and gamma mixed negative binomial regression, in ICML, pp. 1343 1350, 2012. [42] N. G. Polson, J. G. Scott, and J. Windle, Bayesian inference for logistic models using Pólya Gamma latent variables, J. Amer. Statist. Assoc., vol. 108, no. 504, pp. 1339 1349, 2013. [43] M. Zhou, Softplus regressions and convex polytopes, ar Xiv:1608.06383, 2016. [44] G. Rätsch, T. Onoda, and K.-R. Müller, Soft margins for Ada Boost, Machine Learning, vol. 42, no. 3, pp. 287 320, 2001. [45] T. Diethe, 13 benchmark datasets derived from the UCI, DELVE and STATLOG repositories. https: //github.com/tdiethe/gunnar_raetsch_benchmark_datasets/, 2015. [46] Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin, Training and testing low-degree polynomial data mappings via linear SVM, J. Mach. Learn. Res., vol. 11, pp. 1471 1490, 2010. [47] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, LIBLINEAR: A library for large linear classification, J. Mach. Learn. Res., pp. 1871 1874, 2008. [48] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1 27:27, 2011. [49] N. Djuric, L. Lan, S. Vucetic, and Z. Wang, Budgetedsvm: A toolbox for scalable SVM approximations, J. Mach. Learn. Res., vol. 14, pp. 3813 3817, 2013.