# parsimonious_bayesian_deep_networks__20ebf968.pdf

Parsimonious Bayesian deep networks

Mingyuan Zhou Department of IROM, Mc Combs School of Business The University of Texas at Austin, Austin, TX 78712 mingyuan.zhou@mccombs.utexas.edu

Combining Bayesian nonparametrics and a forward model selection strategy, we construct parsimonious Bayesian deep networks (PBDNs) that infer capacityregularized network architectures from the data and require neither cross-validation nor ﬁne-tuning when training the model. One of the two essential components of a PBDN is the development of a special inﬁnite-wide single-hidden-layer neural network, whose number of active hidden units can be inferred from the data. The other one is the construction of a greedy layer-wise learning algorithm that uses a forward model selection criterion to determine when to stop adding another hidden layer. We develop both Gibbs sampling and stochastic gradient descent based maximum a posteriori inference for PBDNs, providing state-of-the-art classiﬁcation accuracy and interpretable data subtypes near the decision boundaries, while maintaining low computational complexity for out-of-sample prediction.

1 Introduction

To separate two linearly separable classes, a simple linear classiﬁer such logistic regression will often sufﬁce, in which scenario adding the capability to model nonlinearity not only complicates the model and increases computation, but also often harms rather improves the performance by increasing the risk of overﬁtting. On the other hand, for two classes not well separated by a single hyperplane, a linear classiﬁer is often inadequate, and hence it is common to use either kernel support vector machines [1,2] or deep neural networks [3 5] to nonlinearly transform the covariates, making the two classes become more linearly separable in the transformed covariate space. While being able to achieve high classiﬁcation accuracy, they both have clear limitations. For a kernel based classiﬁer, its number of support vectors often increases linearly in the size of training data [6], making it not only computationally expensive and memory inefﬁcient to train for big data, but also slow in out-of-sample predictions. A deep neural network could be scalable with an appropriate network structure, but it is often cumbersome to tune the network depth (number of layers) and width (number of hidden units) of each hidden layer [5], and has the danger of overﬁtting the training data if a deep neural network, which is often equipped with a larger than necessary modeling capacity, is not carefully regularized.

Rather than making an uneasy choice in the ﬁrst place between a linear classiﬁer, which has fast computation and resists overﬁtting but may not provide sufﬁcient class separation, and an overcapacitized model, which often wastes computation and requires careful regularization to prevent overﬁtting, we propose a parsimonious Bayesian deep network (PBDN) that builds its capacity regularization into the greedy-layer-wise construction and training of the deep network. More speciﬁcally, we transform the covariates in a layer-wise manner, with each layer of transformation designed to facilitate class separation via the use of the noisy-OR interactions of multiple weighted linear hyperplanes. Related to kernel support vector machines, the hyperplanes play a similar role as support vectors in transforming the covariate space, but they are inferred from the data and their number increases at a much slower rate with the training data size. Related to deep neural networks, the proposed multi-layer structure gradually increases its modeling capability by increasing its number

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

of layers, but allows inferring from the data both the width of each hidden layer and depth of the network to prevent building a model whose capacity is larger than necessary.

To obtain a parsimonious deep neural network, one may also consider using a two-step approach that ﬁrst trains an over-capacitized model and then compresses its size [7 10]. The design of PBDN represents a distinct philosophy. Moreover, PBDN is not contradicting with model compression, as these post-processing compression techniques [7 10] may be used to further compress PBDN.

For capacity regularization of the proposed PBDN, we choose to shrink both its width and depth. To shrink the width of a hidden layer, we propose the use of a gamma process [11], a draw from which consists of countably inﬁnite atoms, each of which is used to represent a hyperplane in the covariate space. The gamma process has an inherence shrinkage mechanism as its number of atoms, whose random weights are larger than a certain positive constant ϵ > 0, follows a Poisson distribution, whose mean is ﬁnite almost surely (a.s.) and reduces towards zero as ϵ increases. To shrink the depth of the network, we propose a layer-wise greedy-learning strategy that increases the depth by adding one hidden layer at a time, and uses an appropriate model selection criterion to decide when to stop adding another one. Note related to our work, Zhou et al. [12,13] combine the gamma process and greedy layer-wise training to build a Bayesian deep network in an unsupervised manner. Our experiments show the proposed capacity regularization strategy helps successfully build a PBDN, providing state-of-the-art classiﬁcation accuracy while maintaining low computational complexity for out-of-sample prediction. We have also tried applying a highly optimized off-the-shelf deep neural network based classiﬁer, whose network architecture for a given data is set to be the same as that inferred by the PBDN. However, we have found no performance gains, suggesting the efﬁcacy of the PBDN s greedy training procedure that requires neither cross-validation nor ﬁne-tuning. Note PBDN, like a conventional deep neural network, could also be further improved by introducing convolutional layers if the covariates have spatial, temporal, or other types of structures.

2 Layer-width learning via inﬁnite support hyperplane machines

The ﬁrst essential component of the proposed capacity regularization strategy is to learn the width of a hidden layer. To fulﬁll this goal, we deﬁne inﬁnite support hyperplane machine (i SHM), a label-asymmetric classiﬁer, that places in the covariate space countably inﬁnite hyperplanes {βk}1, , where each βk RV +1 is associated with a weight rk > 0. We use a gamma process [11] to generate {rk, βk}1, , making the inﬁnite sum P k=1 rk be ﬁnite almost surely (a.s.). We measure the proximity of a covariate vector xi RV +1 to βk using the softplus function of their inner product as ln(1 + eβ kxi), which is a smoothed version of Re LU(β kxi) = max(0, β kxi) that is widely used in deep neural networks [14 17]. We consider that xi is far from hyperplane k if ln(1+eβ kxi) is close to zero. Thus, as xi moves away from hyperplane k, that proximity measure monotonically increases on one side of the hyperplane while decreasing on the other. We pass λi = P k=1 rk ln(1 + eβ kxi), a non-negative weighted combination of these proximity measures, through the Bernoulli-Poisson link [18] f(λi) = 1 e λi to deﬁne the conditional class probability as P(yi = 1 | {rk, βk}k, xi) = 1 Q k=1(1 pik), pik = 1 e rk ln(1+eβ kxi). (1)

Note the model treats the data labeled as 0 and 1 differently, and (1) suggests that in general P(yi = 1 | {rk, βk}k, xi) = 1 P(yi = 0 | {rk, βk}k, xi). We will show that P

can be used to represent the kth data subtype discovered by the algorithm.

2.1 Nonparametric Bayesian hierarchical model

One may readily notice from (1) that the noisy-OR construction, widely used in probabilistic reasoning [19 21], is generalized by i SHM to attribute a binary outcome of yi = 1 to countably inﬁnite hidden causes pik. Denoting W as the logical OR operator, as P(y = 0) = Q

k P(bk = 0) if y = W

k bk and E[e θ] = e r ln(1+ex) if θ Gamma(r, ex), we have an augmented form of (1) as

yi = W k=1 bik, bik Bernoulli(pik), pik = 1 e θik, θik Gamma(rk, eβ kxi), (3) where bik Bernoulli(pik) can be further augmented as bik = δ(mik 1), mik Pois(θik), where mik Z, Z := {0, 1, . . .}, and δ(x) equals to 1 if the condition x is satisﬁed and 0 otherwise.

We now marginalize bik out to formally deﬁne i SHM. Let G ΓP(G0, 1/c) denote a gamma process deﬁned on the product space R+ Ω, where R+ = {x : x > 0}, c R+, and G0 is a ﬁnite and continuous base measure over a complete separable metric space Ω. As illustrated in Fig. 2 (b) in the Appendix, given a draw from G, expressed as G = P k=1 rkδβk, where βk is an atom and rk is its weight, the i SHM generates the label under the Bernoulli-Poisson link [18] as

yi | G, xi Bernoulli 1 e P k=1 rk ln(1+ex iβk ) , (4)

which can be represented as a noisy-OR model as in (3) or, as shown in Fig. 2 (a), constructed as

yi = δ(mi 1), mi = P k=1 mik, mik Pois(θik), θik Gamma(rk, eβ kxi). (5)

From (3) and (5), it is clear that one may declare hyperplane k as inactive if P

2.2 Inductive bias and distinction from multilayer perceptron

Below we reveal the inductive bias of i SHM in prioritizing the ﬁt of the data labeled as 1, due to the use of the Bernoulli-Poisson link that has previously been applied for network analysis [18,22,23] and multi-label learning [24]. As the negative log-likelihood (NLL) for xi can be expressed as

NLL(xi) = yi ln 1 e λi + (1 yi)λi, λi = P k=1 rk ln(1 + ex iβk),

we have NLL(xi) = λi ln(eλi 1) if yi = 1 and NLL(xi) = λi if yi = 0. As ln(eλi 1) quickly explodes towards + as λi 0, when yi = 1, i SHM would adjust rk and βk to avoid at all cost overly suppressing xi (i.e., making λi too small). By contrast, it has a high tolerance of failing to sufﬁciently suppress xi with yi = 0. Thus each xi with yi = 1 would be made sufﬁciently close to at least one active support hyperplane. By contrast, while each xi with yi = 0 is desired to be far away from any support hyperplanes, violating that is typically not strongly penalized. Therefore, by training a pair of i SHMs under two opposite labeling settings, two sets of support hyperplanes could be inferred to sufﬁciently cover the covariate space occupied by the training data from both classes.

Note as in (4), i SHM may be viewed as an inﬁnite-wide single-hidden-layer neural network that connects the input layer to the kth hidden unit via the connections weights βk and the softplus nonlinear activation function ln(1 + eβ kxi), and further pass a non-negative weighted combination of these hidden units through the Bernoulli-Poisson link to obtain the conditional class probability. From this point of view, it can be related to a single-hidden-layer multilayer perceptron (MLP) [5, 25] that uses a softplus activation function and cross-entropy loss, with the output activation expressed as σ[w ln(1 + e Bxi)], where σ(x) = 1/(1 + e x), K is the number of hidden units, B = (β1, . . . , βK) , and w = (w1, . . . , w K) RK. Note minimizing the cross-entropy loss is equivalent to maximizing the likelihood of yi | w, B, xi Bernoulli (1 + e PK k=1 wk ln(1+ex iβk )) 1 , which is biased towards ﬁtting neither the data with yi = 1 nor these with yi = 0, since

NLL(xi) = ln(e yiw ln(1+e Bxi) + e(1 yi)w ln(1+e Bxi)).

Therefore, while i SHM is structurally similar to an MLP, it is distinct in its unbounded layer width, its positive constraint on the weights rk connecting the hidden and output layers, its ability to rigorously deﬁne whether a hyperplane is active or inactive, and its inductive bias towards ﬁtting the data labeled as 1. As in practice labeling which class as 1 may be arbitrary, we predict the class label with (1 e P k=1 rk ln(1+ex iβk ) + e P k=1 r k ln(1+ex iβ k ))/2, where {rk, βk}1, and {r k, β k}1, are from a pair of i SHMs trained by labeling the data belonging to this class as 1 and 0, respectively.

2.3 Convex polytope geometric constraint

It is straightforward to show that i SHM with a single unit-weighted hyperplane reduces to logistic regression yi Bernoulli[1/(1+e x iβ)]. To interpret the role of each individual support hyperplane when multiple non-negligibly weighted ones are inferred by i SHM, we analogize each βk to an expert of a committee that collectively make binary decisions. For expert (hyperplane) k, the weight rk indicates how strongly its opinion is weighted by the committee, bik = 0 represents that it votes No, and bik = 1 represents that it votes Yes. Since yi = W k=1 bik, the committee would vote No if and only if all its experts vote No (i.e., all bik are zeros), in other words, the committee would vote Yes even if only a single expert votes Yes. Let us now examine the conﬁned covariate space that

satisﬁes the inequality P(yi = 1 | xi) p0, where a data point is labeled as 1 with a probability no greater than p0. The following theorem shows that it deﬁnes a conﬁned space bounded by a convex polytope, as deﬁned by the intersection of countably inﬁnite half-spaces deﬁned by pik < p0. Theorem 1 (Convex polytope). For i SHM, the conﬁned space speciﬁed by the inequality

P(yi = 1 | {rk, βk}k, xi) p0 (6)

is bounded by a convex polytope deﬁned by the set of solutions to countably inﬁnite inequalities as

x iβk ln (1 p0) 1

rk 1 , k {1, 2, . . .}. (7)

The convex polytope deﬁned in (7) is enclosed by the intersection of countably inﬁnite half-spaces. If we set p0 = 0.5 as the probability threshold to make binary decisions, then the convex polytope assigns a label of yi = 0 to an xi inside the convex polytope (i.e., an xi that satisﬁes all the inequalities in Eq. 7) with a relatively high probability, and assigns a label of yi = 1 to an xi outside the convex polytope (i.e., an xi that violates at least one of the inequalities in Eq. 7) with a probability of at least 50%. Note that hyperplane k with rk 0 has a negligible impact on the conditional class probability. Choosing the gamma process as the nonparametric Bayesian prior sidesteps the need to tune the number of experts. It shrinks the weights of all unnecessary experts, allowing automatically inferring a ﬁnite number of non-negligibly weighted ones (support hyperplanes) from the data. We provide in Appendix B the connections to previously proposed multi-hyperplane models [26 30].

2.4 Gibbs sampling and MAP inference via SGD

For the convenience of implementation, we truncate the gamma process with a ﬁnite and discrete base measure as G0 = PK k=1 γ0 K δβk, where K will be set sufﬁciently large to approximate the truly countably inﬁnite model. We express i SHM using (5) together with

rk Gamma(γ0/K, 1/c0), γ0 Gamma(a0, 1/b0), c0 Gamma(e0, 1/f0),

v=0 R N(0, α 1 vk )Gamma(αvk; aβ, 1/bβk)dαvk, bβk Gamma(e0, 1/f0),

where the normal gamma construction promotes sparsity on the connection weights βk [31].

We describe both Gibbs sampling, desirable for uncertainty quantiﬁcation, and maximum a posteriori (MAP) inference, suitable for large-scale training, in Algorithm 1. We use data augmentation and marginalization to derive Gibbs sampling, with the details deferred to Appendix B. For MAP inference, we use Adam [32] in Tensorﬂow to minimize a stochastic objective function as f({βk, ln rk}K 1 , {yi, xi}i M i1 )+f({β k, ln r k}K 1 , {y i , xi}i M i1 ), which embeds the hierarchical Bayesian model s inductive bias and inherent shrinking mechanism into optimization, where M is the size of a randomly selected mini-batch, y i := 1 yi, λi := PK k=1 eln rk ln(1 + ex iβk), and

f({βk, ln rk}K 1 , {yi, xi}i M i1 ) = PK k=1 γ0

K ln rk + c0eln rk + (aβ + 1/2) PV v=0 PK k=0 [ln(1 + β2 vk/(2bβk))] + N

M Pi M i=i1 yi ln 1 e λi + (1 yi)λi . (8)

3 Network-depth learning via forward model selection

The second essential component of the proposed capacity regularization strategy is to ﬁnd a way to increase the network depth and determine how deep is deep enough. Our solution is to sequentially stack a pair of i SHMs on top of the previously trained one, and develop a forward model selection criterion to decide when to stop stacking another pair. We refer to the resulted model as parsimonious Bayesian deep network (PBDN), as described below in detail.

The noisy-OR hyperplane interactions allow i SHM to go beyond simple linear separation, but with limited capacity due to the convex-polytope constraint imposed on the decision boundary. On the other hand, it is the convex-polytope constraint that provides an implicit regularization, determining how many non-negligibly weighted support hyperplanes are necessary in the covariate space to sufﬁciently activate all data of class 1, while somewhat suppressing the data of class 0. In this paper, we ﬁnd that the model capacity could be quickly enhanced by sequentially stacking such convex-polytope constraints under a feedforward deep structure, while preserving the virtue of being able to learn the number of support hyperplanes in the (transformed) covariate space.

More speciﬁcally, as shown in Fig. 2 (c) of the Appendix, we ﬁrst train a pair of i SHMs that regress the current labels yi {0, 1} and the ﬂipped ones y i = 1 yi, respectively, on the original covariates xi RV . After obtaining K2 support hyperplanes {β(1 2) k }1,K2, constituted by the active support hyperplanes inferred by both i SHM trained with yi and the one trained with y i , we use ln(1 + ex iβ(1 2) k ) as the hidden units of the second layer (ﬁrst hidden layer). More precisely, with t {1, 2, . . .}, K0 := 0, K1 := V , x(0) i := , and x(1) i := xi, denoting x(t) i := [1, ( x(t 1) i ) , ( x(t) i ) ] RKt 1+Kt+1 as the input data vector to layer t + 1, the tth added pair of i SHMs transform x(t) i into the hidden units of layer t + 1, expressed as

x(t+1) i = ln(1 + e(x(t) i ) β(t t+1) 1 ), . . . , ln(1 + e (x(t) i ) β(t t+1) Kt+1 ) .

Hence the input vectors used to train the next layer would be x(t+1) i = [1, ( x(t) i ) , ( x(t+1) i ) ] RKt+Kt+1+1. Therefore, if the computational cost of a single inner product x iβk (e.g., logistic regression) is one, then that for T hidden layers would be about PT t=1 (Kt 1 + Kt + 1)Kt+1/(V + 1). Note one may also use x(t+1) i = x(t+1) i , or x(t+1) i = [(x(t) i ) , ( x(t+1) i ) ] , or other related concatenation methods to construct the covariates to train the next layer.

Our intuition for why PBDN, constructed in this greedy-layer-wise manner, works well is that for two i SHMs trained on the same covariate space under two opposite labeling settings, one i SHM places enough hyperplanes to deﬁne the complement of a convex polytope to sufﬁciently activate all data labeled as 1, while the other does so for all data labeled as 0. Thus, for any xi, at least one pik would be sufﬁciently activated, in other words, xi would be sufﬁciently close to at least one of the active hyperplanes of the i SHM pair. This mechanism prevents any xi from being completely suppressed after transformation. Consequently, these transformed covariates ln(1 + eβ kxi), which can also be concatenated with xi, will be further used to train another i SHM pair. Thus even though a single i SHM pair may not be powerful enough, by keeping all covariate vectors sufﬁciently activated after transformation, they could be simply stacked sequentially to gradually enhance the model capacity, with a strong resistance to overﬁtting and hence without the necessity of cross-validation.

While stacking an additional i SHM pair on PBDN could enhance the model capacity, when T hidden layers is sufﬁciently deeper that the two classes become well separated, there is no more need to add an extra i SHM pair. To detect when it is appropriate to stop adding another i SHM pair, as shown in Algorithm 2 of the Appendix, we consider a forward model selection strategy that sequentially stacks an i SHM pair after another, until the following criterion starts to rise:

AIC(T) = XT

t=1[2(Kt + 1)Kt+1] + 2KT +1 2 X

h ln P(yi | x(T ) i ) + ln P(y i | x(T ) i ) i , (9)

where 2(Kt + 1)Kt+1 represents the cost of adding the tth hidden layer and 2KT +1 represents the cost of using KT +1 nonnegative weights {r(T +1) k }k to connect the Tth hidden layer and the output layer. With (9), we choose PBDN with T hidden layers if AIC(t + 1) AIC(t) for t = 1, . . . , T 1 and AIC(T + 1) > AIC(T). We also consider another model selection criterion that accounts for the sparsity of β(t t+1) k , the connection weights between adjacent layers, using

AICϵ(T) = PT t=12 |Bt| > ϵβt max 0 + |B t |>ϵβ t max 0 + 2KT +1 2 P

i h ln P(yi | x(T ) i ) + ln P(y i | x(T ) i ) i , (10)

where B 0 is the number of nonzero elements in matrix B, ϵ > 0 is a small constant, and Bt and B t consist of the β(t t+1) k trained by the ﬁrst and second i SHMs of the tth i SHM pair, respectively, with βt max and β t max as their respective maximum absolute values.

Note if the covariates have any spatial or temporal structures to exploit, one may replace each x iβk in i SHM with [CNNφ(xi)] βk, where CNNφ( ) represents a convolutional neural network parameterized by φ, to construct a CNN-i SHM, which can then be further greedily grown into CNNPBDN. Customizing PBDNs for structured covariates, such as image pixels and audio measurements, is a promising research topic, however, is beyond the scope of this paper.

4 Illustrations and experimental results

Code for reproducible research is available at https://github.com/mingyuanzhou/PBDN. To illustrate the imposed geometric constraint and inductive bias of a single i SHM, we ﬁrst consider a challenging

1) A pair of inﬁnite support hyperplane machines (i SHMs) 2) PBDN with 2 pairs of i SHMs

3) PBDN with 4 pairs of i SHMs 4) PBDN with 10 pairs of i SHMs

Figure 1: Visualization of PBDN, each layer of which is a pair of i SHMs trained on a two spirals dataset under two opposite labeling settings. For Subﬁgure 1), (a) shows the weights rk of the inferred active support hyperplanes, ordered by their values and (d) shows the trace plot of the average per data point log-likelihood. For i SHM trained by labeling the red (blue) points as ones (zeros), (b) shows a contour map, the value of each point of which represents how many inequalities speciﬁed in (7) are violated, and whose region with zero values corresponds to the convex polytope enclosed by the intersection of the hyperplanes deﬁned in (7), and (c) shows the contour map of predicted class probabilities. (e-f) are analogous plots to (b-c) for i SHM trained with the blue points labeled as ones. The inferred data subtypes as in (2) are represented as red circles and blue pentagrams in subplots (b) and (d). Subﬁgures 2)-4) are analogous to 1), with two main differences: (i) the transformed covariates to train the newly added i SHM pair are obtained by propagating the original 2-D covariates through the previously trained i SHM pairs, and (ii) the contour maps in subplots (b) and (e) visualize the i SHM linear hyperplanes in the transformed space by projecting them back to the original 2-D covariate space.

2-D two spirals dataset, as shown Fig. 1, whose two classes are not fully separable by a convex polytope. We train 10 pairs of i SHMs one pair after another, which are organized into a ten-hiddenlayer PBDN, whose numbers of hidden units from the 1st to 10th hidden layers (i.e., numbers of support hyperplanes of the 1st to 10th i SHM pairs) are inferred to be 8, 14, 15, 11, 19, 22, 23, 18, 19, and 29, respectively. Both AIC and AICϵ=0.01 infers the depth as T = 4.

For Fig. 1, we ﬁrst train an i SHM by labeling the red points as 1 and blue as 0, whose results are shown in subplots 1) (b-c), and another i SHM under the opposite labeling setting, whose results are shown in subplots 1) (e-f). It is evident that when labeling the red points as 1, as shown in 1) (a-c), i SHM infers a convex polytope, intersected by three active support hyperplanes, to enclose the blue points, but at the same time allows the red points within that convex polytope to pass through with appreciable activations. When labeling the blue points as 1, i SHM infers ﬁve active support hyperplane, two of which are visible in the covariate space shown in 1) (e), to enclose the red points, but at the same time allows the blue points within that convex polytope to pass through with appreciable activations, as shown in 1) (f). Only capable of using a convex polytope to enclose the data labeled as 0 is a restriction of i SHM, but it is also why the i SHM pair could appropriately place two parsimonious sets of active support hyperplanes in the covariate space, ensuring the maximum distance of any data point to these support hyperplanes to be sufﬁciently small.

Second, we concatenate the 8-D transformed and 2-D original covariates as 10-D covariates, which is further augmented with a constant bias term of one, to train the second pair of i SHMs. As in subplot 2) (a), ﬁve active support hyperplanes are inferred in the 10-D covariate space when labeling the blue points as 1, which could be matched to ﬁve nonzero smooth segments of the original 2-D spaces shown in 2) (e), which well align with highly activated regions in 2) (f). Again, with the inductive bias, as in 2) (f), all positive labeled data points are sufﬁciently activated at the expense of allowing some negative ones to be only moderately suppressed. Nine support hyperplanes are inferred when labeling the red points as 1, and similar activation behaviors can also be observed in 2) (b-c).

Table 1: Visualization of the subtypes inferred by PBDN in a random trial and comparison of classiﬁcation error rates over ﬁve random trials between PBDN and a two-hidden-layer DNN (128-64) on four different MNIST binary classiﬁcation tasks.

(a) Subtypes of 3 in 3 vs 5 (b) Subtypes of 3 in 3 vs 8 (c) Subtypes of 4 in 4 vs 7 (d) Subtypes of 4 in 4 vs 9

(e) Subtypes of 5 in 3 vs 5 (f) Subtypes of 8 in 3 vs 8 (g) Subtypes of 7 in 4 vs 7 (h) Subtypes of 9 in 4 vs 9

PBDN 2.53% 0.22% 2.66% 0.27% 1.37% 0.18% 2.95% 0.47%

DNN 2.78% 0.36% 2.93% 0.40% 1.21% 0.12% 2.98% 0.17%

We continue the same greedy layer-wise training strategy to add another eight i SHM pairs. From Figs. 1 3)-4) it becomes more and more clear that a support hyperplane, inferred in the transformed covariate space of a deep hidden layer, could be used to represent the boundary of a complex but smooth segment of the original covariate space that well encloses all or a subset of data labeled as 1.

We apply PBDN to four different MNIST binary classiﬁcation tasks and compare its performance with DNN (128-64), a two-hidden-layer deep neural network that will be detailedly described below. As in Tab. 1, both AIC and AICϵ=0.01 infer the depth as T = 1 for PBDN, and infer for each class only a few active hyperplanes, each of which represents a distinct data subtype, as calculated with (2). In a random trial, the inferred networks of PBDN for all four tasks have only a single hidden layer with at most 6 active hidden units. Thus its testing computation is much lower than DNN (128-64), while providing an overall lower testing error rate (both trained with 4000 mini-batches of size 100).

Below we provide comprehensive comparison on eight widely used benchmark datasets between the proposed PBDNs and a variety of algorithms, including logistic regression, Gaussian radial basis function (RBF) kernel support vector machine (SVM), relevance vector machine (RVM) [31], adaptive multi-hyperplane machine (AMM) [27], convex polytope machine (CPM) [30], and the deep neural network (DNN) classiﬁer (DNNClassiﬁer) provided in Tensorﬂow [33]. Except for logistic regression that is a linear classiﬁer, both kernel SVM and RVM are widely used nonlinear classiﬁers relying on the kernel trick, both AMM and CPM intersect multiple hyperplanes to construct their decision boundaries, and DNN uses a multilayer feedforward network, whose network structure often needs to be tuned to achieve a good balance between data ﬁtting and model complexity, to handle nonlinearity. We consider DNN (8-4), a two-hidden-layer DNN that uses 8 and 4 hidden units for its ﬁrst and second hidden layers, respectively, DNN (32-16), and DNN (128-64). In the Appendix, we summarize in Tab. 4 the information of eight benchmark datasets, including banana, breast cancer, titanic, waveform, german, image, ijcnn1, and a9a. For a fair comparison, to ensure the same training/testing partitions for all algorithms across all datasets, we report the results by using either widely used open-source software packages or the code made public available by the original authors. We describe in the Appendix the settings of all competing algorithms.

For all datasets, we follow Algorithm 1 to ﬁrst train a single-hidden-layer PBDN (PBDN1), i.e., a pair of i HSMs ﬁtted under two opposite labeling settings. We then follow Algorithm 2 to train another pair of i HSMs to construct a two-hidden-layer PBDN (PBDN2), and repeat the same procedure to train PBDN3 and PBDN4. Note we observe that PBDN s log-likelihood increases rapidly during the ﬁrst few hundred MCMC/SGD iterations, and then keeps increasing at a slower pace and eventually ﬂuctuates. However, it often takes more iterations to shrink the weights of unneeded hyperplanes towards deactivation. Thus although insufﬁcient iterations may not necessarily degrade the ﬁnal outof-sample prediction accuracy, it may lead to a less compact network and hence higher computational cost for out-of-sample prediction. For each i HSM, we set the upper bound on the number of support hyperplanes as Kmax = 20. For Gibbs sampling, we run 5000 iterations and record {rk, βk}k with the highest likelihood during the last 2500 iterations; for MAP, we process 4000 mini-batches of size M = 100, with 0.05/(4 + T) as the Adam learning rate for the Tth added i SHM pair. We use the inferred {rk, βk}k to either produce out-of-sample predictions or generate transformed covariates for the next layer. We set a0 = b0 = 0.01, e0 = f0 = 1, and aβ = 10 6 for Gibbs sampling. We ﬁx γ0 = c0 = 1 and aβ = bβk = 10 6 for MAP inference. As in Algorithm 1, we prune inactive support hyperplanes once every 200 MCMC or 500 SGD iterations to facilitate computation.

Table 2: Comparison of classiﬁcation error rates between a variety of algorithms and the proposed PBDNs with 1, 2, or 4 hidden layers, and PBDN-AIC and PBDN-AICϵ=0.01 trained with Gibbs sampling or SGD. Displayed in the last two rows of each column are the average of the error rates and that of computational complexities of an algorithm normalized by those of kernel SVM.

LR SVM RVM AMM CPM DNN DNN DNN PBDN1 PBDN2 PBDN4 AIC AICϵ AIC AICϵ (8-4) (32-16) (128-64) Gibbs Gibbs SGD SGD banana 47.76 10.85 11.08 18.76 21.59 14.07 10.88 10.91 22.54 10.94 10.85 10.97 10.97 11.49 11.79 4.38 0.57 0.69 4.09 3.00 5.87 0.36 0.45 5.28 0.46 0.43 0.46 0.46 0.65 0.89

breast cancer 28.05 28.44 31.56 31.82 30.13 32.73 33.51 32.21 29.22 30.26 30.26 29.22 29.22 32.08 29.87 3.68 4.52 4.66 4.47 5.26 4.77 6.47 6.64 3.53 4.86 5.55 3.53 3.53 6.18 5.27

titanic 22.67 22.33 23.20 28.85 23.38 22.32 22.35 22.28 22.88 22.25 22.16 22.88 22.88 23.00 23.00 0.98 0.63 1.08 8.56 3.23 1.38 1.36 1.37 0.59 1.27 1.13 0.59 0.59 0.37 0.37

waveform 13.33 10.73 11.16 11.81 13.52 12.43 11.66 11.20 11.67 11.40 11.69 11.42 11.45 12.38 12.04 0.59 0.86 0.72 1.13 1.97 0.91 0.89 0.63 0.90 0.86 1.34 0.94 0.93 0.59 0.72

german 23.63 23.30 23.67 25.13 23.57 27.83 28.47 25.73 23.57 23.80 23.77 23.77 23.90 26.87 23.93 1.70 2.51 2.28 3.73 2.15 3.60 3.00 2.62 2.43 2.22 2.27 2.52 2.56 2.60 2.01

image 17.53 2.84 3.82 3.82 3.59 4.83 2.54 2.44 3.32 2.43 2.30 2.36 2.30 2.18 2.27 1.05 0.52 0.59 0.87 0.71 1.51 0.45 0.56 0.59 0.51 0.40 0.54 0.52 0.41 0.36

ijcnn1 8.41 4.01 5.37 4.82 4.83 6.35 5.17 4.17 5.46 4.33 4.09 4.33 4.06 4.18 4.15 0.60 0.53 1.40 1.99 1.14 1.33 0.83 0.97 1.32 0.84 0.76 0.84 0.84 0.73 0.68

a9a 15.34 15.87 15.39 15.97 15.56 15.82 16.16 16.97 15.74 15.64 15.70 15.74 15.74 19.90 17.13 0.11 0.33 0.12 0.23 0.23 0.23 0.22 0.55 0.12 0.10 0.09 0.12 0.12 0.30 0.19

Mean of SVM

normalized errors 2.237 1.000 1.110 1.234 1.227 1.260 1.087 1.031 1.219 1.009 0.998 1.006 0.996 1.073 1.029

Mean of SVM

normalized K 0.006 1.000 0.113 0.069 0.046 0.073 0.635 8.050 0.042 0.060 0.160 0.057 0.064 0.128 0.088

Table 3: The inferred depth of PBDN that increases its depth until a model selection criterion starts to rise.

Dataset banana breast cancer titanic waveform german image ijcnn1 a9a

AIC-Gibbs 2.30 0.48 1.00 0.00 1.00 0.00 1.90 0.74 1.30 0.67 2.40 0.52 2.00 0.00 1.00 0.00 AICϵ=0.01-Gibbs 2.30 0.48 1.00 0.00 1.00 0.00 2.00 0.67 1.60 0.84 2.60 0.52 3.40 0.55 1.00 0.00

AIC-SGD 3.20 0.78 1.90 0.99 1.00 0.00 2.40 0.52 2.80 0.63 2.90 0.74 3.20 0.45 3.20 0.45 AICϵ=0.01-SGD 2.80 0.63 1.00 0.00 1.00 0.00 1.50 0.53 1.00 0.00 2.00 0.00 3.00 0.00 1.00 0.00

We record the out-of-sample-prediction errors and computational complexity of various algorithms over these eight benchmark datasets in Tab. 2 and Tab. 5 of the Appendix, respectively, and summarize in Tab. 2 the means of SVM normalized errors and numbers of support hyperplanes/vectors. Overall, PBDN using AICϵ in (10) with ϵ = 0.01 to determine the depth, referred to as PBDN-AICϵ=0.01, has the highest out-of-sample prediction accuracy, followed by PBDN4, the RBF kernel SVM, PBDN using AIC in (9) to determine the depth, referred to as PBDN-AIC, PBDN2, PBDN-AICϵ=0.01 solved with SGD, DNN (128-64), PBDN-AIC solved with SGD, and DNN (32-16).

Overall, logistic regression does not perform well, which is not surprising as it is a linear classiﬁer that uses a single hyperplane to partition the covariate space into two halves to separate one class from the other. As shown in Tab. 2 of the Appendix, for breast cancer, titanic, german, and a9a, all classiﬁers have comparable classiﬁcation errors, suggesting minor or no advantages of using a nonlinear classiﬁer on them. By contrast, for banana, waveform, image, and ijcnn1, all nonlinear classiﬁers clearly outperform logistic regression. Note PBDN1, which clearly reduces the classiﬁcation errors of logistic regression, performs similarly to both AMM and CPM. These results are not surprising as CPM, closely related to AMM, uses a convex polytope, deﬁned as the intersection of multiple hyperplanes, to enclose one class, whereas the classiﬁcation decision boundaries of PBDN1 can be bounded within a convex polytope that encloses negative examples. Note that the number of hyperplanes are automatically inferred from the data by PBDN1, thanks to the inherent shrinkage mechanism of the gamma process, whereas the ones of AMM and CPM are both selected via cross validations. While PBDN1 can partially remedy their sensitivity to how the data are labeled by combining the results obtained under two opposite labeling settings, the decision boundaries of the two i SHMs and those of both AMM and CPM are still restricted to a conﬁned space related to a single convex polytope, which may be used to explain why on banana, image, and ijcnn1, they all clearly underperform a PBDN with more than one hidden layer.

As shown in Tab. 2, DNN (8-4) clearly underperforms DNN (32-16) in terms of classiﬁcation accuracy on both image and ijcnn1, indicating that having 8 and 4 hidden units for the ﬁrst and second hidden layers, respectively, is far from enough for DNN to provide a sufﬁciently high nonlinear modeling capacity for these two datasets. Note that the equivalent number of hyperplanes for DNN (K1, K2), a two-hidden-layer DNN with K1 and K2 hidden units in the ﬁrst and second hidden layers, respectively, is computed as [(V + 1)K1 + K1K2]/(V + 1). Thus the computational complexity quickly increases as the network size increases. For example, DNN (8-4) is comparable to PBDN1 and PBDN-AIC in terms of out-of-sample-prediction computational complexity, as shown in Tabs.

2 and 5, but it clearly underperforms all of them in terms of classiﬁcation accuracy, as shown in Tab. 2. While DNN (128-64) performs well in terms of classiﬁcation accuracy, as shown in Tab. 2, its out-of-sample-prediction computational complexity becomes clearly higher than the other algorithms with comparable or better accuracy, such as RVM and PBDN, as shown in Tab. 5. In practice, however, the search space for a DNN with two or more hidden layers is enormous, making it difﬁcult to determine a network that is neither too large nor too small to achieve a good compromise between ﬁtting the data well and having low complexity for both training and out-of-sample prediction. E.g., while DNN (128-64) could further improve the performance of DNN (32-16) on these two datasets, it uses a much larger network and clearly higher computational complexity for out-of-sample prediction.

We show the inferred number of active support hyperplanes by PBDN in a single random trial in Figs. 3-6. For PBDN, the computation in both training and out-of-sample prediction also increases in T, the network depth. It is clear from Tab. 2 that increasing T from 1 to 2 generally leads to the most signiﬁcant improvement if there is a clear advantage of increasing T, and once T is sufﬁciently large, further increasing T leads to small performance ﬂuctuations but does not appear to lead to clear overﬁtting. As shown in Tab. 3, the use of the AIC based greedy model selection criterion eliminates the need to tuning the depth T, allowing it to be inferred from the data. Note we have tried stacking CPMs as how we stack i SHMs, but found that the accuracy often quickly deteriorates rather than improving. E.g., for CPMs with (2, 3, or 4) layers, the error rates become (0.131, 0.177, 0.223) on waveform, and (0.046, 0.080, 0.216) on image. The reason could be that CPM infers redundant unweighted hyperplanes that lead to strong multicollinearity for the covariates of deep layers.

Note on each given data, we have tried training a DNN with the same network architecture inferred by a PBDN. While a DNN jointly trains all its hidden layers, it provides no performance gain over the corresponding PBDN. More speciﬁcally, the DNNs using the network architectures inferred by PBDNs with AIC-Gibbs, AICϵ=0.01-Gibbs, AIC-SGD, and AICϵ=0.01-SGD, have the mean of SVM normalized errors as 1.047, 1.011, 1.076, and 1.144, respectively. These observations suggest the efﬁcacy of the greedy-layer-wise training strategy of the PBDN, which requires no cross-validation.

For out-of-sample prediction, the computation of a classiﬁcation algorithm generally increases linearly in the number of support hyperplanes/vectors. Using logistic regression with a single hyperplane for reference, we summarize the computation complexity in Tab. 2, which indicates that in comparison to SVM that consistently requires the most number of support vectors, PBDN often requires signiﬁcantly less time for predicting the class label of a new data sample. For example, for out-of-sample prediction for the image dataset, as shown in Tab. 5, on average SVM uses about 212 support vectors, whereas on average PBDNs with one to ﬁve hidden layers use about 13, 16, 29, 50, and 64 hyperplanes, respectively, and PBDN-AIC uses about 22 hyperplanes, showing that in comparison to kernel SVM, PBDN could be much more computationally efﬁcient in making out-of-sample prediction.

5 Conclusions

The inﬁnite support hyperplane machine (i SHM), which interacts countably inﬁnite non-negative weighted hyperplanes via a noisy-OR mechanism, is employed as the building unit to greedily construct a capacity-regularized parsimonious Bayesian deep network (PBDN). i SHM has an inductive bias in ﬁtting the positively labeled data, and employs the gamma process to infer a parsimonious set of active hyperplanes to enclose negatively labeled data within a convex-polytope bounded space. Due to the inductive bias and label asymmetry, i SHMs are trained in pairs to ensure a sufﬁcient coverage of the covariate space occupied by the data from both classes. The sequentially trained i SHM pairs can be stacked into PBDN, a feedforward deep network that gradually enhances its modeling capacity as the network depth increases, achieving high accuracy while having low computational complexity for out-of-sample prediction. PBDN can be trained using either Gibbs sampling that is suitable for quantifying posterior uncertainty, or SGD based MAP inference that is scalable to big data. One may potentially construct PBDNs for regression analysis of count, categorical, and continuous response variables by following the same three-step strategy: constructing a nonparametric Bayesian model that infers the number of components for the task of interest, greedily adding layers one at a time, and using a forward model selection criterion to decide how deep is deep enough. For the ﬁrst step, the recently proposed Lomax distribution based racing framework [34] could be a promising candidate for both categorical and non-negative response variables, and Dirichlet process mixtures of generalized linear models [35] could be promising candidates for continuous response variables and many other types of variables via appropriate link functions.

Acknowledgments

M. Zhou acknowledges the support of Award IIS-1812699 from the U.S. National Science Foundation, the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research, and the computational support of Texas Advanced Computing Center.

[1] V. Vapnik, Statistical learning theory. Wiley New York, 1998.

[2] B. Schölkopf, C. J. C. Burges, and A. J. Smola, Advances in Kernel Methods: Support Vector Learning. MIT Press, 1999.

[3] G. Hinton, S. Osindero, and Y.-W. Teh, A fast learning algorithm for deep belief nets, Neural Computation, vol. 18, no. 7, pp. 1527 1554, 2006.

[4] Y. Le Cun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol. 521, no. 7553, pp. 436 444, 2015.

[5] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.

[6] I. Steinwart, Sparseness of support vector machines, J. Mach. Learn. Res., vol. 4, pp. 1071 1105, 2003.

[7] C. Bucilu a, R. Caruana, and A. Niculescu-Mizil, Model compression, in KDD, pp. 535 541, ACM, 2006.

[8] Y. Gong, L. Liu, M. Yang, and L. Bourdev, Compressing deep convolutional networks using vector quantization, ar Xiv preprint ar Xiv:1412.6115, 2014.

[9] G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network, ar Xiv preprint ar Xiv:1503.02531, 2015.

[10] S. Han, H. Mao, and W. J. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, ar Xiv preprint ar Xiv:1510.00149, 2015.

[11] T. S. Ferguson, A Bayesian analysis of some nonparametric problems, Ann. Statist., vol. 1, no. 2, pp. 209 230, 1973.

[12] M. Zhou, Y. Cong, and B. Chen, The Poisson gamma belief network, in NIPS, 2015.

[13] M. Zhou, Y. Cong, and B. Chen, Augmentable gamma belief networks, J. Mach. Learn. Res., vol. 17, no. 163, pp. 1 44, 2016.

[14] V. Nair and G. E. Hinton, Rectiﬁed linear units improve restricted Boltzmann machines, in ICML, pp. 807 814, 2010.

[15] X. Glorot, A. Bordes, and Y. Bengio, Deep sparse rectiﬁer neural networks, in AISTATS, pp. 315 323, 2011.

[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classiﬁcation with deep convolutional neural networks, in NIPS, pp. 1097 1105, 2012.

[17] W. Shang, K. Sohn, D. Almeida, and H. Lee, Understanding and improving convolutional neural networks via concatenated rectiﬁed linear units, in ICML, 2016.

[18] M. Zhou, Inﬁnite edge partition models for overlapping community detection and link prediction, in AISTATS, pp. 1135 1143, 2015.

[19] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.

[20] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, An introduction to variational methods for graphical models, Machine learning, vol. 37, no. 2, pp. 183 233, 1999.

[21] S. Arora, R. Ge, T. Ma, and A. Risteski, Provable learning of noisy-or networks, ar Xiv preprint ar Xiv:1612.08795, 2016.

[22] F. Caron and E. B. Fox, Sparse graphs using exchangeable random measures, J. R. Stat. Soc.: Series B, vol. 79, no. 5, pp. 1295 1366, 2017.

[23] M. Zhou, Discussion on sparse graphs using exchangeable random measures by Francois Caron and Emily B. Fox, ar Xiv preprint ar Xiv:1802.07721, 2018.

[24] P. Rai, C. Hu, R. Henao, and L. Carin, Large-scale Bayesian multi-label learning via topic-based label embeddings, in NIPS, 2015.

[25] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford university press, 1995.

[26] F. Aiolli and A. Sperduti, Multiclass classiﬁcation with multi-prototype support vector machines, vol. 6, pp. 817 850, 2005.

[27] Z. Wang, N. Djuric, K. Crammer, and S. Vucetic, Trading representability for scalability: Adaptive multi-hyperplane machine for nonlinear classiﬁcation, in KDD, pp. 24 32, 2011.

[28] N. Manwani and P. S. Sastry, Learning polyhedral classiﬁers using logistic function., in ACML, pp. 17 30, 2010.

[29] N. Manwani and P. S. Sastry, Polyceptron: A polyhedral learning algorithm, ar Xiv:1107.1564, 2011.

[30] A. Kantchelian, M. C. Tschantz, L. Huang, P. L. Bartlett, A. D. Joseph, and J. D. Tygar, Large-margin convex polytope machine, in NIPS, pp. 3248 3256, 2014.

[31] M. Tipping, Sparse Bayesian learning and the relevance vector machine, J. Mach. Learn. Res., vol. 1, pp. 211 244, June 2001.

[32] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, ar Xiv preprint ar Xiv:1412.6980, 2014.

[33] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorﬂow.org.

[34] Q. Zhang and M. Zhou, Nonparametric Bayesian Lomax delegate racing for survival analysis with competing risks, in Neur IPS, 2018.

[35] L. A. Hannah, D. M. Blei, and W. B. Powell, Dirichlet process mixtures of generalized linear models, J. Mach. Learn. Res., vol. 12, no. Jun, pp. 1923 1953, 2011.

[36] K. Crammer and Y. Singer, On the algorithmic implementation of multiclass kernel-based vector machines, J. Mach. Learn. Res., vol. 2, pp. 265 292, 2002.

[37] D. B. Dunson and A. H. Herring, Bayesian latent variable models for mixed discrete outcomes, Biostatistics, vol. 6, no. 1, pp. 11 25, 2005.

[38] M. Zhou, L. Hannah, D. Dunson, and L. Carin, Beta-negative binomial process and Poisson factor analysis, in AISTATS, pp. 1462 1471, 2012.

[39] M. Zhou and L. Carin, Negative binomial process count and mixture modeling, IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 2, pp. 307 320, 2015.

[40] N. G. Polson and J. G. Scott, Default Bayesian analysis for multi-way tables: a data-augmentation approach, ar Xiv:1109.4180v1, 2011.

[41] M. Zhou, L. Li, D. Dunson, and L. Carin, Lognormal and gamma mixed negative binomial regression, in ICML, pp. 1343 1350, 2012.

[42] N. G. Polson, J. G. Scott, and J. Windle, Bayesian inference for logistic models using Pólya Gamma latent variables, J. Amer. Statist. Assoc., vol. 108, no. 504, pp. 1339 1349, 2013.

[43] M. Zhou, Softplus regressions and convex polytopes, ar Xiv:1608.06383, 2016.

[44] G. Rätsch, T. Onoda, and K.-R. Müller, Soft margins for Ada Boost, Machine Learning, vol. 42, no. 3, pp. 287 320, 2001.

[45] T. Diethe, 13 benchmark datasets derived from the UCI, DELVE and STATLOG repositories. https: //github.com/tdiethe/gunnar_raetsch_benchmark_datasets/, 2015.

[46] Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin, Training and testing low-degree polynomial data mappings via linear SVM, J. Mach. Learn. Res., vol. 11, pp. 1471 1490, 2010.

[47] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, LIBLINEAR: A library for large linear classiﬁcation, J. Mach. Learn. Res., pp. 1871 1874, 2008.

[48] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1 27:27, 2011.

[49] N. Djuric, L. Lan, S. Vucetic, and Z. Wang, Budgetedsvm: A toolbox for scalable SVM approximations, J. Mach. Learn. Res., vol. 14, pp. 3813 3817, 2013.